To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q
, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q
, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q
. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
The Serial
, OpenMP
, and MPI
C samples make use of the following gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
sleep.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> static cudaStream_t *streams; // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(float seconds) { // Get device frequency in KHz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = (int64_t)(seconds * Hz); return num_cycles; } // Create streams extern "C" void create_streams(int num_streams) { // Allocate streams streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t)); // Default stream streams[0] = NULL; // Primer kernel launch sleep<<< 1, 1 >>>(1); // Create streams for(int i = 1; i <= num_streams; i++) cudaStreamCreate(&streams[i]); } // Launches a kernel that sleeps for num_cycles extern "C" void sleep_kernel(int64_t num_cycles, int stream_id) { // Launch a single GPU thread to sleep int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles); } // Wait for stream to complete extern "C" void wait_for_stream(int stream_id) { cudaStreamSynchronize(streams[stream_id]); } // Wait for streams to complete extern "C" void wait_for_streams(int num_streams) { for(int i = 1; i <= num_streams; i++) cudaStreamSynchronize(streams[i]); } // Destroy stream objects extern "C" void destroy_streams(int num_streams) { // Clean up stream for(int i = 1; i <= num_streams; i++) cudaStreamDestroy(streams[i]); free(streams); }
Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -arch=sm_35 -c sleep.cu
Serially Launched Concurrent Kernels in C
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <sys/time.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; struct timeval start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Max number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously for(i=1; i<=num_kernels; i++){ sleep_kernel(cycles, i); } // Wait for kernels to complete wait_for_streams(num_kernels); // Clean up streams destroy_streams(num_kernels); // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o serial.out
Run:
$ module load cudatoolkit $ aprun ./serial.out
OpenMP Launched Concurrent Kernels in C
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; double start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Set number of OMP threads omp_set_num_threads(num_kernels); // Start timer start = omp_get_wtime(); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously #pragma omp parallel firstprivate(cycles) { int stream_id = omp_get_thread_num()+1; sleep_kernel(cycles, stream_id); } // Wait for kernels to complete wait_for_streams(num_kernels); // Wait for kernels to complete and clean up streams destroy_streams(num_kernels); // Stop timer stop = omp_get_wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } return 0; }
Compile:
PGI
$ module load cudatoolkit $ cc -mp sleep.o launcher.c -o openmp.out
GNU
$ module load cudatoolkit $ cc -fopenmp sleep.o launcher.c -o openmp.out
Intel
$ module load cudatoolkit $ cc -openmp sleep.o launcher.c -o openmp.out
Cray
$ module load cudatoolkit $ cc sleep.o launcher.c -o openmp.out
Run:
$ module load cudatoolkit $ aprun -d16 ./openmp.out
MPI Launched Concurrent Kernels in C
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "mpi.h" int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_stream(int stream_id); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); uint64_t cycles; double start, stop; int num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = size; // Setup default stream in sleep.cu wrapper create_streams(0); // Loop through number of kernels to launch, from 1 to max_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer MPI_Barrier(MPI_COMM_WORLD); if(rank == 0) start = MPI_Wtime(); // Launch kernel into default stream if(rank < num_kernels) sleep_kernel(cycles, 0); // Wait for all ranks to submit kernel MPI_Barrier(MPI_COMM_WORLD); // Wait for default stream if(rank < num_kernels) wait_for_stream(0); // Wait for all ranks to complete MPI_Barrier(MPI_COMM_WORLD); // Print seconds ellapsed if(rank == 0) { stop = MPI_Wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } } destroy_streams(0); MPI_Finalize(); return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in C
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "math.h" #include <sys/time.h> int main(int argc, char *argv[]) { uint64_t num_cycles; struct timeval start, stop; int i,num_kernels; // Set number of cycles to sleep for 1 second // We'll use frequency/instruction latency to estimate this num_cycles = 730000000/15; // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); for(i=0; i<=num_kernels; i++) { #pragma acc parallel async(i) vector_length(1) num_gangs(1) { uint64_t cycles; #pragma acc loop seq for(cycles=0; cycles<num_cycles; cycles++) { cycles = cycles; } } } // Wait for all async streams to complete #pragma acc wait // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
volatile
qualifier to the declaration of cycles should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ cc -acc launcher.c -o acc.out
Cray
$ module load cudatoolkit $ cc -hpragma=acc launcher.c -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
Fortran
The Serial
, OpenMP
, and MPI
Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
module sleep use cudadevice use cudafor implicit none integer, dimension(:), allocatable :: streams contains !CUDA kernel to pause for at least num_cycle cycles attributes(global) subroutine sleep(num_cycles) integer(8), value :: num_cycles integer(8) :: cycles integer(8) :: start cycles = 0 start = clock64 do while (cycles < num_cycles) cycles = clock64 - start enddo end subroutine sleep !Returns number of cycles required for requested seconds integer(8) function get_cycles(seconds) result(num_cycles) real(8), intent(in) :: seconds integer(8) :: istat, Hz type(cudadeviceprop) :: prop istat = cudaGetDeviceProperties(prop, 0) Hz = prop%clockRate * 1000 num_cycles = seconds * Hz end function get_cycles !Create streams subroutine create_streams(num_streams) integer :: num_streams, istat, i allocate(streams(num_streams+1)) streams(1) = 0 ! Primer kernel launch call sleep<<< 1, 1 >>>(int8(1)); do i=2,num_streams+1 istat = cudaStreamCreate(streams(i)) enddo end subroutine create_streams !Launches a kernel that sleeps for num_cycles subroutine sleep_kernel(num_cycles, stream_id) integer(8) :: num_cycles integer :: stream_id type(dim3) :: blockSize, gridSize blockSize = dim3(1,1,1) gridSize = dim3(1,1,1) call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles) end subroutine sleep_kernel ! Wait for stream to complete subroutine wait_for_stream(stream_id) integer :: stream_id, istat istat = cudaStreamSynchronize(streams(stream_id)) end subroutine wait_for_stream ! Wait for streams to complete subroutine wait_for_streams(num_streams) integer :: num_streams, istat, i do i=2,num_streams+1 istat = cudaStreamSynchronize(streams(i)) enddo end subroutine wait_for_streams ! Destroy streams subroutine destroy_streams(num_streams) integer :: num_streams, i, istat do i=2,num_streams+1 istat = cudaStreamDestroy(streams(i)) enddo deallocate(streams) end subroutine destroy_streams end module sleep
Serially Launched Concurrent Kernels in Fortran
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep integer(8) :: cycles integer :: max_kernels, num_kernels, i real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 32 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously do i = 2, num_kernels+1 call sleep_kernel(cycles, i) enddo ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out
OpenMP Launched Concurrent Kernels in Fortran
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep use omp_lib implicit none integer(8) :: cycles integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Set number of OMP threads call omp_set_num_threads(num_kernels) ! Start timer start = omp_get_wtime() ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously !$omp parallel private(stream_id) firstprivate(cycles) stream_id = omp_get_thread_num()+2 call sleep_kernel(cycles, stream_id) !$omp end parallel ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer stop = omp_get_wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out
MPI Launched Concurrent Kernels in Fortran
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.f90
program main use sleep implicit none include 'mpif.h' integer :: max_kernels, num_kernels, i, ierr, rank, size integer(8) :: cycles real(8) :: start, stop, seconds call MPI_Init(ierr) ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) call MPI_Comm_size(MPI_COMM_WORLD, size, ierr) ! Number of kernels to launch max_kernels = size ! Setup default stream in sleep.cu wrapper call create_streams(0); ! Loop through number of kernels to launch, from 1 to max_kernels do num_kernels = 1, max_kernels ! Start timer call MPI_Barrier(MPI_COMM_WORLD, ierr) if (rank == 0) then start = MPI_Wtime() endif ! Launch num_kernel kernels asynchrnously if (rank < num_kernels) then call sleep_kernel(cycles, 1) endif ! Wait for all ranks to submit kernel call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Wait for kernel to complete if(rank < num_kernels) then call wait_for_stream(0) endif ! Wait for all ranks to complete call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Print seconds ellapsed if (rank == 0) then stop = MPI_Wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' endif enddo ! clean up array in wrapper, no stream actually destroyed call destroy_streams(0) call MPI_Finalize(ierr) end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in Fortran
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.f90
program main implicit none integer(8) :: num_cycles, cycles, i integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds real(4) :: foo(33) ! Set number of cycles to sleep for 1 second ! We'll use frequency/instruction latency to estimate this num_cycles = 730000000/(15*5) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Launch num_kernel kernels asynchrnously do i = 1, num_kernels !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo) !$acc loop seq do cycles = 1, num_cycles foo(i) = sin(1.5708) enddo !$acc end loop !$acc end parallel enddo ! Wait for all async streams to complete !$acc wait ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
volatile
qualifier to the declaration of foo should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ ftn -acc launcher.f90 -o acc.out
Cray
$ module load cudatoolkit $ ftn -hacc launcher.f90 -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
This tutorial covers how to write a parallel program to calculate π using the Monte Carlo method. The first code is a simple serial implementation. The next codes are parallelized using MPI and OpenMP and then finally, the last code sample is a version that combines both of these parallel techniques.
Serial Implementation
The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).
serialpi.c
#include <stdio.h> #include <stdlib.h> #include <math.h> void main(int argc, char* argv[]) { double niter = 10000000; double x,y; int i; int count=0; double z; double pi; srand(time(NULL)); //main loop for (i=0; i<niter; ++i) { //get random points x = (double)random()/RAND_MAX; y = (double)random()/RAND_MAX; z = sqrt((x*x)+(y*y)); //check to see if point is in unit circle if (z<=1) { ++count; } } pi = ((double)count/(double)niter)*4.0; //p = 4(m/n) printf("Pi: %f\n", pi);
Next, compile the code using the compiler wrappers on the supercomputer
cc serialpi.c -o a.out
To run this program on Titan, or any of the other clusters, a job needs to be submitted in the form of a PBS batch script. Here is the an example batch scrip, titan_pi.pbs:
#!/bin/bash # Begin PBS directives #PBS -A AAA### #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V #PBS -q debug # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 ./a.out
Replace AAA### with the appropriate project ID that the time will be charged to. See the Titan System User Guide for a more in depth look at writing batch scripts.
To submit this batch script, run the following:
qsub titan_pi.pbs
When executed, the result will be something like:
Pi: 3.141664
Parallelize with MPI
MPI Send/Recv
In this next version of the program, the code has been parallelized with MPI. In order to do this, a few things need to be set up in the code. First, the mpi.h header file needs to be included.
#include "mpi.h"
Next, an MPI instance needs to be created. To do this, add this:
MPI_Init(&argv, &argc);
right before the code needed to be parallelized with MPI. To utilize the created MPI instance, the code needs to have MPI sends and receives present. This message sending and receiving is the basic concept behind MPI. Many functions in MPI can be implemented with combinations of basic send and receive calls. Here, our code is set up to do computations on all but the master node (MPI rank 0). The only thing that rank 0 does is receive the messages (the calculated points in/out of the circle) sent by the other nodes and calculate the ratio based upon that data. The MPI_Recv(); function needs to be added so that our master node can receive the messages. To set up the receive, it needs the following info:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);
- A buffer to store the reveived data
- A number of maximum elements in the receive buffer
- An MPI_Datatype of the appropriate type for the data to be revieved
- The rank number that the message will be sent from. This can be the wildcard MPI_ANY_SOURCE
- A message tag. This serves as a label to identify a particular message. It is defined by an application. It can be the wildcard MPI_ANY_TAG
- The name of the MPI communicator. I.e. MPI_COMM_WORLD.
- A status variable
So the first MPI_Recv line should look like this:
MPI_Recv(&recieved[i], nodenum, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
Next, an MPI_Send(); needs to be set up so that the master rank has data to receive. MPI_Send(); only needs to know:
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);
- A pointer to the send buffer (data to send)
- Number of elements in the send buffer
- An MPI_Datatype for what kind of data thats being sent
- The destination rank
- A message tag to help identify a specific message or type of message, as specified/defined by the application
- The MPI communicator
So with this in mind, the first MPI_Send needs to look like this:
MPI_Send(&count, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
Finally, the MPI instance should be closed out before the program exits. To do this simply add this line after the parallel code:
MPI_Finalize();
This code is written so that each node is running the same algorithm to place the random points, categorize them, and count them. The task assigned to master node collects the total number of tried points and total number of points in the circle. It takes the ratio of them and multiplies it by 4 to get a more accurate estimation of π. This is more accurate, because each node is running as many iterations as the singe node in the serial code ran. The master node treats them as if the values were ultimately from the same circle. The same thing can be achieved with the serial code by multiplying niter, the number of iterations, by the total number of nodes used in the parallelized version–but that is much less efficient than the MPI version because it is not dividing the work over several nodes simultaneously.
mpiPi.c
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <math.h> #define SEED 35791246 int main(int argc, char* argv[]) { long niter = 1000000; int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i, count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int nodenum; MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process MPI_Comm_size(MPI_COMM_WORLD, &nodenum); int recieved[nodenum]; long recvniter[nodenum]; srand(SEED+myid); //Give rand() a seed value. Needs to be different on each node if(myid != 0) { for (i=0; i<niter; ++i) //main loop { x= ((double)rand())/RAND_MAX; //gets a random x coordinate y =((double)rand())/RAND_MAX; //gets a random y coordinate z = sqrt(x*x+y*y); //Checks to see if number in inside unit circle if (z<=1) { count++; //if it is, consider it a valid random point } } for(i=0; i<nodenum; ++i) { MPI_Send(&count, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Send(&niter, 1, MPI_LONG, 0, 2, MPI_COMM_WORLD); } } else if (myid == 0) { for(i=0; i<nodenum; ++i) { MPI_Recv(&recieved[i], nodenum, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Recv(&recvniter[i], nodenum, MPI_LONG, MPI_ANY_SOURCE, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } } if (myid == 0) //if root process { int finalcount = 0; long finalniter = 0; for(i = 0; i<nodenum; ++i) { finalcount += recieved[i]; finalniter += recvniter[i]; } pi = ((double)finalcount/(double)finalniter)*4.0; //p = 4(m/n) printf("Pi: %f\n", pi); //Print the calculated value of pi } MPI_Finalize(); //Close the MPI instance return 0; }
Compile this code by running:
cc mpiPi.c -o a.out
The job submission is the same as the previous program but two lines in the batch script need to be edited:
#PBS -l walltime=0:01:00,nodes=1
the “nodes=1” needs to be changed to to how many nodes needed to run this code.
and
aprun -n 1 ./a.out
The “-n 1” needs to be changed to reflect the change made in the previous line (e.g. if the previous line now reads “nodes = 100” then the aprun line needs to reflect that by saying “-n 100”
Now the job can be submitted to the batch system to be run when the number of requested nodes becomes available.
qsub titan_pi.pbs
MPI Reduce
In the following version of the code, the serial code has been parallelized with MPI again, but this time the collective MPI_Reduce() operation is used. MPI_Reduce() gathers all the specified values on all nodes to one reduced value on the master node according to the specified operation (i.e. MPI_SUM). This has the same effect as the previous code, but makes for cleaner reading code.
mpirPi.c
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <math.h> int main(int argc, char* argv[]) { int niter = 100000; int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i; int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int reducedcount; //total number of "good" points from all nodes int reducedniter; //total number of ALL points from all nodes MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process /* Everyone can do the work */ for (i=0; i<niter; ++i) //main loop { srand48(time(NULL)+myid); //Give rand() a seed value unique on each node (times are synced) x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number in inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } MPI_Reduce(&count, &reducedcount, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Reduce(&niter, &reducedniter, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) //if root process { pi = ((double)reducedcount/(double)reducedniter)*4.0; //p = 4(m/n) printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter); //Print the calculated value of pi } MPI_Finalize(); //Close the MPI instance return 0; }
Compile this code by running
cc mpirPi.c -o a.out
Again, the batch script needs to be edited to reflect the number of nodes requested for this code. Once that is done, submit the job by once again running
qsub titan_pi.pbs
Parallelize with OpenMP
Here the original serial code has been parallelized with OpenMP instead of MPI. OpenMP is used to create separate threads to run on the multicore CPUs of Titan. To use OpenMP the following line needs to added:
#include <omp.h>
Also, the #pragma omp parallel
compiler directive needs to be added. This tells the compiler that the following block will be OpenMP parallel code. This code is creating 16 threads; one for each core of the CPU. Since OpenMP is a shared memory API, one needs to keep in mind what variables need to be accessible by what threads. Here, our code needs each thread to have its own copy of x, y, z, and i, and the variable “count” needs to be accessible by all threads so that it can be incremented when needed. In order to accomplish this, OpenMP needs to be told that x, y, z, and i need to be private. Here the firstprivate() clause is used in the omp pragma to declare them as thread private. The firstprivate() clause also has the feature that will automatically initialize the variables inside its parentheses. The variable, “count” needs to be accessible to all threads (i.e. shared) so the shared() clause is used to accomplish that. Finally, OpenMP is told to spawn 16 threads (one for each core of the processor). This is done by using the num_threads() clause.
omppi.c
#include <stdio.h> #include <stdlib.h> #include <omp.h> #include <math.h> int main(int argc, char* argv[]) { int niter = 1000000; //number of iterations per FOR loop double x,y; //x,y value for the random coordinate int i; //loop counter int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int numthreads = 16; #pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads) { srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value for (i=0; i<niter; ++i) //main loop { x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } //print the value of each thread/rank } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("Pi: %f\n", pi); return 0; }
Compiling it is similar to the previous examples, with one exception. When the compiler is called, the correct OpenMP flag needs to be added. This flag is different depending on which programming environment module you have loaded (run module list
and look for a module that says PrgEnv-*****)
-
- PrgEnv-pgi: -mp
-
- PrgEnv-intel: -openmp
-
- PrgEnv-cray: -openmp
- PrgEnv-gnu: -fopenmp
Now the code can be compiled using this format:
cc -(appropriate flag here) omppi.c -o a.out
The batch script needs to be edited so that only 1 node is requested (there is no MPI code in this for multiple nodes. Yet.)
Submit the script:
qsub titan_pi.pbs
OpenMP + MPI Hybrid
This next code sample combines both MPI and OpenMP. This allows it to fully exploit the parallel architecture of Titan’s multicore CPUs as well as add a layer of parallelism from MPI. In this version, 16 OpenMP threads are launched on each MPI rank. Here, a different approach is taken with the OpenMP section. Instead of declaring count as shared, the reduction(:) clause is used. This clause effectively creates a private copy of “count” on each thread to be incremented and then once all threads complete, OpenMP will reduce all instances of “count” into the original count variable according to the mathematical operator given. In this case, the total number from all threads is needed, so they are reduced by taking the sum of all of them. Then MPI_Reduce is used to gather and reduce “count” from each node into one variable on the master node.
mpiOMPpi.c
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <omp.h> #include <math.h> int main(int argc, char* argv[]) { int niter = 1000000; //number of iterations per FOR loop int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i; //loop counter int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int reducedcount; //total number of "good" points from all nodes int reducedniter; //total number of ALL points from all nodes int ranknum = 0; //total number of nodes available int numthreads = 16; MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process MPI_Comm_size(MPI_COMM_WORLD, &ranknum); //Gets number of nodes availible to process if(myid != 0) //Do the following on all except the master node { //Start OpenMP code: 16 threads/node #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value for (i=0; i<niter; ++i) //main loop { x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } } } MPI_Barrier(MPI_COMM_WORLD); MPI_Reduce(&count, &reducedcount, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); reducedniter = numthreads*niter*(ranknum-1); if (myid == 0) //if root process/master node { //p = 4(m/n) pi = ((double)reducedcount/(double)reducedniter)*4.0; //Print the calculated value of pi printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter); } MPI_Finalize(); //Close the MPI instance return 0; }
Compiling this code is similar to the past examples (don’t forget to use the correct OpenMP flag according to the programming environment used). To compile run:
cc -(OpenMP flag) mpiOMPpi.c -o a.out
The batch script can be edited now so that it can request the appropriate number of nodes (see above). After the batch script is edited, submit the job like before:
qsub titan_pi.pbs
If you have any questions about these examples or how to run them please email help@nccs.gov.
All code from this tutorial, including a Makefile and batch scripts can be found as a git repository at https://github.com/olcf/Serial-to-Parallel–Monte-Carlo-Pi