To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q
, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q
, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q
. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
The Serial
, OpenMP
, and MPI
C samples make use of the following gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
sleep.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> static cudaStream_t *streams; // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(float seconds) { // Get device frequency in KHz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = (int64_t)(seconds * Hz); return num_cycles; } // Create streams extern "C" void create_streams(int num_streams) { // Allocate streams streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t)); // Default stream streams[0] = NULL; // Primer kernel launch sleep<<< 1, 1 >>>(1); // Create streams for(int i = 1; i <= num_streams; i++) cudaStreamCreate(&streams[i]); } // Launches a kernel that sleeps for num_cycles extern "C" void sleep_kernel(int64_t num_cycles, int stream_id) { // Launch a single GPU thread to sleep int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles); } // Wait for stream to complete extern "C" void wait_for_stream(int stream_id) { cudaStreamSynchronize(streams[stream_id]); } // Wait for streams to complete extern "C" void wait_for_streams(int num_streams) { for(int i = 1; i <= num_streams; i++) cudaStreamSynchronize(streams[i]); } // Destroy stream objects extern "C" void destroy_streams(int num_streams) { // Clean up stream for(int i = 1; i <= num_streams; i++) cudaStreamDestroy(streams[i]); free(streams); }
Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -arch=sm_35 -c sleep.cu
Serially Launched Concurrent Kernels in C
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <sys/time.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; struct timeval start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Max number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously for(i=1; i<=num_kernels; i++){ sleep_kernel(cycles, i); } // Wait for kernels to complete wait_for_streams(num_kernels); // Clean up streams destroy_streams(num_kernels); // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o serial.out
Run:
$ module load cudatoolkit $ aprun ./serial.out
OpenMP Launched Concurrent Kernels in C
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; double start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Set number of OMP threads omp_set_num_threads(num_kernels); // Start timer start = omp_get_wtime(); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously #pragma omp parallel firstprivate(cycles) { int stream_id = omp_get_thread_num()+1; sleep_kernel(cycles, stream_id); } // Wait for kernels to complete wait_for_streams(num_kernels); // Wait for kernels to complete and clean up streams destroy_streams(num_kernels); // Stop timer stop = omp_get_wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } return 0; }
Compile:
PGI
$ module load cudatoolkit $ cc -mp sleep.o launcher.c -o openmp.out
GNU
$ module load cudatoolkit $ cc -fopenmp sleep.o launcher.c -o openmp.out
Intel
$ module load cudatoolkit $ cc -openmp sleep.o launcher.c -o openmp.out
Cray
$ module load cudatoolkit $ cc sleep.o launcher.c -o openmp.out
Run:
$ module load cudatoolkit $ aprun -d16 ./openmp.out
MPI Launched Concurrent Kernels in C
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "mpi.h" int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_stream(int stream_id); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); uint64_t cycles; double start, stop; int num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = size; // Setup default stream in sleep.cu wrapper create_streams(0); // Loop through number of kernels to launch, from 1 to max_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer MPI_Barrier(MPI_COMM_WORLD); if(rank == 0) start = MPI_Wtime(); // Launch kernel into default stream if(rank < num_kernels) sleep_kernel(cycles, 0); // Wait for all ranks to submit kernel MPI_Barrier(MPI_COMM_WORLD); // Wait for default stream if(rank < num_kernels) wait_for_stream(0); // Wait for all ranks to complete MPI_Barrier(MPI_COMM_WORLD); // Print seconds ellapsed if(rank == 0) { stop = MPI_Wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } } destroy_streams(0); MPI_Finalize(); return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in C
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "math.h" #include <sys/time.h> int main(int argc, char *argv[]) { uint64_t num_cycles; struct timeval start, stop; int i,num_kernels; // Set number of cycles to sleep for 1 second // We'll use frequency/instruction latency to estimate this num_cycles = 730000000/15; // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); for(i=0; i<=num_kernels; i++) { #pragma acc parallel async(i) vector_length(1) num_gangs(1) { uint64_t cycles; #pragma acc loop seq for(cycles=0; cycles<num_cycles; cycles++) { cycles = cycles; } } } // Wait for all async streams to complete #pragma acc wait // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
volatile
qualifier to the declaration of cycles should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ cc -acc launcher.c -o acc.out
Cray
$ module load cudatoolkit $ cc -hpragma=acc launcher.c -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
Fortran
The Serial
, OpenMP
, and MPI
Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
module sleep use cudadevice use cudafor implicit none integer, dimension(:), allocatable :: streams contains !CUDA kernel to pause for at least num_cycle cycles attributes(global) subroutine sleep(num_cycles) integer(8), value :: num_cycles integer(8) :: cycles integer(8) :: start cycles = 0 start = clock64 do while (cycles < num_cycles) cycles = clock64 - start enddo end subroutine sleep !Returns number of cycles required for requested seconds integer(8) function get_cycles(seconds) result(num_cycles) real(8), intent(in) :: seconds integer(8) :: istat, Hz type(cudadeviceprop) :: prop istat = cudaGetDeviceProperties(prop, 0) Hz = prop%clockRate * 1000 num_cycles = seconds * Hz end function get_cycles !Create streams subroutine create_streams(num_streams) integer :: num_streams, istat, i allocate(streams(num_streams+1)) streams(1) = 0 ! Primer kernel launch call sleep<<< 1, 1 >>>(int8(1)); do i=2,num_streams+1 istat = cudaStreamCreate(streams(i)) enddo end subroutine create_streams !Launches a kernel that sleeps for num_cycles subroutine sleep_kernel(num_cycles, stream_id) integer(8) :: num_cycles integer :: stream_id type(dim3) :: blockSize, gridSize blockSize = dim3(1,1,1) gridSize = dim3(1,1,1) call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles) end subroutine sleep_kernel ! Wait for stream to complete subroutine wait_for_stream(stream_id) integer :: stream_id, istat istat = cudaStreamSynchronize(streams(stream_id)) end subroutine wait_for_stream ! Wait for streams to complete subroutine wait_for_streams(num_streams) integer :: num_streams, istat, i do i=2,num_streams+1 istat = cudaStreamSynchronize(streams(i)) enddo end subroutine wait_for_streams ! Destroy streams subroutine destroy_streams(num_streams) integer :: num_streams, i, istat do i=2,num_streams+1 istat = cudaStreamDestroy(streams(i)) enddo deallocate(streams) end subroutine destroy_streams end module sleep
Serially Launched Concurrent Kernels in Fortran
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep integer(8) :: cycles integer :: max_kernels, num_kernels, i real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 32 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously do i = 2, num_kernels+1 call sleep_kernel(cycles, i) enddo ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out
OpenMP Launched Concurrent Kernels in Fortran
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep use omp_lib implicit none integer(8) :: cycles integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Set number of OMP threads call omp_set_num_threads(num_kernels) ! Start timer start = omp_get_wtime() ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously !$omp parallel private(stream_id) firstprivate(cycles) stream_id = omp_get_thread_num()+2 call sleep_kernel(cycles, stream_id) !$omp end parallel ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer stop = omp_get_wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out
MPI Launched Concurrent Kernels in Fortran
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.f90
program main use sleep implicit none include 'mpif.h' integer :: max_kernels, num_kernels, i, ierr, rank, size integer(8) :: cycles real(8) :: start, stop, seconds call MPI_Init(ierr) ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) call MPI_Comm_size(MPI_COMM_WORLD, size, ierr) ! Number of kernels to launch max_kernels = size ! Setup default stream in sleep.cu wrapper call create_streams(0); ! Loop through number of kernels to launch, from 1 to max_kernels do num_kernels = 1, max_kernels ! Start timer call MPI_Barrier(MPI_COMM_WORLD, ierr) if (rank == 0) then start = MPI_Wtime() endif ! Launch num_kernel kernels asynchrnously if (rank < num_kernels) then call sleep_kernel(cycles, 1) endif ! Wait for all ranks to submit kernel call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Wait for kernel to complete if(rank < num_kernels) then call wait_for_stream(0) endif ! Wait for all ranks to complete call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Print seconds ellapsed if (rank == 0) then stop = MPI_Wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' endif enddo ! clean up array in wrapper, no stream actually destroyed call destroy_streams(0) call MPI_Finalize(ierr) end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in Fortran
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.f90
program main implicit none integer(8) :: num_cycles, cycles, i integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds real(4) :: foo(33) ! Set number of cycles to sleep for 1 second ! We'll use frequency/instruction latency to estimate this num_cycles = 730000000/(15*5) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Launch num_kernel kernels asynchrnously do i = 1, num_kernels !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo) !$acc loop seq do cycles = 1, num_cycles foo(i) = sin(1.5708) enddo !$acc end loop !$acc end parallel enddo ! Wait for all async streams to complete !$acc wait ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
volatile
qualifier to the declaration of foo should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ ftn -acc launcher.f90 -o acc.out
Cray
$ module load cudatoolkit $ ftn -hacc launcher.f90 -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov.
How to Enable
To enable GPUDirect the following steps must be taken before invoking aprun:
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1
Examples
The following examples will initialize memory on the GPU and then preform an MPI_Allgather operation between GPUs.
For these examples use two nodes and request an interactive job.
$qsub -I -A PROJ### -lnodes=2,walltime=00:30:00
The executables must be placed in /tmp/work/$USER to run properly on the GPU.
CUDA C
direct.cpp
#include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include <mpi.h> int main( int argc, char** argv ) { MPI_Init (&argc, &argv); int direct; int rank, size; int *h_buff = NULL; int *d_rank = NULL; int *d_buff = NULL; size_t bytes; int i; // Ensure that RDMA ENABLED CUDA is set correctly direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA")); if(direct != 1){ printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n"); exit (EXIT_FAILURE); } // Get MPI rank and size MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); // Allocate host and device buffers and copy rank value to GPU bytes = size*sizeof(int); h_buff = (int*)malloc(bytes); cudaMalloc(&d_buff, bytes); cudaMalloc(&d_rank, sizeof(int)); cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice); // Preform Allgather using device buffer MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD); // Check that the GPU buffer is correct cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost); for(i=0; i<size; i++){ if(h_buff[i] != i) { printf ("Alltoall Failed!\n"); exit (EXIT_FAILURE); } } if(rank==0) printf("Success!\n"); // Clean up free(h_buff); cudaFree(d_buff); cudaFree(d_rank); MPI_Finalize(); return 0; }
Compiling CUDA C
For ease of compiling the GNU environment will be used.
$ module load cudatoolkit $ module switch PrgEnv-pgi PrgEnv-gnu $ CC -lcudart direct.cpp -o direct.out
Running CUDA C
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
CUDA Fortran
direct.cuf
program GPUdirect use cudafor implicit none include 'mpif.h' integer :: direct character(len=255) :: env_var integer :: rank, size, ierror integer,dimension(:),allocatable :: h_buff integer,device :: d_rank integer,dimension(:),allocatable,device :: d_buff integer :: i call getenv("MPICH_RDMA_ENABLED_CUDA", env_var) read( env_var, '(i10)' ) direct if (direct .NE. 1) then print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!' call exit(1) endif call MPI_INIT(ierror) ! Get MPI rank and size call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror) call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror) ! Initialize host and device buffers allocate(h_buff(size)) allocate(d_buff(size)) ! Implicity copy rank to device d_rank = rank ! Preform allgather using device buffers call MPI_ALLGATHER(d_rank, 1, MPI_INTEGER, d_buff, 1, MPI_INTEGER, MPI_COMM_WORLD, ierror); ! Check that buffer is correct h_buff = d_buff(1:size) do i=1,size if (h_buff(i) .NE. i-1) then print *, 'Alltoall Failed!' call exit(1) endif enddo if (rank .EQ. 0) then print *, 'Success!' endif ! Clean up deallocate(h_buff) deallocate(d_buff) call MPI_FINALIZE(ierror) end program GPUdirect
Compiling CUDA Fortran
$ module load cudatoolkit $ ftn direct.cuf -o direct.out
Running CUDA Fortran
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
OpenACC C
direct.c
#include <stdio.h> #include <stdlib.h> #include <mpi.h> int main( int argc, char** argv ) { MPI_Init (&argc, &argv); int direct; int rank, size; int *restrict buff = NULL; size_t bytes; int i; // Ensure that RDMA ENABLED CUDA is set correctly direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA")); if(direct != 1){ printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n"); exit (EXIT_FAILURE); } // Get MPI rank and size MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); // Initialize buffer bytes = size*sizeof(int); buff = (int*)malloc(bytes); // Copy buff to device at start of region and back to host and end of region #pragma acc data copy(rank, buff[0:size]) { // Inside this region the device data pointer will be used #pragma acc host_data use_device(rank, buff) { MPI_Allgather(&rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD); } } // Check that buffer is correct for(i=0; i<size; i++){ if(buff[i] != i) { printf ("Alltoall Failed!\n"); exit (EXIT_FAILURE); } } if(rank==0) printf("Success!\n"); // Clean up free(buff); MPI_Finalize(); return 0; }
Compiling OpenACC C
CRAY
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -hpragma=acc direct.c -o direct.out
PGI
$ module load cudatoolkit $ cc -acc -lcudart direct.c -o direct.out
Running OpenACC C
CRAY
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
PGI
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
OpenACC Fortran
direct.f90
program GPUdirect include 'mpif.h' integer :: direct character(len=255) :: env_var integer :: rank, size, ierror integer,dimension(:),allocatable :: buff integer :: i call getenv("MPICH_RDMA_ENABLED_CUDA", env_var) read( env_var, '(i10)' ) direct if (direct .NE. 1) then print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!' call exit(1) endif call MPI_INIT(ierror) ! Get MPI rank and size call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror) call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror) ! Initialize buffer allocate(buff(size)) ! Copy buff to device at start of region and back to host and end of region !$acc data copy(rank, buff(1:size)) ! Inside this region the device data pointer will be used !$acc host_data use_device(rank, buff) ! Preform all to all using device buffer call MPI_ALLGATHER(rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD, ierror); !$acc end host_data !$acc end data ! Check that buffer is correct do i=1,size if (buff(i) .NE. i-1) then print *, 'Alltoall Failed!' call exit(1) endif enddo if (rank .EQ. 0) then print *, 'Success!' endif ! Clean up deallocate(buff) call MPI_FINALIZE(ierror) end program GPUdirect
Compiling OpenACC Fortran
CRAY
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -hacc direct.f90 -o direct.out
PGI
$ module load cudatoolkit $ ftn -acc -lcudart direct.f90 -o direct.out
Running OpenACC Fortran
CRAY
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
PGI
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
Optimizations
Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial and error tuning to achieve best results.
Pipelining
Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is 16 and messages under 8 Kilobytes in size are never pipelined.
Nemesis
Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.
This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.
This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.
The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
How to Enable
To enable the proxy server the following steps must be taken before invoking aprun:
$ export CRAY_CUDA_MPS=1
Issues
Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.
Examples
The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.
For these examples use a single node and request an interactive job.
$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00
C Wrapper
The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.
pauseWrapper.cu
#include <stdio.h> #include <stdlib.h> #include <time.h> // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(int seconds) { // Get device frequency in Hz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = seconds * Hz; return num_cycles; } // Launches a kernel that sleeps for at least num_cycles extern "C" void sleep_kernel(int64_t num_cycles) { // Our kernel will launch a single thread to sleep the kernel int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<<gridSize, blockSize>>>(num_cycles); } // Wait for all pending GPU transactions to end extern "C" void wait_for_gpu() { cudaDeviceSynchronize(); }
Compiling
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -c -arch=sm_35 pauseWrapper.cu
Multiple kernels single process
Launching multiple kernels from a single process does not require the CUDA proxy.
launcher.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch num_kernel kernels asynchrnously for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); return 0; }
Compiling and running multiple kernels from a single process
$ CC pauseWrapper.o launcher.cpp -o sleep.out $ aprun -n1 ./sleep.out
Multiple kernels launched from multiple threads
Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.
launcherOMP.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch kernel omp_set_num_threads(num_kernels); #pragma omp parallel for shared(cycles) num_threads(num_kernels) for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); }
Compiling and running multiple kernels launched from multiple threads
$ CC pauseWrapper.o launcherMP.cpp -o sleep.out $ aprun -n1 -d14 ./sleep.out
Multiple kernels launched from multiple MPI tasks
Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.
launcherMPI.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <mpi.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Sleep kernel for 1 second sleep_kernel(cycles); // Wait for the kernel to complete wait_for_gpu(); MPI_Finalize(); return 0; }
Compiling and running multiple kernels launched from multiple MPI tasks
$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out $ export CRAY_CUDA_MPS=1 $ aprun -n14 -N14 ./sleep.out
This tutorial describes how to use PETSc for domain decomposition to parallelize (with MPI) a solver for the simple advection equation (1.1).

Equation 1.1
In equation (1.1), D is a scalar density and V = { Vx,Vy,Vz } is a velocity field. We assume only Vx is non-zero and constant. A simple time-explicit scheme for solving equation (1.1) is given by (1.2)

Equation 1.2
where, for example, the “upwind flux” gives (1.3)

Equation 1.3
Thus, the numerical scheme couples nearest neighbors, which means that data must be exchanged between MPI tasks during the evolution. Although the equation solved in this tutorial example is simple, the strategy outlined here can be easily generalized to more complex problems using more advanced numerical methods. The example code provided (with the driver Advection3D.F90) performs three main tasks; (1) creates a distributed logically Cartesian grid/mesh (in MeshModule.F90; Section 4), (2) creates physical fields associated with each grid cell (in FieldsModule.F90; Section 5), and (3) evolves the simple advection equation on the distributed grid (in Advection3D.F90 and AdvectionModule.F90). The main data structure used by the program (UniverseType, defined in UniverseModule.F90) holds one data structure for the mesh (MeshType, defined in MeshModule.F90) and one data structure for the physical fields (FieldsType, defined in FieldsModule.F90). The actual time stepping loop is in the main driver (Advection3D.F90).
• $PETSC_DIR/source/src/dm/examples/tutorials/
• $PETSC_DIR/source/src/ksp/ksp/examples/tutorials/
• $PETSC_DIR/source/src/mat/examples/tutorials/
• $PETSC_DIR/source/src/snes/examples/tutorials/
• $PETSC_DIR/source/src/sys/examples/tutorials/
• $PETSC_DIR/source/src/ts/examples/tutorials/
• $PETSC_DIR/source/src/vec/vec/examples/tutorials/
In the following section we describe the individual modules of a Fortran code that use PETSc for domain decomposition. In each section we provide links to PETSc documentation for all the PETSc routines used by the code. The user is encouraged to download, compile, and run the provided code.
Please direct any questions or comments to help@nccs.gov.
Advection3D.F90
Mesh and Data Structures
The call to CreateUniverse creates the main data structure UniverseType. PETSc (and MPI) are also initialized in CreateUniverse. It should be the first call in any program using the data structure defined by UniverseType. Notable input parameters are the number of mesh cells in each spatial dimension, the inner and outer boundaries of the computational domain, and the type of boundary conditions used. BoundaryConditions are of type DMDABoundaryType, and PETSc supports DMDA_BOUNDARY_NONE, DMDA_BOUNDARY_GHOST, and DMDA_BOUNDARY_PERIODIC.
! Program solves dD/dt+Vx*dD/dx+Vy*dD/dy+Vz*dD/dz=0 on periodic unit domain ! with constant advection velocity (V1, V2, V3) = (1, 0, 0), ! and initial condition D = Sin(2 * Pi * X1) ! Create Mesh and Data Structures to Hold Physical Fields: nZones(1:3) = 256 CALL CreateUniverse( & U, Name = 'Advection3D', & nZones & = nZones, & InnerBoundaries & = (/ Zero, Zero, Zero /), & OuterBoundaries & = (/ One, One, One /), & BoundaryConditions & = (/ DMDA_BOUNDARY_PERIODIC, & DMDA_BOUNDARY_PERIODIC, & DMDA_BOUNDARY_PERIODIC /))
The call to DestroyUniverse frees up memory allocated by CreateUniverse and finalizes PETSc (and MPI) to end cleanly.
! Destroy Mesh and Data Structures that Hold Physical Fields: CALL DestroyUniverse(U)
UniverseModule.F90
The following PETSc routines are called
• PetscInitialize
• PetscFinalize
• PetscGetTime
UniverseType
UniverseType contains Mesh (MeshType) and Fields (FieldsType), and additional rudimentary variables associated with parallel execution (e.g., the integer Communicator).
TYPE, PUBLIC :: UniverseType CHARACTER(80) :: & Name PetscInt :: & Communicator, & MPI_Rank, & N_MPI_Ranks PetscLogDouble :: & WallTimeInitialize, & WallTimeFinalize TYPE(MeshType), POINTER :: & Mesh ! TYPE(FieldsType), POINTER :: & Fields ! END TYPE UniverseType
CreateUniverse
CreateUniverse initializes PETSc (and MPI) with the call to PetscInitialize. It also initializes the rudimentary variables and creates the data structures MeshType and FieldsType with the calls to CreateMesh and CreateFields.
SUBROUTINE CreateUniverse(U, Name, & nZones, InnerBoundaries, OuterBoundaries, BoundaryConditions) TYPE(UniverseType), POINTER :: & U CHARACTER(len=*), INTENT(in) :: & Name PetscInt, DIMENSION(:), INTENT(in) :: & nZones PetscReal, DIMENSION(:), INTENT(in) :: & InnerBoundaries, & OuterBoundaries DMDABoundaryType, DIMENSION(:), INTENT(in) :: & BoundaryConditions PetscInt :: & Error CALL PetscInitialize(PETSC_NULL_CHARACTER, Error) ALLOCATE(U) U % Communicator = MPI_COMM_WORLD CALL MPI_Comm_Rank(U % Communicator, U % MPI_Rank, Error) CALL MPI_Comm_Size(U % Communicator, U % N_MPI_Ranks, Error) U % Name = Name CALL PetscGetTime(U % WallTimeInitialize, Error) IF(U % MPI_RANK == 0)THEN PRINT* PRINT*, "INFO: Creating Universe ", TRIM(U % Name) PRINT* PRINT*, " INFO: Number of MPI Ranks = ", U % N_MPI_Ranks END IF ALLOCATE(U % Mesh) CALL CreateMesh(U % Mesh, U % Communicator, U % MPI_Rank, & nZones, InnerBoundaries, OuterBoundaries, BoundaryConditions) ALLOCATE(U % Fields) CALL CreateFields(U % Fields, U % Mesh, U % Communicator, U % MPI_Rank) END SUBROUTINE CreateUniverse
The call to PetscGetTime is used to time the program (see DestroyUniverse).
DestroyUniverse
DestroyUniverse frees up memory allocated by CreateUniverse. In particular, Fields and Mesh are deallocated by calls to DestroyFields and DestroyMesh. DestroyUniverse also finalizes PETSc (and MPI) to exit cleanly.
SUBROUTINE DestroyUniverse(U) TYPE(UniverseType), POINTER :: & U PetscInt :: & Error IF(U % MPI_Rank == 0)THEN PRINT* PRINT*, "INFO: Destroying Universe ", TRIM(U % Name) END IF CALL DestroyFields(U % Fields) DEALLOCATE(U % Fields) CALL DestroyMesh(U % Mesh) DEALLOCATE(U % Mesh) CALL PetscGetTime(U % WallTimeFinalize, Error) IF(U % MPI_Rank == 0)THEN PRINT* PRINT*, " INFO: ", U % N_MPI_Ranks, " MPI Ranks Finished in ", & U % WallTimeFinalize - U % WallTimeInitialize, " Seconds" PRINT* END IF CALL PetscFinalize(Error) DEALLOCATE(U) END SUBROUTINE DestroyUniverse
MeshModule.F90
The following PETSc routines are called
• DMDACreate1D
• DMDACreate2D
• DMDACreate3D
• DMDAGetCoordinateDA
• DMCreateGlobalVector
• VecDuplicate
• DMDAGetCorners
• DMCreateLocalVector
• DMDAGetGhostCorners
• DMDAVecGetArrayF90
• DMDAVecRestoreArrayF90
• DMGlobalToLocalBegin
• DMGlobalToLocalEnd
• VecDestroy
• DMDestroy
X1,X2,X3 are integers used to access the mesh coordinates (see e.g., CreateMesh3D for usage).
! Coordinate Names PetscInt, PUBLIC, PARAMETER :: & X1 = 0, & X2 = 1, & X3 = 2
MeshType
MeshType contains basic information about the mesh; e.g., dimensionality (nDimensions), number of zones in each spatial dimension (nZones), the coordinates of the inner and outer boundaries (InnerBoundaries and OuterBoundaries), the type of boundary condition used in each dimension (BoundaryType). StencilWidth is the width of the computational stencil and sets the depth of the ghost zone layer to be communicated. (StencilWidth=1 in the current example, but can be larger for higher-order methods.) StencilType can be of type DMDA_STENCIL_BOX or DMDA_STENCIL_STAR to accommodate different computational stencils (see note below). The main data structures in MeshType are MeshDA and CoordinateDA, both are DM objects that are used to manage data for a structured grid in 1, 2, or 3 dimensions. MeshType also contains Positions (PositionsType), which holds mesh coordinate values.
TYPE, PUBLIC :: MeshType PetscInt :: & nDimensions, & StencilWidth PetscInt, DIMENSION(:), ALLOCATABLE :: & nZones PetscReal, DIMENSION(:), ALLOCATABLE :: & InnerBoundaries, & OuterBoundaries DMDAStencilType :: & StencilType DMDABoundaryType, DIMENSION(:), ALLOCATABLE :: & BoundaryType DM :: & MeshDA, & CoordinateDA TYPE(PositionsType), POINTER :: & Positions END TYPE MeshType
PositionsType
PositionsType contains distributed vectors (of type Vec) to hold position coordinates. InnerEdge, Center, and OuterEdge hold the inner, center, and outer coordinates of a computational mesh cell. Vectors with the “Global” suffix (e.g., InnerEdgeGlobal) are created with DMCreateGlobalVector and do not contain ghost zones. Vectors with the “Local” suffix (e.g., InnerEdgeLocal) are created with DMCreateLocalVector and contains ghost zones (see CreatePositions in Section 4 e). The integers iBX(1:3) (“i Begin X”), iWX(1:3) (“i Width X”), and iEX(1:3) (“i End X”) are used to access specific cells on a mesh portion owned by a particular process. We have also defined the corresponding iBXG(1:3), iWXG(1:3), and iEXG(1:3) for the ghost domains. iBX and iWX, and iBXG and iWXG are set with calls to DMDAGetCorners and DMDAGetGhostedCorners (see Section 4e).
TYPE :: PositionsType PetscInt, DIMENSION(3) :: & iBX, & ! Inner Indices of Local Domain iWX, & ! Width of Local Domain iEX, & ! Outer Indices of Local Domain iBXG, & ! Inner Indices of Local Ghosted Domain iWXG, & ! Width of Local Ghosted Domain iEXG ! Outer Indices of Local Ghosted Domain Vec :: & InnerEdgeGlobal, & CenterGlobal, & OuterEdgeGlobal, & InnerEdgeLocal, & CenterLocal, & OuterEdgeLocal END TYPE PositionsType
CreateMesh
CreateMesh sets basic information about the mesh, and, depending on the dimensionality of the problem, calls CreateMesh1D, CreateMesh2D, or CreateMesh3D, which creates the data structures MeshDA and CoordinateDA and sets the position coordinates. After the global position coordinates are initialized (e.g., InnerEdgeGlobal), the calls DMGlobalToLocalBegin and DMGlobalToLocalEnd populate ghost values in local position coordinates (e.g., InnerEdgeLocal).
SUBROUTINE CreateMesh(M, Comm, Rank, & nZones, InnerBoundaries, OuterBoundaries, BoundaryConditions) TYPE(MeshType), POINTER :: & M PetscInt, INTENT(in) :: & Comm, & Rank PetscInt, DIMENSION(:), INTENT(in) :: & nZones PetscReal, DIMENSION(:), INTENT(in) :: & InnerBoundaries, & OuterBoundaries DMDABoundaryType, DIMENSION(:), INTENT(in) :: & BoundaryConditions PetscInt :: & iDim, & Error M % nDimensions = SIZE(nZones) ALLOCATE( M % nZones (M % nDimensions) ) ALLOCATE( M % InnerBoundaries(M % nDimensions) ) ALLOCATE( M % OuterBoundaries(M % nDimensions) ) ALLOCATE( M % BoundaryType (M % nDimensions) ) DO iDim = 1, M % nDimensions M % nZones(iDim) = nZones(iDim) M % InnerBoundaries(iDim) = InnerBoundaries(iDim) M % OuterBoundaries(iDim) = OuterBoundaries(iDim) M % BoundaryType(iDim) = BoundaryConditions(iDim) END DO M % StencilWidth = 1 M % StencilType = DMDA_STENCIL_STAR IF(Rank == 0)THEN PRINT* PRINT*, " INFO: Creating Mesh" PRINT*, " Dimensionality = ", M % nDimensions PRINT*, " Stencil Width = ", M % StencilWidth DO iDim = 1, M % nDimensions PRINT*, " iDim, nZones = ", iDim, M % nZones(iDim) PRINT*, " Inner Boundary = ", M % InnerBoundaries(iDim) PRINT*, " Outer Boundary = ", M % OuterBoundaries(iDim) END DO END IF SELECT CASE (M % nDimensions) CASE (1) CALL CreateMesh1D(M, Comm) CASE (2) CALL CreateMesh2D(M, Comm) CASE (3) CALL CreateMesh3D(M, Comm) END SELECT ! Populate Ghost Zones with Coordinates: CALL DMGlobalToLocalBegin( & M % CoordinateDA, & M % Positions % InnerEdgeGlobal, INSERT_VALUES, & M % Positions % InnerEdgeLocal, Error) CALL DMGlobalToLocalEnd( & M % CoordinateDA, & M % Positions % InnerEdgeGlobal, INSERT_VALUES, & M % Positions % InnerEdgeLocal, Error) CALL DMGlobalToLocalBegin( & M % CoordinateDA, & M % Positions % OuterEdgeGlobal, INSERT_VALUES, & M % Positions % OuterEdgeLocal, Error) CALL DMGlobalToLocalEnd( & M % CoordinateDA, & M % Positions % OuterEdgeGlobal, INSERT_VALUES, & M % Positions % OuterEdgeLocal, Error) CALL DMGlobalToLocalBegin( & M % CoordinateDA, & M % Positions % CenterGlobal, INSERT_VALUES, & M % Positions % CenterLocal, Error) CALL DMGlobalToLocalEnd( & M % CoordinateDA, & M % Positions % CenterGlobal, INSERT_VALUES, & M % Positions % CenterLocal, Error) END SUBROUTINE CreateMesh
CreateMesh3D
CreateMesh3D creates a three-dimensional mesh. The call to the PETSc routine DMDACreate3D creates the object MeshDA (of type DM) to manage the communication of three-dimensional data that is distributed across processors (see the provided code for corresponding 1D and 2D versions). The call to DMDAGetCoordinateDA initializes CoordinateDA (also of type DM) that is used to scatter between global and local coordinates and to access the coordinate vectors defined in PositionsType. The call to CreatePositions creates/allocates the position coordinate vectors and initializes indices used to access specific cells in the mesh owned by a particular process. To fill in values in a global vector (e.g., InnerEdgeLocal) we must call DMDAVecGetGetArrayF90, which returns a multi-dimensional array (e.g., InnerEdge) that shares data with the underlying vector. In the 3D case, InnerEdge is a rank-four array, where the first index denotes the coordinate (i.e., X1, X2, or X3) and the following three indices denote the logical position in the mesh. After the position coordinates have been set and access to InnerEdge is no longer needed, one must call DMDAVecResoreArrayF90.
SUBROUTINE CreateMesh3D(M, Comm) TYPE(MeshType), POINTER :: & M PetscInt, INTENT(in) :: & Comm PetscInt :: & Error, & iX1, iX2, iX3 PetscReal, DIMENSION(3) :: & dX PetscReal, DIMENSION(:,:,:,:), POINTER :: & InnerEdge, & OuterEdge, & Center CALL DMDACreate3D( & Comm, & M % BoundaryType(1), M % BoundaryType(2), M % BoundaryType(3), & M % StencilType, & M % nZones(1), M % nZones(2), M % nZones(3), & PETSC_DECIDE, PETSC_DECIDE, PETSC_DECIDE, & 1, M % StencilWidth, & PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, & M % MeshDA, Error) CALL DMDAGetCoordinateDA( & M % MeshDA, M % CoordinateDA, Error) ! ! Create vectors to hold coordinate values: ! ALLOCATE(M % Positions) CALL CreatePositions(M % Positions, M % nDimensions, M % CoordinateDA) ! ! Fill in coordinate values: ! CALL DMDAVecGetArrayF90( & M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error) CALL DMDAVecGetArrayF90( & M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error) CALL DMDAVecGetArrayF90( & M % CoordinateDA, M % Positions % CenterGlobal, Center, Error) ! Equidistant mesh for simplicity dX = ( M % OuterBoundaries - M % InnerBoundaries ) / REAL( M % nZones ) DO iX3 = M % Positions % iBX(3), M % Positions % iEX(3) DO iX2 = M % Positions % iBX(2), M % Positions % iEX(2) DO iX1 = M % Positions % iBX(1), M % Positions % iEX(1) InnerEdge(X1, iX1, iX2, iX3) & = M % InnerBoundaries(1) + dX(1) * iX1 OuterEdge(X1, iX1, iX2, iX3) & = InnerEdge(X1, iX1, iX2, iX3) + dX(1) Center (X1, iX1, iX2, iX3) & = Half * ( InnerEdge(X1, iX1, iX2, iX3) & + OuterEdge(X1, iX1, iX2, iX3) ) InnerEdge(X2, iX1, iX2, iX3) & = M % InnerBoundaries(2) + dX(2) * iX2 OuterEdge(X2, iX1, iX2, iX3) & = InnerEdge(X2, iX1, iX2, iX3) + dX(2) Center (X2, iX1, iX2, iX3) & = Half * ( InnerEdge(X2, iX1, iX2, iX3) & + OuterEdge(X2, iX1, iX2, iX3) ) InnerEdge(X3, iX1, iX2, iX3) & = M % InnerBoundaries(3) + dX(3) * iX3 OuterEdge(X3, iX1, iX2, iX3) & = InnerEdge(X3, iX1, iX2, iX3) + dX(3) Center (X3, iX1, iX2, iX3) & = Half * ( InnerEdge(X3, iX1, iX2, iX3) & + OuterEdge(X3, iX1, iX2, iX3) ) END DO END DO END DO CALL DMDAVecRestoreArrayF90( & M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error) CALL DMDAVecRestoreArrayF90( & M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error) CALL DMDAVecRestoreArrayF90( & M % CoordinateDA, M % Positions % CenterGlobal, Center, Error) END SUBROUTINE CreateMesh3D
CreatePositions
CreatePositions creates global and local vectors containing position coordinates (e.g., InnerEdgeLocal and InnerEdgeGlobal). Vectors created with DMCreateGlobalVector do not have ghost points accessible. Vectors created with DMCreateLocalVector have accessible ghost points. Calls to DMDAGetCorners and DMDAGetGhostedCorners returns the indices of the lower left corner of the local region and the extent (width) of the local region.
SUBROUTINE CreatePositions(Positions, nDimensions, CoordinateDA) TYPE(PositionsType), POINTER :: & Positions PetscInt :: & nDimensions DM :: & CoordinateDA PetscInt :: & Error, & iDim ! Global (Non-Ghosted) Position Vectors: CALL DMCreateGlobalVector( & CoordinateDA, Positions % InnerEdgeGlobal, Error) CALL VecDuplicate( & Positions % InnerEdgeGlobal, Positions % CenterGlobal, Error) CALL VecDuplicate( & Positions % InnerEdgeGlobal, Positions % OuterEdgeGlobal, Error) Positions % iBX(:) = 0 Positions % iWX(:) = 1 Positions % iEX(:) = 0 ! Get Indices for Non-Ghosted Vectors: SELECT CASE (nDimensions) CASE (1) CALL DMDAGetCorners( & CoordinateDA, & Positions % iBX(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, & Positions % iWX(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, & Error) CASE (2) CALL DMDAGetCorners( & CoordinateDA, & Positions % iBX(1), Positions % iBX(2), PETSC_NULL_INTEGER, & Positions % iWX(1), Positions % iWX(2), PETSC_NULL_INTEGER, & Error) CASE(3) CALL DMDAGetCorners( & CoordinateDA, & Positions % iBX(1), Positions % iBX(2), Positions % iBX(3), & Positions % iWX(1), Positions % iWX(2), Positions % iWX(3), & Error) END SELECT DO iDim = 1, nDimensions Positions % iEX(iDim) & = Positions % iBX(iDim) & + Positions % iWX(iDim) - 1 END DO ! Local (Ghosted) Position Vectors: CALL DMCreateLocalVector( & CoordinateDA, Positions % InnerEdgeLocal, Error) CALL VecDuplicate( & Positions % InnerEdgeLocal, Positions % CenterLocal, Error) CALL VecDuplicate( & Positions % InnerEdgeLocal, Positions % OuterEdgeLocal, Error) Positions % iBXG(:) = 0 Positions % iWXG(:) = 1 Positions % iEXG(:) = 0 ! Get Indices for Ghosted Vectors: SELECT CASE (nDimensions) CASE (1) CALL DMDAGetGhostCorners( & CoordinateDA, & Positions % iBXG(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, & Positions % iWXG(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, & Error) CASE (2) CALL DMDAGetGhostCorners( & CoordinateDA, & Positions % iBXG(1), Positions % iBXG(2), PETSC_NULL_INTEGER, & Positions % iWXG(1), Positions % iWXG(2), PETSC_NULL_INTEGER, & Error) CASE (3) CALL DMDAGetGhostCorners( & CoordinateDA, & Positions % iBXG(1), Positions % iBXG(2), Positions % iBXG(3), & Positions % iWXG(1), Positions % iWXG(2), Positions % iWXG(3), & Error) END SELECT DO iDim = 1, nDimensions Positions % iEXG(iDim) & = Positions % iBXG(iDim) & + Positions % iWXG(iDim) - 1 END DO END SUBROUTINE CreatePositions
DestroyMesh
DestroyMesh frees memory allocated by CreateMesh
SUBROUTINE DestroyMesh(M) TYPE(MeshType), POINTER :: & M PetscInt :: & Error DEALLOCATE( M % nZones ) DEALLOCATE( M % InnerBoundaries ) DEALLOCATE( M % OuterBoundaries ) DEALLOCATE( M % BoundaryType ) CALL DestroyPositions(M % Positions) DEALLOCATE(M % Positions) CALL DMDestroy(M % MeshDA, Error) END SUBROUTINE DestroyMesh
DestroyPositions
DestroyPositions destroys the vectors created in CreatePositions
SUBROUTINE DestroyPositions(Positions) TYPE(PositionsType), POINTER :: & Positions PetscInt :: & Error CALL VecDestroy(Positions % InnerEdgeGlobal, Error) CALL VecDestroy(Positions % CenterGlobal, Error) CALL VecDestroy(Positions % OuterEdgeGlobal, Error) CALL VecDestroy(Positions % InnerEdgeLocal, Error) CALL VecDestroy(Positions % CenterLocal, Error) CALL VecDestroy(Positions % OuterEdgeLocal, Error) END SUBROUTINE DestroyPositions
FieldsModule.F90
The following PETSc routines are called
• DMDACreate3D
• DMCreateGlobalVector
• DMCreateLocalVector
• VecDestroy
• DMDestroy
D, V1, V2, V3 are integers used to access physical variables/fields (here representing the density and three components of velocity)
! Field Names PetscInt, PUBLIC, PARAMETER :: & D = 0, & V1 = 1, & V2 = 2, & V3 = 3, & N_FIELDS = 4
FieldsType
FieldsType contains vectors to hold the physical fields: FieldsGlobal (non-ghosted) and FieldsLocal (ghosted). FieldsDA (of type DM) is the object used to manage data for a structured grid (e.g., exchange of data between processes). The integer DegreesOfFreedom holds the number of physical fields. It is used to create the FieldsDA object.
TYPE, PUBLIC :: FieldsType PetscInt :: & DegreesOfFreedom Vec :: & FieldsGlobal, & FieldsLocal DM :: & FieldsDA END TYPE FieldsType
CreateFields
CreateFields initializes and creates data structures defined in FieldsType. Depending on the dimensionality of the problem, it creates the FieldsDA object by calling either DMDACreate1D, DMDACreate2D, or DMDACreate3D. It also creates the global and local vectors to hold the physical fields. Note that DegreesOfFreedom > 1 in this case, which has consequences for the shape of the array returned by DMDAVecGetArrayF90 (see initialization of FieldsGlobal in Advection3D.F90). For DegreesOfFreedom > 1 the first index in the array returned by DMDAVecGetArrayF90 is used to access specific fields. The preceding indices refer to the field’s logical position in the mesh.
SUBROUTINE CreateFields(F, M, Comm, Rank) TYPE(FieldsType), POINTER :: & F TYPE(MeshType), POINTER :: & M PetscInt, INTENT(in) :: & Comm, & Rank PetscInt :: & Error F % DegreesOfFreedom = N_FIELDS IF(Rank == 0)THEN PRINT* PRINT*, " INFO: Creating Fields" PRINT*, " Degrees of Freedom = ", F % DegreesOfFreedom END IF SELECT CASE (M % nDimensions) CASE (1) CALL DMDACreate1D( & Comm, & M % BoundaryType(1), & M % nZones(1), & F % DegreesOfFreedom, & M % StencilWidth, & PETSC_NULL_INTEGER, & F % FieldsDA, & Error) CASE (2) CALL DMDACreate2D( & Comm, & M % BoundaryType(1), & M % BoundaryType(2), & M % StencilType, & M % nZones(1), & M % nZones(2), & PETSC_DECIDE, & PETSC_DECIDE, & F % DegreesOfFreedom, & M % StencilWidth, & PETSC_NULL_INTEGER, & PETSC_NULL_INTEGER, & F % FieldsDA, & Error) CASE (3) CALL DMDACreate3D( & Comm, & M % BoundaryType(1), & M % BoundaryType(2), & M % BoundaryType(3), & M % StencilType, & M % nZones(1), & M % nZones(2), & M % nZones(3), & PETSC_DECIDE, & PETSC_DECIDE, & PETSC_DECIDE, & F % DegreesOfFreedom, & M % StencilWidth, & PETSC_NULL_INTEGER, & PETSC_NULL_INTEGER, & PETSC_NULL_INTEGER, & F % FieldsDA, & Error) END SELECT CALL DMCreateGlobalVector( & F % FieldsDA, F % FieldsGlobal, Error) CALL DMCreateLocalVector( & F % FieldsDA, F % FieldsLocal, Error) END SUBROUTINE CreateFields
DestroyFields
DestroyFields frees memory allocated by CreateFields
SUBROUTINE DestroyFields(F) TYPE(FieldsType), POINTER :: & F PetscInt :: & Error CALL VecDestroy(F % FieldsGlobal, Error) CALL VecDestroy(F % FieldsLocal, Error) CALL DMDestroy(F % FieldsDA, Error) END SUBROUTINE DestroyFields
Advection3D.F90 (continued)
The following PETSc routines are called
• VecDuplicate
• DMDAVecGetArrayF90
• DMDAVecRestoreArrayF90
• VecCopy
• DMGlobalToLocalBegin
• DMGlobalToLocalEnd
• VecDestroy
! Get Access to Mesh Coordinates: CALL DMDAVecGetArrayF90( & M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error) CALL DMDAVecGetArrayF90( & M % CoordinateDA, M % Positions % CenterGlobal, Center, Error) CALL DMDAVecGetArrayF90( & M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error)
! Release Access to Coordinates: CALL DMDAVecRestoreArrayF90( & M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error) CALL DMDAVecRestoreArrayF90( & M % CoordinateDA, M % Positions % CenterGlobal, Center, Error) CALL DMDAVecRestoreArrayF90( & M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error)
Set initial condition
To initialize FieldsGlobal, we first call DMDAVecGetArrayF90, which returns FieldsOld, a rank-four array (DegreesOfFreedom > 1) where the first dimension holds the individual fields and the next three dimensions hold the logical position in the mesh. After initialization of the data we call DMDAVecRestoreArrayF90.
! Get Access to Non Ghosted Fields Data: CALL DMDAVecGetArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsOld, Error) ! Set Initial Condition and Compute Time Step: MyTimeStep = Huge DO iX3 = iBX3, iEX3 DO iX2 = iBX2, iEX2 DO iX1 = iBX1, iEX1 FieldsOld(D, iX1, iX2, iX3) & = SIN( Two * Pi * Center(X1, iX1, iX2, iX3) ) FieldsOld(V1, iX1, iX2, iX3) & = One FieldsOld(V2, iX1, iX2, iX3) & = Zero FieldsOld(V3, iX1, iX2, iX3) & = Zero MyTimeStep & = MIN(MyTimeStep, & ( OuterEdge(X1, iX1, iX2, iX3) & - InnerEdge(X1, iX1, iX2, iX3) ) & / ( Tiny + FieldsOld(V1, iX1, iX2, iX3) ) ) MyTimeStep & = MIN(MyTimeStep, & ( OuterEdge(X2, iX1, iX2, iX3) & - InnerEdge(X2, iX1, iX2, iX3) ) & / ( Tiny + FieldsOld(V2, iX1, iX2, iX3) ) ) MyTimeStep & = MIN(MyTimeStep, & ( OuterEdge(X3, iX1, iX2, iX3) & - InnerEdge(X3, iX1, iX2, iX3) ) & / ( Tiny + FieldsOld(V3, iX1, iX2, iX3) ) ) END DO END DO END DO ! Release Access to Non-Ghosted Fields: CALL DMDAVecRestoreArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsOld, Error) ! Reduce Global Time Step: CALL MPI_ALLREDUCE( & MyTimeStep, TimeStep, 1, MPI_DOUBLE_PRECISION, MPI_MIN, & U % Communicator, Error) ! Populate Ghosted Fields with Initial Condition: CALL DMGlobalToLocalBegin( & F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, Error) CALL DMGlobalToLocalEnd( & F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, Error) ! Keep Copy of Initial Condition: CALL VecDuplicate(F % FieldsLocal, InitialCondition, Error) CALL VecCopy( F % FieldsLocal, InitialCondition, Error)
Evolve initial condition
The time step loop is organized in the following way to overlap communications and computations:
- Update cells whose data is communicated with neighboring processes (AdvectExchangeCells3D; see AdvectionModule.F90); red cells in the above figure
- Begin ghost exchange by calling DMGlobalToLocalBegin
- Update cells whose data is not communicated with neighboring processes (AdvectNonExchangeCells3D; see AdvectionModule.F90); green cells in the above figure
- Finish ghost exchange by calling DMGlobalToLocalEnd. Ghost cells are now populated with updated values, and ready for the next time step
! Begin Time Step Loop: Done = .FALSE. iCycle = 0 MaxCycles = 100000 Time = Zero EndTime = One DO WHILE (.NOT.Done) iCycle = iCycle + 1 IF(U % MPI_Rank == 0)THEN PRINT* PRINT*, " Cycle = ", iCycle PRINT*, " Time Step = ", TimeStep PRINT*, " Time = ", Time END IF ! Copy Ghosted Data to Local Work Vector: CALL VecCopy(F % FieldsLocal, FieldsScratch, Error) ! Get Access to Fields Data: CALL DMDAVecGetArrayF90( & F % FieldsDA, FieldsScratch, FieldsOld, Error) CALL DMDAVecGetArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsNew, Error) ! Update Cells Whose Data is Communicated: CALL AdvectExchangeCells3D(FieldsNew, FieldsOld, InnerEdge, OuterEdge, & TimeStep, iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW) ! Begin Ghost Exchange: CALL DMGlobalToLocalBegin( & F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, & Error) ! Update Cells Whose Data is Not Comminucated: CALL AdvectNonExchangeCells3D(FieldsNew, FieldsOld, InnerEdge, OuterEdge, & TimeStep, iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW) ! Release Access to Fields Data: CALL DMDAVecRestoreArrayF90( & F % FieldsDA, FieldsScratch, FieldsOld, Error) CALL DMDAVecRestoreArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsNew, Error) ! End Ghost Exchange: CALL DMGlobalToLocalEnd( & F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, & Error) ! Copy New to Old: CALL DMDAVecGetArrayF90( & F % FieldsDA, F % FieldsLocal, FieldsOld, Error) CALL DMDAVecGetArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsNew, Error) DO iX3 = iBX3, iEX3 DO iX2 = iBX2, iEX2 DO iX1 = iBX1, iEX1 DO iField = 0, DOF - 1 FieldsOld(iField, iX1, iX2, iX3) & = FieldsNew(iField, iX1, iX2, iX3) END DO END DO END DO END DO CALL DMDAVecRestoreArrayF90( & F % FieldsDA, F % FieldsLocal, FieldsOld, Error) CALL DMDAVecRestoreArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsNew, Error) Time = Time + TimeStep IF(iCycle >= MaxCycles .OR. Time >= EndTime) Done = .TRUE. END DO ! Destroy Local Scratch Vector: CALL VecDestroy(FieldsScratch, Error)
Compare result with initial condition
! Compare Evolved Solution with Initial Condition in the L1 Norm: CALL DMDAVecGetArrayF90( & F % FieldsDA, InitialCondition, FieldsOld, Error) CALL DMDAVecGetArrayF90( & F % FieldsDA, F % FieldsLocal, FieldsNew, Error) MyL1Norm = Zero DO iX3 = iBX3, iEX3 DO iX2 = iBX2, iEX2 DO iX1 = iBX1, iEX1 MyL1Norm & = MyL1Norm & + ABS(FieldsNew(D, iX1, iX2, iX3) - FieldsOld(D, iX1, iX2, iX3)) END DO END DO END DO CALL MPI_REDUCE( & MyL1Norm, L1Norm, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, & U % Communicator, Error) IF(U % MPI_Rank == 0)THEN PRINT* PRINT*, " INFO: L1 Norm = ", L1Norm END IF CALL DMDAVecRestoreArrayF90( & F % FieldsDA, InitialCondition, FieldsOld, Error) CALL DMDAVecRestoreArrayF90( & F % FieldsDA, F % FieldsGlobal, FieldsNew, Error) CALL VecDestroy(InitialCondition, Error)
AdvectionModule.F90
AdvectionModule contains AdvectExchangeCells3D and AdvectNonExchangeCells3D which call the kernel Advect3D on exchange cells and non-exchange cells, respectively.
AdvectExchangeCells3D
SUBROUTINE AdvectExchangeCells3D( & FieldsNew, FieldsOld, InnerEdge, OuterEdge, TimeStep, & iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW) PetscReal, DIMENSION(:,:,:,:), POINTER :: & FieldsNew, & FieldsOld, & InnerEdge, & OuterEdge PetscReal, INTENT(in) :: & TimeStep PetscInt, INTENT(in) :: & iBX1, iBX2, iBX3, & iEX1, iEX2, iEX3, SW PetscInt :: & X1B, X1E, & X2B, X2E, & X3B, X3E ! 1: X1B = iBX1 X1E = iBX1 + SW - 1 X2B = iBX2 X2E = iEX2 X3B = iBX3 X3E = iEX3 CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) ! 2: X1B = iEX1 - SW + 1 X1E = iEX1 X2B = iBX2 X2E = iEX2 X3B = iBX3 X3E = iEX3 CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) ! 3: X1B = iBX1 + SW X1E = iEX1 - SW X2B = iBX2 X2E = iBX2 + SW - 1 X3B = iBX3 X3E = iEX3 CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) ! 4: IF(iEX2 > iBX2)THEN X1B = iBX1 + SW X1E = iEX1 - SW X2B = iEX2 - SW + 1 X2E = iEX2 X3B = iBX3 X3E = iEX3 CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) END IF ! 5: X1B = iBX1 + SW X1E = iEX1 - SW X2B = iBX2 + SW X2E = iEX2 - SW X3B = iBX3 X3E = iBX3 + SW - 1 CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) ! 6: IF(iEX3 > iBX3)THEN X1B = iBX1 + SW X1E = iEX1 - SW X2B = iBX2 + SW X2E = iEX2 - SW X3B = iEX3 - SW + 1 X3E = iEX3 CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) END IF END SUBROUTINE AdvectExchangeCells3D
AdvectNonExchangeCells3D
SUBROUTINE AdvectNonExchangeCells3D( & FieldsNew, FieldsOld, InnerEdge, OuterEdge, TimeStep, & iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW) PetscReal, DIMENSION(:,:,:,:), POINTER :: & FieldsNew, & FieldsOld, & InnerEdge, & OuterEdge PetscReal, INTENT(in) :: & TimeStep PetscInt, INTENT(in) :: & iBX1, iBX2, iBX3, & iEX1, iEX2, iEX3, SW PetscInt :: & X1B, X1E, & X2B, X2E, & X3B, X3E X1B = iBX1 + SW X1E = iEX1 - SW X2B = iBX2 + SW X2E = iEX2 - SW X3B = iBX3 + SW X3E = iEX3 - SW CALL Advect3D( & FieldsNew(D, & X1B : X1E, X2B : X2E, X3B : X3E), & FieldsOld(D, & X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), & FieldsNew(V1, & X1B : X1E, X2B : X2E, X3B : X3E), & OuterEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E) & - InnerEdge(X1, & X1B : X1E, X2B : X2E, X3B : X3E), & TimeStep, & X1B = X1B, X1E = X1E, & X2B = X2B, X2E = X2E, & X3B = X3B, X3E = X3E, SW = SW) END SUBROUTINE AdvectNonExchangeCells3D
Advect3D
SUBROUTINE Advect3D(D_NEW, D_OLD, V1, dX1, dt, & X1B, X1E, X2B, X2E, X3B, X3E, SW) PetscInt, INTENT(in) :: & X1B, X1E, X2B, X2E, X3B, X3E, SW PetscReal, DIMENSION(X1B : X1E, X2B : X2E, X3B : X3E), INTENT(inout) :: & D_NEW PetscReal, DIMENSION(X1B - SW : X1E + SW, & X2B - SW : X2E + SW, & X3B - SW : X3E + SW), INTENT(in) :: & D_OLD PetscReal, DIMENSION(X1B : X1E, X2B : X2E, X3B : X3E), INTENT(in) :: & V1, & dX1 PetscReal, INTENT(in) :: & dt PetscReal :: & FluxOuterX1, & FluxInnerX1 PetscInt :: & iX1, iX2, iX3 DO iX3 = X3B, X3E DO iX2 = X2B, X2E DO iX1 = X1B, X1E FluxInnerX1 & = Half * ( ( V1(iX1, iX2, iX3) & + ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1-1, iX2, iX3) & + ( V1(iX1, iX2, iX3) & - ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1, iX2, iX3) ) FluxOuterX1 & = Half * ( ( V1(iX1, iX2, iX3) & + ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1, iX2, iX3) & + ( V1(iX1, iX2, iX3) & - ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1+1, iX2, iX3) ) D_NEW(iX1, iX2, iX3) & = D_OLD(iX1, iX2, iX3) & - dt * ( FluxOuterX1 - FluxInnerX1 ) / dX1(iX1, iX2, iX3) END DO END DO END DO END SUBROUTINE Advect3D
Compiling
$ module load cray-petsc $ module load git $ git clone https://github.com/olcf/PETSc-tutorial.git $ cd PETSc-tutorial $ mkdir Executables $ cd Executables $ ftn -c ../RealKindModule.F90 ../MeshModule.F90 ../FieldsModule.F90 ../UniverseModule.F90 ../AdvectionModule.F90 ../Advection3D.F90 $ ftn -o Advection3D RealKindModule.o MeshModule.o FieldsModule.o UniverseModule.o AdvectionModule.o Advection3D.o
Running
$ aprun -n16 ./Advection3D INFO: Creating Universe Advection3D INFO: Number of MPI Ranks = 16 INFO: Creating Mesh Dimensionality = 3 Stencil Width = 1 iDim, nZones = 1 256 Inner Boundary = 0.000000000000000 Outer Boundary = 1.000000000000000 iDim, nZones = 2 256 Inner Boundary = 0.000000000000000 Outer Boundary = 1.000000000000000 iDim, nZones = 3 256 Inner Boundary = 0.000000000000000 Outer Boundary = 1.000000000000000 INFO: Creating Fields Degrees of Freedom = 4 Cycle = 1 Time Step = 3.9062500000000000E-003 Time = 0.000000000000000 Cycle = 2 Time Step = 3.9062500000000000E-003 Time = 3.9062500000000000E-003 Cycle = 3 Time Step = 3.9062500000000000E-003 Time = 7.8125000000000000E-003 ... Cycle = 256 Time Step = 3.9062500000000000E-003 Time = 0.9960937500000000 INFO: Finished Evolution Cycles Evolved = 256 Time = 1.000000000000000 INFO: L1 Norm = 2.2737367544323206E-013 INFO: Destroying Universe Advection3D INFO: 16 MPI Ranks Finished in 109.6593058109283 Seconds
This tutorial covers how to write a parallel program to calculate π using the Monte Carlo method. The first code is a simple serial implementation. The next codes are parallelized using MPI and OpenMP and then finally, the last code sample is a version that combines both of these parallel techniques.
Serial Implementation
The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).
serialpi.c
#include <stdio.h> #include <stdlib.h> #include <math.h> void main(int argc, char* argv[]) { double niter = 10000000; double x,y; int i; int count=0; double z; double pi; srand(time(NULL)); //main loop for (i=0; i<niter; ++i) { //get random points x = (double)random()/RAND_MAX; y = (double)random()/RAND_MAX; z = sqrt((x*x)+(y*y)); //check to see if point is in unit circle if (z<=1) { ++count; } } pi = ((double)count/(double)niter)*4.0; //p = 4(m/n) printf("Pi: %f\n", pi);
Next, compile the code using the compiler wrappers on the supercomputer
cc serialpi.c -o a.out
To run this program on Titan, or any of the other clusters, a job needs to be submitted in the form of a PBS batch script. Here is the an example batch scrip, titan_pi.pbs:
#!/bin/bash # Begin PBS directives #PBS -A AAA### #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V #PBS -q debug # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 ./a.out
Replace AAA### with the appropriate project ID that the time will be charged to. See the Titan System User Guide for a more in depth look at writing batch scripts.
To submit this batch script, run the following:
qsub titan_pi.pbs
When executed, the result will be something like:
Pi: 3.141664
Parallelize with MPI
MPI Send/Recv
In this next version of the program, the code has been parallelized with MPI. In order to do this, a few things need to be set up in the code. First, the mpi.h header file needs to be included.
#include "mpi.h"
Next, an MPI instance needs to be created. To do this, add this:
MPI_Init(&argv, &argc);
right before the code needed to be parallelized with MPI. To utilize the created MPI instance, the code needs to have MPI sends and receives present. This message sending and receiving is the basic concept behind MPI. Many functions in MPI can be implemented with combinations of basic send and receive calls. Here, our code is set up to do computations on all but the master node (MPI rank 0). The only thing that rank 0 does is receive the messages (the calculated points in/out of the circle) sent by the other nodes and calculate the ratio based upon that data. The MPI_Recv(); function needs to be added so that our master node can receive the messages. To set up the receive, it needs the following info:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status);
- A buffer to store the reveived data
- A number of maximum elements in the receive buffer
- An MPI_Datatype of the appropriate type for the data to be revieved
- The rank number that the message will be sent from. This can be the wildcard MPI_ANY_SOURCE
- A message tag. This serves as a label to identify a particular message. It is defined by an application. It can be the wildcard MPI_ANY_TAG
- The name of the MPI communicator. I.e. MPI_COMM_WORLD.
- A status variable
So the first MPI_Recv line should look like this:
MPI_Recv(&recieved[i], nodenum, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
Next, an MPI_Send(); needs to be set up so that the master rank has data to receive. MPI_Send(); only needs to know:
int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm);
- A pointer to the send buffer (data to send)
- Number of elements in the send buffer
- An MPI_Datatype for what kind of data thats being sent
- The destination rank
- A message tag to help identify a specific message or type of message, as specified/defined by the application
- The MPI communicator
So with this in mind, the first MPI_Send needs to look like this:
MPI_Send(&count, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
Finally, the MPI instance should be closed out before the program exits. To do this simply add this line after the parallel code:
MPI_Finalize();
This code is written so that each node is running the same algorithm to place the random points, categorize them, and count them. The task assigned to master node collects the total number of tried points and total number of points in the circle. It takes the ratio of them and multiplies it by 4 to get a more accurate estimation of π. This is more accurate, because each node is running as many iterations as the singe node in the serial code ran. The master node treats them as if the values were ultimately from the same circle. The same thing can be achieved with the serial code by multiplying niter, the number of iterations, by the total number of nodes used in the parallelized version–but that is much less efficient than the MPI version because it is not dividing the work over several nodes simultaneously.
mpiPi.c
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <math.h> #define SEED 35791246 int main(int argc, char* argv[]) { long niter = 1000000; int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i, count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int nodenum; MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process MPI_Comm_size(MPI_COMM_WORLD, &nodenum); int recieved[nodenum]; long recvniter[nodenum]; srand(SEED+myid); //Give rand() a seed value. Needs to be different on each node if(myid != 0) { for (i=0; i<niter; ++i) //main loop { x= ((double)rand())/RAND_MAX; //gets a random x coordinate y =((double)rand())/RAND_MAX; //gets a random y coordinate z = sqrt(x*x+y*y); //Checks to see if number in inside unit circle if (z<=1) { count++; //if it is, consider it a valid random point } } for(i=0; i<nodenum; ++i) { MPI_Send(&count, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Send(&niter, 1, MPI_LONG, 0, 2, MPI_COMM_WORLD); } } else if (myid == 0) { for(i=0; i<nodenum; ++i) { MPI_Recv(&recieved[i], nodenum, MPI_INT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, MPI_STATUS_IGNORE); MPI_Recv(&recvniter[i], nodenum, MPI_LONG, MPI_ANY_SOURCE, 2, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } } if (myid == 0) //if root process { int finalcount = 0; long finalniter = 0; for(i = 0; i<nodenum; ++i) { finalcount += recieved[i]; finalniter += recvniter[i]; } pi = ((double)finalcount/(double)finalniter)*4.0; //p = 4(m/n) printf("Pi: %f\n", pi); //Print the calculated value of pi } MPI_Finalize(); //Close the MPI instance return 0; }
Compile this code by running:
cc mpiPi.c -o a.out
The job submission is the same as the previous program but two lines in the batch script need to be edited:
#PBS -l walltime=0:01:00,nodes=1
the “nodes=1” needs to be changed to to how many nodes needed to run this code.
and
aprun -n 1 ./a.out
The “-n 1” needs to be changed to reflect the change made in the previous line (e.g. if the previous line now reads “nodes = 100” then the aprun line needs to reflect that by saying “-n 100”
Now the job can be submitted to the batch system to be run when the number of requested nodes becomes available.
qsub titan_pi.pbs
MPI Reduce
In the following version of the code, the serial code has been parallelized with MPI again, but this time the collective MPI_Reduce() operation is used. MPI_Reduce() gathers all the specified values on all nodes to one reduced value on the master node according to the specified operation (i.e. MPI_SUM). This has the same effect as the previous code, but makes for cleaner reading code.
mpirPi.c
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <math.h> int main(int argc, char* argv[]) { int niter = 100000; int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i; int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int reducedcount; //total number of "good" points from all nodes int reducedniter; //total number of ALL points from all nodes MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process /* Everyone can do the work */ for (i=0; i<niter; ++i) //main loop { srand48(time(NULL)+myid); //Give rand() a seed value unique on each node (times are synced) x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number in inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } MPI_Reduce(&count, &reducedcount, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); MPI_Reduce(&niter, &reducedniter, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) //if root process { pi = ((double)reducedcount/(double)reducedniter)*4.0; //p = 4(m/n) printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter); //Print the calculated value of pi } MPI_Finalize(); //Close the MPI instance return 0; }
Compile this code by running
cc mpirPi.c -o a.out
Again, the batch script needs to be edited to reflect the number of nodes requested for this code. Once that is done, submit the job by once again running
qsub titan_pi.pbs
Parallelize with OpenMP
Here the original serial code has been parallelized with OpenMP instead of MPI. OpenMP is used to create separate threads to run on the multicore CPUs of Titan. To use OpenMP the following line needs to added:
#include <omp.h>
Also, the #pragma omp parallel
compiler directive needs to be added. This tells the compiler that the following block will be OpenMP parallel code. This code is creating 16 threads; one for each core of the CPU. Since OpenMP is a shared memory API, one needs to keep in mind what variables need to be accessible by what threads. Here, our code needs each thread to have its own copy of x, y, z, and i, and the variable “count” needs to be accessible by all threads so that it can be incremented when needed. In order to accomplish this, OpenMP needs to be told that x, y, z, and i need to be private. Here the firstprivate() clause is used in the omp pragma to declare them as thread private. The firstprivate() clause also has the feature that will automatically initialize the variables inside its parentheses. The variable, “count” needs to be accessible to all threads (i.e. shared) so the shared() clause is used to accomplish that. Finally, OpenMP is told to spawn 16 threads (one for each core of the processor). This is done by using the num_threads() clause.
omppi.c
#include <stdio.h> #include <stdlib.h> #include <omp.h> #include <math.h> int main(int argc, char* argv[]) { int niter = 1000000; //number of iterations per FOR loop double x,y; //x,y value for the random coordinate int i; //loop counter int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int numthreads = 16; #pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads) { srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value for (i=0; i<niter; ++i) //main loop { x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } //print the value of each thread/rank } pi = ((double)count/(double)(niter*numthreads))*4.0; printf("Pi: %f\n", pi); return 0; }
Compiling it is similar to the previous examples, with one exception. When the compiler is called, the correct OpenMP flag needs to be added. This flag is different depending on which programming environment module you have loaded (run module list
and look for a module that says PrgEnv-*****)
-
- PrgEnv-pgi: -mp
-
- PrgEnv-intel: -openmp
-
- PrgEnv-cray: -openmp
- PrgEnv-gnu: -fopenmp
Now the code can be compiled using this format:
cc -(appropriate flag here) omppi.c -o a.out
The batch script needs to be edited so that only 1 node is requested (there is no MPI code in this for multiple nodes. Yet.)
Submit the script:
qsub titan_pi.pbs
OpenMP + MPI Hybrid
This next code sample combines both MPI and OpenMP. This allows it to fully exploit the parallel architecture of Titan’s multicore CPUs as well as add a layer of parallelism from MPI. In this version, 16 OpenMP threads are launched on each MPI rank. Here, a different approach is taken with the OpenMP section. Instead of declaring count as shared, the reduction(:) clause is used. This clause effectively creates a private copy of “count” on each thread to be incremented and then once all threads complete, OpenMP will reduce all instances of “count” into the original count variable according to the mathematical operator given. In this case, the total number from all threads is needed, so they are reduced by taking the sum of all of them. Then MPI_Reduce is used to gather and reduce “count” from each node into one variable on the master node.
mpiOMPpi.c
#include <stdio.h> #include <stdlib.h> #include "mpi.h" #include <omp.h> #include <math.h> int main(int argc, char* argv[]) { int niter = 1000000; //number of iterations per FOR loop int myid; //holds process's rank id double x,y; //x,y value for the random coordinate int i; //loop counter int count=0; //Count holds all the number of how many good coordinates double z; //Used to check if x^2+y^2<=1 double pi; //holds approx value of pi int reducedcount; //total number of "good" points from all nodes int reducedniter; //total number of ALL points from all nodes int ranknum = 0; //total number of nodes available int numthreads = 16; MPI_Init(&argc, &argv); //Start MPI MPI_Comm_rank(MPI_COMM_WORLD, &myid); //get rank of node's process MPI_Comm_size(MPI_COMM_WORLD, &ranknum); //Gets number of nodes availible to process if(myid != 0) //Do the following on all except the master node { //Start OpenMP code: 16 threads/node #pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads) { srandom((int)time(NULL) ^ omp_get_thread_num()); //Give random() a seed value for (i=0; i<niter; ++i) //main loop { x = (double)random()/RAND_MAX; //gets a random x coordinate y = (double)random()/RAND_MAX; //gets a random y coordinate z = sqrt((x*x)+(y*y)); //Checks to see if number is inside unit circle if (z<=1) { ++count; //if it is, consider it a valid random point } } } } MPI_Barrier(MPI_COMM_WORLD); MPI_Reduce(&count, &reducedcount, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); reducedniter = numthreads*niter*(ranknum-1); if (myid == 0) //if root process/master node { //p = 4(m/n) pi = ((double)reducedcount/(double)reducedniter)*4.0; //Print the calculated value of pi printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter); } MPI_Finalize(); //Close the MPI instance return 0; }
Compiling this code is similar to the past examples (don’t forget to use the correct OpenMP flag according to the programming environment used). To compile run:
cc -(OpenMP flag) mpiOMPpi.c -o a.out
The batch script can be edited now so that it can request the appropriate number of nodes (see above). After the batch script is edited, submit the job like before:
qsub titan_pi.pbs
If you have any questions about these examples or how to run them please email help@nccs.gov.
All code from this tutorial, including a Makefile and batch scripts can be found as a git repository at https://github.com/olcf/Serial-to-Parallel–Monte-Carlo-Pi