To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q
, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q
, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q
. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
The Serial
, OpenMP
, and MPI
C samples make use of the following gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
sleep.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> static cudaStream_t *streams; // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(float seconds) { // Get device frequency in KHz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = (int64_t)(seconds * Hz); return num_cycles; } // Create streams extern "C" void create_streams(int num_streams) { // Allocate streams streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t)); // Default stream streams[0] = NULL; // Primer kernel launch sleep<<< 1, 1 >>>(1); // Create streams for(int i = 1; i <= num_streams; i++) cudaStreamCreate(&streams[i]); } // Launches a kernel that sleeps for num_cycles extern "C" void sleep_kernel(int64_t num_cycles, int stream_id) { // Launch a single GPU thread to sleep int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles); } // Wait for stream to complete extern "C" void wait_for_stream(int stream_id) { cudaStreamSynchronize(streams[stream_id]); } // Wait for streams to complete extern "C" void wait_for_streams(int num_streams) { for(int i = 1; i <= num_streams; i++) cudaStreamSynchronize(streams[i]); } // Destroy stream objects extern "C" void destroy_streams(int num_streams) { // Clean up stream for(int i = 1; i <= num_streams; i++) cudaStreamDestroy(streams[i]); free(streams); }
Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -arch=sm_35 -c sleep.cu
Serially Launched Concurrent Kernels in C
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <sys/time.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; struct timeval start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Max number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously for(i=1; i<=num_kernels; i++){ sleep_kernel(cycles, i); } // Wait for kernels to complete wait_for_streams(num_kernels); // Clean up streams destroy_streams(num_kernels); // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o serial.out
Run:
$ module load cudatoolkit $ aprun ./serial.out
OpenMP Launched Concurrent Kernels in C
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; double start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Set number of OMP threads omp_set_num_threads(num_kernels); // Start timer start = omp_get_wtime(); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously #pragma omp parallel firstprivate(cycles) { int stream_id = omp_get_thread_num()+1; sleep_kernel(cycles, stream_id); } // Wait for kernels to complete wait_for_streams(num_kernels); // Wait for kernels to complete and clean up streams destroy_streams(num_kernels); // Stop timer stop = omp_get_wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } return 0; }
Compile:
PGI
$ module load cudatoolkit $ cc -mp sleep.o launcher.c -o openmp.out
GNU
$ module load cudatoolkit $ cc -fopenmp sleep.o launcher.c -o openmp.out
Intel
$ module load cudatoolkit $ cc -openmp sleep.o launcher.c -o openmp.out
Cray
$ module load cudatoolkit $ cc sleep.o launcher.c -o openmp.out
Run:
$ module load cudatoolkit $ aprun -d16 ./openmp.out
MPI Launched Concurrent Kernels in C
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "mpi.h" int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_stream(int stream_id); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); uint64_t cycles; double start, stop; int num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = size; // Setup default stream in sleep.cu wrapper create_streams(0); // Loop through number of kernels to launch, from 1 to max_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer MPI_Barrier(MPI_COMM_WORLD); if(rank == 0) start = MPI_Wtime(); // Launch kernel into default stream if(rank < num_kernels) sleep_kernel(cycles, 0); // Wait for all ranks to submit kernel MPI_Barrier(MPI_COMM_WORLD); // Wait for default stream if(rank < num_kernels) wait_for_stream(0); // Wait for all ranks to complete MPI_Barrier(MPI_COMM_WORLD); // Print seconds ellapsed if(rank == 0) { stop = MPI_Wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } } destroy_streams(0); MPI_Finalize(); return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in C
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "math.h" #include <sys/time.h> int main(int argc, char *argv[]) { uint64_t num_cycles; struct timeval start, stop; int i,num_kernels; // Set number of cycles to sleep for 1 second // We'll use frequency/instruction latency to estimate this num_cycles = 730000000/15; // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); for(i=0; i<=num_kernels; i++) { #pragma acc parallel async(i) vector_length(1) num_gangs(1) { uint64_t cycles; #pragma acc loop seq for(cycles=0; cycles<num_cycles; cycles++) { cycles = cycles; } } } // Wait for all async streams to complete #pragma acc wait // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
volatile
qualifier to the declaration of cycles should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ cc -acc launcher.c -o acc.out
Cray
$ module load cudatoolkit $ cc -hpragma=acc launcher.c -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
Fortran
The Serial
, OpenMP
, and MPI
Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
module sleep use cudadevice use cudafor implicit none integer, dimension(:), allocatable :: streams contains !CUDA kernel to pause for at least num_cycle cycles attributes(global) subroutine sleep(num_cycles) integer(8), value :: num_cycles integer(8) :: cycles integer(8) :: start cycles = 0 start = clock64 do while (cycles < num_cycles) cycles = clock64 - start enddo end subroutine sleep !Returns number of cycles required for requested seconds integer(8) function get_cycles(seconds) result(num_cycles) real(8), intent(in) :: seconds integer(8) :: istat, Hz type(cudadeviceprop) :: prop istat = cudaGetDeviceProperties(prop, 0) Hz = prop%clockRate * 1000 num_cycles = seconds * Hz end function get_cycles !Create streams subroutine create_streams(num_streams) integer :: num_streams, istat, i allocate(streams(num_streams+1)) streams(1) = 0 ! Primer kernel launch call sleep<<< 1, 1 >>>(int8(1)); do i=2,num_streams+1 istat = cudaStreamCreate(streams(i)) enddo end subroutine create_streams !Launches a kernel that sleeps for num_cycles subroutine sleep_kernel(num_cycles, stream_id) integer(8) :: num_cycles integer :: stream_id type(dim3) :: blockSize, gridSize blockSize = dim3(1,1,1) gridSize = dim3(1,1,1) call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles) end subroutine sleep_kernel ! Wait for stream to complete subroutine wait_for_stream(stream_id) integer :: stream_id, istat istat = cudaStreamSynchronize(streams(stream_id)) end subroutine wait_for_stream ! Wait for streams to complete subroutine wait_for_streams(num_streams) integer :: num_streams, istat, i do i=2,num_streams+1 istat = cudaStreamSynchronize(streams(i)) enddo end subroutine wait_for_streams ! Destroy streams subroutine destroy_streams(num_streams) integer :: num_streams, i, istat do i=2,num_streams+1 istat = cudaStreamDestroy(streams(i)) enddo deallocate(streams) end subroutine destroy_streams end module sleep
Serially Launched Concurrent Kernels in Fortran
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep integer(8) :: cycles integer :: max_kernels, num_kernels, i real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 32 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously do i = 2, num_kernels+1 call sleep_kernel(cycles, i) enddo ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out
OpenMP Launched Concurrent Kernels in Fortran
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep use omp_lib implicit none integer(8) :: cycles integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Set number of OMP threads call omp_set_num_threads(num_kernels) ! Start timer start = omp_get_wtime() ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously !$omp parallel private(stream_id) firstprivate(cycles) stream_id = omp_get_thread_num()+2 call sleep_kernel(cycles, stream_id) !$omp end parallel ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer stop = omp_get_wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out
MPI Launched Concurrent Kernels in Fortran
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.f90
program main use sleep implicit none include 'mpif.h' integer :: max_kernels, num_kernels, i, ierr, rank, size integer(8) :: cycles real(8) :: start, stop, seconds call MPI_Init(ierr) ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) call MPI_Comm_size(MPI_COMM_WORLD, size, ierr) ! Number of kernels to launch max_kernels = size ! Setup default stream in sleep.cu wrapper call create_streams(0); ! Loop through number of kernels to launch, from 1 to max_kernels do num_kernels = 1, max_kernels ! Start timer call MPI_Barrier(MPI_COMM_WORLD, ierr) if (rank == 0) then start = MPI_Wtime() endif ! Launch num_kernel kernels asynchrnously if (rank < num_kernels) then call sleep_kernel(cycles, 1) endif ! Wait for all ranks to submit kernel call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Wait for kernel to complete if(rank < num_kernels) then call wait_for_stream(0) endif ! Wait for all ranks to complete call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Print seconds ellapsed if (rank == 0) then stop = MPI_Wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' endif enddo ! clean up array in wrapper, no stream actually destroyed call destroy_streams(0) call MPI_Finalize(ierr) end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in Fortran
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.f90
program main implicit none integer(8) :: num_cycles, cycles, i integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds real(4) :: foo(33) ! Set number of cycles to sleep for 1 second ! We'll use frequency/instruction latency to estimate this num_cycles = 730000000/(15*5) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Launch num_kernel kernels asynchrnously do i = 1, num_kernels !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo) !$acc loop seq do cycles = 1, num_cycles foo(i) = sin(1.5708) enddo !$acc end loop !$acc end parallel enddo ! Wait for all async streams to complete !$acc wait ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
volatile
qualifier to the declaration of foo should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ ftn -acc launcher.f90 -o acc.out
Cray
$ module load cudatoolkit $ ftn -hacc launcher.f90 -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out