To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov


C

The Serial, OpenMP, and MPI C samples make use of the following gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams

sleep.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
 
static cudaStream_t *streams;

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}
 
// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(float seconds)
{
    // Get device frequency in KHz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;
 
    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = (int64_t)(seconds * Hz);
   
    return num_cycles;
}
 
// Create streams
extern "C" void create_streams(int num_streams)
{
    // Allocate streams
    streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t));    
 
    // Default stream
    streams[0] = NULL;

    // Primer kernel launch
    sleep<<< 1, 1 >>>(1); 

    // Create streams
    for(int i = 1; i <= num_streams; i++)
        cudaStreamCreate(&streams[i]);
}
 
// Launches a kernel that sleeps for num_cycles
extern "C" void sleep_kernel(int64_t num_cycles, int stream_id)
{
    // Launch a single GPU thread to sleep
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;
 
    // Execute the kernel
    sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles);
}
 
// Wait for stream to complete
extern "C" void wait_for_stream(int stream_id)
{
    cudaStreamSynchronize(streams[stream_id]);
}
 
// Wait for streams to complete
extern "C" void wait_for_streams(int num_streams)
{
    for(int i = 1; i <= num_streams; i++)
        cudaStreamSynchronize(streams[i]);
}
 
// Destroy stream objects
extern "C" void destroy_streams(int num_streams)
{
    // Clean up stream
    for(int i = 1; i <= num_streams; i++)
        cudaStreamDestroy(streams[i]);
    free(streams);
}

Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -arch=sm_35 -c sleep.cu

Serially Launched Concurrent Kernels in C

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    struct timeval start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Max number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        for(i=1; i<=num_kernels; i++){
            sleep_kernel(cycles, i);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);

        // Clean up streams
        destroy_streams(num_kernels);

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o serial.out

Run:

$ module load cudatoolkit
$ aprun ./serial.out

OpenMP Launched Concurrent Kernels in C

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    double start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {

        // Set number of OMP threads
        omp_set_num_threads(num_kernels);

        // Start timer
        start = omp_get_wtime();

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        #pragma omp parallel firstprivate(cycles)
        {
            int stream_id = omp_get_thread_num()+1;
            sleep_kernel(cycles, stream_id);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);
   
        // Wait for kernels to complete and clean up streams
        destroy_streams(num_kernels);

        // Stop timer
        stop = omp_get_wtime();
        printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
    }

    return 0;
}

Compile:
PGI

$ module load cudatoolkit
$ cc -mp sleep.o launcher.c -o openmp.out

GNU

$ module load cudatoolkit
$ cc -fopenmp sleep.o launcher.c -o openmp.out

Intel

$ module load cudatoolkit
$ cc -openmp sleep.o launcher.c -o openmp.out

Cray

$ module load cudatoolkit
$ cc sleep.o launcher.c -o openmp.out

Run:

$ module load cudatoolkit
$ aprun -d16 ./openmp.out

MPI Launched Concurrent Kernels in C

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "mpi.h"

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_stream(int stream_id);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    uint64_t cycles;
    double start, stop;
    int num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = size;

    // Setup default stream in sleep.cu wrapper
    create_streams(0);

    // Loop through number of kernels to launch, from 1 to max_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        MPI_Barrier(MPI_COMM_WORLD);
        if(rank == 0)
            start = MPI_Wtime();

        // Launch kernel into default stream
        if(rank < num_kernels)
            sleep_kernel(cycles, 0);

        // Wait for all ranks to submit kernel
        MPI_Barrier(MPI_COMM_WORLD);

        // Wait for default stream
        if(rank < num_kernels)
            wait_for_stream(0);

        // Wait for all ranks to complete
        MPI_Barrier(MPI_COMM_WORLD);

        // Print seconds ellapsed
        if(rank == 0) {
            stop = MPI_Wtime();
            printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
        }
    }

    destroy_streams(0);
    MPI_Finalize();

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in C

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "math.h"
#include <sys/time.h>

int main(int argc, char *argv[])
{

    uint64_t num_cycles;
    struct timeval start, stop;
    int i,num_kernels;

    // Set number of cycles to sleep for 1 second
    // We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/15;

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        for(i=0; i<=num_kernels; i++)
        {
            #pragma acc parallel async(i) vector_length(1) num_gangs(1)
            {
                uint64_t cycles;
                #pragma acc loop seq
                for(cycles=0; cycles<num_cycles; cycles++) {
                    cycles = cycles;
                }
            }
        }

        // Wait for all async streams to complete
        #pragma acc wait

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of cycles should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ cc -acc launcher.c -o acc.out

Cray

$ module load cudatoolkit
$ cc -hpragma=acc launcher.c -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

Fortran

The Serial, OpenMP, and MPI Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams
For simplicity the following will use the PGI programming environment as it provides CUDA Fortran. For additional information on compiler interoperability please see the tutorial Compiling mixed GPU and CPU code.
module sleep
    use cudadevice
    use cudafor
    implicit none
 
    integer, dimension(:), allocatable :: streams
 
    contains
 
    !CUDA kernel to pause for at least num_cycle cycles
    attributes(global) subroutine sleep(num_cycles)
        integer(8), value :: num_cycles
        integer(8) :: cycles
        integer(8) :: start
 
        cycles = 0
        start = clock64
        do while (cycles < num_cycles)
            cycles = clock64 - start
        enddo
    end subroutine sleep
 
    !Returns number of cycles required for requested seconds
    integer(8) function get_cycles(seconds) result(num_cycles)
        real(8), intent(in) :: seconds
        integer(8) :: istat, Hz
        type(cudadeviceprop) :: prop
 
        istat = cudaGetDeviceProperties(prop, 0)
        Hz = prop%clockRate * 1000
        num_cycles = seconds * Hz
    end function get_cycles
 
    !Create streams
    subroutine create_streams(num_streams)
        integer :: num_streams, istat, i
 
        allocate(streams(num_streams+1))

        streams(1) = 0 

        ! Primer kernel launch
        call sleep<<< 1, 1 >>>(int8(1));

        do i=2,num_streams+1
            istat = cudaStreamCreate(streams(i))
        enddo
    end subroutine create_streams
 
    !Launches a kernel that sleeps for num_cycles
    subroutine sleep_kernel(num_cycles, stream_id)
        integer(8) :: num_cycles
        integer    ::  stream_id
        type(dim3) :: blockSize, gridSize
 
        blockSize = dim3(1,1,1)
        gridSize = dim3(1,1,1)
 
        call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles)
    end subroutine sleep_kernel
 
    ! Wait for stream to complete
    subroutine wait_for_stream(stream_id)
        integer :: stream_id, istat
 
        istat = cudaStreamSynchronize(streams(stream_id))
    end subroutine wait_for_stream
 
    ! Wait for streams to complete
    subroutine wait_for_streams(num_streams)
        integer :: num_streams, istat, i
 
        do i=2,num_streams+1
            istat = cudaStreamSynchronize(streams(i))
        enddo
    end subroutine wait_for_streams
 
    ! Destroy streams
    subroutine destroy_streams(num_streams)
        integer :: num_streams, i, istat
 
        do i=2,num_streams+1
            istat = cudaStreamDestroy(streams(i))
        enddo
        deallocate(streams)
    end subroutine destroy_streams
 
end module sleep

Serially Launched Concurrent Kernels in Fortran

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, i
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 32

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        do i = 2, num_kernels+1
            call sleep_kernel(cycles, i)
        enddo

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out

OpenMP Launched Concurrent Kernels in Fortran

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep
    use omp_lib

    implicit none

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Set number of OMP threads
        call omp_set_num_threads(num_kernels)

        ! Start timer
        start = omp_get_wtime()

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        !$omp parallel private(stream_id) firstprivate(cycles)
        stream_id = omp_get_thread_num()+2
        call sleep_kernel(cycles, stream_id)
        !$omp end parallel

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        stop = omp_get_wtime()

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out

MPI Launched Concurrent Kernels in Fortran

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.f90

program main
    use sleep
    implicit none
    include 'mpif.h'

    integer    ::  max_kernels, num_kernels, i, ierr, rank, size
    integer(8) :: cycles
    real(8)    :: start, stop, seconds

    call MPI_Init(ierr)

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

    ! Number of kernels to launch
    max_kernels = size

    ! Setup default stream in sleep.cu wrapper
    call create_streams(0);

    ! Loop through number of kernels to launch, from 1 to max_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call MPI_Barrier(MPI_COMM_WORLD, ierr)
        if (rank == 0) then
            start = MPI_Wtime()
        endif

        ! Launch num_kernel kernels asynchrnously
        if (rank < num_kernels) then
            call sleep_kernel(cycles, 1)
        endif

        ! Wait for all ranks to submit kernel
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Wait for kernel to complete
        if(rank < num_kernels) then
            call wait_for_stream(0)
        endif

        ! Wait for all ranks to complete
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Print seconds ellapsed
        if (rank == 0) then
            stop = MPI_Wtime()
            print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'
        endif

    enddo

    ! clean up array in wrapper, no stream actually destroyed
    call destroy_streams(0)

    call MPI_Finalize(ierr)

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in Fortran

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.f90

program main
    implicit none

    integer(8) :: num_cycles, cycles, i
    integer    :: max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds
    real(4)    :: foo(33)

    ! Set number of cycles to sleep for 1 second
    ! We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/(15*5)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Launch num_kernel kernels asynchrnously
        do i = 1, num_kernels
            !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo)
            !$acc loop seq
            do cycles = 1, num_cycles
                foo(i) = sin(1.5708)
            enddo
            !$acc end loop
            !$acc end parallel
        enddo

        ! Wait for all async streams to complete
        !$acc wait

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of foo should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ ftn -acc launcher.f90 -o acc.out

Cray

$ module load cudatoolkit
$ ftn -hacc launcher.f90 -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out