Several GPU accelerated libraries are provided on OLCF systems. Usage for the most common accelerated libraries is outlined below.

MAGMA

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.

Using MAGMA

This module is currently only compatible with the GNU programming environment:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit magma

To link in the MAGMA library while on Titan:

$ cc -lcuda -lmagma source.c

Linking in MAGMA on Rhea is a bit different because the Titan compiler wrapper takes care of some of the extra flags. For example:

$ nvcc $MAGMA_INC $MAGMA_LIB -L$ACML_DIR/gfortran64/lib -lmagma -lcublas -lacml source.c

This also requires that a BLAS library be loaded like ACML

$ module load acml

MAGMA Resources

For comprehensive user manual please see the MAGMA Documentation. A knowledgable MAGMA User Forum is also available for personalized help. To see MAGMA in action see the following two PGI articles that include full example code of MAGMA usage with PGI Fortran: Using MAGMA With PGI Fortran and Using GPU-enabled Math Libraries with PGI Fortran.

cuBLAS/cuSPARSE

cuBLAS and cuSPARSE are NVIDIA provided BLAS GPU routines optimized for dense and sparse use respectively. If your program currently uses BLAS routines integration should be straight forward and minimal CUDA knowledge is needed. Although primarily designed for use in C/C++ code Fortran bindings are available.

Using cuBLAS/cuSPARSE

cuBLAS and cuSPARSE are accessed through the cublas header and need to be linked against the cublas library:

$ module load cudatoolkit
$ cc -lcublas source.c

cuBLAS/cuSPARSE Resources

The CUBLAS and CUSPARSE user guides are available to download from NVIDIA, these guides provide complete function listings as well as example code. The nvidia SDK provides sample code and can accessed using the instructions below. An example of using CUBLAS with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran.

Running the examples:

Obtain an interactive job and load the appropriate modules:

$ qsub -I -A[projID] -lwalltime=00:30:00,nodes=1
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit nvidia-sdk

Copy the example files:

$ cd $MEMBERWORK/[projid]
$ cp -r $NVIDIA_SDK_PATH/CUDALibraries .
$ cd CUDALibraries

Now each example can be executed:

$ cd bin/linux/release
$ aprun simpleCUBLAS

LibSciACC

Cray’s LibSciACC provides GPU enabled BLAS and LAPACK routines. LibSicACC provides two interfaces, automatic and manual. The Automatic interface is largely transparent to the programmer. LibSciACC will determine if the call is likely to benefit from GPU acceleration and if so will take care of accelerating the routine and the associated memory management. The manual interface provides an API to manage accelerator resources, providing more control to the programmer.

Using LibSciACC

It is recommended that the craype-accel-nvidia35 module be used to manage LibSciAcc. The LibSciACC Automatic interface is currently compatible with the Cray and GNU programming environments:

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35

LibSciAcc will automatically be linked in when using the Cray provided compiler wrappers

$ cc source.c
$ ftn source.f90

LibSciACC Resources

The man page intro_libsci_acc provides detailed usage information. The environment variable $LIBSCI_ACC_EXAMPLES_DIR specifies a directory containing several C and Fortran example codes.

cuFFT

CUFFT provides a set of optimized GPU fast Fourier routines that are provided by NVIDIA as part of the CUDA toolkit. The CUFFT library provides an API similar to FFTW for managing accelerated FFT’s. The CUFFTW interface provides a FFTW3 interface to CUFFT to aid in porting existing applications.

Using cuFFT

The cudatoolkit module will append the include and library directories required by CUFFT. When using NVCC or the GNU programming environment the library can then be added.

$ module load cudatoolkit
$ cc -lcufft source.c
$ nvcc -lcufft source.c

cuFFT Resources

NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CUFFT with Fortran through the ISO_C_BINDING interface please see the following example. The OLCF provides an OpenACC and CUFFT interoperability tutorial.

cuRAND

CURAND is an NVIDIA provided random number generator library. CURAND provides both a host launched and device inalienable interface. Multiple pseudorandom and quasirandom algorithms are supported.

Using cuRAND

The cudatoolkit module will append the include and library directories required by CURAND. When using NVCC or the GNU programming environment the library can then be added.

$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu
$ cc -lcurand source.c
$ nvcc -lcurand source.cu

cuRAND Resources

NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CURAND host library please see the following Accelerator Interoperability Tutorial.

Thrust

Thrust is a CUDA accelerated C++ template library modeled after the Standard Template Library(STL). Thrust provides a high level host interface for GPU data management as well as an assortment of accelerated algorithms. Even if your application is not currently using the STL the easy access to many optimized accelerated algorithms Thrust provides is worth taking a look at.

Using Thrust

The cudatoolkit module will append the include and library directories required by Thrust. When using NVCC or the GNU programming environment the library can then be added.

$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu
$ CC source.cpp
$ nvcc source.cu

Thrust Resources

NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using Thrust please see the following Accelerator Interoperability II Tutorial. The Github page allows access to the Thrust source code, examples, and information on how to obtain help.

Experience in porting Fortran code to GPUs with OpenACC

Speakers: Mike Zingale and Adam Jacobs
This talk was given at the 9-23-15 OLCF User Group Conference Call.

In the slides and video below, we describe our efforts to put large physics modules written in Fortran onto the GPUs. Our main target is integrating stiff systems of ODEs describing nuclear reactions. Our strategy is to put the entire integration: righthand side evaluations, Jacobian construction, and the actual timestepping onto the GPUs. This minimizes the communication costs — only the initial state vector is passed in, and the new state vector comes out at the end. We will discuss the challenges we had in getting the compilers to accept modern Fortran constructions with OpenACC and discuss preliminary results.

Slides: openacc-networks.pdf

Video:

Experience in porting Fortran code to GPUs with OpenACC from OLCF on Vimeo.

To fully exploit the computational power of the GPU generally a large amount of data parallelism must be expressed. In the specific case of accelerated libraries such as cuBLAS, cuFFT, and cuSPARSE if each operation does not possess a sufficient amount of data parallelism another option is to batch many smaller operations into a single large operation. This tutorial will demonstrate how to take advantage of batched cuBLAS operations to improve GPU utilization. Additionally this tutorial will expand on the GPU concurrency topics from the first tutorial through the use of streams and Hyper-Q. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov


C

Note: cuBLAS uses column major ordering and 1-based indexing. The following examples utilize square diagonal matrices for simplicity and do not directly address this issue.

A Serial Example in C

The following example preforms a single DGEMM operation using the cuBLAS version 2 library. This example will be expanded to show the use of batched cuBLAS calls under a variety of situations. The form of our sample DGEMM operation will be as follows:

DGEMM

cublas.c

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"

int main(int argc, char* argv[])
{
    // Linear dimension of matrices
    int dim = 100;

    // Allocate host storage for A,B,C square matrices
    double *A, *B, *C;
    A = (double*)malloc(dim*dim*sizeof(double));
    B = (double*)malloc(dim*dim*sizeof(double));
    C = (double*)malloc(dim*dim*sizeof(double));

    // Allocate device storage for A,B,C
    double *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, dim*dim*sizeof(double));
    cudaMalloc((void**)&d_B, dim*dim*sizeof(double));
    cudaMalloc((void**)&d_C, dim*dim*sizeof(double));

    // Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    // Matrices are arranged column major
    int i,j,index;
    for(j=0; j<dim; j++) {
        for(i=0; i<dim; i++) {
            index = j*dim + i;
            if(i==j) {
                A[index] = sin(index);
                B[index] = sin(index);
                C[index] = cos(index)*cos(index);
            }
	    else {
                A[index] = 0.0;
                B[index] = 0.0;
                C[index] = 0.0;
            }
        }    
    }

    // Create cublas instance
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Set input matrices on device
    cublasSetMatrix(dim, dim, sizeof(double), A, dim, d_A, dim);
    cublasSetMatrix(dim, dim, sizeof(double), B, dim, d_B, dim);
    cublasSetMatrix(dim, dim, sizeof(double), C, dim, d_C, dim);

    // DGEMM: C = alpha*A*B + beta*C
    double alpha = 1.0;
    double beta  = 1.0;

    cublasDgemm(handle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                dim, dim, dim,
                &alpha,
                d_A, dim,
                d_B, dim,
                &beta,
                d_C, dim);

    // Retrieve result matrix from device
    cublasGetMatrix(dim, dim, sizeof(double), d_C, dim, C, dim);

    // Simple sanity test, sum up all elements
    double sum = 0;
    for(j=0; j<dim; j++) {
        for(i=0; i<dim; i++) {
            index = j*dim + i;
            sum += C[index];
        }
    }
    printf("Sum is: %f, should be: %d\n", sum, dim);   

    // Clean up resources
    free(A);
    free(B);
    free(C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cublasDestroy(handle);

    return 0;
}
For simplicity the following compiles will use the GNU programming environment. For more information on using non GNU compilers please see the Compiling Mixed GPU and CPU code tutorial

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ cc -lcublas cublas.c -o cublas.out

Run:

$ module load cudatoolkit
$ aprun ./cublas.out

Library-provided Batching in C

The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched, which takes as an argument an array of pointers to matrices and the number of DGEMM operations to preform in total. By allowing the library to handle the batching, high performance can be achieved with minimal additional effort on the part of the programmer.

cublas-batch.c

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"

int main(int argc, char* argv[])
{
    int i,j,k,index;

    // Linear dimension of matrices
    int dim = 100;

    // Number of A,B,C matrix sets
    int batch_count = 1000;

    // Allocate host storage for batch_count A,B,C square matrices
    double **A, **B, **C;
    A = (double**)malloc(batch_count*sizeof(double*));
    B = (double**)malloc(batch_count*sizeof(double*));
    C = (double**)malloc(batch_count*sizeof(double*));
    for(i=0; i<batch_count; i++) {
        A[i] = (double*)malloc(dim*dim*sizeof(double));
        B[i] = (double*)malloc(dim*dim*sizeof(double));
        C[i] = (double*)malloc(dim*dim*sizeof(double));
    }

    // Create host pointer array to device matrix storage
    double **d_A, **d_B, **d_C, **h_d_A, **h_d_B, **h_d_C;
    h_d_A = (double**)malloc(batch_count*sizeof(double*));
    h_d_B = (double**)malloc(batch_count*sizeof(double*));
    h_d_C = (double**)malloc(batch_count*sizeof(double*));

    for(i=0; i<batch_count; i++) {
        cudaMalloc((void**)&h_d_A[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&h_d_B[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&h_d_C[i], dim*dim*sizeof(double));
    }
    // Copy the host array of device pointers to the device
    cudaMalloc((void**)&d_A, batch_count*sizeof(double*));
    cudaMalloc((void**)&d_B, batch_count*sizeof(double*));
    cudaMalloc((void**)&d_C, batch_count*sizeof(double*));
    cudaMemcpy(d_A, h_d_A, batch_count*sizeof(double*), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_d_B, batch_count*sizeof(double*), cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, h_d_C, batch_count*sizeof(double*), cudaMemcpyHostToDevice);

    // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2
    // Matrices are arranged column major
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                if(i==j) {
                    (A[k])[index] = k*sin(index);
                    (B[k])[index] = sin(index);
                    (C[k])[index] = k*cos(index)*cos(index);
                }
		else {
                    (A[k])[index] = 0.0;
                    (B[k])[index] = 0.0;
                    (C[k])[index] = 0.0;
                }
            } // i   
        } // j
    } // k

    // Create cublas instance
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Set input matrices on device
    for(i=0; i<batch_count; i++) {
        cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, h_d_A[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, h_d_B[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, h_d_C[i], dim);
    }

    // Set matrix coefficients
    double alpha = 1.0;
    double beta  = 1.0;

    // DGEMM: C = alpha*A*B + beta*C
    cublasDgemmBatched(handle,
                       CUBLAS_OP_N, CUBLAS_OP_N,
                       dim, dim, dim,
                       &alpha,
                       (const double**)d_A, dim,
                       (const double**)d_B, dim,
                       &beta,
                       d_C, dim,
                       batch_count);

    // Retrieve result matrix from device
    for(i=0; i<batch_count; i++)
        cublasGetMatrix(dim, dim, sizeof(double), h_d_C[i], dim, C[i], dim);

    // Simple sanity test, sum up all elements
    double sum = 0;
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                sum += (C[k])[index];
            }
        }
    }
    printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2);   

    // Clean up resources

    for(i=0; i<batch_count; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
        cudaFree(h_d_A[i]);
        cudaFree(h_d_B[i]);
        cudaFree(h_d_C[i]);
    }

    free(A);
    free(B);
    free(C);
    free(h_d_A);
    free(h_d_B);
    free(h_d_C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cublasDestroy(handle);

    return 0;
}

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ cc -lcublas cublas-batch.c -o cublas-batch.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-batch.out

CUDA Streams in C

Not all library functions have a batch equivalent, in this case using CUDA streams may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM operation in it’s own stream, allowing up to 32 concurrent operations through the use of Hyper-Q. The cuBLAS library provides functionality to create, set, and destroy CUDA streams.

cublas-stream.c

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"

int main(int argc, char* argv[])
{
    int i,j,k,index;

    // Linear dimension of matrices
    int dim = 100;

    // Number of A,B,C matrix sets
    int batch_count = 1000;

    // Allocate host storage for batch_count A,B,C square matrices
    double **A, **B, **C;
    A = (double**)malloc(batch_count*sizeof(double*));
    B = (double**)malloc(batch_count*sizeof(double*));
    C = (double**)malloc(batch_count*sizeof(double*));
    for(i=0; i<batch_count; i++) {
        A[i] = (double*)malloc(dim*dim*sizeof(double));
        B[i] = (double*)malloc(dim*dim*sizeof(double));
        C[i] = (double*)malloc(dim*dim*sizeof(double));
    }

    // Allocate device storage for batch_count A,B,C
    double **d_A, **d_B, **d_C;
    d_A = (double**)malloc(batch_count*sizeof(double*));
    d_B = (double**)malloc(batch_count*sizeof(double*));
    d_C = (double**)malloc(batch_count*sizeof(double*));
    for(i=0; i<batch_count; i++) {
        cudaMalloc((void**)&d_A[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&d_B[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&d_C[i], dim*dim*sizeof(double));
    }
    // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2
    // Matrices are arranged column major
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                if(i==j) {
                    (A[k])[index] = k*sin(index);
                    (B[k])[index] = sin(index);
                    (C[k])[index] = k*cos(index)*cos(index);
                }
		else {
                    (A[k])[index] = 0.0;
                    (B[k])[index] = 0.0;
                    (C[k])[index] = 0.0;
                }
            } // i   
        } // j
    } // k

    // Create cublas instance
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Set input matrices on device
    for(i=0; i<batch_count; i++) {
        cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, d_A[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, d_B[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, d_C[i], dim);
    }

    // Create a stream for every DGEMM operation
    cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t));
    for(i=0; i<batch_count; i++)
        cudaStreamCreate(&streams[i]);

    // Set matrix coefficients
    double alpha = 1.0;
    double beta  = 1.0;

    // Launch each DGEMM operation in own CUDA stream
    for(i=0; i<batch_count; i++){
        // Set CUDA stream
        cublasSetStream(handle, streams[i]);

        // DGEMM: C = alpha*A*B + beta*C
        cublasDgemm(handle,
                    CUBLAS_OP_N, CUBLAS_OP_N,
                    dim, dim, dim,
                    &alpha,
                    d_A[i], dim,
                    d_B[i], dim,
                    &beta,
                    d_C[i], dim);
    }

    // Retrieve result matrix from device
    for(i=0; i<batch_count; i++) {
        cublasGetMatrix(dim, dim, sizeof(double), d_C[i], dim, C[i], dim);
    }

    // Simple sanity test, sum up all elements
    double sum = 0;
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                sum += (C[k])[index];
            }
        }
    }
    printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2);   

    // Clean up resources
    for(i=0; i<batch_count; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
        cudaFree(d_A[i]);
        cudaFree(d_B[i]);
        cudaFree(d_C[i]);
        cudaStreamDestroy(streams[i]);
    }
    free(A);
    free(B);
    free(C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cublasDestroy(handle);

    return 0;
}

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ cc -lcublas cublas-stream.c -o cublas-stream.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-stream.out

Fortran

Wrappers

NVIDIA provides a fortran interface for the legacy cuBLAS API although they do not provide a fortran interface to the version 2 API, which adds batched operations. One of the changes introduced with the new API is that an opaque cuBLAS handle must be be passed into most function calls; while this provides greater control over operations it is a hinderance to interoperability. The wrapper is split into two pieces: cublas_fortran.cu and cublas_fortran_iso.f90. cublas_fortran.cu provides a lightweight C wrapper around cuBLAS calls, with most calls taking a pointer to the cuBLAS handle that can be dereferenced to make the actual call. Since the wrapper functions require only a pointer to the cuBLAS handle we may interface them with the ISO_C_BINDING module. This ISO_C_BINDING interface is the second piece to our wrapper, cublas_fortran_iso.f90, and allows for robust interoperability between C and Fortran types.

cublas_fortran.cu

#include <stdio.h>
#include "cublas_v2.h"

extern "C" int f_cublasCreate(cublasHandle_t **handle)
{
    *handle = (cublasHandle_t*)malloc(sizeof(cublasHandle_t));
    return cublasCreate(*handle);
}

extern "C" int f_cublasDgemm(cublasHandle_t *handle,
               cublasOperation_t transa, cublasOperation_t transb, 
              int m, int n, int k, 
              const double *alpha,
              const double *A, int lda, 
              const double *B, int ldb,
              const double *beta, 
              double *C, int ldc)
{
    return cublasDgemm(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc);
}

extern "C" int f_cublasDgemmBatched(cublasHandle_t *handle,
               cublasOperation_t transa, cublasOperation_t transb,
              int m, int n, int k,
              const double *alpha,
              const double **A, int lda,
              const double **B, int ldb,
              const double *beta,
              double **C, int ldc,
              int batch_count)
{
    return cublasDgemmBatched(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc,batch_count);
}

extern "C" void f_cublasDestroy(cublasHandle_t *handle)
{
    cublasDestroy(*handle);
    free(handle);
}

extern "C" int f_cudaStreamCreate(cudaStream_t **stream)
{
    *stream = (cudaStream_t *) malloc(sizeof(cudaStream_t));
    return cudaStreamCreate(*stream);
}

extern "C" int f_cublasSetStream(cublasHandle_t *handle, cudaStream_t *streamid)
{
    return cublasSetStream(*handle, *streamid);
}

extern "C" void f_cudaStreamDestroy(cudaStream_t *stream)
{
    cudaStreamDestroy(*stream);
}

cublas_fortran_iso.f90

module cublas_f
  use ISO_C_BINDING

    enum, BIND(C)
        enumerator :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C
    end enum
   
    enum, BIND(C)
        enumerator :: cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, &
                      cudaMemcpyDeviceToDevice, cudaMemcpyDefault
    end enum

  INTERFACE
    integer(C_INT) function cudaMalloc(ptr, bytes) BIND(C, NAME='cudaMalloc')
        use ISO_C_BINDING
        type(C_PTR) :: ptr
        integer(C_SIZE_T), value :: bytes
    end function

    integer(C_INT) function cudaMemcpy(dst, src, count, kind) BIND(C, NAME='cudaMemcpy')
        use ISO_C_BINDING
        type(C_PTR), value :: dst
        type(C_PTR), value :: src
        integer(C_SIZE_T), value :: count
        integer(C_INT), value :: kind
    end function

    integer(C_INT) function cublasCreate(handle_ptr) BIND(C, NAME='f_cublasCreate')
        use ISO_C_BINDING
        type(C_PTR) :: handle_ptr
    end function

    subroutine  cublasDestroy(handle_ptr) BIND(C, NAME='f_cublasDestroy')
        use ISO_C_BINDING
        type(C_PTR), value :: handle_ptr
    end subroutine

    subroutine cudaFree(ptr) BIND(C, NAME='cudaFree')
        use ISO_C_BINDING
        type(C_PTR), value :: ptr
    end subroutine

    integer(C_INT) function cublasSetMatrix(rows, cols, elemSize, a_ptr, &
                                           lda, b_ptr, ldb) &
                                           BIND(C, NAME='cublasSetMatrix')
        use ISO_C_BINDING
        integer(C_INT), value :: rows
        integer(C_INT), value :: cols
        integer(C_INT), value :: elemSize
        type(C_PTR),    value :: a_ptr
        integer(C_INT), value :: lda
        type(C_PTR),    value :: b_ptr
        integer(C_INT), value :: ldb
    end function

    integer(C_INT) function cublasGetMatrix(rows, cols, elemSize, a_ptr, &
                                            lda, b_ptr, ldb ) &
                                            BIND(C, NAME='cublasGetMatrix')
        use ISO_C_BINDING
        integer(C_INT), value :: rows
        integer(C_INT), value :: cols
        integer(C_INT), value :: elemSize
        type(C_PTR),    value :: a_ptr
        integer(C_INT), value :: lda
        type(C_PTR),    value :: b_ptr
        integer(C_INT), value :: ldb

    end function

    integer(C_INT) function cublasDgemm(handle, transa, transb, m, n, k, alpha, &
                                        A, lda, B, ldb, beta, C, ldc) &
                                        BIND(C, NAME='f_cublasDgemm')
        use ISO_C_BINDING
        type(C_PTR), value    :: handle
        integer(C_INT), value :: transa
        integer(C_INT), value :: transb
        integer(C_INT), value :: m
        integer(C_INT), value :: n
        integer(C_INT), value :: k
        real(C_DOUBLE)        :: alpha
        type(C_PTR), value    :: A
        integer(C_INT), value :: lda
        type(C_PTR), value    :: B
        integer(C_INT), value :: ldb
        real(C_DOUBLE)        :: beta
        type(C_PTR), value    :: C
        integer(C_INT), value :: ldc
    end function

    integer(C_INT) function cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, &
                                        A, lda, B, ldb, beta, C, ldc, batch_count) &
                                        BIND(C, NAME='f_cublasDgemmBatched')
        use ISO_C_BINDING
        type(C_PTR), value    :: handle
        integer(C_INT), value :: transa
        integer(C_INT), value :: transb
        integer(C_INT), value :: m
        integer(C_INT), value :: n
        integer(C_INT), value :: k
        real(C_DOUBLE)        :: alpha
        type(C_PTR), value    :: A
        integer(C_INT), value :: lda
        type(C_PTR), value    :: B
        integer(C_INT), value :: ldb
        real(C_DOUBLE)        :: beta
        type(C_PTR), value    :: C
        integer(C_INT), value :: ldc
        integer(C_INT), value :: batch_count
    end function

    integer(C_INT) function cudaStreamCreate(stream_ptr) BIND(C, NAME='f_cudaStreamCreate')
        use ISO_C_BINDING
        type(C_PTR) :: stream_ptr
    end function

    integer(C_INT) function cublasSetStream(handle, stream) BIND(C, NAME='f_cublasSetStream')
        use ISO_C_BINDING
        type(C_PTR), value :: handle
        type(C_PTR), value :: stream
    end function

    integer(C_INT) function cudaStreamDestroy(stream) BIND(C, NAME='f_cudaStreamDestroy')
        use ISO_C_BINDING
        type(C_PTR), value :: stream
    end function

    subroutine cudaDeviceSynchronize() BIND(C, NAME='cudaDeviceSynchronize')
    end subroutine

  END INTERFACE

a end module cublas_f

A Serial Example in Fortran

The following example preforms a single DGEMM operation using the cuBLAS version 2 library. This example will be expanded to show the use of batched cuBLAS calls under a variety of situations. The form of our sample DGEMM operation will be as follows:

DGEMM

Note: cuBLAS uses column major ordering and 1-based indexing to maintain legacy Fortran support.

cublas.f90

program main
    use ISO_C_BINDING
    use cublas_f
    implicit none

    integer :: dim, stat, i, j
    integer(8) :: bytes
    real(8),dimension(:,:), pointer  :: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR) :: d_A, d_B, d_C
    type(C_PTR) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim))
    allocate(B(dim,dim))
    allocate(C(dim,dim))

    ! Allocate device storage for A,B,C
    bytes = int(dim*dim*sizeof_double, 8) 
    stat = cudaMalloc(d_A, bytes)
    stat = cudaMalloc(d_B, bytes)
    stat = cudaMalloc(d_C, bytes)

    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major
    do j=1,dim
        do i=1,dim
            if (i==j) then
                index = real(j*dim + i)
                A(i,j) = sin(index)
                B(i,j) = sin(index)
                C(i,j) = cos(index) * cos(index)
            else
                A(i,j) = 0.0
                B(i,j) = 0.0
                C(i,j) = 0.0
            endif
        enddo
    enddo

    ! Create cublas instance
    stat = cublasCreate(handle)

    ! Set input matrices on device
    stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1)), dim, d_A, dim)
    stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1)), dim, d_B, dim)
    stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1)), dim, d_C, dim)

    ! DGEMM: C = alpha*A*B + beta*C
    alpha = 1.0
    beta = 1.0
    stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, &
                       dim, dim, dim, &
                       alpha,         &
                       d_A, dim,      &
                       d_B, dim,      &
                       beta,          &
                       d_C, dim)

    ! Retrieve result matrix from device
    stat = cublasGetMatrix(dim, dim, sizeof_double, d_C, dim, C_LOC(C(1,1)), dim)

    ! Simple sanity test, sum up all elements
    sum = 0.0
    do j=1,dim
        do i=1,dim
            sum = sum + C(i,j)
        enddo
    enddo
    print *, "Sum is:", sum, "should be: ", dim

    deallocate(A)
    deallocate(B)
    deallocate(C)
    call cudaFree(d_A)
    call cudaFree(d_B)
    call cudaFree(d_C)
    call cublasDestroy(handle)

end program
For simplicity the following compiles will use the GNU programming environment. For more information on using non GNU compilers please see the Compiling Mixed GPU and CPU code tutorial

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ nvcc -c cublas_fortran.cu
$ ftn -lcublas cublas_fortran_iso.f90 cublas.f90 cublas_fortran.o -o cublas.out

Run:

$ module load cudatoolkit
$ aprun ./cublas.out

Library-provided Batching in Fortran

The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched, which takes as an argument an array of pointers to matrices and the number of DGEMM operations to preform in total. By allowing the library to handle the batching high performance can be achieved with minimal additional effort on the part of the programmer.

cublas-batch.f90

program main
    use ISO_C_BINDING
    use cublas_f
    implicit none

    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), pointer:: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR) :: d_A, d_B, d_C
    type(C_PTR), dimension(:), pointer :: h_d_A, h_d_B, h_d_C, streams
    type(C_PTR) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))

    ! Create host pointer array to device matrix storage
    allocate(h_d_A(batch_count))
    allocate(h_d_B(batch_count))
    allocate(h_d_C(batch_count))
    bytes = dim*dim*sizeof_double

    do i=1,batch_count
        stat = cudaMalloc(h_d_A(i), bytes)
        stat = cudaMalloc(h_d_B(i), bytes)
        stat = cudaMalloc(h_d_C(i), bytes)
    enddo

    ! Copy the host array of device pointers to the device
    bytes = batch_count*sizeof_double ! 8 byte pointers
    stat = cudaMalloc(d_A, bytes)
    stat = cudaMalloc(d_B, bytes)
    stat = cudaMalloc(d_C, bytes)

    stat = cudaMemcpy(d_A, C_LOC(h_d_A(1)), bytes, cudaMemcpyHostToDevice);
    stat = cudaMemcpy(d_B, C_LOC(h_d_B(1)), bytes, cudaMemcpyHostToDevice);
    stat = cudaMemcpy(d_C, C_LOC(h_d_C(1)), bytes, cudaMemcpyHostToDevice);

    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    A(i,j,k) = k*sin(index)
                    B(i,j,k) = sin(index)
                    C(i,j,k) = k*cos(index) * cos(index)
                else
                    A(i,j,k) = 0.0
                    B(i,j,k) = 0.0
                    C(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

    ! Create cublas instance
    stat = cublasCreate(handle)

    ! Set input matrices on device
    do i=1,batch_count
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, h_d_A(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, h_d_B(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, h_d_C(i), dim)
    enddo

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! batched DGEMM: C = alpha*A*B + beta*C
    stat = cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, &
                       dim, dim, dim, &
                       alpha,         &
                       d_A, dim,      &
                       d_B, dim,      &
                       beta,          &
                       d_C, dim,      &
                       batch_count)

    ! Retrieve result matrix from device
    do i=1,batch_count
        stat = cublasGetMatrix(dim, dim, sizeof_double, h_d_C(i), dim, C_LOC(C(1,1,i)), dim)
    enddo

    ! Simple sanity test, sum up all elements
    sum = 0.0
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
    print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2

    do i=1,batch_count
        call cudaFree(h_d_A(i))
        call cudaFree(h_d_B(i))
        call cudaFree(h_d_C(i))
    enddo

    deallocate(A)
    deallocate(B)
    deallocate(C)
    call cublasDestroy(handle)

end program

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ nvcc -c cublas_fortran.cu
$ ftn -lcublas cublas_fortran_iso.f90 cublas-batch.f90 cublas_fortran.o -o cublas-batch.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-batch.out

CUDA Streams in Fortran

Not all library functions have a batch equivalent, in this case using CUDA streams may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM operation in it’s own stream, Allowing up to 32 concurrent operations through the use of Hyper-Q. The cuBLAS library provides functionality to create, set, and destroy CUDA streams.

cublas-stream.f90

program main
    use ISO_C_BINDING
    use cublas_f
    implicit none

    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), pointer:: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR), dimension(:), pointer :: d_A, d_B, d_C, streams
    type(C_PTR) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))

    ! Allocate device storage for A,B,C
    allocate(d_A(batch_count))
    allocate(d_B(batch_count))
    allocate(d_C(batch_count))
    bytes = int(dim*dim*sizeof_double, 8)
    do i=1,batch_count
        stat = cudaMalloc(d_A(i), bytes)
        stat = cudaMalloc(d_B(i), bytes)
        stat = cudaMalloc(d_C(i), bytes)
    enddo

    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    A(i,j,k) = k*sin(index)
                    B(i,j,k) = sin(index)
                    C(i,j,k) = k*cos(index) * cos(index)
                else
                    A(i,j,k) = 0.0
                    B(i,j,k) = 0.0
                    C(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

    ! Create cublas instance
    stat = cublasCreate(handle)

    ! Set input matrices on device
    do i=1,batch_count
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, d_A(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, d_B(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, d_C(i), dim)
    enddo

    ! Create a stream for every DGEMM operation
    allocate(streams(batch_count))
    do i=1,batch_count
        stat = cudaStreamCreate(streams(i))
    enddo

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! Launch each DGEMM operation in own CUDA stream
    do i=1,batch_count
        ! Set CUDA stream
        stat = cublasSetStream(handle, streams(i))

        ! DGEMM: C = alpha*A*B + beta*C
        stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, &
                           dim, dim, dim, &
                           alpha,         &
                           d_A(i), dim,   &
                           d_B(i), dim,   &
                           beta,          &
                           d_C(i), dim)
    enddo

    ! Retrieve result matrix from device
    do i=1,batch_count
        stat = cublasGetMatrix(dim, dim, sizeof_double, d_C(i), dim, C_LOC(C(1,1,i)), dim)
    enddo

    ! Simple sanity test, sum up all elements
    sum = 0.0
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
    print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2

    do i=1,batch_count
        stat = cudaStreamDestroy(streams(i))
        call cudaFree(d_A(i))
        call cudaFree(d_B(i))
        call cudaFree(d_C(i))
    enddo

    deallocate(A)
    deallocate(B)
    deallocate(C)
    call cublasDestroy(handle)

end program

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ nvcc -c cublas_fortran.cu
$ ftn -lcublas cublas_fortran_iso.f90 cublas-stream.f90 cublas_fortran.o -o cublas-stream.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-stream.out

To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov


C

The Serial, OpenMP, and MPI C samples make use of the following gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams

sleep.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
 
static cudaStream_t *streams;

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}
 
// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(float seconds)
{
    // Get device frequency in KHz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;
 
    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = (int64_t)(seconds * Hz);
   
    return num_cycles;
}
 
// Create streams
extern "C" void create_streams(int num_streams)
{
    // Allocate streams
    streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t));    
 
    // Default stream
    streams[0] = NULL;

    // Primer kernel launch
    sleep<<< 1, 1 >>>(1); 

    // Create streams
    for(int i = 1; i <= num_streams; i++)
        cudaStreamCreate(&streams[i]);
}
 
// Launches a kernel that sleeps for num_cycles
extern "C" void sleep_kernel(int64_t num_cycles, int stream_id)
{
    // Launch a single GPU thread to sleep
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;
 
    // Execute the kernel
    sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles);
}
 
// Wait for stream to complete
extern "C" void wait_for_stream(int stream_id)
{
    cudaStreamSynchronize(streams[stream_id]);
}
 
// Wait for streams to complete
extern "C" void wait_for_streams(int num_streams)
{
    for(int i = 1; i <= num_streams; i++)
        cudaStreamSynchronize(streams[i]);
}
 
// Destroy stream objects
extern "C" void destroy_streams(int num_streams)
{
    // Clean up stream
    for(int i = 1; i <= num_streams; i++)
        cudaStreamDestroy(streams[i]);
    free(streams);
}

Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -arch=sm_35 -c sleep.cu

Serially Launched Concurrent Kernels in C

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    struct timeval start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Max number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        for(i=1; i<=num_kernels; i++){
            sleep_kernel(cycles, i);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);

        // Clean up streams
        destroy_streams(num_kernels);

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o serial.out

Run:

$ module load cudatoolkit
$ aprun ./serial.out

OpenMP Launched Concurrent Kernels in C

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    double start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {

        // Set number of OMP threads
        omp_set_num_threads(num_kernels);

        // Start timer
        start = omp_get_wtime();

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        #pragma omp parallel firstprivate(cycles)
        {
            int stream_id = omp_get_thread_num()+1;
            sleep_kernel(cycles, stream_id);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);
   
        // Wait for kernels to complete and clean up streams
        destroy_streams(num_kernels);

        // Stop timer
        stop = omp_get_wtime();
        printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
    }

    return 0;
}

Compile:
PGI

$ module load cudatoolkit
$ cc -mp sleep.o launcher.c -o openmp.out

GNU

$ module load cudatoolkit
$ cc -fopenmp sleep.o launcher.c -o openmp.out

Intel

$ module load cudatoolkit
$ cc -openmp sleep.o launcher.c -o openmp.out

Cray

$ module load cudatoolkit
$ cc sleep.o launcher.c -o openmp.out

Run:

$ module load cudatoolkit
$ aprun -d16 ./openmp.out

MPI Launched Concurrent Kernels in C

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "mpi.h"

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_stream(int stream_id);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    uint64_t cycles;
    double start, stop;
    int num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = size;

    // Setup default stream in sleep.cu wrapper
    create_streams(0);

    // Loop through number of kernels to launch, from 1 to max_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        MPI_Barrier(MPI_COMM_WORLD);
        if(rank == 0)
            start = MPI_Wtime();

        // Launch kernel into default stream
        if(rank < num_kernels)
            sleep_kernel(cycles, 0);

        // Wait for all ranks to submit kernel
        MPI_Barrier(MPI_COMM_WORLD);

        // Wait for default stream
        if(rank < num_kernels)
            wait_for_stream(0);

        // Wait for all ranks to complete
        MPI_Barrier(MPI_COMM_WORLD);

        // Print seconds ellapsed
        if(rank == 0) {
            stop = MPI_Wtime();
            printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
        }
    }

    destroy_streams(0);
    MPI_Finalize();

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in C

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "math.h"
#include <sys/time.h>

int main(int argc, char *argv[])
{

    uint64_t num_cycles;
    struct timeval start, stop;
    int i,num_kernels;

    // Set number of cycles to sleep for 1 second
    // We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/15;

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        for(i=0; i<=num_kernels; i++)
        {
            #pragma acc parallel async(i) vector_length(1) num_gangs(1)
            {
                uint64_t cycles;
                #pragma acc loop seq
                for(cycles=0; cycles<num_cycles; cycles++) {
                    cycles = cycles;
                }
            }
        }

        // Wait for all async streams to complete
        #pragma acc wait

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of cycles should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ cc -acc launcher.c -o acc.out

Cray

$ module load cudatoolkit
$ cc -hpragma=acc launcher.c -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

Fortran

The Serial, OpenMP, and MPI Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams
For simplicity the following will use the PGI programming environment as it provides CUDA Fortran. For additional information on compiler interoperability please see the tutorial Compiling mixed GPU and CPU code.
module sleep
    use cudadevice
    use cudafor
    implicit none
 
    integer, dimension(:), allocatable :: streams
 
    contains
 
    !CUDA kernel to pause for at least num_cycle cycles
    attributes(global) subroutine sleep(num_cycles)
        integer(8), value :: num_cycles
        integer(8) :: cycles
        integer(8) :: start
 
        cycles = 0
        start = clock64
        do while (cycles < num_cycles)
            cycles = clock64 - start
        enddo
    end subroutine sleep
 
    !Returns number of cycles required for requested seconds
    integer(8) function get_cycles(seconds) result(num_cycles)
        real(8), intent(in) :: seconds
        integer(8) :: istat, Hz
        type(cudadeviceprop) :: prop
 
        istat = cudaGetDeviceProperties(prop, 0)
        Hz = prop%clockRate * 1000
        num_cycles = seconds * Hz
    end function get_cycles
 
    !Create streams
    subroutine create_streams(num_streams)
        integer :: num_streams, istat, i
 
        allocate(streams(num_streams+1))

        streams(1) = 0 

        ! Primer kernel launch
        call sleep<<< 1, 1 >>>(int8(1));

        do i=2,num_streams+1
            istat = cudaStreamCreate(streams(i))
        enddo
    end subroutine create_streams
 
    !Launches a kernel that sleeps for num_cycles
    subroutine sleep_kernel(num_cycles, stream_id)
        integer(8) :: num_cycles
        integer    ::  stream_id
        type(dim3) :: blockSize, gridSize
 
        blockSize = dim3(1,1,1)
        gridSize = dim3(1,1,1)
 
        call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles)
    end subroutine sleep_kernel
 
    ! Wait for stream to complete
    subroutine wait_for_stream(stream_id)
        integer :: stream_id, istat
 
        istat = cudaStreamSynchronize(streams(stream_id))
    end subroutine wait_for_stream
 
    ! Wait for streams to complete
    subroutine wait_for_streams(num_streams)
        integer :: num_streams, istat, i
 
        do i=2,num_streams+1
            istat = cudaStreamSynchronize(streams(i))
        enddo
    end subroutine wait_for_streams
 
    ! Destroy streams
    subroutine destroy_streams(num_streams)
        integer :: num_streams, i, istat
 
        do i=2,num_streams+1
            istat = cudaStreamDestroy(streams(i))
        enddo
        deallocate(streams)
    end subroutine destroy_streams
 
end module sleep

Serially Launched Concurrent Kernels in Fortran

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, i
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 32

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        do i = 2, num_kernels+1
            call sleep_kernel(cycles, i)
        enddo

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out

OpenMP Launched Concurrent Kernels in Fortran

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep
    use omp_lib

    implicit none

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Set number of OMP threads
        call omp_set_num_threads(num_kernels)

        ! Start timer
        start = omp_get_wtime()

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        !$omp parallel private(stream_id) firstprivate(cycles)
        stream_id = omp_get_thread_num()+2
        call sleep_kernel(cycles, stream_id)
        !$omp end parallel

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        stop = omp_get_wtime()

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out

MPI Launched Concurrent Kernels in Fortran

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.f90

program main
    use sleep
    implicit none
    include 'mpif.h'

    integer    ::  max_kernels, num_kernels, i, ierr, rank, size
    integer(8) :: cycles
    real(8)    :: start, stop, seconds

    call MPI_Init(ierr)

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

    ! Number of kernels to launch
    max_kernels = size

    ! Setup default stream in sleep.cu wrapper
    call create_streams(0);

    ! Loop through number of kernels to launch, from 1 to max_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call MPI_Barrier(MPI_COMM_WORLD, ierr)
        if (rank == 0) then
            start = MPI_Wtime()
        endif

        ! Launch num_kernel kernels asynchrnously
        if (rank < num_kernels) then
            call sleep_kernel(cycles, 1)
        endif

        ! Wait for all ranks to submit kernel
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Wait for kernel to complete
        if(rank < num_kernels) then
            call wait_for_stream(0)
        endif

        ! Wait for all ranks to complete
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Print seconds ellapsed
        if (rank == 0) then
            stop = MPI_Wtime()
            print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'
        endif

    enddo

    ! clean up array in wrapper, no stream actually destroyed
    call destroy_streams(0)

    call MPI_Finalize(ierr)

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in Fortran

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.f90

program main
    implicit none

    integer(8) :: num_cycles, cycles, i
    integer    :: max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds
    real(4)    :: foo(33)

    ! Set number of cycles to sleep for 1 second
    ! We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/(15*5)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Launch num_kernel kernels asynchrnously
        do i = 1, num_kernels
            !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo)
            !$acc loop seq
            do cycles = 1, num_cycles
                foo(i) = sin(1.5708)
            enddo
            !$acc end loop
            !$acc end parallel
        enddo

        ! Wait for all async streams to complete
        !$acc wait

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of foo should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ ftn -acc launcher.f90 -o acc.out

Cray

$ module load cudatoolkit
$ ftn -hacc launcher.f90 -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }   

    // sum component wise and save result into vector c
    #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
    for(i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0.0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to vecAdd.c

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.

// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
    c[i] = a[i] + b[i];
}

The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.

Compiling vecAdd.c

PGI

$ module load cudatoolkit
$ cc -acc vecAdd.c -o vecAdd.out

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -h pragma=acc vecAdd.c -o vecAdd.out

CapsMC

If using the GNU programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -std=c99 will need to be added to cc.
If using the Intel programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -restrict will need to be added to cc.
CapsMC can be used in conjunction with any of the programming environments available on Titan.
$ module load cudatoolkit capsmc
$ cc vecAdd.c -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b  
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)  
    enddo

    ! Sum component wise and save result into vector c

    !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$acc end kernels

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Changes to vecAdd.f90

!$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n))
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$acc end kernels

The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.

Compiling vecAdd.f90

PGI

$ module load cudatoolkit
$ ftn -acc vecAdd.f90 -o vecAdd.out

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -h acc vecAdd.f90 -o vecAdd.out

CapsMC

CapsMC can be used in conjunction with any of the programming environments available on Titan.
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit capsmc
$ ftn vecAdd.f90 -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

Reduction

OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.

vecAdd-reduction.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // sum component wise and save result into vector c and do reduction
    float sum = 0.0;
    #pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n])
    {
        for(i=0; i<n; i++) {
            c[i] = a[i] + b[i];
            sum += c[i];
        }
    }

    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

vecAdd-reduction.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c, perform reduction sum on c

    !$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n))
    do i=1,n
        c(i) = a(i) + b(i)
        sum = sum +  c(i)
    enddo
    !$acc end kernels

    ! Divide the sum by n, the final result should be 1.0 within error
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Additional Information

With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.

PGI

For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.

Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:

$ cc -acc -Minfo vecAdd-reduction.c

Compile Fortran code with the PGI compiler and -Minfo flag as follows:

$ ftn -acc -Minfo vecAdd-reduction.f90

Output:

main:
     34, Generating local(c[0:n])
         Generating copyin(b[0:n])
         Generating copyin(a[0:n])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     35, Loop is parallelizable
         Accelerator kernel generated
         35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
         37, Sum reduction generated for sum

Breaking this output down:

main:
     34, Generating local(c[0:n])
         Generating copyin(b[0:n])
         Generating copyin(a[0:n])

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

35, Loop is parallelizable
    Accelerator kernel generated

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes

In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.

37, Sum reduction generated for sum

The compiler has correctly identified the sum reduction and created an optimized kernel for it.

Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.

$ export ACC_NOTIFY=1
$ aprun ./vecAdd.out
launch kernel  file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=35 device=0 grid=79 block=128 queue=0
launch kernel  file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=37 device=0 grid=1 block=256 queue=0

Cray

Compiling:
C

$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out

Fortran

$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out

Output:

    !$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).

Breaking this output down:

    !$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).

At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.

Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.

$ export CRAY_ACC_DEBUG=1
$ aprun ./vecAdd.out
ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28
ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29
ACC: Execute kernel main_$ck_L29_2 async(auto) from vecAdd-reduction.f90:29
ACC: Wait async(auto) from vecAdd-reduction.f90:32
ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32
ACC: Wait async(auto) from vecAdd-reduction.f90:38
ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38

CapsMC

Currently CapsMC will not compile the provided code due to perceived inter-iteration dependencies. To force the compile either of the following two loop level directives can be added directly above the for/do loop:

#pragma acc loop independent
#pragma acc loop reduction(+:sum)

!$acc loop independent
!$acc loop reduction(+:sum)

Compiling:
C

$ capsmc --io-report cc vecAdd-reduction.c

Fortran

$ capsmc --io-report ftn vecAdd-reduction.f90

Output:

hmpp: [Info] IO report :
hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8)
hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".

Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.

$ export HMPPRT_LOG_LEVEL="info"
$ aprun ./a.out
[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451419] ( 0) INFO : Upload   a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451610] ( 0) INFO : Upload   b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451826] ( 0) INFO : Upload   i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451905] ( 0) INFO : Upload   sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)

Breaking this output down:

[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451419] ( 0) INFO : Upload   a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451610] ( 0) INFO : Upload   b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451826] ( 0) INFO : Upload   i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451905] ( 0) INFO : Upload   sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)

The kernel is called.

[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)

At the end of the kernel sum is copied from the accelerator back to the device and the memory released.

This sample shows the Game of Life CPU code converted to an OpenACC accelerator directive version. The full source can be viewed or download at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

GOL.c

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

#define dim 1024 // Linear game grid dimension excluding ghost cells
#define idx(i,j) ((i)*(dim+2)+(j))

// Add up all neighbors
inline int getNeighbors(int *grid, int i, int j)
{
    int num = grid[idx(i+1,j)] + grid[idx(i-1,j)]     //upper lower
            + grid[idx(i,j+1)] + grid[idx(i,j-1)]     //right left
            + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals
            + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];

    return num;
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    // Number of game iterations
    int maxIter = 1<<10;

    // Total number of alive cells
    int total = 0;

    // Total size of grid in bytes
    size_t bytes = sizeof(int)*(dim+2)*(dim+2);

    // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    int *grid = (int*) malloc(bytes);

    // Allocate newGrid
    int *restrict newGrid = (int*) malloc(bytes);

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i <= dim; i++) {
        for(j = 1; j <= dim; j++) {
            grid[idx(i,j)] = rand() % 2;
        }
    }

  int fullSize = (dim+2)*(dim+2);
  #pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize])
  {
    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {
    #pragma acc kernels
    {
        // Left-Right columns
        #pragma acc loop independent
        for (i = 1; i <= dim; i++) {
            grid[idx(i,0)] = grid[idx(i,dim)];   //Copy first real column to right ghost column
            grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column
        }
        // Top-Bottom rows
        #pragma acc loop independent
        for (j = 0; j <= dim+1; j++) {
            grid[idx(0,j)] = grid[idx(dim,j)];   //Copy first real row to bottom ghost row
            grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row
        }

        // Now we loop over all cells and determine their fate
        #pragma acc loop independent
        for (i = 1; i <= dim; i++) {
            for (j = 1; j <= dim; j++) {
               // Get the number of neighbors for a given grid point
                int numNeighbors = getNeighbors(grid,i,j);

                // Here we have explicitly all of the game rules
                if (grid[idx(i,j)] == 1 && numNeighbors < 2)
                    newGrid[idx(i,j)] = 0;
                else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[idx(i,j)] = 1;
                else if (grid[idx(i,j)] == 1 && numNeighbors > 3)
                    newGrid[idx(i,j)] = 0;
                else if (grid[idx(i,j)] == 0 && numNeighbors == 3)
                    newGrid[idx(i,j)] = 1;
                else
                    newGrid[idx(i,j)] = grid[idx(i,j)];
            }
        }

       // Can't switch pointers so we mannually have to copy array over
       #pragma acc loop independent
       for(i = 1; i <= dim; i++) {
           for(j = 1; j <= dim; j++) {
               grid[idx(i,j)] = newGrid[idx(i,j)];
           }
       }
    } // End ACC kernels region
    } // End main game loop

    // Sum up alive cells
    #pragma acc parallel
    {
        #pragma acc loop reduction(+:total)
        for (i = 1; i <= dim; i++) {
            #pragma acc loop reduction(+:total)
            for (j = 1; j <= dim; j++) {
                total += grid[idx(i,j)];
            }
        }
    }
  } // End ACC Data region

    // Print number of live cells
    printf("Total Alive: %d\n", total);

    // Release memory
    free(grid);
    free(newGrid);

    return 0;
}

Changes to GOL.c

#define dim 1024 // Linear game grid dimension excluding ghost cells
#define idx(i,j) ((i)*(dim+2)+(j))

At the time of this writing, multidimensional C arrays are not fully supported in all OpenACC implementations. A convenience macro is defined and used whenever grid or newGrid is accessed. These changes have not all been highlighted. grid[i][j] will now be accessed as grid[idx(i,j)]

inline int getNeighbors(int *grid, int i, int j)

OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. The inline qualifier can be used as a hint to the compiler we wish to have the function inlined.

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
int *grid = (int*) malloc(bytes);

// Allocate newGrid
int *restrict newGrid = (int*) malloc(bytes);

The grid and newGrid arrays are now linear. As discussed in the vector addition example we use the restrict qualifier to help the compiler resolve loop dependencies.

#pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize])
{
}

Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 49 and ends at line 101. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.

#pragma acc kernels
{
}

An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.

// Left-Right columns
#pragma acc loop independent
// Top-Bottom rows
#pragma acc loop independent
// Now we loop over all cells and determine their fate
#pragma acc loop independent
// Can't switch pointers so we mannually have to copy array over
#pragma acc loop independent

The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.

// Can't switch pointers so we mannually have to copy array over
#pragma acc loop independent
for(i = 1; i <= dim; i++) {
   for(j = 1; j <= dim; j++) {
       grid[idx(i,j)] = newGrid[idx(i,j)];
   }
}

Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.

// Sum up alive cells
#pragma acc parallel
{
    #pragma acc loop reduction(+:total)
    for (i = 1; i <= dim; i++) {
        #pragma acc loop reduction(+:total)
        for (j = 1; j <= dim; j++) {
            total += grid[idx(i,j)];
        }
    }
}

At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.

Compiling GOL.c

PGI

$ module load cudatoolkit
$ cc -acc gol.c -Minline -Minfo -ta=nvidia:kepler

CRAY
Cray will automatically attempt to inline the getNeighbors function

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -hpragma=acc,msgs gol.c

Compiling GOL.c with CapsMC

If using the GNU programming environment with capsmc to properly handle the restrict qualifier the compiler flag -std=c99 will need to be added to cc.
If using the Intel programming environment with capsmc to properly handle the restrict qualifier the compiler flag -restrict will need to be added to cc.
CapsMC can be used in conjunction with any of the programming environments available on Titan

CapsMC will automatically attempt to inline the getNeighbors function

$ module load capsmc
$ module load cudatoolkit
$ cc gol.c

GOL.f90

program main
    implicit none
    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    ! Linear game grid dimension
    integer :: dim = 1024
    ! Number of game iterations
    integer :: maxIter = 2**10

    ! Game grid pointers
    integer,dimension(:,:),allocatable :: grid, newGrid
 
    ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    allocate(grid(dim+2,dim+2))
    allocate(newGrid(dim+2,dim+2))

    ! Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            grid(i,j) = nint(randm)
        enddo
    enddo

    !$acc data copyin(grid), create(newGrid)

    ! Main game loop
    do iter=1,maxIter

    !$acc kernels

        ! Top-Bottom ghost rows
        !$acc loop independent
        do j=2,dim+1
            grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
            grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
        enddo

        ! Left-Right ghost columns
        !$acc loop independent
        do i=1,dim+2
            grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
            grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
        enddo

        ! Now we loop over all cells and determine their fate
        do j=2,dim+1
            do i=2,dim+1
                ! Get the number of neighbors for a given grid point
                numNeighbors = grid(i,j+1) + grid(i,j-1)&     !left and right
                             + grid(i+1,j) + grid(i-1,j)&     !upper and lower
                             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                             + grid(i-1,j+1) + grid(i+1,j-1)

                ! Here we have explicitly all of the game rules
                if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(i,j) = 1
                elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                    newGrid(i,j) = 1
                else
                    newGrid(i,j) = grid(i,j)
                endif
            enddo
        enddo

        ! Can't switch pointers so we mannually have to copy array over
        do j=2,dim+1
            do i=2,dim+1
                grid(i,j) = newGrid(i,j)
            enddo
        enddo

    !$acc end kernels
    enddo!end main game loop  

    total = 0
    ! Sum up alive cells and print results
    !$acc parallel
    !$acc loop reduction(+:total)
    do j=2,dim+1
        !$acc loop reduction(+:total)
        do i=2,dim+1
            total = total + grid(i,j)
        enddo
    enddo
    !$acc end parallel

    !$acc end data

    print *, "Total Alive", total

end program

Changes to GOL.f90

!$acc data copyin(grid), create(newGrid)
!$acc end data

Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 27 and ends at line 90. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.

!$acc kernels
!$acc end kernels

An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.

!$acc loop independent
!$acc loop independent

The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.

numNeighbors = grid(i,j+1) + grid(i,j-1)&     !left and right
             + grid(i+1,j) + grid(i-1,j)&     !upper and lower
             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
             + grid(i-1,j+1) + grid(i+1,j-1)

OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. For this example, using the current set of compilers, it was best to inline the neighbors function by hand.

! Can't switch pointers so we mannually have to copy array over
do j=2,dim+1
    do i=2,dim+1
        grid(i,j) = newGrid(i,j)
    enddo
enddo

Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.

! Sum up alive cells and print results
!$acc parallel
!$acc loop reduction(+:total)
do j=2,dim+1
    !$acc loop reduction(+:total)
    do i=2,dim+1
        total = total + grid(i,j)
    enddo 
enddo
!$acc end parallel

At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.

Compiling GOL.f90

OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined using the appropriate compiler flag.
PGI

$ module load cudatoolkit
$ ftn -acc gol.f90 -Minline -Minfo -ta=nvidia:kepler

CRAY

Cray will automatically try to inline the getNeighbors function

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -hacc,msgs gol.f90

Compiling GOL.f90 with CapsMC

CapsMC can be used in conjunction with any of the programming environments available on Titan

CapsMC will automatically try to inline the getNeighbors function

$ module load capsmc
$ module load cudatoolkit
$ ftn gol.f90

The following C and Fortran examples show how to combine OpenACC data regions with GPU libraries such as CUFFT. Since files containing CUFFT function calls must be compiled with the nvcc compiler and OpenACC containing files must be compiled with cc/CC/ftn it is necessary to create wrapper functions as shown below. The acc host_data use_device directive will be used which causes the device allocated memory address of specified variables to be used in host code within the directives scope. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Care must be taken to ensure that all dependent GPU operations occur on the same CUDA stream. The example below queries OpenACC for the stream in use and executes all library calls within this stream. It is an error to assume OpenACC is using the default stream and doing so may very well produce difficult to detect race conditions.

C/C++

fft.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "openacc.h"

#ifndef M_PI
#define M_PI           3.14159265358979323846
#endif

// Forward declaration of wrapper function that will call CUFFT
extern void launchCUFFT(float *d_data, int n, void *stream);

int main(int argc, char *argv[])
{
    int n = 256;
    float *data = malloc(2*n*sizeof(float));
    int i;

    // Initialize interleaved input data on host
    float w = 7.0;
    float x;
    for(i=0; i<2*n; i+=2)  {
        x = (float)i/2.0/(n-1);
        data[i] = cos(2*M_PI*w*x);
        data[i+1] = 0.0;
    }

    // Copy data to device at start of region and back to host and end of region
    #pragma acc data copy(data[0:2*n])
    {
        // Inside this region the device data pointer will be used
        #pragma acc host_data use_device(data)
        {
           // Query OpenACC for CUDA stream
           void *stream = acc_get_cuda_stream(acc_async_sync);

           // Launch FFT on the GPU
           launchCUFFT(data, n, stream);
        }
    }

    // Find the frequency
    int max_id = 0;
    for(i=0; i<n; i+=2) {
        if( data[i] > data[max_id] )
            max_id = i;
    }
    printf("frequency = %d\n", max_id/2);

    return 0;
}

fft.c + cufft.cu

#include <cufft.h>

// Declared extern "C" to disable C++ name mangling
extern "C" void launchCUFFT(float *d_data, int n, void *stream)
{
    cufftHandle plan;
    cufftPlan1d(&plan, n, CUFFT_C2C, 1);
    cufftSetStream(plan, (cudaStream_t)stream);
    cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD);
    cufftDestroy(plan);
}

Compiling C/C++

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc cufft.cu -c
$ cc -h pragma=acc -lcufft fft.c cufft.o

PGI

$ module load cudatoolkit
$ nvcc cufft.cu -c
$ cc -acc -lcufft fft.c cufft.o

Fortran

fft.f90

module cufft
  INTERFACE
    subroutine launchcufft(data, n, stream) BIND (C, NAME='launchCUFFT')
      USE ISO_C_BINDING
      implicit none
      type (C_PTR), value :: data
      integer (C_INT), value :: n
      type (C_PTR), value :: stream
    end subroutine
  END INTERFACE
end module cufft

program fft
    USE ISO_C_BINDING
    USE cufft
    USE openacc
    IMPLICIT NONE

    INTEGER, PARAMETER :: n = 256
    COMPLEX (C_FLOAT_COMPLEX) :: data(n)
    INTEGER (C_INT):: i
    INTEGER :: max_id,istat
    type (C_PTR) :: stream

    ! Initialize interleaved input data on host
    REAL :: w = 7.0
    REAL :: x
    REAL, PARAMETER :: PI = 3.1415927
    do i=1,n
        x = (i-1.0)/(n-1.0);
        data(i) = CMPLX(COS(2.0*PI*w*x),0.0)
    enddo

    ! Copy data to device at start of region and back to host and end of region
    !$acc data copy(data)

        ! Inside this region the device data pointer will be used
        !$acc host_data use_device(data)

        ! Query OpenACC for CUDA stream
        stream = acc_get_cuda_stream(acc_async_sync)

        ! Launch FFT on the GPU
        call launchcufft(C_LOC(data), n, stream)
        !$acc end host_data

    !$acc end data

    ! Find the frequency
    max_id = 1
    do i=1,n/2
        if (REAL(data(i)) .gt. REAL(data(max_id))) then
            max_id = i-1
        endif
    enddo
    print *, "frequency:", max_id

end program fft

fft.f90 + cufft.cu

#include <cufft.h>

// Declared extern "C" to disable C++ name mangling
extern "C" void launchCUFFT(float *d_data, int n, void *stream)
{
    cufftHandle plan;
    cufftPlan1d(&plan, n, CUFFT_C2C, 1);
    cufftSetStream(plan, (cudaStream_t)stream);
    cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD);
    cufftDestroy(plan);
}

Compiling Fortran

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc -c cufft.cu
$ ftn -c -hacc fft.f90
$ ftn fft.o cufft.o -lcufft

PGI

$ module load cudatoolkit
$ nvcc -c cufft.cu
$ ftn -c -acc fft.f90
$ ftn -acc fft.o cufft.o -lcufft

The following C and Fortran examples show how to leverage OpenACC, cuRand, and Thrust in a single application. This example will use cuRand to generate random data, OpenACC to hash the data, and finally Thrust to sort the data into canonical order. Although this example is provided to illustrate OpenACC interoperability only it was written with a simple nearest neighbor hash implementation given interleaved 3D positions(x1,y1,z1,x2,y2,z2…) in mind. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Care must be taken to ensure that all dependent GPU operations occur on the same CUDA stream. The example below queries OpenACC for the stream in use and executes all library calls within this stream. It is an error to assume OpenACC is using the default stream and doing so may very well produce difficult to detect race conditions.

C/C++

sort.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "openacc.h"

// Forward declaration of nvcc compiled functions
void fill_rand(float *d_buffer, int num, void *stream);
inline int hash_val(float x, float y, float z);
void sort(int *d_key, int *d_values, int num, void *stream);

int main(int argc, char* argv[])
{
    int i,index;
    float x,y,z;
    int dim = 3;
    int num = 1000000;
    int length = dim*num;
    float *restrict positions = (float*)malloc(length*sizeof(float));
    int *restrict keys   = (int*)malloc(num*sizeof(int));
    int *restrict values = (int*)malloc(num*sizeof(int));

    // Query OpenACC for CUDA stream
    void *stream = acc_get_cuda_stream(acc_async_sync);

    // OpenACC will create positions, keys, and values arrays on the device
    #pragma acc data create(positions[0:length], keys[0:num], values[0:num])
    {
        // NVIDIA cuRandom will create our initial random data
        #pragma acc host_data use_device(positions)
        {
             fill_rand(positions, length, stream);
        }

        // OpenACC will calculate the hash value for each particle
        #pragma acc parallel loop
        for(i=0; i<num; i++) {
            index = i*3;
            x = positions[index];
            y = positions[index+1];
            z = positions[index+2];

            // Key is hash value and value is 'particle id'
            keys[i] = hash_val(x,y,z);
            values[i] = i;
        }

        // Thrust will be used to sort our key value pairs
        #pragma acc host_data use_device(keys, values)
        {
            sort(keys, values, num, stream);
        }
    }

    return 0;
}

// Uniform grid hash
inline int hash_val(float x, float y, float z)
{
    double spacing = 0.01;

    // Calculate grid coordinates
    int grid_x,grid_y,grid_z;
    grid_x = abs(floor(x/spacing));
    grid_y = abs(floor(y/spacing));
    grid_z = abs(floor(z/spacing));

    int num_x = 1.0/spacing;
    int num_y = 1.0/spacing;
    int num_z = 1.0/spacing;

    int grid_position = (num_x * num_y * grid_z) + (grid_y * num_x + grid_x);
    return grid_position;
}

sortGPU.cu

#include <stdio.h>
#include <curand.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/system_error.h>

// Fill d_buffer with num random numbers
extern "C" void fill_rand(float *d_buffer, int num, void *stream)
{
  curandGenerator_t gen;
  int status;

  // Create generator
  status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);

  // Set CUDA stream
  status |= curandSetStream(gen, (cudaStream_t)stream);

  // Set seed
  status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);

  // Generate num random numbers
  status |= curandGenerateUniform(gen, d_buffer, num);

  // Cleanup generator
  status |= curandDestroyGenerator(gen);

  if (status != CURAND_STATUS_SUCCESS) {
      printf ("curand failure!\n");
      exit (EXIT_FAILURE);
  }
}

// Sort key value pairs
extern "C" void sort(int *keys, int *values, int num, void *stream)
{
    try {
        // Sort keys AND values array by key
        thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream),
                            keys, keys + num, values);
    }
    catch(thrust::system_error &e) {
        std::cerr << "Error sorting with Thrust: " << e.what() << std::endl;
        exit (EXIT_FAILURE);
    }
}

Compiling C/C++

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc sortGPU.cu -c
$ cc -hpragma=acc -lcurand sort.c sortGPU.o

PGI

$ module load cudatoolkit

$ nvcc sortGPU.cu -c
$ cc -acc -lcurand sort.c sortGPU.o -Minline

Fortran

sort.f90

! Module containing nvcc compiled function declarations and hash function
module sortgpu
  INTERFACE
    subroutine fill_rand(positions, length, stream) BIND(C,NAME='fill_rand')
      USE ISO_C_BINDING
      implicit none
      type (C_PTR), value :: positions
      integer (C_INT), value :: length
      type (C_PTR), value :: stream
    end subroutine fill_rand
 
    subroutine sort(keys, values, num, stream) BIND(C,NAME='sort')
      USE ISO_C_BINDING
      implicit none
      type (C_PTR), value :: keys
      type (C_PTR), value :: values
      integer (C_INT), value :: num
      type (C_PTR), value :: stream
    end subroutine sort
  END INTERFACE
 
  contains
    integer function hash_val(x,y,z)
      real :: x,y,z,spacing
      integer :: grid_x, grid_y, grid_z
      integer :: num_x, num_y, num_z
 
      spacing =  0.01;
 
      grid_x = abs(floor(x/spacing))
      grid_y = abs(floor(y/spacing))
      grid_z = abs(floor(z/spacing))
 
      num_x = 1.0/spacing
      num_y = 1.0/spacing
      num_z = 1.0/spacing
 
      hash_val =  (num_x * num_y * grid_z) + (grid_y * num_x + grid_x) + 1
      return
    end function hash_val
end module sortgpu
 
program interop
    use ISO_C_BINDING
    use sortgpu
    use openacc
    implicit none
 
    integer :: i,indx
    real :: x,y,z
    integer, parameter :: dims = 3
    integer(C_INT), parameter :: num = 1000000
    integer(C_INT), parameter :: length = dims*num
    real (C_FLOAT) :: positions(length)
    integer(C_INT) :: keys(num)
    integer(C_INT) :: values(num)
    type (C_PTR) :: stream

    ! Query OpenACC for CUDA stream
    stream = acc_get_cuda_stream(acc_async_sync)

    ! OpenACC will create positions, keys, and values arrays on the device
    !$acc data create(positions, keys, values)
 
        ! NVIDIA cuRandom will create our initial random data
        !$acc host_data use_device(positions)
        call fill_rand(C_LOC(positions), length, stream)
        !$acc end host_data
 
        ! OpenACC will calculate the hash value for each particle
        !$acc parallel loop
        do i=0,num-1
            indx = i*3;
            x = positions(indx+1)
            y = positions(indx+2)
            z = positions(indx+3)
 
            ! Key is 'particle' id and value is hashed position
            keys(i+1) = hash_val(x,y,z);
            values(i+1) = i+1;
        enddo
        !$acc end parallel loop
 
        ! Thrust will be used to sort our key value pairs
        !$acc host_data use_device(keys, values)
        call sort(C_LOC(keys), C_LOC(values), num, stream);
        !$acc end host_data
 
    !$acc end data    
 
end program interop

sortGPU.cu

#include <stdio.h>
#include <curand.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/system_error.h>

// Fill d_buffer with num random numbers
extern "C" void fill_rand(float *d_buffer, int num, void *stream)
{
  curandGenerator_t gen;
  int status;

  // Create generator
  status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);

  // Set CUDA stream
  status |= curandSetStream(gen, (cudaStream_t)stream);

  // Set seed
  status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);

  // Generate num random numbers
  status |= curandGenerateUniform(gen, d_buffer, num);

  // Cleanup generator
  status |= curandDestroyGenerator(gen);

  if (status != CURAND_STATUS_SUCCESS) {
      printf ("curand failure!\n");
      exit (EXIT_FAILURE);
  }
}

// Sort key value pairs
extern "C" void sort(int *keys, int *values, int num, void *stream)
{
    try {
        // Sort keys AND values array by key
        thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream),
                            keys, keys + num, values);
    }
    catch(thrust::system_error &e) {
        std::cerr << "Error sorting with Thrust: " << e.what() << std::endl;
        exit (EXIT_FAILURE);
    }
}

Compiling Fortran

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc sortGPU.cu -c
$ ftn -hacc -lcurand sort.f90 sortGPU.o

PGI

$ module load cudatoolkit
$ module switch pgi pgi/13.6.0

$ nvcc sortGPU.cu -c
$ ftn -acc -lcurand sort.f90 sortGPU.o -Minline=name:hash_val

This tutorial will cover linking GPU code compiled with NVCC and CPU code compiled with the Cray compiler wrappers cc/CC/ftn. The Cray compiler wrappers work in conjunction with the modules system to link in needed libraries such as MPI; it is therefore recommended that the wrappers be used to compile the CPU portions of your code. To generate compiler portable code it is necessary to compile CUDA C and CUDA runtime containing code with NVCC. The resulting NVCC compiled object code must then be linked in with object code compiled with the Cray wrappers. NVCC preforms GNU style C++ name mangling on compiled functions so care must be taken in compiling and linking codes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

C++

When linking C++ host code with NVCC compiled code the C++ code must use GNU compatible name mangling. This is controlled through compiler wrapper flags.

vecAddWrapperCXX.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

void vecAdd_wrapper()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAdd.cxx

#include <stdio.h>

void vecAdd_wrapper(void);

int main( int argc, char* argv[] )
{
    vecAdd_wrapper();

    return 0;
}

Compiling C++

PGI

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC --gnu vecAdd.cxx vecAddWrapperCXX.o

Cray

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC vecAdd.cxx vecAddWrapperCXX.o

GNU

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC vecAdd.cxx vecAddWrapperCXX.o

Intel

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC vecAdd.cxx vecAddWrapperCXX.o

C

NVCC name mangling must be disabled if it is to be linked with C code. This requires the extern “C” function qualifier.

vecAddWrapperC.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

extern "C" void vecAdd_wrapper()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAdd.c

#include <stdio.h>

void vecAdd_wrapper(void);

int main( int argc, char* argv[] )
{
    vecAdd_wrapper();

    return 0;
}

Compiling C

There should be no compiler dependent variances for C.

$ module load cudatoolkit
$ nvcc -c vecAddWrapperC.cu
$ cc vecAdd.c vecAddWrapperC.o

Fortran: Simple

This simple method allows C code to be called from Fortran but requires that the C function name be modified. NVCC name mangling must be disabled if it is to be linked with Fortran code. This requires the extern “C” function qualifier. Additionally function names must be lowercase and end in an underscore, _ . The ISO_C_BINDING method following this example allows you to call C functions from Fortran without modifying the C function call.

vecAddWrapperF.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

extern "C" void vecadd_wrapper_()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAdd.f90

program vecAdd
    call vecadd_wrapper()
end program vecAdd

Compiling Fortran: Simple

There should be no compiler dependent variations for fortran.

$ module load cudatoolkit
$ nvcc -c vecAddWrapperF90.cu 
$ ftn vecAddWrapperF90.o vecAdd.f90

Fortran: ISO_C_BINDING

The ISO_C_BINDING provides Fortran a greater interoperability with C and reduces the need to modify the C function name.

vecAddWrapperISO.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

extern "C" void vecAdd_wrapper()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAddISO.f90

module vector
  INTERFACE
    subroutine vecadd_wrapper() BIND (C, NAME='vecAdd_wrapper')
      USE ISO_C_BINDING
      implicit none
    end subroutine vecadd_wrapper
  END INTERFACE
end module vector

program vecAdd
    use ISO_C_BINDING
    use vector

    call vecadd_wrapper()

end program vecAdd

Compiling Fortran: ISO_C_BINDING

There should be no compiler dependent variations for fortran.

$ module load cudatoolkit
$ nvcc -c vecAddWrapperISO.cu 
$ ftn vecAddWrapperISO.o vecAddISO.f90

Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov.

How to Enable

To enable GPUDirect the following steps must be taken before invoking aprun:

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1

Examples

The following examples will initialize memory on the GPU and then preform an MPI_Allgather operation between GPUs.

For these examples use two nodes and request an interactive job.

$qsub -I -A PROJ### -lnodes=2,walltime=00:30:00

The executables must be placed in /tmp/work/$USER to run properly on the GPU.

CUDA C

direct.cpp

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <mpi.h>

int main( int argc, char** argv )
{
    MPI_Init (&argc, &argv);

    int direct;
    int rank, size;
    int *h_buff = NULL;
    int *d_rank = NULL;
    int *d_buff = NULL;
    size_t bytes;
    int i;

    // Ensure that RDMA ENABLED CUDA is set correctly
    direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA"));
    if(direct != 1){
        printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
        exit (EXIT_FAILURE);
    }

    // Get MPI rank and size
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);

    // Allocate host and device buffers and copy rank value to GPU
    bytes = size*sizeof(int);
    h_buff = (int*)malloc(bytes);
    cudaMalloc(&d_buff, bytes);
    cudaMalloc(&d_rank, sizeof(int));
    cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice);

    // Preform Allgather using device buffer
    MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD);

    // Check that the GPU buffer is correct
    cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost);
    for(i=0; i<size; i++){
        if(h_buff[i] != i) {
            printf ("Alltoall Failed!\n");
            exit (EXIT_FAILURE);
        }
    }
    if(rank==0)
        printf("Success!\n");

    // Clean up
    free(h_buff);
    cudaFree(d_buff);
    cudaFree(d_rank);
    MPI_Finalize();

    return 0;
}

Compiling CUDA C

For ease of compiling the GNU environment will be used.

  $ module load cudatoolkit
  $ module switch PrgEnv-pgi PrgEnv-gnu
  $ CC -lcudart direct.cpp -o direct.out

Running CUDA C

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1
  $ aprun -n2 -N1 ./direct.out

CUDA Fortran

direct.cuf

program GPUdirect
    use cudafor
    implicit none
    include 'mpif.h'

    integer :: direct
    character(len=255) :: env_var
    integer :: rank, size, ierror
    integer,dimension(:),allocatable :: h_buff
    integer,device :: d_rank
    integer,dimension(:),allocatable,device :: d_buff
    integer :: i

    call getenv("MPICH_RDMA_ENABLED_CUDA", env_var)
    read( env_var, '(i10)' ) direct
    if (direct .NE. 1) then
      print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!'
      call exit(1)
    endif

    call MPI_INIT(ierror)

    ! Get MPI rank and size
    call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror)
    call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror)

    ! Initialize host and device buffers
    allocate(h_buff(size))
    allocate(d_buff(size))
    ! Implicity copy rank to device
    d_rank = rank

    ! Preform allgather using device buffers
    call MPI_ALLGATHER(d_rank, 1, MPI_INTEGER, d_buff, 1, MPI_INTEGER, MPI_COMM_WORLD, ierror);

    ! Check that buffer is correct
    h_buff = d_buff(1:size)
    do i=1,size
        if (h_buff(i) .NE. i-1) then
            print *, 'Alltoall Failed!'
            call exit(1)
        endif
    enddo
    if (rank .EQ. 0) then
        print *, 'Success!'
    endif

    ! Clean up
    deallocate(h_buff)
    deallocate(d_buff)
    call MPI_FINALIZE(ierror)

end program GPUdirect

Compiling CUDA Fortran

  $ module load cudatoolkit
  $ ftn direct.cuf -o direct.out

Running CUDA Fortran

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1
  $ aprun -n2 -N1 ./direct.out

OpenACC C

direct.c

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main( int argc, char** argv )
{
    MPI_Init (&argc, &argv);

    int direct;
    int rank, size;
    int *restrict buff = NULL;
    size_t bytes;
    int i;

    // Ensure that RDMA ENABLED CUDA is set correctly
    direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA"));
    if(direct != 1){
        printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
        exit (EXIT_FAILURE);
    }

    // Get MPI rank and size
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);

    // Initialize buffer
    bytes = size*sizeof(int);
    buff = (int*)malloc(bytes);

    // Copy buff to device at start of region and back to host and end of region
    #pragma acc data copy(rank, buff[0:size])
    {
        // Inside this region the device data pointer will be used
        #pragma acc host_data use_device(rank, buff)
        {
            MPI_Allgather(&rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD);
        }
    }

    // Check that buffer is correct
    for(i=0; i<size; i++){
        if(buff[i] != i) {
            printf ("Alltoall Failed!\n");
            exit (EXIT_FAILURE);
        }
    }
    if(rank==0)
        printf("Success!\n");

    // Clean up
    free(buff);

    MPI_Finalize();

    return 0;
}

Compiling OpenACC C

CRAY

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -hpragma=acc direct.c -o direct.out

PGI

$ module load cudatoolkit
$ cc -acc -lcudart direct.c -o direct.out

Running OpenACC C

CRAY

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

PGI

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

OpenACC Fortran

direct.f90

program GPUdirect
    include 'mpif.h'

    integer :: direct
    character(len=255) :: env_var
    integer :: rank, size, ierror
    integer,dimension(:),allocatable :: buff
    integer :: i

    call getenv("MPICH_RDMA_ENABLED_CUDA", env_var)
    read( env_var, '(i10)' ) direct
    if (direct .NE. 1) then
      print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!'
      call exit(1)
    endif

    call MPI_INIT(ierror)

    ! Get MPI rank and size
    call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror)
    call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror)

    ! Initialize buffer
    allocate(buff(size))

    ! Copy buff to device at start of region and back to host and end of region
    !$acc data copy(rank, buff(1:size))
        ! Inside this region the device data pointer will be used
        !$acc host_data use_device(rank, buff)
            ! Preform all to all using device buffer
            call MPI_ALLGATHER(rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD, ierror);
        !$acc end host_data
    !$acc end data

    ! Check that buffer is correct
    do i=1,size
        if (buff(i) .NE. i-1) then
            print *, 'Alltoall Failed!'
            call exit(1)
        endif
    enddo
    if (rank .EQ. 0) then
        print *, 'Success!'
    endif

    ! Clean up
    deallocate(buff)
    call MPI_FINALIZE(ierror)
end program GPUdirect

Compiling OpenACC Fortran

CRAY

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -hacc direct.f90 -o direct.out

PGI

$ module load cudatoolkit
$ ftn -acc -lcudart direct.f90 -o direct.out

Running OpenACC Fortran

CRAY

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

PGI

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

Optimizations

Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial and error tuning to achieve best results.

Pipelining

Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is 16 and messages under 8 Kilobytes in size are never pipelined.

Nemesis

Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.

This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.

This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.

The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Currently GPU memory between processes accessing the proxy is not guarded, meaning process i can access memory allocated by process j. This SHOULD NOT be used to share memory between processes and care should be taken to ensure process only access GPU memory they have allocated themselves.

How to Enable

To enable the proxy server the following steps must be taken before invoking aprun:

$ export CRAY_CUDA_MPS=1

Issues

Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.

Examples

The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.

For these examples use a single node and request an interactive job.

$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00

C Wrapper

The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.

pauseWrapper.cu

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}

// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(int seconds)
{
    // Get device frequency in Hz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;

    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = seconds * Hz;

    return num_cycles;
}

// Launches a kernel that sleeps for at least num_cycles
extern "C" void sleep_kernel(int64_t num_cycles)
{
    // Our kernel will launch a single thread to sleep the kernel
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;

    // Execute the kernel
    sleep<<<gridSize, blockSize>>>(num_cycles);
}

// Wait for all pending GPU transactions to end
extern "C" void wait_for_gpu()
{
    cudaDeviceSynchronize();
}

Compiling

The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -c -arch=sm_35 pauseWrapper.cu

Multiple kernels single process

Launching multiple kernels from a single process does not require the CUDA proxy.

launcher.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch num_kernel kernels asynchrnously
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();

    return 0;
}

Compiling and running multiple kernels from a single process

$ CC pauseWrapper.o launcher.cpp -o sleep.out
$ aprun -n1 ./sleep.out

Multiple kernels launched from multiple threads

Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.

launcherOMP.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch kernel
    omp_set_num_threads(num_kernels);

    #pragma omp parallel for shared(cycles) num_threads(num_kernels)
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();
}

Compiling and running multiple kernels launched from multiple threads

$ CC pauseWrapper.o launcherMP.cpp -o sleep.out
$ aprun -n1 -d14 ./sleep.out

Multiple kernels launched from multiple MPI tasks

Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.

launcherMPI.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mpi.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Sleep kernel for 1 second
    sleep_kernel(cycles);

    // Wait for the kernel to complete
    wait_for_gpu();

    MPI_Finalize();
    return 0;
}

Compiling and running multiple kernels launched from multiple MPI tasks

$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out
$ export CRAY_CUDA_MPS=1
$ aprun -n14 -N14 ./sleep.out

This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    double *h_a;
    double *h_b;
    //Host output vector
    double *h_c;

    // Device input vectors
    double *d_a;
    double *d_b;
    //Device output vector
    double *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

Changes

Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

Let’s take a look at what makes up this simple kernel.

__global__ void vecAdd(double *a, double *b, double *c, int n)

The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.

int id = blockIdx.x*blockDim.x+threadIdx.x;

We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members.

if (id < n)

Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

c[id] = a[id] + b[id]

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c

Array Pointers:

// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;

// Device input vectors
double *d_a;
double *d_b;
// Device output vector
double *d_c;

With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.

cudaMalloc:

// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.

cudaMemcpy to device:

// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.

Thread mapping:

int blockSize, gridSize;
 
// Number of threads in each thread block
blockSize = 256;
 
// Number of thread blocks in grid
gridSize = ceil((float)n/blockSize);

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.

Launch kernel:

// Execute the kernel
vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);

We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

cudaMemcpy to host:

// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.

release device memory

// Release device memory
 cudaFree(d_a);
 cudaFree(d_b);
 cudaFree(d_c);

As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.

Compiling

The CUDA environment must be set up so before compiling make sure the module is loaded

$ module load cudatoolkit
$ nvcc vecAdd.cu -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

In this tutorial we will cover three different CUDA versions of our GOL program, each highlighting a different CUDA memory area. The first will strictly use global memory and be a straightforward GPU port of our program, the second we will introduce shared memory, and lastly we cover the use of texture memory. A description of the game of life and the original CPU code can be found here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

GOL-CUDA.cu

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985
#define BLOCK_SIZE 16

__global__ void ghostRows(int dim, int *grid)
{
    // We want id ∈ [1,dim]
    int id = blockDim.x * blockIdx.x + threadIdx.x + 1;

    if (id <= dim)
    {
        //Copy first real row to bottom ghost row
        grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id];
        //Copy last real row to top ghost row
        grid[id] = grid[(dim+2)*dim + id];
    }
}

__global__ void ghostCols(int dim, int *grid)
{
    // We want id ∈ [0,dim+1]
    int id = blockDim.x * blockIdx.x + threadIdx.x;

    if (id <= dim+1)
    {
        //Copy first real column to right most ghost column
        grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1];
        //Copy last real column to left most ghost column 
        grid[id*(dim+2)] = grid[id*(dim+2) + dim];    
    }
}

__global__ void GOL(int dim, int *grid, int *newGrid)
{
    // We want id ∈ [1,dim]
    int iy = blockDim.y * blockIdx.y + threadIdx.y + 1;
    int ix = blockDim.x * blockIdx.x + threadIdx.x + 1;
    int id = iy * (dim+2) + ix;

    int numNeighbors;

    if (iy <= dim && ix <= dim) {

        // Get the number of neighbors for a given grid point
        numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower
                     + grid[id+1] + grid[id-1]             //right left
                     + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals
                     + grid[id-(dim+1)] + grid[id+(dim+1)];

        int cell = grid[id];
        // Here we have explicitly all of the game rules
        if (cell == 1 && numNeighbors < 2)
            newGrid[id] = 0;
        else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
            newGrid[id] = 1;
        else if (cell == 1 && numNeighbors > 3)
            newGrid[id] = 0;
        else if (cell == 0 && numNeighbors == 3)
            newGrid[id] = 1;
        else
            newGrid[id] = cell;
    }
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    int* h_grid; //Grid on host
    int* d_grid; //Grid on device
    int* d_newGrid; //Second grid used on device only
    int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid

    int dim = 1024; //Linear dimension of our grid - not counting ghost cells
    int maxIter = 1<<10; //Number of game steps

    size_t bytes = sizeof(int)*(dim+2)*(dim+2);//2 added for periodic boundary condition ghost cells
    // Allocate host Grid used for initial setup and read back from device
    h_grid = (int*)malloc(bytes);

    // Allocate device grids
    cudaMalloc(&d_grid, bytes);
    cudaMalloc(&d_newGrid, bytes);

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            h_grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    // Copy over initial game grid (Dim-1 threads)
    cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);

    dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
    int linGrid = (int)ceil(dim/(float)BLOCK_SIZE);
    dim3 gridSize(linGrid,linGrid,1);

    dim3 cpyBlockSize(BLOCK_SIZE,1,1);
    dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1);
    dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {

        ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid);
        ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid);
        GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);

        // Swap our grids and iterate again
        d_tmpGrid = d_grid;
        d_grid = d_newGrid;
        d_newGrid = d_tmpGrid;
    }//iter loop

    // Copy back results and sum
    cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += h_grid[i*(dim+2)+j];
        }
    }
    printf("Total Alive: %d\n", total);

    // Release memory
    cudaFree(d_grid);
    cudaFree(d_newGrid);
    free(h_grid);

    return 0;
}

Changes

#define BLOCK_SIZE 16

The number of threads per block is a common performance tuning parameter and it is often set as a precompiler define statement.

__global__ void ghostRows(int dim, int *grid)
{
    // We want id ∈ [1,dim]
    int id = blockDim.x * blockIdx.x + threadIdx.x + 1;
 
    if (id <= dim)
    {
        //Copy first real row to bottom ghost row
        grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id];
        //Copy last real row to top ghost row
        grid[id] = grid[(dim+2)*dim + id];
    }
}
__global__ void ghostCols(int dim, int *grid)
{
    // We want id ∈ [0,dim+1]
    int id = blockDim.x * blockIdx.x + threadIdx.x;
 
    if (id <= dim+1)
    {
        //Copy first real column to right most ghost column
        grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1];
       //Copy last real column to left most ghost column
        grid[id*(dim+2)] = grid[id*(dim+2) + dim];
    }
}

The ghostRows and ghostColumns kernels are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.

__global__ void GOL(int dim, int *grid, int *newGrid)
{
    // We want id ∈ [1,dim]
    int iy = blockDim.y * blockIdx.y + threadIdx.y + 1;
    int ix = blockDim.x * blockIdx.x + threadIdx.x + 1;
    int id = iy * (dim+2) + ix;
 
    int numNeighbors;
 
    if (iy <= dim && ix <= dim) {
 
        // Get the number of neighbors for a given grid point
        numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower
                     + grid[id+1] + grid[id-1]             //right left
                     + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals
                     + grid[id-(dim+1)] + grid[id+(dim+1)];
 
        int cell = grid[id];
        // Here we have explicitly all of the game rules
        if (cell == 1 && numNeighbors < 2)
            newGrid[id] = 0;
        else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
            newGrid[id] = 1;
        else if (cell == 1 && numNeighbors > 3)
            newGrid[id] = 0;
        else if (cell == 0 && numNeighbors == 3)
            newGrid[id] = 1;
        else
            newGrid[id] = cell;
    }
}

The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.

int* d_grid; //Grid on device
int* d_newGrid; //Second grid used on device only
int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid

We declare our device pointers on the host as we would normally, we prefix our pointer name with d_ to help separate them from host pointers which we prefix with h_.

// Allocate device grids
cudaMalloc(&d_grid, bytes);
cudaMalloc(&d_newGrid, bytes);

Both of our device arrays are allocated in device global memory. A single dimensional array is used

// Copy over initial game grid (Dim-1 threads)
cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);

After we have filled h_grid on the host we must copy it over to d_grid which resides in the device global memory.

dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
dim3 cpyBlockSize(BLOCK_SIZE,1,1);

The CUDA data type dim3 is used to define the number of threads in our block, For the GOL kernel we specify a two dimensional block size to better suite our problems geometry, for copying a simple single dimensional size is best. Note that any missing dimensions in the constructor are assumed to be 1.

int linGrid = (int)ceil(dim/(float)BLOCK_SIZE);
dim3 gridSize(linGrid,linGrid,1);
dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1);
dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);

With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of type dim3.

ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid);
ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid);
GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);

Each iteration of the game game loop we launch our three kernels, first copying the ghost cells and then preforming one game iteration.

// Swap our grids and iterate again
d_tmpGrid = d_grid;
d_grid = d_newGrid;
d_newGrid = d_tmpGrid;

The grid and newGrid device pointers can be switched just as we did with the host pointers in the CPU version.

// Copy back results and sum
cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);

Once our main game loop is complete we must copy the grid from device memory to host memory.

// Release memory
cudaFree(d_grid);
cudaFree(d_newGrid);

Having used cudaMalloc to allocate our device memory we release it with cudaFree.

Compiling

module load cudatoolkit
nvcc GOL-CUDA.cu -o GOL-CUDA.out

GOL-CUDA-Shared.cu

In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:

  1. Have each thread read from global memory into shared memory
  2. Synchronize threads to ensure all threads in a given block have finished reading from global memory
  3. Have each thread preform calculations using shared memory values
  4. Have each thread write result from shared memory back to device global memory

For detailed coverage of shared memory please see the CUDA C Programming guide.

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985
#define BLOCK_SIZE_x 32
#define BLOCK_SIZE_y 16

__global__ void ghostRows(int dim, int* grid)
{
    // We want id ∈ [1,dim]
    int id = blockDim.x * blockIdx.x + threadIdx.x + 1;

    if (id <= dim)
    {
        //Copy first real row to bottom ghost row
        grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; 
        //Copy last real row to top ghost row
        grid[id] = grid[(dim+2)*dim + id]; 
    }
}
__global__ void ghostCols(int dim, int* grid)
{
    // We want id ∈ [0,dim+1]
    int id = blockDim.x * blockIdx.x + threadIdx.x;

    if (id <= dim+1)
    {
        //Copy first real column to right most ghost column
        grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; 
        //Copy last real column to left most ghost column
        grid[id*(dim+2)] = grid[id*(dim+2) + dim];                                                  
    }
}

__global__ void GOL(int dim, int *grid, int *newGrid)
{
        int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y;
        int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x;
        int id = iy * (dim+2) + ix;

        int i = threadIdx.y;
        int j = threadIdx.x;
        int numNeighbors;

        // Declare the shared memory on a per block level
        __shared__ int s_grid[BLOCK_SIZE_y][BLOCK_SIZE_x];

       // Copy cells into shared memory
       if (ix <= dim+1 && iy <= dim+1)
           s_grid[i][j] = grid[id];

       //Sync all threads in block
        __syncthreads();

       if (iy <= dim && ix <= dim) {
           if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) {

               // Get the number of neighbors for a given grid point
               numNeighbors = s_grid[i+1][j] + s_grid[i-1][j] //upper lower
                            + s_grid[i][j+1] + s_grid[i][j-1] //right left
                            + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals
                            + s_grid[i-1][j+1] + s_grid[i+1][j-1];

                int cell = s_grid[i][j];

                // Here we have explicitly all of the game rules
                if (cell == 1 && numNeighbors < 2)
                    newGrid[id] = 0;
                else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[id] = 1;
                else if (cell == 1 && numNeighbors > 3)
                    newGrid[id] = 0;
                else if (cell == 0 && numNeighbors == 3)
                    newGrid[id] = 1;
                else
                    newGrid[id] = cell;
           }
       }

}

int main(int argc, char* argv[])
{
    int i,j,iter;
    int *h_grid; //Grid on host
    int *d_grid; //Grid on device
    int *d_newGrid; //Second grid used on device only
    int *d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid

    int dim = 1024; //Linear dimension of our grid - not counting ghost cells
    int maxIter = 1<<10; //Number of game steps

    size_t bytes = sizeof(int)*(dim+2)*(dim+2);
    // Allocate host Grid used for initial setup and read back from device
    h_grid = (int*)malloc(bytes);

    // Allocate device grids
    cudaMalloc(&d_grid, bytes);
    cudaMalloc(&d_newGrid, bytes);

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            h_grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    cudaFuncSetCacheConfig(GOL, cudaFuncCachePreferShared);

    // Copy over initial game grid (Dim-1 threads)
    cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);

    dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1);
    int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2));
    int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2));
    dim3 gridSize(linGrid_x,linGrid_y,1);

    dim3 cpyBlockSize(BLOCK_SIZE_x,1,1);
    dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1);
    dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {

        ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid);
        ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid);
        GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);

        // Swap our grids and iterate again
        d_tmpGrid = d_grid;
        d_grid = d_newGrid;
        d_newGrid = d_tmpGrid;
    }//iter loop

    cudaError_t error = cudaGetLastError();
    if(error != cudaSuccess)
        printf("CUDA error %s\n",cudaGetErrorString(error));

    // Copy back results and sum
    cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += h_grid[i*(dim+2)+j];
        }
    }
    printf("Total Alive: %d\n", total);

    cudaFree(d_grid);
    cudaFree(d_newGrid);
    free(h_grid);

    return 0;
}

Changes :

#define BLOCK_SIZE_x 32
#define BLOCK_SIZE_y 16

A square block size is not always best performance wise, for greater flexibility we switch to separate x and y dimensions.

int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y;
int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x;
int id = iy * (dim+2) + ix;

Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.

int i = threadIdx.y;
int j = threadIdx.x;

We will also define some convenience variables for the block thread indices.

// Declare the shared memory on a per block level
__shared__ int s_grid[blockDim.y][blockDim.x];

The __shared__ qualifier is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis.

// Copy cells into shared memory
if (ix <= dim+1 && iy <= dim+1)
s_grid[i][j] = grid[id];

Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.

//Sync all threads in block
__syncthreads();

Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling __syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.

if (iy <= dim && ix <= dim) {
    if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) {

Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.

// Get the number of neighbors for a given grid point
numNeighbors = s_grid[i+1][j] + s_grid[i-1][j]     //upper lower
            + s_grid[i][j+1] + s_grid[i][j-1]      //right left
            + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals
            + s_grid[i-1][j+1] + s_grid[i+1][j-1];

int cell = s_grid[i][j];

The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.

dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1);
int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2));
int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2));
dim3 gridSize(linGrid_x,linGrid_y,1);

Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.

GOL-CUDA-Texture.cu

The texture cache example is provided for completeness but not fully documented. The texture cache is optimized for 2D spacial locality and does not pollute the L1 cache. Some programs may benefit from the smaller cache line compared to the L1 cache.

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

texture<int,2> gridTex;

__global__ void GOL(int dim, int *newGrid)
{
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
    int ix = blockDim.x * blockIdx.x + threadIdx.x;
    int id = iy * dim + ix;

    int numNeighbors;

    float iyTex = (iy + 0.5f)/dim;
    float ixTex = (ix + 0.5f)/dim;
    float oneTex = 1.0f/dim;

    if(iy < dim && ix < dim)
{
    //Get the number of neighbors for a given grid point
    numNeighbors = tex2D(gridTex, iyTex+oneTex, ixTex) //upper/lower
                 + tex2D(gridTex, iyTex-oneTex, ixTex)
                 + tex2D(gridTex, iyTex, ixTex+oneTex) //right/left
                 + tex2D(gridTex, iyTex, ixTex-oneTex)
                 + tex2D(gridTex, iyTex-oneTex, ixTex-oneTex) //diagonals
                 + tex2D(gridTex, iyTex-oneTex, ixTex+oneTex)
                 + tex2D(gridTex, iyTex+oneTex, ixTex-oneTex) 
                 + tex2D(gridTex, iyTex+oneTex, ixTex+oneTex);

    int cell = tex2D(gridTex, iyTex, ixTex);

    //Here we have explicitly all of the game rules
    if (cell == 1 && numNeighbors < 2)
        newGrid[id] = 0;
    else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
        newGrid[id] = 1;
    else if (cell == 1 && numNeighbors > 3)
        newGrid[id] = 0;
    else if (cell == 0 && numNeighbors == 3)
         newGrid[id] = 1;
    else
       newGrid[id] = cell;

}
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    int* h_grid; //Grid on host
    cudaArray* d_grid; //Grid on device
    int* d_newGrid; //Second grid used on device only

    int dim = 1024; //Linear dimension of our grid - not counting ghost cells
    int maxIter = 1<<10; //Number of game steps

    size_t bytes = sizeof(int)*dim*dim;
    //Allocate host Grid used for initial setup and read back from device
    h_grid = (int*)malloc(bytes);

    //Allocate device grids
    cudaMallocArray(&d_grid, &gridTex.channelDesc, dim, dim);
    cudaMalloc(&d_newGrid, bytes);

    //Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 0; i<dim; i++) {
        for(j = 0; j<dim; j++) {
            h_grid[i*dim+j] = rand() % 2;
        }
    }

    //Copy over initial game grid (Dim-1 threads)
    cudaMemcpyToArray (d_grid, 0, 0, h_grid, bytes, cudaMemcpyHostToDevice);
    cudaBindTextureToArray(gridTex, d_grid);

    gridTex.normalized = true;
    gridTex.addressMode[0] = cudaAddressModeWrap;
    gridTex.addressMode[1] = cudaAddressModeWrap;

    dim3 dimBlock(8,8);
    int linGrid = (int)ceil(dim/(float)dimBlock.x);
    dim3 dimGrid(linGrid,linGrid);

    //Main game loop
    for (iter = 0; iter<maxIter; iter++) {
        GOL<<<dimGrid,dimBlock>>>(dim, d_newGrid);

        //Swap our grids and iterate again
        cudaMemcpyToArray (d_grid, 0, 0, d_newGrid, bytes, cudaMemcpyDeviceToDevice);
    }//iter loop

    //Copy back results and sum
    cudaMemcpy(h_grid, d_newGrid, bytes, cudaMemcpyDeviceToHost);

    //Sum up alive cells and print results
    int total = 0;
    for (i = 0; i<dim; i++) {
        for (j = 0; j<dim; j++) {
            total += h_grid[i*dim+j];
        }
    }
    printf("Total Alive: %d\n", total);

    cudaFree(d_grid);
    cudaFree(d_newGrid);
    free(h_grid);

    return 0;
}

This tutorial shows a minimal conversion from our vector addition CPU code to PGI CUDA Fortran. Consider this a CUDA Fortran ‘Hello World’. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.cuf

module kernel
    contains
    ! CUDA kernel. Each thread takes care of one element of c
    attributes(global) subroutine vecAdd_kernel(n, a, b, c)
        integer, value :: n
        real(8), device :: a(n), b(n), c(n)
        integer :: id

        ! Get our global thread ID
        id = (blockidx%x-1)*blockdim%x + threadidx%x

        ! Make sure we do not go out of bounds
        if (id <= n) then
            c(id) = a(id) + b(id)
        endif
    end subroutine vecAdd_kernel
end module kernel

program main
    use cudafor
    use kernel

    type(dim3) :: blockSize, gridSize
    real(8) :: sum
    integer :: i

    ! Size of vectors
    integer :: n = 100000

    ! Host input vectors
    real(8),dimension(:),allocatable :: h_a
    real(8),dimension(:),allocatable :: h_b
    !Host output vector
    real(8),dimension(:),allocatable :: h_c

    ! Device input vectors
    real(8),device,dimension(:),allocatable :: d_a
    real(8),device,dimension(:),allocatable :: d_b
    !Host output vector
    real(8),device,dimension(:),allocatable :: d_c

    ! Allocate memory for each vector on host
    allocate(h_a(n))
    allocate(h_b(n))
    allocate(h_c(n))

    ! Allocate memory for each vector on GPU
    allocate(d_a(n))
    allocate(d_b(n))
    allocate(d_c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        h_a(i) = sin(i*1D0)*sin(i*1D0)
        h_b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Implicit copy of host vectors to device
    d_a = h_a(1:n)
    d_b = h_b(1:n)

    ! Number of threads in each thread block
    blockSize = dim3(1024,1,1)

    ! Number of thread blocks in grid
    gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)

    ! Execute the kernel
    call vecAdd_kernel<<<gridSize, blockSize>>>(n, d_a, d_b, d_c)

    ! Implicit copy of device array to host
    h_c = d_c(1:n)

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    sum = 0.0;
    do i=1,n
        sum = sum +  h_c(i)
    enddo
    sum = sum/real(n)
    print *, 'final result: ', sum

    ! Release device memory
    deallocate(d_a)
    deallocate(d_b)
    deallocate(d_c)

    ! Release host memory
    deallocate(h_a)
    deallocate(h_b)
    deallocate(h_c)

end program main

Changes

Kernel

The kernel is the heart of our CUDA code. When we launch a kernel we specify the number of threads per block(blockdim) and number of blocks per grid(griddim). The total number of threads is (blockdim) * (griddim). Each thread evaluates one copy of the kernel.

module kernel
    contains
    !CUDA kernel. Each thread takes care of one element of c
    attributes(global) subroutine vecAdd_kernel(n, a, b, c)
        integer, value :: n
        real(8), device :: a(n), b(n), c(n)
        integer :: id

        !Get our global thread ID
        id = (blockidx%x-1)*blockdim%x + threadidx%x

        !Make sure we do not go out of bounds
        if (id <= n) then
            c(id) = a(id) + b(id)
        endif
    end subroutine vecAdd_kernel
end module kernel

Let’s take a look at what makes up this simple kernel.

module kernel
    contains
end module kernel

The interface for the kernel subroutine must be explicitly known to the calling program, one way of achieving this is placing the subroutine in a module.

attributes(global) subroutine vecAdd_kernel(n, a, b, c)

The global attribute specifies this is a CUDA kernel, otherwise normal Fortran subroutine syntax is used.

!Get our global thread ID
id = (blockidx%x-1)*blockdim%x + threadidx%x

We calculate the threads global id using CUDA Fortran provided derived data types. blockidx contains the blocks position in the grid, ranging from 1 to griddim. threadidx is the threads index inside of it’s associated block, ranging from 1 to blockdim.

if (id <= n) then

Unless we have an array length divisible by our blockdim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

c(id) = a(id) + b(id)

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c

Modules

use cudafor
use kernel

The cudafor is a PGI provided module that provides some of the CUDA specific derived types such as dim3. The kernel module we created must also be included.

Array Pointers

!Host input vectors
real(8),dimension(:),allocatable :: h_a
real(8),dimension(:),allocatable :: h_b
!Host output vector
real(8),dimension(:),allocatable :: h_c

!Device input vectors
real(8),device,dimension(:),allocatable :: d_a
real(8),device,dimension(:),allocatable :: d_b
!Host output vector
real(8),device,dimension(:),allocatable :: d_c

With the host CPU and GPU having separate memory spaces we must have two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them. Notice that the GPU pointers have the device attribute.

Allocate device memory

!Allocate memory for each vector on GPU
allocate(d_a(n))
allocate(d_b(n))
allocate(d_c(n))

Given a pointer with the device attribute memory will be allocated in the GPUs global memory space.

Copy to device

!Implicit copy of host vectors to device
d_a = h_a(1:n)
d_b = h_b(1:n)

After we initialize the data on the host we must copy it over to the device, to do so we can can use the syntax above. This is a convenient way of doing things but care must be taken. This operation can be very time consuming as we are initiating a DMA transfer of data from the host memory, over the PCI bus, to the device memory.

Thread mapping

!Number of threads in each thread block
blockSize = dim3(1024,1,1)

!number of thread blocks in grid
gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the block Size will be chosen based upon kernel memory requirements and the grid size calculated such that enough threads are launched to cover the kernel domain. dim3 is a CUDA Fortran provided data type that has 3 dimensions, in this case we are dealing with a one dimensional block and grid so we specify a dimensionality of 1 for the other two dimensions.

Launch kernel

call vecAdd_kernel<<< gridSize, blockSize >>>(n, d_a, d_b, d_c)

We launch our kernel with a modified Fortran subroutine syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

Copy to host

!Implicit copy of device array to host
h_c = d_c(1:n)

The implicit copy from the device to the host will block until the kernel is completed.

Release device memory

deallocate(d_a)
deallocate(d_b)
deallocate(d_c)

Device memory is deallocated in the same programmatic manner as host allocated arrays.

Compiling

The pgi environment must be set up so before compiling
$ module load cudatoolkit
$ ftn vecAdd.cuf -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

For this tutorial we will cover two different CUDA Fortran versions of our Game of Life program. The first will strictly use global memory and be a straight forward GPU port of our program, the second will introduce the use of shared memory. For a description of the Game of Life and the original CPU code please see the tutorial here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

GOL.cuf

module kernels
    contains

    attributes(global) subroutine ghostCols_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [2,dim+1]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1

        if(id < dim+2) then
            grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column
            grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column
        endif

    end subroutine ghostCols_kernel

    attributes(global) subroutine ghostRows_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [1,dim+2]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x

        if(id <= dim+2) then
            grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row
            grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row
        endif
    end subroutine ghostRows_kernel

    attributes(global) subroutine GOL_kernel(dim, grid, newGrid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid, newGrid
        integer :: idx, idy, id, numNeighbors, cell

        !Get global thread IDs
        idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1
        idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1

        if(idx < dim+2 .AND. idy < dim+2) then
            numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) &     !right & left
                         + grid(idy+1,idx) + grid(idy-1,idx) &     !upper & lower
                         + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals
                         + grid(idy-1,idx+1) + grid(idy+1,idx-1)

            cell = grid(idy,idx)
            !Here we have explicitly all of the game rules
            if(cell == 1 .AND. numNeighbors < 2) then
                newGrid(idy,idx) = 0
            elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                newGrid(idy,idx) = 1
            elseif(cell == 1 .AND. numNeighbors > 3) then
                newGrid(idy,idx) = 0
            elseif(cell == 0 .AND. numNeighbors == 3) then
                newGrid(idy,idx) = 1
            else
                newGrid(idy,idx) = cell
            endif
        endif

    end subroutine GOL_kernel
end module kernels

program main
    use cudafor
    use kernels

    implicit none

    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    !Linear game grid dimension
    integer :: dim = 1024
    !Number of game iterations
    integer :: maxIter = 2**10
    integer :: blockDim = 16
    type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize

    !Host game grid array
    integer,dimension(:,:),allocatable :: h_grid

    !Device game grid arrays
    integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid

    !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells
    allocate(h_grid(dim+2,dim+2))

    !Allocate grid on device
    allocate(d_grid(dim+2,dim+2))
    allocate(d_newGrid(dim+2,dim+2))

    !Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            h_grid(i,j) = nint(randm)
        enddo
    enddo

    !Implicit copy of host vectors to device
    d_grid = h_grid

    !Set the number of threads per block and blocks per grid
    cpyBlockSize    = dim3(blockDim, 1, 1)
    cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1)
    cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1)

    blockSize   = dim3(blockDim, blockDim, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x)), &
                         ceiling(real(dim)/real(blockSize%y)), 1)

    !Main game loop
    do iter=1,maxIter
        !device pointers are not supported currently so manually switch grid/newGrid
        if(MOD(iter,2) == 1) then
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid)
        else
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid)
        endif
    enddo

    !Implicit copy of device vector back to host
    h_grid = d_grid

    !Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + h_grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    !Release memory
    deallocate(h_grid)
    deallocate(d_grid)
    deallocate(d_newGrid)

end program

Changes to GOL.cuf

    attributes(global) subroutine ghostCols_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [2,dim+1]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1

        if(id < dim+2) then
            grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column
            grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column
        endif

    end subroutine ghostCols_kernel

    attributes(global) subroutine ghostRows_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [1,dim+2]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x

        if(id <= dim+2) then
            grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row
            grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row
        endif
    end subroutine ghostRows_kernel

The ghostRows_kernel and ghostColumns_kernel device subroutines are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.

    attributes(global) subroutine GOL_kernel(dim, grid, newGrid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid, newGrid
        integer :: idx, idy, id, numNeighbors, cell

        !Get global thread IDs
        idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1
        idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1

        if(idx < dim+2 .AND. idy < dim+2) then
            numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) &     !right & left
                         + grid(idy+1,idx) + grid(idy-1,idx) &     !upper & lower
                         + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals
                         + grid(idy-1,idx+1) + grid(idy+1,idx-1)

            cell = grid(idy,idx)
            !Here we have explicitly all of the game rules
            if(cell == 1 .AND. numNeighbors < 2) then
                newGrid(idy,idx) = 0
            elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                newGrid(idy,idx) = 1
            elseif(cell == 1 .AND. numNeighbors > 3) then
                newGrid(idy,idx) = 0
            elseif(cell == 0 .AND. numNeighbors == 3) then
                newGrid(idy,idx) = 1
            else
                newGrid(idy,idx) = cell
            endif
        endif

    end subroutine GOL_kernel

The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.

    !Host game grid array
    integer,dimension(:,:),allocatable :: h_grid
 
    !Device game grid arrays
    integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid

Currently CUDA Fortran does not support device pointers and so we change our array attribute from pointer to allocatable.

    !Allocate grid on device
    allocate(d_grid(dim+2,dim+2))
    allocate(d_newGrid(dim+2,dim+2))

Given a array with the device attribute memory will be allocated in the GPUs global memory space.

    !Implicit copy of host vectors to device
    d_grid = h_grid

By setting the device array equal to a host array a host to device DMA copy is implicitly initiated.

    !Set the number of threads per block and blocks per grid
    cpyBlockSize    = dim3(blockDim, 1, 1)
    cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1)
    cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1)
 
    blockSize   = dim3(blockDim, blockDim, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x)), &
                         ceiling(real(dim)/real(blockSize%y)), 1)

With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of the CUDA Fortran data type dim3.

    !Main game loop
    do iter=1,maxIter
        !device pointers are not supported currently so manually switch grid/newGrid
        if(MOD(iter,2) == 1) then
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid)
        else
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid)
        endif
    enddo

Although we can not simply switch our device grid pointers at the end of each game iteration due to CUDA Fortran not support device pointers we may switch what arguments we call our kernels with. This achieves the same effect with a minimal increase in code size.

    !Implicit copy of device vector back to host
    h_grid = d_grid

After our kernel has completed we can implicitly copy the grid array from the device back to the host.

GOL-Shared.cuf

In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:

  1. Have each thread read from global memory into shared memory
  2. Synchronize threads to ensure all threads in a given block have finished reading from global memory
  3. Have each thread preform calculations using shared memory values
  4. Have each thread write result from shared memory back to device global memory

For detailed coverage of shared memory please see the CUDA C Programming Guide and CUDA Fortran Programming Guide.

module kernel
    contains

    attributes(global) subroutine ghostCols_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ?~H~H [2,dim+1]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1

        if(id < dim+2) then
            grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column
            grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column
        endif

    end subroutine ghostCols_kernel

    attributes(global) subroutine ghostRows_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ?~H~H [1,dim+2]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x

        if(id <= dim+2) then
            grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row
            grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row
        endif
    end subroutine ghostRows_kernel

    attributes(global) subroutine GOL_kernel(dim, grid, newGrid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid, newGrid
        integer :: idx, idy, id, i, j, numNeighbors, cell

        !Declare the shared memory on a per block level
        integer,shared :: s_grid(16, 16)

        !Get global thread IDs
        idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x
        idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y

        !Get local thread IDs
        i = threadIdx%y
        j = threadIdx%x

        !Copy cells into shared memory
        if (idx <= dim+2 .AND. idy <= dim+2) then
            s_grid(i,j) = grid(idy,idx)
        endif

        !Sync all threads in block
        call syncthreads()

        if (idx < dim+2 .AND. idy < dim+2) then
            if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then

                numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) &     !right & left
                             + s_grid(i+1,j) + s_grid(i-1,j) &     !upper & lower
                             + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals
                             + s_grid(i-1,j+1) + s_grid(i+1,j-1)

                cell = s_grid(i,j)

                !Here we have explicitly all of the game rules
                if(cell == 1 .AND. numNeighbors < 2) then
                    newGrid(idy,idx) = 0
                elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(idy,idx) = 1
                elseif(cell == 1 .AND. numNeighbors > 3) then
                    newGrid(idy,idx) = 0
                elseif(cell == 0 .AND. numNeighbors == 3) then
                    newGrid(idy,idx) = 1
                else
                    newGrid(idy,idx) = cell
                endif

            endif
        endif

    end subroutine GOL_kernel
end module kernel

program main
    use cudafor
    use kernel

    implicit none

    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    !Linear game grid dimension
    integer :: dim = 1024
    !Number of game iterations
    integer :: maxIter = 2**10
    integer :: blockDim = 16
    type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize

    !Host game grid array
    integer,dimension(:,:),allocatable :: h_grid

    !Device game grid array
    integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid

    !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells
    allocate(h_grid(dim+2,dim+2))

    !Allocate grid on device
    allocate(d_grid(dim+2,dim+2))
    allocate(d_newGrid(dim+2,dim+2))

    !Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            h_grid(i,j) = nint(randm)
        enddo
    enddo

    !Implicit copy of host vectors to device
    d_grid = h_grid

    !Set the number of threads per block and blocks per grid
    cpyBlockSize    = dim3(blockDim, 1, 1)
    cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1)
    cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1)

    blockSize   = dim3(blockDim-2, blockDim-2, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x-2)), &
                    ceiling(real(dim)/real(blockSize%y-2)), 1)

    !Main game loop
    do iter=1,maxIter
        !device pointers are not supported currently, so manually switch grid/newGrid
        if(MOD(iter,2) == 1) then
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid)
        else
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid)
        endif
    enddo

    !Implicit copy of device vector back to host
    h_grid = d_grid

    !Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + h_grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    !Release memory
    deallocate(h_grid)
    deallocate(d_grid)
    deallocate(d_newGrid)

end program

Changes to GOL-Shared.cuf

!Declare the shared memory on a per block level
integer,shared :: s_grid(16, 16)

The shared attribute is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis

!Get global thread IDs
idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x
idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y

Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.

!Get local thread IDs
i = threadIdx%y
j = threadIdx%x

We define some convenience variables for the block thread indices.

!Copy cells into shared memory
if (idx <= dim+2 .AND. idy <= dim+2) then
    s_grid(i,j) = grid(idy,idx)
endif

Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.

!Sync all threads in block
call syncthreads()

Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.

if (idx < dim+2 .AND. idy < dim+2) then
    if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then

Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.

numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) &     !right & left
             + s_grid(i+1,j) + s_grid(i-1,j) &     !upper & lower
             + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals
             + s_grid(i-1,j+1) + s_grid(i+1,j-1)
 
cell = s_grid(i,j)

The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.

    blockSize   = dim3(blockDim-2, blockDim-2, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x-2)), &
                    ceiling(real(dim)/real(blockSize%y-2)), 1)

Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.

This tutorial shows a minimal conversion from our vector addition CPU code to an OpenCL version. Consider this a OpenCL ‘Hello World’. Error handling is not included so that the structure of the code is more digestible. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <CL/opencl.h>

// OpenCL kernel. Each work item takes care of one element of c
const char *kernelSource =                                       "\n" \
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable                    \n" \
"__kernel void vecAdd(  __global double *a,                       \n" \
"                       __global double *b,                       \n" \
"                       __global double *c,                       \n" \
"                       const unsigned int n)                    \n" \
"{                                                               \n" \
"    //Get our global thread ID                                  \n" \
"    int id = get_global_id(0);                                  \n" \
"                                                                \n" \
"    //Make sure we do not go out of bounds                      \n" \
"    if (id < n)                                                 \n" \
"        c[id] = a[id] + b[id];                                  \n" \
"}                                                               \n" \
                                                                "\n" ;

int main( int argc, char* argv[] )
{
    // Length of vectors
    unsigned int n = 100000;

    // Host input vectors
    double *h_a;
    double *h_b;
    // Host output vector
    double *h_c;

    // Device input buffers
    cl_mem d_a;
    cl_mem d_b;
    // Device output buffer
    cl_mem d_c;

    cl_platform_id cpPlatform;        // OpenCL platform
    cl_device_id device_id;           // device ID
    cl_context context;               // context
    cl_command_queue queue;           // command queue
    cl_program program;               // program
    cl_kernel kernel;                 // kernel

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);

    // Initialize vectors on host
    int i;
    for( i = 0; i < n; i++ )
    {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    size_t globalSize, localSize;
    cl_int err;

    // Number of work items in each local work group
    localSize = 64;

    // Number of total work items - localSize must be devisor
    globalSize = ceil(n/(float)localSize)*localSize;

    // Bind to platform
    err = clGetPlatformIDs(1, &cpPlatform, NULL);

    // Get ID for the device
    err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

    // Create a context  
    context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

    // Create a command queue 
    queue = clCreateCommandQueue(context, device_id, 0, &err);

    // Create the compute program from the source buffer
    program = clCreateProgramWithSource(context, 1,
                            (const char **) & kernelSource, NULL, &err);

    // Build the program executable 
    clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

    // Create the compute kernel in the program we wish to run
    kernel = clCreateKernel(program, "vecAdd", &err);

    // Create the input and output arrays in device memory for our calculation
    d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
    d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
    d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, NULL);

    // Write our data set into the input array in device memory
    err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0,
                                   bytes, h_a, 0, NULL, NULL);
    err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0,
                                   bytes, h_b, 0, NULL, NULL);

    // Set the arguments to our compute kernel
    err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
    err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
    err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
    err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);

    // Execute the kernel over the entire range of the data set  
    err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize,
                                                              0, NULL, NULL);

    // Wait for the command queue to get serviced before reading back results
    clFinish(queue);

    // Read the results from the device
    clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0,
                                bytes, h_c, 0, NULL, NULL );

    //Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // release OpenCL resources
    clReleaseMemObject(d_a);
    clReleaseMemObject(d_b);
    clReleaseMemObject(d_c);
    clReleaseProgram(program);
    clReleaseKernel(kernel);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    //release host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

Changes

Kernel:
The kernel is the heart of our OpenCL code. The entire kernel must eventually be read in as a c string, the easiest way for a small program like this is to wrap quotes and line returns around the entire kernel. In a real program you would more than likely read the kernel in from a separate file.

// OpenCL kernel. Each work item takes care of one element of c
const char *kernelSource =                                      "\n" \
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable                    \n" \
"__kernel void vecAdd(  __global double *a,                       \n" \
"                       __global double *b,                       \n" \
"                       __global double *c,                       \n" \
"                       const unsigned int n)                    \n" \
"{                                                               \n" \
"    //Get our global thread ID                                  \n" \
"    int id = get_global_id(0);                                  \n" \
"                                                                \n" \
"    //Make sure we do not go out of bounds                      \n" \
"    if (id < n)                                                 \n" \
"        c[id] = a[id] + b[id];                                  \n" \
"}                                                               \n" \
                                                                "\n" ;

Let’s take a look at what makes up this simple kernel.

__kernel void vecAdd(  __global double *a, __global double *b,
                       __global double *c, const unsigned int n)

The __kernel decorator specifies this is an OpenCL kernel and the __global decorator specifies the pointer is referring to the global device memory space, otherwise normal C function syntax is used. The kernel must have return type void.

int id = get_global_id(0); 

Here we grab the global work item id for dimension 0.

if (id < n)
    c[id] = a[id] + b[id];

There must be an integer number of work groups, or put in another way the number of work items in each work group must be a devisor of the global number of work items. Since the work group size is used to tune performance and will not necessarily be a devisor of the total number of threads needed it is common to be forced to launch more threads than are needed and ignore the extras. After we check that we are inside of the problem domain we can access and manipulate the device memory.

Memory:

// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;
 
// Device input buffers
cl_mem d_a;
cl_mem d_b;
// Device output buffer
cl_mem d_c;

With the host CPU and GPU having separate memory spaces we must maintain two separate references to our memory, one set for our host array pointers and one set for our device array memory handles. Here we use the h_ and d_ prefix to differentiate them.

Thread Mapping:

// Number of work items in each local work group
localSize = 64;
 
// Number of total work items - localSize must be devisor
globalSize = ceil(n/(float)localSize)*localSize;

To map our problem onto the underlying hardware we must specify a local and global integer size. The local size defines the number of work items in a work group, on an NVIDIA GPU this is equivalent to the number of threads in a thread block. The global size is the total number of work items launched. the localSize must be a devisor of globalSize and so we calculate the smallest integer that covers our problem domain and is divisible by localSize.

OpenCL Setup:

// Bind to platform
err = clGetPlatformIDs(1, &cpPlatform, NULL);

Each hardware vendor will have a different platform which must be bound to before use, here clGetPlatformIDs will set cpPlatform to contain the systems available platforms. For example if a system contains an AMD CPU and an NVIDIA GPU with the appropriate OpenCL drivers installed two OpenCL platforms will be available.

// Get ID for the device
err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);

The platform can be queried to find what specific devices it contains. In this case we query the platform for GPU devices by using the CL_DEVICE_TYPE_GPU enumerated value.

// Create a context
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

Before using an OpenCL device a context must be set up. The context will be used to manage command queues, memory, and kernel activity. A context can contain more than one device.

// Create a command queue
queue = clCreateCommandQueue(context, device_id, 0, &err);

The command queue is used to stream commands from the host to the specified device. Memory transfers and kernel activity can pushed to the command queue where they will be executed on the device when possible.

Compile Kernel:

// Create the compute program from the source buffer
program = clCreateProgramWithSource(context, 1,
                        (const char **) & kernelSource, NULL, &err);
 
// Build the program executable
clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
 
// Create the compute kernel in the program we wish to run
kernel = clCreateKernel(program, "vecAdd", &err);

To ensure that OpenCL code is portable to many devices the default way to run kernels is with just-in-time, or JIT, compilation. We must prepare the source code for the device(s) in a given context. First we create our program, which is a set of kernel code, and then from that program we create the individual kernels.

Prepare data on device:

// Create the input and output arrays in device memory for our calculation
d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL);
d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, NULL);
 
// Write our data set into the input array in device memory
err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0,
                               bytes, h_a, 0, NULL, NULL);
err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0,
                               bytes, h_b, 0, NULL, NULL);
 
// Set the arguments to our compute kernel
err  = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);

Before launching our kernel we must create buffers between the host and device, bind the host data to those newly created device buffers, and finally set the kernel arguments.

Launch Kernel:

// Execute the kernel over the entire range of the data set
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize,
                                                          0, NULL, NULL);

Once all of the memory resides on the device the kernel can be enqueued to be launched.

Copy results to host:

// Wait for the command queue to get serviced before reading back results
clFinish(queue);
 
// Read the results from the device
clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0,
                            bytes, h_c, 0, NULL, NULL );

We can block until the command queue is cleared and then read back the device results to the host.

Compiling

$ module load cudatoolkit
$ cc -lOpenCL vecAdd.c -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd.cc

C++ bindings are commonly used for OpenCL development and offers an interface that is somewhat more streamlined than the standard C interface. An example of the vectorAddition code written using these bindings is given below.

#define __CL_ENABLE_EXCEPTIONS

#include "cl.hpp"
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <math.h>

// OpenCL kernel. Each work item takes care of one element of c
const char *kernelSource =                                      "\n" \
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable                    \n" \
"__kernel void vecAdd(  __global double *a,                       \n" \
"                       __global double *b,                       \n" \
"                       __global double *c,                       \n" \
"                       const unsigned int n)                    \n" \
"{                                                               \n" \
"    //Get our global thread ID                                  \n" \
"    int id = get_global_id(0);                                  \n" \
"                                                                \n" \
"    //Make sure we do not go out of bounds                      \n" \
"    if (id < n)                                                 \n" \
"        c[id] = a[id] + b[id];                                  \n" \
"}                                                               \n" \
                                                                "\n" ;


int main(int argc, char *argv[])
{

    // Length of vectors
    unsigned int n = 1000;

    // Host input vectors
    double *h_a;
    double *h_b;
    // Host output vector
    double *h_c;

    // Device input buffers
    cl::Buffer d_a;
    cl::Buffer d_b;
    // Device output buffer
    cl::Buffer d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector on host
    h_a = new double[n];
    h_b = new double[n];
    h_c = new double[n];

    // Initialize vectors on host
    for(int i = 0; i < n; i++ )
    {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    cl_int err = CL_SUCCESS;
    try {

        // Query platforms
        std::vector<cl::Platform> platforms;
        cl::Platform::get(&platforms);
        if (platforms.size() == 0) {
            std::cout << "Platform size 0\n";
            return -1;
         }

        // Get list of devices on default platform and create context
        cl_context_properties properties[] =
           { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0};
        cl::Context context(CL_DEVICE_TYPE_GPU, properties);
        std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();

        // Create command queue for first device
        cl::CommandQueue queue(context, devices[0], 0, &err);

        // Create device memory buffers
        d_a = cl::Buffer(context, CL_MEM_READ_ONLY, bytes);
        d_b = cl::Buffer(context, CL_MEM_READ_ONLY, bytes);
        d_c = cl::Buffer(context, CL_MEM_WRITE_ONLY, bytes);

        // Bind memory buffers
        queue.enqueueWriteBuffer(d_a, CL_TRUE, 0, bytes, h_a);
        queue.enqueueWriteBuffer(d_b, CL_TRUE, 0, bytes, h_b);

        //Build kernel from source string
        cl::Program::Sources source(1,
            std::make_pair(kernelSource,strlen(kernelSource)));
        cl::Program program_ = cl::Program(context, source);
        program_.build(devices);

        // Create kernel object
        cl::Kernel kernel(program_, "vecAdd", &err);

        // Bind kernel arguments to kernel
        kernel.setArg(0, d_a);
        kernel.setArg(1, d_b);
        kernel.setArg(2, d_c);
        kernel.setArg(3, n);

        // Number of work items in each local work group
        cl::NDRange localSize(64);
        // Number of total work items - localSize must be devisor
        cl::NDRange globalSize((int)(ceil(n/(float)64)*64));

        // Enqueue kernel
        cl::Event event;
        queue.enqueueNDRangeKernel(
            kernel,
            cl::NullRange,
            globalSize,
            localSize,
            NULL,
            &event);

        // Block until kernel completion
        event.wait();

        // Read back d_c
        queue.enqueueReadBuffer(d_c, CL_TRUE, 0, bytes, h_c);
        }
    catch (cl::Error err) {
         std::cerr
            << "ERROR: "<<err.what()<<"("<<err.err()<<")"<<std::endl;
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(int i=0; i<n; i++)
        sum += h_c[i];
    std::cout<<"final result: "<<sum/n<<std::endl;

    // Release host memory
    delete(h_a);
    delete(h_b);
    delete(h_c);

    return 0;
}

Compiling and Running, C++

To compile you will first need to download the OpenCL C++ header file cl.hpp.

$ module load cudatoolkit
$ CC vecAdd.cc -lOpenCL -o vecAdd.out
$ aprun ./vecAdd.out
final result: 1.000000