Several GPU accelerated libraries are provided on OLCF systems. Usage for the most common accelerated libraries is outlined below.

MAGMA

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.

Using MAGMA

This module is currently only compatible with the GNU programming environment:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit magma

To link in the MAGMA library while on Titan:

$ cc -lcuda -lmagma source.c

Linking in MAGMA on Rhea is a bit different because the Titan compiler wrapper takes care of some of the extra flags. For example:

$ nvcc $MAGMA_INC $MAGMA_LIB -L$ACML_DIR/gfortran64/lib -lmagma -lcublas -lacml source.c

This also requires that a BLAS library be loaded like ACML

$ module load acml

MAGMA Resources

For comprehensive user manual please see the MAGMA Documentation. A knowledgable MAGMA User Forum is also available for personalized help. To see MAGMA in action see the following two PGI articles that include full example code of MAGMA usage with PGI Fortran: Using MAGMA With PGI Fortran and Using GPU-enabled Math Libraries with PGI Fortran.

cuBLAS/cuSPARSE

cuBLAS and cuSPARSE are NVIDIA provided BLAS GPU routines optimized for dense and sparse use respectively. If your program currently uses BLAS routines integration should be straight forward and minimal CUDA knowledge is needed. Although primarily designed for use in C/C++ code Fortran bindings are available.

Using cuBLAS/cuSPARSE

cuBLAS and cuSPARSE are accessed through the cublas header and need to be linked against the cublas library:

$ module load cudatoolkit
$ cc -lcublas source.c

cuBLAS/cuSPARSE Resources

The CUBLAS and CUSPARSE user guides are available to download from NVIDIA, these guides provide complete function listings as well as example code. The nvidia SDK provides sample code and can accessed using the instructions below. An example of using CUBLAS with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran.

Running the examples:

Obtain an interactive job and load the appropriate modules:

$ qsub -I -A[projID] -lwalltime=00:30:00,nodes=1
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit nvidia-sdk

Copy the example files:

$ cd $MEMBERWORK/[projid]
$ cp -r $NVIDIA_SDK_PATH/CUDALibraries .
$ cd CUDALibraries

Now each example can be executed:

$ cd bin/linux/release
$ aprun simpleCUBLAS

LibSciACC

Cray’s LibSciACC provides GPU enabled BLAS and LAPACK routines. LibSicACC provides two interfaces, automatic and manual. The Automatic interface is largely transparent to the programmer. LibSciACC will determine if the call is likely to benefit from GPU acceleration and if so will take care of accelerating the routine and the associated memory management. The manual interface provides an API to manage accelerator resources, providing more control to the programmer.

Using LibSciACC

It is recommended that the craype-accel-nvidia35 module be used to manage LibSciAcc. The LibSciACC Automatic interface is currently compatible with the Cray and GNU programming environments:

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35

LibSciAcc will automatically be linked in when using the Cray provided compiler wrappers

$ cc source.c
$ ftn source.f90

LibSciACC Resources

The man page intro_libsci_acc provides detailed usage information. The environment variable $LIBSCI_ACC_EXAMPLES_DIR specifies a directory containing several C and Fortran example codes.

cuFFT

CUFFT provides a set of optimized GPU fast Fourier routines that are provided by NVIDIA as part of the CUDA toolkit. The CUFFT library provides an API similar to FFTW for managing accelerated FFT’s. The CUFFTW interface provides a FFTW3 interface to CUFFT to aid in porting existing applications.

Using cuFFT

The cudatoolkit module will append the include and library directories required by CUFFT. When using NVCC or the GNU programming environment the library can then be added.

$ module load cudatoolkit
$ cc -lcufft source.c
$ nvcc -lcufft source.c

cuFFT Resources

NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CUFFT with Fortran through the ISO_C_BINDING interface please see the following example. The OLCF provides an OpenACC and CUFFT interoperability tutorial.

cuRAND

CURAND is an NVIDIA provided random number generator library. CURAND provides both a host launched and device inalienable interface. Multiple pseudorandom and quasirandom algorithms are supported.

Using cuRAND

The cudatoolkit module will append the include and library directories required by CURAND. When using NVCC or the GNU programming environment the library can then be added.

$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu
$ cc -lcurand source.c
$ nvcc -lcurand source.cu

cuRAND Resources

NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CURAND host library please see the following Accelerator Interoperability Tutorial.

Thrust

Thrust is a CUDA accelerated C++ template library modeled after the Standard Template Library(STL). Thrust provides a high level host interface for GPU data management as well as an assortment of accelerated algorithms. Even if your application is not currently using the STL the easy access to many optimized accelerated algorithms Thrust provides is worth taking a look at.

Using Thrust

The cudatoolkit module will append the include and library directories required by Thrust. When using NVCC or the GNU programming environment the library can then be added.

$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu
$ CC source.cpp
$ nvcc source.cu

Thrust Resources

NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using Thrust please see the following Accelerator Interoperability II Tutorial. The Github page allows access to the Thrust source code, examples, and information on how to obtain help.

To fully exploit the computational power of the GPU generally a large amount of data parallelism must be expressed. In the specific case of accelerated libraries such as cuBLAS, cuFFT, and cuSPARSE if each operation does not possess a sufficient amount of data parallelism another option is to batch many smaller operations into a single large operation. This tutorial will demonstrate how to take advantage of batched cuBLAS operations to improve GPU utilization. Additionally this tutorial will expand on the GPU concurrency topics from the first tutorial through the use of streams and Hyper-Q. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov


C

Note: cuBLAS uses column major ordering and 1-based indexing. The following examples utilize square diagonal matrices for simplicity and do not directly address this issue.

A Serial Example in C

The following example preforms a single DGEMM operation using the cuBLAS version 2 library. This example will be expanded to show the use of batched cuBLAS calls under a variety of situations. The form of our sample DGEMM operation will be as follows:

DGEMM

cublas.c

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"

int main(int argc, char* argv[])
{
    // Linear dimension of matrices
    int dim = 100;

    // Allocate host storage for A,B,C square matrices
    double *A, *B, *C;
    A = (double*)malloc(dim*dim*sizeof(double));
    B = (double*)malloc(dim*dim*sizeof(double));
    C = (double*)malloc(dim*dim*sizeof(double));

    // Allocate device storage for A,B,C
    double *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, dim*dim*sizeof(double));
    cudaMalloc((void**)&d_B, dim*dim*sizeof(double));
    cudaMalloc((void**)&d_C, dim*dim*sizeof(double));

    // Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    // Matrices are arranged column major
    int i,j,index;
    for(j=0; j<dim; j++) {
        for(i=0; i<dim; i++) {
            index = j*dim + i;
            if(i==j) {
                A[index] = sin(index);
                B[index] = sin(index);
                C[index] = cos(index)*cos(index);
            }
	    else {
                A[index] = 0.0;
                B[index] = 0.0;
                C[index] = 0.0;
            }
        }    
    }

    // Create cublas instance
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Set input matrices on device
    cublasSetMatrix(dim, dim, sizeof(double), A, dim, d_A, dim);
    cublasSetMatrix(dim, dim, sizeof(double), B, dim, d_B, dim);
    cublasSetMatrix(dim, dim, sizeof(double), C, dim, d_C, dim);

    // DGEMM: C = alpha*A*B + beta*C
    double alpha = 1.0;
    double beta  = 1.0;

    cublasDgemm(handle,
                CUBLAS_OP_N, CUBLAS_OP_N,
                dim, dim, dim,
                &alpha,
                d_A, dim,
                d_B, dim,
                &beta,
                d_C, dim);

    // Retrieve result matrix from device
    cublasGetMatrix(dim, dim, sizeof(double), d_C, dim, C, dim);

    // Simple sanity test, sum up all elements
    double sum = 0;
    for(j=0; j<dim; j++) {
        for(i=0; i<dim; i++) {
            index = j*dim + i;
            sum += C[index];
        }
    }
    printf("Sum is: %f, should be: %d\n", sum, dim);   

    // Clean up resources
    free(A);
    free(B);
    free(C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cublasDestroy(handle);

    return 0;
}
For simplicity the following compiles will use the GNU programming environment. For more information on using non GNU compilers please see the Compiling Mixed GPU and CPU code tutorial

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ cc -lcublas cublas.c -o cublas.out

Run:

$ module load cudatoolkit
$ aprun ./cublas.out

Library-provided Batching in C

The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched, which takes as an argument an array of pointers to matrices and the number of DGEMM operations to preform in total. By allowing the library to handle the batching, high performance can be achieved with minimal additional effort on the part of the programmer.

cublas-batch.c

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"

int main(int argc, char* argv[])
{
    int i,j,k,index;

    // Linear dimension of matrices
    int dim = 100;

    // Number of A,B,C matrix sets
    int batch_count = 1000;

    // Allocate host storage for batch_count A,B,C square matrices
    double **A, **B, **C;
    A = (double**)malloc(batch_count*sizeof(double*));
    B = (double**)malloc(batch_count*sizeof(double*));
    C = (double**)malloc(batch_count*sizeof(double*));
    for(i=0; i<batch_count; i++) {
        A[i] = (double*)malloc(dim*dim*sizeof(double));
        B[i] = (double*)malloc(dim*dim*sizeof(double));
        C[i] = (double*)malloc(dim*dim*sizeof(double));
    }

    // Create host pointer array to device matrix storage
    double **d_A, **d_B, **d_C, **h_d_A, **h_d_B, **h_d_C;
    h_d_A = (double**)malloc(batch_count*sizeof(double*));
    h_d_B = (double**)malloc(batch_count*sizeof(double*));
    h_d_C = (double**)malloc(batch_count*sizeof(double*));

    for(i=0; i<batch_count; i++) {
        cudaMalloc((void**)&h_d_A[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&h_d_B[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&h_d_C[i], dim*dim*sizeof(double));
    }
    // Copy the host array of device pointers to the device
    cudaMalloc((void**)&d_A, batch_count*sizeof(double*));
    cudaMalloc((void**)&d_B, batch_count*sizeof(double*));
    cudaMalloc((void**)&d_C, batch_count*sizeof(double*));
    cudaMemcpy(d_A, h_d_A, batch_count*sizeof(double*), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_d_B, batch_count*sizeof(double*), cudaMemcpyHostToDevice);
    cudaMemcpy(d_C, h_d_C, batch_count*sizeof(double*), cudaMemcpyHostToDevice);

    // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2
    // Matrices are arranged column major
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                if(i==j) {
                    (A[k])[index] = k*sin(index);
                    (B[k])[index] = sin(index);
                    (C[k])[index] = k*cos(index)*cos(index);
                }
		else {
                    (A[k])[index] = 0.0;
                    (B[k])[index] = 0.0;
                    (C[k])[index] = 0.0;
                }
            } // i   
        } // j
    } // k

    // Create cublas instance
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Set input matrices on device
    for(i=0; i<batch_count; i++) {
        cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, h_d_A[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, h_d_B[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, h_d_C[i], dim);
    }

    // Set matrix coefficients
    double alpha = 1.0;
    double beta  = 1.0;

    // DGEMM: C = alpha*A*B + beta*C
    cublasDgemmBatched(handle,
                       CUBLAS_OP_N, CUBLAS_OP_N,
                       dim, dim, dim,
                       &alpha,
                       (const double**)d_A, dim,
                       (const double**)d_B, dim,
                       &beta,
                       d_C, dim,
                       batch_count);

    // Retrieve result matrix from device
    for(i=0; i<batch_count; i++)
        cublasGetMatrix(dim, dim, sizeof(double), h_d_C[i], dim, C[i], dim);

    // Simple sanity test, sum up all elements
    double sum = 0;
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                sum += (C[k])[index];
            }
        }
    }
    printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2);   

    // Clean up resources

    for(i=0; i<batch_count; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
        cudaFree(h_d_A[i]);
        cudaFree(h_d_B[i]);
        cudaFree(h_d_C[i]);
    }

    free(A);
    free(B);
    free(C);
    free(h_d_A);
    free(h_d_B);
    free(h_d_C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cublasDestroy(handle);

    return 0;
}

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ cc -lcublas cublas-batch.c -o cublas-batch.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-batch.out

CUDA Streams in C

Not all library functions have a batch equivalent, in this case using CUDA streams may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM operation in it’s own stream, allowing up to 32 concurrent operations through the use of Hyper-Q. The cuBLAS library provides functionality to create, set, and destroy CUDA streams.

cublas-stream.c

#include <stdio.h>
#include <stdlib.h>
#include "math.h"
#include "cublas_v2.h"

int main(int argc, char* argv[])
{
    int i,j,k,index;

    // Linear dimension of matrices
    int dim = 100;

    // Number of A,B,C matrix sets
    int batch_count = 1000;

    // Allocate host storage for batch_count A,B,C square matrices
    double **A, **B, **C;
    A = (double**)malloc(batch_count*sizeof(double*));
    B = (double**)malloc(batch_count*sizeof(double*));
    C = (double**)malloc(batch_count*sizeof(double*));
    for(i=0; i<batch_count; i++) {
        A[i] = (double*)malloc(dim*dim*sizeof(double));
        B[i] = (double*)malloc(dim*dim*sizeof(double));
        C[i] = (double*)malloc(dim*dim*sizeof(double));
    }

    // Allocate device storage for batch_count A,B,C
    double **d_A, **d_B, **d_C;
    d_A = (double**)malloc(batch_count*sizeof(double*));
    d_B = (double**)malloc(batch_count*sizeof(double*));
    d_C = (double**)malloc(batch_count*sizeof(double*));
    for(i=0; i<batch_count; i++) {
        cudaMalloc((void**)&d_A[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&d_B[i], dim*dim*sizeof(double));
        cudaMalloc((void**)&d_C[i], dim*dim*sizeof(double));
    }
    // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2
    // Matrices are arranged column major
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                if(i==j) {
                    (A[k])[index] = k*sin(index);
                    (B[k])[index] = sin(index);
                    (C[k])[index] = k*cos(index)*cos(index);
                }
		else {
                    (A[k])[index] = 0.0;
                    (B[k])[index] = 0.0;
                    (C[k])[index] = 0.0;
                }
            } // i   
        } // j
    } // k

    // Create cublas instance
    cublasHandle_t handle;
    cublasCreate(&handle);

    // Set input matrices on device
    for(i=0; i<batch_count; i++) {
        cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, d_A[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, d_B[i], dim);
        cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, d_C[i], dim);
    }

    // Create a stream for every DGEMM operation
    cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t));
    for(i=0; i<batch_count; i++)
        cudaStreamCreate(&streams[i]);

    // Set matrix coefficients
    double alpha = 1.0;
    double beta  = 1.0;

    // Launch each DGEMM operation in own CUDA stream
    for(i=0; i<batch_count; i++){
        // Set CUDA stream
        cublasSetStream(handle, streams[i]);

        // DGEMM: C = alpha*A*B + beta*C
        cublasDgemm(handle,
                    CUBLAS_OP_N, CUBLAS_OP_N,
                    dim, dim, dim,
                    &alpha,
                    d_A[i], dim,
                    d_B[i], dim,
                    &beta,
                    d_C[i], dim);
    }

    // Retrieve result matrix from device
    for(i=0; i<batch_count; i++) {
        cublasGetMatrix(dim, dim, sizeof(double), d_C[i], dim, C[i], dim);
    }

    // Simple sanity test, sum up all elements
    double sum = 0;
    for(k=0; k<batch_count; k++) {
        for(j=0; j<dim; j++) {
            for(i=0; i<dim; i++) {
                index = j*dim + i;
                sum += (C[k])[index];
            }
        }
    }
    printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2);   

    // Clean up resources
    for(i=0; i<batch_count; i++) {
        free(A[i]);
        free(B[i]);
        free(C[i]);
        cudaFree(d_A[i]);
        cudaFree(d_B[i]);
        cudaFree(d_C[i]);
        cudaStreamDestroy(streams[i]);
    }
    free(A);
    free(B);
    free(C);
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);
    cublasDestroy(handle);

    return 0;
}

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ cc -lcublas cublas-stream.c -o cublas-stream.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-stream.out

Fortran

Wrappers

NVIDIA provides a fortran interface for the legacy cuBLAS API although they do not provide a fortran interface to the version 2 API, which adds batched operations. One of the changes introduced with the new API is that an opaque cuBLAS handle must be be passed into most function calls; while this provides greater control over operations it is a hinderance to interoperability. The wrapper is split into two pieces: cublas_fortran.cu and cublas_fortran_iso.f90. cublas_fortran.cu provides a lightweight C wrapper around cuBLAS calls, with most calls taking a pointer to the cuBLAS handle that can be dereferenced to make the actual call. Since the wrapper functions require only a pointer to the cuBLAS handle we may interface them with the ISO_C_BINDING module. This ISO_C_BINDING interface is the second piece to our wrapper, cublas_fortran_iso.f90, and allows for robust interoperability between C and Fortran types.

cublas_fortran.cu

#include <stdio.h>
#include "cublas_v2.h"

extern "C" int f_cublasCreate(cublasHandle_t **handle)
{
    *handle = (cublasHandle_t*)malloc(sizeof(cublasHandle_t));
    return cublasCreate(*handle);
}

extern "C" int f_cublasDgemm(cublasHandle_t *handle,
               cublasOperation_t transa, cublasOperation_t transb, 
              int m, int n, int k, 
              const double *alpha,
              const double *A, int lda, 
              const double *B, int ldb,
              const double *beta, 
              double *C, int ldc)
{
    return cublasDgemm(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc);
}

extern "C" int f_cublasDgemmBatched(cublasHandle_t *handle,
               cublasOperation_t transa, cublasOperation_t transb,
              int m, int n, int k,
              const double *alpha,
              const double **A, int lda,
              const double **B, int ldb,
              const double *beta,
              double **C, int ldc,
              int batch_count)
{
    return cublasDgemmBatched(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc,batch_count);
}

extern "C" void f_cublasDestroy(cublasHandle_t *handle)
{
    cublasDestroy(*handle);
    free(handle);
}

extern "C" int f_cudaStreamCreate(cudaStream_t **stream)
{
    *stream = (cudaStream_t *) malloc(sizeof(cudaStream_t));
    return cudaStreamCreate(*stream);
}

extern "C" int f_cublasSetStream(cublasHandle_t *handle, cudaStream_t *streamid)
{
    return cublasSetStream(*handle, *streamid);
}

extern "C" void f_cudaStreamDestroy(cudaStream_t *stream)
{
    cudaStreamDestroy(*stream);
}

cublas_fortran_iso.f90

module cublas_f
  use ISO_C_BINDING

    enum, BIND(C)
        enumerator :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C
    end enum
   
    enum, BIND(C)
        enumerator :: cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, &
                      cudaMemcpyDeviceToDevice, cudaMemcpyDefault
    end enum

  INTERFACE
    integer(C_INT) function cudaMalloc(ptr, bytes) BIND(C, NAME='cudaMalloc')
        use ISO_C_BINDING
        type(C_PTR) :: ptr
        integer(C_SIZE_T), value :: bytes
    end function

    integer(C_INT) function cudaMemcpy(dst, src, count, kind) BIND(C, NAME='cudaMemcpy')
        use ISO_C_BINDING
        type(C_PTR), value :: dst
        type(C_PTR), value :: src
        integer(C_SIZE_T), value :: count
        integer(C_INT), value :: kind
    end function

    integer(C_INT) function cublasCreate(handle_ptr) BIND(C, NAME='f_cublasCreate')
        use ISO_C_BINDING
        type(C_PTR) :: handle_ptr
    end function

    subroutine  cublasDestroy(handle_ptr) BIND(C, NAME='f_cublasDestroy')
        use ISO_C_BINDING
        type(C_PTR), value :: handle_ptr
    end subroutine

    subroutine cudaFree(ptr) BIND(C, NAME='cudaFree')
        use ISO_C_BINDING
        type(C_PTR), value :: ptr
    end subroutine

    integer(C_INT) function cublasSetMatrix(rows, cols, elemSize, a_ptr, &
                                           lda, b_ptr, ldb) &
                                           BIND(C, NAME='cublasSetMatrix')
        use ISO_C_BINDING
        integer(C_INT), value :: rows
        integer(C_INT), value :: cols
        integer(C_INT), value :: elemSize
        type(C_PTR),    value :: a_ptr
        integer(C_INT), value :: lda
        type(C_PTR),    value :: b_ptr
        integer(C_INT), value :: ldb
    end function

    integer(C_INT) function cublasGetMatrix(rows, cols, elemSize, a_ptr, &
                                            lda, b_ptr, ldb ) &
                                            BIND(C, NAME='cublasGetMatrix')
        use ISO_C_BINDING
        integer(C_INT), value :: rows
        integer(C_INT), value :: cols
        integer(C_INT), value :: elemSize
        type(C_PTR),    value :: a_ptr
        integer(C_INT), value :: lda
        type(C_PTR),    value :: b_ptr
        integer(C_INT), value :: ldb

    end function

    integer(C_INT) function cublasDgemm(handle, transa, transb, m, n, k, alpha, &
                                        A, lda, B, ldb, beta, C, ldc) &
                                        BIND(C, NAME='f_cublasDgemm')
        use ISO_C_BINDING
        type(C_PTR), value    :: handle
        integer(C_INT), value :: transa
        integer(C_INT), value :: transb
        integer(C_INT), value :: m
        integer(C_INT), value :: n
        integer(C_INT), value :: k
        real(C_DOUBLE)        :: alpha
        type(C_PTR), value    :: A
        integer(C_INT), value :: lda
        type(C_PTR), value    :: B
        integer(C_INT), value :: ldb
        real(C_DOUBLE)        :: beta
        type(C_PTR), value    :: C
        integer(C_INT), value :: ldc
    end function

    integer(C_INT) function cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, &
                                        A, lda, B, ldb, beta, C, ldc, batch_count) &
                                        BIND(C, NAME='f_cublasDgemmBatched')
        use ISO_C_BINDING
        type(C_PTR), value    :: handle
        integer(C_INT), value :: transa
        integer(C_INT), value :: transb
        integer(C_INT), value :: m
        integer(C_INT), value :: n
        integer(C_INT), value :: k
        real(C_DOUBLE)        :: alpha
        type(C_PTR), value    :: A
        integer(C_INT), value :: lda
        type(C_PTR), value    :: B
        integer(C_INT), value :: ldb
        real(C_DOUBLE)        :: beta
        type(C_PTR), value    :: C
        integer(C_INT), value :: ldc
        integer(C_INT), value :: batch_count
    end function

    integer(C_INT) function cudaStreamCreate(stream_ptr) BIND(C, NAME='f_cudaStreamCreate')
        use ISO_C_BINDING
        type(C_PTR) :: stream_ptr
    end function

    integer(C_INT) function cublasSetStream(handle, stream) BIND(C, NAME='f_cublasSetStream')
        use ISO_C_BINDING
        type(C_PTR), value :: handle
        type(C_PTR), value :: stream
    end function

    integer(C_INT) function cudaStreamDestroy(stream) BIND(C, NAME='f_cudaStreamDestroy')
        use ISO_C_BINDING
        type(C_PTR), value :: stream
    end function

    subroutine cudaDeviceSynchronize() BIND(C, NAME='cudaDeviceSynchronize')
    end subroutine

  END INTERFACE

a end module cublas_f

A Serial Example in Fortran

The following example preforms a single DGEMM operation using the cuBLAS version 2 library. This example will be expanded to show the use of batched cuBLAS calls under a variety of situations. The form of our sample DGEMM operation will be as follows:

DGEMM

Note: cuBLAS uses column major ordering and 1-based indexing to maintain legacy Fortran support.

cublas.f90

program main
    use ISO_C_BINDING
    use cublas_f
    implicit none

    integer :: dim, stat, i, j
    integer(8) :: bytes
    real(8),dimension(:,:), pointer  :: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR) :: d_A, d_B, d_C
    type(C_PTR) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim))
    allocate(B(dim,dim))
    allocate(C(dim,dim))

    ! Allocate device storage for A,B,C
    bytes = int(dim*dim*sizeof_double, 8) 
    stat = cudaMalloc(d_A, bytes)
    stat = cudaMalloc(d_B, bytes)
    stat = cudaMalloc(d_C, bytes)

    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major
    do j=1,dim
        do i=1,dim
            if (i==j) then
                index = real(j*dim + i)
                A(i,j) = sin(index)
                B(i,j) = sin(index)
                C(i,j) = cos(index) * cos(index)
            else
                A(i,j) = 0.0
                B(i,j) = 0.0
                C(i,j) = 0.0
            endif
        enddo
    enddo

    ! Create cublas instance
    stat = cublasCreate(handle)

    ! Set input matrices on device
    stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1)), dim, d_A, dim)
    stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1)), dim, d_B, dim)
    stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1)), dim, d_C, dim)

    ! DGEMM: C = alpha*A*B + beta*C
    alpha = 1.0
    beta = 1.0
    stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, &
                       dim, dim, dim, &
                       alpha,         &
                       d_A, dim,      &
                       d_B, dim,      &
                       beta,          &
                       d_C, dim)

    ! Retrieve result matrix from device
    stat = cublasGetMatrix(dim, dim, sizeof_double, d_C, dim, C_LOC(C(1,1)), dim)

    ! Simple sanity test, sum up all elements
    sum = 0.0
    do j=1,dim
        do i=1,dim
            sum = sum + C(i,j)
        enddo
    enddo
    print *, "Sum is:", sum, "should be: ", dim

    deallocate(A)
    deallocate(B)
    deallocate(C)
    call cudaFree(d_A)
    call cudaFree(d_B)
    call cudaFree(d_C)
    call cublasDestroy(handle)

end program
For simplicity the following compiles will use the GNU programming environment. For more information on using non GNU compilers please see the Compiling Mixed GPU and CPU code tutorial

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ nvcc -c cublas_fortran.cu
$ ftn -lcublas cublas_fortran_iso.f90 cublas.f90 cublas_fortran.o -o cublas.out

Run:

$ module load cudatoolkit
$ aprun ./cublas.out

Library-provided Batching in Fortran

The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched, which takes as an argument an array of pointers to matrices and the number of DGEMM operations to preform in total. By allowing the library to handle the batching high performance can be achieved with minimal additional effort on the part of the programmer.

cublas-batch.f90

program main
    use ISO_C_BINDING
    use cublas_f
    implicit none

    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), pointer:: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR) :: d_A, d_B, d_C
    type(C_PTR), dimension(:), pointer :: h_d_A, h_d_B, h_d_C, streams
    type(C_PTR) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))

    ! Create host pointer array to device matrix storage
    allocate(h_d_A(batch_count))
    allocate(h_d_B(batch_count))
    allocate(h_d_C(batch_count))
    bytes = dim*dim*sizeof_double

    do i=1,batch_count
        stat = cudaMalloc(h_d_A(i), bytes)
        stat = cudaMalloc(h_d_B(i), bytes)
        stat = cudaMalloc(h_d_C(i), bytes)
    enddo

    ! Copy the host array of device pointers to the device
    bytes = batch_count*sizeof_double ! 8 byte pointers
    stat = cudaMalloc(d_A, bytes)
    stat = cudaMalloc(d_B, bytes)
    stat = cudaMalloc(d_C, bytes)

    stat = cudaMemcpy(d_A, C_LOC(h_d_A(1)), bytes, cudaMemcpyHostToDevice);
    stat = cudaMemcpy(d_B, C_LOC(h_d_B(1)), bytes, cudaMemcpyHostToDevice);
    stat = cudaMemcpy(d_C, C_LOC(h_d_C(1)), bytes, cudaMemcpyHostToDevice);

    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    A(i,j,k) = k*sin(index)
                    B(i,j,k) = sin(index)
                    C(i,j,k) = k*cos(index) * cos(index)
                else
                    A(i,j,k) = 0.0
                    B(i,j,k) = 0.0
                    C(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

    ! Create cublas instance
    stat = cublasCreate(handle)

    ! Set input matrices on device
    do i=1,batch_count
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, h_d_A(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, h_d_B(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, h_d_C(i), dim)
    enddo

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! batched DGEMM: C = alpha*A*B + beta*C
    stat = cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, &
                       dim, dim, dim, &
                       alpha,         &
                       d_A, dim,      &
                       d_B, dim,      &
                       beta,          &
                       d_C, dim,      &
                       batch_count)

    ! Retrieve result matrix from device
    do i=1,batch_count
        stat = cublasGetMatrix(dim, dim, sizeof_double, h_d_C(i), dim, C_LOC(C(1,1,i)), dim)
    enddo

    ! Simple sanity test, sum up all elements
    sum = 0.0
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
    print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2

    do i=1,batch_count
        call cudaFree(h_d_A(i))
        call cudaFree(h_d_B(i))
        call cudaFree(h_d_C(i))
    enddo

    deallocate(A)
    deallocate(B)
    deallocate(C)
    call cublasDestroy(handle)

end program

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ nvcc -c cublas_fortran.cu
$ ftn -lcublas cublas_fortran_iso.f90 cublas-batch.f90 cublas_fortran.o -o cublas-batch.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-batch.out

CUDA Streams in Fortran

Not all library functions have a batch equivalent, in this case using CUDA streams may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM operation in it’s own stream, Allowing up to 32 concurrent operations through the use of Hyper-Q. The cuBLAS library provides functionality to create, set, and destroy CUDA streams.

cublas-stream.f90

program main
    use ISO_C_BINDING
    use cublas_f
    implicit none

    integer :: dim, stat, i, j, k, batch_count
    integer(8) :: bytes
    real(8),dimension(:,:,:), pointer:: A, B, C
    real(8) :: alpha, beta, index, sum
    type(C_PTR), dimension(:), pointer :: d_A, d_B, d_C, streams
    type(C_PTR) :: handle

    integer :: sizeof_double
    parameter (sizeof_double=8)

    !Linear dimension of matrices
    dim = 100

    ! Number of A,B,C matrix sets
    batch_count = 1000

    ! Allocate host storage for A,B,C square matrices
    allocate(A(dim,dim,batch_count))
    allocate(B(dim,dim,batch_count))
    allocate(C(dim,dim,batch_count))

    ! Allocate device storage for A,B,C
    allocate(d_A(batch_count))
    allocate(d_B(batch_count))
    allocate(d_C(batch_count))
    bytes = int(dim*dim*sizeof_double, 8)
    do i=1,batch_count
        stat = cudaMalloc(d_A(i), bytes)
        stat = cudaMalloc(d_B(i), bytes)
        stat = cudaMalloc(d_C(i), bytes)
    enddo

    ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2
    ! Matrices are arranged column major
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                if (i==j) then
                    index = real(j*dim + i)
                    A(i,j,k) = k*sin(index)
                    B(i,j,k) = sin(index)
                    C(i,j,k) = k*cos(index) * cos(index)
                else
                    A(i,j,k) = 0.0
                    B(i,j,k) = 0.0
                    C(i,j,k) = 0.0
                endif
            enddo ! i
        enddo ! j
    enddo ! k

    ! Create cublas instance
    stat = cublasCreate(handle)

    ! Set input matrices on device
    do i=1,batch_count
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, d_A(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, d_B(i), dim)
        stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, d_C(i), dim)
    enddo

    ! Create a stream for every DGEMM operation
    allocate(streams(batch_count))
    do i=1,batch_count
        stat = cudaStreamCreate(streams(i))
    enddo

    ! Set matrix coefficients
    alpha = 1.0
    beta = 1.0

    ! Launch each DGEMM operation in own CUDA stream
    do i=1,batch_count
        ! Set CUDA stream
        stat = cublasSetStream(handle, streams(i))

        ! DGEMM: C = alpha*A*B + beta*C
        stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, &
                           dim, dim, dim, &
                           alpha,         &
                           d_A(i), dim,   &
                           d_B(i), dim,   &
                           beta,          &
                           d_C(i), dim)
    enddo

    ! Retrieve result matrix from device
    do i=1,batch_count
        stat = cublasGetMatrix(dim, dim, sizeof_double, d_C(i), dim, C_LOC(C(1,1,i)), dim)
    enddo

    ! Simple sanity test, sum up all elements
    sum = 0.0
    do k=1,batch_count
        do j=1,dim
            do i=1,dim
                sum = sum + C(i,j,k)
            enddo
        enddo
    enddo
    print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2

    do i=1,batch_count
        stat = cudaStreamDestroy(streams(i))
        call cudaFree(d_A(i))
        call cudaFree(d_B(i))
        call cudaFree(d_C(i))
    enddo

    deallocate(A)
    deallocate(B)
    deallocate(C)
    call cublasDestroy(handle)

end program

Compile:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit
$ nvcc -c cublas_fortran.cu
$ ftn -lcublas cublas_fortran_iso.f90 cublas-stream.f90 cublas_fortran.o -o cublas-stream.out

Run:

$ module load cudatoolkit
$ aprun ./cublas-stream.out

This tutorial will cover linking GPU code compiled with NVCC and CPU code compiled with the Cray compiler wrappers cc/CC/ftn. The Cray compiler wrappers work in conjunction with the modules system to link in needed libraries such as MPI; it is therefore recommended that the wrappers be used to compile the CPU portions of your code. To generate compiler portable code it is necessary to compile CUDA C and CUDA runtime containing code with NVCC. The resulting NVCC compiled object code must then be linked in with object code compiled with the Cray wrappers. NVCC preforms GNU style C++ name mangling on compiled functions so care must be taken in compiling and linking codes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

C++

When linking C++ host code with NVCC compiled code the C++ code must use GNU compatible name mangling. This is controlled through compiler wrapper flags.

vecAddWrapperCXX.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

void vecAdd_wrapper()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAdd.cxx

#include <stdio.h>

void vecAdd_wrapper(void);

int main( int argc, char* argv[] )
{
    vecAdd_wrapper();

    return 0;
}

Compiling C++

PGI

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC --gnu vecAdd.cxx vecAddWrapperCXX.o

Cray

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC vecAdd.cxx vecAddWrapperCXX.o

GNU

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC vecAdd.cxx vecAddWrapperCXX.o

Intel

$ module load cudatoolkit
$ nvcc -c vecAddWrapperCXX.cu
$ CC vecAdd.cxx vecAddWrapperCXX.o

C

NVCC name mangling must be disabled if it is to be linked with C code. This requires the extern “C” function qualifier.

vecAddWrapperC.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

extern "C" void vecAdd_wrapper()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAdd.c

#include <stdio.h>

void vecAdd_wrapper(void);

int main( int argc, char* argv[] )
{
    vecAdd_wrapper();

    return 0;
}

Compiling C

There should be no compiler dependent variances for C.

$ module load cudatoolkit
$ nvcc -c vecAddWrapperC.cu
$ cc vecAdd.c vecAddWrapperC.o

Fortran: Simple

This simple method allows C code to be called from Fortran but requires that the C function name be modified. NVCC name mangling must be disabled if it is to be linked with Fortran code. This requires the extern “C” function qualifier. Additionally function names must be lowercase and end in an underscore, _ . The ISO_C_BINDING method following this example allows you to call C functions from Fortran without modifying the C function call.

vecAddWrapperF.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

extern "C" void vecadd_wrapper_()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAdd.f90

program vecAdd
    call vecadd_wrapper()
end program vecAdd

Compiling Fortran: Simple

There should be no compiler dependent variations for fortran.

$ module load cudatoolkit
$ nvcc -c vecAddWrapperF90.cu 
$ ftn vecAddWrapperF90.o vecAdd.f90

Fortran: ISO_C_BINDING

The ISO_C_BINDING provides Fortran a greater interoperability with C and reduces the need to modify the C function name.

vecAddWrapperISO.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(float *a, float *b, float *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

extern "C" void vecAdd_wrapper()
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    float *h_a;
    float *h_b;
    //Host output vector
    float *h_c;

    // Device input vectors
    float *d_a;
    float *d_b;
    //Device output vector
    float *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(float);

    // Allocate memory for each vector on host
    h_a = (float*)malloc(bytes);
    h_b = (float*)malloc(bytes);
    h_c = (float*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sinf(i)*sinf(i);
        h_b[i] = cosf(i)*cosf(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    float sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);
}

vecAddISO.f90

module vector
  INTERFACE
    subroutine vecadd_wrapper() BIND (C, NAME='vecAdd_wrapper')
      USE ISO_C_BINDING
      implicit none
    end subroutine vecadd_wrapper
  END INTERFACE
end module vector

program vecAdd
    use ISO_C_BINDING
    use vector

    call vecadd_wrapper()

end program vecAdd

Compiling Fortran: ISO_C_BINDING

There should be no compiler dependent variations for fortran.

$ module load cudatoolkit
$ nvcc -c vecAddWrapperISO.cu 
$ ftn vecAddWrapperISO.o vecAddISO.f90

Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov.

How to Enable

To enable GPUDirect the following steps must be taken before invoking aprun:

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1

Examples

The following examples will initialize memory on the GPU and then preform an MPI_Allgather operation between GPUs.

For these examples use two nodes and request an interactive job.

$qsub -I -A PROJ### -lnodes=2,walltime=00:30:00

The executables must be placed in /tmp/work/$USER to run properly on the GPU.

CUDA C

direct.cpp

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <mpi.h>

int main( int argc, char** argv )
{
    MPI_Init (&argc, &argv);

    int direct;
    int rank, size;
    int *h_buff = NULL;
    int *d_rank = NULL;
    int *d_buff = NULL;
    size_t bytes;
    int i;

    // Ensure that RDMA ENABLED CUDA is set correctly
    direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA"));
    if(direct != 1){
        printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
        exit (EXIT_FAILURE);
    }

    // Get MPI rank and size
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);

    // Allocate host and device buffers and copy rank value to GPU
    bytes = size*sizeof(int);
    h_buff = (int*)malloc(bytes);
    cudaMalloc(&d_buff, bytes);
    cudaMalloc(&d_rank, sizeof(int));
    cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice);

    // Preform Allgather using device buffer
    MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD);

    // Check that the GPU buffer is correct
    cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost);
    for(i=0; i<size; i++){
        if(h_buff[i] != i) {
            printf ("Alltoall Failed!\n");
            exit (EXIT_FAILURE);
        }
    }
    if(rank==0)
        printf("Success!\n");

    // Clean up
    free(h_buff);
    cudaFree(d_buff);
    cudaFree(d_rank);
    MPI_Finalize();

    return 0;
}

Compiling CUDA C

For ease of compiling the GNU environment will be used.

  $ module load cudatoolkit
  $ module switch PrgEnv-pgi PrgEnv-gnu
  $ CC -lcudart direct.cpp -o direct.out

Running CUDA C

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1
  $ aprun -n2 -N1 ./direct.out

CUDA Fortran

direct.cuf

program GPUdirect
    use cudafor
    implicit none
    include 'mpif.h'

    integer :: direct
    character(len=255) :: env_var
    integer :: rank, size, ierror
    integer,dimension(:),allocatable :: h_buff
    integer,device :: d_rank
    integer,dimension(:),allocatable,device :: d_buff
    integer :: i

    call getenv("MPICH_RDMA_ENABLED_CUDA", env_var)
    read( env_var, '(i10)' ) direct
    if (direct .NE. 1) then
      print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!'
      call exit(1)
    endif

    call MPI_INIT(ierror)

    ! Get MPI rank and size
    call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror)
    call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror)

    ! Initialize host and device buffers
    allocate(h_buff(size))
    allocate(d_buff(size))
    ! Implicity copy rank to device
    d_rank = rank

    ! Preform allgather using device buffers
    call MPI_ALLGATHER(d_rank, 1, MPI_INTEGER, d_buff, 1, MPI_INTEGER, MPI_COMM_WORLD, ierror);

    ! Check that buffer is correct
    h_buff = d_buff(1:size)
    do i=1,size
        if (h_buff(i) .NE. i-1) then
            print *, 'Alltoall Failed!'
            call exit(1)
        endif
    enddo
    if (rank .EQ. 0) then
        print *, 'Success!'
    endif

    ! Clean up
    deallocate(h_buff)
    deallocate(d_buff)
    call MPI_FINALIZE(ierror)

end program GPUdirect

Compiling CUDA Fortran

  $ module load cudatoolkit
  $ ftn direct.cuf -o direct.out

Running CUDA Fortran

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1
  $ aprun -n2 -N1 ./direct.out

OpenACC C

direct.c

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main( int argc, char** argv )
{
    MPI_Init (&argc, &argv);

    int direct;
    int rank, size;
    int *restrict buff = NULL;
    size_t bytes;
    int i;

    // Ensure that RDMA ENABLED CUDA is set correctly
    direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA"));
    if(direct != 1){
        printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
        exit (EXIT_FAILURE);
    }

    // Get MPI rank and size
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);

    // Initialize buffer
    bytes = size*sizeof(int);
    buff = (int*)malloc(bytes);

    // Copy buff to device at start of region and back to host and end of region
    #pragma acc data copy(rank, buff[0:size])
    {
        // Inside this region the device data pointer will be used
        #pragma acc host_data use_device(rank, buff)
        {
            MPI_Allgather(&rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD);
        }
    }

    // Check that buffer is correct
    for(i=0; i<size; i++){
        if(buff[i] != i) {
            printf ("Alltoall Failed!\n");
            exit (EXIT_FAILURE);
        }
    }
    if(rank==0)
        printf("Success!\n");

    // Clean up
    free(buff);

    MPI_Finalize();

    return 0;
}

Compiling OpenACC C

CRAY

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -hpragma=acc direct.c -o direct.out

PGI

$ module load cudatoolkit
$ cc -acc -lcudart direct.c -o direct.out

Running OpenACC C

CRAY

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

PGI

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

OpenACC Fortran

direct.f90

program GPUdirect
    include 'mpif.h'

    integer :: direct
    character(len=255) :: env_var
    integer :: rank, size, ierror
    integer,dimension(:),allocatable :: buff
    integer :: i

    call getenv("MPICH_RDMA_ENABLED_CUDA", env_var)
    read( env_var, '(i10)' ) direct
    if (direct .NE. 1) then
      print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!'
      call exit(1)
    endif

    call MPI_INIT(ierror)

    ! Get MPI rank and size
    call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror)
    call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror)

    ! Initialize buffer
    allocate(buff(size))

    ! Copy buff to device at start of region and back to host and end of region
    !$acc data copy(rank, buff(1:size))
        ! Inside this region the device data pointer will be used
        !$acc host_data use_device(rank, buff)
            ! Preform all to all using device buffer
            call MPI_ALLGATHER(rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD, ierror);
        !$acc end host_data
    !$acc end data

    ! Check that buffer is correct
    do i=1,size
        if (buff(i) .NE. i-1) then
            print *, 'Alltoall Failed!'
            call exit(1)
        endif
    enddo
    if (rank .EQ. 0) then
        print *, 'Success!'
    endif

    ! Clean up
    deallocate(buff)
    call MPI_FINALIZE(ierror)
end program GPUdirect

Compiling OpenACC Fortran

CRAY

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -hacc direct.f90 -o direct.out

PGI

$ module load cudatoolkit
$ ftn -acc -lcudart direct.f90 -o direct.out

Running OpenACC Fortran

CRAY

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

PGI

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

Optimizations

Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial and error tuning to achieve best results.

Pipelining

Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is 16 and messages under 8 Kilobytes in size are never pipelined.

Nemesis

Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.

This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.

This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.

The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Currently GPU memory between processes accessing the proxy is not guarded, meaning process i can access memory allocated by process j. This SHOULD NOT be used to share memory between processes and care should be taken to ensure process only access GPU memory they have allocated themselves.

How to Enable

To enable the proxy server the following steps must be taken before invoking aprun:

$ export CRAY_CUDA_MPS=1

Issues

Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.

Examples

The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.

For these examples use a single node and request an interactive job.

$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00

C Wrapper

The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.

pauseWrapper.cu

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}

// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(int seconds)
{
    // Get device frequency in Hz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;

    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = seconds * Hz;

    return num_cycles;
}

// Launches a kernel that sleeps for at least num_cycles
extern "C" void sleep_kernel(int64_t num_cycles)
{
    // Our kernel will launch a single thread to sleep the kernel
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;

    // Execute the kernel
    sleep<<<gridSize, blockSize>>>(num_cycles);
}

// Wait for all pending GPU transactions to end
extern "C" void wait_for_gpu()
{
    cudaDeviceSynchronize();
}

Compiling

The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -c -arch=sm_35 pauseWrapper.cu

Multiple kernels single process

Launching multiple kernels from a single process does not require the CUDA proxy.

launcher.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch num_kernel kernels asynchrnously
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();

    return 0;
}

Compiling and running multiple kernels from a single process

$ CC pauseWrapper.o launcher.cpp -o sleep.out
$ aprun -n1 ./sleep.out

Multiple kernels launched from multiple threads

Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.

launcherOMP.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch kernel
    omp_set_num_threads(num_kernels);

    #pragma omp parallel for shared(cycles) num_threads(num_kernels)
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();
}

Compiling and running multiple kernels launched from multiple threads

$ CC pauseWrapper.o launcherMP.cpp -o sleep.out
$ aprun -n1 -d14 ./sleep.out

Multiple kernels launched from multiple MPI tasks

Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.

launcherMPI.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mpi.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Sleep kernel for 1 second
    sleep_kernel(cycles);

    // Wait for the kernel to complete
    wait_for_gpu();

    MPI_Finalize();
    return 0;
}

Compiling and running multiple kernels launched from multiple MPI tasks

$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out
$ export CRAY_CUDA_MPS=1
$ aprun -n14 -N14 ./sleep.out

This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    double *h_a;
    double *h_b;
    //Host output vector
    double *h_c;

    // Device input vectors
    double *d_a;
    double *d_b;
    //Device output vector
    double *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

Changes

Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

Let’s take a look at what makes up this simple kernel.

__global__ void vecAdd(double *a, double *b, double *c, int n)

The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.

int id = blockIdx.x*blockDim.x+threadIdx.x;

We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members.

if (id < n)

Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

c[id] = a[id] + b[id]

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c

Array Pointers:

// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;

// Device input vectors
double *d_a;
double *d_b;
// Device output vector
double *d_c;

With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.

cudaMalloc:

// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.

cudaMemcpy to device:

// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.

Thread mapping:

int blockSize, gridSize;
 
// Number of threads in each thread block
blockSize = 256;
 
// Number of thread blocks in grid
gridSize = ceil((float)n/blockSize);

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.

Launch kernel:

// Execute the kernel
vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);

We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

cudaMemcpy to host:

// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.

release device memory

// Release device memory
 cudaFree(d_a);
 cudaFree(d_b);
 cudaFree(d_c);

As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.

Compiling

The CUDA environment must be set up so before compiling make sure the module is loaded

$ module load cudatoolkit
$ nvcc vecAdd.cu -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

In this tutorial we will cover three different CUDA versions of our GOL program, each highlighting a different CUDA memory area. The first will strictly use global memory and be a straightforward GPU port of our program, the second we will introduce shared memory, and lastly we cover the use of texture memory. A description of the game of life and the original CPU code can be found here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

GOL-CUDA.cu

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985
#define BLOCK_SIZE 16

__global__ void ghostRows(int dim, int *grid)
{
    // We want id ∈ [1,dim]
    int id = blockDim.x * blockIdx.x + threadIdx.x + 1;

    if (id <= dim)
    {
        //Copy first real row to bottom ghost row
        grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id];
        //Copy last real row to top ghost row
        grid[id] = grid[(dim+2)*dim + id];
    }
}

__global__ void ghostCols(int dim, int *grid)
{
    // We want id ∈ [0,dim+1]
    int id = blockDim.x * blockIdx.x + threadIdx.x;

    if (id <= dim+1)
    {
        //Copy first real column to right most ghost column
        grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1];
        //Copy last real column to left most ghost column 
        grid[id*(dim+2)] = grid[id*(dim+2) + dim];    
    }
}

__global__ void GOL(int dim, int *grid, int *newGrid)
{
    // We want id ∈ [1,dim]
    int iy = blockDim.y * blockIdx.y + threadIdx.y + 1;
    int ix = blockDim.x * blockIdx.x + threadIdx.x + 1;
    int id = iy * (dim+2) + ix;

    int numNeighbors;

    if (iy <= dim && ix <= dim) {

        // Get the number of neighbors for a given grid point
        numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower
                     + grid[id+1] + grid[id-1]             //right left
                     + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals
                     + grid[id-(dim+1)] + grid[id+(dim+1)];

        int cell = grid[id];
        // Here we have explicitly all of the game rules
        if (cell == 1 && numNeighbors < 2)
            newGrid[id] = 0;
        else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
            newGrid[id] = 1;
        else if (cell == 1 && numNeighbors > 3)
            newGrid[id] = 0;
        else if (cell == 0 && numNeighbors == 3)
            newGrid[id] = 1;
        else
            newGrid[id] = cell;
    }
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    int* h_grid; //Grid on host
    int* d_grid; //Grid on device
    int* d_newGrid; //Second grid used on device only
    int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid

    int dim = 1024; //Linear dimension of our grid - not counting ghost cells
    int maxIter = 1<<10; //Number of game steps

    size_t bytes = sizeof(int)*(dim+2)*(dim+2);//2 added for periodic boundary condition ghost cells
    // Allocate host Grid used for initial setup and read back from device
    h_grid = (int*)malloc(bytes);

    // Allocate device grids
    cudaMalloc(&d_grid, bytes);
    cudaMalloc(&d_newGrid, bytes);

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            h_grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    // Copy over initial game grid (Dim-1 threads)
    cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);

    dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
    int linGrid = (int)ceil(dim/(float)BLOCK_SIZE);
    dim3 gridSize(linGrid,linGrid,1);

    dim3 cpyBlockSize(BLOCK_SIZE,1,1);
    dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1);
    dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {

        ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid);
        ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid);
        GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);

        // Swap our grids and iterate again
        d_tmpGrid = d_grid;
        d_grid = d_newGrid;
        d_newGrid = d_tmpGrid;
    }//iter loop

    // Copy back results and sum
    cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += h_grid[i*(dim+2)+j];
        }
    }
    printf("Total Alive: %d\n", total);

    // Release memory
    cudaFree(d_grid);
    cudaFree(d_newGrid);
    free(h_grid);

    return 0;
}

Changes

#define BLOCK_SIZE 16

The number of threads per block is a common performance tuning parameter and it is often set as a precompiler define statement.

__global__ void ghostRows(int dim, int *grid)
{
    // We want id ∈ [1,dim]
    int id = blockDim.x * blockIdx.x + threadIdx.x + 1;
 
    if (id <= dim)
    {
        //Copy first real row to bottom ghost row
        grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id];
        //Copy last real row to top ghost row
        grid[id] = grid[(dim+2)*dim + id];
    }
}
__global__ void ghostCols(int dim, int *grid)
{
    // We want id ∈ [0,dim+1]
    int id = blockDim.x * blockIdx.x + threadIdx.x;
 
    if (id <= dim+1)
    {
        //Copy first real column to right most ghost column
        grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1];
       //Copy last real column to left most ghost column
        grid[id*(dim+2)] = grid[id*(dim+2) + dim];
    }
}

The ghostRows and ghostColumns kernels are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.

__global__ void GOL(int dim, int *grid, int *newGrid)
{
    // We want id ∈ [1,dim]
    int iy = blockDim.y * blockIdx.y + threadIdx.y + 1;
    int ix = blockDim.x * blockIdx.x + threadIdx.x + 1;
    int id = iy * (dim+2) + ix;
 
    int numNeighbors;
 
    if (iy <= dim && ix <= dim) {
 
        // Get the number of neighbors for a given grid point
        numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower
                     + grid[id+1] + grid[id-1]             //right left
                     + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals
                     + grid[id-(dim+1)] + grid[id+(dim+1)];
 
        int cell = grid[id];
        // Here we have explicitly all of the game rules
        if (cell == 1 && numNeighbors < 2)
            newGrid[id] = 0;
        else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
            newGrid[id] = 1;
        else if (cell == 1 && numNeighbors > 3)
            newGrid[id] = 0;
        else if (cell == 0 && numNeighbors == 3)
            newGrid[id] = 1;
        else
            newGrid[id] = cell;
    }
}

The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.

int* d_grid; //Grid on device
int* d_newGrid; //Second grid used on device only
int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid

We declare our device pointers on the host as we would normally, we prefix our pointer name with d_ to help separate them from host pointers which we prefix with h_.

// Allocate device grids
cudaMalloc(&d_grid, bytes);
cudaMalloc(&d_newGrid, bytes);

Both of our device arrays are allocated in device global memory. A single dimensional array is used

// Copy over initial game grid (Dim-1 threads)
cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);

After we have filled h_grid on the host we must copy it over to d_grid which resides in the device global memory.

dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
dim3 cpyBlockSize(BLOCK_SIZE,1,1);

The CUDA data type dim3 is used to define the number of threads in our block, For the GOL kernel we specify a two dimensional block size to better suite our problems geometry, for copying a simple single dimensional size is best. Note that any missing dimensions in the constructor are assumed to be 1.

int linGrid = (int)ceil(dim/(float)BLOCK_SIZE);
dim3 gridSize(linGrid,linGrid,1);
dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1);
dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);

With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of type dim3.

ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid);
ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid);
GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);

Each iteration of the game game loop we launch our three kernels, first copying the ghost cells and then preforming one game iteration.

// Swap our grids and iterate again
d_tmpGrid = d_grid;
d_grid = d_newGrid;
d_newGrid = d_tmpGrid;

The grid and newGrid device pointers can be switched just as we did with the host pointers in the CPU version.

// Copy back results and sum
cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);

Once our main game loop is complete we must copy the grid from device memory to host memory.

// Release memory
cudaFree(d_grid);
cudaFree(d_newGrid);

Having used cudaMalloc to allocate our device memory we release it with cudaFree.

Compiling

module load cudatoolkit
nvcc GOL-CUDA.cu -o GOL-CUDA.out

GOL-CUDA-Shared.cu

In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:

  1. Have each thread read from global memory into shared memory
  2. Synchronize threads to ensure all threads in a given block have finished reading from global memory
  3. Have each thread preform calculations using shared memory values
  4. Have each thread write result from shared memory back to device global memory

For detailed coverage of shared memory please see the CUDA C Programming guide.

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985
#define BLOCK_SIZE_x 32
#define BLOCK_SIZE_y 16

__global__ void ghostRows(int dim, int* grid)
{
    // We want id ∈ [1,dim]
    int id = blockDim.x * blockIdx.x + threadIdx.x + 1;

    if (id <= dim)
    {
        //Copy first real row to bottom ghost row
        grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; 
        //Copy last real row to top ghost row
        grid[id] = grid[(dim+2)*dim + id]; 
    }
}
__global__ void ghostCols(int dim, int* grid)
{
    // We want id ∈ [0,dim+1]
    int id = blockDim.x * blockIdx.x + threadIdx.x;

    if (id <= dim+1)
    {
        //Copy first real column to right most ghost column
        grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; 
        //Copy last real column to left most ghost column
        grid[id*(dim+2)] = grid[id*(dim+2) + dim];                                                  
    }
}

__global__ void GOL(int dim, int *grid, int *newGrid)
{
        int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y;
        int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x;
        int id = iy * (dim+2) + ix;

        int i = threadIdx.y;
        int j = threadIdx.x;
        int numNeighbors;

        // Declare the shared memory on a per block level
        __shared__ int s_grid[BLOCK_SIZE_y][BLOCK_SIZE_x];

       // Copy cells into shared memory
       if (ix <= dim+1 && iy <= dim+1)
           s_grid[i][j] = grid[id];

       //Sync all threads in block
        __syncthreads();

       if (iy <= dim && ix <= dim) {
           if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) {

               // Get the number of neighbors for a given grid point
               numNeighbors = s_grid[i+1][j] + s_grid[i-1][j] //upper lower
                            + s_grid[i][j+1] + s_grid[i][j-1] //right left
                            + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals
                            + s_grid[i-1][j+1] + s_grid[i+1][j-1];

                int cell = s_grid[i][j];

                // Here we have explicitly all of the game rules
                if (cell == 1 && numNeighbors < 2)
                    newGrid[id] = 0;
                else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[id] = 1;
                else if (cell == 1 && numNeighbors > 3)
                    newGrid[id] = 0;
                else if (cell == 0 && numNeighbors == 3)
                    newGrid[id] = 1;
                else
                    newGrid[id] = cell;
           }
       }

}

int main(int argc, char* argv[])
{
    int i,j,iter;
    int *h_grid; //Grid on host
    int *d_grid; //Grid on device
    int *d_newGrid; //Second grid used on device only
    int *d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid

    int dim = 1024; //Linear dimension of our grid - not counting ghost cells
    int maxIter = 1<<10; //Number of game steps

    size_t bytes = sizeof(int)*(dim+2)*(dim+2);
    // Allocate host Grid used for initial setup and read back from device
    h_grid = (int*)malloc(bytes);

    // Allocate device grids
    cudaMalloc(&d_grid, bytes);
    cudaMalloc(&d_newGrid, bytes);

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            h_grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    cudaFuncSetCacheConfig(GOL, cudaFuncCachePreferShared);

    // Copy over initial game grid (Dim-1 threads)
    cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);

    dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1);
    int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2));
    int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2));
    dim3 gridSize(linGrid_x,linGrid_y,1);

    dim3 cpyBlockSize(BLOCK_SIZE_x,1,1);
    dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1);
    dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {

        ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid);
        ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid);
        GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);

        // Swap our grids and iterate again
        d_tmpGrid = d_grid;
        d_grid = d_newGrid;
        d_newGrid = d_tmpGrid;
    }//iter loop

    cudaError_t error = cudaGetLastError();
    if(error != cudaSuccess)
        printf("CUDA error %s\n",cudaGetErrorString(error));

    // Copy back results and sum
    cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += h_grid[i*(dim+2)+j];
        }
    }
    printf("Total Alive: %d\n", total);

    cudaFree(d_grid);
    cudaFree(d_newGrid);
    free(h_grid);

    return 0;
}

Changes :

#define BLOCK_SIZE_x 32
#define BLOCK_SIZE_y 16

A square block size is not always best performance wise, for greater flexibility we switch to separate x and y dimensions.

int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y;
int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x;
int id = iy * (dim+2) + ix;

Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.

int i = threadIdx.y;
int j = threadIdx.x;

We will also define some convenience variables for the block thread indices.

// Declare the shared memory on a per block level
__shared__ int s_grid[blockDim.y][blockDim.x];

The __shared__ qualifier is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis.

// Copy cells into shared memory
if (ix <= dim+1 && iy <= dim+1)
s_grid[i][j] = grid[id];

Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.

//Sync all threads in block
__syncthreads();

Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling __syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.

if (iy <= dim && ix <= dim) {
    if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) {

Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.

// Get the number of neighbors for a given grid point
numNeighbors = s_grid[i+1][j] + s_grid[i-1][j]     //upper lower
            + s_grid[i][j+1] + s_grid[i][j-1]      //right left
            + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals
            + s_grid[i-1][j+1] + s_grid[i+1][j-1];

int cell = s_grid[i][j];

The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.

dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1);
int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2));
int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2));
dim3 gridSize(linGrid_x,linGrid_y,1);

Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.

GOL-CUDA-Texture.cu

The texture cache example is provided for completeness but not fully documented. The texture cache is optimized for 2D spacial locality and does not pollute the L1 cache. Some programs may benefit from the smaller cache line compared to the L1 cache.

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

texture<int,2> gridTex;

__global__ void GOL(int dim, int *newGrid)
{
    int iy = blockDim.y * blockIdx.y + threadIdx.y;
    int ix = blockDim.x * blockIdx.x + threadIdx.x;
    int id = iy * dim + ix;

    int numNeighbors;

    float iyTex = (iy + 0.5f)/dim;
    float ixTex = (ix + 0.5f)/dim;
    float oneTex = 1.0f/dim;

    if(iy < dim && ix < dim)
{
    //Get the number of neighbors for a given grid point
    numNeighbors = tex2D(gridTex, iyTex+oneTex, ixTex) //upper/lower
                 + tex2D(gridTex, iyTex-oneTex, ixTex)
                 + tex2D(gridTex, iyTex, ixTex+oneTex) //right/left
                 + tex2D(gridTex, iyTex, ixTex-oneTex)
                 + tex2D(gridTex, iyTex-oneTex, ixTex-oneTex) //diagonals
                 + tex2D(gridTex, iyTex-oneTex, ixTex+oneTex)
                 + tex2D(gridTex, iyTex+oneTex, ixTex-oneTex) 
                 + tex2D(gridTex, iyTex+oneTex, ixTex+oneTex);

    int cell = tex2D(gridTex, iyTex, ixTex);

    //Here we have explicitly all of the game rules
    if (cell == 1 && numNeighbors < 2)
        newGrid[id] = 0;
    else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3))
        newGrid[id] = 1;
    else if (cell == 1 && numNeighbors > 3)
        newGrid[id] = 0;
    else if (cell == 0 && numNeighbors == 3)
         newGrid[id] = 1;
    else
       newGrid[id] = cell;

}
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    int* h_grid; //Grid on host
    cudaArray* d_grid; //Grid on device
    int* d_newGrid; //Second grid used on device only

    int dim = 1024; //Linear dimension of our grid - not counting ghost cells
    int maxIter = 1<<10; //Number of game steps

    size_t bytes = sizeof(int)*dim*dim;
    //Allocate host Grid used for initial setup and read back from device
    h_grid = (int*)malloc(bytes);

    //Allocate device grids
    cudaMallocArray(&d_grid, &gridTex.channelDesc, dim, dim);
    cudaMalloc(&d_newGrid, bytes);

    //Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 0; i<dim; i++) {
        for(j = 0; j<dim; j++) {
            h_grid[i*dim+j] = rand() % 2;
        }
    }

    //Copy over initial game grid (Dim-1 threads)
    cudaMemcpyToArray (d_grid, 0, 0, h_grid, bytes, cudaMemcpyHostToDevice);
    cudaBindTextureToArray(gridTex, d_grid);

    gridTex.normalized = true;
    gridTex.addressMode[0] = cudaAddressModeWrap;
    gridTex.addressMode[1] = cudaAddressModeWrap;

    dim3 dimBlock(8,8);
    int linGrid = (int)ceil(dim/(float)dimBlock.x);
    dim3 dimGrid(linGrid,linGrid);

    //Main game loop
    for (iter = 0; iter<maxIter; iter++) {
        GOL<<<dimGrid,dimBlock>>>(dim, d_newGrid);

        //Swap our grids and iterate again
        cudaMemcpyToArray (d_grid, 0, 0, d_newGrid, bytes, cudaMemcpyDeviceToDevice);
    }//iter loop

    //Copy back results and sum
    cudaMemcpy(h_grid, d_newGrid, bytes, cudaMemcpyDeviceToHost);

    //Sum up alive cells and print results
    int total = 0;
    for (i = 0; i<dim; i++) {
        for (j = 0; j<dim; j++) {
            total += h_grid[i*dim+j];
        }
    }
    printf("Total Alive: %d\n", total);

    cudaFree(d_grid);
    cudaFree(d_newGrid);
    free(h_grid);

    return 0;
}

This tutorial shows a minimal conversion from our vector addition CPU code to PGI CUDA Fortran. Consider this a CUDA Fortran ‘Hello World’. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.cuf

module kernel
    contains
    ! CUDA kernel. Each thread takes care of one element of c
    attributes(global) subroutine vecAdd_kernel(n, a, b, c)
        integer, value :: n
        real(8), device :: a(n), b(n), c(n)
        integer :: id

        ! Get our global thread ID
        id = (blockidx%x-1)*blockdim%x + threadidx%x

        ! Make sure we do not go out of bounds
        if (id <= n) then
            c(id) = a(id) + b(id)
        endif
    end subroutine vecAdd_kernel
end module kernel

program main
    use cudafor
    use kernel

    type(dim3) :: blockSize, gridSize
    real(8) :: sum
    integer :: i

    ! Size of vectors
    integer :: n = 100000

    ! Host input vectors
    real(8),dimension(:),allocatable :: h_a
    real(8),dimension(:),allocatable :: h_b
    !Host output vector
    real(8),dimension(:),allocatable :: h_c

    ! Device input vectors
    real(8),device,dimension(:),allocatable :: d_a
    real(8),device,dimension(:),allocatable :: d_b
    !Host output vector
    real(8),device,dimension(:),allocatable :: d_c

    ! Allocate memory for each vector on host
    allocate(h_a(n))
    allocate(h_b(n))
    allocate(h_c(n))

    ! Allocate memory for each vector on GPU
    allocate(d_a(n))
    allocate(d_b(n))
    allocate(d_c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        h_a(i) = sin(i*1D0)*sin(i*1D0)
        h_b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Implicit copy of host vectors to device
    d_a = h_a(1:n)
    d_b = h_b(1:n)

    ! Number of threads in each thread block
    blockSize = dim3(1024,1,1)

    ! Number of thread blocks in grid
    gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)

    ! Execute the kernel
    call vecAdd_kernel<<<gridSize, blockSize>>>(n, d_a, d_b, d_c)

    ! Implicit copy of device array to host
    h_c = d_c(1:n)

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    sum = 0.0;
    do i=1,n
        sum = sum +  h_c(i)
    enddo
    sum = sum/real(n)
    print *, 'final result: ', sum

    ! Release device memory
    deallocate(d_a)
    deallocate(d_b)
    deallocate(d_c)

    ! Release host memory
    deallocate(h_a)
    deallocate(h_b)
    deallocate(h_c)

end program main

Changes

Kernel

The kernel is the heart of our CUDA code. When we launch a kernel we specify the number of threads per block(blockdim) and number of blocks per grid(griddim). The total number of threads is (blockdim) * (griddim). Each thread evaluates one copy of the kernel.

module kernel
    contains
    !CUDA kernel. Each thread takes care of one element of c
    attributes(global) subroutine vecAdd_kernel(n, a, b, c)
        integer, value :: n
        real(8), device :: a(n), b(n), c(n)
        integer :: id

        !Get our global thread ID
        id = (blockidx%x-1)*blockdim%x + threadidx%x

        !Make sure we do not go out of bounds
        if (id <= n) then
            c(id) = a(id) + b(id)
        endif
    end subroutine vecAdd_kernel
end module kernel

Let’s take a look at what makes up this simple kernel.

module kernel
    contains
end module kernel

The interface for the kernel subroutine must be explicitly known to the calling program, one way of achieving this is placing the subroutine in a module.

attributes(global) subroutine vecAdd_kernel(n, a, b, c)

The global attribute specifies this is a CUDA kernel, otherwise normal Fortran subroutine syntax is used.

!Get our global thread ID
id = (blockidx%x-1)*blockdim%x + threadidx%x

We calculate the threads global id using CUDA Fortran provided derived data types. blockidx contains the blocks position in the grid, ranging from 1 to griddim. threadidx is the threads index inside of it’s associated block, ranging from 1 to blockdim.

if (id <= n) then

Unless we have an array length divisible by our blockdim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

c(id) = a(id) + b(id)

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c

Modules

use cudafor
use kernel

The cudafor is a PGI provided module that provides some of the CUDA specific derived types such as dim3. The kernel module we created must also be included.

Array Pointers

!Host input vectors
real(8),dimension(:),allocatable :: h_a
real(8),dimension(:),allocatable :: h_b
!Host output vector
real(8),dimension(:),allocatable :: h_c

!Device input vectors
real(8),device,dimension(:),allocatable :: d_a
real(8),device,dimension(:),allocatable :: d_b
!Host output vector
real(8),device,dimension(:),allocatable :: d_c

With the host CPU and GPU having separate memory spaces we must have two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them. Notice that the GPU pointers have the device attribute.

Allocate device memory

!Allocate memory for each vector on GPU
allocate(d_a(n))
allocate(d_b(n))
allocate(d_c(n))

Given a pointer with the device attribute memory will be allocated in the GPUs global memory space.

Copy to device

!Implicit copy of host vectors to device
d_a = h_a(1:n)
d_b = h_b(1:n)

After we initialize the data on the host we must copy it over to the device, to do so we can can use the syntax above. This is a convenient way of doing things but care must be taken. This operation can be very time consuming as we are initiating a DMA transfer of data from the host memory, over the PCI bus, to the device memory.

Thread mapping

!Number of threads in each thread block
blockSize = dim3(1024,1,1)

!number of thread blocks in grid
gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the block Size will be chosen based upon kernel memory requirements and the grid size calculated such that enough threads are launched to cover the kernel domain. dim3 is a CUDA Fortran provided data type that has 3 dimensions, in this case we are dealing with a one dimensional block and grid so we specify a dimensionality of 1 for the other two dimensions.

Launch kernel

call vecAdd_kernel<<< gridSize, blockSize >>>(n, d_a, d_b, d_c)

We launch our kernel with a modified Fortran subroutine syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

Copy to host

!Implicit copy of device array to host
h_c = d_c(1:n)

The implicit copy from the device to the host will block until the kernel is completed.

Release device memory

deallocate(d_a)
deallocate(d_b)
deallocate(d_c)

Device memory is deallocated in the same programmatic manner as host allocated arrays.

Compiling

The pgi environment must be set up so before compiling
$ module load cudatoolkit
$ ftn vecAdd.cuf -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

For this tutorial we will cover two different CUDA Fortran versions of our Game of Life program. The first will strictly use global memory and be a straight forward GPU port of our program, the second will introduce the use of shared memory. For a description of the Game of Life and the original CPU code please see the tutorial here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

GOL.cuf

module kernels
    contains

    attributes(global) subroutine ghostCols_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [2,dim+1]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1

        if(id < dim+2) then
            grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column
            grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column
        endif

    end subroutine ghostCols_kernel

    attributes(global) subroutine ghostRows_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [1,dim+2]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x

        if(id <= dim+2) then
            grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row
            grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row
        endif
    end subroutine ghostRows_kernel

    attributes(global) subroutine GOL_kernel(dim, grid, newGrid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid, newGrid
        integer :: idx, idy, id, numNeighbors, cell

        !Get global thread IDs
        idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1
        idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1

        if(idx < dim+2 .AND. idy < dim+2) then
            numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) &     !right & left
                         + grid(idy+1,idx) + grid(idy-1,idx) &     !upper & lower
                         + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals
                         + grid(idy-1,idx+1) + grid(idy+1,idx-1)

            cell = grid(idy,idx)
            !Here we have explicitly all of the game rules
            if(cell == 1 .AND. numNeighbors < 2) then
                newGrid(idy,idx) = 0
            elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                newGrid(idy,idx) = 1
            elseif(cell == 1 .AND. numNeighbors > 3) then
                newGrid(idy,idx) = 0
            elseif(cell == 0 .AND. numNeighbors == 3) then
                newGrid(idy,idx) = 1
            else
                newGrid(idy,idx) = cell
            endif
        endif

    end subroutine GOL_kernel
end module kernels

program main
    use cudafor
    use kernels

    implicit none

    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    !Linear game grid dimension
    integer :: dim = 1024
    !Number of game iterations
    integer :: maxIter = 2**10
    integer :: blockDim = 16
    type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize

    !Host game grid array
    integer,dimension(:,:),allocatable :: h_grid

    !Device game grid arrays
    integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid

    !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells
    allocate(h_grid(dim+2,dim+2))

    !Allocate grid on device
    allocate(d_grid(dim+2,dim+2))
    allocate(d_newGrid(dim+2,dim+2))

    !Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            h_grid(i,j) = nint(randm)
        enddo
    enddo

    !Implicit copy of host vectors to device
    d_grid = h_grid

    !Set the number of threads per block and blocks per grid
    cpyBlockSize    = dim3(blockDim, 1, 1)
    cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1)
    cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1)

    blockSize   = dim3(blockDim, blockDim, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x)), &
                         ceiling(real(dim)/real(blockSize%y)), 1)

    !Main game loop
    do iter=1,maxIter
        !device pointers are not supported currently so manually switch grid/newGrid
        if(MOD(iter,2) == 1) then
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid)
        else
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid)
        endif
    enddo

    !Implicit copy of device vector back to host
    h_grid = d_grid

    !Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + h_grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    !Release memory
    deallocate(h_grid)
    deallocate(d_grid)
    deallocate(d_newGrid)

end program

Changes to GOL.cuf

    attributes(global) subroutine ghostCols_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [2,dim+1]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1

        if(id < dim+2) then
            grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column
            grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column
        endif

    end subroutine ghostCols_kernel

    attributes(global) subroutine ghostRows_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ∈ [1,dim+2]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x

        if(id <= dim+2) then
            grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row
            grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row
        endif
    end subroutine ghostRows_kernel

The ghostRows_kernel and ghostColumns_kernel device subroutines are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.

    attributes(global) subroutine GOL_kernel(dim, grid, newGrid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid, newGrid
        integer :: idx, idy, id, numNeighbors, cell

        !Get global thread IDs
        idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1
        idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1

        if(idx < dim+2 .AND. idy < dim+2) then
            numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) &     !right & left
                         + grid(idy+1,idx) + grid(idy-1,idx) &     !upper & lower
                         + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals
                         + grid(idy-1,idx+1) + grid(idy+1,idx-1)

            cell = grid(idy,idx)
            !Here we have explicitly all of the game rules
            if(cell == 1 .AND. numNeighbors < 2) then
                newGrid(idy,idx) = 0
            elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                newGrid(idy,idx) = 1
            elseif(cell == 1 .AND. numNeighbors > 3) then
                newGrid(idy,idx) = 0
            elseif(cell == 0 .AND. numNeighbors == 3) then
                newGrid(idy,idx) = 1
            else
                newGrid(idy,idx) = cell
            endif
        endif

    end subroutine GOL_kernel

The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.

    !Host game grid array
    integer,dimension(:,:),allocatable :: h_grid
 
    !Device game grid arrays
    integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid

Currently CUDA Fortran does not support device pointers and so we change our array attribute from pointer to allocatable.

    !Allocate grid on device
    allocate(d_grid(dim+2,dim+2))
    allocate(d_newGrid(dim+2,dim+2))

Given a array with the device attribute memory will be allocated in the GPUs global memory space.

    !Implicit copy of host vectors to device
    d_grid = h_grid

By setting the device array equal to a host array a host to device DMA copy is implicitly initiated.

    !Set the number of threads per block and blocks per grid
    cpyBlockSize    = dim3(blockDim, 1, 1)
    cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1)
    cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1)
 
    blockSize   = dim3(blockDim, blockDim, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x)), &
                         ceiling(real(dim)/real(blockSize%y)), 1)

With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of the CUDA Fortran data type dim3.

    !Main game loop
    do iter=1,maxIter
        !device pointers are not supported currently so manually switch grid/newGrid
        if(MOD(iter,2) == 1) then
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid)
        else
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid)
        endif
    enddo

Although we can not simply switch our device grid pointers at the end of each game iteration due to CUDA Fortran not support device pointers we may switch what arguments we call our kernels with. This achieves the same effect with a minimal increase in code size.

    !Implicit copy of device vector back to host
    h_grid = d_grid

After our kernel has completed we can implicitly copy the grid array from the device back to the host.

GOL-Shared.cuf

In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:

  1. Have each thread read from global memory into shared memory
  2. Synchronize threads to ensure all threads in a given block have finished reading from global memory
  3. Have each thread preform calculations using shared memory values
  4. Have each thread write result from shared memory back to device global memory

For detailed coverage of shared memory please see the CUDA C Programming Guide and CUDA Fortran Programming Guide.

module kernel
    contains

    attributes(global) subroutine ghostCols_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ?~H~H [2,dim+1]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1

        if(id < dim+2) then
            grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column
            grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column
        endif

    end subroutine ghostCols_kernel

    attributes(global) subroutine ghostRows_kernel(dim, grid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid
        integer :: id

        !We want id ?~H~H [1,dim+2]
        !Get Global thread ID
        id = (blockIdx%x-1) * blockDim%x + threadIdx%x

        if(id <= dim+2) then
            grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row
            grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row
        endif
    end subroutine ghostRows_kernel

    attributes(global) subroutine GOL_kernel(dim, grid, newGrid)
        integer,value :: dim
        integer,device,dimension(dim+2,dim+2) :: grid, newGrid
        integer :: idx, idy, id, i, j, numNeighbors, cell

        !Declare the shared memory on a per block level
        integer,shared :: s_grid(16, 16)

        !Get global thread IDs
        idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x
        idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y

        !Get local thread IDs
        i = threadIdx%y
        j = threadIdx%x

        !Copy cells into shared memory
        if (idx <= dim+2 .AND. idy <= dim+2) then
            s_grid(i,j) = grid(idy,idx)
        endif

        !Sync all threads in block
        call syncthreads()

        if (idx < dim+2 .AND. idy < dim+2) then
            if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then

                numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) &     !right & left
                             + s_grid(i+1,j) + s_grid(i-1,j) &     !upper & lower
                             + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals
                             + s_grid(i-1,j+1) + s_grid(i+1,j-1)

                cell = s_grid(i,j)

                !Here we have explicitly all of the game rules
                if(cell == 1 .AND. numNeighbors < 2) then
                    newGrid(idy,idx) = 0
                elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(idy,idx) = 1
                elseif(cell == 1 .AND. numNeighbors > 3) then
                    newGrid(idy,idx) = 0
                elseif(cell == 0 .AND. numNeighbors == 3) then
                    newGrid(idy,idx) = 1
                else
                    newGrid(idy,idx) = cell
                endif

            endif
        endif

    end subroutine GOL_kernel
end module kernel

program main
    use cudafor
    use kernel

    implicit none

    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    !Linear game grid dimension
    integer :: dim = 1024
    !Number of game iterations
    integer :: maxIter = 2**10
    integer :: blockDim = 16
    type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize

    !Host game grid array
    integer,dimension(:,:),allocatable :: h_grid

    !Device game grid array
    integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid

    !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells
    allocate(h_grid(dim+2,dim+2))

    !Allocate grid on device
    allocate(d_grid(dim+2,dim+2))
    allocate(d_newGrid(dim+2,dim+2))

    !Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            h_grid(i,j) = nint(randm)
        enddo
    enddo

    !Implicit copy of host vectors to device
    d_grid = h_grid

    !Set the number of threads per block and blocks per grid
    cpyBlockSize    = dim3(blockDim, 1, 1)
    cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1)
    cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1)

    blockSize   = dim3(blockDim-2, blockDim-2, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x-2)), &
                    ceiling(real(dim)/real(blockSize%y-2)), 1)

    !Main game loop
    do iter=1,maxIter
        !device pointers are not supported currently, so manually switch grid/newGrid
        if(MOD(iter,2) == 1) then
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid)
        else
            call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid)
            call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid)
        endif
    enddo

    !Implicit copy of device vector back to host
    h_grid = d_grid

    !Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + h_grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    !Release memory
    deallocate(h_grid)
    deallocate(d_grid)
    deallocate(d_newGrid)

end program

Changes to GOL-Shared.cuf

!Declare the shared memory on a per block level
integer,shared :: s_grid(16, 16)

The shared attribute is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis

!Get global thread IDs
idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x
idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y

Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.

!Get local thread IDs
i = threadIdx%y
j = threadIdx%x

We define some convenience variables for the block thread indices.

!Copy cells into shared memory
if (idx <= dim+2 .AND. idy <= dim+2) then
    s_grid(i,j) = grid(idy,idx)
endif

Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.

!Sync all threads in block
call syncthreads()

Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.

if (idx < dim+2 .AND. idy < dim+2) then
    if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then

Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.

numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) &     !right & left
             + s_grid(i+1,j) + s_grid(i-1,j) &     !upper & lower
             + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals
             + s_grid(i-1,j+1) + s_grid(i+1,j-1)
 
cell = s_grid(i,j)

The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.

    blockSize   = dim3(blockDim-2, blockDim-2, 1)
    gridSize    = dim3(ceiling(real(dim)/real(blockSize%x-2)), &
                    ceiling(real(dim)/real(blockSize%y-2)), 1)

Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.

The use of GPU acceleration can add a useful level of parallelization to your code. This tutorial will show how to parallelize a simple Monte Carlo operation using a GPU accelerator. The first code will be a simple serial implementation of the Monte Carlo simulation used to compute π. The next samples will be the same code but accelerated using CUDA, OpenACC, and then Thrust.

Serial

The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of random points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void main(int argc, char* argv[])
{
    double niter = 100000;
    double x,y;
    int i;
    int count=0;
    double z;
    double pi;
    srand(time(NULL));
    //main loop
    for (i=0; i<niter; ++i)
    {
        //get random points
        x = (double)random()/RAND_MAX;
        y = (double)random()/RAND_MAX;
        z = sqrt((x*x)+(y*y));
        //check to see if point is in unit circle
        if (z<=1)
        {
            ++count;
        }
    }
    //p = 4(m/n)
    pi = ((double)count/(double)niter)*4.0;
    printf("Pi: %f\n", pi);

To compile this code, run:

$ cc serialpi.c -o a.out

The resulting executable can either be run in an interactive session or via batch script. To launch an interactive session run:

$ qsub -I -A -l walltime=1:00,nodes=1

This executable can also be run via batch script. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A <PROJID>
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n1 ./a.out

Save this in your project’s scratch directory as something like titan.pbs and copy your executable to $MEMBERWORK/<projid>.

To submit the script to the batch node, run

$ qsub titan.pbs

CUDA

Compute Unified Device Architecture (CUDA) is the parallel programming model developed by NVIDIA that extends C or Fortran code with features that target the GPU device. In CUDA, the programmer defines functions to be launched in many thousands or millions of parallel threads on the GPU. These functions are generally referred to as CUDA kernels or simply “kernels”. When writing these kernels, the programmer needs to keep in mind how many threads and blocks of threads they want launched on the GPU at runtime. A possible kernel for this code could look something like this:

__global__ void kernel(int* count_d, float* randomnums)
{
        int i;
        double x,y,z;
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        i = tid;
        int xidx = 0, yidx = 0;

        xidx = (i+i);
        yidx = (xidx+1);

        x = randomnums[xidx];
        y = randomnums[yidx];
        z = ((x*x)+(y*y));

        if (z<=1)
                count_d[tid] = 1;
        else
                count_d[tid] = 0;
}

Take note of the __global__ decorator. This tells the nvcc compiler that this function is a CUDA kernel that will be called by host code and be run on the device. Also, note the variables blockDim.x, blockIdx.x, and threadIdx.x. These are built in variables that identify a thread’s place in the overall three-dimensional grid, similar to the concept of an MPI rank or OpenMP thread.

Another thing to keep in mind is the fact that the CPU (host) and GPU (device) have separate memory. This means that the host can’t directly access memory on the device and vice versa. To overcome this, CUDA gives us a set of functions to transfer and allocate data on the GPU device’s RAM. For example, in the Monte Carlo code, there needs to be a set of random numbers available on the device for the kernel to use, as well as an empty array to hold a 1 or 0, depending on whether or not the point is in the circle or not (See the Serial section for more information on the Monte Carlo specifics). So after the random numbers are generated and stored into an array, that array needs to be copied to the device. Also the empty array needs to be allocated on the device. To do this we use two different CUDA functions. Firstly, the empty array needs to be allocated on the device. The cudaMalloc() function gives us the ability to allocate memory on the GPU:

cudaError_t cudaMalloc(void** devPtr,
                       size_t size)

Before this function is called, an int* named count_d needs to be created (Note that device pointer variables generally end with “_d”). Also, the programmer needs to determine the size of the array. In this case, since the kernel is only processing one point per thread, the number of threads can be used as the size (thread blocks * threads per thread block = total number of threads) Once this is done, the cudaMalloc() function can be called. For this code, it could look like this:

cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));

Also, an array to house the random numbers (2 per thread this time) needs to be allocated on the device using cudaMalloc. This is done in the same fashion as before:

cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));

Once there is a place to contain the required random numbers, they can be generated. The CURAND library contains kernels that generate high quality random numbers using the GPU. CURAND needs a bit of setting up before it can generate the random numbers. First, a CURAND generator needs to be declared:

curandGenerator_t gen;

The next function creates the generator and declares its type.

curandStatus_t curandCreateGenerator(curandGenerator_t* generator,
                                     curandRngType_t rng_type);

The next one sets the seed.

curandStatus_t curandSetPseudoRandomGeneratorSeed(curandGenerator_t generator,
                                                  unsigned long long seed);

Next, curandGenerateUniform() is the actual function that generates the random numbers.

curandStatus_t curandGenerateUniform(curandGenerator_t generator,
                                     float* outputPtr,
                                     size_t num);

Finally, the random number generator can be destroyed because theres no further need for it.

curandStatus_t curandDestroyGenerator(curandGenerator_t generator);

Using these definitions, create the random numbers as follows:

curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL));
curandGenerateUniform(gen, randomnums, (2*niter));
curandDestroyGeneraor(gen);

Now, the CUDA kernel needs to be called and launched on the device. Before this can be done, the programmer needs to know how many threads per block, and how many blocks that need to be launched. In this case, 1000 threads/block and 100 blocks should work fine (these numbers we picked arbitrarily, the programmer can set their own). Once this is determined, the kernel can be launched and passed the correct device pointers.

kernel <<<blocks, threads>>> (count_d, randomnums);

Once the kernel has finished, the count_d array needs to be copied back to the host. Similar to the standard C function memcpy(), we use the CUDA function cudaMemcpy() which looks like this:

cudaError_t cudaMemcpy(void * dst,
                       const void * src,
                       size_t count,
                       enum cudaMemcpyKind kind)

Before this can be done, the programmer needs to declare an empty array on the host that will contain the data that is copied back with cudaMemcpy. Also, make sure the amount of data copied back (size_t count) is the right size (the same size as the cudaMalloc’d array). Finally, cudaMemcpy needs to know in what direction the copy is going to be performed. In this particular instance, the array is on the device and needs to be copied to the host, so the final parameter for the cudaMemcpy needs to be the built in “cudaMemcpyDeviceToHost”. The completed cudaMemcpy line looks like this:

cudaMemcpy(count, count_d, blocks*threads*sizeof(int),
           cudaMemcpyDeviceToHost);

Now that all the CUDA code is set up, the rest of the code looks like this:

cpi.cu:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <time.h>

//Declare the CUDA kernel
__global__ void kernel(int* count_d, float* randomnums)
{
        int i;
        double x,y,z;
        //Find the overall ID of the thread
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        i = tid;
        int xidx = 0, yidx = 0;

        //Do the MonteCarlo!
        xidx = (i+i);
        yidx = (xidx+1);

        //Get the random x,y points
        x = randomnums[xidx];
        y = randomnums[yidx];
        z = ((x*x)+(y*y));

        if (z<=1)
                count_d[tid] = 1;
        else
                count_d[tid] = 0;
}

//Used to check if there are any errors launching the kernel
void CUDAErrorCheck()
{
        cudaError_t error = cudaGetLastError();
        if (error != cudaSuccess)
        {
                printf("CUDA error : %s (%d)\n", cudaGetErrorString(error), error);
                exit(0);
        }
}

int main(int argc,char* argv[])
{
        //NOTE: if threads and/or blocks is changed, niter needs to be changed to reflect
        //that change (niter=threads*blocks)
        int niter = 100000;
        float *randomnums;
        double pi;
        //Allocate the array for the random numbers
        cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));
        //Use CuRand to generate an array of random numbers on the device
        int status;
        curandGenerator_t gen;
        status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
        status |= curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL));
        status |= curandGenerateUniform(gen, randomnums, (2*niter));
        status |= curandDestroyGenerator(gen);
        //Check to see if there was any problem launching the CURAND kernels and generating
        //the random numbers on the device
        if (status != CURAND_STATUS_SUCCESS)
        {
                printf("CuRand Failure\n");
                exit(EXIT_FAILURE);
        }

        //Threads per thread block to be launched
        int threads = 1000;
        //Number of thread blocks launched
        int blocks = 100;
        int* count_d;
        int *count = (int*)malloc(blocks*threads*sizeof(int));
        unsigned int reducedcount = 0;
        //Allocate the array to hold a value (1,0) whether the point in is the circle (1) or not (0)
        cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));
        CUDAErrorCheck();
        //Launch the kernel
        kernel <<<blocks, threads>>> (count_d, randomnums);
        //Acts as a kind of code Barrier until the kernel is finished. Kernel calls are nonblocking so
        //the code would continue regardless of whether the kernel succeeded or not without the Sync
        cudaDeviceSynchronize();
        CUDAErrorCheck();
        //Copy the resulting array back
        cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost);
        int i = 0;

        //Reduce array into int
        for(i = 0; i<niter; i++)
                reducedcount += count[i];

        //Free the cudaMalloc()'d arrays
        cudaFree(randomnums);
        cudaFree(count_d);
        free(count);

        //Find the ratio
        pi = ((double)reducedcount/niter)*4.0;
        printf("Pi: %f\n", pi);

        return 0;
}

Compiling

Before this can be compiled, the CUDA toolkit needs to be loaded. To do this, run:

$ module load cudatoolkit

Now that the module is loaded, the nvcc compiler wrapper is available. To compile this run:

$ nvcc -arch=sm_35 -lcurand cpi.cu -o a.out

Here, the -arch=sm_35 switch is used to tell the compiler that the target device Compute Capabilty is 3.5. This allows the programmer to do more things in the CUDA kernel, for instance use double. The -lcurand switch tells the compiler to link in the CURAND library and give the code access to the CURAND functions.

Then copy the executable (in this case, a.out) to the scratch directory (see the serial section for more info on finding the scratch directory). This can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, use this:

$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1

Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command

$ aprun -n1 ./a.out

Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A <PROJID>
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 ./a.out

Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:

$ qsub titan.pbs

OpenACC

OpenACC provides an alternative pragma based accelerator programming model that can be easier to use than CUDA. These pragmas are similar to OpenMP’s but launch threads on the GPU instead of the CPU. Parallelizing the Monte Carlo code with OpenACC is a fairly straight-forward procedure. The programmer must first identify the best region to parallelize. In this code, the most parallelizable region is the main for loop which actually performs the Monte Carlo simulation. The main loop can be broken up into two for loops. One loop that will do the operation and store a 1 or 0 in an array depending on if the point is in the circle or not, like in the CUDA version. The second loop will become a separate kernel that will reduce the output array of the first kernel into 1 integer on the host. The OpenACC pragma should look something like this:

#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)

Here,”acc” tells the compiler that this is an OpenACC pragma. Next, the parallel construct is used to launch multiple threads on the GPU. Next, the copy(randomnums[0:200000]) directive copies the randomnums array onto the device. The copyout(output[0:100000]) directive copies the array onto the device then back off onto the host once the kernel is finished. Finally the create(x,y,z) directive  created the three variables, x,y,z, on the device. The parallel region looks something like this:

        #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
        {
                for (i = 0; i<100000; i++)
                {
                        int temp = i+i;
                        x = randomnums[temp];
                        y = randomnums[temp+1];
                        z = ((x*x)+(y*y));
                        output[i] = z;
                }
                for (i = 0; i<100000; i++)
                {
                        if (output[i]<=1)
                        {
                                ++count;
                        }
                        else
                                continue;
                }
        }

The complete code looks like this:

accpi.cpp

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
        double niter = 100000;
        double x,y;
        int count=0;
        double z;
        double pi;
        int a = 0;
        time_t t;
        double randomnums[200000];
        double output[100000];
        srand48(((unsigned)time(&t)));
        for (a = 0; a<=2*niter; a++)
                randomnums[a] = (double)drand48();
        int i = 0;
        #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
        {
                for (i = 0; i<100000; i++)
                {
                        int temp = i+i;
                        x = randomnums[temp];
                        y = randomnums[temp+1];
                        z = ((x*x)+(y*y));
                        output[i] = z;
                }
                for (i = 0; i<100000; i++)
                {
                        if (output[i]<=1)
                        {
                                ++count;
                        }
                        else
                                continue;
                }
        }
        pi = ((double)count/(double)niter)*4.0;
        printf("Pi: %f\n", pi);
}

To compile this the cudatoolkit needs to be loaded. To load it run module load cudatoolkit. Next, run

$ CC -acc -Minfo=all -ta=nvidia:5.0,cc3x accpi.cpp -o a.out

to compile the code into an executable. The -Minfo=all switch will tell pgicc to print out various information regarding how its compiling the OpenACC code. This can be a very useful tool for debugging OpenACC code. The -ta=nvidia:5.0,cc3x switch tells what version of the CUDA to use and what compute architecture to use as the target. Some sample output for the -Minfo=all switch is:

main:
21, Generating copy(randomnums[0:])
Generating copyout(output[0:])
Generating create(x)
Generating create(y)
Generating create(z)
Accelerator kernel generated
22, #pragma acc loop vector(256) /* threadIdx.x */
30, #pragma acc loop vector(256) /* threadIdx.x */
21, Generating NVIDIA code
Generating compute capability 3.0 binary
22, Loop is parallelizable
30, Loop is parallelizable
34, Sum reduction generated for count

To compile this code with the Cray compiler, load the PrgEnv-cray module as well as the craype-accel-nvidia35 module

$ module swap PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35

To compile to an exectuable, run:

$ CC -hpragma=acc accpi.cpp -o a.out
Running:

Then copy the executable (in this case, a.out) to the scratch directory. Like before, this can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, run:

$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1

Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command

$ aprun -n1 ./a.out

Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A [PROJID]
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 -N 1 ./a.out

Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:

$ qsub titan.pbs

Thrust

Finally, this operation can be accelerated with the unique Thrust library. Thrust is a template library that allows programmers to write parallel code in the same way that they would use the C++ Standard Library for serial code. Because of this, the Thrust parallelized code is structured quite differently. For instance, the Monte Carlo kernel from the previous CUDA example is replaced by an equivalent function inside a struct.

struct montecarlo :
       public thrust::unary_function<unsigned int, float>
{
    //use the __host__ __device__ decorators so that
    //thrust knows that this is a gpu function.
    //think of operator() as similar to a CUDA kernel
	__host__ __device__
	float operator()(unsigned int thread_id)
	{
                unsigned int seed;
                //set the seed
                seed = 49868^thread_id;
                int i;
                float x,y,z,sum=0.0;

                //define the random number generator
                thrust::default_random_engine rng(seed);
                thrust::uniform_real_distribution u01(0,1);
                for (i = 0; i < 1000; i++)
                {
                        //get the random numbers
                        x = u01(rng);
                        y = u01(rng);
                        z = ((x*x)+(y*y));
                        if (z<=1)
                                sum += 1;
                }
                return sum;
        }
};

First, the montecarlo struct is defined and given a Thrust function template to adhere to.

struct montecarlo : public thrust::unary_function

This Thrust template says that the function in the struct will be a standard thrust unary function with an unsigned int as an input (this is the thread number. This is handled by Thrust) and will return a floating point number. Next, the programmer defines the operator function

__host__ __device__
float operator()(unsigned int thread_id)

and gives it the __host__ and __device__ decorator so that the compiler knows that it is a GPU function. The next thing that the programmer should take note of is the use of the Thrust random number generator.

thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution u01(0,1);

Since it is not possible to call host functions from a GPU kernel, this step is necessary. By using Thrust functions like this, it allows the programmer to more easily and quickly accelerate the serial code without having to use more complex and specialized APIs, like CURAND.

Finally the way this kernel can be launched is by the use of another Thrust function. This transform_reduce() function is defined as:

OutputType thrust::transform_reduce(InputIterator first,
                                    InputIterator last,
                                    UnaryFunction unary_op,
                                    OutputType init,
                                    BinaryFunction binary_op)

Here, the thrust::transform_reduce() function is used because it will run every number between first and last through our predefined functor (unary_op) and then it will reduce the resulting transformed answers into one final float according to binary_op with the initial value of init. first and last need to be defined as thrust::counting_iterator(first) and thrust::counting_iterator(last) The full function will look like this.

count = thrust::transform_reduce(thrust::counting_iterator(0),
                                 thrust::counting_iterator(niter),
                                 montecarlo(),
                                 0.0,
                                 thrust::plus());

So with that in mind, the full code can look something like this:

thrust.cu

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <thrust/generate.h>
#include <thrust/random.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <cmath>

struct montecarlo :
       public thrust::unary_function<unsigned int, float>
{
    //use the __host__ __device__ decorators so that
    //thrust knows that this is a gpu function.
    //think of operator() as similar to a CUDA kernel
	__host__ __device__
	float operator()(unsigned int thread_id)
	{
                unsigned int seed;
                //set the seed
                seed = 49868^thread_id;
                int i;
                float x,y,z,sum=0.0;

                //define the random number generator
                thrust::default_random_engine rng(seed);
                thrust::uniform_real_distribution u01(0,1);
                for (i = 0; i < 1000; i++)
                {
                        //get the random numbers
                        x = u01(rng);
                        y = u01(rng);
                        z = ((x*x)+(y*y));
                        if (z<=1)
                                sum += 1;
                }
                return sum;
        }
};

int main(int argc, char* argv[])
{
        float pi;
        float count = 0.0;
        int niter = 1000;
        count = thrust::transform_reduce(thrust::counting_iterator(0),
                                     thrust::counting_iterator(niter),
                                     montecarlo(),
                                     0.0,
                                     thrust::plus());
        pi = (count/(niter*niter))*4.0;
        printf("Pi: %f\n", pi);
}

To compile this code, the cudatoolkit needs to be loaded

$ module load cudatoolkit

Once the code is ready to be compiled, it should be compiled using NVIDIA’s nvcc compiler like so:

$ nvcc -arch=sm_35 thrust.cu -o a.out

It can be run either by submitting it in a batch script or by running it from an interactive job like described in the previous examples.

The code in this tutorial along with a makefile can be found at https://github.com/olcf/Accelerating-Serial-Code

If you have any questions about these examples or how to run them please email help@nccs.gov.

This sample shows the Game of Life CPU code converted to an HMPP accelerator directive version. Both codelet and region C and Fortran methods are presented. A second version is presented that fixes a performance issue incurred in the initial version. Please direct any questions or comments to help@nccs.gov

This tutorial covers CAPS HMPP Accelerator directives, If you are interested in CAPS OpenACC support please see the OpenACC Vector Addition Tutorial.

GOL-codelet.c

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

#pragma hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none
void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2], int maxIter)
{
    int i, j, iter;

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {

        // Left-Right columns
        for (i = 1; i<=dim; i++) {
            grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column
            grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column
        }

        // Top-Bottom rows
        for (j = 0; j<=dim+1; j++) {
            grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row
            grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row
        }

        int numNeighbors;
        // Now we loop over all cells and determine their fate
        for (i = 1; i<=dim; i++) {
            for (j = 1; j<=dim; j++) {
               // Get the number of neighbors for a given grid point
                numNeighbors = grid[i+1][j] + grid[i-1][j]      //upper lower
                             + grid[i][j+1] + grid[i][j-1]      //right left
                             + grid[i+1][j+1] + grid[i-1][j-1]  //diagonals
                             + grid[i-1][j+1] + grid[i+1][j-1];


                // Here we have explicitly all of the game rules
                if (grid[i][j] == 1 && numNeighbors < 2)
                    newGrid[i][j] = 0;
                else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[i][j] = 1;
                else if (grid[i][j] == 1 && numNeighbors > 3)
                    newGrid[i][j] = 0;
                else if (grid[i][j] == 0 && numNeighbors == 3)
                    newGrid[i][j] = 1;
                else
                    newGrid[i][j] = grid[i][j];
            }
        }

        // Can't switch pointers so we mannually have to copy array over
        for(i = 1; i<dim+2; i++) {
            for(j = 1; j<dim+2; j++) {
                grid[i][j] = newGrid[i][j];
            }
        }
    }

}

int main(int argc, char* argv[])
{
    int i,j,iter,numNeighbors;
    int dim = 1024;
    // Number of game iterations
    int maxIter = 1<<15;

    // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Allocate newGrid
    int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    // Call the GPU to execute codelet
    #pragma hmpp GOL callsite
    GOL(dim, grid, newGrid, maxIter);

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += grid[i*(dim+2)+j];
        }
    }

    printf("Total Alive: %d\n", total);

    free(grid);
    free(newGrid);

    return 0;
}

Changes
Lines 6-58

#pragma hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none
void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2], int maxIter)
{
    int i, j, iter;
 
    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {
 
        // Left-Right columns
        for (i = 1; i<=dim; i++) {
            grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column
            grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column
        }
 
        // Top-Bottom rows
        for (j = 0; j<=dim+1; j++) {
            grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row
            grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row
        }
 
        int numNeighbors;
        // Now we loop over all cells and determine their fate
        for (i = 1; i<=dim; i++) {
            for (j = 1; j<=dim; j++) {
               // Get the number of neighbors for a given grid point
                numNeighbors = grid[i+1][j] + grid[i-1][j]      //upper lower
                             + grid[i][j+1] + grid[i][j-1]      //right left
                             + grid[i+1][j+1] + grid[i-1][j-1]  //diagonals
                             + grid[i-1][j+1] + grid[i+1][j-1];
 
                // Here we have explicitly all of the game rules
                if (grid[i][j] == 1 && numNeighbors < 2)
                    newGrid[i][j] = 0;
                else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[i][j] = 1;
                else if (grid[i][j] == 1 && numNeighbors > 3)
                    newGrid[i][j] = 0;
                else if (grid[i][j] == 0 && numNeighbors == 3)
                    newGrid[i][j] = 1;
                else
                    newGrid[i][j] = grid[i][j];
            }
        }
 
        // Can't switch pointers so we mannually have to copy array over
        for(i = 1; i<dim+2; i++) {
            for(j = 1; j<dim+2; j++) {
                grid[i][j] = newGrid[i][j];
            }
        }
    }
 
}

To use the codelet method all of the game logic was moved into its own function. With the directive option args[grid].io=inout we specify that grid is to be copied onto the GPU at the start of the codelet and copied back to the host at the end. newGrid is specified to have no transfer between the host and device. The main game loop is included inside of the codelet even though it is neccesarily serial to ensure that the game grids are not coppied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below. grid and newGrid are passed into our codelet with their full dimension stated, this is neccesary for HMPP to correctly copy the data over.

// Get the number of neighbors for a given grid point
numNeighbors = grid[i+1][j] + grid[i-1][j]          //upper lower
                 + grid[i][j+1] + grid[i][j-1]      //right left
                 + grid[i+1][j+1] + grid[i-1][j-1]  //diagonals
                 + grid[i-1][j+1] + grid[i+1][j-1];

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.

// Can't switch pointers so we mannually have to copy array over
for(i = 1; i<dim+2; i++) {
    for(j = 1; j<dim+2; j++) {
        grid[i][j] = newGrid[i][j];
    }
}

This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cell
int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));
 
// Allocate newGrid
int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));
grid[i*(dim+2)+j] = rand() % 2;
total += grid[i*(dim+2)+j];

HMPP seems to currently be limited to copying single dimensional arrays from host the device. We are able to maintain our multidimensional array indexing in the codelet while on the host we must allocate and address linear arrays.


GOL-region.c

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985
#define idx(i,j) ((i)*(dim+2)+(j))

int main(int argc, char* argv[])
{
    int i,j,numNeighbors;
    int dim = 1024;
    // Number of game iterations
    int maxIter = 1<<15;

    // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Allocate newGrid
    int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            grid[i*(dim+2)+j] = rand() % 2;
        }
    }

#pragma hmpp GOL region, target=CUDA, args[*].transfer=atcall, args[grid].io=inout &
#pragma hmpp & args[newGrid].io=none
{
    int i,j,iter;
    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {

        #pragma hmppcg gridify
        // Left-Right columns
        for (i = 1; i<=dim; i++) {
            grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column
            grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column
        }

        #pragma hmppcg gridify
        // Top-Bottom rows
        for (j = 0; j<=dim+1; j++) {
            grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row
            grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row
        }

        int numNeighbors;
        // Now we loop over all cells and determine their fate
        #pragma hmppcg gridify (i,j)
        for (i = 1; i<=dim; i++) {
            for (j = 1; j<=dim; j++) {
               // Get the number of neighbors for a given grid point
                numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)]      //upper lower
                             + grid[idx(i,j+1)] + grid[idx(i,j-1)]      //right left
                             + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)]  //diagonals
                             + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];

                // Here we have explicitly all of the game rules
                if (grid[idx(i,j)] == 1 && numNeighbors < 2)
                    newGrid[idx(i,j)] = 0;
                else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[idx(i,j)] = 1;
                else if (grid[idx(i,j)] == 1 && numNeighbors > 3)
                    newGrid[idx(i,j)] = 0;
                else if (grid[idx(i,j)] == 0 && numNeighbors == 3)
                    newGrid[idx(i,j)] = 1;
                else
                    newGrid[idx(i,j)] = grid[idx(i,j)];
            }
        }

        #pragma hmppcg gridify
        // Can't switch pointers so we mannually have to copy array over
        for(i = 1; i<dim+2; i++) {
            for(j = 1; j<dim+2; j++) {
                grid[idx(i,j)] = newGrid[idx(i,j)];
            }
        }
    }

}
    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += grid[idx(i,j)];
        }
    }

    printf("Total Alive: %d\n", total);

    free(grid);
    free(newGrid);

    return 0;
}

Changes

#define idx(i,j) ((i)*(dim+2)+(j))

In the codelet version we were able to stretch the C language a bit and use multidimensional arrays in our codelet and a single dimensional array on our host, this is not the case with the region method. A convenience macro is defined and used whenever grid or newGrid is accessed.

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));
 
// Allocate newGrid
int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

As in the codelet version HMPP does not currently handle multidimensional arrays properly and so we revert back to a single dimensional array.

#pragma hmpp GOL region, target=CUDA, args[*].transfer=atcall, args[grid].io=inout &
#pragma hmpp & args[newGrid.io]=none

We add the directive option args[grid].io=inout we specify that grid is to be copied onto the GPU at the start of the region and copied back to the host at the end. newGrid is specified to have no transfer between the host and device. The main game loop is included inside of the region even though it is neccesarily serial to ensure that the game grids are not coppied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below.

 int i,j,iter;

At the time of this writing HMPP does not correctly identify the induction variables used in hmpp region that are declared outside of the region. The solution is to simply define our induction variables inside of the region.

#pragma hmppcg gridify
// Left-Right columns
for (i = 1; i<=dim; i++) {
#pragma hmppcg gridify
// Top-Bottom rows
for (j = 0; j<=dim+1; j++) {
// Now we loop over all cells and determine their fate
#pragma hmppcg gridify (i,j)
for (i = 1; i<=dim; i++) {
    for (j = 1; j<=dim; j++) {
#pragma hmppcg gridify
// Can't switch pointers so we mannually have to copy array over
for(i = 1; i<dim+2; i++) {
    for(j = 1; j<dim+2; j++) {

The hmppcg directives are used to guide the compiler at the loop level. In this particular case the compiler did not want to parallelize the loops due to the outermost game loop being sequential. Adding hmppcg gridify tells the compiler the loop immediately following the directive should be executed in parallel on the GPU.

// Get the number of neighbors for a given grid point
numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)]          //upper lower
                 + grid[idx(i,j+1)] + grid[idx(i,j-1)]      //right left
                 + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)]  //diagonals
                 + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.

#pragma hmppcg gridify
// Can't switch pointers so we mannually have to copy array over
for(i = 1; i<dim+2; i++) {
    for(j = 1; j<dim+2; j++) {
        grid[idx(i,j)] = newGrid[idx(i,j)];
    }
}

This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

Compiling
We add the –io-report option to hmpp to see compile time host/device memory transfer information.

$ hmpp --io-report cc GOL-codelet.c -o gol.out

or

$ hmpp --io-report cc GOL-region.c -o gol.out

Output:
The output will be similar for both the codelet and region methods.

hmpp: [Info] IO report :
In GROUP 'gol'
 REGION 'gol' at GOL-region.c:29, function '__hmpp_region__gol'
    Parameter  'dim' has intent IN
    Parameter  'maxIter' has intent IN
    Parameter  'grid' has intent INOUT
    Parameter  'newGrid' has intent INOUT

Here we see the intent, the transfer direction with respect to the GPU, of each of our variables. We will see more explicit methods of dealing with data transfer below.

hmpp: [Info] Generated codelet filename is "gol_cuda.cu".
hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:37: Loop 'i' was gridified (1D)
hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:44: Loop 'j' was gridified (1D)
hmppcg: [Message DPL3001] GOLHMPP-TRY0-REGION.c:52: Loops 'j' and 'i' were gridified (2D)
hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:77: Loop 'i' was gridified (1D)
hmppcg: [Warning DPL3002] GOLHMPP-TRY0-REGION.c:33: Loop not gridified: will be executed on CPU

All of the loops have been gridified indicating a kernel has been generated for the given loop that will be executed in parallel on the GPU. A warming message tells us that the main game loop has not been gridified and will be executed serially on the CPU, this is the expected and correct behavior.


GOL-codelet.f90

!$hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none
subroutine GOL(dim, grid, newGrid, maxIter)
    implicit none
    integer, intent(in) :: dim, maxIter
    integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2)
    integer :: i,j,iter,numNeighbors

    ! Main game loop
    do iter=1,maxIter
        ! Top-Bottom ghost rows
        do j=2,dim+1
            grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
            grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
        enddo

        ! Left-Right ghost columns
        do i=1,dim+2
            grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
            grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
        enddo

        ! Now we loop over all cells and determine their fate
        do j=2,dim+1
            do i=2,dim+1
                ! Get the number of neighbors for a given grid point
                numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right
                             + grid(i+1,j) + grid(i-1,j)& !upper & lower
                             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                             + grid(i-1,j+1) + grid(i+1,j-1)


                ! Here we have explicitly all of the game rules
                if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(i,j) = 1
                elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                    newGrid(i,j) = 1
                else
                    newGrid(i,j) = grid(i,j)
                endif
            enddo
        enddo

        ! Can't switch pointers so we mannually have to copy array over
        do j=2,dim+1
            do i=2,dim+1
                grid(i,j) = newGrid(i,j)
            enddo
        enddo

    enddo!end main game loop  

end subroutine GOL

program main
    implicit none

    integer :: i,j,seed(8),numNeighbors,total
    real :: randm
    ! Linear game grid dimension
    integer :: dim = 1024
    ! Number of game iterations
    integer :: maxIter = 2**10

    ! Game grid pointers
    integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid

    ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    allocate(grid(dim+2,dim+2))
    allocate(newGrid(dim+2,dim+2))

    ! Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            grid(i,j) = nint(randm)
        enddo
    enddo

    !$hmpp GOL callsite
    call GOL(dim, grid, newGrid, maxIter)

    ! Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    ! Release memory
    deallocate(grid)
    deallocate(newGrid)

end program

Changes
Lines 1-55:

!$hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none
subroutine GOL(dim, grid, newGrid, maxIter)
    implicit none
    integer, intent(in) :: dim, maxIter
    integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2)
    integer :: i,j,iter,numNeighbors
 
    ! Main game loop
    do iter=1,maxITer
        ! Top-Bottom ghost rows
        do j=2,dim+1
            grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
            grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
        enddo
 
        ! Left-Right ghost columns
        do i=1,dim+2
            grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
            grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
        enddo
 
        ! Now we loop over all cells and determine their fate
        do j=2,dim+1
            do i=2,dim+1
                ! Get the number of neighbors for a given grid point
                numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right
                             + grid(i+1,j) + grid(i-1,j)& !upper & lower
                             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                             + grid(i-1,j+1) + grid(i+1,j-1)
 
                ! Here we have explicitly all of the game rules
                if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(i,j) = 1
                elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                    newGrid(i,j) = 1
                else
                    newGrid(i,j) = grid(i,j)
                endif
            enddo
        enddo
 
        ! Can't switch pointers so we mannually have to copy array over
        do j=2,dim+1
            do i=2,dim+1
                grid(i,j) = newGrid(i,j)
            enddo
        enddo
 
    enddo!end main game loop  
 
end subroutine GOL

To use the codelet method all of the game logic was moved into its own subroutine. With the directive option args[grid].io=inout we specify that grid is to be copied onto the GPU at the start of the codelet and copied back to the host at the end. newGrid is specified to have no transfer between the host and device. The main game loop is included inside of the codelet even though it is neccesarily serial to ensure that the game grids are not copied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below.

! Get the number of neighbors for a given grid point
numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right
             + grid(i+1,j) + grid(i-1,j)& !upper & lower
             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
             + grid(i-1,j+1) + grid(i+1,j-1)

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our subroutine to calculate the number of alive neighbors a cell has.

! Can't switch pointers so we mannually have to copy array over
do j=2,dim+1
    do i=2,dim+1
        grid(i,j) = newGrid(i,j)
    enddo
enddo

This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

GOL-region.f90

program main
    implicit none

    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    ! Linear game grid dimension
    integer :: dim = 1024
    ! Number of game iterations
    integer :: maxIter = 2**10

    ! Game grid pointers
    integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid

    ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    allocate(grid(dim+2,dim+2))
    allocate(newGrid(dim+2,dim+2))

    ! Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            grid(i,j) = nint(randm)
        enddo
    enddo

    !$hmpp GOL region, target=CUDA, args[grid].io=inout, args[*].transfer=atcall & 
    !$hmpp & args[newGrid].io=none

    ! Main game loop
    do iter=1,maxITer
        ! Top-Bottom ghost rows
        do j=2,dim+1
            grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
            grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
        enddo

        ! Left-Right ghost columns
        do i=1,dim+2
            grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
            grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
        enddo

        ! Now we loop over all cells and determine their fate
        !$hmppcg gridify(j, i)
        do j=2,dim+1
            do i=2,dim+1
                ! Get the number of neighbors for a given grid point
                numNeighbors = grid(i,j+1) + grid(i,j-1)&    !left & right
                             + grid(i+1,j) + grid(i-1,j)&    !upper & lower
                             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                             + grid(i-1,j+1) + grid(i+1,j-1)


                !Here we have explicitly all of the game rules
                if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(i,j) = 1
                elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                    newGrid(i,j) = 1
                else
                    newGrid(i,j) = grid(i,j)
                endif
            enddo
        enddo

        ! Can't switch pointers so we mannually have to copy array over
        do j=2,dim+1
            do i=2,dim+1
                grid(i,j) = newGrid(i,j)
            enddo
        enddo

    enddo!end main game loop  

    !$hmpp GOL endregion

    ! Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    ! Release memory
    deallocate(grid)
    deallocate(newGrid)

end program

Changes

! Now we loop over all cells and determine their fate
!$hmppcg gridify(j, i)
    do j=2,dim+1
        do i=2,dim+1

By default the hmpp compiler did not want to gridify the loop over each cell. We can attempt to force the compiler to do so adding the hmppcg directive with gridify(j, i) option which specifies we want to gridify with respect to the induction variables j and i.

! Get the number of neighbors for a given grid point
numNeighbors = grid(i,j+1) + grid(i,j-1)&    !left & right
             + grid(i+1,j) + grid(i-1,j)&    !upper & lower
             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
             + grid(i-1,j+1) + grid(i+1,j-1)

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our subroutine to calculate the number of alive neighbors a cell has.

! Can't switch pointers so we mannually have to copy array over
do j=2,dim+1
    do i=2,dim+1
        grid(i,j) = newGrid(i,j)
        enddo 
    enddo

This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

Compiling
We add the –io-report option to hmpp to see compile time host/device memory transfer information.

$ hmpp --io-report ftn GOL-codelet.f90 -o gol.out

or

$ hmpp --io-report ftn GOL-region.f90 -o gol.out

Output:
The output will be similar for both the codelet and region methods. In the most recent release as of this writing you may get some warning messages from the compiler about unused variables when using the region method.

hmpp: [Info] IO report :
In GROUP 'gol'
 CODELET 'gol' at GOLHMPP-TRY0-CODELET.f90:1, function 'gol'
    Parameter 'dim' has intent IN
    Parameter 'grid' has intent INOUT
    Parameter 'newgrid' has intent INOUT
    Parameter 'maxiter' has intent IN

Here we see the intent, the transfer direction with respect to the GPU, of each of our variables. We will see more explicit methods of dealing with data transfer below.

hmpp: [Info] Generated codelet filename is "gol_cuda.cu".
hmppcg: [Message DPL3000] GOLHMPP-TRY0-CODELET.f90:11: Loop 'j' was gridified (1D)
hmppcg: [Message DPL3000] GOLHMPP-TRY0-CODELET.f90:17: Loop 'i' was gridified (1D)
hmppcg: [Message DPL3001] GOLHMPP-TRY0-CODELET.f90:23: Loops 'i' and 'j' were gridified (2D)
hmppcg: [Message DPL3001] GOLHMPP-TRY0-CODELET.f90:47: Loops 'i' and 'j' were gridified (2D)
hmppcg: [Warning DPL3002] GOLHMPP-TRY0-CODELET.f90:9: Loop not gridified: will be executed on CPU

All of the loops have been gridified indicating a kernel has been generated for the given loop that will be executed in parallel on the GPU. A warming message tells us that the main game loop has not been gridified and will be executed serially on the CPU, this is the expected and correct behavior.

GOL-codelet-Try2.c

In this version we will remove the memory copy kernel that we needed to use since HMPP accelerate does not currently suppoort device pointers. As an added bonus we will also take a look at more advanced memory management techniques. There are numerous ways this could be approached but the methods shown highlight many good to know HMPP features.

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

#pragma hmpp <GOL> group target=CUDA
#pragma hmpp <GOL> mapbyname, dim
#pragma hmpp <GOL> map, args[even::grid;odd::newGrid]
#pragma hmpp <GOL> map, args[even::newGrid;odd::grid]
#pragma hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual
#pragma hmpp <GOL> odd  codelet, args[dim,grid,newGrid].transfer=manual
void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2])
{
    int i, j, numNeighbors;

    // Left-Right columns
    for (i = 1; i<=dim; i++) {
        grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column
        grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column
    }
    // Top-Bottom rows
    for (j = 0; j<=dim+1; j++) {
        grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row
        grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row
    }

    // Now we loop over all cells and determine their fate
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
           // Get the number of neighbors for a given grid point
            numNeighbors = grid[i+1][j] + grid[i-1][j]      //upper lower
                         + grid[i][j+1] + grid[i][j-1]      //right left
                         + grid[i+1][j+1] + grid[i-1][j-1]  //Diagonals
                         + grid[i-1][j+1] + grid[i+1][j-1];

            // Here we have explicitly all of the game rules
            if (grid[i][j] == 1 && numNeighbors < 2)
                newGrid[i][j] = 0;
            else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                newGrid[i][j] = 1;
            else if (grid[i][j] == 1 && numNeighbors > 3)
                newGrid[i][j] = 0;
            else if (grid[i][j] == 0 && numNeighbors == 3)
                newGrid[i][j] = 1;
            else
                newGrid[i][j] = grid[i][j];
        }
    }
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    // Linear grid dimension
    int dim = 1024;
    // Number of game iterations
    int maxIter = 1<<10;

    #pragma hmpp <GOL> allocate    
    // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Allocate newGrid
    int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    // Copy memory tp GPU
    #pragma hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {
        if(iter%2 == 0) {
            #pragma hmpp <GOL> even callsite
            GOL(dim, grid, newGrid);
        }
        else {
            #pragma hmpp <GOL> odd callsite
            GOL(dim, grid, newGrid);
         }

    }//iter loop

    // Read back final configuration to confirm it worked
    #pragma hmpp <GOL> delegatedstore, args[odd::newGrid]

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += newGrid[i*(dim+2)+j];
        }
    }

    printf("Total Alive: %d\n", total);

    #pragma hmpp <GOL> release 
    free(grid);
    free(newGrid);

    return 0;
}

Changes

#pragma hmpp <GOL> group target=CUDA
#pragma hmpp <GOL> mapbyname, dim
#pragma hmpp <GOL> map, args[even::grid;odd::newGrid]
#pragma hmpp <GOL> map, args[even::newGrid;odd::grid]
#pragma hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual
#pragma hmpp <GOL> odd  codelet, args[dim,grid,newGrid].transfer=manual

A major change to our code has been to get rid of the newGrid to grid memory copy. Instead we use some of the built in HMPP directive functionality.

#pragma hmpp <GOL> group target=CUDA

Here we create what is known as a codelet group, name GOL. This allows us to keep track of similar kernels and set common attributes.

#pragma hmpp <GOL> mapbyname, dim

This line will map the variable dim across all kernels. This means once dim is allocated and copied to the device for one kernel it will be available for use by all kernels in the GOL group.

#pragma hmpp <GOL> map, args[even::grid;odd::newGrid]
#pragma hmpp <GOL> map, args[even::newGrid;odd::grid]

The map directive is used to map arguments between different kernels. This will effectively switch our pointers for us. The notation args[CodeletName1::Argument1,CodeletName2::Argument2] is used such that Argument1 in CodeletName1 will refer to the same device memory as Argument2 in CodeletName2. We see in our case we will soon be defining two kernels, one named even and one named odd.

#pragma hmpp <GOL> even codelet
#pragma hmpp <GOL> odd  codelet, args[newGrid].io=none

The GOL group will contain two codelets, even and odd. Our game will oscillate between calling these codelets in the main game loop. The only difference between these codelets is caused by the map directive used above which effectively switches the device pointers to grid and newGrid.

#pragma hmpp <GOL> allocate

The allocate directive will allocate space on the GPU for all memory needed by the named group. Any memory objects shared by more than one codelet through the map or mapbyname directives will only be allocated once.

// Copy memory tp GPU
#pragma hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]

Once we have allocated space on the GPU we can copy the results from the host to the GPU with the advancedload directive. Notice when referencing what arguments will be copied we specify one codelet names, either codelet name could be specified.

// Main game loop
for (iter = 0; iter<maxIter; iter++) {
    if(iter%2 == 0) {
        #pragma hmpp <GOL> even callsite
        GOL(dim, grid, newGrid);
    }
    else {
        #pragma hmpp <GOL> odd callsite
        GOL(dim, grid, newGrid);
    }
}//iter loop

The main game loop induction variable is used to switch between calling our even codelet and our odd codelet. Although it looks as though the same GOL function is being called in both cases because of the map HMPP directive defined with the codelet we know the grid and newGrid GPU pointers will be switched.

// Read back final configuration to confirm it worked
#pragma hmpp <GOL> delegatedstore, args[odd::newGrid]

After the kernel is finished we copy the grid from the device to the host.

#pragma hmpp <GOL> release

Finally since we allocated the device memory we must free it with the directive hmpp GOL release .

GOL-region-Try2.c

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985
#define idx(i,j) ((i)*(dim+2)+(j))

int main(int argc, char* argv[])
{
    int i,j,iter;
    int dim = 1024;
    //Number of game iterations
    int maxIter = 1<<10;

    #pragma hmpp GOL allocate, args[grid,newGrid].size={(dim+2)*(dim+2)} 
    // Allocate square grid of (dim+2)^2 elements, 2 added for perioid boundary condition ghost cells
    int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Allocate newGrid
    int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i<=dim; i++) {
        for(j = 1; j<=dim; j++) {
            grid[i*(dim+2)+j] = rand() % 2;
        }
    }

    // Copy memory tp GPU
    #pragma hmpp GOL advancedload, args[dim;grid;newGrid]

    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {
        #pragma hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall
        {
            int i,j;
            if(iter%2 == 0)
            {
                #pragma hmppcg gridify
                // Left-Right columns
                for (i = 1; i<=dim; i++) {
                    grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column
                    grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column
                }

                #pragma hmppcg gridify
                // Top-Bottom rows
                for (j = 0; j<=dim+1; j++) {
                    grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row
                    grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row
                }

                int numNeighbors;
                // Now we loop over all cells and determine their fate
                #pragma hmppcg gridify (i,j)
                for (i = 1; i<=dim; i++) {
                    for (j = 1; j<=dim; j++) {
                        // Get the number of neighbors for a given grid point
                        numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)]      //upper lower
                                     + grid[idx(i,j+1)] + grid[idx(i,j-1)]      //right left
                                     + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)]  //diagonals
                                     + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];


                        // Here we have explicitly all of the game rules
                        if (grid[idx(i,j)] == 1 && numNeighbors < 2)
                            newGrid[idx(i,j)] = 0;
                        else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                            newGrid[idx(i,j)] = 1;
                        else if (grid[idx(i,j)] == 1 && numNeighbors > 3)
                            newGrid[idx(i,j)] = 0;
                        else if (grid[idx(i,j)] == 0 && numNeighbors == 3)
                            newGrid[idx(i,j)] = 1;
                        else
                            newGrid[idx(i,j)] = grid[idx(i,j)];
                    }
                }

            } else {
            #pragma hmppcg gridify
            // Left-Right columns
            for (i = 1; i<=dim; i++) {
                newGrid[idx(i,0)] = newGrid[idx(i,dim)]; //Copy first real column to right ghost column
                newGrid[idx(i,dim+1)] = newGrid[idx(i,1)]; //Copy last real column to left ghost column
            }

            #pragma hmppcg gridify
            // Top-Bottom rows
            for (j = 0; j<=dim+1; j++) {
                newGrid[idx(0,j)] = newGrid[idx(dim,j)]; //Copy first real row to bottom ghost row
                newGrid[idx(dim+1,j)] = newGrid[idx(1,j)]; //Copy last real row to top ghost row
            }

            int numNeighbors;
            // Now we loop over all cells and determine their fate
            #pragma hmppcg gridify (i,j)
            for (i = 1; i<=dim; i++) {
                for (j = 1; j<=dim; j++) {
                    // Get the number of neighbors for a given grid point
                    numNeighbors = newGrid[idx(i+1,j)] + newGrid[idx(i-1,j)]      //upper lower
                                 + newGrid[idx(i,j+1)] + newGrid[idx(i,j-1)]      //right left
                                 + newGrid[idx(i+1,j+1)] + newGrid[idx(i-1,j-1)]  //diagonals
                                 + newGrid[idx(i-1,j+1)] + newGrid[idx(i+1,j-1)];


                    // Here we have explicitly all of the game rules
                    if (newGrid[idx(i,j)] == 1 && numNeighbors < 2)
                        grid[idx(i,j)] = 0;
                    else if (newGrid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                        grid[idx(i,j)] = 1;
                    else if (newGrid[idx(i,j)] == 1 && numNeighbors > 3)
                        grid[idx(i,j)] = 0;
                    else if (newGrid[idx(i,j)] == 0 && numNeighbors == 3)
                        grid[idx(i,j)] = 1;
                    else
                        grid[idx(i,j)] = newGrid[idx(i,j)];
                }
            }

          }
        }
    }//iter loop

    // Read back final configuration to confirm it worked
    #pragma hmpp GOL delegatedstore, args[grid]

    // Sum up alive cells and print results
    int total = 0;
    for (i = 1; i<=dim; i++) {
        for (j = 1; j<=dim; j++) {
            total += grid[i*(dim+2)+j];
        }
    }

    printf("Total Alive: %d\n", total);

    #pragma hmpp GOL release 
    free(grid);
    free(newGrid);

    return 0;
}

Changes

#pragma hmpp GOL allocate, args[grid,newGrid].size={(dim+2)*(dim+2)}

The hmpp GOL allocate directive will allocate space on the GPU for the named region. The compiler requires us to specify the size of the passed in arrays at allocation time.

hmpp GOL advancedload, args[dim;iter;grid;newGrid]

hmpp GOL advancedload, args[dim;iter;grid;newGrid] will copy the named arguments of the region from the host to the GPU.

#pragma hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall

In this version of the code we have our region inside of our main game loop but no longer have to worry about unnecessary memory copies.

           if(iter%2 == 0)
            {
                #pragma hmppcg gridify
                // Left-Right columns
                for (i = 1; i<=dim; i++) {
                    grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column
                    grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column
                }
 
                #pragma hmppcg gridify
                // Top-Bottom rows
                for (j = 0; j<=dim+1; j++) {
                    grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row
                    grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row
                }
 
                int numNeighbors;
                // Now we loop over all cells and determine their fate
                #pragma hmppcg gridify (i,j)
                for (i = 1; i<=dim; i++) {
                    for (j = 1; j<=dim; j++) {
                        // Get the number of neighbors for a given grid point
                        numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)]      //upper lower
                                     + grid[idx(i,j+1)] + grid[idx(i,j-1)]      //right left
                                     + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)]  //diagonals
                                     + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];
 
                        // Here we have explicitly all of the game rules
                        if (grid[idx(i,j)] == 1 && numNeighbors < 2)
                            newGrid[idx(i,j)] = 0;
                        else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                            newGrid[idx(i,j)] = 1;
                        else if (grid[idx(i,j)] == 1 && numNeighbors > 3)
                            newGrid[idx(i,j)] = 0;
                        else if (grid[idx(i,j)] == 0 && numNeighbors == 3)
                            newGrid[idx(i,j)] = 1;
                        else
                            newGrid[idx(i,j)] = grid[idx(i,j)];
                    }
                }
 
            } else {
            #pragma hmppcg gridify
            // Left-Right columns
            for (i = 1; i<=dim; i++) {
                newGrid[idx(i,0)] = newGrid[idx(i,dim)]; //Copy first real column to right ghost column
                newGrid[idx(i,dim+1)] = newGrid[idx(i,1)]; //Copy last real column to left ghost column
            }
 
            #pragma hmppcg gridify
            // Top-Bottom rows
            for (j = 0; j<=dim+1; j++) {
                newGrid[idx(0,j)] = newGrid[idx(dim,j)]; //Copy first real row to bottom ghost row
                newGrid[idx(dim+1,j)] = newGrid[idx(1,j)]; //Copy last real row to top ghost row
            }
 
            int numNeighbors;
            // Now we loop over all cells and determine their fate
            #pragma hmppcg gridify (i,j)
            for (i = 1; i<=dim; i++) {
                for (j = 1; j<=dim; j++) {
                    // Get the number of neighbors for a given grid point
                    numNeighbors = newGrid[idx(i+1,j)] + newGrid[idx(i-1,j)]      //upper lower
                                 + newGrid[idx(i,j+1)] + newGrid[idx(i,j-1)]      //right left
                                 + newGrid[idx(i+1,j+1)] + newGrid[idx(i-1,j-1)]  //diagonals
                                 + newGrid[idx(i-1,j+1)] + newGrid[idx(i+1,j-1)];
 
                    // Here we have explicitly all of the game rules
                    if (newGrid[idx(i,j)] == 1 && numNeighbors < 2)
                        grid[idx(i,j)] = 0;
                    else if (newGrid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                        grid[idx(i,j)] = 1;
                    else if (newGrid[idx(i,j)] == 1 && numNeighbors > 3)
                        grid[idx(i,j)] = 0;
                    else if (newGrid[idx(i,j)] == 0 && numNeighbors == 3)
                        grid[idx(i,j)] = 1;
                    else
                        grid[idx(i,j)] = newGrid[idx(i,j)];
                }
            }
 
          }

A major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around a costly memory transfer.

// Read back final configuration to confirm it worked
#pragma hmpp GOL delegatedstore, args[grid]

After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]

#pragma hmpp GOL release

Finally since we allocated the device memory we must free it with the directive hmpp GOL release .

GOL-codelet-Try2.f90

!$hmpp <GOL> group, target=CUDA
!$hmpp <GOL> mapbyname, dim
!$hmpp <GOL> map, args[even::grid;odd::newGrid]
!$hmpp <GOL> map, args[even::newGrid;odd::grid]
!$hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual
!$hmpp <GOL> odd  codelet, args[dim,grid,newGrid].transfer=manual
subroutine GOL(dim, grid, newGrid)
    implicit none
    integer, intent(in) :: dim
    integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2)
    integer :: i,j,numNeighbors

    ! Top-Bottom ghost rows
    do j=2,dim+1
        grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row 
        grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
    enddo

    ! Left-Right ghost columns
    do i=1,dim+2
        grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
        grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
    enddo


    ! Now we loop over all cells and determine their fate
    do j=2,dim+1
       do i=2,dim+1
           ! Get the number of neighbors for a given grid point
            numNeighbors = grid(i,j+1) + grid(i,j-1)&     !right & left
                         + grid(i+1,j) + grid(i-1,j)&     !upper & lower
                         + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                         + grid(i-1,j+1) + grid(i+1,j-1)

           ! Here we have explicitly all of the game rules
           if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                newGrid(i,j) = 0
            elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                newGrid(i,j) = 1
            elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                newGrid(i,j) = 0
            elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                newGrid(i,j) = 1
            else
                newGrid(i,j) = grid(i,j)
            endif
        enddo
    enddo
end subroutine GOL

program main
    implicit none

    integer :: i,j,seed(8),iter,total
    real :: randm
    ! Linear game grid dimension
    integer :: dim = 1024
    ! Number of game iterations
    integer :: maxIter = 2**10

    ! Game grid pointers
    integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid

    !$hmpp <GOL> allocate 
    ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    allocate(grid(dim+2,dim+2))
    allocate(newGrid(dim+2,dim+2))

    ! Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            grid(i,j) = nint(randm)
        enddo
    enddo

    ! Copy memory tp GPU
    !$hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]

    ! Main game loop
    do iter=1,maxITer
        if(MOD(iter,2) == 1) then
            !$hmpp <GOL> even callsite
             call GOL(dim, grid, newGrid)
         else
             !$hmpp <GOL> odd callsite
             call GOL(dim, grid, newGrid)
         endif
    enddo!end main game loop 

    ! Read back final configuration to confirm it worked
    !$hmpp <GOL> delegatedstore, args[odd::newGrid]

    ! Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + newGrid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    ! Release memory
    !$hmpp <GOL> release
    deallocate(grid)
    deallocate(newGrid)

end program

Changes

!$hmpp <GOL> group, target=CUDA
!$hmpp <GOL> mapbyname, dim
!$hmpp <GOL> map, args[even::grid;odd::newGrid]
!$hmpp <GOL> map, args[even::newGrid;odd::grid]
!$hmpp <GOL> even codelet
!$hmpp <GOL> odd  codelet

A major change to our code has been to get rid of the newGrid to grid memory copy. Instead we use some of the built in HMPP directive functionality.

!$hmpp <GOL> group, target=CUDA

Here we create what is known as a codelet group, name GOL. This allows us to keep track of similar kernels and set common attributes.

!$hmpp <GOL> mapbyname, dim

This line will map the variable dim across all kernels. This means once dim is allocated and copied to the device for one kernel it will be available for use by all kernels in the GOL group.

!$hmpp <GOL> map, args[even::grid;odd::newGrid]
!$hmpp <GOL> map, args[even::newGrid;odd::grid]

The map directive is used to map arguments between different kernels. This will effectively switch our pointers for us. The notation args[CodeletName1::Argument1,CodeletName2::Argument2] is used such that Argument1 in CodeletName1 will refer to the same device memory as Argument2 in CodeletName2. We see in our case we will soon be defining two kernels, one named even and one named odd.

!$hmpp <GOL> even codelet
!$hmpp <GOL> odd  codelet

The GOL group will contain two codelets, even and odd. Our game will oscillate between calling these codelets in the main game loop. The only difference between these codelets is caused by the map directive used above which effectively switches the device pointers to grid and newGrid.

!$hmpp <GOL> allocate

The allocate directive will allocate space on the GPU for all memory needed by the named group. Any memory objects shared by more than one codelet through the map or mapbyname directives will only be allocated once.

! Copy memory tp GPU
!$hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]

Once we have allocated space on the GPU we can copy the results from the host to the GPU with the advancedload directive. Notice when referencing what arguments will be copied we specify one codelet names, either codelet name could be specified.

if(MOD(iter,2) == 1) then
    !$hmpp <GOL> even callsite, args[grid].noupdate=true
    call GOL(dim, grid, newGrid)
else
    !$hmpp <GOL> odd callsite, args[grid].noupdate=true
    call GOL(dim, grid, newGrid)
endif

We see that we use our main game loop induction variable to switch between calling our even codelet and our odd codelet. Although it looks as though the same GOL function is being called in both cases because of the map HMPP directive defined with the codelet we know the grid and newGrid GPU pointers will be switched. args[grid].noupdate=true indicates that we do not want to have any implicit data movement between the host and device.

! Read back final configuration to confirm it worked
!$hmpp <GOL> delegatedstore, args[odd::newGrid]

After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]

!$hmpp <GOL> release

Finally since we allocated the device memory we must free it with the directive hmpp GOL release .

GOL-region-Try2.f90

program main
    implicit none

    integer :: i,j,seed(8),iter,numNeighbors,total
    real :: randm
    ! Linear game grid dimension
    integer :: dim = 1024
    ! Number of game iterations
    integer :: maxIter = 2**10

    ! Game grid pointers
    integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid

    !$hmpp GOL allocate, args[grid,newGrid].size={(dim+2),(dim+2)}
    ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    allocate(grid(dim+2,dim+2))
    allocate(newGrid(dim+2,dim+2))

    ! Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            grid(i,j) = nint(randm)
        enddo
    enddo

    ! Copy memory tp GPU
    !$hmpp GOL advancedload, args[dim;grid;newGrid]

    ! Main game loop
    do iter=1,maxITer
    !$hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall
        if(MOD(iter,2) == 1) then
            ! Top-Bottom ghost rows
            do j=2,dim+1
                grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
                grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
            enddo

            ! Left-Right ghost columns
            do i=1,dim+2
                grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
                grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
            enddo

            ! Now we loop over all cells and determine their fate
            !$hmppcg gridify(j,i)
            do j=2,dim+1
                do i=2,dim+1
                    ! Get the number of neighbors for a given grid point
                    numNeighbors = grid(i,j+1) + grid(i,j-1)&     !right & left
                                 + grid(i+1,j) + grid(i-1,j)&     !upper & lower
                                 + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                                 + grid(i-1,j+1) + grid(i+1,j-1)

                    ! Here we have explicitly all of the game rules
                    if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                        newGrid(i,j) = 0
                    elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                        newGrid(i,j) = 1
                    elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                        newGrid(i,j) = 0
                    elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                        newGrid(i,j) = 1
                    else
                        newGrid(i,j) = grid(i,j)
                    endif
                enddo
            enddo
        else
            ! Top-Bottom ghost rows
            do j=2,dim+1
                newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row
                newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row
            enddo

            ! Left-Right ghost columns
            do i=1,dim+2
                newGrid(i,1) = newGrid(i, dim+1) !Copy first game grid column to right ghost column
                newGrid(i,dim+2) = newGrid(i,2)  !Copy last game grid column to left ghost column
            enddo

            ! Now we loop over all cells and determine their fate
            !$hmppcg gridify(j,i)
            do j=2,dim+1
                do i=2,dim+1
                    ! Get the number of neighbors for a given grid point
                    numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)&     !right & left
                                 + newGrid(i+1,j) + newGrid(i-1,j)&     !upper & lower
                                 + newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals
                                 + newGrid(i-1,j+1) + newGrid(i+1,j-1)

                    ! Here we have explicitly all of the game rules
                    if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then
                        grid(i,j) = 0
                    elseif(newGrid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                        grid(i,j) = 1
                    elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then
                        grid(i,j) = 0
                    elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then
                        grid(i,j) = 1
                    else
                        grid(i,j) = newGrid(i,j)
                    endif
                enddo
            enddo
        endif !End grid newGrid switch
       !$hmpp GOL endregion

    enddo!end main game loop 

    ! Read back final configuration to confirm it worked
    !$hmpp GOL delegatedstore, args[grid]

    ! Sum up alive cells and print results
    total = 0
    do j=2,dim+1
        do i=2,dim+1
            total = total + grid(i,j)
        enddo
    enddo
    print *, "Total Alive", total

    ! Release memory
    !$hmpp GOL release
    deallocate(grid)
    deallocate(newGrid)

end program

Changes

!$hmpp GOL allocate, args[grid,newGrid].size={(dim+2),(dim+2)}

The hmpp GOL allocate directive will allocate space on the GPU for the named region. The compiler requires us to specify the size of the passed in arrays at allocation time with args[grid,newGrid].size={(dim+2),(dim+2)}

!$hmpp GOL advancedload, args[dim;grid;newGrid]

hmpp GOL advancedload, args[dim;iter;grid;newGrid] will copy the named arguments of the region from the host to the GPU.

hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall

In this version of the code we have our region inside of our main game loop but no longer have to worry about unnecessary memory copies.

Lines 35-109:

        if(MOD(iter,2) == 1) then
            ! Top-Bottom ghost rows
            do j=2,dim+1
                grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
                grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
            enddo
 
            ! Left-Right ghost columns
            do i=1,dim+2
                grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
                grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
            enddo
 
            ! Now we loop over all cells and determine their fate
            !$hmppcg gridify(j,i)
            do j=2,dim+1
                do i=2,dim+1
                    ! Get the number of neighbors for a given grid point
                    numNeighbors = grid(i,j+1) + grid(i,j-1)&     !right & left
                                 + grid(i+1,j) + grid(i-1,j)&     !upper & lower
                                 + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                                 + grid(i-1,j+1) + grid(i+1,j-1)
 
                    ! Here we have explicitly all of the game rules
                    if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                        newGrid(i,j) = 0
                    elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                        newGrid(i,j) = 1
                    elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                        newGrid(i,j) = 0
                    elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                        newGrid(i,j) = 1
                    else
                        newGrid(i,j) = grid(i,j)
                    endif
                enddo
            enddo
        else
            ! Top-Bottom ghost rows
            do j=2,dim+1
                newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row
                newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row
            enddo
 
            ! Left-Right ghost columns
            do i=1,dim+2
                newGrid(i,1) = newGrid(i, dim+1) !Copy first game grid column to right ghost column
                newGrid(i,dim+2) = newGrid(i,2)  !Copy last game grid column to left ghost column
            enddo
 
            ! Now we loop over all cells and determine their fate
            !$hmppcg gridify(j,i)
            do j=2,dim+1
                do i=2,dim+1
                    ! Get the number of neighbors for a given grid point
                    numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)&     !right & left
                                 + newGrid(i+1,j) + newGrid(i-1,j)&     !upper & lower
                                 + newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals
                                 + newGrid(i-1,j+1) + newGrid(i+1,j-1)
 
                    ! Here we have explicitly all of the game rules
                    if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then
                        grid(i,j) = 0
                    elseif(newGrid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                        grid(i,j) = 1
                    elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then
                        grid(i,j) = 0
                    elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then
                        grid(i,j) = 1
                    else
                        grid(i,j) = newGrid(i,j)
                    endif
                enddo
            enddo
        endif !End grid newGrid switch

A major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around a costly memory transfer.

! Now we loop over all cells and determine their fate
!$hmppcg gridify(j,i)
! Now we loop over all cells and determine their fate
!$hmppcg gridify(j,i)

By default the hmpp compiler did not want to gridify the loop over each cell. We can attempt to force the compiler to do so adding thehmppcg gridify(j, i) directive which specifies we want to gridify with respect to the induction variables j and i.

! Read back final configuration to confirm it worked
!$hmpp GOL delegatedstore, args[grid]

After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]

! Release memory
!$hmpp GOL release

Finally since we allocated the device memory we must free it with the directive hmpp GOL release.

This sample shows a minimal conversion from our vector addition CPU code to an HMPP accelerator directives version, consider this an HMPP ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

This tutorial covers CAPS HMPP accelerator directives, If you are interested in CAPS OpenACC support please see: OpenACC Vector Addition

HMPP allows code to be offloaded onto the GPU using two different methods, both of which are covered. The codelet method allows an entire C function or Fortran subroutine to be executed on the GPU. The region method allows a contiguous block of code, not necessarily residing in a function or subroutine, to be executed on the GPU.

vecAdd-codelet.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out 
void vecAdd(int n, double a[n], double b[n], double c[n])
{
    int j;
    for(j=0; j<n; j++) {
        c[j] = a[j] + b[j];
    }
}

int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;

    // Input vectors
    double *a;
    double *b;
    // Output vector
    double *c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // Sum component wise and save result into vector c
    #pragma hmpp vecAdd callsite
    vecAdd(n, a, b, c);

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to vecAdd-codelet.c

#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out 
void vecAdd(int n, double a[n], double b[n], double c[n])
{
    int j;
    for(j=0; j<n; j++) {
        c[j] = a[j] + b[j];
    }
}

The combined #pragma hmpp directive and C function vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.

#pragma hmpp vecAdd callsite
vecAdd(n, a, b, c);

The combined #pragma hmpp directive and C function call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

Compiling and Running vecAdd-codelet.c

Before compiling the hmpp module must be loaded:

$ module load PrgEnv-pgi capsmc cudatoolkit
$ hmpp cc vecAdd-codelet.c -o vecAdd.out

Output:
The compiler will output the following:

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-codelet.c:9: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd-codelet.f90

!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
subroutine vecAdd(n, a, b, c)
    implicit none
    integer, intent(in) :: n
    real(8), intent(in) :: a(n), b(n)
    real(8), intent(out) :: c(n)

    integer :: j
    do j=1,n
        c(j) = a(j) + b(j)
    enddo
end subroutine vecAdd

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c

    !$hmpp vecAdd callsite
    call vecAdd(n, a, b, c)

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program main

Changes to vecAdd-codelet.f90

!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
subroutine vecAdd(n, a, b, c)
    implicit none
    integer, intent(in) :: n
    real(8), intent(in) :: a(n), b(n)
    real(8), intent(out) :: c(n)

    integer :: j
    do j=1,n
        c(j) = a(j) + b(j)
    enddo
end subroutine vecAdd

The combined !$hmpp directive and Fortran subroutine vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.

!$hmpp vecAdd callsite
call vecAdd(n, a, b, c)

The combined !$hmpp directive and Fortran subroutine call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

Compiling vecAdd-codelet.f90

$ module load PrgEnv-pgi cudatoolkit capsmc
$ hmpp ftn vecAdd.f90 -o vecAdd.out

Output:
The compiler will output the following:

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-codelet.f90:9: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

Additional Codelet Information

Much information is obscured from the programmer so let’s add the –io-report hmpp flag to see what memory transfers will take place between the GPU and host.

C

$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out

Fortran

$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out

Output

In GROUP 'vecadd'
 CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd'
    Parameter 'n' has intent IN
    Parameter 'a' has intent IN
    Parameter 'b' has intent IN
    Parameter 'c' has intent OUT

We see that n, a, and b will be copied into the GPU while c will be copied out.
What if we were the omit the intent for the vector c to be copied back to the host in our codelet declaration?
C

#pragma hmpp vecAdd codelet, target=CUDA

Fortran

!$hmpp vecAdd codelet, target=CUDA

Output

In GROUP 'vecadd'
 CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd'
    Parameter 'n' has intent IN
    Parameter 'a' has intent IN
    Parameter 'b' has intent IN
    Parameter 'c' has intent IN

We see that the compiler does not correctly copy the vector c back to the host. This will cause erroneous results that do not produce any warning or error message. It is vitally important to always check that memory transfers are correct.

vecAdd-region.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 100000;

    // Input vectors
    double *a;
    double *b;
    // Output vector
    double *c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);
    
    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // Sum component wise and save result into vector c
    #pragma hmpp vecAdd region, target=CUDA,  args[*].transfer=atcall
    {
        int j;
        for(j=0; j<n; j++) {
            c[j] = a[j] + b[j];
        }
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to vecAdd-region.c

#pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall
{
    int j;
    for(j=0; j<n; j++) {
        c[j] = a[j] + b[j];
    }
}

The code inside of the hmpp region is computed on the GPU. The region begins with the #pragma hmpp region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

Compiling vecAdd-region.c

$ hmpp --io-report cc vecAdd-region.c -o vecAdd.out

Output:
The compiler will output the following:

In GROUP 'vecadd'
 REGION 'vecadd' at vecAdd-region.c:34, function '__hmpp_region__vecadd'
    Parameter  'n' has intent IN
    Parameter  'a' has intent IN
    Parameter  'b' has intent IN
    Parameter  'c' has intent INOUT

We see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-region.c:37: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.


vecAdd-region.f90

Modifications from CPU code in bold

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    !Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c

    !$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$hmpp vecAdd endregion

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Changes: to vecAdd-region.f90

!$hmpp vecAdd region, target=CUDA,  args[*].transfer=atcall
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$hmpp vecAdd endregion

The code inside of the hmpp region is computed on the GPU. The region begins with the !$hmpp region directive and ends with the !$hmpp endregion directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

Compiling vecAdd-region.f90

$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out

Output
The compiler will output the following:

In GROUP 'vecadd'
 REGION 'vecadd' at vecAdd-region.f90:28, function 'hmpp_region__vecadd'
    Parameter  'n' has intent IN
    Parameter  'n_1' has intent IN
    Parameter  'n_2' has intent IN
    Parameter  'a' has intent IN
    Parameter  'n_4' has intent IN
    Parameter  'n_5' has intent IN
    Parameter  'b' has intent IN
    Parameter  'n_7' has intent IN
    Parameter  'n_8' has intent IN
    Parameter  'c' has intent INOUT
    Parameter  'i' has intent INOUT

The current HMPP compiler doesn’t do too well with the Fortran version, copying in several variables that are not used. For now we will need to ignore these erroneous variable copies and their associated warning messages. We do see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-region.f90:29: Loop 'i' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 29 with induction variable ‘i’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.