Several GPU accelerated libraries are provided on OLCF systems. Usage for the most common accelerated libraries is outlined below.
MAGMA
The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.
Using MAGMA
This module is currently only compatible with the GNU programming environment:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit magma
To link in the MAGMA library while on Titan:
$ cc -lcuda -lmagma source.c
Linking in MAGMA on Rhea is a bit different because the Titan compiler wrapper takes care of some of the extra flags. For example:
$ nvcc $MAGMA_INC $MAGMA_LIB -L$ACML_DIR/gfortran64/lib -lmagma -lcublas -lacml source.c
This also requires that a BLAS library be loaded like ACML
$ module load acml
MAGMA Resources
For comprehensive user manual please see the MAGMA Documentation. A knowledgable MAGMA User Forum is also available for personalized help. To see MAGMA in action see the following two PGI articles that include full example code of MAGMA usage with PGI Fortran: Using MAGMA With PGI Fortran and Using GPU-enabled Math Libraries with PGI Fortran.
cuBLAS/cuSPARSE
cuBLAS and cuSPARSE are NVIDIA provided BLAS GPU routines optimized for dense and sparse use respectively. If your program currently uses BLAS routines integration should be straight forward and minimal CUDA knowledge is needed. Although primarily designed for use in C/C++ code Fortran bindings are available.
Using cuBLAS/cuSPARSE
cuBLAS and cuSPARSE are accessed through the cublas header and need to be linked against the cublas library:
$ module load cudatoolkit $ cc -lcublas source.c
cuBLAS/cuSPARSE Resources
The CUBLAS and CUSPARSE user guides are available to download from NVIDIA, these guides provide complete function listings as well as example code. The nvidia SDK provides sample code and can accessed using the instructions below. An example of using CUBLAS with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran.
Running the examples:
Obtain an interactive job and load the appropriate modules:
$ qsub -I -A[projID] -lwalltime=00:30:00,nodes=1 $ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit nvidia-sdk
Copy the example files:
$ cd $MEMBERWORK/[projid] $ cp -r $NVIDIA_SDK_PATH/CUDALibraries . $ cd CUDALibraries
Now each example can be executed:
$ cd bin/linux/release $ aprun simpleCUBLAS
LibSciACC
Cray’s LibSciACC provides GPU enabled BLAS and LAPACK routines. LibSicACC provides two interfaces, automatic and manual. The Automatic interface is largely transparent to the programmer. LibSciACC will determine if the call is likely to benefit from GPU acceleration and if so will take care of accelerating the routine and the associated memory management. The manual interface provides an API to manage accelerator resources, providing more control to the programmer.
Using LibSciACC
It is recommended that the craype-accel-nvidia35 module be used to manage LibSciAcc. The LibSciACC Automatic interface is currently compatible with the Cray and GNU programming environments:
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35
LibSciAcc will automatically be linked in when using the Cray provided compiler wrappers
$ cc source.c
$ ftn source.f90
LibSciACC Resources
The man page intro_libsci_acc provides detailed usage information. The environment variable $LIBSCI_ACC_EXAMPLES_DIR specifies a directory containing several C and Fortran example codes.
cuFFT
CUFFT provides a set of optimized GPU fast Fourier routines that are provided by NVIDIA as part of the CUDA toolkit. The CUFFT library provides an API similar to FFTW for managing accelerated FFT’s. The CUFFTW interface provides a FFTW3 interface to CUFFT to aid in porting existing applications.
Using cuFFT
The cudatoolkit module will append the include and library directories required by CUFFT. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ cc -lcufft source.c
$ nvcc -lcufft source.c
cuFFT Resources
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CUFFT with Fortran through the ISO_C_BINDING interface please see the following example. The OLCF provides an OpenACC and CUFFT interoperability tutorial.
cuRAND
CURAND is an NVIDIA provided random number generator library. CURAND provides both a host launched and device inalienable interface. Multiple pseudorandom and quasirandom algorithms are supported.
Using cuRAND
The cudatoolkit module will append the include and library directories required by CURAND. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu $ cc -lcurand source.c
$ nvcc -lcurand source.cu
cuRAND Resources
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CURAND host library please see the following Accelerator Interoperability Tutorial.
Thrust
Thrust is a CUDA accelerated C++ template library modeled after the Standard Template Library(STL). Thrust provides a high level host interface for GPU data management as well as an assortment of accelerated algorithms. Even if your application is not currently using the STL the easy access to many optimized accelerated algorithms Thrust provides is worth taking a look at.
Using Thrust
The cudatoolkit module will append the include and library directories required by Thrust. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu $ CC source.cpp
$ nvcc source.cu
Thrust Resources
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using Thrust please see the following Accelerator Interoperability II Tutorial. The Github page allows access to the Thrust source code, examples, and information on how to obtain help.
Experience in porting Fortran code to GPUs with OpenACC
Speakers: Mike Zingale and Adam Jacobs
This talk was given at the 9-23-15 OLCF User Group Conference Call.
In the slides and video below, we describe our efforts to put large physics modules written in Fortran onto the GPUs. Our main target is integrating stiff systems of ODEs describing nuclear reactions. Our strategy is to put the entire integration: righthand side evaluations, Jacobian construction, and the actual timestepping onto the GPUs. This minimizes the communication costs — only the initial state vector is passed in, and the new state vector comes out at the end. We will discuss the challenges we had in getting the compilers to accept modern Fortran constructions with OpenACC and discuss preliminary results.
Slides: openacc-networks.pdf
Video:
Experience in porting Fortran code to GPUs with OpenACC from OLCF on Vimeo.
To fully exploit the computational power of the GPU generally a large amount of data parallelism must be expressed. In the specific case of accelerated libraries such as cuBLAS
, cuFFT
, and cuSPARSE
if each operation does not possess a sufficient amount of data parallelism another option is to batch many smaller operations into a single large operation. This tutorial will demonstrate how to take advantage of batched cuBLAS
operations to improve GPU utilization. Additionally this tutorial will expand on the GPU concurrency topics from the first tutorial through the use of streams
and Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
A Serial Example in C
The following example preforms a single DGEMM
operation using the cuBLAS
version 2 library. This example will be expanded to show the use of batched cuBLAS
calls under a variety of situations. The form of our sample DGEMM
operation will be as follows:
cublas.c
#include <stdio.h> #include <stdlib.h> #include "math.h" #include "cublas_v2.h" int main(int argc, char* argv[]) { // Linear dimension of matrices int dim = 100; // Allocate host storage for A,B,C square matrices double *A, *B, *C; A = (double*)malloc(dim*dim*sizeof(double)); B = (double*)malloc(dim*dim*sizeof(double)); C = (double*)malloc(dim*dim*sizeof(double)); // Allocate device storage for A,B,C double *d_A, *d_B, *d_C; cudaMalloc((void**)&d_A, dim*dim*sizeof(double)); cudaMalloc((void**)&d_B, dim*dim*sizeof(double)); cudaMalloc((void**)&d_C, dim*dim*sizeof(double)); // Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 // Matrices are arranged column major int i,j,index; for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; if(i==j) { A[index] = sin(index); B[index] = sin(index); C[index] = cos(index)*cos(index); } else { A[index] = 0.0; B[index] = 0.0; C[index] = 0.0; } } } // Create cublas instance cublasHandle_t handle; cublasCreate(&handle); // Set input matrices on device cublasSetMatrix(dim, dim, sizeof(double), A, dim, d_A, dim); cublasSetMatrix(dim, dim, sizeof(double), B, dim, d_B, dim); cublasSetMatrix(dim, dim, sizeof(double), C, dim, d_C, dim); // DGEMM: C = alpha*A*B + beta*C double alpha = 1.0; double beta = 1.0; cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, d_A, dim, d_B, dim, &beta, d_C, dim); // Retrieve result matrix from device cublasGetMatrix(dim, dim, sizeof(double), d_C, dim, C, dim); // Simple sanity test, sum up all elements double sum = 0; for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; sum += C[index]; } } printf("Sum is: %f, should be: %d\n", sum, dim); // Clean up resources free(A); free(B); free(C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle); return 0; }
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ cc -lcublas cublas.c -o cublas.out
Run:
$ module load cudatoolkit $ aprun ./cublas.out
Library-provided Batching in C
The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched
, which takes as an argument an array of pointers to matrices and the number of DGEMM
operations to preform in total. By allowing the library to handle the batching, high performance can be achieved with minimal additional effort on the part of the programmer.
cublas-batch.c
#include <stdio.h> #include <stdlib.h> #include "math.h" #include "cublas_v2.h" int main(int argc, char* argv[]) { int i,j,k,index; // Linear dimension of matrices int dim = 100; // Number of A,B,C matrix sets int batch_count = 1000; // Allocate host storage for batch_count A,B,C square matrices double **A, **B, **C; A = (double**)malloc(batch_count*sizeof(double*)); B = (double**)malloc(batch_count*sizeof(double*)); C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { A[i] = (double*)malloc(dim*dim*sizeof(double)); B[i] = (double*)malloc(dim*dim*sizeof(double)); C[i] = (double*)malloc(dim*dim*sizeof(double)); } // Create host pointer array to device matrix storage double **d_A, **d_B, **d_C, **h_d_A, **h_d_B, **h_d_C; h_d_A = (double**)malloc(batch_count*sizeof(double*)); h_d_B = (double**)malloc(batch_count*sizeof(double*)); h_d_C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { cudaMalloc((void**)&h_d_A[i], dim*dim*sizeof(double)); cudaMalloc((void**)&h_d_B[i], dim*dim*sizeof(double)); cudaMalloc((void**)&h_d_C[i], dim*dim*sizeof(double)); } // Copy the host array of device pointers to the device cudaMalloc((void**)&d_A, batch_count*sizeof(double*)); cudaMalloc((void**)&d_B, batch_count*sizeof(double*)); cudaMalloc((void**)&d_C, batch_count*sizeof(double*)); cudaMemcpy(d_A, h_d_A, batch_count*sizeof(double*), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_d_B, batch_count*sizeof(double*), cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_d_C, batch_count*sizeof(double*), cudaMemcpyHostToDevice); // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2 // Matrices are arranged column major for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; if(i==j) { (A[k])[index] = k*sin(index); (B[k])[index] = sin(index); (C[k])[index] = k*cos(index)*cos(index); } else { (A[k])[index] = 0.0; (B[k])[index] = 0.0; (C[k])[index] = 0.0; } } // i } // j } // k // Create cublas instance cublasHandle_t handle; cublasCreate(&handle); // Set input matrices on device for(i=0; i<batch_count; i++) { cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, h_d_A[i], dim); cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, h_d_B[i], dim); cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, h_d_C[i], dim); } // Set matrix coefficients double alpha = 1.0; double beta = 1.0; // DGEMM: C = alpha*A*B + beta*C cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, (const double**)d_A, dim, (const double**)d_B, dim, &beta, d_C, dim, batch_count); // Retrieve result matrix from device for(i=0; i<batch_count; i++) cublasGetMatrix(dim, dim, sizeof(double), h_d_C[i], dim, C[i], dim); // Simple sanity test, sum up all elements double sum = 0; for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; sum += (C[k])[index]; } } } printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2); // Clean up resources for(i=0; i<batch_count; i++) { free(A[i]); free(B[i]); free(C[i]); cudaFree(h_d_A[i]); cudaFree(h_d_B[i]); cudaFree(h_d_C[i]); } free(A); free(B); free(C); free(h_d_A); free(h_d_B); free(h_d_C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle); return 0; }
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ cc -lcublas cublas-batch.c -o cublas-batch.out
Run:
$ module load cudatoolkit $ aprun ./cublas-batch.out
CUDA Streams in C
Not all library functions have a batch equivalent, in this case using CUDA streams
may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM
operation in it’s own stream
, allowing up to 32 concurrent operations through the use of Hyper-Q
. The cuBLAS
library provides functionality to create, set, and destroy CUDA streams
.
cublas-stream.c
#include <stdio.h> #include <stdlib.h> #include "math.h" #include "cublas_v2.h" int main(int argc, char* argv[]) { int i,j,k,index; // Linear dimension of matrices int dim = 100; // Number of A,B,C matrix sets int batch_count = 1000; // Allocate host storage for batch_count A,B,C square matrices double **A, **B, **C; A = (double**)malloc(batch_count*sizeof(double*)); B = (double**)malloc(batch_count*sizeof(double*)); C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { A[i] = (double*)malloc(dim*dim*sizeof(double)); B[i] = (double*)malloc(dim*dim*sizeof(double)); C[i] = (double*)malloc(dim*dim*sizeof(double)); } // Allocate device storage for batch_count A,B,C double **d_A, **d_B, **d_C; d_A = (double**)malloc(batch_count*sizeof(double*)); d_B = (double**)malloc(batch_count*sizeof(double*)); d_C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { cudaMalloc((void**)&d_A[i], dim*dim*sizeof(double)); cudaMalloc((void**)&d_B[i], dim*dim*sizeof(double)); cudaMalloc((void**)&d_C[i], dim*dim*sizeof(double)); } // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2 // Matrices are arranged column major for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; if(i==j) { (A[k])[index] = k*sin(index); (B[k])[index] = sin(index); (C[k])[index] = k*cos(index)*cos(index); } else { (A[k])[index] = 0.0; (B[k])[index] = 0.0; (C[k])[index] = 0.0; } } // i } // j } // k // Create cublas instance cublasHandle_t handle; cublasCreate(&handle); // Set input matrices on device for(i=0; i<batch_count; i++) { cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, d_A[i], dim); cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, d_B[i], dim); cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, d_C[i], dim); } // Create a stream for every DGEMM operation cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t)); for(i=0; i<batch_count; i++) cudaStreamCreate(&streams[i]); // Set matrix coefficients double alpha = 1.0; double beta = 1.0; // Launch each DGEMM operation in own CUDA stream for(i=0; i<batch_count; i++){ // Set CUDA stream cublasSetStream(handle, streams[i]); // DGEMM: C = alpha*A*B + beta*C cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, d_A[i], dim, d_B[i], dim, &beta, d_C[i], dim); } // Retrieve result matrix from device for(i=0; i<batch_count; i++) { cublasGetMatrix(dim, dim, sizeof(double), d_C[i], dim, C[i], dim); } // Simple sanity test, sum up all elements double sum = 0; for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; sum += (C[k])[index]; } } } printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2); // Clean up resources for(i=0; i<batch_count; i++) { free(A[i]); free(B[i]); free(C[i]); cudaFree(d_A[i]); cudaFree(d_B[i]); cudaFree(d_C[i]); cudaStreamDestroy(streams[i]); } free(A); free(B); free(C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle); return 0; }
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ cc -lcublas cublas-stream.c -o cublas-stream.out
Run:
$ module load cudatoolkit $ aprun ./cublas-stream.out
Fortran
Wrappers
NVIDIA provides a fortran interface for the legacy cuBLAS
API although they do not provide a fortran interface to the version 2 API, which adds batched operations. One of the changes introduced with the new API is that an opaque cuBLAS
handle must be be passed into most function calls; while this provides greater control over operations it is a hinderance to interoperability. The wrapper is split into two pieces: cublas_fortran.cu
and cublas_fortran_iso.f90
. cublas_fortran.cu
provides a lightweight C
wrapper around cuBLAS
calls, with most calls taking a pointer to the cuBLAS
handle that can be dereferenced to make the actual call. Since the wrapper functions require only a pointer to the cuBLAS
handle we may interface them with the ISO_C_BINDING
module. This ISO_C_BINDING
interface is the second piece to our wrapper, cublas_fortran_iso.f90
, and allows for robust interoperability between C and Fortran types.
cublas_fortran.cu
#include <stdio.h> #include "cublas_v2.h" extern "C" int f_cublasCreate(cublasHandle_t **handle) { *handle = (cublasHandle_t*)malloc(sizeof(cublasHandle_t)); return cublasCreate(*handle); } extern "C" int f_cublasDgemm(cublasHandle_t *handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const double *alpha, const double *A, int lda, const double *B, int ldb, const double *beta, double *C, int ldc) { return cublasDgemm(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc); } extern "C" int f_cublasDgemmBatched(cublasHandle_t *handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const double *alpha, const double **A, int lda, const double **B, int ldb, const double *beta, double **C, int ldc, int batch_count) { return cublasDgemmBatched(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc,batch_count); } extern "C" void f_cublasDestroy(cublasHandle_t *handle) { cublasDestroy(*handle); free(handle); } extern "C" int f_cudaStreamCreate(cudaStream_t **stream) { *stream = (cudaStream_t *) malloc(sizeof(cudaStream_t)); return cudaStreamCreate(*stream); } extern "C" int f_cublasSetStream(cublasHandle_t *handle, cudaStream_t *streamid) { return cublasSetStream(*handle, *streamid); } extern "C" void f_cudaStreamDestroy(cudaStream_t *stream) { cudaStreamDestroy(*stream); }
cublas_fortran_iso.f90
module cublas_f use ISO_C_BINDING enum, BIND(C) enumerator :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C end enum enum, BIND(C) enumerator :: cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, & cudaMemcpyDeviceToDevice, cudaMemcpyDefault end enum INTERFACE integer(C_INT) function cudaMalloc(ptr, bytes) BIND(C, NAME='cudaMalloc') use ISO_C_BINDING type(C_PTR) :: ptr integer(C_SIZE_T), value :: bytes end function integer(C_INT) function cudaMemcpy(dst, src, count, kind) BIND(C, NAME='cudaMemcpy') use ISO_C_BINDING type(C_PTR), value :: dst type(C_PTR), value :: src integer(C_SIZE_T), value :: count integer(C_INT), value :: kind end function integer(C_INT) function cublasCreate(handle_ptr) BIND(C, NAME='f_cublasCreate') use ISO_C_BINDING type(C_PTR) :: handle_ptr end function subroutine cublasDestroy(handle_ptr) BIND(C, NAME='f_cublasDestroy') use ISO_C_BINDING type(C_PTR), value :: handle_ptr end subroutine subroutine cudaFree(ptr) BIND(C, NAME='cudaFree') use ISO_C_BINDING type(C_PTR), value :: ptr end subroutine integer(C_INT) function cublasSetMatrix(rows, cols, elemSize, a_ptr, & lda, b_ptr, ldb) & BIND(C, NAME='cublasSetMatrix') use ISO_C_BINDING integer(C_INT), value :: rows integer(C_INT), value :: cols integer(C_INT), value :: elemSize type(C_PTR), value :: a_ptr integer(C_INT), value :: lda type(C_PTR), value :: b_ptr integer(C_INT), value :: ldb end function integer(C_INT) function cublasGetMatrix(rows, cols, elemSize, a_ptr, & lda, b_ptr, ldb ) & BIND(C, NAME='cublasGetMatrix') use ISO_C_BINDING integer(C_INT), value :: rows integer(C_INT), value :: cols integer(C_INT), value :: elemSize type(C_PTR), value :: a_ptr integer(C_INT), value :: lda type(C_PTR), value :: b_ptr integer(C_INT), value :: ldb end function integer(C_INT) function cublasDgemm(handle, transa, transb, m, n, k, alpha, & A, lda, B, ldb, beta, C, ldc) & BIND(C, NAME='f_cublasDgemm') use ISO_C_BINDING type(C_PTR), value :: handle integer(C_INT), value :: transa integer(C_INT), value :: transb integer(C_INT), value :: m integer(C_INT), value :: n integer(C_INT), value :: k real(C_DOUBLE) :: alpha type(C_PTR), value :: A integer(C_INT), value :: lda type(C_PTR), value :: B integer(C_INT), value :: ldb real(C_DOUBLE) :: beta type(C_PTR), value :: C integer(C_INT), value :: ldc end function integer(C_INT) function cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, & A, lda, B, ldb, beta, C, ldc, batch_count) & BIND(C, NAME='f_cublasDgemmBatched') use ISO_C_BINDING type(C_PTR), value :: handle integer(C_INT), value :: transa integer(C_INT), value :: transb integer(C_INT), value :: m integer(C_INT), value :: n integer(C_INT), value :: k real(C_DOUBLE) :: alpha type(C_PTR), value :: A integer(C_INT), value :: lda type(C_PTR), value :: B integer(C_INT), value :: ldb real(C_DOUBLE) :: beta type(C_PTR), value :: C integer(C_INT), value :: ldc integer(C_INT), value :: batch_count end function integer(C_INT) function cudaStreamCreate(stream_ptr) BIND(C, NAME='f_cudaStreamCreate') use ISO_C_BINDING type(C_PTR) :: stream_ptr end function integer(C_INT) function cublasSetStream(handle, stream) BIND(C, NAME='f_cublasSetStream') use ISO_C_BINDING type(C_PTR), value :: handle type(C_PTR), value :: stream end function integer(C_INT) function cudaStreamDestroy(stream) BIND(C, NAME='f_cudaStreamDestroy') use ISO_C_BINDING type(C_PTR), value :: stream end function subroutine cudaDeviceSynchronize() BIND(C, NAME='cudaDeviceSynchronize') end subroutine END INTERFACE a end module cublas_f
A Serial Example in Fortran
The following example preforms a single DGEMM
operation using the cuBLAS
version 2 library. This example will be expanded to show the use of batched cuBLAS
calls under a variety of situations. The form of our sample DGEMM
operation will be as follows:
cublas.f90
program main use ISO_C_BINDING use cublas_f implicit none integer :: dim, stat, i, j integer(8) :: bytes real(8),dimension(:,:), pointer :: A, B, C real(8) :: alpha, beta, index, sum type(C_PTR) :: d_A, d_B, d_C type(C_PTR) :: handle integer :: sizeof_double parameter (sizeof_double=8) !Linear dimension of matrices dim = 100 ! Allocate host storage for A,B,C square matrices allocate(A(dim,dim)) allocate(B(dim,dim)) allocate(C(dim,dim)) ! Allocate device storage for A,B,C bytes = int(dim*dim*sizeof_double, 8) stat = cudaMalloc(d_A, bytes) stat = cudaMalloc(d_B, bytes) stat = cudaMalloc(d_C, bytes) ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 ! Matrices are arranged column major do j=1,dim do i=1,dim if (i==j) then index = real(j*dim + i) A(i,j) = sin(index) B(i,j) = sin(index) C(i,j) = cos(index) * cos(index) else A(i,j) = 0.0 B(i,j) = 0.0 C(i,j) = 0.0 endif enddo enddo ! Create cublas instance stat = cublasCreate(handle) ! Set input matrices on device stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1)), dim, d_A, dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1)), dim, d_B, dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1)), dim, d_C, dim) ! DGEMM: C = alpha*A*B + beta*C alpha = 1.0 beta = 1.0 stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, & dim, dim, dim, & alpha, & d_A, dim, & d_B, dim, & beta, & d_C, dim) ! Retrieve result matrix from device stat = cublasGetMatrix(dim, dim, sizeof_double, d_C, dim, C_LOC(C(1,1)), dim) ! Simple sanity test, sum up all elements sum = 0.0 do j=1,dim do i=1,dim sum = sum + C(i,j) enddo enddo print *, "Sum is:", sum, "should be: ", dim deallocate(A) deallocate(B) deallocate(C) call cudaFree(d_A) call cudaFree(d_B) call cudaFree(d_C) call cublasDestroy(handle) end program
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ nvcc -c cublas_fortran.cu $ ftn -lcublas cublas_fortran_iso.f90 cublas.f90 cublas_fortran.o -o cublas.out
Run:
$ module load cudatoolkit $ aprun ./cublas.out
Library-provided Batching in Fortran
The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched
, which takes as an argument an array of pointers to matrices and the number of DGEMM
operations to preform in total. By allowing the library to handle the batching high performance can be achieved with minimal additional effort on the part of the programmer.
cublas-batch.f90
program main use ISO_C_BINDING use cublas_f implicit none integer :: dim, stat, i, j, k, batch_count integer(8) :: bytes real(8),dimension(:,:,:), pointer:: A, B, C real(8) :: alpha, beta, index, sum type(C_PTR) :: d_A, d_B, d_C type(C_PTR), dimension(:), pointer :: h_d_A, h_d_B, h_d_C, streams type(C_PTR) :: handle integer :: sizeof_double parameter (sizeof_double=8) !Linear dimension of matrices dim = 100 ! Number of A,B,C matrix sets batch_count = 1000 ! Allocate host storage for A,B,C square matrices allocate(A(dim,dim,batch_count)) allocate(B(dim,dim,batch_count)) allocate(C(dim,dim,batch_count)) ! Create host pointer array to device matrix storage allocate(h_d_A(batch_count)) allocate(h_d_B(batch_count)) allocate(h_d_C(batch_count)) bytes = dim*dim*sizeof_double do i=1,batch_count stat = cudaMalloc(h_d_A(i), bytes) stat = cudaMalloc(h_d_B(i), bytes) stat = cudaMalloc(h_d_C(i), bytes) enddo ! Copy the host array of device pointers to the device bytes = batch_count*sizeof_double ! 8 byte pointers stat = cudaMalloc(d_A, bytes) stat = cudaMalloc(d_B, bytes) stat = cudaMalloc(d_C, bytes) stat = cudaMemcpy(d_A, C_LOC(h_d_A(1)), bytes, cudaMemcpyHostToDevice); stat = cudaMemcpy(d_B, C_LOC(h_d_B(1)), bytes, cudaMemcpyHostToDevice); stat = cudaMemcpy(d_C, C_LOC(h_d_C(1)), bytes, cudaMemcpyHostToDevice); ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 ! Matrices are arranged column major do k=1,batch_count do j=1,dim do i=1,dim if (i==j) then index = real(j*dim + i) A(i,j,k) = k*sin(index) B(i,j,k) = sin(index) C(i,j,k) = k*cos(index) * cos(index) else A(i,j,k) = 0.0 B(i,j,k) = 0.0 C(i,j,k) = 0.0 endif enddo ! i enddo ! j enddo ! k ! Create cublas instance stat = cublasCreate(handle) ! Set input matrices on device do i=1,batch_count stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, h_d_A(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, h_d_B(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, h_d_C(i), dim) enddo ! Set matrix coefficients alpha = 1.0 beta = 1.0 ! batched DGEMM: C = alpha*A*B + beta*C stat = cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, & dim, dim, dim, & alpha, & d_A, dim, & d_B, dim, & beta, & d_C, dim, & batch_count) ! Retrieve result matrix from device do i=1,batch_count stat = cublasGetMatrix(dim, dim, sizeof_double, h_d_C(i), dim, C_LOC(C(1,1,i)), dim) enddo ! Simple sanity test, sum up all elements sum = 0.0 do k=1,batch_count do j=1,dim do i=1,dim sum = sum + C(i,j,k) enddo enddo enddo print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2 do i=1,batch_count call cudaFree(h_d_A(i)) call cudaFree(h_d_B(i)) call cudaFree(h_d_C(i)) enddo deallocate(A) deallocate(B) deallocate(C) call cublasDestroy(handle) end program
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ nvcc -c cublas_fortran.cu $ ftn -lcublas cublas_fortran_iso.f90 cublas-batch.f90 cublas_fortran.o -o cublas-batch.out
Run:
$ module load cudatoolkit $ aprun ./cublas-batch.out
CUDA Streams in Fortran
Not all library functions have a batch equivalent, in this case using CUDA streams
may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM
operation in it’s own stream
, Allowing up to 32 concurrent operations through the use of Hyper-Q
. The cuBLAS
library provides functionality to create, set, and destroy CUDA streams
.
cublas-stream.f90
program main use ISO_C_BINDING use cublas_f implicit none integer :: dim, stat, i, j, k, batch_count integer(8) :: bytes real(8),dimension(:,:,:), pointer:: A, B, C real(8) :: alpha, beta, index, sum type(C_PTR), dimension(:), pointer :: d_A, d_B, d_C, streams type(C_PTR) :: handle integer :: sizeof_double parameter (sizeof_double=8) !Linear dimension of matrices dim = 100 ! Number of A,B,C matrix sets batch_count = 1000 ! Allocate host storage for A,B,C square matrices allocate(A(dim,dim,batch_count)) allocate(B(dim,dim,batch_count)) allocate(C(dim,dim,batch_count)) ! Allocate device storage for A,B,C allocate(d_A(batch_count)) allocate(d_B(batch_count)) allocate(d_C(batch_count)) bytes = int(dim*dim*sizeof_double, 8) do i=1,batch_count stat = cudaMalloc(d_A(i), bytes) stat = cudaMalloc(d_B(i), bytes) stat = cudaMalloc(d_C(i), bytes) enddo ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 ! Matrices are arranged column major do k=1,batch_count do j=1,dim do i=1,dim if (i==j) then index = real(j*dim + i) A(i,j,k) = k*sin(index) B(i,j,k) = sin(index) C(i,j,k) = k*cos(index) * cos(index) else A(i,j,k) = 0.0 B(i,j,k) = 0.0 C(i,j,k) = 0.0 endif enddo ! i enddo ! j enddo ! k ! Create cublas instance stat = cublasCreate(handle) ! Set input matrices on device do i=1,batch_count stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, d_A(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, d_B(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, d_C(i), dim) enddo ! Create a stream for every DGEMM operation allocate(streams(batch_count)) do i=1,batch_count stat = cudaStreamCreate(streams(i)) enddo ! Set matrix coefficients alpha = 1.0 beta = 1.0 ! Launch each DGEMM operation in own CUDA stream do i=1,batch_count ! Set CUDA stream stat = cublasSetStream(handle, streams(i)) ! DGEMM: C = alpha*A*B + beta*C stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, & dim, dim, dim, & alpha, & d_A(i), dim, & d_B(i), dim, & beta, & d_C(i), dim) enddo ! Retrieve result matrix from device do i=1,batch_count stat = cublasGetMatrix(dim, dim, sizeof_double, d_C(i), dim, C_LOC(C(1,1,i)), dim) enddo ! Simple sanity test, sum up all elements sum = 0.0 do k=1,batch_count do j=1,dim do i=1,dim sum = sum + C(i,j,k) enddo enddo enddo print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2 do i=1,batch_count stat = cudaStreamDestroy(streams(i)) call cudaFree(d_A(i)) call cudaFree(d_B(i)) call cudaFree(d_C(i)) enddo deallocate(A) deallocate(B) deallocate(C) call cublasDestroy(handle) end program
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ nvcc -c cublas_fortran.cu $ ftn -lcublas cublas_fortran_iso.f90 cublas-stream.f90 cublas_fortran.o -o cublas-stream.out
Run:
$ module load cudatoolkit $ aprun ./cublas-stream.out
To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q
, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q
, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q
. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
The Serial
, OpenMP
, and MPI
C samples make use of the following gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
sleep.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> static cudaStream_t *streams; // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(float seconds) { // Get device frequency in KHz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = (int64_t)(seconds * Hz); return num_cycles; } // Create streams extern "C" void create_streams(int num_streams) { // Allocate streams streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t)); // Default stream streams[0] = NULL; // Primer kernel launch sleep<<< 1, 1 >>>(1); // Create streams for(int i = 1; i <= num_streams; i++) cudaStreamCreate(&streams[i]); } // Launches a kernel that sleeps for num_cycles extern "C" void sleep_kernel(int64_t num_cycles, int stream_id) { // Launch a single GPU thread to sleep int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles); } // Wait for stream to complete extern "C" void wait_for_stream(int stream_id) { cudaStreamSynchronize(streams[stream_id]); } // Wait for streams to complete extern "C" void wait_for_streams(int num_streams) { for(int i = 1; i <= num_streams; i++) cudaStreamSynchronize(streams[i]); } // Destroy stream objects extern "C" void destroy_streams(int num_streams) { // Clean up stream for(int i = 1; i <= num_streams; i++) cudaStreamDestroy(streams[i]); free(streams); }
Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -arch=sm_35 -c sleep.cu
Serially Launched Concurrent Kernels in C
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <sys/time.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; struct timeval start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Max number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously for(i=1; i<=num_kernels; i++){ sleep_kernel(cycles, i); } // Wait for kernels to complete wait_for_streams(num_kernels); // Clean up streams destroy_streams(num_kernels); // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o serial.out
Run:
$ module load cudatoolkit $ aprun ./serial.out
OpenMP Launched Concurrent Kernels in C
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; double start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Set number of OMP threads omp_set_num_threads(num_kernels); // Start timer start = omp_get_wtime(); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously #pragma omp parallel firstprivate(cycles) { int stream_id = omp_get_thread_num()+1; sleep_kernel(cycles, stream_id); } // Wait for kernels to complete wait_for_streams(num_kernels); // Wait for kernels to complete and clean up streams destroy_streams(num_kernels); // Stop timer stop = omp_get_wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } return 0; }
Compile:
PGI
$ module load cudatoolkit $ cc -mp sleep.o launcher.c -o openmp.out
GNU
$ module load cudatoolkit $ cc -fopenmp sleep.o launcher.c -o openmp.out
Intel
$ module load cudatoolkit $ cc -openmp sleep.o launcher.c -o openmp.out
Cray
$ module load cudatoolkit $ cc sleep.o launcher.c -o openmp.out
Run:
$ module load cudatoolkit $ aprun -d16 ./openmp.out
MPI Launched Concurrent Kernels in C
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "mpi.h" int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_stream(int stream_id); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); uint64_t cycles; double start, stop; int num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = size; // Setup default stream in sleep.cu wrapper create_streams(0); // Loop through number of kernels to launch, from 1 to max_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer MPI_Barrier(MPI_COMM_WORLD); if(rank == 0) start = MPI_Wtime(); // Launch kernel into default stream if(rank < num_kernels) sleep_kernel(cycles, 0); // Wait for all ranks to submit kernel MPI_Barrier(MPI_COMM_WORLD); // Wait for default stream if(rank < num_kernels) wait_for_stream(0); // Wait for all ranks to complete MPI_Barrier(MPI_COMM_WORLD); // Print seconds ellapsed if(rank == 0) { stop = MPI_Wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } } destroy_streams(0); MPI_Finalize(); return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in C
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "math.h" #include <sys/time.h> int main(int argc, char *argv[]) { uint64_t num_cycles; struct timeval start, stop; int i,num_kernels; // Set number of cycles to sleep for 1 second // We'll use frequency/instruction latency to estimate this num_cycles = 730000000/15; // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); for(i=0; i<=num_kernels; i++) { #pragma acc parallel async(i) vector_length(1) num_gangs(1) { uint64_t cycles; #pragma acc loop seq for(cycles=0; cycles<num_cycles; cycles++) { cycles = cycles; } } } // Wait for all async streams to complete #pragma acc wait // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
volatile
qualifier to the declaration of cycles should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ cc -acc launcher.c -o acc.out
Cray
$ module load cudatoolkit $ cc -hpragma=acc launcher.c -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
Fortran
The Serial
, OpenMP
, and MPI
Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
module sleep use cudadevice use cudafor implicit none integer, dimension(:), allocatable :: streams contains !CUDA kernel to pause for at least num_cycle cycles attributes(global) subroutine sleep(num_cycles) integer(8), value :: num_cycles integer(8) :: cycles integer(8) :: start cycles = 0 start = clock64 do while (cycles < num_cycles) cycles = clock64 - start enddo end subroutine sleep !Returns number of cycles required for requested seconds integer(8) function get_cycles(seconds) result(num_cycles) real(8), intent(in) :: seconds integer(8) :: istat, Hz type(cudadeviceprop) :: prop istat = cudaGetDeviceProperties(prop, 0) Hz = prop%clockRate * 1000 num_cycles = seconds * Hz end function get_cycles !Create streams subroutine create_streams(num_streams) integer :: num_streams, istat, i allocate(streams(num_streams+1)) streams(1) = 0 ! Primer kernel launch call sleep<<< 1, 1 >>>(int8(1)); do i=2,num_streams+1 istat = cudaStreamCreate(streams(i)) enddo end subroutine create_streams !Launches a kernel that sleeps for num_cycles subroutine sleep_kernel(num_cycles, stream_id) integer(8) :: num_cycles integer :: stream_id type(dim3) :: blockSize, gridSize blockSize = dim3(1,1,1) gridSize = dim3(1,1,1) call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles) end subroutine sleep_kernel ! Wait for stream to complete subroutine wait_for_stream(stream_id) integer :: stream_id, istat istat = cudaStreamSynchronize(streams(stream_id)) end subroutine wait_for_stream ! Wait for streams to complete subroutine wait_for_streams(num_streams) integer :: num_streams, istat, i do i=2,num_streams+1 istat = cudaStreamSynchronize(streams(i)) enddo end subroutine wait_for_streams ! Destroy streams subroutine destroy_streams(num_streams) integer :: num_streams, i, istat do i=2,num_streams+1 istat = cudaStreamDestroy(streams(i)) enddo deallocate(streams) end subroutine destroy_streams end module sleep
Serially Launched Concurrent Kernels in Fortran
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep integer(8) :: cycles integer :: max_kernels, num_kernels, i real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 32 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously do i = 2, num_kernels+1 call sleep_kernel(cycles, i) enddo ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out
OpenMP Launched Concurrent Kernels in Fortran
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep use omp_lib implicit none integer(8) :: cycles integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Set number of OMP threads call omp_set_num_threads(num_kernels) ! Start timer start = omp_get_wtime() ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously !$omp parallel private(stream_id) firstprivate(cycles) stream_id = omp_get_thread_num()+2 call sleep_kernel(cycles, stream_id) !$omp end parallel ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer stop = omp_get_wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out
MPI Launched Concurrent Kernels in Fortran
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.f90
program main use sleep implicit none include 'mpif.h' integer :: max_kernels, num_kernels, i, ierr, rank, size integer(8) :: cycles real(8) :: start, stop, seconds call MPI_Init(ierr) ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) call MPI_Comm_size(MPI_COMM_WORLD, size, ierr) ! Number of kernels to launch max_kernels = size ! Setup default stream in sleep.cu wrapper call create_streams(0); ! Loop through number of kernels to launch, from 1 to max_kernels do num_kernels = 1, max_kernels ! Start timer call MPI_Barrier(MPI_COMM_WORLD, ierr) if (rank == 0) then start = MPI_Wtime() endif ! Launch num_kernel kernels asynchrnously if (rank < num_kernels) then call sleep_kernel(cycles, 1) endif ! Wait for all ranks to submit kernel call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Wait for kernel to complete if(rank < num_kernels) then call wait_for_stream(0) endif ! Wait for all ranks to complete call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Print seconds ellapsed if (rank == 0) then stop = MPI_Wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' endif enddo ! clean up array in wrapper, no stream actually destroyed call destroy_streams(0) call MPI_Finalize(ierr) end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in Fortran
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.f90
program main implicit none integer(8) :: num_cycles, cycles, i integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds real(4) :: foo(33) ! Set number of cycles to sleep for 1 second ! We'll use frequency/instruction latency to estimate this num_cycles = 730000000/(15*5) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Launch num_kernel kernels asynchrnously do i = 1, num_kernels !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo) !$acc loop seq do cycles = 1, num_cycles foo(i) = sin(1.5708) enddo !$acc end loop !$acc end parallel enddo ! Wait for all async streams to complete !$acc wait ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
volatile
qualifier to the declaration of foo should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ ftn -acc launcher.f90 -o acc.out
Cray
$ module load cudatoolkit $ ftn -hacc launcher.f90 -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 10000; // Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // sum component wise and save result into vector c #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n]) for(i=0; i<n; i++) { c[i] = a[i] + b[i]; } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0.0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd.c
// Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c;
The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.
// sum component wise and save result into vector c #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n]) for(i=0; i<n; i++) { c[i] = a[i] + b[i]; }
The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.
Compiling vecAdd.c
PGI
$ module load cudatoolkit $ cc -acc vecAdd.c -o vecAdd.out
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -h pragma=acc vecAdd.c -o vecAdd.out
CapsMC
$ module load cudatoolkit capsmc $ cc vecAdd.c -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
vecAdd.f90
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n c(i) = a(i) + b(i) enddo !$acc end kernels ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Changes to vecAdd.f90
!$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n)) do i=1,n c(i) = a(i) + b(i) enddo !$acc end kernels
The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.
Compiling vecAdd.f90
PGI
$ module load cudatoolkit $ ftn -acc vecAdd.f90 -o vecAdd.out
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -h acc vecAdd.f90 -o vecAdd.out
CapsMC
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit capsmc $ ftn vecAdd.f90 -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
Reduction
OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.
vecAdd-reduction.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 10000; // Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // sum component wise and save result into vector c and do reduction float sum = 0.0; #pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n]) { for(i=0; i<n; i++) { c[i] = a[i] + b[i]; sum += c[i]; } } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
vecAdd-reduction.f90
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c, perform reduction sum on c !$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n c(i) = a(i) + b(i) sum = sum + c(i) enddo !$acc end kernels ! Divide the sum by n, the final result should be 1.0 within error sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Additional Information
With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.
PGI
For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.
Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:
$ cc -acc -Minfo vecAdd-reduction.c
Compile Fortran code with the PGI compiler and -Minfo flag as follows:
$ ftn -acc -Minfo vecAdd-reduction.f90
Output:
main: 34, Generating local(c[0:n]) Generating copyin(b[0:n]) Generating copyin(a[0:n]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 35, Loop is parallelizable Accelerator kernel generated 35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes 37, Sum reduction generated for sum
Breaking this output down:
main: 34, Generating local(c[0:n]) Generating copyin(b[0:n]) Generating copyin(a[0:n])
We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
35, Loop is parallelizable Accelerator kernel generated
Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.
35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.
37, Sum reduction generated for sum
The compiler has correctly identified the sum reduction and created an optimized kernel for it.
Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.
$ export ACC_NOTIFY=1 $ aprun ./vecAdd.out launch kernel file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=35 device=0 grid=79 block=128 queue=0 launch kernel file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=37 device=0 grid=1 block=256 queue=0
Cray
Compiling:
C
$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out
Fortran
$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out
Output:
!$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n)) ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 A data region was created at line 28 and ending at line 38. ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin). ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin). ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share). do i=1,n ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was placed on the accelerator. ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock. ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
Breaking this output down:
!$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n)) ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 A data region was created at line 28 and ending at line 38. ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin). ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin). ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).
We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
do i=1,n ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was placed on the accelerator. ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock. ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.
Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.
$ export CRAY_ACC_DEBUG=1 $ aprun ./vecAdd.out ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28 ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29 ACC: Execute kernel main_$ck_L29_2 async(auto) from vecAdd-reduction.f90:29 ACC: Wait async(auto) from vecAdd-reduction.f90:32 ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32 ACC: Wait async(auto) from vecAdd-reduction.f90:38 ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38
CapsMC
#pragma acc loop independent
#pragma acc loop reduction(+:sum)
!$acc loop independent
!$acc loop reduction(+:sum)
Compiling:
C
$ capsmc --io-report cc vecAdd-reduction.c
Fortran
$ capsmc --io-report ftn vecAdd-reduction.f90
Output:
hmpp: [Info] IO report : hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8) hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".
Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.
$ export HMPPRT_LOG_LEVEL="info" $ aprun ./a.out [ 0.029828] ( 0) INFO : Enter kernels (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.030021] ( 0) INFO : Acquire (target=cuda) [ 0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451419] ( 0) INFO : Upload a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451610] ( 0) INFO : Upload b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451826] ( 0) INFO : Upload i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451905] ( 0) INFO : Upload sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451940] ( 0) INFO : Call __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470506] ( 0) INFO : Free sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470590] ( 0) INFO : Free i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470626] ( 0) INFO : Free c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470677] ( 0) INFO : Free b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470727] ( 0) INFO : Free a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470863] ( 0) INFO : Leave kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
Breaking this output down:
[ 0.029828] ( 0) INFO : Enter kernels (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.030021] ( 0) INFO : Acquire (target=cuda) [ 0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451419] ( 0) INFO : Upload a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451610] ( 0) INFO : Upload b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451826] ( 0) INFO : Upload i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451905] ( 0) INFO : Upload sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
[ 0.451940] ( 0) INFO : Call __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
The kernel is called.
[ 0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470506] ( 0) INFO : Free sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470590] ( 0) INFO : Free i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470626] ( 0) INFO : Free c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470677] ( 0) INFO : Free b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470727] ( 0) INFO : Free a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470863] ( 0) INFO : Leave kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
At the end of the kernel sum is copied from the accelerator back to the device and the memory released.
This sample shows the Game of Life CPU code converted to an OpenACC accelerator directive version. The full source can be viewed or download at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
GOL.c
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define dim 1024 // Linear game grid dimension excluding ghost cells #define idx(i,j) ((i)*(dim+2)+(j)) // Add up all neighbors inline int getNeighbors(int *grid, int i, int j) { int num = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; return num; } int main(int argc, char* argv[]) { int i,j,iter; // Number of game iterations int maxIter = 1<<10; // Total number of alive cells int total = 0; // Total size of grid in bytes size_t bytes = sizeof(int)*(dim+2)*(dim+2); // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc(bytes); // Allocate newGrid int *restrict newGrid = (int*) malloc(bytes); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i <= dim; i++) { for(j = 1; j <= dim; j++) { grid[idx(i,j)] = rand() % 2; } } int fullSize = (dim+2)*(dim+2); #pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize]) { // Main game loop for (iter = 0; iter<maxIter; iter++) { #pragma acc kernels { // Left-Right columns #pragma acc loop independent for (i = 1; i <= dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } // Top-Bottom rows #pragma acc loop independent for (j = 0; j <= dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } // Now we loop over all cells and determine their fate #pragma acc loop independent for (i = 1; i <= dim; i++) { for (j = 1; j <= dim; j++) { // Get the number of neighbors for a given grid point int numNeighbors = getNeighbors(grid,i,j); // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } // Can't switch pointers so we mannually have to copy array over #pragma acc loop independent for(i = 1; i <= dim; i++) { for(j = 1; j <= dim; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } } } // End ACC kernels region } // End main game loop // Sum up alive cells #pragma acc parallel { #pragma acc loop reduction(+:total) for (i = 1; i <= dim; i++) { #pragma acc loop reduction(+:total) for (j = 1; j <= dim; j++) { total += grid[idx(i,j)]; } } } } // End ACC Data region // Print number of live cells printf("Total Alive: %d\n", total); // Release memory free(grid); free(newGrid); return 0; }
Changes to GOL.c
#define dim 1024 // Linear game grid dimension excluding ghost cells #define idx(i,j) ((i)*(dim+2)+(j))
At the time of this writing, multidimensional C arrays are not fully supported in all OpenACC implementations. A convenience macro is defined and used whenever grid or newGrid is accessed. These changes have not all been highlighted. grid[i][j] will now be accessed as grid[idx(i,j)]
inline int getNeighbors(int *grid, int i, int j)
OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. The inline qualifier can be used as a hint to the compiler we wish to have the function inlined.
// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc(bytes); // Allocate newGrid int *restrict newGrid = (int*) malloc(bytes);
The grid and newGrid arrays are now linear. As discussed in the vector addition example we use the restrict qualifier to help the compiler resolve loop dependencies.
#pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize]) {
}
Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 49 and ends at line 101. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.
#pragma acc kernels {
}
An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.
// Left-Right columns #pragma acc loop independent
// Top-Bottom rows #pragma acc loop independent
// Now we loop over all cells and determine their fate #pragma acc loop independent
// Can't switch pointers so we mannually have to copy array over #pragma acc loop independent
The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.
// Can't switch pointers so we mannually have to copy array over #pragma acc loop independent for(i = 1; i <= dim; i++) { for(j = 1; j <= dim; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } }
Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.
// Sum up alive cells #pragma acc parallel { #pragma acc loop reduction(+:total) for (i = 1; i <= dim; i++) { #pragma acc loop reduction(+:total) for (j = 1; j <= dim; j++) { total += grid[idx(i,j)]; } } }
At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.
Compiling GOL.c
PGI
$ module load cudatoolkit $ cc -acc gol.c -Minline -Minfo -ta=nvidia:kepler
CRAY
Cray will automatically attempt to inline the getNeighbors function
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -hpragma=acc,msgs gol.c
Compiling GOL.c with CapsMC
CapsMC will automatically attempt to inline the getNeighbors function
$ module load capsmc $ module load cudatoolkit $ cc gol.c
GOL.f90
program main implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo !$acc data copyin(grid), create(newGrid) ! Main game loop do iter=1,maxIter !$acc kernels ! Top-Bottom ghost rows !$acc loop independent do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns !$acc loop independent do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left and right + grid(i+1,j) + grid(i-1,j)& !upper and lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo !$acc end kernels enddo!end main game loop total = 0 ! Sum up alive cells and print results !$acc parallel !$acc loop reduction(+:total) do j=2,dim+1 !$acc loop reduction(+:total) do i=2,dim+1 total = total + grid(i,j) enddo enddo !$acc end parallel !$acc end data print *, "Total Alive", total end program
Changes to GOL.f90
!$acc data copyin(grid), create(newGrid)
!$acc end data
Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 27 and ends at line 90. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.
!$acc kernels
!$acc end kernels
An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.
!$acc loop independent
!$acc loop independent
The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.
numNeighbors = grid(i,j+1) + grid(i,j-1)& !left and right + grid(i+1,j) + grid(i-1,j)& !upper and lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1)
OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. For this example, using the current set of compilers, it was best to inline the neighbors function by hand.
! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo
Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.
! Sum up alive cells and print results !$acc parallel !$acc loop reduction(+:total) do j=2,dim+1 !$acc loop reduction(+:total) do i=2,dim+1 total = total + grid(i,j) enddo enddo !$acc end parallel
At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.
Compiling GOL.f90
OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined using the appropriate compiler flag.
PGI
$ module load cudatoolkit $ ftn -acc gol.f90 -Minline -Minfo -ta=nvidia:kepler
CRAY
Cray will automatically try to inline the getNeighbors function
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -hacc,msgs gol.f90
Compiling GOL.f90 with CapsMC
CapsMC will automatically try to inline the getNeighbors function
$ module load capsmc $ module load cudatoolkit $ ftn gol.f90
The following C and Fortran examples show how to combine OpenACC data regions with GPU libraries such as CUFFT. Since files containing CUFFT function calls must be compiled with the nvcc compiler and OpenACC containing files must be compiled with cc/CC/ftn it is necessary to create wrapper functions as shown below. The acc host_data use_device directive will be used which causes the device allocated memory address of specified variables to be used in host code within the directives scope. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C/C++
fft.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #include "openacc.h" #ifndef M_PI #define M_PI 3.14159265358979323846 #endif // Forward declaration of wrapper function that will call CUFFT extern void launchCUFFT(float *d_data, int n, void *stream); int main(int argc, char *argv[]) { int n = 256; float *data = malloc(2*n*sizeof(float)); int i; // Initialize interleaved input data on host float w = 7.0; float x; for(i=0; i<2*n; i+=2) { x = (float)i/2.0/(n-1); data[i] = cos(2*M_PI*w*x); data[i+1] = 0.0; } // Copy data to device at start of region and back to host and end of region #pragma acc data copy(data[0:2*n]) { // Inside this region the device data pointer will be used #pragma acc host_data use_device(data) { // Query OpenACC for CUDA stream void *stream = acc_get_cuda_stream(acc_async_sync); // Launch FFT on the GPU launchCUFFT(data, n, stream); } } // Find the frequency int max_id = 0; for(i=0; i<n; i+=2) { if( data[i] > data[max_id] ) max_id = i; } printf("frequency = %d\n", max_id/2); return 0; }
fft.c + cufft.cu
#include <cufft.h> // Declared extern "C" to disable C++ name mangling extern "C" void launchCUFFT(float *d_data, int n, void *stream) { cufftHandle plan; cufftPlan1d(&plan, n, CUFFT_C2C, 1); cufftSetStream(plan, (cudaStream_t)stream); cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD); cufftDestroy(plan); }
Compiling C/C++
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc cufft.cu -c $ cc -h pragma=acc -lcufft fft.c cufft.o
PGI
$ module load cudatoolkit $ nvcc cufft.cu -c $ cc -acc -lcufft fft.c cufft.o
Fortran
fft.f90
module cufft INTERFACE subroutine launchcufft(data, n, stream) BIND (C, NAME='launchCUFFT') USE ISO_C_BINDING implicit none type (C_PTR), value :: data integer (C_INT), value :: n type (C_PTR), value :: stream end subroutine END INTERFACE end module cufft program fft USE ISO_C_BINDING USE cufft USE openacc IMPLICIT NONE INTEGER, PARAMETER :: n = 256 COMPLEX (C_FLOAT_COMPLEX) :: data(n) INTEGER (C_INT):: i INTEGER :: max_id,istat type (C_PTR) :: stream ! Initialize interleaved input data on host REAL :: w = 7.0 REAL :: x REAL, PARAMETER :: PI = 3.1415927 do i=1,n x = (i-1.0)/(n-1.0); data(i) = CMPLX(COS(2.0*PI*w*x),0.0) enddo ! Copy data to device at start of region and back to host and end of region !$acc data copy(data) ! Inside this region the device data pointer will be used !$acc host_data use_device(data) ! Query OpenACC for CUDA stream stream = acc_get_cuda_stream(acc_async_sync) ! Launch FFT on the GPU call launchcufft(C_LOC(data), n, stream) !$acc end host_data !$acc end data ! Find the frequency max_id = 1 do i=1,n/2 if (REAL(data(i)) .gt. REAL(data(max_id))) then max_id = i-1 endif enddo print *, "frequency:", max_id end program fft
fft.f90 + cufft.cu
#include <cufft.h> // Declared extern "C" to disable C++ name mangling extern "C" void launchCUFFT(float *d_data, int n, void *stream) { cufftHandle plan; cufftPlan1d(&plan, n, CUFFT_C2C, 1); cufftSetStream(plan, (cudaStream_t)stream); cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD); cufftDestroy(plan); }
Compiling Fortran
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc -c cufft.cu $ ftn -c -hacc fft.f90 $ ftn fft.o cufft.o -lcufft
PGI
$ module load cudatoolkit $ nvcc -c cufft.cu $ ftn -c -acc fft.f90 $ ftn -acc fft.o cufft.o -lcufft
The following C and Fortran examples show how to leverage OpenACC, cuRand, and Thrust in a single application. This example will use cuRand to generate random data, OpenACC to hash the data, and finally Thrust to sort the data into canonical order. Although this example is provided to illustrate OpenACC interoperability only it was written with a simple nearest neighbor hash implementation given interleaved 3D positions(x1,y1,z1,x2,y2,z2…) in mind. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C/C++
sort.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #include "openacc.h" // Forward declaration of nvcc compiled functions void fill_rand(float *d_buffer, int num, void *stream); inline int hash_val(float x, float y, float z); void sort(int *d_key, int *d_values, int num, void *stream); int main(int argc, char* argv[]) { int i,index; float x,y,z; int dim = 3; int num = 1000000; int length = dim*num; float *restrict positions = (float*)malloc(length*sizeof(float)); int *restrict keys = (int*)malloc(num*sizeof(int)); int *restrict values = (int*)malloc(num*sizeof(int)); // Query OpenACC for CUDA stream void *stream = acc_get_cuda_stream(acc_async_sync); // OpenACC will create positions, keys, and values arrays on the device #pragma acc data create(positions[0:length], keys[0:num], values[0:num]) { // NVIDIA cuRandom will create our initial random data #pragma acc host_data use_device(positions) { fill_rand(positions, length, stream); } // OpenACC will calculate the hash value for each particle #pragma acc parallel loop for(i=0; i<num; i++) { index = i*3; x = positions[index]; y = positions[index+1]; z = positions[index+2]; // Key is hash value and value is 'particle id' keys[i] = hash_val(x,y,z); values[i] = i; } // Thrust will be used to sort our key value pairs #pragma acc host_data use_device(keys, values) { sort(keys, values, num, stream); } } return 0; } // Uniform grid hash inline int hash_val(float x, float y, float z) { double spacing = 0.01; // Calculate grid coordinates int grid_x,grid_y,grid_z; grid_x = abs(floor(x/spacing)); grid_y = abs(floor(y/spacing)); grid_z = abs(floor(z/spacing)); int num_x = 1.0/spacing; int num_y = 1.0/spacing; int num_z = 1.0/spacing; int grid_position = (num_x * num_y * grid_z) + (grid_y * num_x + grid_x); return grid_position; }
sortGPU.cu
#include <stdio.h> #include <curand.h> #include <thrust/sort.h> #include <thrust/execution_policy.h> #include <thrust/binary_search.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/system_error.h> // Fill d_buffer with num random numbers extern "C" void fill_rand(float *d_buffer, int num, void *stream) { curandGenerator_t gen; int status; // Create generator status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT); // Set CUDA stream status |= curandSetStream(gen, (cudaStream_t)stream); // Set seed status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL); // Generate num random numbers status |= curandGenerateUniform(gen, d_buffer, num); // Cleanup generator status |= curandDestroyGenerator(gen); if (status != CURAND_STATUS_SUCCESS) { printf ("curand failure!\n"); exit (EXIT_FAILURE); } } // Sort key value pairs extern "C" void sort(int *keys, int *values, int num, void *stream) { try { // Sort keys AND values array by key thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream), keys, keys + num, values); } catch(thrust::system_error &e) { std::cerr << "Error sorting with Thrust: " << e.what() << std::endl; exit (EXIT_FAILURE); } }
Compiling C/C++
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc sortGPU.cu -c $ cc -hpragma=acc -lcurand sort.c sortGPU.o
PGI
$ module load cudatoolkit $ nvcc sortGPU.cu -c $ cc -acc -lcurand sort.c sortGPU.o -Minline
Fortran
sort.f90
! Module containing nvcc compiled function declarations and hash function module sortgpu INTERFACE subroutine fill_rand(positions, length, stream) BIND(C,NAME='fill_rand') USE ISO_C_BINDING implicit none type (C_PTR), value :: positions integer (C_INT), value :: length type (C_PTR), value :: stream end subroutine fill_rand subroutine sort(keys, values, num, stream) BIND(C,NAME='sort') USE ISO_C_BINDING implicit none type (C_PTR), value :: keys type (C_PTR), value :: values integer (C_INT), value :: num type (C_PTR), value :: stream end subroutine sort END INTERFACE contains integer function hash_val(x,y,z) real :: x,y,z,spacing integer :: grid_x, grid_y, grid_z integer :: num_x, num_y, num_z spacing = 0.01; grid_x = abs(floor(x/spacing)) grid_y = abs(floor(y/spacing)) grid_z = abs(floor(z/spacing)) num_x = 1.0/spacing num_y = 1.0/spacing num_z = 1.0/spacing hash_val = (num_x * num_y * grid_z) + (grid_y * num_x + grid_x) + 1 return end function hash_val end module sortgpu program interop use ISO_C_BINDING use sortgpu use openacc implicit none integer :: i,indx real :: x,y,z integer, parameter :: dims = 3 integer(C_INT), parameter :: num = 1000000 integer(C_INT), parameter :: length = dims*num real (C_FLOAT) :: positions(length) integer(C_INT) :: keys(num) integer(C_INT) :: values(num) type (C_PTR) :: stream ! Query OpenACC for CUDA stream stream = acc_get_cuda_stream(acc_async_sync) ! OpenACC will create positions, keys, and values arrays on the device !$acc data create(positions, keys, values) ! NVIDIA cuRandom will create our initial random data !$acc host_data use_device(positions) call fill_rand(C_LOC(positions), length, stream) !$acc end host_data ! OpenACC will calculate the hash value for each particle !$acc parallel loop do i=0,num-1 indx = i*3; x = positions(indx+1) y = positions(indx+2) z = positions(indx+3) ! Key is 'particle' id and value is hashed position keys(i+1) = hash_val(x,y,z); values(i+1) = i+1; enddo !$acc end parallel loop ! Thrust will be used to sort our key value pairs !$acc host_data use_device(keys, values) call sort(C_LOC(keys), C_LOC(values), num, stream); !$acc end host_data !$acc end data end program interop
sortGPU.cu
#include <stdio.h> #include <curand.h> #include <thrust/sort.h> #include <thrust/execution_policy.h> #include <thrust/binary_search.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/system_error.h> // Fill d_buffer with num random numbers extern "C" void fill_rand(float *d_buffer, int num, void *stream) { curandGenerator_t gen; int status; // Create generator status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT); // Set CUDA stream status |= curandSetStream(gen, (cudaStream_t)stream); // Set seed status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL); // Generate num random numbers status |= curandGenerateUniform(gen, d_buffer, num); // Cleanup generator status |= curandDestroyGenerator(gen); if (status != CURAND_STATUS_SUCCESS) { printf ("curand failure!\n"); exit (EXIT_FAILURE); } } // Sort key value pairs extern "C" void sort(int *keys, int *values, int num, void *stream) { try { // Sort keys AND values array by key thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream), keys, keys + num, values); } catch(thrust::system_error &e) { std::cerr << "Error sorting with Thrust: " << e.what() << std::endl; exit (EXIT_FAILURE); } }
Compiling Fortran
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc sortGPU.cu -c $ ftn -hacc -lcurand sort.f90 sortGPU.o
PGI
$ module load cudatoolkit $ module switch pgi pgi/13.6.0 $ nvcc sortGPU.cu -c $ ftn -acc -lcurand sort.f90 sortGPU.o -Minline=name:hash_val
This tutorial will cover linking GPU code compiled with NVCC and CPU code compiled with the Cray compiler wrappers cc/CC/ftn. The Cray compiler wrappers work in conjunction with the modules system to link in needed libraries such as MPI; it is therefore recommended that the wrappers be used to compile the CPU portions of your code. To generate compiler portable code it is necessary to compile CUDA C and CUDA runtime containing code with NVCC. The resulting NVCC compiled object code must then be linked in with object code compiled with the Cray wrappers. NVCC preforms GNU style C++ name mangling on compiled functions so care must be taken in compiling and linking codes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C++
When linking C++ host code with NVCC compiled code the C++ code must use GNU compatible name mangling. This is controlled through compiler wrapper flags.
vecAddWrapperCXX.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } void vecAdd_wrapper() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAdd.cxx
#include <stdio.h> void vecAdd_wrapper(void); int main( int argc, char* argv[] ) { vecAdd_wrapper(); return 0; }
Compiling C++
PGI
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC --gnu vecAdd.cxx vecAddWrapperCXX.o
Cray
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC vecAdd.cxx vecAddWrapperCXX.o
GNU
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC vecAdd.cxx vecAddWrapperCXX.o
Intel
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC vecAdd.cxx vecAddWrapperCXX.o
C
NVCC name mangling must be disabled if it is to be linked with C code. This requires the extern “C” function qualifier.
vecAddWrapperC.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } extern "C" void vecAdd_wrapper() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAdd.c
#include <stdio.h> void vecAdd_wrapper(void); int main( int argc, char* argv[] ) { vecAdd_wrapper(); return 0; }
Compiling C
There should be no compiler dependent variances for C.
$ module load cudatoolkit $ nvcc -c vecAddWrapperC.cu $ cc vecAdd.c vecAddWrapperC.o
Fortran: Simple
This simple method allows C code to be called from Fortran but requires that the C function name be modified. NVCC name mangling must be disabled if it is to be linked with Fortran code. This requires the extern “C” function qualifier. Additionally function names must be lowercase and end in an underscore, _ . The ISO_C_BINDING method following this example allows you to call C functions from Fortran without modifying the C function call.
vecAddWrapperF.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } extern "C" void vecadd_wrapper_() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAdd.f90
program vecAdd call vecadd_wrapper() end program vecAdd
Compiling Fortran: Simple
There should be no compiler dependent variations for fortran.
$ module load cudatoolkit $ nvcc -c vecAddWrapperF90.cu $ ftn vecAddWrapperF90.o vecAdd.f90
Fortran: ISO_C_BINDING
The ISO_C_BINDING provides Fortran a greater interoperability with C and reduces the need to modify the C function name.
vecAddWrapperISO.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } extern "C" void vecAdd_wrapper() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAddISO.f90
module vector INTERFACE subroutine vecadd_wrapper() BIND (C, NAME='vecAdd_wrapper') USE ISO_C_BINDING implicit none end subroutine vecadd_wrapper END INTERFACE end module vector program vecAdd use ISO_C_BINDING use vector call vecadd_wrapper() end program vecAdd
Compiling Fortran: ISO_C_BINDING
There should be no compiler dependent variations for fortran.
$ module load cudatoolkit $ nvcc -c vecAddWrapperISO.cu $ ftn vecAddWrapperISO.o vecAddISO.f90
Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov.
How to Enable
To enable GPUDirect the following steps must be taken before invoking aprun:
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1
Examples
The following examples will initialize memory on the GPU and then preform an MPI_Allgather operation between GPUs.
For these examples use two nodes and request an interactive job.
$qsub -I -A PROJ### -lnodes=2,walltime=00:30:00
The executables must be placed in /tmp/work/$USER to run properly on the GPU.
CUDA C
direct.cpp
#include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include <mpi.h> int main( int argc, char** argv ) { MPI_Init (&argc, &argv); int direct; int rank, size; int *h_buff = NULL; int *d_rank = NULL; int *d_buff = NULL; size_t bytes; int i; // Ensure that RDMA ENABLED CUDA is set correctly direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA")); if(direct != 1){ printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n"); exit (EXIT_FAILURE); } // Get MPI rank and size MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); // Allocate host and device buffers and copy rank value to GPU bytes = size*sizeof(int); h_buff = (int*)malloc(bytes); cudaMalloc(&d_buff, bytes); cudaMalloc(&d_rank, sizeof(int)); cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice); // Preform Allgather using device buffer MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD); // Check that the GPU buffer is correct cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost); for(i=0; i<size; i++){ if(h_buff[i] != i) { printf ("Alltoall Failed!\n"); exit (EXIT_FAILURE); } } if(rank==0) printf("Success!\n"); // Clean up free(h_buff); cudaFree(d_buff); cudaFree(d_rank); MPI_Finalize(); return 0; }
Compiling CUDA C
For ease of compiling the GNU environment will be used.
$ module load cudatoolkit $ module switch PrgEnv-pgi PrgEnv-gnu $ CC -lcudart direct.cpp -o direct.out
Running CUDA C
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
CUDA Fortran
direct.cuf
program GPUdirect use cudafor implicit none include 'mpif.h' integer :: direct character(len=255) :: env_var integer :: rank, size, ierror integer,dimension(:),allocatable :: h_buff integer,device :: d_rank integer,dimension(:),allocatable,device :: d_buff integer :: i call getenv("MPICH_RDMA_ENABLED_CUDA", env_var) read( env_var, '(i10)' ) direct if (direct .NE. 1) then print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!' call exit(1) endif call MPI_INIT(ierror) ! Get MPI rank and size call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror) call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror) ! Initialize host and device buffers allocate(h_buff(size)) allocate(d_buff(size)) ! Implicity copy rank to device d_rank = rank ! Preform allgather using device buffers call MPI_ALLGATHER(d_rank, 1, MPI_INTEGER, d_buff, 1, MPI_INTEGER, MPI_COMM_WORLD, ierror); ! Check that buffer is correct h_buff = d_buff(1:size) do i=1,size if (h_buff(i) .NE. i-1) then print *, 'Alltoall Failed!' call exit(1) endif enddo if (rank .EQ. 0) then print *, 'Success!' endif ! Clean up deallocate(h_buff) deallocate(d_buff) call MPI_FINALIZE(ierror) end program GPUdirect
Compiling CUDA Fortran
$ module load cudatoolkit $ ftn direct.cuf -o direct.out
Running CUDA Fortran
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
OpenACC C
direct.c
#include <stdio.h> #include <stdlib.h> #include <mpi.h> int main( int argc, char** argv ) { MPI_Init (&argc, &argv); int direct; int rank, size; int *restrict buff = NULL; size_t bytes; int i; // Ensure that RDMA ENABLED CUDA is set correctly direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA")); if(direct != 1){ printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n"); exit (EXIT_FAILURE); } // Get MPI rank and size MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); // Initialize buffer bytes = size*sizeof(int); buff = (int*)malloc(bytes); // Copy buff to device at start of region and back to host and end of region #pragma acc data copy(rank, buff[0:size]) { // Inside this region the device data pointer will be used #pragma acc host_data use_device(rank, buff) { MPI_Allgather(&rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD); } } // Check that buffer is correct for(i=0; i<size; i++){ if(buff[i] != i) { printf ("Alltoall Failed!\n"); exit (EXIT_FAILURE); } } if(rank==0) printf("Success!\n"); // Clean up free(buff); MPI_Finalize(); return 0; }
Compiling OpenACC C
CRAY
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -hpragma=acc direct.c -o direct.out
PGI
$ module load cudatoolkit $ cc -acc -lcudart direct.c -o direct.out
Running OpenACC C
CRAY
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
PGI
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
OpenACC Fortran
direct.f90
program GPUdirect include 'mpif.h' integer :: direct character(len=255) :: env_var integer :: rank, size, ierror integer,dimension(:),allocatable :: buff integer :: i call getenv("MPICH_RDMA_ENABLED_CUDA", env_var) read( env_var, '(i10)' ) direct if (direct .NE. 1) then print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!' call exit(1) endif call MPI_INIT(ierror) ! Get MPI rank and size call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror) call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror) ! Initialize buffer allocate(buff(size)) ! Copy buff to device at start of region and back to host and end of region !$acc data copy(rank, buff(1:size)) ! Inside this region the device data pointer will be used !$acc host_data use_device(rank, buff) ! Preform all to all using device buffer call MPI_ALLGATHER(rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD, ierror); !$acc end host_data !$acc end data ! Check that buffer is correct do i=1,size if (buff(i) .NE. i-1) then print *, 'Alltoall Failed!' call exit(1) endif enddo if (rank .EQ. 0) then print *, 'Success!' endif ! Clean up deallocate(buff) call MPI_FINALIZE(ierror) end program GPUdirect
Compiling OpenACC Fortran
CRAY
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -hacc direct.f90 -o direct.out
PGI
$ module load cudatoolkit $ ftn -acc -lcudart direct.f90 -o direct.out
Running OpenACC Fortran
CRAY
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
PGI
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
Optimizations
Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial and error tuning to achieve best results.
Pipelining
Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is 16 and messages under 8 Kilobytes in size are never pipelined.
Nemesis
Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.
This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.
This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.
The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
How to Enable
To enable the proxy server the following steps must be taken before invoking aprun:
$ export CRAY_CUDA_MPS=1
Issues
Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.
Examples
The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.
For these examples use a single node and request an interactive job.
$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00
C Wrapper
The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.
pauseWrapper.cu
#include <stdio.h> #include <stdlib.h> #include <time.h> // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(int seconds) { // Get device frequency in Hz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = seconds * Hz; return num_cycles; } // Launches a kernel that sleeps for at least num_cycles extern "C" void sleep_kernel(int64_t num_cycles) { // Our kernel will launch a single thread to sleep the kernel int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<<gridSize, blockSize>>>(num_cycles); } // Wait for all pending GPU transactions to end extern "C" void wait_for_gpu() { cudaDeviceSynchronize(); }
Compiling
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -c -arch=sm_35 pauseWrapper.cu
Multiple kernels single process
Launching multiple kernels from a single process does not require the CUDA proxy.
launcher.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch num_kernel kernels asynchrnously for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); return 0; }
Compiling and running multiple kernels from a single process
$ CC pauseWrapper.o launcher.cpp -o sleep.out $ aprun -n1 ./sleep.out
Multiple kernels launched from multiple threads
Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.
launcherOMP.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch kernel omp_set_num_threads(num_kernels); #pragma omp parallel for shared(cycles) num_threads(num_kernels) for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); }
Compiling and running multiple kernels launched from multiple threads
$ CC pauseWrapper.o launcherMP.cpp -o sleep.out $ aprun -n1 -d14 ./sleep.out
Multiple kernels launched from multiple MPI tasks
Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.
launcherMPI.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <mpi.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Sleep kernel for 1 second sleep_kernel(cycles); // Wait for the kernel to complete wait_for_gpu(); MPI_Finalize(); return 0; }
Compiling and running multiple kernels launched from multiple MPI tasks
$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out $ export CRAY_CUDA_MPS=1 $ aprun -n14 -N14 ./sleep.out
This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Host input vectors double *h_a; double *h_b; //Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; //Device output vector double *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sin(i)*sin(i); h_b[i] = cos(i)*cos(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); return 0; }
Changes
Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.
// CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; }
Let’s take a look at what makes up this simple kernel.
__global__ void vecAdd(double *a, double *b, double *c, int n)
The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.
int id = blockIdx.x*blockDim.x+threadIdx.x;
We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members.
if (id < n)
Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.
c[id] = a[id] + b[id]
the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c
Array Pointers:
// Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; // Device output vector double *d_c;
With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.
cudaMalloc:
// Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes);
Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.
cudaMemcpy to device:
// Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.
Thread mapping:
int blockSize, gridSize; // Number of threads in each thread block blockSize = 256; // Number of thread blocks in grid gridSize = ceil((float)n/blockSize);
To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.
Launch kernel:
// Execute the kernel vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);
We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.
cudaMemcpy to host:
// Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.
release device memory
// Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.
Compiling
The CUDA environment must be set up so before compiling make sure the module is loaded
$ module load cudatoolkit $ nvcc vecAdd.cu -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
In this tutorial we will cover three different CUDA versions of our GOL program, each highlighting a different CUDA memory area. The first will strictly use global memory and be a straightforward GPU port of our program, the second we will introduce shared memory, and lastly we cover the use of texture memory. A description of the game of life and the original CPU code can be found here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
GOL-CUDA.cu
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define BLOCK_SIZE 16 __global__ void ghostRows(int dim, int *grid) { // We want id ∈ [1,dim] int id = blockDim.x * blockIdx.x + threadIdx.x + 1; if (id <= dim) { //Copy first real row to bottom ghost row grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; //Copy last real row to top ghost row grid[id] = grid[(dim+2)*dim + id]; } } __global__ void ghostCols(int dim, int *grid) { // We want id ∈ [0,dim+1] int id = blockDim.x * blockIdx.x + threadIdx.x; if (id <= dim+1) { //Copy first real column to right most ghost column grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; //Copy last real column to left most ghost column grid[id*(dim+2)] = grid[id*(dim+2) + dim]; } } __global__ void GOL(int dim, int *grid, int *newGrid) { // We want id ∈ [1,dim] int iy = blockDim.y * blockIdx.y + threadIdx.y + 1; int ix = blockDim.x * blockIdx.x + threadIdx.x + 1; int id = iy * (dim+2) + ix; int numNeighbors; if (iy <= dim && ix <= dim) { // Get the number of neighbors for a given grid point numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower + grid[id+1] + grid[id-1] //right left + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals + grid[id-(dim+1)] + grid[id+(dim+1)]; int cell = grid[id]; // Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } } int main(int argc, char* argv[]) { int i,j,iter; int* h_grid; //Grid on host int* d_grid; //Grid on device int* d_newGrid; //Second grid used on device only int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid int dim = 1024; //Linear dimension of our grid - not counting ghost cells int maxIter = 1<<10; //Number of game steps size_t bytes = sizeof(int)*(dim+2)*(dim+2);//2 added for periodic boundary condition ghost cells // Allocate host Grid used for initial setup and read back from device h_grid = (int*)malloc(bytes); // Allocate device grids cudaMalloc(&d_grid, bytes); cudaMalloc(&d_newGrid, bytes); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { h_grid[i*(dim+2)+j] = rand() % 2; } } // Copy over initial game grid (Dim-1 threads) cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice); dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1); int linGrid = (int)ceil(dim/(float)BLOCK_SIZE); dim3 gridSize(linGrid,linGrid,1); dim3 cpyBlockSize(BLOCK_SIZE,1,1); dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1); dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1); // Main game loop for (iter = 0; iter<maxIter; iter++) { ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid); ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid); GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid); // Swap our grids and iterate again d_tmpGrid = d_grid; d_grid = d_newGrid; d_newGrid = d_tmpGrid; }//iter loop // Copy back results and sum cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost); // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += h_grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); // Release memory cudaFree(d_grid); cudaFree(d_newGrid); free(h_grid); return 0; }
Changes
#define BLOCK_SIZE 16
The number of threads per block is a common performance tuning parameter and it is often set as a precompiler define statement.
__global__ void ghostRows(int dim, int *grid) { // We want id ∈ [1,dim] int id = blockDim.x * blockIdx.x + threadIdx.x + 1; if (id <= dim) { //Copy first real row to bottom ghost row grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; //Copy last real row to top ghost row grid[id] = grid[(dim+2)*dim + id]; } }
__global__ void ghostCols(int dim, int *grid) { // We want id ∈ [0,dim+1] int id = blockDim.x * blockIdx.x + threadIdx.x; if (id <= dim+1) { //Copy first real column to right most ghost column grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; //Copy last real column to left most ghost column grid[id*(dim+2)] = grid[id*(dim+2) + dim]; } }
The ghostRows and ghostColumns kernels are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.
__global__ void GOL(int dim, int *grid, int *newGrid) { // We want id ∈ [1,dim] int iy = blockDim.y * blockIdx.y + threadIdx.y + 1; int ix = blockDim.x * blockIdx.x + threadIdx.x + 1; int id = iy * (dim+2) + ix; int numNeighbors; if (iy <= dim && ix <= dim) { // Get the number of neighbors for a given grid point numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower + grid[id+1] + grid[id-1] //right left + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals + grid[id-(dim+1)] + grid[id+(dim+1)]; int cell = grid[id]; // Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } }
The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.
int* d_grid; //Grid on device int* d_newGrid; //Second grid used on device only int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid
We declare our device pointers on the host as we would normally, we prefix our pointer name with d_ to help separate them from host pointers which we prefix with h_.
// Allocate device grids cudaMalloc(&d_grid, bytes); cudaMalloc(&d_newGrid, bytes);
Both of our device arrays are allocated in device global memory. A single dimensional array is used
// Copy over initial game grid (Dim-1 threads) cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);
After we have filled h_grid on the host we must copy it over to d_grid which resides in the device global memory.
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
dim3 cpyBlockSize(BLOCK_SIZE,1,1);
The CUDA data type dim3 is used to define the number of threads in our block, For the GOL kernel we specify a two dimensional block size to better suite our problems geometry, for copying a simple single dimensional size is best. Note that any missing dimensions in the constructor are assumed to be 1.
int linGrid = (int)ceil(dim/(float)BLOCK_SIZE); dim3 gridSize(linGrid,linGrid,1);
dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1); dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);
With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of type dim3.
ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid); ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid); GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);
Each iteration of the game game loop we launch our three kernels, first copying the ghost cells and then preforming one game iteration.
// Swap our grids and iterate again d_tmpGrid = d_grid; d_grid = d_newGrid; d_newGrid = d_tmpGrid;
The grid and newGrid device pointers can be switched just as we did with the host pointers in the CPU version.
// Copy back results and sum cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);
Once our main game loop is complete we must copy the grid from device memory to host memory.
// Release memory cudaFree(d_grid); cudaFree(d_newGrid);
Having used cudaMalloc to allocate our device memory we release it with cudaFree.
Compiling
module load cudatoolkit nvcc GOL-CUDA.cu -o GOL-CUDA.out
GOL-CUDA-Shared.cu
In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:
- Have each thread read from global memory into shared memory
- Synchronize threads to ensure all threads in a given block have finished reading from global memory
- Have each thread preform calculations using shared memory values
- Have each thread write result from shared memory back to device global memory
For detailed coverage of shared memory please see the CUDA C Programming guide.
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define BLOCK_SIZE_x 32 #define BLOCK_SIZE_y 16 __global__ void ghostRows(int dim, int* grid) { // We want id ∈ [1,dim] int id = blockDim.x * blockIdx.x + threadIdx.x + 1; if (id <= dim) { //Copy first real row to bottom ghost row grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; //Copy last real row to top ghost row grid[id] = grid[(dim+2)*dim + id]; } } __global__ void ghostCols(int dim, int* grid) { // We want id ∈ [0,dim+1] int id = blockDim.x * blockIdx.x + threadIdx.x; if (id <= dim+1) { //Copy first real column to right most ghost column grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; //Copy last real column to left most ghost column grid[id*(dim+2)] = grid[id*(dim+2) + dim]; } } __global__ void GOL(int dim, int *grid, int *newGrid) { int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y; int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x; int id = iy * (dim+2) + ix; int i = threadIdx.y; int j = threadIdx.x; int numNeighbors; // Declare the shared memory on a per block level __shared__ int s_grid[BLOCK_SIZE_y][BLOCK_SIZE_x]; // Copy cells into shared memory if (ix <= dim+1 && iy <= dim+1) s_grid[i][j] = grid[id]; //Sync all threads in block __syncthreads(); if (iy <= dim && ix <= dim) { if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) { // Get the number of neighbors for a given grid point numNeighbors = s_grid[i+1][j] + s_grid[i-1][j] //upper lower + s_grid[i][j+1] + s_grid[i][j-1] //right left + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals + s_grid[i-1][j+1] + s_grid[i+1][j-1]; int cell = s_grid[i][j]; // Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } } } int main(int argc, char* argv[]) { int i,j,iter; int *h_grid; //Grid on host int *d_grid; //Grid on device int *d_newGrid; //Second grid used on device only int *d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid int dim = 1024; //Linear dimension of our grid - not counting ghost cells int maxIter = 1<<10; //Number of game steps size_t bytes = sizeof(int)*(dim+2)*(dim+2); // Allocate host Grid used for initial setup and read back from device h_grid = (int*)malloc(bytes); // Allocate device grids cudaMalloc(&d_grid, bytes); cudaMalloc(&d_newGrid, bytes); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { h_grid[i*(dim+2)+j] = rand() % 2; } } cudaFuncSetCacheConfig(GOL, cudaFuncCachePreferShared); // Copy over initial game grid (Dim-1 threads) cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice); dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1); int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2)); int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2)); dim3 gridSize(linGrid_x,linGrid_y,1); dim3 cpyBlockSize(BLOCK_SIZE_x,1,1); dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1); dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1); // Main game loop for (iter = 0; iter<maxIter; iter++) { ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid); ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid); GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid); // Swap our grids and iterate again d_tmpGrid = d_grid; d_grid = d_newGrid; d_newGrid = d_tmpGrid; }//iter loop cudaError_t error = cudaGetLastError(); if(error != cudaSuccess) printf("CUDA error %s\n",cudaGetErrorString(error)); // Copy back results and sum cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost); // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += h_grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); cudaFree(d_grid); cudaFree(d_newGrid); free(h_grid); return 0; }
Changes :
#define BLOCK_SIZE_x 32 #define BLOCK_SIZE_y 16
A square block size is not always best performance wise, for greater flexibility we switch to separate x and y dimensions.
int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y; int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x; int id = iy * (dim+2) + ix;
Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.
int i = threadIdx.y; int j = threadIdx.x;
We will also define some convenience variables for the block thread indices.
// Declare the shared memory on a per block level __shared__ int s_grid[blockDim.y][blockDim.x];
The __shared__ qualifier is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis.
// Copy cells into shared memory if (ix <= dim+1 && iy <= dim+1) s_grid[i][j] = grid[id];
Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.
//Sync all threads in block __syncthreads();
Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling __syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.
if (iy <= dim && ix <= dim) { if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) {
Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.
// Get the number of neighbors for a given grid point numNeighbors = s_grid[i+1][j] + s_grid[i-1][j] //upper lower + s_grid[i][j+1] + s_grid[i][j-1] //right left + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals + s_grid[i-1][j+1] + s_grid[i+1][j-1]; int cell = s_grid[i][j];
The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.
dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1); int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2)); int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2)); dim3 gridSize(linGrid_x,linGrid_y,1);
Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.
GOL-CUDA-Texture.cu
The texture cache example is provided for completeness but not fully documented. The texture cache is optimized for 2D spacial locality and does not pollute the L1 cache. Some programs may benefit from the smaller cache line compared to the L1 cache.
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 texture<int,2> gridTex; __global__ void GOL(int dim, int *newGrid) { int iy = blockDim.y * blockIdx.y + threadIdx.y; int ix = blockDim.x * blockIdx.x + threadIdx.x; int id = iy * dim + ix; int numNeighbors; float iyTex = (iy + 0.5f)/dim; float ixTex = (ix + 0.5f)/dim; float oneTex = 1.0f/dim; if(iy < dim && ix < dim) { //Get the number of neighbors for a given grid point numNeighbors = tex2D(gridTex, iyTex+oneTex, ixTex) //upper/lower + tex2D(gridTex, iyTex-oneTex, ixTex) + tex2D(gridTex, iyTex, ixTex+oneTex) //right/left + tex2D(gridTex, iyTex, ixTex-oneTex) + tex2D(gridTex, iyTex-oneTex, ixTex-oneTex) //diagonals + tex2D(gridTex, iyTex-oneTex, ixTex+oneTex) + tex2D(gridTex, iyTex+oneTex, ixTex-oneTex) + tex2D(gridTex, iyTex+oneTex, ixTex+oneTex); int cell = tex2D(gridTex, iyTex, ixTex); //Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } } int main(int argc, char* argv[]) { int i,j,iter; int* h_grid; //Grid on host cudaArray* d_grid; //Grid on device int* d_newGrid; //Second grid used on device only int dim = 1024; //Linear dimension of our grid - not counting ghost cells int maxIter = 1<<10; //Number of game steps size_t bytes = sizeof(int)*dim*dim; //Allocate host Grid used for initial setup and read back from device h_grid = (int*)malloc(bytes); //Allocate device grids cudaMallocArray(&d_grid, &gridTex.channelDesc, dim, dim); cudaMalloc(&d_newGrid, bytes); //Assign initial population randomly srand(SRAND_VALUE); for(i = 0; i<dim; i++) { for(j = 0; j<dim; j++) { h_grid[i*dim+j] = rand() % 2; } } //Copy over initial game grid (Dim-1 threads) cudaMemcpyToArray (d_grid, 0, 0, h_grid, bytes, cudaMemcpyHostToDevice); cudaBindTextureToArray(gridTex, d_grid); gridTex.normalized = true; gridTex.addressMode[0] = cudaAddressModeWrap; gridTex.addressMode[1] = cudaAddressModeWrap; dim3 dimBlock(8,8); int linGrid = (int)ceil(dim/(float)dimBlock.x); dim3 dimGrid(linGrid,linGrid); //Main game loop for (iter = 0; iter<maxIter; iter++) { GOL<<<dimGrid,dimBlock>>>(dim, d_newGrid); //Swap our grids and iterate again cudaMemcpyToArray (d_grid, 0, 0, d_newGrid, bytes, cudaMemcpyDeviceToDevice); }//iter loop //Copy back results and sum cudaMemcpy(h_grid, d_newGrid, bytes, cudaMemcpyDeviceToHost); //Sum up alive cells and print results int total = 0; for (i = 0; i<dim; i++) { for (j = 0; j<dim; j++) { total += h_grid[i*dim+j]; } } printf("Total Alive: %d\n", total); cudaFree(d_grid); cudaFree(d_newGrid); free(h_grid); return 0; }
This tutorial shows a minimal conversion from our vector addition CPU code to PGI CUDA Fortran. Consider this a CUDA Fortran ‘Hello World’. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.cuf
module kernel contains ! CUDA kernel. Each thread takes care of one element of c attributes(global) subroutine vecAdd_kernel(n, a, b, c) integer, value :: n real(8), device :: a(n), b(n), c(n) integer :: id ! Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x ! Make sure we do not go out of bounds if (id <= n) then c(id) = a(id) + b(id) endif end subroutine vecAdd_kernel end module kernel program main use cudafor use kernel type(dim3) :: blockSize, gridSize real(8) :: sum integer :: i ! Size of vectors integer :: n = 100000 ! Host input vectors real(8),dimension(:),allocatable :: h_a real(8),dimension(:),allocatable :: h_b !Host output vector real(8),dimension(:),allocatable :: h_c ! Device input vectors real(8),device,dimension(:),allocatable :: d_a real(8),device,dimension(:),allocatable :: d_b !Host output vector real(8),device,dimension(:),allocatable :: d_c ! Allocate memory for each vector on host allocate(h_a(n)) allocate(h_b(n)) allocate(h_c(n)) ! Allocate memory for each vector on GPU allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n h_a(i) = sin(i*1D0)*sin(i*1D0) h_b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Implicit copy of host vectors to device d_a = h_a(1:n) d_b = h_b(1:n) ! Number of threads in each thread block blockSize = dim3(1024,1,1) ! Number of thread blocks in grid gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1) ! Execute the kernel call vecAdd_kernel<<<gridSize, blockSize>>>(n, d_a, d_b, d_c) ! Implicit copy of device array to host h_c = d_c(1:n) ! Sum up vector c and print result divided by n, this should equal 1 within error sum = 0.0; do i=1,n sum = sum + h_c(i) enddo sum = sum/real(n) print *, 'final result: ', sum ! Release device memory deallocate(d_a) deallocate(d_b) deallocate(d_c) ! Release host memory deallocate(h_a) deallocate(h_b) deallocate(h_c) end program main
Changes
Kernel
The kernel is the heart of our CUDA code. When we launch a kernel we specify the number of threads per block(blockdim) and number of blocks per grid(griddim). The total number of threads is (blockdim) * (griddim). Each thread evaluates one copy of the kernel.
module kernel contains !CUDA kernel. Each thread takes care of one element of c attributes(global) subroutine vecAdd_kernel(n, a, b, c) integer, value :: n real(8), device :: a(n), b(n), c(n) integer :: id !Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x !Make sure we do not go out of bounds if (id <= n) then c(id) = a(id) + b(id) endif end subroutine vecAdd_kernel end module kernel
Let’s take a look at what makes up this simple kernel.
module kernel contains
end module kernel
The interface for the kernel subroutine must be explicitly known to the calling program, one way of achieving this is placing the subroutine in a module.
attributes(global) subroutine vecAdd_kernel(n, a, b, c)
The global attribute specifies this is a CUDA kernel, otherwise normal Fortran subroutine syntax is used.
!Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x
We calculate the threads global id using CUDA Fortran provided derived data types. blockidx contains the blocks position in the grid, ranging from 1 to griddim. threadidx is the threads index inside of it’s associated block, ranging from 1 to blockdim.
if (id <= n) then
Unless we have an array length divisible by our blockdim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.
c(id) = a(id) + b(id)
the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c
Modules
use cudafor use kernel
The cudafor is a PGI provided module that provides some of the CUDA specific derived types such as dim3. The kernel module we created must also be included.
Array Pointers
!Host input vectors real(8),dimension(:),allocatable :: h_a real(8),dimension(:),allocatable :: h_b !Host output vector real(8),dimension(:),allocatable :: h_c !Device input vectors real(8),device,dimension(:),allocatable :: d_a real(8),device,dimension(:),allocatable :: d_b !Host output vector real(8),device,dimension(:),allocatable :: d_c
With the host CPU and GPU having separate memory spaces we must have two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them. Notice that the GPU pointers have the device attribute.
Allocate device memory
!Allocate memory for each vector on GPU allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n))
Given a pointer with the device attribute memory will be allocated in the GPUs global memory space.
Copy to device
!Implicit copy of host vectors to device d_a = h_a(1:n) d_b = h_b(1:n)
After we initialize the data on the host we must copy it over to the device, to do so we can can use the syntax above. This is a convenient way of doing things but care must be taken. This operation can be very time consuming as we are initiating a DMA transfer of data from the host memory, over the PCI bus, to the device memory.
Thread mapping
!Number of threads in each thread block blockSize = dim3(1024,1,1) !number of thread blocks in grid gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)
To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the block Size will be chosen based upon kernel memory requirements and the grid size calculated such that enough threads are launched to cover the kernel domain. dim3 is a CUDA Fortran provided data type that has 3 dimensions, in this case we are dealing with a one dimensional block and grid so we specify a dimensionality of 1 for the other two dimensions.
Launch kernel
call vecAdd_kernel<<< gridSize, blockSize >>>(n, d_a, d_b, d_c)
We launch our kernel with a modified Fortran subroutine syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.
Copy to host
!Implicit copy of device array to host h_c = d_c(1:n)
The implicit copy from the device to the host will block until the kernel is completed.
Release device memory
deallocate(d_a) deallocate(d_b) deallocate(d_c)
Device memory is deallocated in the same programmatic manner as host allocated arrays.
Compiling
$ module load cudatoolkit $ ftn vecAdd.cuf -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
For this tutorial we will cover two different CUDA Fortran versions of our Game of Life program. The first will strictly use global memory and be a straight forward GPU port of our program, the second will introduce the use of shared memory. For a description of the Game of Life and the original CPU code please see the tutorial here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
GOL.cuf
module kernels contains attributes(global) subroutine ghostCols_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [2,dim+1] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 if(id < dim+2) then grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column endif end subroutine ghostCols_kernel attributes(global) subroutine ghostRows_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [1,dim+2] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x if(id <= dim+2) then grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row endif end subroutine ghostRows_kernel attributes(global) subroutine GOL_kernel(dim, grid, newGrid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid, newGrid integer :: idx, idy, id, numNeighbors, cell !Get global thread IDs idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1 if(idx < dim+2 .AND. idy < dim+2) then numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) & !right & left + grid(idy+1,idx) + grid(idy-1,idx) & !upper & lower + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals + grid(idy-1,idx+1) + grid(idy+1,idx-1) cell = grid(idy,idx) !Here we have explicitly all of the game rules if(cell == 1 .AND. numNeighbors < 2) then newGrid(idy,idx) = 0 elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(idy,idx) = 1 elseif(cell == 1 .AND. numNeighbors > 3) then newGrid(idy,idx) = 0 elseif(cell == 0 .AND. numNeighbors == 3) then newGrid(idy,idx) = 1 else newGrid(idy,idx) = cell endif endif end subroutine GOL_kernel end module kernels program main use cudafor use kernels implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm !Linear game grid dimension integer :: dim = 1024 !Number of game iterations integer :: maxIter = 2**10 integer :: blockDim = 16 type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize !Host game grid array integer,dimension(:,:),allocatable :: h_grid !Device game grid arrays integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells allocate(h_grid(dim+2,dim+2)) !Allocate grid on device allocate(d_grid(dim+2,dim+2)) allocate(d_newGrid(dim+2,dim+2)) !Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) h_grid(i,j) = nint(randm) enddo enddo !Implicit copy of host vectors to device d_grid = h_grid !Set the number of threads per block and blocks per grid cpyBlockSize = dim3(blockDim, 1, 1) cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1) cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1) blockSize = dim3(blockDim, blockDim, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x)), & ceiling(real(dim)/real(blockSize%y)), 1) !Main game loop do iter=1,maxIter !device pointers are not supported currently so manually switch grid/newGrid if(MOD(iter,2) == 1) then call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid) else call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid) endif enddo !Implicit copy of device vector back to host h_grid = d_grid !Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + h_grid(i,j) enddo enddo print *, "Total Alive", total !Release memory deallocate(h_grid) deallocate(d_grid) deallocate(d_newGrid) end program
Changes to GOL.cuf
attributes(global) subroutine ghostCols_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [2,dim+1] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 if(id < dim+2) then grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column endif end subroutine ghostCols_kernel attributes(global) subroutine ghostRows_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [1,dim+2] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x if(id <= dim+2) then grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row endif end subroutine ghostRows_kernel
The ghostRows_kernel and ghostColumns_kernel device subroutines are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.
attributes(global) subroutine GOL_kernel(dim, grid, newGrid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid, newGrid integer :: idx, idy, id, numNeighbors, cell !Get global thread IDs idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1 if(idx < dim+2 .AND. idy < dim+2) then numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) & !right & left + grid(idy+1,idx) + grid(idy-1,idx) & !upper & lower + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals + grid(idy-1,idx+1) + grid(idy+1,idx-1) cell = grid(idy,idx) !Here we have explicitly all of the game rules if(cell == 1 .AND. numNeighbors < 2) then newGrid(idy,idx) = 0 elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(idy,idx) = 1 elseif(cell == 1 .AND. numNeighbors > 3) then newGrid(idy,idx) = 0 elseif(cell == 0 .AND. numNeighbors == 3) then newGrid(idy,idx) = 1 else newGrid(idy,idx) = cell endif endif end subroutine GOL_kernel
The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.
!Host game grid array integer,dimension(:,:),allocatable :: h_grid !Device game grid arrays integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid
Currently CUDA Fortran does not support device pointers and so we change our array attribute from pointer to allocatable.
!Allocate grid on device allocate(d_grid(dim+2,dim+2)) allocate(d_newGrid(dim+2,dim+2))
Given a array with the device attribute memory will be allocated in the GPUs global memory space.
!Implicit copy of host vectors to device d_grid = h_grid
By setting the device array equal to a host array a host to device DMA copy is implicitly initiated.
!Set the number of threads per block and blocks per grid cpyBlockSize = dim3(blockDim, 1, 1) cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1) cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1) blockSize = dim3(blockDim, blockDim, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x)), & ceiling(real(dim)/real(blockSize%y)), 1)
With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of the CUDA Fortran data type dim3.
!Main game loop do iter=1,maxIter !device pointers are not supported currently so manually switch grid/newGrid if(MOD(iter,2) == 1) then call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid) else call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid) endif enddo
Although we can not simply switch our device grid pointers at the end of each game iteration due to CUDA Fortran not support device pointers we may switch what arguments we call our kernels with. This achieves the same effect with a minimal increase in code size.
!Implicit copy of device vector back to host h_grid = d_grid
After our kernel has completed we can implicitly copy the grid array from the device back to the host.
GOL-Shared.cuf
In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:
- Have each thread read from global memory into shared memory
- Synchronize threads to ensure all threads in a given block have finished reading from global memory
- Have each thread preform calculations using shared memory values
- Have each thread write result from shared memory back to device global memory
For detailed coverage of shared memory please see the CUDA C Programming Guide and CUDA Fortran Programming Guide.
module kernel contains attributes(global) subroutine ghostCols_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ?~H~H [2,dim+1] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 if(id < dim+2) then grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column endif end subroutine ghostCols_kernel attributes(global) subroutine ghostRows_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ?~H~H [1,dim+2] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x if(id <= dim+2) then grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row endif end subroutine ghostRows_kernel attributes(global) subroutine GOL_kernel(dim, grid, newGrid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid, newGrid integer :: idx, idy, id, i, j, numNeighbors, cell !Declare the shared memory on a per block level integer,shared :: s_grid(16, 16) !Get global thread IDs idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y !Get local thread IDs i = threadIdx%y j = threadIdx%x !Copy cells into shared memory if (idx <= dim+2 .AND. idy <= dim+2) then s_grid(i,j) = grid(idy,idx) endif !Sync all threads in block call syncthreads() if (idx < dim+2 .AND. idy < dim+2) then if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) & !right & left + s_grid(i+1,j) + s_grid(i-1,j) & !upper & lower + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals + s_grid(i-1,j+1) + s_grid(i+1,j-1) cell = s_grid(i,j) !Here we have explicitly all of the game rules if(cell == 1 .AND. numNeighbors < 2) then newGrid(idy,idx) = 0 elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(idy,idx) = 1 elseif(cell == 1 .AND. numNeighbors > 3) then newGrid(idy,idx) = 0 elseif(cell == 0 .AND. numNeighbors == 3) then newGrid(idy,idx) = 1 else newGrid(idy,idx) = cell endif endif endif end subroutine GOL_kernel end module kernel program main use cudafor use kernel implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm !Linear game grid dimension integer :: dim = 1024 !Number of game iterations integer :: maxIter = 2**10 integer :: blockDim = 16 type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize !Host game grid array integer,dimension(:,:),allocatable :: h_grid !Device game grid array integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells allocate(h_grid(dim+2,dim+2)) !Allocate grid on device allocate(d_grid(dim+2,dim+2)) allocate(d_newGrid(dim+2,dim+2)) !Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) h_grid(i,j) = nint(randm) enddo enddo !Implicit copy of host vectors to device d_grid = h_grid !Set the number of threads per block and blocks per grid cpyBlockSize = dim3(blockDim, 1, 1) cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1) cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1) blockSize = dim3(blockDim-2, blockDim-2, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x-2)), & ceiling(real(dim)/real(blockSize%y-2)), 1) !Main game loop do iter=1,maxIter !device pointers are not supported currently, so manually switch grid/newGrid if(MOD(iter,2) == 1) then call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid) else call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid) endif enddo !Implicit copy of device vector back to host h_grid = d_grid !Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + h_grid(i,j) enddo enddo print *, "Total Alive", total !Release memory deallocate(h_grid) deallocate(d_grid) deallocate(d_newGrid) end program
Changes to GOL-Shared.cuf
!Declare the shared memory on a per block level integer,shared :: s_grid(16, 16)
The shared attribute is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis
!Get global thread IDs idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y
Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.
!Get local thread IDs i = threadIdx%y j = threadIdx%x
We define some convenience variables for the block thread indices.
!Copy cells into shared memory if (idx <= dim+2 .AND. idy <= dim+2) then s_grid(i,j) = grid(idy,idx) endif
Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.
!Sync all threads in block call syncthreads()
Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.
if (idx < dim+2 .AND. idy < dim+2) then if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then
Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.
numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) & !right & left + s_grid(i+1,j) + s_grid(i-1,j) & !upper & lower + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals + s_grid(i-1,j+1) + s_grid(i+1,j-1) cell = s_grid(i,j)
The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.
blockSize = dim3(blockDim-2, blockDim-2, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x-2)), & ceiling(real(dim)/real(blockSize%y-2)), 1)
Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.
This tutorial shows a minimal conversion from our vector addition CPU code to an OpenCL version. Consider this a OpenCL ‘Hello World’. Error handling is not included so that the structure of the code is more digestible. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <CL/opencl.h> // OpenCL kernel. Each work item takes care of one element of c const char *kernelSource = "\n" \ "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \ "__kernel void vecAdd( __global double *a, \n" \ " __global double *b, \n" \ " __global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ; int main( int argc, char* argv[] ) { // Length of vectors unsigned int n = 100000; // Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl_mem d_a; cl_mem d_b; // Device output buffer cl_mem d_c; cl_platform_id cpPlatform; // OpenCL platform cl_device_id device_id; // device ID cl_context context; // context cl_command_queue queue; // command queue cl_program program; // program cl_kernel kernel; // kernel // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Initialize vectors on host int i; for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } size_t globalSize, localSize; cl_int err; // Number of work items in each local work group localSize = 64; // Number of total work items - localSize must be devisor globalSize = ceil(n/(float)localSize)*localSize; // Bind to platform err = clGetPlatformIDs(1, &cpPlatform, NULL); // Get ID for the device err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); // Create a context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); // Create a command queue queue = clCreateCommandQueue(context, device_id, 0, &err); // Create the compute program from the source buffer program = clCreateProgramWithSource(context, 1, (const char **) & kernelSource, NULL, &err); // Build the program executable clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Create the compute kernel in the program we wish to run kernel = clCreateKernel(program, "vecAdd", &err); // Create the input and output arrays in device memory for our calculation d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL); d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL); d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, NULL); // Write our data set into the input array in device memory err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0, bytes, h_a, 0, NULL, NULL); err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0, bytes, h_b, 0, NULL, NULL); // Set the arguments to our compute kernel err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n); // Execute the kernel over the entire range of the data set err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL); // Wait for the command queue to get serviced before reading back results clFinish(queue); // Read the results from the device clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL ); //Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // release OpenCL resources clReleaseMemObject(d_a); clReleaseMemObject(d_b); clReleaseMemObject(d_c); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(queue); clReleaseContext(context); //release host memory free(h_a); free(h_b); free(h_c); return 0; }
Changes
Kernel:
The kernel is the heart of our OpenCL code. The entire kernel must eventually be read in as a c string, the easiest way for a small program like this is to wrap quotes and line returns around the entire kernel. In a real program you would more than likely read the kernel in from a separate file.
// OpenCL kernel. Each work item takes care of one element of c const char *kernelSource = "\n" \ "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \ "__kernel void vecAdd( __global double *a, \n" \ " __global double *b, \n" \ " __global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ;
Let’s take a look at what makes up this simple kernel.
__kernel void vecAdd( __global double *a, __global double *b, __global double *c, const unsigned int n)
The __kernel decorator specifies this is an OpenCL kernel and the __global decorator specifies the pointer is referring to the global device memory space, otherwise normal C function syntax is used. The kernel must have return type void.
int id = get_global_id(0);
Here we grab the global work item id for dimension 0.
if (id < n) c[id] = a[id] + b[id];
There must be an integer number of work groups, or put in another way the number of work items in each work group must be a devisor of the global number of work items. Since the work group size is used to tune performance and will not necessarily be a devisor of the total number of threads needed it is common to be forced to launch more threads than are needed and ignore the extras. After we check that we are inside of the problem domain we can access and manipulate the device memory.
Memory:
// Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl_mem d_a; cl_mem d_b; // Device output buffer cl_mem d_c;
With the host CPU and GPU having separate memory spaces we must maintain two separate references to our memory, one set for our host array pointers and one set for our device array memory handles. Here we use the h_ and d_ prefix to differentiate them.
Thread Mapping:
// Number of work items in each local work group localSize = 64; // Number of total work items - localSize must be devisor globalSize = ceil(n/(float)localSize)*localSize;
To map our problem onto the underlying hardware we must specify a local and global integer size. The local size defines the number of work items in a work group, on an NVIDIA GPU this is equivalent to the number of threads in a thread block. The global size is the total number of work items launched. the localSize must be a devisor of globalSize and so we calculate the smallest integer that covers our problem domain and is divisible by localSize.
OpenCL Setup:
// Bind to platform err = clGetPlatformIDs(1, &cpPlatform, NULL);
Each hardware vendor will have a different platform which must be bound to before use, here clGetPlatformIDs will set cpPlatform to contain the systems available platforms. For example if a system contains an AMD CPU and an NVIDIA GPU with the appropriate OpenCL drivers installed two OpenCL platforms will be available.
// Get ID for the device err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL);
The platform can be queried to find what specific devices it contains. In this case we query the platform for GPU devices by using the CL_DEVICE_TYPE_GPU enumerated value.
// Create a context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
Before using an OpenCL device a context must be set up. The context will be used to manage command queues, memory, and kernel activity. A context can contain more than one device.
// Create a command queue queue = clCreateCommandQueue(context, device_id, 0, &err);
The command queue is used to stream commands from the host to the specified device. Memory transfers and kernel activity can pushed to the command queue where they will be executed on the device when possible.
Compile Kernel:
// Create the compute program from the source buffer program = clCreateProgramWithSource(context, 1, (const char **) & kernelSource, NULL, &err); // Build the program executable clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Create the compute kernel in the program we wish to run kernel = clCreateKernel(program, "vecAdd", &err);
To ensure that OpenCL code is portable to many devices the default way to run kernels is with just-in-time, or JIT, compilation. We must prepare the source code for the device(s) in a given context. First we create our program, which is a set of kernel code, and then from that program we create the individual kernels.
Prepare data on device:
// Create the input and output arrays in device memory for our calculation d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL); d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL); d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, NULL); // Write our data set into the input array in device memory err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0, bytes, h_a, 0, NULL, NULL); err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0, bytes, h_b, 0, NULL, NULL); // Set the arguments to our compute kernel err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);
Before launching our kernel we must create buffers between the host and device, bind the host data to those newly created device buffers, and finally set the kernel arguments.
Launch Kernel:
// Execute the kernel over the entire range of the data set err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);
Once all of the memory resides on the device the kernel can be enqueued to be launched.
Copy results to host:
// Wait for the command queue to get serviced before reading back results clFinish(queue); // Read the results from the device clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL );
We can block until the command queue is cleared and then read back the device results to the host.
Compiling
$ module load cudatoolkit $ cc -lOpenCL vecAdd.c -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
vecAdd.cc
C++ bindings are commonly used for OpenCL development and offers an interface that is somewhat more streamlined than the standard C interface. An example of the vectorAddition code written using these bindings is given below.
#define __CL_ENABLE_EXCEPTIONS #include "cl.hpp" #include <cstdio> #include <cstdlib> #include <iostream> #include <math.h> // OpenCL kernel. Each work item takes care of one element of c const char *kernelSource = "\n" \ "#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" \ "__kernel void vecAdd( __global double *a, \n" \ " __global double *b, \n" \ " __global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ; int main(int argc, char *argv[]) { // Length of vectors unsigned int n = 1000; // Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl::Buffer d_a; cl::Buffer d_b; // Device output buffer cl::Buffer d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = new double[n]; h_b = new double[n]; h_c = new double[n]; // Initialize vectors on host for(int i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } cl_int err = CL_SUCCESS; try { // Query platforms std::vector<cl::Platform> platforms; cl::Platform::get(&platforms); if (platforms.size() == 0) { std::cout << "Platform size 0\n"; return -1; } // Get list of devices on default platform and create context cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, properties); std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(); // Create command queue for first device cl::CommandQueue queue(context, devices[0], 0, &err); // Create device memory buffers d_a = cl::Buffer(context, CL_MEM_READ_ONLY, bytes); d_b = cl::Buffer(context, CL_MEM_READ_ONLY, bytes); d_c = cl::Buffer(context, CL_MEM_WRITE_ONLY, bytes); // Bind memory buffers queue.enqueueWriteBuffer(d_a, CL_TRUE, 0, bytes, h_a); queue.enqueueWriteBuffer(d_b, CL_TRUE, 0, bytes, h_b); //Build kernel from source string cl::Program::Sources source(1, std::make_pair(kernelSource,strlen(kernelSource))); cl::Program program_ = cl::Program(context, source); program_.build(devices); // Create kernel object cl::Kernel kernel(program_, "vecAdd", &err); // Bind kernel arguments to kernel kernel.setArg(0, d_a); kernel.setArg(1, d_b); kernel.setArg(2, d_c); kernel.setArg(3, n); // Number of work items in each local work group cl::NDRange localSize(64); // Number of total work items - localSize must be devisor cl::NDRange globalSize((int)(ceil(n/(float)64)*64)); // Enqueue kernel cl::Event event; queue.enqueueNDRangeKernel( kernel, cl::NullRange, globalSize, localSize, NULL, &event); // Block until kernel completion event.wait(); // Read back d_c queue.enqueueReadBuffer(d_c, CL_TRUE, 0, bytes, h_c); } catch (cl::Error err) { std::cerr << "ERROR: "<<err.what()<<"("<<err.err()<<")"<<std::endl; } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(int i=0; i<n; i++) sum += h_c[i]; std::cout<<"final result: "<<sum/n<<std::endl; // Release host memory delete(h_a); delete(h_b); delete(h_c); return 0; }
Compiling and Running, C++
To compile you will first need to download the OpenCL C++ header file cl.hpp.
$ module load cudatoolkit $ CC vecAdd.cc -lOpenCL -o vecAdd.out
$ aprun ./vecAdd.out final result: 1.000000