Several GPU accelerated libraries are provided on OLCF systems. Usage for the most common accelerated libraries is outlined below.
MAGMA
The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.
Using MAGMA
This module is currently only compatible with the GNU programming environment:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit magma
To link in the MAGMA library while on Titan:
$ cc -lcuda -lmagma source.c
Linking in MAGMA on Rhea is a bit different because the Titan compiler wrapper takes care of some of the extra flags. For example:
$ nvcc $MAGMA_INC $MAGMA_LIB -L$ACML_DIR/gfortran64/lib -lmagma -lcublas -lacml source.c
This also requires that a BLAS library be loaded like ACML
$ module load acml
MAGMA Resources
For comprehensive user manual please see the MAGMA Documentation. A knowledgable MAGMA User Forum is also available for personalized help. To see MAGMA in action see the following two PGI articles that include full example code of MAGMA usage with PGI Fortran: Using MAGMA With PGI Fortran and Using GPU-enabled Math Libraries with PGI Fortran.
cuBLAS/cuSPARSE
cuBLAS and cuSPARSE are NVIDIA provided BLAS GPU routines optimized for dense and sparse use respectively. If your program currently uses BLAS routines integration should be straight forward and minimal CUDA knowledge is needed. Although primarily designed for use in C/C++ code Fortran bindings are available.
Using cuBLAS/cuSPARSE
cuBLAS and cuSPARSE are accessed through the cublas header and need to be linked against the cublas library:
$ module load cudatoolkit $ cc -lcublas source.c
cuBLAS/cuSPARSE Resources
The CUBLAS and CUSPARSE user guides are available to download from NVIDIA, these guides provide complete function listings as well as example code. The nvidia SDK provides sample code and can accessed using the instructions below. An example of using CUBLAS with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran.
Running the examples:
Obtain an interactive job and load the appropriate modules:
$ qsub -I -A[projID] -lwalltime=00:30:00,nodes=1 $ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit nvidia-sdk
Copy the example files:
$ cd $MEMBERWORK/[projid] $ cp -r $NVIDIA_SDK_PATH/CUDALibraries . $ cd CUDALibraries
Now each example can be executed:
$ cd bin/linux/release $ aprun simpleCUBLAS
LibSciACC
Cray’s LibSciACC provides GPU enabled BLAS and LAPACK routines. LibSicACC provides two interfaces, automatic and manual. The Automatic interface is largely transparent to the programmer. LibSciACC will determine if the call is likely to benefit from GPU acceleration and if so will take care of accelerating the routine and the associated memory management. The manual interface provides an API to manage accelerator resources, providing more control to the programmer.
Using LibSciACC
It is recommended that the craype-accel-nvidia35 module be used to manage LibSciAcc. The LibSciACC Automatic interface is currently compatible with the Cray and GNU programming environments:
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35
LibSciAcc will automatically be linked in when using the Cray provided compiler wrappers
$ cc source.c
$ ftn source.f90
LibSciACC Resources
The man page intro_libsci_acc provides detailed usage information. The environment variable $LIBSCI_ACC_EXAMPLES_DIR specifies a directory containing several C and Fortran example codes.
cuFFT
CUFFT provides a set of optimized GPU fast Fourier routines that are provided by NVIDIA as part of the CUDA toolkit. The CUFFT library provides an API similar to FFTW for managing accelerated FFT’s. The CUFFTW interface provides a FFTW3 interface to CUFFT to aid in porting existing applications.
Using cuFFT
The cudatoolkit module will append the include and library directories required by CUFFT. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ cc -lcufft source.c
$ nvcc -lcufft source.c
cuFFT Resources
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CUFFT with Fortran through the ISO_C_BINDING interface please see the following example. The OLCF provides an OpenACC and CUFFT interoperability tutorial.
cuRAND
CURAND is an NVIDIA provided random number generator library. CURAND provides both a host launched and device inalienable interface. Multiple pseudorandom and quasirandom algorithms are supported.
Using cuRAND
The cudatoolkit module will append the include and library directories required by CURAND. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu $ cc -lcurand source.c
$ nvcc -lcurand source.cu
cuRAND Resources
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CURAND host library please see the following Accelerator Interoperability Tutorial.
Thrust
Thrust is a CUDA accelerated C++ template library modeled after the Standard Template Library(STL). Thrust provides a high level host interface for GPU data management as well as an assortment of accelerated algorithms. Even if your application is not currently using the STL the easy access to many optimized accelerated algorithms Thrust provides is worth taking a look at.
Using Thrust
The cudatoolkit module will append the include and library directories required by Thrust. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu $ CC source.cpp
$ nvcc source.cu
Thrust Resources
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using Thrust please see the following Accelerator Interoperability II Tutorial. The Github page allows access to the Thrust source code, examples, and information on how to obtain help.
To fully exploit the computational power of the GPU generally a large amount of data parallelism must be expressed. In the specific case of accelerated libraries such as cuBLAS
, cuFFT
, and cuSPARSE
if each operation does not possess a sufficient amount of data parallelism another option is to batch many smaller operations into a single large operation. This tutorial will demonstrate how to take advantage of batched cuBLAS
operations to improve GPU utilization. Additionally this tutorial will expand on the GPU concurrency topics from the first tutorial through the use of streams
and Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
A Serial Example in C
The following example preforms a single DGEMM
operation using the cuBLAS
version 2 library. This example will be expanded to show the use of batched cuBLAS
calls under a variety of situations. The form of our sample DGEMM
operation will be as follows:
cublas.c
#include <stdio.h> #include <stdlib.h> #include "math.h" #include "cublas_v2.h" int main(int argc, char* argv[]) { // Linear dimension of matrices int dim = 100; // Allocate host storage for A,B,C square matrices double *A, *B, *C; A = (double*)malloc(dim*dim*sizeof(double)); B = (double*)malloc(dim*dim*sizeof(double)); C = (double*)malloc(dim*dim*sizeof(double)); // Allocate device storage for A,B,C double *d_A, *d_B, *d_C; cudaMalloc((void**)&d_A, dim*dim*sizeof(double)); cudaMalloc((void**)&d_B, dim*dim*sizeof(double)); cudaMalloc((void**)&d_C, dim*dim*sizeof(double)); // Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 // Matrices are arranged column major int i,j,index; for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; if(i==j) { A[index] = sin(index); B[index] = sin(index); C[index] = cos(index)*cos(index); } else { A[index] = 0.0; B[index] = 0.0; C[index] = 0.0; } } } // Create cublas instance cublasHandle_t handle; cublasCreate(&handle); // Set input matrices on device cublasSetMatrix(dim, dim, sizeof(double), A, dim, d_A, dim); cublasSetMatrix(dim, dim, sizeof(double), B, dim, d_B, dim); cublasSetMatrix(dim, dim, sizeof(double), C, dim, d_C, dim); // DGEMM: C = alpha*A*B + beta*C double alpha = 1.0; double beta = 1.0; cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, d_A, dim, d_B, dim, &beta, d_C, dim); // Retrieve result matrix from device cublasGetMatrix(dim, dim, sizeof(double), d_C, dim, C, dim); // Simple sanity test, sum up all elements double sum = 0; for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; sum += C[index]; } } printf("Sum is: %f, should be: %d\n", sum, dim); // Clean up resources free(A); free(B); free(C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle); return 0; }
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ cc -lcublas cublas.c -o cublas.out
Run:
$ module load cudatoolkit $ aprun ./cublas.out
Library-provided Batching in C
The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched
, which takes as an argument an array of pointers to matrices and the number of DGEMM
operations to preform in total. By allowing the library to handle the batching, high performance can be achieved with minimal additional effort on the part of the programmer.
cublas-batch.c
#include <stdio.h> #include <stdlib.h> #include "math.h" #include "cublas_v2.h" int main(int argc, char* argv[]) { int i,j,k,index; // Linear dimension of matrices int dim = 100; // Number of A,B,C matrix sets int batch_count = 1000; // Allocate host storage for batch_count A,B,C square matrices double **A, **B, **C; A = (double**)malloc(batch_count*sizeof(double*)); B = (double**)malloc(batch_count*sizeof(double*)); C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { A[i] = (double*)malloc(dim*dim*sizeof(double)); B[i] = (double*)malloc(dim*dim*sizeof(double)); C[i] = (double*)malloc(dim*dim*sizeof(double)); } // Create host pointer array to device matrix storage double **d_A, **d_B, **d_C, **h_d_A, **h_d_B, **h_d_C; h_d_A = (double**)malloc(batch_count*sizeof(double*)); h_d_B = (double**)malloc(batch_count*sizeof(double*)); h_d_C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { cudaMalloc((void**)&h_d_A[i], dim*dim*sizeof(double)); cudaMalloc((void**)&h_d_B[i], dim*dim*sizeof(double)); cudaMalloc((void**)&h_d_C[i], dim*dim*sizeof(double)); } // Copy the host array of device pointers to the device cudaMalloc((void**)&d_A, batch_count*sizeof(double*)); cudaMalloc((void**)&d_B, batch_count*sizeof(double*)); cudaMalloc((void**)&d_C, batch_count*sizeof(double*)); cudaMemcpy(d_A, h_d_A, batch_count*sizeof(double*), cudaMemcpyHostToDevice); cudaMemcpy(d_B, h_d_B, batch_count*sizeof(double*), cudaMemcpyHostToDevice); cudaMemcpy(d_C, h_d_C, batch_count*sizeof(double*), cudaMemcpyHostToDevice); // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2 // Matrices are arranged column major for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; if(i==j) { (A[k])[index] = k*sin(index); (B[k])[index] = sin(index); (C[k])[index] = k*cos(index)*cos(index); } else { (A[k])[index] = 0.0; (B[k])[index] = 0.0; (C[k])[index] = 0.0; } } // i } // j } // k // Create cublas instance cublasHandle_t handle; cublasCreate(&handle); // Set input matrices on device for(i=0; i<batch_count; i++) { cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, h_d_A[i], dim); cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, h_d_B[i], dim); cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, h_d_C[i], dim); } // Set matrix coefficients double alpha = 1.0; double beta = 1.0; // DGEMM: C = alpha*A*B + beta*C cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, (const double**)d_A, dim, (const double**)d_B, dim, &beta, d_C, dim, batch_count); // Retrieve result matrix from device for(i=0; i<batch_count; i++) cublasGetMatrix(dim, dim, sizeof(double), h_d_C[i], dim, C[i], dim); // Simple sanity test, sum up all elements double sum = 0; for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; sum += (C[k])[index]; } } } printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2); // Clean up resources for(i=0; i<batch_count; i++) { free(A[i]); free(B[i]); free(C[i]); cudaFree(h_d_A[i]); cudaFree(h_d_B[i]); cudaFree(h_d_C[i]); } free(A); free(B); free(C); free(h_d_A); free(h_d_B); free(h_d_C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle); return 0; }
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ cc -lcublas cublas-batch.c -o cublas-batch.out
Run:
$ module load cudatoolkit $ aprun ./cublas-batch.out
CUDA Streams in C
Not all library functions have a batch equivalent, in this case using CUDA streams
may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM
operation in it’s own stream
, allowing up to 32 concurrent operations through the use of Hyper-Q
. The cuBLAS
library provides functionality to create, set, and destroy CUDA streams
.
cublas-stream.c
#include <stdio.h> #include <stdlib.h> #include "math.h" #include "cublas_v2.h" int main(int argc, char* argv[]) { int i,j,k,index; // Linear dimension of matrices int dim = 100; // Number of A,B,C matrix sets int batch_count = 1000; // Allocate host storage for batch_count A,B,C square matrices double **A, **B, **C; A = (double**)malloc(batch_count*sizeof(double*)); B = (double**)malloc(batch_count*sizeof(double*)); C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { A[i] = (double*)malloc(dim*dim*sizeof(double)); B[i] = (double*)malloc(dim*dim*sizeof(double)); C[i] = (double*)malloc(dim*dim*sizeof(double)); } // Allocate device storage for batch_count A,B,C double **d_A, **d_B, **d_C; d_A = (double**)malloc(batch_count*sizeof(double*)); d_B = (double**)malloc(batch_count*sizeof(double*)); d_C = (double**)malloc(batch_count*sizeof(double*)); for(i=0; i<batch_count; i++) { cudaMalloc((void**)&d_A[i], dim*dim*sizeof(double)); cudaMalloc((void**)&d_B[i], dim*dim*sizeof(double)); cudaMalloc((void**)&d_C[i], dim*dim*sizeof(double)); } // Fill A,B diagonals with k*sin(i) data, C diagonal with k*cos(i)^2 // Matrices are arranged column major for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; if(i==j) { (A[k])[index] = k*sin(index); (B[k])[index] = sin(index); (C[k])[index] = k*cos(index)*cos(index); } else { (A[k])[index] = 0.0; (B[k])[index] = 0.0; (C[k])[index] = 0.0; } } // i } // j } // k // Create cublas instance cublasHandle_t handle; cublasCreate(&handle); // Set input matrices on device for(i=0; i<batch_count; i++) { cublasSetMatrix(dim, dim, sizeof(double), A[i], dim, d_A[i], dim); cublasSetMatrix(dim, dim, sizeof(double), B[i], dim, d_B[i], dim); cublasSetMatrix(dim, dim, sizeof(double), C[i], dim, d_C[i], dim); } // Create a stream for every DGEMM operation cudaStream_t *streams = (cudaStream_t *) malloc(batch_count*sizeof(cudaStream_t)); for(i=0; i<batch_count; i++) cudaStreamCreate(&streams[i]); // Set matrix coefficients double alpha = 1.0; double beta = 1.0; // Launch each DGEMM operation in own CUDA stream for(i=0; i<batch_count; i++){ // Set CUDA stream cublasSetStream(handle, streams[i]); // DGEMM: C = alpha*A*B + beta*C cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, dim, dim, dim, &alpha, d_A[i], dim, d_B[i], dim, &beta, d_C[i], dim); } // Retrieve result matrix from device for(i=0; i<batch_count; i++) { cublasGetMatrix(dim, dim, sizeof(double), d_C[i], dim, C[i], dim); } // Simple sanity test, sum up all elements double sum = 0; for(k=0; k<batch_count; k++) { for(j=0; j<dim; j++) { for(i=0; i<dim; i++) { index = j*dim + i; sum += (C[k])[index]; } } } printf("Element sum is: %f, should be: %d\n", sum, dim*(batch_count-1)*(batch_count)/2); // Clean up resources for(i=0; i<batch_count; i++) { free(A[i]); free(B[i]); free(C[i]); cudaFree(d_A[i]); cudaFree(d_B[i]); cudaFree(d_C[i]); cudaStreamDestroy(streams[i]); } free(A); free(B); free(C); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle); return 0; }
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ cc -lcublas cublas-stream.c -o cublas-stream.out
Run:
$ module load cudatoolkit $ aprun ./cublas-stream.out
Fortran
Wrappers
NVIDIA provides a fortran interface for the legacy cuBLAS
API although they do not provide a fortran interface to the version 2 API, which adds batched operations. One of the changes introduced with the new API is that an opaque cuBLAS
handle must be be passed into most function calls; while this provides greater control over operations it is a hinderance to interoperability. The wrapper is split into two pieces: cublas_fortran.cu
and cublas_fortran_iso.f90
. cublas_fortran.cu
provides a lightweight C
wrapper around cuBLAS
calls, with most calls taking a pointer to the cuBLAS
handle that can be dereferenced to make the actual call. Since the wrapper functions require only a pointer to the cuBLAS
handle we may interface them with the ISO_C_BINDING
module. This ISO_C_BINDING
interface is the second piece to our wrapper, cublas_fortran_iso.f90
, and allows for robust interoperability between C and Fortran types.
cublas_fortran.cu
#include <stdio.h> #include "cublas_v2.h" extern "C" int f_cublasCreate(cublasHandle_t **handle) { *handle = (cublasHandle_t*)malloc(sizeof(cublasHandle_t)); return cublasCreate(*handle); } extern "C" int f_cublasDgemm(cublasHandle_t *handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const double *alpha, const double *A, int lda, const double *B, int ldb, const double *beta, double *C, int ldc) { return cublasDgemm(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc); } extern "C" int f_cublasDgemmBatched(cublasHandle_t *handle, cublasOperation_t transa, cublasOperation_t transb, int m, int n, int k, const double *alpha, const double **A, int lda, const double **B, int ldb, const double *beta, double **C, int ldc, int batch_count) { return cublasDgemmBatched(*handle,transa,transb,m,n,k,alpha,A,lda,B,ldb,beta,C,ldc,batch_count); } extern "C" void f_cublasDestroy(cublasHandle_t *handle) { cublasDestroy(*handle); free(handle); } extern "C" int f_cudaStreamCreate(cudaStream_t **stream) { *stream = (cudaStream_t *) malloc(sizeof(cudaStream_t)); return cudaStreamCreate(*stream); } extern "C" int f_cublasSetStream(cublasHandle_t *handle, cudaStream_t *streamid) { return cublasSetStream(*handle, *streamid); } extern "C" void f_cudaStreamDestroy(cudaStream_t *stream) { cudaStreamDestroy(*stream); }
cublas_fortran_iso.f90
module cublas_f use ISO_C_BINDING enum, BIND(C) enumerator :: CUBLAS_OP_N, CUBLAS_OP_T, CUBLAS_OP_C end enum enum, BIND(C) enumerator :: cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, & cudaMemcpyDeviceToDevice, cudaMemcpyDefault end enum INTERFACE integer(C_INT) function cudaMalloc(ptr, bytes) BIND(C, NAME='cudaMalloc') use ISO_C_BINDING type(C_PTR) :: ptr integer(C_SIZE_T), value :: bytes end function integer(C_INT) function cudaMemcpy(dst, src, count, kind) BIND(C, NAME='cudaMemcpy') use ISO_C_BINDING type(C_PTR), value :: dst type(C_PTR), value :: src integer(C_SIZE_T), value :: count integer(C_INT), value :: kind end function integer(C_INT) function cublasCreate(handle_ptr) BIND(C, NAME='f_cublasCreate') use ISO_C_BINDING type(C_PTR) :: handle_ptr end function subroutine cublasDestroy(handle_ptr) BIND(C, NAME='f_cublasDestroy') use ISO_C_BINDING type(C_PTR), value :: handle_ptr end subroutine subroutine cudaFree(ptr) BIND(C, NAME='cudaFree') use ISO_C_BINDING type(C_PTR), value :: ptr end subroutine integer(C_INT) function cublasSetMatrix(rows, cols, elemSize, a_ptr, & lda, b_ptr, ldb) & BIND(C, NAME='cublasSetMatrix') use ISO_C_BINDING integer(C_INT), value :: rows integer(C_INT), value :: cols integer(C_INT), value :: elemSize type(C_PTR), value :: a_ptr integer(C_INT), value :: lda type(C_PTR), value :: b_ptr integer(C_INT), value :: ldb end function integer(C_INT) function cublasGetMatrix(rows, cols, elemSize, a_ptr, & lda, b_ptr, ldb ) & BIND(C, NAME='cublasGetMatrix') use ISO_C_BINDING integer(C_INT), value :: rows integer(C_INT), value :: cols integer(C_INT), value :: elemSize type(C_PTR), value :: a_ptr integer(C_INT), value :: lda type(C_PTR), value :: b_ptr integer(C_INT), value :: ldb end function integer(C_INT) function cublasDgemm(handle, transa, transb, m, n, k, alpha, & A, lda, B, ldb, beta, C, ldc) & BIND(C, NAME='f_cublasDgemm') use ISO_C_BINDING type(C_PTR), value :: handle integer(C_INT), value :: transa integer(C_INT), value :: transb integer(C_INT), value :: m integer(C_INT), value :: n integer(C_INT), value :: k real(C_DOUBLE) :: alpha type(C_PTR), value :: A integer(C_INT), value :: lda type(C_PTR), value :: B integer(C_INT), value :: ldb real(C_DOUBLE) :: beta type(C_PTR), value :: C integer(C_INT), value :: ldc end function integer(C_INT) function cublasDgemmBatched(handle, transa, transb, m, n, k, alpha, & A, lda, B, ldb, beta, C, ldc, batch_count) & BIND(C, NAME='f_cublasDgemmBatched') use ISO_C_BINDING type(C_PTR), value :: handle integer(C_INT), value :: transa integer(C_INT), value :: transb integer(C_INT), value :: m integer(C_INT), value :: n integer(C_INT), value :: k real(C_DOUBLE) :: alpha type(C_PTR), value :: A integer(C_INT), value :: lda type(C_PTR), value :: B integer(C_INT), value :: ldb real(C_DOUBLE) :: beta type(C_PTR), value :: C integer(C_INT), value :: ldc integer(C_INT), value :: batch_count end function integer(C_INT) function cudaStreamCreate(stream_ptr) BIND(C, NAME='f_cudaStreamCreate') use ISO_C_BINDING type(C_PTR) :: stream_ptr end function integer(C_INT) function cublasSetStream(handle, stream) BIND(C, NAME='f_cublasSetStream') use ISO_C_BINDING type(C_PTR), value :: handle type(C_PTR), value :: stream end function integer(C_INT) function cudaStreamDestroy(stream) BIND(C, NAME='f_cudaStreamDestroy') use ISO_C_BINDING type(C_PTR), value :: stream end function subroutine cudaDeviceSynchronize() BIND(C, NAME='cudaDeviceSynchronize') end subroutine END INTERFACE a end module cublas_f
A Serial Example in Fortran
The following example preforms a single DGEMM
operation using the cuBLAS
version 2 library. This example will be expanded to show the use of batched cuBLAS
calls under a variety of situations. The form of our sample DGEMM
operation will be as follows:
cublas.f90
program main use ISO_C_BINDING use cublas_f implicit none integer :: dim, stat, i, j integer(8) :: bytes real(8),dimension(:,:), pointer :: A, B, C real(8) :: alpha, beta, index, sum type(C_PTR) :: d_A, d_B, d_C type(C_PTR) :: handle integer :: sizeof_double parameter (sizeof_double=8) !Linear dimension of matrices dim = 100 ! Allocate host storage for A,B,C square matrices allocate(A(dim,dim)) allocate(B(dim,dim)) allocate(C(dim,dim)) ! Allocate device storage for A,B,C bytes = int(dim*dim*sizeof_double, 8) stat = cudaMalloc(d_A, bytes) stat = cudaMalloc(d_B, bytes) stat = cudaMalloc(d_C, bytes) ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 ! Matrices are arranged column major do j=1,dim do i=1,dim if (i==j) then index = real(j*dim + i) A(i,j) = sin(index) B(i,j) = sin(index) C(i,j) = cos(index) * cos(index) else A(i,j) = 0.0 B(i,j) = 0.0 C(i,j) = 0.0 endif enddo enddo ! Create cublas instance stat = cublasCreate(handle) ! Set input matrices on device stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1)), dim, d_A, dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1)), dim, d_B, dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1)), dim, d_C, dim) ! DGEMM: C = alpha*A*B + beta*C alpha = 1.0 beta = 1.0 stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, & dim, dim, dim, & alpha, & d_A, dim, & d_B, dim, & beta, & d_C, dim) ! Retrieve result matrix from device stat = cublasGetMatrix(dim, dim, sizeof_double, d_C, dim, C_LOC(C(1,1)), dim) ! Simple sanity test, sum up all elements sum = 0.0 do j=1,dim do i=1,dim sum = sum + C(i,j) enddo enddo print *, "Sum is:", sum, "should be: ", dim deallocate(A) deallocate(B) deallocate(C) call cudaFree(d_A) call cudaFree(d_B) call cudaFree(d_C) call cublasDestroy(handle) end program
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ nvcc -c cublas_fortran.cu $ ftn -lcublas cublas_fortran_iso.f90 cublas.f90 cublas_fortran.o -o cublas.out
Run:
$ module load cudatoolkit $ aprun ./cublas.out
Library-provided Batching in Fortran
The most efficient way to increase GPU utilization for small library operations is to use library provided batch capabilities. In the code sample below we use cublasDgemmBatched
, which takes as an argument an array of pointers to matrices and the number of DGEMM
operations to preform in total. By allowing the library to handle the batching high performance can be achieved with minimal additional effort on the part of the programmer.
cublas-batch.f90
program main use ISO_C_BINDING use cublas_f implicit none integer :: dim, stat, i, j, k, batch_count integer(8) :: bytes real(8),dimension(:,:,:), pointer:: A, B, C real(8) :: alpha, beta, index, sum type(C_PTR) :: d_A, d_B, d_C type(C_PTR), dimension(:), pointer :: h_d_A, h_d_B, h_d_C, streams type(C_PTR) :: handle integer :: sizeof_double parameter (sizeof_double=8) !Linear dimension of matrices dim = 100 ! Number of A,B,C matrix sets batch_count = 1000 ! Allocate host storage for A,B,C square matrices allocate(A(dim,dim,batch_count)) allocate(B(dim,dim,batch_count)) allocate(C(dim,dim,batch_count)) ! Create host pointer array to device matrix storage allocate(h_d_A(batch_count)) allocate(h_d_B(batch_count)) allocate(h_d_C(batch_count)) bytes = dim*dim*sizeof_double do i=1,batch_count stat = cudaMalloc(h_d_A(i), bytes) stat = cudaMalloc(h_d_B(i), bytes) stat = cudaMalloc(h_d_C(i), bytes) enddo ! Copy the host array of device pointers to the device bytes = batch_count*sizeof_double ! 8 byte pointers stat = cudaMalloc(d_A, bytes) stat = cudaMalloc(d_B, bytes) stat = cudaMalloc(d_C, bytes) stat = cudaMemcpy(d_A, C_LOC(h_d_A(1)), bytes, cudaMemcpyHostToDevice); stat = cudaMemcpy(d_B, C_LOC(h_d_B(1)), bytes, cudaMemcpyHostToDevice); stat = cudaMemcpy(d_C, C_LOC(h_d_C(1)), bytes, cudaMemcpyHostToDevice); ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 ! Matrices are arranged column major do k=1,batch_count do j=1,dim do i=1,dim if (i==j) then index = real(j*dim + i) A(i,j,k) = k*sin(index) B(i,j,k) = sin(index) C(i,j,k) = k*cos(index) * cos(index) else A(i,j,k) = 0.0 B(i,j,k) = 0.0 C(i,j,k) = 0.0 endif enddo ! i enddo ! j enddo ! k ! Create cublas instance stat = cublasCreate(handle) ! Set input matrices on device do i=1,batch_count stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, h_d_A(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, h_d_B(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, h_d_C(i), dim) enddo ! Set matrix coefficients alpha = 1.0 beta = 1.0 ! batched DGEMM: C = alpha*A*B + beta*C stat = cublasDgemmBatched(handle, CUBLAS_OP_N, CUBLAS_OP_N, & dim, dim, dim, & alpha, & d_A, dim, & d_B, dim, & beta, & d_C, dim, & batch_count) ! Retrieve result matrix from device do i=1,batch_count stat = cublasGetMatrix(dim, dim, sizeof_double, h_d_C(i), dim, C_LOC(C(1,1,i)), dim) enddo ! Simple sanity test, sum up all elements sum = 0.0 do k=1,batch_count do j=1,dim do i=1,dim sum = sum + C(i,j,k) enddo enddo enddo print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2 do i=1,batch_count call cudaFree(h_d_A(i)) call cudaFree(h_d_B(i)) call cudaFree(h_d_C(i)) enddo deallocate(A) deallocate(B) deallocate(C) call cublasDestroy(handle) end program
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ nvcc -c cublas_fortran.cu $ ftn -lcublas cublas_fortran_iso.f90 cublas-batch.f90 cublas_fortran.o -o cublas-batch.out
Run:
$ module load cudatoolkit $ aprun ./cublas-batch.out
CUDA Streams in Fortran
Not all library functions have a batch equivalent, in this case using CUDA streams
may be appropriate. Each stream can be considered a seperate work queue, allowing kernels from separate streams to be executed concurrently if hardware resources allow. In the code sample below we place each DGEMM
operation in it’s own stream
, Allowing up to 32 concurrent operations through the use of Hyper-Q
. The cuBLAS
library provides functionality to create, set, and destroy CUDA streams
.
cublas-stream.f90
program main use ISO_C_BINDING use cublas_f implicit none integer :: dim, stat, i, j, k, batch_count integer(8) :: bytes real(8),dimension(:,:,:), pointer:: A, B, C real(8) :: alpha, beta, index, sum type(C_PTR), dimension(:), pointer :: d_A, d_B, d_C, streams type(C_PTR) :: handle integer :: sizeof_double parameter (sizeof_double=8) !Linear dimension of matrices dim = 100 ! Number of A,B,C matrix sets batch_count = 1000 ! Allocate host storage for A,B,C square matrices allocate(A(dim,dim,batch_count)) allocate(B(dim,dim,batch_count)) allocate(C(dim,dim,batch_count)) ! Allocate device storage for A,B,C allocate(d_A(batch_count)) allocate(d_B(batch_count)) allocate(d_C(batch_count)) bytes = int(dim*dim*sizeof_double, 8) do i=1,batch_count stat = cudaMalloc(d_A(i), bytes) stat = cudaMalloc(d_B(i), bytes) stat = cudaMalloc(d_C(i), bytes) enddo ! Fill A,B diagonals with sin(i) data, C diagonal with cos(i)^2 ! Matrices are arranged column major do k=1,batch_count do j=1,dim do i=1,dim if (i==j) then index = real(j*dim + i) A(i,j,k) = k*sin(index) B(i,j,k) = sin(index) C(i,j,k) = k*cos(index) * cos(index) else A(i,j,k) = 0.0 B(i,j,k) = 0.0 C(i,j,k) = 0.0 endif enddo ! i enddo ! j enddo ! k ! Create cublas instance stat = cublasCreate(handle) ! Set input matrices on device do i=1,batch_count stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(A(1,1,i)), dim, d_A(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(B(1,1,i)), dim, d_B(i), dim) stat = cublasSetMatrix(dim, dim, sizeof_double, C_LOC(C(1,1,i)), dim, d_C(i), dim) enddo ! Create a stream for every DGEMM operation allocate(streams(batch_count)) do i=1,batch_count stat = cudaStreamCreate(streams(i)) enddo ! Set matrix coefficients alpha = 1.0 beta = 1.0 ! Launch each DGEMM operation in own CUDA stream do i=1,batch_count ! Set CUDA stream stat = cublasSetStream(handle, streams(i)) ! DGEMM: C = alpha*A*B + beta*C stat = cublasDgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, & dim, dim, dim, & alpha, & d_A(i), dim, & d_B(i), dim, & beta, & d_C(i), dim) enddo ! Retrieve result matrix from device do i=1,batch_count stat = cublasGetMatrix(dim, dim, sizeof_double, d_C(i), dim, C_LOC(C(1,1,i)), dim) enddo ! Simple sanity test, sum up all elements sum = 0.0 do k=1,batch_count do j=1,dim do i=1,dim sum = sum + C(i,j,k) enddo enddo enddo print *, "Sum is:", sum, "should be: ", dim*(batch_count)*(batch_count+1)/2 do i=1,batch_count stat = cudaStreamDestroy(streams(i)) call cudaFree(d_A(i)) call cudaFree(d_B(i)) call cudaFree(d_C(i)) enddo deallocate(A) deallocate(B) deallocate(C) call cublasDestroy(handle) end program
Compile:
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit $ nvcc -c cublas_fortran.cu $ ftn -lcublas cublas_fortran_iso.f90 cublas-stream.f90 cublas_fortran.o -o cublas-stream.out
Run:
$ module load cudatoolkit $ aprun ./cublas-stream.out
This tutorial will cover linking GPU code compiled with NVCC and CPU code compiled with the Cray compiler wrappers cc/CC/ftn. The Cray compiler wrappers work in conjunction with the modules system to link in needed libraries such as MPI; it is therefore recommended that the wrappers be used to compile the CPU portions of your code. To generate compiler portable code it is necessary to compile CUDA C and CUDA runtime containing code with NVCC. The resulting NVCC compiled object code must then be linked in with object code compiled with the Cray wrappers. NVCC preforms GNU style C++ name mangling on compiled functions so care must be taken in compiling and linking codes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C++
When linking C++ host code with NVCC compiled code the C++ code must use GNU compatible name mangling. This is controlled through compiler wrapper flags.
vecAddWrapperCXX.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } void vecAdd_wrapper() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAdd.cxx
#include <stdio.h> void vecAdd_wrapper(void); int main( int argc, char* argv[] ) { vecAdd_wrapper(); return 0; }
Compiling C++
PGI
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC --gnu vecAdd.cxx vecAddWrapperCXX.o
Cray
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC vecAdd.cxx vecAddWrapperCXX.o
GNU
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC vecAdd.cxx vecAddWrapperCXX.o
Intel
$ module load cudatoolkit $ nvcc -c vecAddWrapperCXX.cu $ CC vecAdd.cxx vecAddWrapperCXX.o
C
NVCC name mangling must be disabled if it is to be linked with C code. This requires the extern “C” function qualifier.
vecAddWrapperC.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } extern "C" void vecAdd_wrapper() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAdd.c
#include <stdio.h> void vecAdd_wrapper(void); int main( int argc, char* argv[] ) { vecAdd_wrapper(); return 0; }
Compiling C
There should be no compiler dependent variances for C.
$ module load cudatoolkit $ nvcc -c vecAddWrapperC.cu $ cc vecAdd.c vecAddWrapperC.o
Fortran: Simple
This simple method allows C code to be called from Fortran but requires that the C function name be modified. NVCC name mangling must be disabled if it is to be linked with Fortran code. This requires the extern “C” function qualifier. Additionally function names must be lowercase and end in an underscore, _ . The ISO_C_BINDING method following this example allows you to call C functions from Fortran without modifying the C function call.
vecAddWrapperF.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } extern "C" void vecadd_wrapper_() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAdd.f90
program vecAdd call vecadd_wrapper() end program vecAdd
Compiling Fortran: Simple
There should be no compiler dependent variations for fortran.
$ module load cudatoolkit $ nvcc -c vecAddWrapperF90.cu $ ftn vecAddWrapperF90.o vecAdd.f90
Fortran: ISO_C_BINDING
The ISO_C_BINDING provides Fortran a greater interoperability with C and reduces the need to modify the C function name.
vecAddWrapperISO.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(float *a, float *b, float *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } extern "C" void vecAdd_wrapper() { // Size of vectors int n = 100000; // Host input vectors float *h_a; float *h_b; //Host output vector float *h_c; // Device input vectors float *d_a; float *d_b; //Device output vector float *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(float); // Allocate memory for each vector on host h_a = (float*)malloc(bytes); h_b = (float*)malloc(bytes); h_c = (float*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error float sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); }
vecAddISO.f90
module vector INTERFACE subroutine vecadd_wrapper() BIND (C, NAME='vecAdd_wrapper') USE ISO_C_BINDING implicit none end subroutine vecadd_wrapper END INTERFACE end module vector program vecAdd use ISO_C_BINDING use vector call vecadd_wrapper() end program vecAdd
Compiling Fortran: ISO_C_BINDING
There should be no compiler dependent variations for fortran.
$ module load cudatoolkit $ nvcc -c vecAddWrapperISO.cu $ ftn vecAddWrapperISO.o vecAddISO.f90
Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov.
How to Enable
To enable GPUDirect the following steps must be taken before invoking aprun:
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1
Examples
The following examples will initialize memory on the GPU and then preform an MPI_Allgather operation between GPUs.
For these examples use two nodes and request an interactive job.
$qsub -I -A PROJ### -lnodes=2,walltime=00:30:00
The executables must be placed in /tmp/work/$USER to run properly on the GPU.
CUDA C
direct.cpp
#include <stdio.h> #include <stdlib.h> #include <cuda_runtime.h> #include <mpi.h> int main( int argc, char** argv ) { MPI_Init (&argc, &argv); int direct; int rank, size; int *h_buff = NULL; int *d_rank = NULL; int *d_buff = NULL; size_t bytes; int i; // Ensure that RDMA ENABLED CUDA is set correctly direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA")); if(direct != 1){ printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n"); exit (EXIT_FAILURE); } // Get MPI rank and size MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); // Allocate host and device buffers and copy rank value to GPU bytes = size*sizeof(int); h_buff = (int*)malloc(bytes); cudaMalloc(&d_buff, bytes); cudaMalloc(&d_rank, sizeof(int)); cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice); // Preform Allgather using device buffer MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD); // Check that the GPU buffer is correct cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost); for(i=0; i<size; i++){ if(h_buff[i] != i) { printf ("Alltoall Failed!\n"); exit (EXIT_FAILURE); } } if(rank==0) printf("Success!\n"); // Clean up free(h_buff); cudaFree(d_buff); cudaFree(d_rank); MPI_Finalize(); return 0; }
Compiling CUDA C
For ease of compiling the GNU environment will be used.
$ module load cudatoolkit $ module switch PrgEnv-pgi PrgEnv-gnu $ CC -lcudart direct.cpp -o direct.out
Running CUDA C
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
CUDA Fortran
direct.cuf
program GPUdirect use cudafor implicit none include 'mpif.h' integer :: direct character(len=255) :: env_var integer :: rank, size, ierror integer,dimension(:),allocatable :: h_buff integer,device :: d_rank integer,dimension(:),allocatable,device :: d_buff integer :: i call getenv("MPICH_RDMA_ENABLED_CUDA", env_var) read( env_var, '(i10)' ) direct if (direct .NE. 1) then print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!' call exit(1) endif call MPI_INIT(ierror) ! Get MPI rank and size call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror) call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror) ! Initialize host and device buffers allocate(h_buff(size)) allocate(d_buff(size)) ! Implicity copy rank to device d_rank = rank ! Preform allgather using device buffers call MPI_ALLGATHER(d_rank, 1, MPI_INTEGER, d_buff, 1, MPI_INTEGER, MPI_COMM_WORLD, ierror); ! Check that buffer is correct h_buff = d_buff(1:size) do i=1,size if (h_buff(i) .NE. i-1) then print *, 'Alltoall Failed!' call exit(1) endif enddo if (rank .EQ. 0) then print *, 'Success!' endif ! Clean up deallocate(h_buff) deallocate(d_buff) call MPI_FINALIZE(ierror) end program GPUdirect
Compiling CUDA Fortran
$ module load cudatoolkit $ ftn direct.cuf -o direct.out
Running CUDA Fortran
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
OpenACC C
direct.c
#include <stdio.h> #include <stdlib.h> #include <mpi.h> int main( int argc, char** argv ) { MPI_Init (&argc, &argv); int direct; int rank, size; int *restrict buff = NULL; size_t bytes; int i; // Ensure that RDMA ENABLED CUDA is set correctly direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA")); if(direct != 1){ printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n"); exit (EXIT_FAILURE); } // Get MPI rank and size MPI_Comm_rank (MPI_COMM_WORLD, &rank); MPI_Comm_size (MPI_COMM_WORLD, &size); // Initialize buffer bytes = size*sizeof(int); buff = (int*)malloc(bytes); // Copy buff to device at start of region and back to host and end of region #pragma acc data copy(rank, buff[0:size]) { // Inside this region the device data pointer will be used #pragma acc host_data use_device(rank, buff) { MPI_Allgather(&rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD); } } // Check that buffer is correct for(i=0; i<size; i++){ if(buff[i] != i) { printf ("Alltoall Failed!\n"); exit (EXIT_FAILURE); } } if(rank==0) printf("Success!\n"); // Clean up free(buff); MPI_Finalize(); return 0; }
Compiling OpenACC C
CRAY
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -hpragma=acc direct.c -o direct.out
PGI
$ module load cudatoolkit $ cc -acc -lcudart direct.c -o direct.out
Running OpenACC C
CRAY
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
PGI
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
OpenACC Fortran
direct.f90
program GPUdirect include 'mpif.h' integer :: direct character(len=255) :: env_var integer :: rank, size, ierror integer,dimension(:),allocatable :: buff integer :: i call getenv("MPICH_RDMA_ENABLED_CUDA", env_var) read( env_var, '(i10)' ) direct if (direct .NE. 1) then print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!' call exit(1) endif call MPI_INIT(ierror) ! Get MPI rank and size call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror) call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror) ! Initialize buffer allocate(buff(size)) ! Copy buff to device at start of region and back to host and end of region !$acc data copy(rank, buff(1:size)) ! Inside this region the device data pointer will be used !$acc host_data use_device(rank, buff) ! Preform all to all using device buffer call MPI_ALLGATHER(rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD, ierror); !$acc end host_data !$acc end data ! Check that buffer is correct do i=1,size if (buff(i) .NE. i-1) then print *, 'Alltoall Failed!' call exit(1) endif enddo if (rank .EQ. 0) then print *, 'Success!' endif ! Clean up deallocate(buff) call MPI_FINALIZE(ierror) end program GPUdirect
Compiling OpenACC Fortran
CRAY
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -hacc direct.f90 -o direct.out
PGI
$ module load cudatoolkit $ ftn -acc -lcudart direct.f90 -o direct.out
Running OpenACC Fortran
CRAY
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
PGI
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH $ export MPICH_RDMA_ENABLED_CUDA=1 $ aprun -n2 -N1 ./direct.out
Optimizations
Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial and error tuning to achieve best results.
Pipelining
Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is 16 and messages under 8 Kilobytes in size are never pipelined.
Nemesis
Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.
This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.
This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.
The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
How to Enable
To enable the proxy server the following steps must be taken before invoking aprun:
$ export CRAY_CUDA_MPS=1
Issues
Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.
Examples
The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.
For these examples use a single node and request an interactive job.
$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00
C Wrapper
The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.
pauseWrapper.cu
#include <stdio.h> #include <stdlib.h> #include <time.h> // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(int seconds) { // Get device frequency in Hz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = seconds * Hz; return num_cycles; } // Launches a kernel that sleeps for at least num_cycles extern "C" void sleep_kernel(int64_t num_cycles) { // Our kernel will launch a single thread to sleep the kernel int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<<gridSize, blockSize>>>(num_cycles); } // Wait for all pending GPU transactions to end extern "C" void wait_for_gpu() { cudaDeviceSynchronize(); }
Compiling
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -c -arch=sm_35 pauseWrapper.cu
Multiple kernels single process
Launching multiple kernels from a single process does not require the CUDA proxy.
launcher.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch num_kernel kernels asynchrnously for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); return 0; }
Compiling and running multiple kernels from a single process
$ CC pauseWrapper.o launcher.cpp -o sleep.out $ aprun -n1 ./sleep.out
Multiple kernels launched from multiple threads
Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.
launcherOMP.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch kernel omp_set_num_threads(num_kernels); #pragma omp parallel for shared(cycles) num_threads(num_kernels) for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); }
Compiling and running multiple kernels launched from multiple threads
$ CC pauseWrapper.o launcherMP.cpp -o sleep.out $ aprun -n1 -d14 ./sleep.out
Multiple kernels launched from multiple MPI tasks
Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.
launcherMPI.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <mpi.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Sleep kernel for 1 second sleep_kernel(cycles); // Wait for the kernel to complete wait_for_gpu(); MPI_Finalize(); return 0; }
Compiling and running multiple kernels launched from multiple MPI tasks
$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out $ export CRAY_CUDA_MPS=1 $ aprun -n14 -N14 ./sleep.out
This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Host input vectors double *h_a; double *h_b; //Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; //Device output vector double *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sin(i)*sin(i); h_b[i] = cos(i)*cos(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); return 0; }
Changes
Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.
// CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; }
Let’s take a look at what makes up this simple kernel.
__global__ void vecAdd(double *a, double *b, double *c, int n)
The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.
int id = blockIdx.x*blockDim.x+threadIdx.x;
We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members.
if (id < n)
Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.
c[id] = a[id] + b[id]
the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c
Array Pointers:
// Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; // Device output vector double *d_c;
With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.
cudaMalloc:
// Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes);
Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.
cudaMemcpy to device:
// Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.
Thread mapping:
int blockSize, gridSize; // Number of threads in each thread block blockSize = 256; // Number of thread blocks in grid gridSize = ceil((float)n/blockSize);
To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.
Launch kernel:
// Execute the kernel vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);
We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.
cudaMemcpy to host:
// Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.
release device memory
// Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.
Compiling
The CUDA environment must be set up so before compiling make sure the module is loaded
$ module load cudatoolkit $ nvcc vecAdd.cu -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
In this tutorial we will cover three different CUDA versions of our GOL program, each highlighting a different CUDA memory area. The first will strictly use global memory and be a straightforward GPU port of our program, the second we will introduce shared memory, and lastly we cover the use of texture memory. A description of the game of life and the original CPU code can be found here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
GOL-CUDA.cu
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define BLOCK_SIZE 16 __global__ void ghostRows(int dim, int *grid) { // We want id ∈ [1,dim] int id = blockDim.x * blockIdx.x + threadIdx.x + 1; if (id <= dim) { //Copy first real row to bottom ghost row grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; //Copy last real row to top ghost row grid[id] = grid[(dim+2)*dim + id]; } } __global__ void ghostCols(int dim, int *grid) { // We want id ∈ [0,dim+1] int id = blockDim.x * blockIdx.x + threadIdx.x; if (id <= dim+1) { //Copy first real column to right most ghost column grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; //Copy last real column to left most ghost column grid[id*(dim+2)] = grid[id*(dim+2) + dim]; } } __global__ void GOL(int dim, int *grid, int *newGrid) { // We want id ∈ [1,dim] int iy = blockDim.y * blockIdx.y + threadIdx.y + 1; int ix = blockDim.x * blockIdx.x + threadIdx.x + 1; int id = iy * (dim+2) + ix; int numNeighbors; if (iy <= dim && ix <= dim) { // Get the number of neighbors for a given grid point numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower + grid[id+1] + grid[id-1] //right left + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals + grid[id-(dim+1)] + grid[id+(dim+1)]; int cell = grid[id]; // Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } } int main(int argc, char* argv[]) { int i,j,iter; int* h_grid; //Grid on host int* d_grid; //Grid on device int* d_newGrid; //Second grid used on device only int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid int dim = 1024; //Linear dimension of our grid - not counting ghost cells int maxIter = 1<<10; //Number of game steps size_t bytes = sizeof(int)*(dim+2)*(dim+2);//2 added for periodic boundary condition ghost cells // Allocate host Grid used for initial setup and read back from device h_grid = (int*)malloc(bytes); // Allocate device grids cudaMalloc(&d_grid, bytes); cudaMalloc(&d_newGrid, bytes); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { h_grid[i*(dim+2)+j] = rand() % 2; } } // Copy over initial game grid (Dim-1 threads) cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice); dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1); int linGrid = (int)ceil(dim/(float)BLOCK_SIZE); dim3 gridSize(linGrid,linGrid,1); dim3 cpyBlockSize(BLOCK_SIZE,1,1); dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1); dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1); // Main game loop for (iter = 0; iter<maxIter; iter++) { ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid); ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid); GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid); // Swap our grids and iterate again d_tmpGrid = d_grid; d_grid = d_newGrid; d_newGrid = d_tmpGrid; }//iter loop // Copy back results and sum cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost); // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += h_grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); // Release memory cudaFree(d_grid); cudaFree(d_newGrid); free(h_grid); return 0; }
Changes
#define BLOCK_SIZE 16
The number of threads per block is a common performance tuning parameter and it is often set as a precompiler define statement.
__global__ void ghostRows(int dim, int *grid) { // We want id ∈ [1,dim] int id = blockDim.x * blockIdx.x + threadIdx.x + 1; if (id <= dim) { //Copy first real row to bottom ghost row grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; //Copy last real row to top ghost row grid[id] = grid[(dim+2)*dim + id]; } }
__global__ void ghostCols(int dim, int *grid) { // We want id ∈ [0,dim+1] int id = blockDim.x * blockIdx.x + threadIdx.x; if (id <= dim+1) { //Copy first real column to right most ghost column grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; //Copy last real column to left most ghost column grid[id*(dim+2)] = grid[id*(dim+2) + dim]; } }
The ghostRows and ghostColumns kernels are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.
__global__ void GOL(int dim, int *grid, int *newGrid) { // We want id ∈ [1,dim] int iy = blockDim.y * blockIdx.y + threadIdx.y + 1; int ix = blockDim.x * blockIdx.x + threadIdx.x + 1; int id = iy * (dim+2) + ix; int numNeighbors; if (iy <= dim && ix <= dim) { // Get the number of neighbors for a given grid point numNeighbors = grid[id+(dim+2)] + grid[id-(dim+2)] //upper lower + grid[id+1] + grid[id-1] //right left + grid[id+(dim+3)] + grid[id-(dim+3)] //diagonals + grid[id-(dim+1)] + grid[id+(dim+1)]; int cell = grid[id]; // Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } }
The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.
int* d_grid; //Grid on device int* d_newGrid; //Second grid used on device only int* d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid
We declare our device pointers on the host as we would normally, we prefix our pointer name with d_ to help separate them from host pointers which we prefix with h_.
// Allocate device grids cudaMalloc(&d_grid, bytes); cudaMalloc(&d_newGrid, bytes);
Both of our device arrays are allocated in device global memory. A single dimensional array is used
// Copy over initial game grid (Dim-1 threads) cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice);
After we have filled h_grid on the host we must copy it over to d_grid which resides in the device global memory.
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE,1);
dim3 cpyBlockSize(BLOCK_SIZE,1,1);
The CUDA data type dim3 is used to define the number of threads in our block, For the GOL kernel we specify a two dimensional block size to better suite our problems geometry, for copying a simple single dimensional size is best. Note that any missing dimensions in the constructor are assumed to be 1.
int linGrid = (int)ceil(dim/(float)BLOCK_SIZE); dim3 gridSize(linGrid,linGrid,1);
dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1); dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1);
With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of type dim3.
ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid); ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid); GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid);
Each iteration of the game game loop we launch our three kernels, first copying the ghost cells and then preforming one game iteration.
// Swap our grids and iterate again d_tmpGrid = d_grid; d_grid = d_newGrid; d_newGrid = d_tmpGrid;
The grid and newGrid device pointers can be switched just as we did with the host pointers in the CPU version.
// Copy back results and sum cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost);
Once our main game loop is complete we must copy the grid from device memory to host memory.
// Release memory cudaFree(d_grid); cudaFree(d_newGrid);
Having used cudaMalloc to allocate our device memory we release it with cudaFree.
Compiling
module load cudatoolkit nvcc GOL-CUDA.cu -o GOL-CUDA.out
GOL-CUDA-Shared.cu
In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:
- Have each thread read from global memory into shared memory
- Synchronize threads to ensure all threads in a given block have finished reading from global memory
- Have each thread preform calculations using shared memory values
- Have each thread write result from shared memory back to device global memory
For detailed coverage of shared memory please see the CUDA C Programming guide.
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define BLOCK_SIZE_x 32 #define BLOCK_SIZE_y 16 __global__ void ghostRows(int dim, int* grid) { // We want id ∈ [1,dim] int id = blockDim.x * blockIdx.x + threadIdx.x + 1; if (id <= dim) { //Copy first real row to bottom ghost row grid[(dim+2)*(dim+1)+id] = grid[(dim+2)+id]; //Copy last real row to top ghost row grid[id] = grid[(dim+2)*dim + id]; } } __global__ void ghostCols(int dim, int* grid) { // We want id ∈ [0,dim+1] int id = blockDim.x * blockIdx.x + threadIdx.x; if (id <= dim+1) { //Copy first real column to right most ghost column grid[id*(dim+2)+dim+1] = grid[id*(dim+2)+1]; //Copy last real column to left most ghost column grid[id*(dim+2)] = grid[id*(dim+2) + dim]; } } __global__ void GOL(int dim, int *grid, int *newGrid) { int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y; int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x; int id = iy * (dim+2) + ix; int i = threadIdx.y; int j = threadIdx.x; int numNeighbors; // Declare the shared memory on a per block level __shared__ int s_grid[BLOCK_SIZE_y][BLOCK_SIZE_x]; // Copy cells into shared memory if (ix <= dim+1 && iy <= dim+1) s_grid[i][j] = grid[id]; //Sync all threads in block __syncthreads(); if (iy <= dim && ix <= dim) { if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) { // Get the number of neighbors for a given grid point numNeighbors = s_grid[i+1][j] + s_grid[i-1][j] //upper lower + s_grid[i][j+1] + s_grid[i][j-1] //right left + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals + s_grid[i-1][j+1] + s_grid[i+1][j-1]; int cell = s_grid[i][j]; // Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } } } int main(int argc, char* argv[]) { int i,j,iter; int *h_grid; //Grid on host int *d_grid; //Grid on device int *d_newGrid; //Second grid used on device only int *d_tmpGrid; //tmp grid pointer used to switch between grid and newGrid int dim = 1024; //Linear dimension of our grid - not counting ghost cells int maxIter = 1<<10; //Number of game steps size_t bytes = sizeof(int)*(dim+2)*(dim+2); // Allocate host Grid used for initial setup and read back from device h_grid = (int*)malloc(bytes); // Allocate device grids cudaMalloc(&d_grid, bytes); cudaMalloc(&d_newGrid, bytes); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { h_grid[i*(dim+2)+j] = rand() % 2; } } cudaFuncSetCacheConfig(GOL, cudaFuncCachePreferShared); // Copy over initial game grid (Dim-1 threads) cudaMemcpy(d_grid, h_grid, bytes, cudaMemcpyHostToDevice); dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1); int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2)); int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2)); dim3 gridSize(linGrid_x,linGrid_y,1); dim3 cpyBlockSize(BLOCK_SIZE_x,1,1); dim3 cpyGridRowsGridSize((int)ceil(dim/(float)cpyBlockSize.x),1,1); dim3 cpyGridColsGridSize((int)ceil((dim+2)/(float)cpyBlockSize.x),1,1); // Main game loop for (iter = 0; iter<maxIter; iter++) { ghostRows<<<cpyGridRowsGridSize, cpyBlockSize>>>(dim, d_grid); ghostCols<<<cpyGridColsGridSize, cpyBlockSize>>>(dim, d_grid); GOL<<<gridSize, blockSize>>>(dim, d_grid, d_newGrid); // Swap our grids and iterate again d_tmpGrid = d_grid; d_grid = d_newGrid; d_newGrid = d_tmpGrid; }//iter loop cudaError_t error = cudaGetLastError(); if(error != cudaSuccess) printf("CUDA error %s\n",cudaGetErrorString(error)); // Copy back results and sum cudaMemcpy(h_grid, d_grid, bytes, cudaMemcpyDeviceToHost); // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += h_grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); cudaFree(d_grid); cudaFree(d_newGrid); free(h_grid); return 0; }
Changes :
#define BLOCK_SIZE_x 32 #define BLOCK_SIZE_y 16
A square block size is not always best performance wise, for greater flexibility we switch to separate x and y dimensions.
int iy = (blockDim.y -2) * blockIdx.y + threadIdx.y; int ix = (blockDim.x -2) * blockIdx.x + threadIdx.x; int id = iy * (dim+2) + ix;
Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.
int i = threadIdx.y; int j = threadIdx.x;
We will also define some convenience variables for the block thread indices.
// Declare the shared memory on a per block level __shared__ int s_grid[blockDim.y][blockDim.x];
The __shared__ qualifier is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis.
// Copy cells into shared memory if (ix <= dim+1 && iy <= dim+1) s_grid[i][j] = grid[id];
Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.
//Sync all threads in block __syncthreads();
Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling __syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.
if (iy <= dim && ix <= dim) { if(i != 0 && i !=blockDim.y-1 && j != 0 && j !=blockDim.x-1) {
Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.
// Get the number of neighbors for a given grid point numNeighbors = s_grid[i+1][j] + s_grid[i-1][j] //upper lower + s_grid[i][j+1] + s_grid[i][j-1] //right left + s_grid[i+1][j+1] + s_grid[i-1][j-1] //diagonals + s_grid[i-1][j+1] + s_grid[i+1][j-1]; int cell = s_grid[i][j];
The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.
dim3 blockSize(BLOCK_SIZE_x, BLOCK_SIZE_y,1); int linGrid_x = (int)ceil(dim/(float)(BLOCK_SIZE_x-2)); int linGrid_y = (int)ceil(dim/(float)(BLOCK_SIZE_y-2)); dim3 gridSize(linGrid_x,linGrid_y,1);
Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.
GOL-CUDA-Texture.cu
The texture cache example is provided for completeness but not fully documented. The texture cache is optimized for 2D spacial locality and does not pollute the L1 cache. Some programs may benefit from the smaller cache line compared to the L1 cache.
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 texture<int,2> gridTex; __global__ void GOL(int dim, int *newGrid) { int iy = blockDim.y * blockIdx.y + threadIdx.y; int ix = blockDim.x * blockIdx.x + threadIdx.x; int id = iy * dim + ix; int numNeighbors; float iyTex = (iy + 0.5f)/dim; float ixTex = (ix + 0.5f)/dim; float oneTex = 1.0f/dim; if(iy < dim && ix < dim) { //Get the number of neighbors for a given grid point numNeighbors = tex2D(gridTex, iyTex+oneTex, ixTex) //upper/lower + tex2D(gridTex, iyTex-oneTex, ixTex) + tex2D(gridTex, iyTex, ixTex+oneTex) //right/left + tex2D(gridTex, iyTex, ixTex-oneTex) + tex2D(gridTex, iyTex-oneTex, ixTex-oneTex) //diagonals + tex2D(gridTex, iyTex-oneTex, ixTex+oneTex) + tex2D(gridTex, iyTex+oneTex, ixTex-oneTex) + tex2D(gridTex, iyTex+oneTex, ixTex+oneTex); int cell = tex2D(gridTex, iyTex, ixTex); //Here we have explicitly all of the game rules if (cell == 1 && numNeighbors < 2) newGrid[id] = 0; else if (cell == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[id] = 1; else if (cell == 1 && numNeighbors > 3) newGrid[id] = 0; else if (cell == 0 && numNeighbors == 3) newGrid[id] = 1; else newGrid[id] = cell; } } int main(int argc, char* argv[]) { int i,j,iter; int* h_grid; //Grid on host cudaArray* d_grid; //Grid on device int* d_newGrid; //Second grid used on device only int dim = 1024; //Linear dimension of our grid - not counting ghost cells int maxIter = 1<<10; //Number of game steps size_t bytes = sizeof(int)*dim*dim; //Allocate host Grid used for initial setup and read back from device h_grid = (int*)malloc(bytes); //Allocate device grids cudaMallocArray(&d_grid, &gridTex.channelDesc, dim, dim); cudaMalloc(&d_newGrid, bytes); //Assign initial population randomly srand(SRAND_VALUE); for(i = 0; i<dim; i++) { for(j = 0; j<dim; j++) { h_grid[i*dim+j] = rand() % 2; } } //Copy over initial game grid (Dim-1 threads) cudaMemcpyToArray (d_grid, 0, 0, h_grid, bytes, cudaMemcpyHostToDevice); cudaBindTextureToArray(gridTex, d_grid); gridTex.normalized = true; gridTex.addressMode[0] = cudaAddressModeWrap; gridTex.addressMode[1] = cudaAddressModeWrap; dim3 dimBlock(8,8); int linGrid = (int)ceil(dim/(float)dimBlock.x); dim3 dimGrid(linGrid,linGrid); //Main game loop for (iter = 0; iter<maxIter; iter++) { GOL<<<dimGrid,dimBlock>>>(dim, d_newGrid); //Swap our grids and iterate again cudaMemcpyToArray (d_grid, 0, 0, d_newGrid, bytes, cudaMemcpyDeviceToDevice); }//iter loop //Copy back results and sum cudaMemcpy(h_grid, d_newGrid, bytes, cudaMemcpyDeviceToHost); //Sum up alive cells and print results int total = 0; for (i = 0; i<dim; i++) { for (j = 0; j<dim; j++) { total += h_grid[i*dim+j]; } } printf("Total Alive: %d\n", total); cudaFree(d_grid); cudaFree(d_newGrid); free(h_grid); return 0; }
This tutorial shows a minimal conversion from our vector addition CPU code to PGI CUDA Fortran. Consider this a CUDA Fortran ‘Hello World’. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.cuf
module kernel contains ! CUDA kernel. Each thread takes care of one element of c attributes(global) subroutine vecAdd_kernel(n, a, b, c) integer, value :: n real(8), device :: a(n), b(n), c(n) integer :: id ! Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x ! Make sure we do not go out of bounds if (id <= n) then c(id) = a(id) + b(id) endif end subroutine vecAdd_kernel end module kernel program main use cudafor use kernel type(dim3) :: blockSize, gridSize real(8) :: sum integer :: i ! Size of vectors integer :: n = 100000 ! Host input vectors real(8),dimension(:),allocatable :: h_a real(8),dimension(:),allocatable :: h_b !Host output vector real(8),dimension(:),allocatable :: h_c ! Device input vectors real(8),device,dimension(:),allocatable :: d_a real(8),device,dimension(:),allocatable :: d_b !Host output vector real(8),device,dimension(:),allocatable :: d_c ! Allocate memory for each vector on host allocate(h_a(n)) allocate(h_b(n)) allocate(h_c(n)) ! Allocate memory for each vector on GPU allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n h_a(i) = sin(i*1D0)*sin(i*1D0) h_b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Implicit copy of host vectors to device d_a = h_a(1:n) d_b = h_b(1:n) ! Number of threads in each thread block blockSize = dim3(1024,1,1) ! Number of thread blocks in grid gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1) ! Execute the kernel call vecAdd_kernel<<<gridSize, blockSize>>>(n, d_a, d_b, d_c) ! Implicit copy of device array to host h_c = d_c(1:n) ! Sum up vector c and print result divided by n, this should equal 1 within error sum = 0.0; do i=1,n sum = sum + h_c(i) enddo sum = sum/real(n) print *, 'final result: ', sum ! Release device memory deallocate(d_a) deallocate(d_b) deallocate(d_c) ! Release host memory deallocate(h_a) deallocate(h_b) deallocate(h_c) end program main
Changes
Kernel
The kernel is the heart of our CUDA code. When we launch a kernel we specify the number of threads per block(blockdim) and number of blocks per grid(griddim). The total number of threads is (blockdim) * (griddim). Each thread evaluates one copy of the kernel.
module kernel contains !CUDA kernel. Each thread takes care of one element of c attributes(global) subroutine vecAdd_kernel(n, a, b, c) integer, value :: n real(8), device :: a(n), b(n), c(n) integer :: id !Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x !Make sure we do not go out of bounds if (id <= n) then c(id) = a(id) + b(id) endif end subroutine vecAdd_kernel end module kernel
Let’s take a look at what makes up this simple kernel.
module kernel contains
end module kernel
The interface for the kernel subroutine must be explicitly known to the calling program, one way of achieving this is placing the subroutine in a module.
attributes(global) subroutine vecAdd_kernel(n, a, b, c)
The global attribute specifies this is a CUDA kernel, otherwise normal Fortran subroutine syntax is used.
!Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x
We calculate the threads global id using CUDA Fortran provided derived data types. blockidx contains the blocks position in the grid, ranging from 1 to griddim. threadidx is the threads index inside of it’s associated block, ranging from 1 to blockdim.
if (id <= n) then
Unless we have an array length divisible by our blockdim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.
c(id) = a(id) + b(id)
the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c
Modules
use cudafor use kernel
The cudafor is a PGI provided module that provides some of the CUDA specific derived types such as dim3. The kernel module we created must also be included.
Array Pointers
!Host input vectors real(8),dimension(:),allocatable :: h_a real(8),dimension(:),allocatable :: h_b !Host output vector real(8),dimension(:),allocatable :: h_c !Device input vectors real(8),device,dimension(:),allocatable :: d_a real(8),device,dimension(:),allocatable :: d_b !Host output vector real(8),device,dimension(:),allocatable :: d_c
With the host CPU and GPU having separate memory spaces we must have two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them. Notice that the GPU pointers have the device attribute.
Allocate device memory
!Allocate memory for each vector on GPU allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n))
Given a pointer with the device attribute memory will be allocated in the GPUs global memory space.
Copy to device
!Implicit copy of host vectors to device d_a = h_a(1:n) d_b = h_b(1:n)
After we initialize the data on the host we must copy it over to the device, to do so we can can use the syntax above. This is a convenient way of doing things but care must be taken. This operation can be very time consuming as we are initiating a DMA transfer of data from the host memory, over the PCI bus, to the device memory.
Thread mapping
!Number of threads in each thread block blockSize = dim3(1024,1,1) !number of thread blocks in grid gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)
To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the block Size will be chosen based upon kernel memory requirements and the grid size calculated such that enough threads are launched to cover the kernel domain. dim3 is a CUDA Fortran provided data type that has 3 dimensions, in this case we are dealing with a one dimensional block and grid so we specify a dimensionality of 1 for the other two dimensions.
Launch kernel
call vecAdd_kernel<<< gridSize, blockSize >>>(n, d_a, d_b, d_c)
We launch our kernel with a modified Fortran subroutine syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.
Copy to host
!Implicit copy of device array to host h_c = d_c(1:n)
The implicit copy from the device to the host will block until the kernel is completed.
Release device memory
deallocate(d_a) deallocate(d_b) deallocate(d_c)
Device memory is deallocated in the same programmatic manner as host allocated arrays.
Compiling
$ module load cudatoolkit $ ftn vecAdd.cuf -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
For this tutorial we will cover two different CUDA Fortran versions of our Game of Life program. The first will strictly use global memory and be a straight forward GPU port of our program, the second will introduce the use of shared memory. For a description of the Game of Life and the original CPU code please see the tutorial here. The full source can be viewed or downloaded at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
GOL.cuf
module kernels contains attributes(global) subroutine ghostCols_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [2,dim+1] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 if(id < dim+2) then grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column endif end subroutine ghostCols_kernel attributes(global) subroutine ghostRows_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [1,dim+2] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x if(id <= dim+2) then grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row endif end subroutine ghostRows_kernel attributes(global) subroutine GOL_kernel(dim, grid, newGrid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid, newGrid integer :: idx, idy, id, numNeighbors, cell !Get global thread IDs idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1 if(idx < dim+2 .AND. idy < dim+2) then numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) & !right & left + grid(idy+1,idx) + grid(idy-1,idx) & !upper & lower + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals + grid(idy-1,idx+1) + grid(idy+1,idx-1) cell = grid(idy,idx) !Here we have explicitly all of the game rules if(cell == 1 .AND. numNeighbors < 2) then newGrid(idy,idx) = 0 elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(idy,idx) = 1 elseif(cell == 1 .AND. numNeighbors > 3) then newGrid(idy,idx) = 0 elseif(cell == 0 .AND. numNeighbors == 3) then newGrid(idy,idx) = 1 else newGrid(idy,idx) = cell endif endif end subroutine GOL_kernel end module kernels program main use cudafor use kernels implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm !Linear game grid dimension integer :: dim = 1024 !Number of game iterations integer :: maxIter = 2**10 integer :: blockDim = 16 type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize !Host game grid array integer,dimension(:,:),allocatable :: h_grid !Device game grid arrays integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells allocate(h_grid(dim+2,dim+2)) !Allocate grid on device allocate(d_grid(dim+2,dim+2)) allocate(d_newGrid(dim+2,dim+2)) !Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) h_grid(i,j) = nint(randm) enddo enddo !Implicit copy of host vectors to device d_grid = h_grid !Set the number of threads per block and blocks per grid cpyBlockSize = dim3(blockDim, 1, 1) cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1) cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1) blockSize = dim3(blockDim, blockDim, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x)), & ceiling(real(dim)/real(blockSize%y)), 1) !Main game loop do iter=1,maxIter !device pointers are not supported currently so manually switch grid/newGrid if(MOD(iter,2) == 1) then call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid) else call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid) endif enddo !Implicit copy of device vector back to host h_grid = d_grid !Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + h_grid(i,j) enddo enddo print *, "Total Alive", total !Release memory deallocate(h_grid) deallocate(d_grid) deallocate(d_newGrid) end program
Changes to GOL.cuf
attributes(global) subroutine ghostCols_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [2,dim+1] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 if(id < dim+2) then grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column endif end subroutine ghostCols_kernel attributes(global) subroutine ghostRows_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ∈ [1,dim+2] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x if(id <= dim+2) then grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row endif end subroutine ghostRows_kernel
The ghostRows_kernel and ghostColumns_kernel device subroutines are used to fill in our ghost rows and columns in device global memory. Each thread is responsible for copying two integers.
attributes(global) subroutine GOL_kernel(dim, grid, newGrid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid, newGrid integer :: idx, idy, id, numNeighbors, cell !Get global thread IDs idx = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 idy = (blockIdx%y-1) * blockDim%y + threadIdx%y + 1 if(idx < dim+2 .AND. idy < dim+2) then numNeighbors = grid(idy,idx+1) + grid(idy,idx-1) & !right & left + grid(idy+1,idx) + grid(idy-1,idx) & !upper & lower + grid(idy+1,idx+1) + grid(idy-1,idx-1) & !diagonals + grid(idy-1,idx+1) + grid(idy+1,idx-1) cell = grid(idy,idx) !Here we have explicitly all of the game rules if(cell == 1 .AND. numNeighbors < 2) then newGrid(idy,idx) = 0 elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(idy,idx) = 1 elseif(cell == 1 .AND. numNeighbors > 3) then newGrid(idy,idx) = 0 elseif(cell == 0 .AND. numNeighbors == 3) then newGrid(idy,idx) = 1 else newGrid(idy,idx) = cell endif endif end subroutine GOL_kernel
The GOL kernel will preform the majority of our work. We use a two dimensional block and grid size to better fit the our problems geometry. Once the thread calculates its global id and tests that it is not out of bounds it will read one cell value from global memory, calculate the new value, and write the result back to global memory.
!Host game grid array integer,dimension(:,:),allocatable :: h_grid !Device game grid arrays integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid
Currently CUDA Fortran does not support device pointers and so we change our array attribute from pointer to allocatable.
!Allocate grid on device allocate(d_grid(dim+2,dim+2)) allocate(d_newGrid(dim+2,dim+2))
Given a array with the device attribute memory will be allocated in the GPUs global memory space.
!Implicit copy of host vectors to device d_grid = h_grid
By setting the device array equal to a host array a host to device DMA copy is implicitly initiated.
!Set the number of threads per block and blocks per grid cpyBlockSize = dim3(blockDim, 1, 1) cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1) cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1) blockSize = dim3(blockDim, blockDim, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x)), & ceiling(real(dim)/real(blockSize%y)), 1)
With our block size specified we must calculate the smallest integer number of blocks necessary to completely cover our game board, this is also of the CUDA Fortran data type dim3.
!Main game loop do iter=1,maxIter !device pointers are not supported currently so manually switch grid/newGrid if(MOD(iter,2) == 1) then call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid) else call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid) endif enddo
Although we can not simply switch our device grid pointers at the end of each game iteration due to CUDA Fortran not support device pointers we may switch what arguments we call our kernels with. This achieves the same effect with a minimal increase in code size.
!Implicit copy of device vector back to host h_grid = d_grid
After our kernel has completed we can implicitly copy the grid array from the device back to the host.
GOL-Shared.cuf
In the first example each thread read nine values from global memory and wrote back one value. The inefficiency of this approach is that each value that is read from global memory is used not only in calculating its own new value but also by its eight closest neighbors in calculating their new value. Although some global memory access penalties have been alleviated in the latest generation of cards with the addition of an L2 cache you will find shared memory is still an incredibly important aspect of GPU programming. Shared memory provides a very fast memory space(comparable to register access) that is shared between threads of the same thread block. In the case where multiple threads in a particular thread block are accessing the same global memory item shared memory may be used to reduce the number of global memory calls. A typical use for shared memory is as follows:
- Have each thread read from global memory into shared memory
- Synchronize threads to ensure all threads in a given block have finished reading from global memory
- Have each thread preform calculations using shared memory values
- Have each thread write result from shared memory back to device global memory
For detailed coverage of shared memory please see the CUDA C Programming Guide and CUDA Fortran Programming Guide.
module kernel contains attributes(global) subroutine ghostCols_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ?~H~H [2,dim+1] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x + 1 if(id < dim+2) then grid(id, dim+2) = grid(id, 2) !Copy first real column to right most ghost column grid(id, 1) = grid(id, dim+1) !Copy last real column to left most ghost column endif end subroutine ghostCols_kernel attributes(global) subroutine ghostRows_kernel(dim, grid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid integer :: id !We want id ?~H~H [1,dim+2] !Get Global thread ID id = (blockIdx%x-1) * blockDim%x + threadIdx%x if(id <= dim+2) then grid(dim+2, id) = grid(2, id) !Copy first real row to bottom ghost row grid(1, id) = grid(dim+1 ,id) !Copy last real row to top ghost row endif end subroutine ghostRows_kernel attributes(global) subroutine GOL_kernel(dim, grid, newGrid) integer,value :: dim integer,device,dimension(dim+2,dim+2) :: grid, newGrid integer :: idx, idy, id, i, j, numNeighbors, cell !Declare the shared memory on a per block level integer,shared :: s_grid(16, 16) !Get global thread IDs idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y !Get local thread IDs i = threadIdx%y j = threadIdx%x !Copy cells into shared memory if (idx <= dim+2 .AND. idy <= dim+2) then s_grid(i,j) = grid(idy,idx) endif !Sync all threads in block call syncthreads() if (idx < dim+2 .AND. idy < dim+2) then if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) & !right & left + s_grid(i+1,j) + s_grid(i-1,j) & !upper & lower + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals + s_grid(i-1,j+1) + s_grid(i+1,j-1) cell = s_grid(i,j) !Here we have explicitly all of the game rules if(cell == 1 .AND. numNeighbors < 2) then newGrid(idy,idx) = 0 elseif(cell == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(idy,idx) = 1 elseif(cell == 1 .AND. numNeighbors > 3) then newGrid(idy,idx) = 0 elseif(cell == 0 .AND. numNeighbors == 3) then newGrid(idy,idx) = 1 else newGrid(idy,idx) = cell endif endif endif end subroutine GOL_kernel end module kernel program main use cudafor use kernel implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm !Linear game grid dimension integer :: dim = 1024 !Number of game iterations integer :: maxIter = 2**10 integer :: blockDim = 16 type(dim3) :: cpyBlockSize, cpyColsGridSize, cpyRowsGridSize, blockSize, gridSize !Host game grid array integer,dimension(:,:),allocatable :: h_grid !Device game grid array integer,device,dimension(:,:),allocatable :: d_grid, d_newGrid !Allocate square grid of (dim+2)^2 elements on host, 2 added for ghost cells allocate(h_grid(dim+2,dim+2)) !Allocate grid on device allocate(d_grid(dim+2,dim+2)) allocate(d_newGrid(dim+2,dim+2)) !Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) h_grid(i,j) = nint(randm) enddo enddo !Implicit copy of host vectors to device d_grid = h_grid !Set the number of threads per block and blocks per grid cpyBlockSize = dim3(blockDim, 1, 1) cpyColsGridSize = dim3(ceiling(real(dim)/real(cpyBlockSize%x)), 1, 1) cpyRowsGridSize = dim3(ceiling(real(dim+2)/real(cpyBlockSize%x)), 1, 1) blockSize = dim3(blockDim-2, blockDim-2, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x-2)), & ceiling(real(dim)/real(blockSize%y-2)), 1) !Main game loop do iter=1,maxIter !device pointers are not supported currently, so manually switch grid/newGrid if(MOD(iter,2) == 1) then call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_grid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_grid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_grid, d_newGrid) else call ghostCols_kernel<<<cpyColsGridSize, cpyBlockSize>>>(dim, d_newGrid) call ghostRows_kernel<<<cpyRowsGridSize, cpyBlockSize>>>(dim, d_newGrid) call GOL_kernel<<<gridSize,blockSize>>>(dim, d_newGrid, d_grid) endif enddo !Implicit copy of device vector back to host h_grid = d_grid !Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + h_grid(i,j) enddo enddo print *, "Total Alive", total !Release memory deallocate(h_grid) deallocate(d_grid) deallocate(d_newGrid) end program
Changes to GOL-Shared.cuf
!Declare the shared memory on a per block level integer,shared :: s_grid(16, 16)
The shared attribute is used in the device kernel to allocate shared memory arrays. The specified size is on a per block basis
!Get global thread IDs idx = (blockIdx%x-1) * (blockDim%x-2) + threadIdx%x idy = (blockIdx%y-1) * (blockDim%y-2) + threadIdx%y
Extra care has to be taken when determining our global thread id for the game kernel. Each thread block will contain two ghost rows and two ghost columns which must be accounted for to calculate their global indexesidx and idy.
!Get local thread IDs i = threadIdx%y j = threadIdx%x
We define some convenience variables for the block thread indices.
!Copy cells into shared memory if (idx <= dim+2 .AND. idy <= dim+2) then s_grid(i,j) = grid(idy,idx) endif
Each thread that is within the bounds of our grid, including ghost cells, will read one value from global memory into shared memory.
!Sync all threads in block call syncthreads()
Before we use any shared memory we need to ensure that each thread in the block has finished reading its value from global memory into shared memory by calling syncThreads, this is a thread block level barrier. When all threads in a given block have reached the barrier execution will commence.
if (idx < dim+2 .AND. idy < dim+2) then if (i /= 1 .AND. i /=blockDim%y .AND. j /=1 .AND. j /=blockDim%x) then
Before calculating our new values we must check that our global thread id’s, iy and ix, are do not extend into, or past, the ghost cells. Local thread id’s, i and j, must be checked so that no calculations are preformed in the per block ghost cells.
numNeighbors = s_grid(i,j+1) + s_grid(i,j-1) & !right & left + s_grid(i+1,j) + s_grid(i-1,j) & !upper & lower + s_grid(i+1,j+1) + s_grid(i-1,j-1) & !diagonals + s_grid(i-1,j+1) + s_grid(i+1,j-1) cell = s_grid(i,j)
The standard GOL calculations are preformed only now we are reading all cell values from shared memory instead of global memory.
blockSize = dim3(blockDim-2, blockDim-2, 1) gridSize = dim3(ceiling(real(dim)/real(blockSize%x-2)), & ceiling(real(dim)/real(blockSize%y-2)), 1)
Slight modification is needed to set the correct block and grid size since we are now using two independent dimensions.
The use of GPU acceleration can add a useful level of parallelization to your code. This tutorial will show how to parallelize a simple Monte Carlo operation using a GPU accelerator. The first code will be a simple serial implementation of the Monte Carlo simulation used to compute π. The next samples will be the same code but accelerated using CUDA, OpenACC, and then Thrust.
Serial
The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of random points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).
#include <stdio.h> #include <stdlib.h> #include <math.h> void main(int argc, char* argv[]) { double niter = 100000; double x,y; int i; int count=0; double z; double pi; srand(time(NULL)); //main loop for (i=0; i<niter; ++i) { //get random points x = (double)random()/RAND_MAX; y = (double)random()/RAND_MAX; z = sqrt((x*x)+(y*y)); //check to see if point is in unit circle if (z<=1) { ++count; } } //p = 4(m/n) pi = ((double)count/(double)niter)*4.0; printf("Pi: %f\n", pi);
To compile this code, run:
$ cc serialpi.c -o a.out
The resulting executable can either be run in an interactive session or via batch script. To launch an interactive session run:
$ qsub -I -A -l walltime=1:00,nodes=1
This executable can also be run via batch script. A sample batch script could look like this:
#!/bin/bash # Begin PBS directives #PBS -A <PROJID> #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n1 ./a.out
Save this in your project’s scratch directory as something like titan.pbs and copy your executable to $MEMBERWORK/<projid>.
To submit the script to the batch node, run
$ qsub titan.pbs
CUDA
Compute Unified Device Architecture (CUDA) is the parallel programming model developed by NVIDIA that extends C or Fortran code with features that target the GPU device. In CUDA, the programmer defines functions to be launched in many thousands or millions of parallel threads on the GPU. These functions are generally referred to as CUDA kernels or simply “kernels”. When writing these kernels, the programmer needs to keep in mind how many threads and blocks of threads they want launched on the GPU at runtime. A possible kernel for this code could look something like this:
__global__ void kernel(int* count_d, float* randomnums) { int i; double x,y,z; int tid = blockDim.x * blockIdx.x + threadIdx.x; i = tid; int xidx = 0, yidx = 0; xidx = (i+i); yidx = (xidx+1); x = randomnums[xidx]; y = randomnums[yidx]; z = ((x*x)+(y*y)); if (z<=1) count_d[tid] = 1; else count_d[tid] = 0; }
Take note of the __global__ decorator. This tells the nvcc compiler that this function is a CUDA kernel that will be called by host code and be run on the device. Also, note the variables blockDim.x, blockIdx.x, and threadIdx.x. These are built in variables that identify a thread’s place in the overall three-dimensional grid, similar to the concept of an MPI rank or OpenMP thread.
Another thing to keep in mind is the fact that the CPU (host) and GPU (device) have separate memory. This means that the host can’t directly access memory on the device and vice versa. To overcome this, CUDA gives us a set of functions to transfer and allocate data on the GPU device’s RAM. For example, in the Monte Carlo code, there needs to be a set of random numbers available on the device for the kernel to use, as well as an empty array to hold a 1 or 0, depending on whether or not the point is in the circle or not (See the Serial section for more information on the Monte Carlo specifics). So after the random numbers are generated and stored into an array, that array needs to be copied to the device. Also the empty array needs to be allocated on the device. To do this we use two different CUDA functions. Firstly, the empty array needs to be allocated on the device. The cudaMalloc() function gives us the ability to allocate memory on the GPU:
cudaError_t cudaMalloc(void** devPtr, size_t size)
Before this function is called, an int* named count_d needs to be created (Note that device pointer variables generally end with “_d”). Also, the programmer needs to determine the size of the array. In this case, since the kernel is only processing one point per thread, the number of threads can be used as the size (thread blocks * threads per thread block = total number of threads) Once this is done, the cudaMalloc() function can be called. For this code, it could look like this:
cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));
Also, an array to house the random numbers (2 per thread this time) needs to be allocated on the device using cudaMalloc. This is done in the same fashion as before:
cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));
Once there is a place to contain the required random numbers, they can be generated. The CURAND library contains kernels that generate high quality random numbers using the GPU. CURAND needs a bit of setting up before it can generate the random numbers. First, a CURAND generator needs to be declared:
curandGenerator_t gen;
The next function creates the generator and declares its type.
curandStatus_t curandCreateGenerator(curandGenerator_t* generator, curandRngType_t rng_type);
The next one sets the seed.
curandStatus_t curandSetPseudoRandomGeneratorSeed(curandGenerator_t generator, unsigned long long seed);
Next, curandGenerateUniform() is the actual function that generates the random numbers.
curandStatus_t curandGenerateUniform(curandGenerator_t generator, float* outputPtr, size_t num);
Finally, the random number generator can be destroyed because theres no further need for it.
curandStatus_t curandDestroyGenerator(curandGenerator_t generator);
Using these definitions, create the random numbers as follows:
curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A); curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL)); curandGenerateUniform(gen, randomnums, (2*niter)); curandDestroyGeneraor(gen);
Now, the CUDA kernel needs to be called and launched on the device. Before this can be done, the programmer needs to know how many threads per block, and how many blocks that need to be launched. In this case, 1000 threads/block and 100 blocks should work fine (these numbers we picked arbitrarily, the programmer can set their own). Once this is determined, the kernel can be launched and passed the correct device pointers.
kernel <<<blocks, threads>>> (count_d, randomnums);
Once the kernel has finished, the count_d array needs to be copied back to the host. Similar to the standard C function memcpy(), we use the CUDA function cudaMemcpy() which looks like this:
cudaError_t cudaMemcpy(void * dst, const void * src, size_t count, enum cudaMemcpyKind kind)
Before this can be done, the programmer needs to declare an empty array on the host that will contain the data that is copied back with cudaMemcpy. Also, make sure the amount of data copied back (size_t count) is the right size (the same size as the cudaMalloc’d array). Finally, cudaMemcpy needs to know in what direction the copy is going to be performed. In this particular instance, the array is on the device and needs to be copied to the host, so the final parameter for the cudaMemcpy needs to be the built in “cudaMemcpyDeviceToHost”. The completed cudaMemcpy line looks like this:
cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost);
Now that all the CUDA code is set up, the rest of the code looks like this:
cpi.cu:
#include <stdio.h> #include <stdlib.h> #include <cuda.h> #include <curand.h> #include <time.h> //Declare the CUDA kernel __global__ void kernel(int* count_d, float* randomnums) { int i; double x,y,z; //Find the overall ID of the thread int tid = blockDim.x * blockIdx.x + threadIdx.x; i = tid; int xidx = 0, yidx = 0; //Do the MonteCarlo! xidx = (i+i); yidx = (xidx+1); //Get the random x,y points x = randomnums[xidx]; y = randomnums[yidx]; z = ((x*x)+(y*y)); if (z<=1) count_d[tid] = 1; else count_d[tid] = 0; } //Used to check if there are any errors launching the kernel void CUDAErrorCheck() { cudaError_t error = cudaGetLastError(); if (error != cudaSuccess) { printf("CUDA error : %s (%d)\n", cudaGetErrorString(error), error); exit(0); } } int main(int argc,char* argv[]) { //NOTE: if threads and/or blocks is changed, niter needs to be changed to reflect //that change (niter=threads*blocks) int niter = 100000; float *randomnums; double pi; //Allocate the array for the random numbers cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float)); //Use CuRand to generate an array of random numbers on the device int status; curandGenerator_t gen; status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A); status |= curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL)); status |= curandGenerateUniform(gen, randomnums, (2*niter)); status |= curandDestroyGenerator(gen); //Check to see if there was any problem launching the CURAND kernels and generating //the random numbers on the device if (status != CURAND_STATUS_SUCCESS) { printf("CuRand Failure\n"); exit(EXIT_FAILURE); } //Threads per thread block to be launched int threads = 1000; //Number of thread blocks launched int blocks = 100; int* count_d; int *count = (int*)malloc(blocks*threads*sizeof(int)); unsigned int reducedcount = 0; //Allocate the array to hold a value (1,0) whether the point in is the circle (1) or not (0) cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int)); CUDAErrorCheck(); //Launch the kernel kernel <<<blocks, threads>>> (count_d, randomnums); //Acts as a kind of code Barrier until the kernel is finished. Kernel calls are nonblocking so //the code would continue regardless of whether the kernel succeeded or not without the Sync cudaDeviceSynchronize(); CUDAErrorCheck(); //Copy the resulting array back cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost); int i = 0; //Reduce array into int for(i = 0; i<niter; i++) reducedcount += count[i]; //Free the cudaMalloc()'d arrays cudaFree(randomnums); cudaFree(count_d); free(count); //Find the ratio pi = ((double)reducedcount/niter)*4.0; printf("Pi: %f\n", pi); return 0; }
Compiling
Before this can be compiled, the CUDA toolkit needs to be loaded. To do this, run:
$ module load cudatoolkit
Now that the module is loaded, the nvcc compiler wrapper is available. To compile this run:
$ nvcc -arch=sm_35 -lcurand cpi.cu -o a.out
Here, the -arch=sm_35 switch is used to tell the compiler that the target device Compute Capabilty is 3.5. This allows the programmer to do more things in the CUDA kernel, for instance use double. The -lcurand switch tells the compiler to link in the CURAND library and give the code access to the CURAND functions.
Then copy the executable (in this case, a.out) to the scratch directory (see the serial section for more info on finding the scratch directory). This can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, use this:
$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1
Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command
$ aprun -n1 ./a.out
Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:
#!/bin/bash # Begin PBS directives #PBS -A <PROJID> #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 ./a.out
Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:
$ qsub titan.pbs
OpenACC
OpenACC provides an alternative pragma based accelerator programming model that can be easier to use than CUDA. These pragmas are similar to OpenMP’s but launch threads on the GPU instead of the CPU. Parallelizing the Monte Carlo code with OpenACC is a fairly straight-forward procedure. The programmer must first identify the best region to parallelize. In this code, the most parallelizable region is the main for loop which actually performs the Monte Carlo simulation. The main loop can be broken up into two for loops. One loop that will do the operation and store a 1 or 0 in an array depending on if the point is in the circle or not, like in the CUDA version. The second loop will become a separate kernel that will reduce the output array of the first kernel into 1 integer on the host. The OpenACC pragma should look something like this:
#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
Here,”acc” tells the compiler that this is an OpenACC pragma. Next, the parallel construct is used to launch multiple threads on the GPU. Next, the copy(randomnums[0:200000]) directive copies the randomnums array onto the device. The copyout(output[0:100000]) directive copies the array onto the device then back off onto the host once the kernel is finished. Finally the create(x,y,z) directive created the three variables, x,y,z, on the device. The parallel region looks something like this:
#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z) { for (i = 0; i<100000; i++) { int temp = i+i; x = randomnums[temp]; y = randomnums[temp+1]; z = ((x*x)+(y*y)); output[i] = z; } for (i = 0; i<100000; i++) { if (output[i]<=1) { ++count; } else continue; } }
The complete code looks like this:
accpi.cpp
#include <stdio.h> #include <time.h> #include <stdlib.h> int main(int argc, char* argv[]) { double niter = 100000; double x,y; int count=0; double z; double pi; int a = 0; time_t t; double randomnums[200000]; double output[100000]; srand48(((unsigned)time(&t))); for (a = 0; a<=2*niter; a++) randomnums[a] = (double)drand48(); int i = 0; #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z) { for (i = 0; i<100000; i++) { int temp = i+i; x = randomnums[temp]; y = randomnums[temp+1]; z = ((x*x)+(y*y)); output[i] = z; } for (i = 0; i<100000; i++) { if (output[i]<=1) { ++count; } else continue; } } pi = ((double)count/(double)niter)*4.0; printf("Pi: %f\n", pi); }
To compile this the cudatoolkit needs to be loaded. To load it run module load cudatoolkit. Next, run
$ CC -acc -Minfo=all -ta=nvidia:5.0,cc3x accpi.cpp -o a.out
to compile the code into an executable. The -Minfo=all switch will tell pgicc to print out various information regarding how its compiling the OpenACC code. This can be a very useful tool for debugging OpenACC code. The -ta=nvidia:5.0,cc3x switch tells what version of the CUDA to use and what compute architecture to use as the target. Some sample output for the -Minfo=all switch is:
main: 21, Generating copy(randomnums[0:]) Generating copyout(output[0:]) Generating create(x) Generating create(y) Generating create(z) Accelerator kernel generated 22, #pragma acc loop vector(256) /* threadIdx.x */ 30, #pragma acc loop vector(256) /* threadIdx.x */ 21, Generating NVIDIA code Generating compute capability 3.0 binary 22, Loop is parallelizable 30, Loop is parallelizable 34, Sum reduction generated for count
To compile this code with the Cray compiler, load the PrgEnv-cray module as well as the craype-accel-nvidia35 module
$ module swap PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35
To compile to an exectuable, run:
$ CC -hpragma=acc accpi.cpp -o a.out
Running:
Then copy the executable (in this case, a.out) to the scratch directory. Like before, this can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, run:
$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1
Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command
$ aprun -n1 ./a.out
Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:
#!/bin/bash # Begin PBS directives #PBS -A [PROJID] #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 -N 1 ./a.out
Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:
$ qsub titan.pbs
Thrust
Finally, this operation can be accelerated with the unique Thrust library. Thrust is a template library that allows programmers to write parallel code in the same way that they would use the C++ Standard Library for serial code. Because of this, the Thrust parallelized code is structured quite differently. For instance, the Monte Carlo kernel from the previous CUDA example is replaced by an equivalent function inside a struct.
struct montecarlo : public thrust::unary_function<unsigned int, float> { //use the __host__ __device__ decorators so that //thrust knows that this is a gpu function. //think of operator() as similar to a CUDA kernel __host__ __device__ float operator()(unsigned int thread_id) { unsigned int seed; //set the seed seed = 49868^thread_id; int i; float x,y,z,sum=0.0; //define the random number generator thrust::default_random_engine rng(seed); thrust::uniform_real_distribution u01(0,1); for (i = 0; i < 1000; i++) { //get the random numbers x = u01(rng); y = u01(rng); z = ((x*x)+(y*y)); if (z<=1) sum += 1; } return sum; } };
First, the montecarlo struct is defined and given a Thrust function template to adhere to.
struct montecarlo : public thrust::unary_function
This Thrust template says that the function in the struct will be a standard thrust unary function with an unsigned int as an input (this is the thread number. This is handled by Thrust) and will return a floating point number. Next, the programmer defines the operator function
__host__ __device__ float operator()(unsigned int thread_id)
and gives it the __host__ and __device__ decorator so that the compiler knows that it is a GPU function. The next thing that the programmer should take note of is the use of the Thrust random number generator.
thrust::default_random_engine rng(seed); thrust::uniform_real_distribution u01(0,1);
Since it is not possible to call host functions from a GPU kernel, this step is necessary. By using Thrust functions like this, it allows the programmer to more easily and quickly accelerate the serial code without having to use more complex and specialized APIs, like CURAND.
Finally the way this kernel can be launched is by the use of another Thrust function. This transform_reduce() function is defined as:
OutputType thrust::transform_reduce(InputIterator first, InputIterator last, UnaryFunction unary_op, OutputType init, BinaryFunction binary_op)
Here, the thrust::transform_reduce() function is used because it will run every number between first and last through our predefined functor (unary_op) and then it will reduce the resulting transformed answers into one final float according to binary_op with the initial value of init. first and last need to be defined as thrust::counting_iterator(first) and thrust::counting_iterator(last) The full function will look like this.
count = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(niter), montecarlo(), 0.0, thrust::plus());
So with that in mind, the full code can look something like this:
thrust.cu
#include <stdio.h> #include <time.h> #include <stdlib.h> #include <thrust/generate.h> #include <thrust/random.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/functional.h> #include <thrust/transform_reduce.h> #include <cmath> struct montecarlo : public thrust::unary_function<unsigned int, float> { //use the __host__ __device__ decorators so that //thrust knows that this is a gpu function. //think of operator() as similar to a CUDA kernel __host__ __device__ float operator()(unsigned int thread_id) { unsigned int seed; //set the seed seed = 49868^thread_id; int i; float x,y,z,sum=0.0; //define the random number generator thrust::default_random_engine rng(seed); thrust::uniform_real_distribution u01(0,1); for (i = 0; i < 1000; i++) { //get the random numbers x = u01(rng); y = u01(rng); z = ((x*x)+(y*y)); if (z<=1) sum += 1; } return sum; } }; int main(int argc, char* argv[]) { float pi; float count = 0.0; int niter = 1000; count = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(niter), montecarlo(), 0.0, thrust::plus()); pi = (count/(niter*niter))*4.0; printf("Pi: %f\n", pi); }
To compile this code, the cudatoolkit needs to be loaded
$ module load cudatoolkit
Once the code is ready to be compiled, it should be compiled using NVIDIA’s nvcc compiler like so:
$ nvcc -arch=sm_35 thrust.cu -o a.out
It can be run either by submitting it in a batch script or by running it from an interactive job like described in the previous examples.
The code in this tutorial along with a makefile can be found at https://github.com/olcf/Accelerating-Serial-Code
If you have any questions about these examples or how to run them please email help@nccs.gov.
This sample shows the Game of Life CPU code converted to an HMPP accelerator directive version. Both codelet and region C and Fortran methods are presented. A second version is presented that fixes a performance issue incurred in the initial version. Please direct any questions or comments to help@nccs.gov
GOL-codelet.c
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #pragma hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2], int maxIter) { int i, j, iter; // Main game loop for (iter = 0; iter<maxIter; iter++) { // Left-Right columns for (i = 1; i<=dim; i++) { grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column } // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //diagonals + grid[i-1][j+1] + grid[i+1][j-1]; // Here we have explicitly all of the game rules if (grid[i][j] == 1 && numNeighbors < 2) newGrid[i][j] = 0; else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[i][j] = 1; else if (grid[i][j] == 1 && numNeighbors > 3) newGrid[i][j] = 0; else if (grid[i][j] == 0 && numNeighbors == 3) newGrid[i][j] = 1; else newGrid[i][j] = grid[i][j]; } } // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[i][j] = newGrid[i][j]; } } } } int main(int argc, char* argv[]) { int i,j,iter,numNeighbors; int dim = 1024; // Number of game iterations int maxIter = 1<<15; // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } // Call the GPU to execute codelet #pragma hmpp GOL callsite GOL(dim, grid, newGrid, maxIter); // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); free(grid); free(newGrid); return 0; }
Changes
Lines 6-58
#pragma hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2], int maxIter) { int i, j, iter; // Main game loop for (iter = 0; iter<maxIter; iter++) { // Left-Right columns for (i = 1; i<=dim; i++) { grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column } // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //diagonals + grid[i-1][j+1] + grid[i+1][j-1]; // Here we have explicitly all of the game rules if (grid[i][j] == 1 && numNeighbors < 2) newGrid[i][j] = 0; else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[i][j] = 1; else if (grid[i][j] == 1 && numNeighbors > 3) newGrid[i][j] = 0; else if (grid[i][j] == 0 && numNeighbors == 3) newGrid[i][j] = 1; else newGrid[i][j] = grid[i][j]; } } // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[i][j] = newGrid[i][j]; } } } }
To use the codelet method all of the game logic was moved into its own function. With the directive option args[grid].io=inout we specify that grid is to be copied onto the GPU at the start of the codelet and copied back to the host at the end. newGrid is specified to have no transfer between the host and device. The main game loop is included inside of the codelet even though it is neccesarily serial to ensure that the game grids are not coppied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below. grid and newGrid are passed into our codelet with their full dimension stated, this is neccesary for HMPP to correctly copy the data over.
// Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //diagonals + grid[i-1][j+1] + grid[i+1][j-1];
In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.
// Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[i][j] = newGrid[i][j]; } }
This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.
// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cell int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));
grid[i*(dim+2)+j] = rand() % 2;
total += grid[i*(dim+2)+j];
HMPP seems to currently be limited to copying single dimensional arrays from host the device. We are able to maintain our multidimensional array indexing in the codelet while on the host we must allocate and address linear arrays.
GOL-region.c
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define idx(i,j) ((i)*(dim+2)+(j)) int main(int argc, char* argv[]) { int i,j,numNeighbors; int dim = 1024; // Number of game iterations int maxIter = 1<<15; // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } #pragma hmpp GOL region, target=CUDA, args[*].transfer=atcall, args[grid].io=inout & #pragma hmpp & args[newGrid].io=none { int i,j,iter; // Main game loop for (iter = 0; iter<maxIter; iter++) { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } #pragma hmppcg gridify // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } } } } // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += grid[idx(i,j)]; } } printf("Total Alive: %d\n", total); free(grid); free(newGrid); return 0; }
Changes
#define idx(i,j) ((i)*(dim+2)+(j))
In the codelet version we were able to stretch the C language a bit and use multidimensional arrays in our codelet and a single dimensional array on our host, this is not the case with the region method. A convenience macro is defined and used whenever grid or newGrid is accessed.
// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));
As in the codelet version HMPP does not currently handle multidimensional arrays properly and so we revert back to a single dimensional array.
#pragma hmpp GOL region, target=CUDA, args[*].transfer=atcall, args[grid].io=inout & #pragma hmpp & args[newGrid.io]=none
We add the directive option args[grid].io=inout we specify that grid is to be copied onto the GPU at the start of the region and copied back to the host at the end. newGrid is specified to have no transfer between the host and device. The main game loop is included inside of the region even though it is neccesarily serial to ensure that the game grids are not coppied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below.
int i,j,iter;
At the time of this writing HMPP does not correctly identify the induction variables used in hmpp region that are declared outside of the region. The solution is to simply define our induction variables inside of the region.
#pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) {
#pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) {
// Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) {
#pragma hmppcg gridify // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) {
The hmppcg directives are used to guide the compiler at the loop level. In this particular case the compiler did not want to parallelize the loops due to the outermost game loop being sequential. Adding hmppcg gridify tells the compiler the loop immediately following the directive should be executed in parallel on the GPU.
// Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];
In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.
#pragma hmppcg gridify // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } }
This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.
Compiling
We add the –io-report option to hmpp to see compile time host/device memory transfer information.
$ hmpp --io-report cc GOL-codelet.c -o gol.out
or
$ hmpp --io-report cc GOL-region.c -o gol.out
Output:
The output will be similar for both the codelet and region methods.
hmpp: [Info] IO report : In GROUP 'gol' REGION 'gol' at GOL-region.c:29, function '__hmpp_region__gol' Parameter 'dim' has intent IN Parameter 'maxIter' has intent IN Parameter 'grid' has intent INOUT Parameter 'newGrid' has intent INOUT
Here we see the intent, the transfer direction with respect to the GPU, of each of our variables. We will see more explicit methods of dealing with data transfer below.
hmpp: [Info] Generated codelet filename is "gol_cuda.cu". hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:37: Loop 'i' was gridified (1D) hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:44: Loop 'j' was gridified (1D) hmppcg: [Message DPL3001] GOLHMPP-TRY0-REGION.c:52: Loops 'j' and 'i' were gridified (2D) hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:77: Loop 'i' was gridified (1D) hmppcg: [Warning DPL3002] GOLHMPP-TRY0-REGION.c:33: Loop not gridified: will be executed on CPU
All of the loops have been gridified indicating a kernel has been generated for the given loop that will be executed in parallel on the GPU. A warming message tells us that the main game loop has not been gridified and will be executed serially on the CPU, this is the expected and correct behavior.
GOL-codelet.f90
!$hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none subroutine GOL(dim, grid, newGrid, maxIter) implicit none integer, intent(in) :: dim, maxIter integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2) integer :: i,j,iter,numNeighbors ! Main game loop do iter=1,maxIter ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo enddo!end main game loop end subroutine GOL program main implicit none integer :: i,j,seed(8),numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo !$hmpp GOL callsite call GOL(dim, grid, newGrid, maxIter) ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + grid(i,j) enddo enddo print *, "Total Alive", total ! Release memory deallocate(grid) deallocate(newGrid) end program
Changes
Lines 1-55:
!$hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none subroutine GOL(dim, grid, newGrid, maxIter) implicit none integer, intent(in) :: dim, maxIter integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2) integer :: i,j,iter,numNeighbors ! Main game loop do iter=1,maxITer ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo enddo!end main game loop end subroutine GOL
To use the codelet method all of the game logic was moved into its own subroutine. With the directive option args[grid].io=inout we specify that grid is to be copied onto the GPU at the start of the codelet and copied back to the host at the end. newGrid is specified to have no transfer between the host and device. The main game loop is included inside of the codelet even though it is neccesarily serial to ensure that the game grids are not copied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below.
! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1)
In the CUDA programming model all functions must be effectively inlined. Here we manually inline our subroutine to calculate the number of alive neighbors a cell has.
! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo
This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.
GOL-region.f90
program main implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo !$hmpp GOL region, target=CUDA, args[grid].io=inout, args[*].transfer=atcall & !$hmpp & args[newGrid].io=none ! Main game loop do iter=1,maxITer ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j, i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) !Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo enddo!end main game loop !$hmpp GOL endregion ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + grid(i,j) enddo enddo print *, "Total Alive", total ! Release memory deallocate(grid) deallocate(newGrid) end program
Changes
! Now we loop over all cells and determine their fate !$hmppcg gridify(j, i) do j=2,dim+1 do i=2,dim+1
By default the hmpp compiler did not want to gridify the loop over each cell. We can attempt to force the compiler to do so adding the hmppcg directive with gridify(j, i) option which specifies we want to gridify with respect to the induction variables j and i.
! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1)
In the CUDA programming model all functions must be effectively inlined. Here we manually inline our subroutine to calculate the number of alive neighbors a cell has.
! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo
This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.
Compiling
We add the –io-report option to hmpp to see compile time host/device memory transfer information.
$ hmpp --io-report ftn GOL-codelet.f90 -o gol.out
or
$ hmpp --io-report ftn GOL-region.f90 -o gol.out
Output:
The output will be similar for both the codelet and region methods. In the most recent release as of this writing you may get some warning messages from the compiler about unused variables when using the region method.
hmpp: [Info] IO report : In GROUP 'gol' CODELET 'gol' at GOLHMPP-TRY0-CODELET.f90:1, function 'gol' Parameter 'dim' has intent IN Parameter 'grid' has intent INOUT Parameter 'newgrid' has intent INOUT Parameter 'maxiter' has intent IN
Here we see the intent, the transfer direction with respect to the GPU, of each of our variables. We will see more explicit methods of dealing with data transfer below.
hmpp: [Info] Generated codelet filename is "gol_cuda.cu". hmppcg: [Message DPL3000] GOLHMPP-TRY0-CODELET.f90:11: Loop 'j' was gridified (1D) hmppcg: [Message DPL3000] GOLHMPP-TRY0-CODELET.f90:17: Loop 'i' was gridified (1D) hmppcg: [Message DPL3001] GOLHMPP-TRY0-CODELET.f90:23: Loops 'i' and 'j' were gridified (2D) hmppcg: [Message DPL3001] GOLHMPP-TRY0-CODELET.f90:47: Loops 'i' and 'j' were gridified (2D) hmppcg: [Warning DPL3002] GOLHMPP-TRY0-CODELET.f90:9: Loop not gridified: will be executed on CPU
All of the loops have been gridified indicating a kernel has been generated for the given loop that will be executed in parallel on the GPU. A warming message tells us that the main game loop has not been gridified and will be executed serially on the CPU, this is the expected and correct behavior.
GOL-codelet-Try2.c
In this version we will remove the memory copy kernel that we needed to use since HMPP accelerate does not currently suppoort device pointers. As an added bonus we will also take a look at more advanced memory management techniques. There are numerous ways this could be approached but the methods shown highlight many good to know HMPP features.
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #pragma hmpp <GOL> group target=CUDA #pragma hmpp <GOL> mapbyname, dim #pragma hmpp <GOL> map, args[even::grid;odd::newGrid] #pragma hmpp <GOL> map, args[even::newGrid;odd::grid] #pragma hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual #pragma hmpp <GOL> odd codelet, args[dim,grid,newGrid].transfer=manual void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2]) { int i, j, numNeighbors; // Left-Right columns for (i = 1; i<=dim; i++) { grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column } // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row } // Now we loop over all cells and determine their fate for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //Diagonals + grid[i-1][j+1] + grid[i+1][j-1]; // Here we have explicitly all of the game rules if (grid[i][j] == 1 && numNeighbors < 2) newGrid[i][j] = 0; else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[i][j] = 1; else if (grid[i][j] == 1 && numNeighbors > 3) newGrid[i][j] = 0; else if (grid[i][j] == 0 && numNeighbors == 3) newGrid[i][j] = 1; else newGrid[i][j] = grid[i][j]; } } } int main(int argc, char* argv[]) { int i,j,iter; // Linear grid dimension int dim = 1024; // Number of game iterations int maxIter = 1<<10; #pragma hmpp <GOL> allocate // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } // Copy memory tp GPU #pragma hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid] // Main game loop for (iter = 0; iter<maxIter; iter++) { if(iter%2 == 0) { #pragma hmpp <GOL> even callsite GOL(dim, grid, newGrid); } else { #pragma hmpp <GOL> odd callsite GOL(dim, grid, newGrid); } }//iter loop // Read back final configuration to confirm it worked #pragma hmpp <GOL> delegatedstore, args[odd::newGrid] // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += newGrid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); #pragma hmpp <GOL> release free(grid); free(newGrid); return 0; }
Changes
#pragma hmpp <GOL> group target=CUDA #pragma hmpp <GOL> mapbyname, dim #pragma hmpp <GOL> map, args[even::grid;odd::newGrid] #pragma hmpp <GOL> map, args[even::newGrid;odd::grid] #pragma hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual #pragma hmpp <GOL> odd codelet, args[dim,grid,newGrid].transfer=manual
A major change to our code has been to get rid of the newGrid to grid memory copy. Instead we use some of the built in HMPP directive functionality.
#pragma hmpp <GOL> group target=CUDA
Here we create what is known as a codelet group, name GOL. This allows us to keep track of similar kernels and set common attributes.
#pragma hmpp <GOL> mapbyname, dim
This line will map the variable dim across all kernels. This means once dim is allocated and copied to the device for one kernel it will be available for use by all kernels in the GOL group.
#pragma hmpp <GOL> map, args[even::grid;odd::newGrid] #pragma hmpp <GOL> map, args[even::newGrid;odd::grid]
The map directive is used to map arguments between different kernels. This will effectively switch our pointers for us. The notation args[CodeletName1::Argument1,CodeletName2::Argument2] is used such that Argument1 in CodeletName1 will refer to the same device memory as Argument2 in CodeletName2. We see in our case we will soon be defining two kernels, one named even and one named odd.
#pragma hmpp <GOL> even codelet #pragma hmpp <GOL> odd codelet, args[newGrid].io=none
The GOL group will contain two codelets, even and odd. Our game will oscillate between calling these codelets in the main game loop. The only difference between these codelets is caused by the map directive used above which effectively switches the device pointers to grid and newGrid.
#pragma hmpp <GOL> allocate
The allocate directive will allocate space on the GPU for all memory needed by the named group. Any memory objects shared by more than one codelet through the map or mapbyname directives will only be allocated once.
// Copy memory tp GPU #pragma hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]
Once we have allocated space on the GPU we can copy the results from the host to the GPU with the advancedload directive. Notice when referencing what arguments will be copied we specify one codelet names, either codelet name could be specified.
// Main game loop for (iter = 0; iter<maxIter; iter++) { if(iter%2 == 0) { #pragma hmpp <GOL> even callsite GOL(dim, grid, newGrid); } else { #pragma hmpp <GOL> odd callsite GOL(dim, grid, newGrid); } }//iter loop
The main game loop induction variable is used to switch between calling our even codelet and our odd codelet. Although it looks as though the same GOL function is being called in both cases because of the map HMPP directive defined with the codelet we know the grid and newGrid GPU pointers will be switched.
// Read back final configuration to confirm it worked #pragma hmpp <GOL> delegatedstore, args[odd::newGrid]
After the kernel is finished we copy the grid from the device to the host.
#pragma hmpp <GOL> release
Finally since we allocated the device memory we must free it with the directive hmpp GOL release .
GOL-region-Try2.c
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define idx(i,j) ((i)*(dim+2)+(j)) int main(int argc, char* argv[]) { int i,j,iter; int dim = 1024; //Number of game iterations int maxIter = 1<<10; #pragma hmpp GOL allocate, args[grid,newGrid].size={(dim+2)*(dim+2)} // Allocate square grid of (dim+2)^2 elements, 2 added for perioid boundary condition ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } // Copy memory tp GPU #pragma hmpp GOL advancedload, args[dim;grid;newGrid] // Main game loop for (iter = 0; iter<maxIter; iter++) { #pragma hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall { int i,j; if(iter%2 == 0) { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } } else { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { newGrid[idx(i,0)] = newGrid[idx(i,dim)]; //Copy first real column to right ghost column newGrid[idx(i,dim+1)] = newGrid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { newGrid[idx(0,j)] = newGrid[idx(dim,j)]; //Copy first real row to bottom ghost row newGrid[idx(dim+1,j)] = newGrid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = newGrid[idx(i+1,j)] + newGrid[idx(i-1,j)] //upper lower + newGrid[idx(i,j+1)] + newGrid[idx(i,j-1)] //right left + newGrid[idx(i+1,j+1)] + newGrid[idx(i-1,j-1)] //diagonals + newGrid[idx(i-1,j+1)] + newGrid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (newGrid[idx(i,j)] == 1 && numNeighbors < 2) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) grid[idx(i,j)] = 1; else if (newGrid[idx(i,j)] == 1 && numNeighbors > 3) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 0 && numNeighbors == 3) grid[idx(i,j)] = 1; else grid[idx(i,j)] = newGrid[idx(i,j)]; } } } } }//iter loop // Read back final configuration to confirm it worked #pragma hmpp GOL delegatedstore, args[grid] // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); #pragma hmpp GOL release free(grid); free(newGrid); return 0; }
Changes
#pragma hmpp GOL allocate, args[grid,newGrid].size={(dim+2)*(dim+2)}
The hmpp GOL allocate directive will allocate space on the GPU for the named region. The compiler requires us to specify the size of the passed in arrays at allocation time.
hmpp GOL advancedload, args[dim;iter;grid;newGrid]
hmpp GOL advancedload, args[dim;iter;grid;newGrid] will copy the named arguments of the region from the host to the GPU.
#pragma hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall
In this version of the code we have our region inside of our main game loop but no longer have to worry about unnecessary memory copies.
if(iter%2 == 0) { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } } else { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { newGrid[idx(i,0)] = newGrid[idx(i,dim)]; //Copy first real column to right ghost column newGrid[idx(i,dim+1)] = newGrid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { newGrid[idx(0,j)] = newGrid[idx(dim,j)]; //Copy first real row to bottom ghost row newGrid[idx(dim+1,j)] = newGrid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = newGrid[idx(i+1,j)] + newGrid[idx(i-1,j)] //upper lower + newGrid[idx(i,j+1)] + newGrid[idx(i,j-1)] //right left + newGrid[idx(i+1,j+1)] + newGrid[idx(i-1,j-1)] //diagonals + newGrid[idx(i-1,j+1)] + newGrid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (newGrid[idx(i,j)] == 1 && numNeighbors < 2) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) grid[idx(i,j)] = 1; else if (newGrid[idx(i,j)] == 1 && numNeighbors > 3) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 0 && numNeighbors == 3) grid[idx(i,j)] = 1; else grid[idx(i,j)] = newGrid[idx(i,j)]; } } }
A major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around a costly memory transfer.
// Read back final configuration to confirm it worked #pragma hmpp GOL delegatedstore, args[grid]
After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]
#pragma hmpp GOL release
Finally since we allocated the device memory we must free it with the directive hmpp GOL release .
GOL-codelet-Try2.f90
!$hmpp <GOL> group, target=CUDA !$hmpp <GOL> mapbyname, dim !$hmpp <GOL> map, args[even::grid;odd::newGrid] !$hmpp <GOL> map, args[even::newGrid;odd::grid] !$hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual !$hmpp <GOL> odd codelet, args[dim,grid,newGrid].transfer=manual subroutine GOL(dim, grid, newGrid) implicit none integer, intent(in) :: dim integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2) integer :: i,j,numNeighbors ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !right & left + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo end subroutine GOL program main implicit none integer :: i,j,seed(8),iter,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid !$hmpp <GOL> allocate ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo ! Copy memory tp GPU !$hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid] ! Main game loop do iter=1,maxITer if(MOD(iter,2) == 1) then !$hmpp <GOL> even callsite call GOL(dim, grid, newGrid) else !$hmpp <GOL> odd callsite call GOL(dim, grid, newGrid) endif enddo!end main game loop ! Read back final configuration to confirm it worked !$hmpp <GOL> delegatedstore, args[odd::newGrid] ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + newGrid(i,j) enddo enddo print *, "Total Alive", total ! Release memory !$hmpp <GOL> release deallocate(grid) deallocate(newGrid) end program
Changes
!$hmpp <GOL> group, target=CUDA !$hmpp <GOL> mapbyname, dim !$hmpp <GOL> map, args[even::grid;odd::newGrid] !$hmpp <GOL> map, args[even::newGrid;odd::grid] !$hmpp <GOL> even codelet !$hmpp <GOL> odd codelet
A major change to our code has been to get rid of the newGrid to grid memory copy. Instead we use some of the built in HMPP directive functionality.
!$hmpp <GOL> group, target=CUDA
Here we create what is known as a codelet group, name GOL. This allows us to keep track of similar kernels and set common attributes.
!$hmpp <GOL> mapbyname, dim
This line will map the variable dim across all kernels. This means once dim is allocated and copied to the device for one kernel it will be available for use by all kernels in the GOL group.
!$hmpp <GOL> map, args[even::grid;odd::newGrid] !$hmpp <GOL> map, args[even::newGrid;odd::grid]
The map directive is used to map arguments between different kernels. This will effectively switch our pointers for us. The notation args[CodeletName1::Argument1,CodeletName2::Argument2] is used such that Argument1 in CodeletName1 will refer to the same device memory as Argument2 in CodeletName2. We see in our case we will soon be defining two kernels, one named even and one named odd.
!$hmpp <GOL> even codelet !$hmpp <GOL> odd codelet
The GOL group will contain two codelets, even and odd. Our game will oscillate between calling these codelets in the main game loop. The only difference between these codelets is caused by the map directive used above which effectively switches the device pointers to grid and newGrid.
!$hmpp <GOL> allocate
The allocate directive will allocate space on the GPU for all memory needed by the named group. Any memory objects shared by more than one codelet through the map or mapbyname directives will only be allocated once.
! Copy memory tp GPU !$hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]
Once we have allocated space on the GPU we can copy the results from the host to the GPU with the advancedload directive. Notice when referencing what arguments will be copied we specify one codelet names, either codelet name could be specified.
if(MOD(iter,2) == 1) then !$hmpp <GOL> even callsite, args[grid].noupdate=true call GOL(dim, grid, newGrid) else !$hmpp <GOL> odd callsite, args[grid].noupdate=true call GOL(dim, grid, newGrid) endif
We see that we use our main game loop induction variable to switch between calling our even codelet and our odd codelet. Although it looks as though the same GOL function is being called in both cases because of the map HMPP directive defined with the codelet we know the grid and newGrid GPU pointers will be switched. args[grid].noupdate=true indicates that we do not want to have any implicit data movement between the host and device.
! Read back final configuration to confirm it worked !$hmpp <GOL> delegatedstore, args[odd::newGrid]
After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]
!$hmpp <GOL> release
Finally since we allocated the device memory we must free it with the directive hmpp GOL release .
GOL-region-Try2.f90
program main implicit none integer :: i,j,seed(8),iter,numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid !$hmpp GOL allocate, args[grid,newGrid].size={(dim+2),(dim+2)} ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo ! Copy memory tp GPU !$hmpp GOL advancedload, args[dim;grid;newGrid] ! Main game loop do iter=1,maxITer !$hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall if(MOD(iter,2) == 1) then ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !right & left + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo else ! Top-Bottom ghost rows do j=2,dim+1 newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 newGrid(i,1) = newGrid(i, dim+1) !Copy first game grid column to right ghost column newGrid(i,dim+2) = newGrid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)& !right & left + newGrid(i+1,j) + newGrid(i-1,j)& !upper & lower + newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals + newGrid(i-1,j+1) + newGrid(i+1,j-1) ! Here we have explicitly all of the game rules if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then grid(i,j) = 0 elseif(newGrid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then grid(i,j) = 1 elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then grid(i,j) = 0 elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then grid(i,j) = 1 else grid(i,j) = newGrid(i,j) endif enddo enddo endif !End grid newGrid switch !$hmpp GOL endregion enddo!end main game loop ! Read back final configuration to confirm it worked !$hmpp GOL delegatedstore, args[grid] ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + grid(i,j) enddo enddo print *, "Total Alive", total ! Release memory !$hmpp GOL release deallocate(grid) deallocate(newGrid) end program
Changes
!$hmpp GOL allocate, args[grid,newGrid].size={(dim+2),(dim+2)}
The hmpp GOL allocate directive will allocate space on the GPU for the named region. The compiler requires us to specify the size of the passed in arrays at allocation time with args[grid,newGrid].size={(dim+2),(dim+2)}
!$hmpp GOL advancedload, args[dim;grid;newGrid]
hmpp GOL advancedload, args[dim;iter;grid;newGrid] will copy the named arguments of the region from the host to the GPU.
hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall
In this version of the code we have our region inside of our main game loop but no longer have to worry about unnecessary memory copies.
Lines 35-109:
if(MOD(iter,2) == 1) then ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !right & left + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo else ! Top-Bottom ghost rows do j=2,dim+1 newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 newGrid(i,1) = newGrid(i, dim+1) !Copy first game grid column to right ghost column newGrid(i,dim+2) = newGrid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)& !right & left + newGrid(i+1,j) + newGrid(i-1,j)& !upper & lower + newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals + newGrid(i-1,j+1) + newGrid(i+1,j-1) ! Here we have explicitly all of the game rules if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then grid(i,j) = 0 elseif(newGrid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then grid(i,j) = 1 elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then grid(i,j) = 0 elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then grid(i,j) = 1 else grid(i,j) = newGrid(i,j) endif enddo enddo endif !End grid newGrid switch
A major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around a costly memory transfer.
! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i)
! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i)
By default the hmpp compiler did not want to gridify the loop over each cell. We can attempt to force the compiler to do so adding thehmppcg gridify(j, i) directive which specifies we want to gridify with respect to the induction variables j and i.
! Read back final configuration to confirm it worked !$hmpp GOL delegatedstore, args[grid]
After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]
! Release memory !$hmpp GOL release
Finally since we allocated the device memory we must free it with the directive hmpp GOL release.
This sample shows a minimal conversion from our vector addition CPU code to an HMPP accelerator directives version, consider this an HMPP ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov
HMPP allows code to be offloaded onto the GPU using two different methods, both of which are covered. The codelet method allows an entire C function or Fortran subroutine to be executed on the GPU. The region method allows a contiguous block of code, not necessarily residing in a function or subroutine, to be executed on the GPU.
vecAdd-codelet.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out void vecAdd(int n, double a[n], double b[n], double c[n]) { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } } int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Input vectors double *a; double *b; // Output vector double *c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // Sum component wise and save result into vector c #pragma hmpp vecAdd callsite vecAdd(n, a, b, c); // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd-codelet.c
#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out void vecAdd(int n, double a[n], double b[n], double c[n]) { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } }
The combined #pragma hmpp directive and C function vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.
#pragma hmpp vecAdd callsite vecAdd(n, a, b, c);
The combined #pragma hmpp directive and C function call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.
Compiling and Running vecAdd-codelet.c
Before compiling the hmpp module must be loaded:
$ module load PrgEnv-pgi capsmc cudatoolkit $ hmpp cc vecAdd-codelet.c -o vecAdd.out
Output:
The compiler will output the following:
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-codelet.c:9: Loop 'j' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.
$ aprun ./vecAdd.out final result: 1.000000
vecAdd-codelet.f90
!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out subroutine vecAdd(n, a, b, c) implicit none integer, intent(in) :: n real(8), intent(in) :: a(n), b(n) real(8), intent(out) :: c(n) integer :: j do j=1,n c(j) = a(j) + b(j) enddo end subroutine vecAdd program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$hmpp vecAdd callsite call vecAdd(n, a, b, c) ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program main
Changes to vecAdd-codelet.f90
!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out subroutine vecAdd(n, a, b, c) implicit none integer, intent(in) :: n real(8), intent(in) :: a(n), b(n) real(8), intent(out) :: c(n) integer :: j do j=1,n c(j) = a(j) + b(j) enddo end subroutine vecAdd
The combined !$hmpp directive and Fortran subroutine vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.
!$hmpp vecAdd callsite call vecAdd(n, a, b, c)
The combined !$hmpp directive and Fortran subroutine call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.
Compiling vecAdd-codelet.f90
$ module load PrgEnv-pgi cudatoolkit capsmc $ hmpp ftn vecAdd.f90 -o vecAdd.out
Output:
The compiler will output the following:
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-codelet.f90:9: Loop 'j' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.
Additional Codelet Information
Much information is obscured from the programmer so let’s add the –io-report hmpp flag to see what memory transfers will take place between the GPU and host.
C
$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out
Fortran
$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out
Output
In GROUP 'vecadd' CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent OUT
We see that n, a, and b will be copied into the GPU while c will be copied out.
What if we were the omit the intent for the vector c to be copied back to the host in our codelet declaration?
C
#pragma hmpp vecAdd codelet, target=CUDA
Fortran
!$hmpp vecAdd codelet, target=CUDA
Output
In GROUP 'vecadd' CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent IN
We see that the compiler does not correctly copy the vector c back to the host. This will cause erroneous results that do not produce any warning or error message. It is vitally important to always check that memory transfers are correct.
vecAdd-region.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Input vectors double *a; double *b; // Output vector double *c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // Sum component wise and save result into vector c #pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd-region.c
#pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } }
The code inside of the hmpp region is computed on the GPU. The region begins with the #pragma hmpp region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.
Compiling vecAdd-region.c
$ hmpp --io-report cc vecAdd-region.c -o vecAdd.out
Output:
The compiler will output the following:
In GROUP 'vecadd' REGION 'vecadd' at vecAdd-region.c:34, function '__hmpp_region__vecadd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent INOUT
We see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-region.c:37: Loop 'j' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.
vecAdd-region.f90
Modifications from CPU code in bold
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b !Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall do i=1,n c(i) = a(i) + b(i) enddo !$hmpp vecAdd endregion ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Changes: to vecAdd-region.f90
!$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall do i=1,n c(i) = a(i) + b(i) enddo !$hmpp vecAdd endregion
The code inside of the hmpp region is computed on the GPU. The region begins with the !$hmpp region directive and ends with the !$hmpp endregion directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.
Compiling vecAdd-region.f90
$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out
Output
The compiler will output the following:
In GROUP 'vecadd' REGION 'vecadd' at vecAdd-region.f90:28, function 'hmpp_region__vecadd' Parameter 'n' has intent IN Parameter 'n_1' has intent IN Parameter 'n_2' has intent IN Parameter 'a' has intent IN Parameter 'n_4' has intent IN Parameter 'n_5' has intent IN Parameter 'b' has intent IN Parameter 'n_7' has intent IN Parameter 'n_8' has intent IN Parameter 'c' has intent INOUT Parameter 'i' has intent INOUT
The current HMPP compiler doesn’t do too well with the Fortran version, copying in several variables that are not used. For now we will need to ignore these erroneous variable copies and their associated warning messages. We do see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-region.f90:29: Loop 'i' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 29 with induction variable ‘i’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.