Experience in porting Fortran code to GPUs with OpenACC
Speakers: Mike Zingale and Adam Jacobs
This talk was given at the 9-23-15 OLCF User Group Conference Call.
In the slides and video below, we describe our efforts to put large physics modules written in Fortran onto the GPUs. Our main target is integrating stiff systems of ODEs describing nuclear reactions. Our strategy is to put the entire integration: righthand side evaluations, Jacobian construction, and the actual timestepping onto the GPUs. This minimizes the communication costs — only the initial state vector is passed in, and the new state vector comes out at the end. We will discuss the challenges we had in getting the compilers to accept modern Fortran constructions with OpenACC and discuss preliminary results.
Slides: openacc-networks.pdf
Video:
Experience in porting Fortran code to GPUs with OpenACC from OLCF on Vimeo.
To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q
, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q
, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q
. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q
. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C
The Serial
, OpenMP
, and MPI
C samples make use of the following gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
sleep.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> #include <time.h> static cudaStream_t *streams; // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(float seconds) { // Get device frequency in KHz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = (int64_t)(seconds * Hz); return num_cycles; } // Create streams extern "C" void create_streams(int num_streams) { // Allocate streams streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t)); // Default stream streams[0] = NULL; // Primer kernel launch sleep<<< 1, 1 >>>(1); // Create streams for(int i = 1; i <= num_streams; i++) cudaStreamCreate(&streams[i]); } // Launches a kernel that sleeps for num_cycles extern "C" void sleep_kernel(int64_t num_cycles, int stream_id) { // Launch a single GPU thread to sleep int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles); } // Wait for stream to complete extern "C" void wait_for_stream(int stream_id) { cudaStreamSynchronize(streams[stream_id]); } // Wait for streams to complete extern "C" void wait_for_streams(int num_streams) { for(int i = 1; i <= num_streams; i++) cudaStreamSynchronize(streams[i]); } // Destroy stream objects extern "C" void destroy_streams(int num_streams) { // Clean up stream for(int i = 1; i <= num_streams; i++) cudaStreamDestroy(streams[i]); free(streams); }
Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -arch=sm_35 -c sleep.cu
Serially Launched Concurrent Kernels in C
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <sys/time.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; struct timeval start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Max number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously for(i=1; i<=num_kernels; i++){ sleep_kernel(cycles, i); } // Wait for kernels to complete wait_for_streams(num_kernels); // Clean up streams destroy_streams(num_kernels); // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o serial.out
Run:
$ module load cudatoolkit $ aprun ./serial.out
OpenMP Launched Concurrent Kernels in C
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_streams(int num_streams); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { uint64_t cycles; double start, stop; int i, num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Set number of OMP threads omp_set_num_threads(num_kernels); // Start timer start = omp_get_wtime(); // Create streams create_streams(num_kernels); // Launch num_kernel kernels asynchrnously #pragma omp parallel firstprivate(cycles) { int stream_id = omp_get_thread_num()+1; sleep_kernel(cycles, stream_id); } // Wait for kernels to complete wait_for_streams(num_kernels); // Wait for kernels to complete and clean up streams destroy_streams(num_kernels); // Stop timer stop = omp_get_wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } return 0; }
Compile:
PGI
$ module load cudatoolkit $ cc -mp sleep.o launcher.c -o openmp.out
GNU
$ module load cudatoolkit $ cc -fopenmp sleep.o launcher.c -o openmp.out
Intel
$ module load cudatoolkit $ cc -openmp sleep.o launcher.c -o openmp.out
Cray
$ module load cudatoolkit $ cc sleep.o launcher.c -o openmp.out
Run:
$ module load cudatoolkit $ aprun -d16 ./openmp.out
MPI Launched Concurrent Kernels in C
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "mpi.h" int64_t get_cycles(float seconds); void sleep_kernel(int64_t num_cycles, int stream_id); void create_streams(int num_streams); void wait_for_stream(int stream_id); void destroy_streams(int num_streams); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank, size; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); uint64_t cycles; double start, stop; int num_kernels; // Get number of cycles to sleep for 1 second cycles = get_cycles(1.0); // Number of kernels to launch int max_kernels = size; // Setup default stream in sleep.cu wrapper create_streams(0); // Loop through number of kernels to launch, from 1 to max_kernels for(num_kernels=1; num_kernels<=max_kernels; num_kernels++) { // Start timer MPI_Barrier(MPI_COMM_WORLD); if(rank == 0) start = MPI_Wtime(); // Launch kernel into default stream if(rank < num_kernels) sleep_kernel(cycles, 0); // Wait for all ranks to submit kernel MPI_Barrier(MPI_COMM_WORLD); // Wait for default stream if(rank < num_kernels) wait_for_stream(0); // Wait for all ranks to complete MPI_Barrier(MPI_COMM_WORLD); // Print seconds ellapsed if(rank == 0) { stop = MPI_Wtime(); printf("Total time for %d kernels: %f s\n", num_kernels, stop-start); } } destroy_streams(0); MPI_Finalize(); return 0; }
Compile:
$ module load cudatoolkit $ cc sleep.o launcher.c -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in C
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.c
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include "math.h" #include <sys/time.h> int main(int argc, char *argv[]) { uint64_t num_cycles; struct timeval start, stop; int i,num_kernels; // Set number of cycles to sleep for 1 second // We'll use frequency/instruction latency to estimate this num_cycles = 730000000/15; // Number of kernels to launch int max_kernels = 33; // Loop through number of kernels to launch, from 1 to num_kernels for(num_kernels=1; num_kernels<max_kernels; num_kernels++) { // Start timer gettimeofday(&start, NULL); for(i=0; i<=num_kernels; i++) { #pragma acc parallel async(i) vector_length(1) num_gangs(1) { uint64_t cycles; #pragma acc loop seq for(cycles=0; cycles<num_cycles; cycles++) { cycles = cycles; } } } // Wait for all async streams to complete #pragma acc wait // Print seconds ellapsed gettimeofday(&stop, NULL); double seconds; seconds = (stop.tv_sec - start.tv_sec); seconds += (stop.tv_usec - start.tv_usec) / 1000000.0; printf("Total time for %d kernels: %f s\n", num_kernels, seconds); } return 0; }
volatile
qualifier to the declaration of cycles should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ cc -acc launcher.c -o acc.out
Cray
$ module load cudatoolkit $ cc -hpragma=acc launcher.c -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
Fortran
The Serial
, OpenMP
, and MPI
Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:
sleep()
: GPU kernel to sleep for a given number of clock cyclesget_cycles()
: Return the number of cycles required to sleep for the requested number of secondscreate_streams()
: Create a given number of non default streamssleep_kernel()
: Launch a single GPU thread that sleeps for a given number of cycleswait_for_streams()
: Wait for streams 1 through the number of streams specified to complete any workdestroy_streams()
: Destroy non default streams
module sleep use cudadevice use cudafor implicit none integer, dimension(:), allocatable :: streams contains !CUDA kernel to pause for at least num_cycle cycles attributes(global) subroutine sleep(num_cycles) integer(8), value :: num_cycles integer(8) :: cycles integer(8) :: start cycles = 0 start = clock64 do while (cycles < num_cycles) cycles = clock64 - start enddo end subroutine sleep !Returns number of cycles required for requested seconds integer(8) function get_cycles(seconds) result(num_cycles) real(8), intent(in) :: seconds integer(8) :: istat, Hz type(cudadeviceprop) :: prop istat = cudaGetDeviceProperties(prop, 0) Hz = prop%clockRate * 1000 num_cycles = seconds * Hz end function get_cycles !Create streams subroutine create_streams(num_streams) integer :: num_streams, istat, i allocate(streams(num_streams+1)) streams(1) = 0 ! Primer kernel launch call sleep<<< 1, 1 >>>(int8(1)); do i=2,num_streams+1 istat = cudaStreamCreate(streams(i)) enddo end subroutine create_streams !Launches a kernel that sleeps for num_cycles subroutine sleep_kernel(num_cycles, stream_id) integer(8) :: num_cycles integer :: stream_id type(dim3) :: blockSize, gridSize blockSize = dim3(1,1,1) gridSize = dim3(1,1,1) call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles) end subroutine sleep_kernel ! Wait for stream to complete subroutine wait_for_stream(stream_id) integer :: stream_id, istat istat = cudaStreamSynchronize(streams(stream_id)) end subroutine wait_for_stream ! Wait for streams to complete subroutine wait_for_streams(num_streams) integer :: num_streams, istat, i do i=2,num_streams+1 istat = cudaStreamSynchronize(streams(i)) enddo end subroutine wait_for_streams ! Destroy streams subroutine destroy_streams(num_streams) integer :: num_streams, i, istat do i=2,num_streams+1 istat = cudaStreamDestroy(streams(i)) enddo deallocate(streams) end subroutine destroy_streams end module sleep
Serially Launched Concurrent Kernels in Fortran
The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for
loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep integer(8) :: cycles integer :: max_kernels, num_kernels, i real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 32 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously do i = 2, num_kernels+1 call sleep_kernel(cycles, i) enddo ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out
OpenMP Launched Concurrent Kernels in Fortran
The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.
launcher.f90
program main use sleep use omp_lib implicit none integer(8) :: cycles integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Set number of OMP threads call omp_set_num_threads(num_kernels) ! Start timer start = omp_get_wtime() ! Create streams call create_streams(num_kernels) ! Launch num_kernel kernels asynchrnously !$omp parallel private(stream_id) firstprivate(cycles) stream_id = omp_get_thread_num()+2 call sleep_kernel(cycles, stream_id) !$omp end parallel ! Wait for kernels to complete and clean up streams call destroy_streams(num_kernels) ! Stop timer stop = omp_get_wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out
MPI Launched Concurrent Kernels in Fortran
The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy
works in conjunction with HyperQ
to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.
launcher.f90
program main use sleep implicit none include 'mpif.h' integer :: max_kernels, num_kernels, i, ierr, rank, size integer(8) :: cycles real(8) :: start, stop, seconds call MPI_Init(ierr) ! Get number of cycles to sleep for 1 second seconds = 1.0 cycles = get_cycles(seconds) call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr) call MPI_Comm_size(MPI_COMM_WORLD, size, ierr) ! Number of kernels to launch max_kernels = size ! Setup default stream in sleep.cu wrapper call create_streams(0); ! Loop through number of kernels to launch, from 1 to max_kernels do num_kernels = 1, max_kernels ! Start timer call MPI_Barrier(MPI_COMM_WORLD, ierr) if (rank == 0) then start = MPI_Wtime() endif ! Launch num_kernel kernels asynchrnously if (rank < num_kernels) then call sleep_kernel(cycles, 1) endif ! Wait for all ranks to submit kernel call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Wait for kernel to complete if(rank < num_kernels) then call wait_for_stream(0) endif ! Wait for all ranks to complete call MPI_Barrier(MPI_COMM_WORLD, ierr) ! Print seconds ellapsed if (rank == 0) then stop = MPI_Wtime() print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' endif enddo ! clean up array in wrapper, no stream actually destroyed call destroy_streams(0) call MPI_Finalize(ierr) end program main
Compile:
PGI
$ module load cudatoolkit $ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Run:
$ module load cudatoolkit $ export CRAY_CUDA_PROXY=1 $ aprun -n16 ./proxy.out
OpenACC Launched Concurrent Kernels in Fortran
OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i)
clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i
, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.
launcher.f90
program main implicit none integer(8) :: num_cycles, cycles, i integer :: max_kernels, num_kernels, stream_id real(8) :: start, stop, seconds real(4) :: foo(33) ! Set number of cycles to sleep for 1 second ! We'll use frequency/instruction latency to estimate this num_cycles = 730000000/(15*5) ! Maximum number of kernels to launch max_kernels = 33 ! Loop through number of kernels to launch, from 1 to num_kernels do num_kernels = 1, max_kernels ! Start timer call cpu_time(start) ! Launch num_kernel kernels asynchrnously do i = 1, num_kernels !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo) !$acc loop seq do cycles = 1, num_cycles foo(i) = sin(1.5708) enddo !$acc end loop !$acc end parallel enddo ! Wait for all async streams to complete !$acc wait ! Stop timer call cpu_time(stop) print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds' enddo end program main
volatile
qualifier to the declaration of foo should force the loop to be kept. the volatile
qualifier is currently not supported in PGI accelerator regions.
Compile:
PGI
$ module load cudatoolkit $ ftn -acc launcher.f90 -o acc.out
Cray
$ module load cudatoolkit $ ftn -hacc launcher.f90 -o acc.out
Run:
$ module load cudatoolkit $ aprun ./acc.out
This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 10000; // Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // sum component wise and save result into vector c #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n]) for(i=0; i<n; i++) { c[i] = a[i] + b[i]; } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0.0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd.c
// Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c;
The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.
// sum component wise and save result into vector c #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n]) for(i=0; i<n; i++) { c[i] = a[i] + b[i]; }
The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.
Compiling vecAdd.c
PGI
$ module load cudatoolkit $ cc -acc vecAdd.c -o vecAdd.out
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -h pragma=acc vecAdd.c -o vecAdd.out
CapsMC
$ module load cudatoolkit capsmc $ cc vecAdd.c -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
vecAdd.f90
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n c(i) = a(i) + b(i) enddo !$acc end kernels ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Changes to vecAdd.f90
!$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n)) do i=1,n c(i) = a(i) + b(i) enddo !$acc end kernels
The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.
Compiling vecAdd.f90
PGI
$ module load cudatoolkit $ ftn -acc vecAdd.f90 -o vecAdd.out
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -h acc vecAdd.f90 -o vecAdd.out
CapsMC
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit capsmc $ ftn vecAdd.f90 -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
Reduction
OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.
vecAdd-reduction.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 10000; // Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // sum component wise and save result into vector c and do reduction float sum = 0.0; #pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n]) { for(i=0; i<n; i++) { c[i] = a[i] + b[i]; sum += c[i]; } } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
vecAdd-reduction.f90
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c, perform reduction sum on c !$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n c(i) = a(i) + b(i) sum = sum + c(i) enddo !$acc end kernels ! Divide the sum by n, the final result should be 1.0 within error sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Additional Information
With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.
PGI
For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.
Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:
$ cc -acc -Minfo vecAdd-reduction.c
Compile Fortran code with the PGI compiler and -Minfo flag as follows:
$ ftn -acc -Minfo vecAdd-reduction.f90
Output:
main: 34, Generating local(c[0:n]) Generating copyin(b[0:n]) Generating copyin(a[0:n]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 35, Loop is parallelizable Accelerator kernel generated 35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes 37, Sum reduction generated for sum
Breaking this output down:
main: 34, Generating local(c[0:n]) Generating copyin(b[0:n]) Generating copyin(a[0:n])
We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
35, Loop is parallelizable Accelerator kernel generated
Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.
35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.
37, Sum reduction generated for sum
The compiler has correctly identified the sum reduction and created an optimized kernel for it.
Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.
$ export ACC_NOTIFY=1 $ aprun ./vecAdd.out launch kernel file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=35 device=0 grid=79 block=128 queue=0 launch kernel file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=37 device=0 grid=1 block=256 queue=0
Cray
Compiling:
C
$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out
Fortran
$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out
Output:
!$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n)) ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 A data region was created at line 28 and ending at line 38. ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin). ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin). ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share). do i=1,n ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was placed on the accelerator. ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock. ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
Breaking this output down:
!$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n)) ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 A data region was created at line 28 and ending at line 38. ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin). ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin). ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).
We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
do i=1,n ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was placed on the accelerator. ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock. ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.
Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.
$ export CRAY_ACC_DEBUG=1 $ aprun ./vecAdd.out ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28 ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29 ACC: Execute kernel main_$ck_L29_2 async(auto) from vecAdd-reduction.f90:29 ACC: Wait async(auto) from vecAdd-reduction.f90:32 ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32 ACC: Wait async(auto) from vecAdd-reduction.f90:38 ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38
CapsMC
#pragma acc loop independent
#pragma acc loop reduction(+:sum)
!$acc loop independent
!$acc loop reduction(+:sum)
Compiling:
C
$ capsmc --io-report cc vecAdd-reduction.c
Fortran
$ capsmc --io-report ftn vecAdd-reduction.f90
Output:
hmpp: [Info] IO report : hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8) hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".
Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.
$ export HMPPRT_LOG_LEVEL="info" $ aprun ./a.out [ 0.029828] ( 0) INFO : Enter kernels (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.030021] ( 0) INFO : Acquire (target=cuda) [ 0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451419] ( 0) INFO : Upload a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451610] ( 0) INFO : Upload b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451826] ( 0) INFO : Upload i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451905] ( 0) INFO : Upload sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451940] ( 0) INFO : Call __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470506] ( 0) INFO : Free sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470590] ( 0) INFO : Free i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470626] ( 0) INFO : Free c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470677] ( 0) INFO : Free b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470727] ( 0) INFO : Free a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470863] ( 0) INFO : Leave kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
Breaking this output down:
[ 0.029828] ( 0) INFO : Enter kernels (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.030021] ( 0) INFO : Acquire (target=cuda) [ 0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451419] ( 0) INFO : Upload a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451610] ( 0) INFO : Upload b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451826] ( 0) INFO : Upload i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451905] ( 0) INFO : Upload sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
[ 0.451940] ( 0) INFO : Call __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
The kernel is called.
[ 0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470506] ( 0) INFO : Free sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470590] ( 0) INFO : Free i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470626] ( 0) INFO : Free c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470677] ( 0) INFO : Free b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470727] ( 0) INFO : Free a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470863] ( 0) INFO : Leave kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
At the end of the kernel sum is copied from the accelerator back to the device and the memory released.
This sample shows the Game of Life CPU code converted to an OpenACC accelerator directive version. The full source can be viewed or download at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
GOL.c
#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define dim 1024 // Linear game grid dimension excluding ghost cells #define idx(i,j) ((i)*(dim+2)+(j)) // Add up all neighbors inline int getNeighbors(int *grid, int i, int j) { int num = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; return num; } int main(int argc, char* argv[]) { int i,j,iter; // Number of game iterations int maxIter = 1<<10; // Total number of alive cells int total = 0; // Total size of grid in bytes size_t bytes = sizeof(int)*(dim+2)*(dim+2); // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc(bytes); // Allocate newGrid int *restrict newGrid = (int*) malloc(bytes); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i <= dim; i++) { for(j = 1; j <= dim; j++) { grid[idx(i,j)] = rand() % 2; } } int fullSize = (dim+2)*(dim+2); #pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize]) { // Main game loop for (iter = 0; iter<maxIter; iter++) { #pragma acc kernels { // Left-Right columns #pragma acc loop independent for (i = 1; i <= dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } // Top-Bottom rows #pragma acc loop independent for (j = 0; j <= dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } // Now we loop over all cells and determine their fate #pragma acc loop independent for (i = 1; i <= dim; i++) { for (j = 1; j <= dim; j++) { // Get the number of neighbors for a given grid point int numNeighbors = getNeighbors(grid,i,j); // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } // Can't switch pointers so we mannually have to copy array over #pragma acc loop independent for(i = 1; i <= dim; i++) { for(j = 1; j <= dim; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } } } // End ACC kernels region } // End main game loop // Sum up alive cells #pragma acc parallel { #pragma acc loop reduction(+:total) for (i = 1; i <= dim; i++) { #pragma acc loop reduction(+:total) for (j = 1; j <= dim; j++) { total += grid[idx(i,j)]; } } } } // End ACC Data region // Print number of live cells printf("Total Alive: %d\n", total); // Release memory free(grid); free(newGrid); return 0; }
Changes to GOL.c
#define dim 1024 // Linear game grid dimension excluding ghost cells #define idx(i,j) ((i)*(dim+2)+(j))
At the time of this writing, multidimensional C arrays are not fully supported in all OpenACC implementations. A convenience macro is defined and used whenever grid or newGrid is accessed. These changes have not all been highlighted. grid[i][j] will now be accessed as grid[idx(i,j)]
inline int getNeighbors(int *grid, int i, int j)
OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. The inline qualifier can be used as a hint to the compiler we wish to have the function inlined.
// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc(bytes); // Allocate newGrid int *restrict newGrid = (int*) malloc(bytes);
The grid and newGrid arrays are now linear. As discussed in the vector addition example we use the restrict qualifier to help the compiler resolve loop dependencies.
#pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize]) {
}
Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 49 and ends at line 101. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.
#pragma acc kernels {
}
An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.
// Left-Right columns #pragma acc loop independent
// Top-Bottom rows #pragma acc loop independent
// Now we loop over all cells and determine their fate #pragma acc loop independent
// Can't switch pointers so we mannually have to copy array over #pragma acc loop independent
The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.
// Can't switch pointers so we mannually have to copy array over #pragma acc loop independent for(i = 1; i <= dim; i++) { for(j = 1; j <= dim; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } }
Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.
// Sum up alive cells #pragma acc parallel { #pragma acc loop reduction(+:total) for (i = 1; i <= dim; i++) { #pragma acc loop reduction(+:total) for (j = 1; j <= dim; j++) { total += grid[idx(i,j)]; } } }
At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.
Compiling GOL.c
PGI
$ module load cudatoolkit $ cc -acc gol.c -Minline -Minfo -ta=nvidia:kepler
CRAY
Cray will automatically attempt to inline the getNeighbors function
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -hpragma=acc,msgs gol.c
Compiling GOL.c with CapsMC
CapsMC will automatically attempt to inline the getNeighbors function
$ module load capsmc $ module load cudatoolkit $ cc gol.c
GOL.f90
program main implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo !$acc data copyin(grid), create(newGrid) ! Main game loop do iter=1,maxIter !$acc kernels ! Top-Bottom ghost rows !$acc loop independent do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns !$acc loop independent do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left and right + grid(i+1,j) + grid(i-1,j)& !upper and lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo !$acc end kernels enddo!end main game loop total = 0 ! Sum up alive cells and print results !$acc parallel !$acc loop reduction(+:total) do j=2,dim+1 !$acc loop reduction(+:total) do i=2,dim+1 total = total + grid(i,j) enddo enddo !$acc end parallel !$acc end data print *, "Total Alive", total end program
Changes to GOL.f90
!$acc data copyin(grid), create(newGrid)
!$acc end data
Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 27 and ends at line 90. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.
!$acc kernels
!$acc end kernels
An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.
!$acc loop independent
!$acc loop independent
The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.
numNeighbors = grid(i,j+1) + grid(i,j-1)& !left and right + grid(i+1,j) + grid(i-1,j)& !upper and lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1)
OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. For this example, using the current set of compilers, it was best to inline the neighbors function by hand.
! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo
Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.
! Sum up alive cells and print results !$acc parallel !$acc loop reduction(+:total) do j=2,dim+1 !$acc loop reduction(+:total) do i=2,dim+1 total = total + grid(i,j) enddo enddo !$acc end parallel
At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.
Compiling GOL.f90
OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined using the appropriate compiler flag.
PGI
$ module load cudatoolkit $ ftn -acc gol.f90 -Minline -Minfo -ta=nvidia:kepler
CRAY
Cray will automatically try to inline the getNeighbors function
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -hacc,msgs gol.f90
Compiling GOL.f90 with CapsMC
CapsMC will automatically try to inline the getNeighbors function
$ module load capsmc $ module load cudatoolkit $ ftn gol.f90
The following C and Fortran examples show how to combine OpenACC data regions with GPU libraries such as CUFFT. Since files containing CUFFT function calls must be compiled with the nvcc compiler and OpenACC containing files must be compiled with cc/CC/ftn it is necessary to create wrapper functions as shown below. The acc host_data use_device directive will be used which causes the device allocated memory address of specified variables to be used in host code within the directives scope. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C/C++
fft.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #include "openacc.h" #ifndef M_PI #define M_PI 3.14159265358979323846 #endif // Forward declaration of wrapper function that will call CUFFT extern void launchCUFFT(float *d_data, int n, void *stream); int main(int argc, char *argv[]) { int n = 256; float *data = malloc(2*n*sizeof(float)); int i; // Initialize interleaved input data on host float w = 7.0; float x; for(i=0; i<2*n; i+=2) { x = (float)i/2.0/(n-1); data[i] = cos(2*M_PI*w*x); data[i+1] = 0.0; } // Copy data to device at start of region and back to host and end of region #pragma acc data copy(data[0:2*n]) { // Inside this region the device data pointer will be used #pragma acc host_data use_device(data) { // Query OpenACC for CUDA stream void *stream = acc_get_cuda_stream(acc_async_sync); // Launch FFT on the GPU launchCUFFT(data, n, stream); } } // Find the frequency int max_id = 0; for(i=0; i<n; i+=2) { if( data[i] > data[max_id] ) max_id = i; } printf("frequency = %d\n", max_id/2); return 0; }
fft.c + cufft.cu
#include <cufft.h> // Declared extern "C" to disable C++ name mangling extern "C" void launchCUFFT(float *d_data, int n, void *stream) { cufftHandle plan; cufftPlan1d(&plan, n, CUFFT_C2C, 1); cufftSetStream(plan, (cudaStream_t)stream); cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD); cufftDestroy(plan); }
Compiling C/C++
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc cufft.cu -c $ cc -h pragma=acc -lcufft fft.c cufft.o
PGI
$ module load cudatoolkit $ nvcc cufft.cu -c $ cc -acc -lcufft fft.c cufft.o
Fortran
fft.f90
module cufft INTERFACE subroutine launchcufft(data, n, stream) BIND (C, NAME='launchCUFFT') USE ISO_C_BINDING implicit none type (C_PTR), value :: data integer (C_INT), value :: n type (C_PTR), value :: stream end subroutine END INTERFACE end module cufft program fft USE ISO_C_BINDING USE cufft USE openacc IMPLICIT NONE INTEGER, PARAMETER :: n = 256 COMPLEX (C_FLOAT_COMPLEX) :: data(n) INTEGER (C_INT):: i INTEGER :: max_id,istat type (C_PTR) :: stream ! Initialize interleaved input data on host REAL :: w = 7.0 REAL :: x REAL, PARAMETER :: PI = 3.1415927 do i=1,n x = (i-1.0)/(n-1.0); data(i) = CMPLX(COS(2.0*PI*w*x),0.0) enddo ! Copy data to device at start of region and back to host and end of region !$acc data copy(data) ! Inside this region the device data pointer will be used !$acc host_data use_device(data) ! Query OpenACC for CUDA stream stream = acc_get_cuda_stream(acc_async_sync) ! Launch FFT on the GPU call launchcufft(C_LOC(data), n, stream) !$acc end host_data !$acc end data ! Find the frequency max_id = 1 do i=1,n/2 if (REAL(data(i)) .gt. REAL(data(max_id))) then max_id = i-1 endif enddo print *, "frequency:", max_id end program fft
fft.f90 + cufft.cu
#include <cufft.h> // Declared extern "C" to disable C++ name mangling extern "C" void launchCUFFT(float *d_data, int n, void *stream) { cufftHandle plan; cufftPlan1d(&plan, n, CUFFT_C2C, 1); cufftSetStream(plan, (cudaStream_t)stream); cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD); cufftDestroy(plan); }
Compiling Fortran
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc -c cufft.cu $ ftn -c -hacc fft.f90 $ ftn fft.o cufft.o -lcufft
PGI
$ module load cudatoolkit $ nvcc -c cufft.cu $ ftn -c -acc fft.f90 $ ftn -acc fft.o cufft.o -lcufft
The following C and Fortran examples show how to leverage OpenACC, cuRand, and Thrust in a single application. This example will use cuRand to generate random data, OpenACC to hash the data, and finally Thrust to sort the data into canonical order. Although this example is provided to illustrate OpenACC interoperability only it was written with a simple nearest neighbor hash implementation given interleaved 3D positions(x1,y1,z1,x2,y2,z2…) in mind. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
C/C++
sort.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #include "openacc.h" // Forward declaration of nvcc compiled functions void fill_rand(float *d_buffer, int num, void *stream); inline int hash_val(float x, float y, float z); void sort(int *d_key, int *d_values, int num, void *stream); int main(int argc, char* argv[]) { int i,index; float x,y,z; int dim = 3; int num = 1000000; int length = dim*num; float *restrict positions = (float*)malloc(length*sizeof(float)); int *restrict keys = (int*)malloc(num*sizeof(int)); int *restrict values = (int*)malloc(num*sizeof(int)); // Query OpenACC for CUDA stream void *stream = acc_get_cuda_stream(acc_async_sync); // OpenACC will create positions, keys, and values arrays on the device #pragma acc data create(positions[0:length], keys[0:num], values[0:num]) { // NVIDIA cuRandom will create our initial random data #pragma acc host_data use_device(positions) { fill_rand(positions, length, stream); } // OpenACC will calculate the hash value for each particle #pragma acc parallel loop for(i=0; i<num; i++) { index = i*3; x = positions[index]; y = positions[index+1]; z = positions[index+2]; // Key is hash value and value is 'particle id' keys[i] = hash_val(x,y,z); values[i] = i; } // Thrust will be used to sort our key value pairs #pragma acc host_data use_device(keys, values) { sort(keys, values, num, stream); } } return 0; } // Uniform grid hash inline int hash_val(float x, float y, float z) { double spacing = 0.01; // Calculate grid coordinates int grid_x,grid_y,grid_z; grid_x = abs(floor(x/spacing)); grid_y = abs(floor(y/spacing)); grid_z = abs(floor(z/spacing)); int num_x = 1.0/spacing; int num_y = 1.0/spacing; int num_z = 1.0/spacing; int grid_position = (num_x * num_y * grid_z) + (grid_y * num_x + grid_x); return grid_position; }
sortGPU.cu
#include <stdio.h> #include <curand.h> #include <thrust/sort.h> #include <thrust/execution_policy.h> #include <thrust/binary_search.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/system_error.h> // Fill d_buffer with num random numbers extern "C" void fill_rand(float *d_buffer, int num, void *stream) { curandGenerator_t gen; int status; // Create generator status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT); // Set CUDA stream status |= curandSetStream(gen, (cudaStream_t)stream); // Set seed status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL); // Generate num random numbers status |= curandGenerateUniform(gen, d_buffer, num); // Cleanup generator status |= curandDestroyGenerator(gen); if (status != CURAND_STATUS_SUCCESS) { printf ("curand failure!\n"); exit (EXIT_FAILURE); } } // Sort key value pairs extern "C" void sort(int *keys, int *values, int num, void *stream) { try { // Sort keys AND values array by key thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream), keys, keys + num, values); } catch(thrust::system_error &e) { std::cerr << "Error sorting with Thrust: " << e.what() << std::endl; exit (EXIT_FAILURE); } }
Compiling C/C++
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc sortGPU.cu -c $ cc -hpragma=acc -lcurand sort.c sortGPU.o
PGI
$ module load cudatoolkit $ nvcc sortGPU.cu -c $ cc -acc -lcurand sort.c sortGPU.o -Minline
Fortran
sort.f90
! Module containing nvcc compiled function declarations and hash function module sortgpu INTERFACE subroutine fill_rand(positions, length, stream) BIND(C,NAME='fill_rand') USE ISO_C_BINDING implicit none type (C_PTR), value :: positions integer (C_INT), value :: length type (C_PTR), value :: stream end subroutine fill_rand subroutine sort(keys, values, num, stream) BIND(C,NAME='sort') USE ISO_C_BINDING implicit none type (C_PTR), value :: keys type (C_PTR), value :: values integer (C_INT), value :: num type (C_PTR), value :: stream end subroutine sort END INTERFACE contains integer function hash_val(x,y,z) real :: x,y,z,spacing integer :: grid_x, grid_y, grid_z integer :: num_x, num_y, num_z spacing = 0.01; grid_x = abs(floor(x/spacing)) grid_y = abs(floor(y/spacing)) grid_z = abs(floor(z/spacing)) num_x = 1.0/spacing num_y = 1.0/spacing num_z = 1.0/spacing hash_val = (num_x * num_y * grid_z) + (grid_y * num_x + grid_x) + 1 return end function hash_val end module sortgpu program interop use ISO_C_BINDING use sortgpu use openacc implicit none integer :: i,indx real :: x,y,z integer, parameter :: dims = 3 integer(C_INT), parameter :: num = 1000000 integer(C_INT), parameter :: length = dims*num real (C_FLOAT) :: positions(length) integer(C_INT) :: keys(num) integer(C_INT) :: values(num) type (C_PTR) :: stream ! Query OpenACC for CUDA stream stream = acc_get_cuda_stream(acc_async_sync) ! OpenACC will create positions, keys, and values arrays on the device !$acc data create(positions, keys, values) ! NVIDIA cuRandom will create our initial random data !$acc host_data use_device(positions) call fill_rand(C_LOC(positions), length, stream) !$acc end host_data ! OpenACC will calculate the hash value for each particle !$acc parallel loop do i=0,num-1 indx = i*3; x = positions(indx+1) y = positions(indx+2) z = positions(indx+3) ! Key is 'particle' id and value is hashed position keys(i+1) = hash_val(x,y,z); values(i+1) = i+1; enddo !$acc end parallel loop ! Thrust will be used to sort our key value pairs !$acc host_data use_device(keys, values) call sort(C_LOC(keys), C_LOC(values), num, stream); !$acc end host_data !$acc end data end program interop
sortGPU.cu
#include <stdio.h> #include <curand.h> #include <thrust/sort.h> #include <thrust/execution_policy.h> #include <thrust/binary_search.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/system_error.h> // Fill d_buffer with num random numbers extern "C" void fill_rand(float *d_buffer, int num, void *stream) { curandGenerator_t gen; int status; // Create generator status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT); // Set CUDA stream status |= curandSetStream(gen, (cudaStream_t)stream); // Set seed status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL); // Generate num random numbers status |= curandGenerateUniform(gen, d_buffer, num); // Cleanup generator status |= curandDestroyGenerator(gen); if (status != CURAND_STATUS_SUCCESS) { printf ("curand failure!\n"); exit (EXIT_FAILURE); } } // Sort key value pairs extern "C" void sort(int *keys, int *values, int num, void *stream) { try { // Sort keys AND values array by key thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream), keys, keys + num, values); } catch(thrust::system_error &e) { std::cerr << "Error sorting with Thrust: " << e.what() << std::endl; exit (EXIT_FAILURE); } }
Compiling Fortran
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ nvcc sortGPU.cu -c $ ftn -hacc -lcurand sort.f90 sortGPU.o
PGI
$ module load cudatoolkit $ module switch pgi pgi/13.6.0 $ nvcc sortGPU.cu -c $ ftn -acc -lcurand sort.f90 sortGPU.o -Minline=name:hash_val
The use of GPU acceleration can add a useful level of parallelization to your code. This tutorial will show how to parallelize a simple Monte Carlo operation using a GPU accelerator. The first code will be a simple serial implementation of the Monte Carlo simulation used to compute π. The next samples will be the same code but accelerated using CUDA, OpenACC, and then Thrust.
Serial
The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of random points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).
#include <stdio.h> #include <stdlib.h> #include <math.h> void main(int argc, char* argv[]) { double niter = 100000; double x,y; int i; int count=0; double z; double pi; srand(time(NULL)); //main loop for (i=0; i<niter; ++i) { //get random points x = (double)random()/RAND_MAX; y = (double)random()/RAND_MAX; z = sqrt((x*x)+(y*y)); //check to see if point is in unit circle if (z<=1) { ++count; } } //p = 4(m/n) pi = ((double)count/(double)niter)*4.0; printf("Pi: %f\n", pi);
To compile this code, run:
$ cc serialpi.c -o a.out
The resulting executable can either be run in an interactive session or via batch script. To launch an interactive session run:
$ qsub -I -A -l walltime=1:00,nodes=1
This executable can also be run via batch script. A sample batch script could look like this:
#!/bin/bash # Begin PBS directives #PBS -A <PROJID> #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n1 ./a.out
Save this in your project’s scratch directory as something like titan.pbs and copy your executable to $MEMBERWORK/<projid>.
To submit the script to the batch node, run
$ qsub titan.pbs
CUDA
Compute Unified Device Architecture (CUDA) is the parallel programming model developed by NVIDIA that extends C or Fortran code with features that target the GPU device. In CUDA, the programmer defines functions to be launched in many thousands or millions of parallel threads on the GPU. These functions are generally referred to as CUDA kernels or simply “kernels”. When writing these kernels, the programmer needs to keep in mind how many threads and blocks of threads they want launched on the GPU at runtime. A possible kernel for this code could look something like this:
__global__ void kernel(int* count_d, float* randomnums) { int i; double x,y,z; int tid = blockDim.x * blockIdx.x + threadIdx.x; i = tid; int xidx = 0, yidx = 0; xidx = (i+i); yidx = (xidx+1); x = randomnums[xidx]; y = randomnums[yidx]; z = ((x*x)+(y*y)); if (z<=1) count_d[tid] = 1; else count_d[tid] = 0; }
Take note of the __global__ decorator. This tells the nvcc compiler that this function is a CUDA kernel that will be called by host code and be run on the device. Also, note the variables blockDim.x, blockIdx.x, and threadIdx.x. These are built in variables that identify a thread’s place in the overall three-dimensional grid, similar to the concept of an MPI rank or OpenMP thread.
Another thing to keep in mind is the fact that the CPU (host) and GPU (device) have separate memory. This means that the host can’t directly access memory on the device and vice versa. To overcome this, CUDA gives us a set of functions to transfer and allocate data on the GPU device’s RAM. For example, in the Monte Carlo code, there needs to be a set of random numbers available on the device for the kernel to use, as well as an empty array to hold a 1 or 0, depending on whether or not the point is in the circle or not (See the Serial section for more information on the Monte Carlo specifics). So after the random numbers are generated and stored into an array, that array needs to be copied to the device. Also the empty array needs to be allocated on the device. To do this we use two different CUDA functions. Firstly, the empty array needs to be allocated on the device. The cudaMalloc() function gives us the ability to allocate memory on the GPU:
cudaError_t cudaMalloc(void** devPtr, size_t size)
Before this function is called, an int* named count_d needs to be created (Note that device pointer variables generally end with “_d”). Also, the programmer needs to determine the size of the array. In this case, since the kernel is only processing one point per thread, the number of threads can be used as the size (thread blocks * threads per thread block = total number of threads) Once this is done, the cudaMalloc() function can be called. For this code, it could look like this:
cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));
Also, an array to house the random numbers (2 per thread this time) needs to be allocated on the device using cudaMalloc. This is done in the same fashion as before:
cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));
Once there is a place to contain the required random numbers, they can be generated. The CURAND library contains kernels that generate high quality random numbers using the GPU. CURAND needs a bit of setting up before it can generate the random numbers. First, a CURAND generator needs to be declared:
curandGenerator_t gen;
The next function creates the generator and declares its type.
curandStatus_t curandCreateGenerator(curandGenerator_t* generator, curandRngType_t rng_type);
The next one sets the seed.
curandStatus_t curandSetPseudoRandomGeneratorSeed(curandGenerator_t generator, unsigned long long seed);
Next, curandGenerateUniform() is the actual function that generates the random numbers.
curandStatus_t curandGenerateUniform(curandGenerator_t generator, float* outputPtr, size_t num);
Finally, the random number generator can be destroyed because theres no further need for it.
curandStatus_t curandDestroyGenerator(curandGenerator_t generator);
Using these definitions, create the random numbers as follows:
curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A); curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL)); curandGenerateUniform(gen, randomnums, (2*niter)); curandDestroyGeneraor(gen);
Now, the CUDA kernel needs to be called and launched on the device. Before this can be done, the programmer needs to know how many threads per block, and how many blocks that need to be launched. In this case, 1000 threads/block and 100 blocks should work fine (these numbers we picked arbitrarily, the programmer can set their own). Once this is determined, the kernel can be launched and passed the correct device pointers.
kernel <<<blocks, threads>>> (count_d, randomnums);
Once the kernel has finished, the count_d array needs to be copied back to the host. Similar to the standard C function memcpy(), we use the CUDA function cudaMemcpy() which looks like this:
cudaError_t cudaMemcpy(void * dst, const void * src, size_t count, enum cudaMemcpyKind kind)
Before this can be done, the programmer needs to declare an empty array on the host that will contain the data that is copied back with cudaMemcpy. Also, make sure the amount of data copied back (size_t count) is the right size (the same size as the cudaMalloc’d array). Finally, cudaMemcpy needs to know in what direction the copy is going to be performed. In this particular instance, the array is on the device and needs to be copied to the host, so the final parameter for the cudaMemcpy needs to be the built in “cudaMemcpyDeviceToHost”. The completed cudaMemcpy line looks like this:
cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost);
Now that all the CUDA code is set up, the rest of the code looks like this:
cpi.cu:
#include <stdio.h> #include <stdlib.h> #include <cuda.h> #include <curand.h> #include <time.h> //Declare the CUDA kernel __global__ void kernel(int* count_d, float* randomnums) { int i; double x,y,z; //Find the overall ID of the thread int tid = blockDim.x * blockIdx.x + threadIdx.x; i = tid; int xidx = 0, yidx = 0; //Do the MonteCarlo! xidx = (i+i); yidx = (xidx+1); //Get the random x,y points x = randomnums[xidx]; y = randomnums[yidx]; z = ((x*x)+(y*y)); if (z<=1) count_d[tid] = 1; else count_d[tid] = 0; } //Used to check if there are any errors launching the kernel void CUDAErrorCheck() { cudaError_t error = cudaGetLastError(); if (error != cudaSuccess) { printf("CUDA error : %s (%d)\n", cudaGetErrorString(error), error); exit(0); } } int main(int argc,char* argv[]) { //NOTE: if threads and/or blocks is changed, niter needs to be changed to reflect //that change (niter=threads*blocks) int niter = 100000; float *randomnums; double pi; //Allocate the array for the random numbers cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float)); //Use CuRand to generate an array of random numbers on the device int status; curandGenerator_t gen; status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A); status |= curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL)); status |= curandGenerateUniform(gen, randomnums, (2*niter)); status |= curandDestroyGenerator(gen); //Check to see if there was any problem launching the CURAND kernels and generating //the random numbers on the device if (status != CURAND_STATUS_SUCCESS) { printf("CuRand Failure\n"); exit(EXIT_FAILURE); } //Threads per thread block to be launched int threads = 1000; //Number of thread blocks launched int blocks = 100; int* count_d; int *count = (int*)malloc(blocks*threads*sizeof(int)); unsigned int reducedcount = 0; //Allocate the array to hold a value (1,0) whether the point in is the circle (1) or not (0) cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int)); CUDAErrorCheck(); //Launch the kernel kernel <<<blocks, threads>>> (count_d, randomnums); //Acts as a kind of code Barrier until the kernel is finished. Kernel calls are nonblocking so //the code would continue regardless of whether the kernel succeeded or not without the Sync cudaDeviceSynchronize(); CUDAErrorCheck(); //Copy the resulting array back cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost); int i = 0; //Reduce array into int for(i = 0; i<niter; i++) reducedcount += count[i]; //Free the cudaMalloc()'d arrays cudaFree(randomnums); cudaFree(count_d); free(count); //Find the ratio pi = ((double)reducedcount/niter)*4.0; printf("Pi: %f\n", pi); return 0; }
Compiling
Before this can be compiled, the CUDA toolkit needs to be loaded. To do this, run:
$ module load cudatoolkit
Now that the module is loaded, the nvcc compiler wrapper is available. To compile this run:
$ nvcc -arch=sm_35 -lcurand cpi.cu -o a.out
Here, the -arch=sm_35 switch is used to tell the compiler that the target device Compute Capabilty is 3.5. This allows the programmer to do more things in the CUDA kernel, for instance use double. The -lcurand switch tells the compiler to link in the CURAND library and give the code access to the CURAND functions.
Then copy the executable (in this case, a.out) to the scratch directory (see the serial section for more info on finding the scratch directory). This can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, use this:
$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1
Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command
$ aprun -n1 ./a.out
Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:
#!/bin/bash # Begin PBS directives #PBS -A <PROJID> #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 ./a.out
Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:
$ qsub titan.pbs
OpenACC
OpenACC provides an alternative pragma based accelerator programming model that can be easier to use than CUDA. These pragmas are similar to OpenMP’s but launch threads on the GPU instead of the CPU. Parallelizing the Monte Carlo code with OpenACC is a fairly straight-forward procedure. The programmer must first identify the best region to parallelize. In this code, the most parallelizable region is the main for loop which actually performs the Monte Carlo simulation. The main loop can be broken up into two for loops. One loop that will do the operation and store a 1 or 0 in an array depending on if the point is in the circle or not, like in the CUDA version. The second loop will become a separate kernel that will reduce the output array of the first kernel into 1 integer on the host. The OpenACC pragma should look something like this:
#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
Here,”acc” tells the compiler that this is an OpenACC pragma. Next, the parallel construct is used to launch multiple threads on the GPU. Next, the copy(randomnums[0:200000]) directive copies the randomnums array onto the device. The copyout(output[0:100000]) directive copies the array onto the device then back off onto the host once the kernel is finished. Finally the create(x,y,z) directive created the three variables, x,y,z, on the device. The parallel region looks something like this:
#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z) { for (i = 0; i<100000; i++) { int temp = i+i; x = randomnums[temp]; y = randomnums[temp+1]; z = ((x*x)+(y*y)); output[i] = z; } for (i = 0; i<100000; i++) { if (output[i]<=1) { ++count; } else continue; } }
The complete code looks like this:
accpi.cpp
#include <stdio.h> #include <time.h> #include <stdlib.h> int main(int argc, char* argv[]) { double niter = 100000; double x,y; int count=0; double z; double pi; int a = 0; time_t t; double randomnums[200000]; double output[100000]; srand48(((unsigned)time(&t))); for (a = 0; a<=2*niter; a++) randomnums[a] = (double)drand48(); int i = 0; #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z) { for (i = 0; i<100000; i++) { int temp = i+i; x = randomnums[temp]; y = randomnums[temp+1]; z = ((x*x)+(y*y)); output[i] = z; } for (i = 0; i<100000; i++) { if (output[i]<=1) { ++count; } else continue; } } pi = ((double)count/(double)niter)*4.0; printf("Pi: %f\n", pi); }
To compile this the cudatoolkit needs to be loaded. To load it run module load cudatoolkit. Next, run
$ CC -acc -Minfo=all -ta=nvidia:5.0,cc3x accpi.cpp -o a.out
to compile the code into an executable. The -Minfo=all switch will tell pgicc to print out various information regarding how its compiling the OpenACC code. This can be a very useful tool for debugging OpenACC code. The -ta=nvidia:5.0,cc3x switch tells what version of the CUDA to use and what compute architecture to use as the target. Some sample output for the -Minfo=all switch is:
main: 21, Generating copy(randomnums[0:]) Generating copyout(output[0:]) Generating create(x) Generating create(y) Generating create(z) Accelerator kernel generated 22, #pragma acc loop vector(256) /* threadIdx.x */ 30, #pragma acc loop vector(256) /* threadIdx.x */ 21, Generating NVIDIA code Generating compute capability 3.0 binary 22, Loop is parallelizable 30, Loop is parallelizable 34, Sum reduction generated for count
To compile this code with the Cray compiler, load the PrgEnv-cray module as well as the craype-accel-nvidia35 module
$ module swap PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35
To compile to an exectuable, run:
$ CC -hpragma=acc accpi.cpp -o a.out
Running:
Then copy the executable (in this case, a.out) to the scratch directory. Like before, this can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, run:
$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1
Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command
$ aprun -n1 ./a.out
Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:
#!/bin/bash # Begin PBS directives #PBS -A [PROJID] #PBS -N pi #PBS -j oe #PBS -l walltime=0:01:00,nodes=1 #PBS -V # End PBS directives and begin shell commands cd $PBS_O_WORKDIR date aprun -n 1 -N 1 ./a.out
Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:
$ qsub titan.pbs
Thrust
Finally, this operation can be accelerated with the unique Thrust library. Thrust is a template library that allows programmers to write parallel code in the same way that they would use the C++ Standard Library for serial code. Because of this, the Thrust parallelized code is structured quite differently. For instance, the Monte Carlo kernel from the previous CUDA example is replaced by an equivalent function inside a struct.
struct montecarlo : public thrust::unary_function<unsigned int, float> { //use the __host__ __device__ decorators so that //thrust knows that this is a gpu function. //think of operator() as similar to a CUDA kernel __host__ __device__ float operator()(unsigned int thread_id) { unsigned int seed; //set the seed seed = 49868^thread_id; int i; float x,y,z,sum=0.0; //define the random number generator thrust::default_random_engine rng(seed); thrust::uniform_real_distribution u01(0,1); for (i = 0; i < 1000; i++) { //get the random numbers x = u01(rng); y = u01(rng); z = ((x*x)+(y*y)); if (z<=1) sum += 1; } return sum; } };
First, the montecarlo struct is defined and given a Thrust function template to adhere to.
struct montecarlo : public thrust::unary_function
This Thrust template says that the function in the struct will be a standard thrust unary function with an unsigned int as an input (this is the thread number. This is handled by Thrust) and will return a floating point number. Next, the programmer defines the operator function
__host__ __device__ float operator()(unsigned int thread_id)
and gives it the __host__ and __device__ decorator so that the compiler knows that it is a GPU function. The next thing that the programmer should take note of is the use of the Thrust random number generator.
thrust::default_random_engine rng(seed); thrust::uniform_real_distribution u01(0,1);
Since it is not possible to call host functions from a GPU kernel, this step is necessary. By using Thrust functions like this, it allows the programmer to more easily and quickly accelerate the serial code without having to use more complex and specialized APIs, like CURAND.
Finally the way this kernel can be launched is by the use of another Thrust function. This transform_reduce() function is defined as:
OutputType thrust::transform_reduce(InputIterator first, InputIterator last, UnaryFunction unary_op, OutputType init, BinaryFunction binary_op)
Here, the thrust::transform_reduce() function is used because it will run every number between first and last through our predefined functor (unary_op) and then it will reduce the resulting transformed answers into one final float according to binary_op with the initial value of init. first and last need to be defined as thrust::counting_iterator(first) and thrust::counting_iterator(last) The full function will look like this.
count = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(niter), montecarlo(), 0.0, thrust::plus());
So with that in mind, the full code can look something like this:
thrust.cu
#include <stdio.h> #include <time.h> #include <stdlib.h> #include <thrust/generate.h> #include <thrust/random.h> #include <thrust/iterator/counting_iterator.h> #include <thrust/functional.h> #include <thrust/transform_reduce.h> #include <cmath> struct montecarlo : public thrust::unary_function<unsigned int, float> { //use the __host__ __device__ decorators so that //thrust knows that this is a gpu function. //think of operator() as similar to a CUDA kernel __host__ __device__ float operator()(unsigned int thread_id) { unsigned int seed; //set the seed seed = 49868^thread_id; int i; float x,y,z,sum=0.0; //define the random number generator thrust::default_random_engine rng(seed); thrust::uniform_real_distribution u01(0,1); for (i = 0; i < 1000; i++) { //get the random numbers x = u01(rng); y = u01(rng); z = ((x*x)+(y*y)); if (z<=1) sum += 1; } return sum; } }; int main(int argc, char* argv[]) { float pi; float count = 0.0; int niter = 1000; count = thrust::transform_reduce(thrust::counting_iterator(0), thrust::counting_iterator(niter), montecarlo(), 0.0, thrust::plus()); pi = (count/(niter*niter))*4.0; printf("Pi: %f\n", pi); }
To compile this code, the cudatoolkit needs to be loaded
$ module load cudatoolkit
Once the code is ready to be compiled, it should be compiled using NVIDIA’s nvcc compiler like so:
$ nvcc -arch=sm_35 thrust.cu -o a.out
It can be run either by submitting it in a batch script or by running it from an interactive job like described in the previous examples.
The code in this tutorial along with a makefile can be found at https://github.com/olcf/Accelerating-Serial-Code
If you have any questions about these examples or how to run them please email help@nccs.gov.