This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 10000; // Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // sum component wise and save result into vector c #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n]) for(i=0; i<n; i++) { c[i] = a[i] + b[i]; } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0.0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd.c
// Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c;
The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.
// sum component wise and save result into vector c #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n]) for(i=0; i<n; i++) { c[i] = a[i] + b[i]; }
The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.
Compiling vecAdd.c
PGI
$ module load cudatoolkit $ cc -acc vecAdd.c -o vecAdd.out
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ cc -h pragma=acc vecAdd.c -o vecAdd.out
CapsMC
$ module load cudatoolkit capsmc $ cc vecAdd.c -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
vecAdd.f90
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n c(i) = a(i) + b(i) enddo !$acc end kernels ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Changes to vecAdd.f90
!$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n)) do i=1,n c(i) = a(i) + b(i) enddo !$acc end kernels
The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.
Compiling vecAdd.f90
PGI
$ module load cudatoolkit $ ftn -acc vecAdd.f90 -o vecAdd.out
Cray
$ module switch PrgEnv-pgi PrgEnv-cray $ module load craype-accel-nvidia35 $ ftn -h acc vecAdd.f90 -o vecAdd.out
CapsMC
$ module switch PrgEnv-pgi PrgEnv-gnu $ module load cudatoolkit capsmc $ ftn vecAdd.f90 -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000
Reduction
OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.
vecAdd-reduction.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 10000; // Input vectors double *restrict a; double *restrict b; // Output vector double *restrict c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // sum component wise and save result into vector c and do reduction float sum = 0.0; #pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n]) { for(i=0; i<n; i++) { c[i] = a[i] + b[i]; sum += c[i]; } } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
vecAdd-reduction.f90
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c, perform reduction sum on c !$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n)) do i=1,n c(i) = a(i) + b(i) sum = sum + c(i) enddo !$acc end kernels ! Divide the sum by n, the final result should be 1.0 within error sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Additional Information
With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.
PGI
For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.
Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:
$ cc -acc -Minfo vecAdd-reduction.c
Compile Fortran code with the PGI compiler and -Minfo flag as follows:
$ ftn -acc -Minfo vecAdd-reduction.f90
Output:
main: 34, Generating local(c[0:n]) Generating copyin(b[0:n]) Generating copyin(a[0:n]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 35, Loop is parallelizable Accelerator kernel generated 35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes 37, Sum reduction generated for sum
Breaking this output down:
main: 34, Generating local(c[0:n]) Generating copyin(b[0:n]) Generating copyin(a[0:n])
We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
35, Loop is parallelizable Accelerator kernel generated
Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.
35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */ CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.
37, Sum reduction generated for sum
The compiler has correctly identified the sum reduction and created an optimized kernel for it.
Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.
$ export ACC_NOTIFY=1 $ aprun ./vecAdd.out launch kernel file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=35 device=0 grid=79 block=128 queue=0 launch kernel file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=37 device=0 grid=1 block=256 queue=0
Cray
Compiling:
C
$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out
Fortran
$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out
Output:
!$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n)) ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 A data region was created at line 28 and ending at line 38. ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin). ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin). ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share). do i=1,n ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was placed on the accelerator. ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock. ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
Breaking this output down:
!$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n)) ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 A data region was created at line 28 and ending at line 38. ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin). ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin). ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).
We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
do i=1,n ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was placed on the accelerator. ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock. ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.
Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.
$ export CRAY_ACC_DEBUG=1 $ aprun ./vecAdd.out ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28 ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29 ACC: Execute kernel main_$ck_L29_2 async(auto) from vecAdd-reduction.f90:29 ACC: Wait async(auto) from vecAdd-reduction.f90:32 ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32 ACC: Wait async(auto) from vecAdd-reduction.f90:38 ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38
CapsMC
#pragma acc loop independent
#pragma acc loop reduction(+:sum)
!$acc loop independent
!$acc loop reduction(+:sum)
Compiling:
C
$ capsmc --io-report cc vecAdd-reduction.c
Fortran
$ capsmc --io-report ftn vecAdd-reduction.f90
Output:
hmpp: [Info] IO report : hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8) hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".
Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.
$ export HMPPRT_LOG_LEVEL="info" $ aprun ./a.out [ 0.029828] ( 0) INFO : Enter kernels (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.030021] ( 0) INFO : Acquire (target=cuda) [ 0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451419] ( 0) INFO : Upload a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451610] ( 0) INFO : Upload b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451826] ( 0) INFO : Upload i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451905] ( 0) INFO : Upload sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451940] ( 0) INFO : Call __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470506] ( 0) INFO : Free sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470590] ( 0) INFO : Free i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470626] ( 0) INFO : Free c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470677] ( 0) INFO : Free b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470727] ( 0) INFO : Free a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470863] ( 0) INFO : Leave kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
Breaking this output down:
[ 0.029828] ( 0) INFO : Enter kernels (queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.030021] ( 0) INFO : Acquire (target=cuda) [ 0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451419] ( 0) INFO : Upload a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451610] ( 0) INFO : Upload b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451826] ( 0) INFO : Upload i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.451905] ( 0) INFO : Upload sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.
[ 0.451940] ( 0) INFO : Call __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
The kernel is called.
[ 0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470506] ( 0) INFO : Free sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470590] ( 0) INFO : Free i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470626] ( 0) INFO : Free c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470677] ( 0) INFO : Free b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470727] ( 0) INFO : Free a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34) [ 0.470863] ( 0) INFO : Leave kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
At the end of the kernel sum is copied from the accelerator back to the device and the memory released.