This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

```#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

// Size of vectors
int n = 10000;

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);

// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for(i=0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}

// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
c[i] = a[i] + b[i];
}

// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0.0;
for(i=0; i<n; i++) {
sum += c[i];
}
sum = sum/n;
printf("final result: %f\n", sum);

// Release memory
free(a);
free(b);
free(c);

return 0;
}
```

```// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
```

The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.

```// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
c[i] = a[i] + b[i];
}
```

The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.

PGI

```\$ module load cudatoolkit
```

Cray

```\$ module switch PrgEnv-pgi PrgEnv-cray
```

CapsMC

If using the GNU programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -std=c99 will need to be added to cc.
If using the Intel programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -restrict will need to be added to cc.
CapsMC can be used in conjunction with any of the programming environments available on Titan.
```\$ module load cudatoolkit capsmc
```

Running

```\$ aprun ./vecAdd.out
final result: 1.000000
```

```program main

! Size of vectors
integer :: n = 100000

! Input vectors
real(8),dimension(:),allocatable :: a
real(8),dimension(:),allocatable :: b
! Output vector
real(8),dimension(:),allocatable :: c

integer :: i
real(8) :: sum

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
a(i) = sin(i*1D0)*sin(i*1D0)
b(i) = cos(i*1D0)*cos(i*1D0)
enddo

! Sum component wise and save result into vector c

!\$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
do i=1,n
c(i) = a(i) + b(i)
enddo
!\$acc end kernels

! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
sum = sum +  c(i)
enddo
sum = sum/n
print *, 'final result: ', sum

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

end program
```

```!\$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n))
do i=1,n
c(i) = a(i) + b(i)
enddo
!\$acc end kernels
```

The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.

PGI

```\$ module load cudatoolkit
```

Cray

```\$ module switch PrgEnv-pgi PrgEnv-cray
```

CapsMC

CapsMC can be used in conjunction with any of the programming environments available on Titan.
```\$ module switch PrgEnv-pgi PrgEnv-gnu
```

Running

```\$ aprun ./vecAdd.out
final result: 1.000000
```

## Reduction

OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.

```#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

// Size of vectors
int n = 10000;

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);

// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for(i=0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}

// sum component wise and save result into vector c and do reduction
float sum = 0.0;
#pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n])
{
for(i=0; i<n; i++) {
c[i] = a[i] + b[i];
sum += c[i];
}
}

sum = sum/n;
printf("final result: %f\n", sum);

// Release memory
free(a);
free(b);
free(c);

return 0;
}
```

```program main

! Size of vectors
integer :: n = 100000

! Input vectors
real(8),dimension(:),allocatable :: a
real(8),dimension(:),allocatable :: b
! Output vector
real(8),dimension(:),allocatable :: c

integer :: i
real(8) :: sum

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
a(i) = sin(i*1D0)*sin(i*1D0)
b(i) = cos(i*1D0)*cos(i*1D0)
enddo

! Sum component wise and save result into vector c, perform reduction sum on c

!\$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n))
do i=1,n
c(i) = a(i) + b(i)
sum = sum +  c(i)
enddo
!\$acc end kernels

! Divide the sum by n, the final result should be 1.0 within error
sum = sum/n
print *, 'final result: ', sum

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

end program
```

With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.

### PGI

For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.

Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:

```\$ cc -acc -Minfo vecAdd-reduction.c
```

Compile Fortran code with the PGI compiler and -Minfo flag as follows:

```\$ ftn -acc -Minfo vecAdd-reduction.f90
```

Output:

```main:
34, Generating local(c[0:n])
Generating copyin(b[0:n])
Generating copyin(a[0:n])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
35, Loop is parallelizable
Accelerator kernel generated
35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
37, Sum reduction generated for sum
```

Breaking this output down:

```main:
34, Generating local(c[0:n])
Generating copyin(b[0:n])
Generating copyin(a[0:n])
```

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

```35, Loop is parallelizable
Accelerator kernel generated
```

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

```35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
```

In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.

```37, Sum reduction generated for sum
```

The compiler has correctly identified the sum reduction and created an optimized kernel for it.

Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.

```\$ export ACC_NOTIFY=1
```

### Cray

Compiling:
C

```\$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out
```

Fortran

```\$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out
```

Output:

```    !\$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29
A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29
A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29
Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
```

Breaking this output down:

```    !\$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28
Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).
```

We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

```      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29
A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29
A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29
Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).
```

At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.

Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.

```\$ export CRAY_ACC_DEBUG=1
ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28
ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29
ACC: Execute kernel main_\$ck_L29_2 async(auto) from vecAdd-reduction.f90:29
ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32
ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38
```

### CapsMC

Currently CapsMC will not compile the provided code due to perceived inter-iteration dependencies. To force the compile either of the following two loop level directives can be added directly above the for/do loop:

#pragma acc loop independent
#pragma acc loop reduction(+:sum)

!\$acc loop independent
!\$acc loop reduction(+:sum)

Compiling:
C

```\$ capsmc --io-report cc vecAdd-reduction.c
```

Fortran

```\$ capsmc --io-report ftn vecAdd-reduction.f90
```

Output:

```hmpp: [Info] IO report :
hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8)
hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".
```

Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.

```\$ export HMPPRT_LOG_LEVEL="info"
\$ aprun ./a.out
[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
```

Breaking this output down:

```[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
```

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

```[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
```

The kernel is called.

```[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)