This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }   

    // sum component wise and save result into vector c
    #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
    for(i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0.0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to vecAdd.c

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.

// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
    c[i] = a[i] + b[i];
}

The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.

Compiling vecAdd.c

PGI

$ module load cudatoolkit
$ cc -acc vecAdd.c -o vecAdd.out

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -h pragma=acc vecAdd.c -o vecAdd.out

CapsMC

If using the GNU programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -std=c99 will need to be added to cc.
If using the Intel programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -restrict will need to be added to cc.
CapsMC can be used in conjunction with any of the programming environments available on Titan.
$ module load cudatoolkit capsmc
$ cc vecAdd.c -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b  
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)  
    enddo

    ! Sum component wise and save result into vector c

    !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$acc end kernels

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Changes to vecAdd.f90

!$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n))
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$acc end kernels

The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.

Compiling vecAdd.f90

PGI

$ module load cudatoolkit
$ ftn -acc vecAdd.f90 -o vecAdd.out

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -h acc vecAdd.f90 -o vecAdd.out

CapsMC

CapsMC can be used in conjunction with any of the programming environments available on Titan.
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit capsmc
$ ftn vecAdd.f90 -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

Reduction

OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.

vecAdd-reduction.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // sum component wise and save result into vector c and do reduction
    float sum = 0.0;
    #pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n])
    {
        for(i=0; i<n; i++) {
            c[i] = a[i] + b[i];
            sum += c[i];
        }
    }

    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

vecAdd-reduction.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c, perform reduction sum on c

    !$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n))
    do i=1,n
        c(i) = a(i) + b(i)
        sum = sum +  c(i)
    enddo
    !$acc end kernels

    ! Divide the sum by n, the final result should be 1.0 within error
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Additional Information

With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.

PGI

For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.

Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:

$ cc -acc -Minfo vecAdd-reduction.c

Compile Fortran code with the PGI compiler and -Minfo flag as follows:

$ ftn -acc -Minfo vecAdd-reduction.f90

Output:

main:
     34, Generating local(c[0:n])
         Generating copyin(b[0:n])
         Generating copyin(a[0:n])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     35, Loop is parallelizable
         Accelerator kernel generated
         35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
         37, Sum reduction generated for sum

Breaking this output down:

main:
     34, Generating local(c[0:n])
         Generating copyin(b[0:n])
         Generating copyin(a[0:n])

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

35, Loop is parallelizable
    Accelerator kernel generated

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes

In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.

37, Sum reduction generated for sum

The compiler has correctly identified the sum reduction and created an optimized kernel for it.

Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.

$ export ACC_NOTIFY=1
$ aprun ./vecAdd.out
launch kernel  file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=35 device=0 grid=79 block=128 queue=0
launch kernel  file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=37 device=0 grid=1 block=256 queue=0

Cray

Compiling:
C

$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out

Fortran

$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out

Output:

    !$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).

Breaking this output down:

    !$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).

At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.

Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.

$ export CRAY_ACC_DEBUG=1
$ aprun ./vecAdd.out
ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28
ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29
ACC: Execute kernel main_$ck_L29_2 async(auto) from vecAdd-reduction.f90:29
ACC: Wait async(auto) from vecAdd-reduction.f90:32
ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32
ACC: Wait async(auto) from vecAdd-reduction.f90:38
ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38

CapsMC

Currently CapsMC will not compile the provided code due to perceived inter-iteration dependencies. To force the compile either of the following two loop level directives can be added directly above the for/do loop:

#pragma acc loop independent
#pragma acc loop reduction(+:sum)

!$acc loop independent
!$acc loop reduction(+:sum)

Compiling:
C

$ capsmc --io-report cc vecAdd-reduction.c

Fortran

$ capsmc --io-report ftn vecAdd-reduction.f90

Output:

hmpp: [Info] IO report :
hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8)
hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".

Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.

$ export HMPPRT_LOG_LEVEL="info"
$ aprun ./a.out
[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451419] ( 0) INFO : Upload   a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451610] ( 0) INFO : Upload   b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451826] ( 0) INFO : Upload   i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451905] ( 0) INFO : Upload   sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)

Breaking this output down:

[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451419] ( 0) INFO : Upload   a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451610] ( 0) INFO : Upload   b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451826] ( 0) INFO : Upload   i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451905] ( 0) INFO : Upload   sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)

The kernel is called.

[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)

At the end of the kernel sum is copied from the accelerator back to the device and the memory released.