This sample shows a minimal conversion of our vector addition CPU code to a PGI accelerator directives version. Consider this a PGI Accelerator ‘Hello World.’ Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

This tutorial covers PGI Accelerator directives, If you are interested in PGI OpenACC support please see: OpenACC Vector Addition

vecAdd.c

#include <stdio.h>;
#include <stdlib.h>;
#include <math.h>;

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 100000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);
    
    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i&lt;n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // sum component wise and save result into vector c
    #pragma acc region
    {
        for(i=0; i&lt;n; i++) {
            c[i] = a[i] + b[i];
           }
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i&lt;n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf(&quot;final result: %f\n&quot;, sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to vecAdd.c

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

The restrict keyword is necessary for all vectors that will be sent to, or retrieved from, the GPU. This is a non enforced guarantee to the compiler that the pointers are not aliased.

// sum component wise and save result into vector c
#pragma acc region
{
    for(i=0; i&amp;lt;n; i++) {
        c[i] = a[i] + b[i];
   }
}

The code inside of the acc region is computed on the GPU. The region begins with the #pragma acc region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region as deemed necessary by the compiler.

Compiling vecAdd.c

We add the target accelerator flag to specify we want to compile for NVIDIA accelerators

$ module load cudatoolkit
$ module load PrgEnv-pgi
$ cc -ta=nvidia vecAdd.c -o vecAdd.out

Running vecAdd.c

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real,dimension(:),allocatable :: a
    real,dimension(:),allocatable :: b
    ! Output vector
    real,dimension(:),allocatable :: c

    integer :: i
    real :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1.0)*sin(i*1.0)
        b(i) = cos(i*1.0)*cos(i*1.0)
    enddo

    ! Sum component wise and save result into vector c

    !$acc region
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$acc end region

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Changes to vecAdd.f90

    !$acc region
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$acc end region

The code inside of the acc region is computed on the GPU. The region begins with the !acc region directive and ends with the !acc end region directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

Compiling vecAdd.f90

We add the target accelerator flag to specify we want to compile for NVIDIA accelerators

$ module load cudatoolkit
$ module load PrgEnv-pgi
$ ftn -ta=nvidia vecAdd.f90 -o vecAdd.out

Running vecAdd.f90

$ aprun ./vecAdd.out
final result: 1.000000

Additional Information

Much information is obscured from the programmer so let’s add the Minfo compiler flag to see what the compiler is doing. With the Minfo flag we will see memory transfer and thread placement information.

Compiling
C

$ cc -ta=nvidia -Minfo vecAdd.c -o vecAdd.out

Fortran

$ ftn -ta=nvidia -Minfo vecAdd.f90 -o vecAdd.out

Output

main:
     33, Generating copyin(b[0:99999])
         Generating copyin(a[0:99999])
         Generating copyout(c[0:99999])
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     35, Loop is parallelizable
         Accelerator kernel generated
         35, #pragma acc for parallel, vector(256)
             CC 1.0 : 5 registers; 20 shared, 36 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 5 registers; 20 shared, 36 constant, 0 local memory bytes; 100 occupancy

Breaking this output down:

33, Generating copyin(b[0:99999])
    Generating copyin(a[0:99999])
    Generating copyout(c[0:99999])

We see that at line 33, the start of our acc region, that elements 0 to 99999 of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU but does need to come out and we see it has been correctly handled by the compiler.

Generating compute capability 1.0 binary
Generating compute capability 1.3 binary

Next the compiler tells us it has generated binaries for both compute capability 1.0 and compute capability 1.3 devices. The binary with the highest compute capability less than or equal to the GPU it is being run on will be used, allowing the executable to be portable yet highly tuned.

35, Loop is parallelizable
    Accelerator kernel generated

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

35, #pragma acc for parallel, vector(256)

In CUDA terminology this translates to a kernel that has a block size of 256, that is, 256 threads will be in each logical thread block.

CC 1.0 : 5 registers; 20 shared, 36 constant, 0 local memory bytes; 100 occupancy
  • CC 1.0: Compute capability the following information applies to
  • 5 registers: Number of registers per thread
  • 20 shared: Number of shared memory variables each thread block will use.
  • 36 constant Number of cached constants the kernel will use.
  • 0 local memory bytes: Bytes stored in local memory. Local memory is located off SM and is used by the compiler to alleviate register pressure.
  • 100 occupancy: Percentage of possible warps that are active. High occupancy is a necessary but not sufficient condition for high GPU performance.