Experience in porting Fortran code to GPUs with OpenACC

Speakers: Mike Zingale and Adam Jacobs
This talk was given at the 9-23-15 OLCF User Group Conference Call.

In the slides and video below, we describe our efforts to put large physics modules written in Fortran onto the GPUs. Our main target is integrating stiff systems of ODEs describing nuclear reactions. Our strategy is to put the entire integration: righthand side evaluations, Jacobian construction, and the actual timestepping onto the GPUs. This minimizes the communication costs — only the initial state vector is passed in, and the new state vector comes out at the end. We will discuss the challenges we had in getting the compilers to accept modern Fortran constructions with OpenACC and discuss preliminary results.

Slides: openacc-networks.pdf

Video:

Experience in porting Fortran code to GPUs with OpenACC from OLCF on Vimeo.

To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov


C

The Serial, OpenMP, and MPI C samples make use of the following gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams

sleep.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
 
static cudaStream_t *streams;

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}
 
// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(float seconds)
{
    // Get device frequency in KHz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;
 
    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = (int64_t)(seconds * Hz);
   
    return num_cycles;
}
 
// Create streams
extern "C" void create_streams(int num_streams)
{
    // Allocate streams
    streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t));    
 
    // Default stream
    streams[0] = NULL;

    // Primer kernel launch
    sleep<<< 1, 1 >>>(1); 

    // Create streams
    for(int i = 1; i <= num_streams; i++)
        cudaStreamCreate(&streams[i]);
}
 
// Launches a kernel that sleeps for num_cycles
extern "C" void sleep_kernel(int64_t num_cycles, int stream_id)
{
    // Launch a single GPU thread to sleep
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;
 
    // Execute the kernel
    sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles);
}
 
// Wait for stream to complete
extern "C" void wait_for_stream(int stream_id)
{
    cudaStreamSynchronize(streams[stream_id]);
}
 
// Wait for streams to complete
extern "C" void wait_for_streams(int num_streams)
{
    for(int i = 1; i <= num_streams; i++)
        cudaStreamSynchronize(streams[i]);
}
 
// Destroy stream objects
extern "C" void destroy_streams(int num_streams)
{
    // Clean up stream
    for(int i = 1; i <= num_streams; i++)
        cudaStreamDestroy(streams[i]);
    free(streams);
}

Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -arch=sm_35 -c sleep.cu

Serially Launched Concurrent Kernels in C

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    struct timeval start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Max number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        for(i=1; i<=num_kernels; i++){
            sleep_kernel(cycles, i);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);

        // Clean up streams
        destroy_streams(num_kernels);

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o serial.out

Run:

$ module load cudatoolkit
$ aprun ./serial.out

OpenMP Launched Concurrent Kernels in C

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    double start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {

        // Set number of OMP threads
        omp_set_num_threads(num_kernels);

        // Start timer
        start = omp_get_wtime();

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        #pragma omp parallel firstprivate(cycles)
        {
            int stream_id = omp_get_thread_num()+1;
            sleep_kernel(cycles, stream_id);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);
   
        // Wait for kernels to complete and clean up streams
        destroy_streams(num_kernels);

        // Stop timer
        stop = omp_get_wtime();
        printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
    }

    return 0;
}

Compile:
PGI

$ module load cudatoolkit
$ cc -mp sleep.o launcher.c -o openmp.out

GNU

$ module load cudatoolkit
$ cc -fopenmp sleep.o launcher.c -o openmp.out

Intel

$ module load cudatoolkit
$ cc -openmp sleep.o launcher.c -o openmp.out

Cray

$ module load cudatoolkit
$ cc sleep.o launcher.c -o openmp.out

Run:

$ module load cudatoolkit
$ aprun -d16 ./openmp.out

MPI Launched Concurrent Kernels in C

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "mpi.h"

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_stream(int stream_id);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    uint64_t cycles;
    double start, stop;
    int num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = size;

    // Setup default stream in sleep.cu wrapper
    create_streams(0);

    // Loop through number of kernels to launch, from 1 to max_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        MPI_Barrier(MPI_COMM_WORLD);
        if(rank == 0)
            start = MPI_Wtime();

        // Launch kernel into default stream
        if(rank < num_kernels)
            sleep_kernel(cycles, 0);

        // Wait for all ranks to submit kernel
        MPI_Barrier(MPI_COMM_WORLD);

        // Wait for default stream
        if(rank < num_kernels)
            wait_for_stream(0);

        // Wait for all ranks to complete
        MPI_Barrier(MPI_COMM_WORLD);

        // Print seconds ellapsed
        if(rank == 0) {
            stop = MPI_Wtime();
            printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
        }
    }

    destroy_streams(0);
    MPI_Finalize();

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in C

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "math.h"
#include <sys/time.h>

int main(int argc, char *argv[])
{

    uint64_t num_cycles;
    struct timeval start, stop;
    int i,num_kernels;

    // Set number of cycles to sleep for 1 second
    // We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/15;

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        for(i=0; i<=num_kernels; i++)
        {
            #pragma acc parallel async(i) vector_length(1) num_gangs(1)
            {
                uint64_t cycles;
                #pragma acc loop seq
                for(cycles=0; cycles<num_cycles; cycles++) {
                    cycles = cycles;
                }
            }
        }

        // Wait for all async streams to complete
        #pragma acc wait

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of cycles should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ cc -acc launcher.c -o acc.out

Cray

$ module load cudatoolkit
$ cc -hpragma=acc launcher.c -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

Fortran

The Serial, OpenMP, and MPI Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams
For simplicity the following will use the PGI programming environment as it provides CUDA Fortran. For additional information on compiler interoperability please see the tutorial Compiling mixed GPU and CPU code.
module sleep
    use cudadevice
    use cudafor
    implicit none
 
    integer, dimension(:), allocatable :: streams
 
    contains
 
    !CUDA kernel to pause for at least num_cycle cycles
    attributes(global) subroutine sleep(num_cycles)
        integer(8), value :: num_cycles
        integer(8) :: cycles
        integer(8) :: start
 
        cycles = 0
        start = clock64
        do while (cycles < num_cycles)
            cycles = clock64 - start
        enddo
    end subroutine sleep
 
    !Returns number of cycles required for requested seconds
    integer(8) function get_cycles(seconds) result(num_cycles)
        real(8), intent(in) :: seconds
        integer(8) :: istat, Hz
        type(cudadeviceprop) :: prop
 
        istat = cudaGetDeviceProperties(prop, 0)
        Hz = prop%clockRate * 1000
        num_cycles = seconds * Hz
    end function get_cycles
 
    !Create streams
    subroutine create_streams(num_streams)
        integer :: num_streams, istat, i
 
        allocate(streams(num_streams+1))

        streams(1) = 0 

        ! Primer kernel launch
        call sleep<<< 1, 1 >>>(int8(1));

        do i=2,num_streams+1
            istat = cudaStreamCreate(streams(i))
        enddo
    end subroutine create_streams
 
    !Launches a kernel that sleeps for num_cycles
    subroutine sleep_kernel(num_cycles, stream_id)
        integer(8) :: num_cycles
        integer    ::  stream_id
        type(dim3) :: blockSize, gridSize
 
        blockSize = dim3(1,1,1)
        gridSize = dim3(1,1,1)
 
        call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles)
    end subroutine sleep_kernel
 
    ! Wait for stream to complete
    subroutine wait_for_stream(stream_id)
        integer :: stream_id, istat
 
        istat = cudaStreamSynchronize(streams(stream_id))
    end subroutine wait_for_stream
 
    ! Wait for streams to complete
    subroutine wait_for_streams(num_streams)
        integer :: num_streams, istat, i
 
        do i=2,num_streams+1
            istat = cudaStreamSynchronize(streams(i))
        enddo
    end subroutine wait_for_streams
 
    ! Destroy streams
    subroutine destroy_streams(num_streams)
        integer :: num_streams, i, istat
 
        do i=2,num_streams+1
            istat = cudaStreamDestroy(streams(i))
        enddo
        deallocate(streams)
    end subroutine destroy_streams
 
end module sleep

Serially Launched Concurrent Kernels in Fortran

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, i
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 32

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        do i = 2, num_kernels+1
            call sleep_kernel(cycles, i)
        enddo

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out

OpenMP Launched Concurrent Kernels in Fortran

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep
    use omp_lib

    implicit none

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Set number of OMP threads
        call omp_set_num_threads(num_kernels)

        ! Start timer
        start = omp_get_wtime()

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        !$omp parallel private(stream_id) firstprivate(cycles)
        stream_id = omp_get_thread_num()+2
        call sleep_kernel(cycles, stream_id)
        !$omp end parallel

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        stop = omp_get_wtime()

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out

MPI Launched Concurrent Kernels in Fortran

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.f90

program main
    use sleep
    implicit none
    include 'mpif.h'

    integer    ::  max_kernels, num_kernels, i, ierr, rank, size
    integer(8) :: cycles
    real(8)    :: start, stop, seconds

    call MPI_Init(ierr)

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

    ! Number of kernels to launch
    max_kernels = size

    ! Setup default stream in sleep.cu wrapper
    call create_streams(0);

    ! Loop through number of kernels to launch, from 1 to max_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call MPI_Barrier(MPI_COMM_WORLD, ierr)
        if (rank == 0) then
            start = MPI_Wtime()
        endif

        ! Launch num_kernel kernels asynchrnously
        if (rank < num_kernels) then
            call sleep_kernel(cycles, 1)
        endif

        ! Wait for all ranks to submit kernel
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Wait for kernel to complete
        if(rank < num_kernels) then
            call wait_for_stream(0)
        endif

        ! Wait for all ranks to complete
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Print seconds ellapsed
        if (rank == 0) then
            stop = MPI_Wtime()
            print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'
        endif

    enddo

    ! clean up array in wrapper, no stream actually destroyed
    call destroy_streams(0)

    call MPI_Finalize(ierr)

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in Fortran

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.f90

program main
    implicit none

    integer(8) :: num_cycles, cycles, i
    integer    :: max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds
    real(4)    :: foo(33)

    ! Set number of cycles to sleep for 1 second
    ! We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/(15*5)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Launch num_kernel kernels asynchrnously
        do i = 1, num_kernels
            !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo)
            !$acc loop seq
            do cycles = 1, num_cycles
                foo(i) = sin(1.5708)
            enddo
            !$acc end loop
            !$acc end parallel
        enddo

        ! Wait for all async streams to complete
        !$acc wait

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of foo should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ ftn -acc launcher.f90 -o acc.out

Cray

$ module load cudatoolkit
$ ftn -hacc launcher.f90 -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

This sample shows a minimal conversion of our Vector Addition CPU code to an OpenACC accelerator directives version. Consider this an OpenACC ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }   

    // sum component wise and save result into vector c
    #pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
    for(i=0; i<n; i++) {
        c[i] = a[i] + b[i];
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0.0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes to vecAdd.c

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

The restrict keyword is a non enforced guarantee to the compiler that the pointers are not aliased. This is not required by the OpenACC standard but in some cases allows the compiler to better vectorize the code.

// sum component wise and save result into vector c
#pragma acc kernels copyin(a[0:n],b[0:n]), copyout(c[0:n])
for(i=0; i<n; i++) {
    c[i] = a[i] + b[i];
}

The compiler will analyze the loops following the acc kernels directive and if possible compute the loops on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop.

Compiling vecAdd.c

PGI

$ module load cudatoolkit
$ cc -acc vecAdd.c -o vecAdd.out

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -h pragma=acc vecAdd.c -o vecAdd.out

CapsMC

If using the GNU programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -std=c99 will need to be added to cc.
If using the Intel programming environment with CapsMC to properly handle the restrict qualifier the compiler flag -restrict will need to be added to cc.
CapsMC can be used in conjunction with any of the programming environments available on Titan.
$ module load cudatoolkit capsmc
$ cc vecAdd.c -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b  
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)  
    enddo

    ! Sum component wise and save result into vector c

    !$acc kernels copyin(a(1:n),b(1:n)), copyout(c(1:n))
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$acc end kernels

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Changes to vecAdd.f90

!$acc kernels copyin(a(0:n),b(0:n)), copyout(c(0:n))
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$acc end kernels

The loop directly following the acc kernels loop directive will be computed on the GPU. Memory is copied from the CPU to the GPU at the start of the loop and back from the GPU to the CPU at the end of the loop as specified by the copyin and copyout modifiers.

Compiling vecAdd.f90

PGI

$ module load cudatoolkit
$ ftn -acc vecAdd.f90 -o vecAdd.out

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -h acc vecAdd.f90 -o vecAdd.out

CapsMC

CapsMC can be used in conjunction with any of the programming environments available on Titan.
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit capsmc
$ ftn vecAdd.f90 -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000

Reduction

OpenACC includes a parallel reduction operation. When a reduction operation is encountered in a kernels region the compiler should automatically detect it and run the reduction on the accelerator. When relying on the compiler to implicitly parallelize your code it is important to check that it is doing so correctly as discussed below in the Additional Information section.

vecAdd-reduction.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 10000;

    // Input vectors
    double *restrict a;
    double *restrict b;
    // Output vector
    double *restrict c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // sum component wise and save result into vector c and do reduction
    float sum = 0.0;
    #pragma acc kernels copyin(a[0:n],b[0:n]), create(c[0:n])
    {
        for(i=0; i<n; i++) {
            c[i] = a[i] + b[i];
            sum += c[i];
        }
    }

    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

vecAdd-reduction.f90

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c, perform reduction sum on c

    !$acc kernels !copyin(a(1:n),b(1:n)), copyout(c(1:n))
    do i=1,n
        c(i) = a(i) + b(i)
        sum = sum +  c(i)
    enddo
    !$acc end kernels

    ! Divide the sum by n, the final result should be 1.0 within error
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Additional Information

With directive use, particularly when the compiler is relied upon to implicitly identify and parallelize code, it is important to check that it is doing what you expect.The way to extract more information from the compilation processes differs between compilers.

PGI

For PGI the -Minfo compiler flag will print out detailed optimization information. With the -Minfo flag we will see memory transfer and thread placement information.

Compiling:
Compile C code with the PGI compiler and -Minfo flag as follows:

$ cc -acc -Minfo vecAdd-reduction.c

Compile Fortran code with the PGI compiler and -Minfo flag as follows:

$ ftn -acc -Minfo vecAdd-reduction.f90

Output:

main:
     34, Generating local(c[0:n])
         Generating copyin(b[0:n])
         Generating copyin(a[0:n])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     35, Loop is parallelizable
         Accelerator kernel generated
         35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes
         37, Sum reduction generated for sum

Breaking this output down:

main:
     34, Generating local(c[0:n])
         Generating copyin(b[0:n])
         Generating copyin(a[0:n])

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

35, Loop is parallelizable
    Accelerator kernel generated

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

35, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
    CC 2.0 : 18 registers; 0 shared, 88 constant, 0 local memory bytes

In CUDA terminology this translates to a kernel that has a block size of 128, that is, 128 threads will be in each logical thread block. Each thread block will use 18 registers and 88 values will be stored in the constant cache.

37, Sum reduction generated for sum

The compiler has correctly identified the sum reduction and created an optimized kernel for it.

Additionally if the environment variable ACC_NOTIFY is set run time information is printed out.

$ export ACC_NOTIFY=1
$ aprun ./vecAdd.out
launch kernel  file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=35 device=0 grid=79 block=128 queue=0
launch kernel  file=/autofs/na1_home/atj/OLCF3/OpenACC/VecAdd/vecAdd-reduction.c function=main line=37 device=0 grid=1 block=256 queue=0

Cray

Compiling:
C

$ cc -h pragma=acc,msgs vecAdd.c -o vecAdd.out

Fortran

$ ftn -h acc,msgs vecAdd.f90 -o vecAdd.out

Output:

    !$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).

Breaking this output down:

    !$acc kernels copyin(a(1:n),b(1:n)), create(c(1:n))
ftn-6413 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  A data region was created at line 28 and ending at line 38.
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "b" to accelerator, free at line 38 (acc_copyin).
ftn-6417 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory and copy user shaped variable "a" to accelerator, free at line 38 (acc_copyin).
ftn-6421 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 28 
  Allocate memory for user shaped variable "c" on accelerator, free at line 38 (acc_share).

We see that at line 28, the start of our acc region, that elements 1 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

      do i=1,n
ftn-6401 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was placed on the accelerator.
ftn-6430 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  A loop starting at line 29 was partitioned across the threadblocks and the 128 threads within a threadblock.
ftn-6415 crayftn: ACCEL MAIN, File = vecAdd-reduction.f90, Line = 29 
  Allocate memory and copy variable "sum" to accelerator, copy back at line 32 (acc_copy).

At line 29, the start of our do loop, we see a kernel was generated with 128 threads per threadblock. At the end of the loop sum is copied from the accelerator back to the device.

Additionally if the environment variable CRAY_ACC_DEBUG is set run time information is printed out.

$ export CRAY_ACC_DEBUG=1
$ aprun ./vecAdd.out
ACC: Transfer 3 items (to acc 1600000 bytes, to host 0 bytes) from vecAdd-reduction.f90:28
ACC: Transfer 3 items (to acc 12 bytes, to host 0 bytes) from vecAdd-reduction.f90:29
ACC: Execute kernel main_$ck_L29_2 async(auto) from vecAdd-reduction.f90:29
ACC: Wait async(auto) from vecAdd-reduction.f90:32
ACC: Transfer 3 items (to acc 0 bytes, to host 8 bytes) from vecAdd-reduction.f90:32
ACC: Wait async(auto) from vecAdd-reduction.f90:38
ACC: Transfer 3 items (to acc 0 bytes, to host 0 bytes) from vecAdd-reduction.f90:38

CapsMC

Currently CapsMC will not compile the provided code due to perceived inter-iteration dependencies. To force the compile either of the following two loop level directives can be added directly above the for/do loop:

#pragma acc loop independent
#pragma acc loop reduction(+:sum)

!$acc loop independent
!$acc loop reduction(+:sum)

Compiling:
C

$ capsmc --io-report cc vecAdd-reduction.c

Fortran

$ capsmc --io-report ftn vecAdd-reduction.f90

Output:

hmpp: [Info] IO report :
hmppcg: [Message DPL0099] vecAdd-reductionHMPP.c:38: Loop 'i' was shared among workers(8)
hmpp: [Info] Generated codelet filename is "__hmpp_acc_region__67z3pizb_cuda.hmf.cu".

Although not much information is provided much more can be provided at runtime. To do so the environment variable HMPPRT_LOG_LEVEL=”info” is set.

$ export HMPPRT_LOG_LEVEL="info"
$ aprun ./a.out
[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451419] ( 0) INFO : Upload   a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451610] ( 0) INFO : Upload   b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451826] ( 0) INFO : Upload   i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451905] ( 0) INFO : Upload   sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)

Breaking this output down:

[     0.029828] ( 0) INFO : Enter    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.030021] ( 0) INFO : Acquire  (target=cuda)
[     0.031667] ( 0) INFO : Allocate a[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451419] ( 0) INFO : Upload   a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451555] ( 0) INFO : Allocate b[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451610] ( 0) INFO : Upload   b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451721] ( 0) INFO : Allocate c[0:10000] (element_size=8, memory_space=cudaglob, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451775] ( 0) INFO : Allocate i[0:1] (element_size=4, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451826] ( 0) INFO : Upload   i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451866] ( 0) INFO : Allocate sum[0:1] (element_size=8, memory_space=host, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.451905] ( 0) INFO : Upload   sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)

We see that at line 34, the start of our acc region, that elements 0 to n of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU so it is allocated locally.

[     0.451940] ( 0) INFO : Call     __hmpp_acc_region__67z3pizb (queue=none, location=vecAdd-reductionHMPP.c:34)

The kernel is called.

[     0.470447] ( 0) INFO : Download sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470506] ( 0) INFO : Free     sum[0:1] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470555] ( 0) INFO : Download i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470590] ( 0) INFO : Free     i[0:1] (element_size=4, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470626] ( 0) INFO : Free     c[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470677] ( 0) INFO : Free     b[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470727] ( 0) INFO : Free     a[0:10000] (element_size=8, queue=none, location=vecAdd-reductionHMPP.c:34)
[     0.470863] ( 0) INFO : Leave    kernels (queue=none, location=vecAdd-reductionHMPP.c:34)

At the end of the kernel sum is copied from the accelerator back to the device and the memory released.

This sample shows the Game of Life CPU code converted to an OpenACC accelerator directive version. The full source can be viewed or download at the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

GOL.c

#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

#define dim 1024 // Linear game grid dimension excluding ghost cells
#define idx(i,j) ((i)*(dim+2)+(j))

// Add up all neighbors
inline int getNeighbors(int *grid, int i, int j)
{
    int num = grid[idx(i+1,j)] + grid[idx(i-1,j)]     //upper lower
            + grid[idx(i,j+1)] + grid[idx(i,j-1)]     //right left
            + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals
            + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];

    return num;
}

int main(int argc, char* argv[])
{
    int i,j,iter;
    // Number of game iterations
    int maxIter = 1<<10;

    // Total number of alive cells
    int total = 0;

    // Total size of grid in bytes
    size_t bytes = sizeof(int)*(dim+2)*(dim+2);

    // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    int *grid = (int*) malloc(bytes);

    // Allocate newGrid
    int *restrict newGrid = (int*) malloc(bytes);

    // Assign initial population randomly
    srand(SRAND_VALUE);
    for(i = 1; i <= dim; i++) {
        for(j = 1; j <= dim; j++) {
            grid[idx(i,j)] = rand() % 2;
        }
    }

  int fullSize = (dim+2)*(dim+2);
  #pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize])
  {
    // Main game loop
    for (iter = 0; iter<maxIter; iter++) {
    #pragma acc kernels
    {
        // Left-Right columns
        #pragma acc loop independent
        for (i = 1; i <= dim; i++) {
            grid[idx(i,0)] = grid[idx(i,dim)];   //Copy first real column to right ghost column
            grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column
        }
        // Top-Bottom rows
        #pragma acc loop independent
        for (j = 0; j <= dim+1; j++) {
            grid[idx(0,j)] = grid[idx(dim,j)];   //Copy first real row to bottom ghost row
            grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row
        }

        // Now we loop over all cells and determine their fate
        #pragma acc loop independent
        for (i = 1; i <= dim; i++) {
            for (j = 1; j <= dim; j++) {
               // Get the number of neighbors for a given grid point
                int numNeighbors = getNeighbors(grid,i,j);

                // Here we have explicitly all of the game rules
                if (grid[idx(i,j)] == 1 && numNeighbors < 2)
                    newGrid[idx(i,j)] = 0;
                else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3))
                    newGrid[idx(i,j)] = 1;
                else if (grid[idx(i,j)] == 1 && numNeighbors > 3)
                    newGrid[idx(i,j)] = 0;
                else if (grid[idx(i,j)] == 0 && numNeighbors == 3)
                    newGrid[idx(i,j)] = 1;
                else
                    newGrid[idx(i,j)] = grid[idx(i,j)];
            }
        }

       // Can't switch pointers so we mannually have to copy array over
       #pragma acc loop independent
       for(i = 1; i <= dim; i++) {
           for(j = 1; j <= dim; j++) {
               grid[idx(i,j)] = newGrid[idx(i,j)];
           }
       }
    } // End ACC kernels region
    } // End main game loop

    // Sum up alive cells
    #pragma acc parallel
    {
        #pragma acc loop reduction(+:total)
        for (i = 1; i <= dim; i++) {
            #pragma acc loop reduction(+:total)
            for (j = 1; j <= dim; j++) {
                total += grid[idx(i,j)];
            }
        }
    }
  } // End ACC Data region

    // Print number of live cells
    printf("Total Alive: %d\n", total);

    // Release memory
    free(grid);
    free(newGrid);

    return 0;
}

Changes to GOL.c

#define dim 1024 // Linear game grid dimension excluding ghost cells
#define idx(i,j) ((i)*(dim+2)+(j))

At the time of this writing, multidimensional C arrays are not fully supported in all OpenACC implementations. A convenience macro is defined and used whenever grid or newGrid is accessed. These changes have not all been highlighted. grid[i][j] will now be accessed as grid[idx(i,j)]

inline int getNeighbors(int *grid, int i, int j)

OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. The inline qualifier can be used as a hint to the compiler we wish to have the function inlined.

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
int *grid = (int*) malloc(bytes);

// Allocate newGrid
int *restrict newGrid = (int*) malloc(bytes);

The grid and newGrid arrays are now linear. As discussed in the vector addition example we use the restrict qualifier to help the compiler resolve loop dependencies.

#pragma acc data copyin(grid[0:fullSize]) create(newGrid[0:fullSize])
{
}

Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 49 and ends at line 101. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.

#pragma acc kernels
{
}

An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.

// Left-Right columns
#pragma acc loop independent
// Top-Bottom rows
#pragma acc loop independent
// Now we loop over all cells and determine their fate
#pragma acc loop independent
// Can't switch pointers so we mannually have to copy array over
#pragma acc loop independent

The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.

// Can't switch pointers so we mannually have to copy array over
#pragma acc loop independent
for(i = 1; i <= dim; i++) {
   for(j = 1; j <= dim; j++) {
       grid[idx(i,j)] = newGrid[idx(i,j)];
   }
}

Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.

// Sum up alive cells
#pragma acc parallel
{
    #pragma acc loop reduction(+:total)
    for (i = 1; i <= dim; i++) {
        #pragma acc loop reduction(+:total)
        for (j = 1; j <= dim; j++) {
            total += grid[idx(i,j)];
        }
    }
}

At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.

Compiling GOL.c

PGI

$ module load cudatoolkit
$ cc -acc gol.c -Minline -Minfo -ta=nvidia:kepler

CRAY
Cray will automatically attempt to inline the getNeighbors function

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -hpragma=acc,msgs gol.c

Compiling GOL.c with CapsMC

If using the GNU programming environment with capsmc to properly handle the restrict qualifier the compiler flag -std=c99 will need to be added to cc.
If using the Intel programming environment with capsmc to properly handle the restrict qualifier the compiler flag -restrict will need to be added to cc.
CapsMC can be used in conjunction with any of the programming environments available on Titan

CapsMC will automatically attempt to inline the getNeighbors function

$ module load capsmc
$ module load cudatoolkit
$ cc gol.c

GOL.f90

program main
    implicit none
    integer :: i,j,iter,seed(8),numNeighbors,total
    real :: randm
    ! Linear game grid dimension
    integer :: dim = 1024
    ! Number of game iterations
    integer :: maxIter = 2**10

    ! Game grid pointers
    integer,dimension(:,:),allocatable :: grid, newGrid
 
    ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
    allocate(grid(dim+2,dim+2))
    allocate(newGrid(dim+2,dim+2))

    ! Assign initial population randomly
    seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
    call random_seed(PUT=seed)
    do j=1,dim
        do i=1,dim
            call random_number(randm)
            grid(i,j) = nint(randm)
        enddo
    enddo

    !$acc data copyin(grid), create(newGrid)

    ! Main game loop
    do iter=1,maxIter

    !$acc kernels

        ! Top-Bottom ghost rows
        !$acc loop independent
        do j=2,dim+1
            grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
            grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
        enddo

        ! Left-Right ghost columns
        !$acc loop independent
        do i=1,dim+2
            grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
            grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
        enddo

        ! Now we loop over all cells and determine their fate
        do j=2,dim+1
            do i=2,dim+1
                ! Get the number of neighbors for a given grid point
                numNeighbors = grid(i,j+1) + grid(i,j-1)&     !left and right
                             + grid(i+1,j) + grid(i-1,j)&     !upper and lower
                             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
                             + grid(i-1,j+1) + grid(i+1,j-1)

                ! Here we have explicitly all of the game rules
                if(grid(i,j) == 1 .AND. numNeighbors < 2) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
                    newGrid(i,j) = 1
                elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
                    newGrid(i,j) = 0
                elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
                    newGrid(i,j) = 1
                else
                    newGrid(i,j) = grid(i,j)
                endif
            enddo
        enddo

        ! Can't switch pointers so we mannually have to copy array over
        do j=2,dim+1
            do i=2,dim+1
                grid(i,j) = newGrid(i,j)
            enddo
        enddo

    !$acc end kernels
    enddo!end main game loop  

    total = 0
    ! Sum up alive cells and print results
    !$acc parallel
    !$acc loop reduction(+:total)
    do j=2,dim+1
        !$acc loop reduction(+:total)
        do i=2,dim+1
            total = total + grid(i,j)
        enddo
    enddo
    !$acc end parallel

    !$acc end data

    print *, "Total Alive", total

end program

Changes to GOL.f90

!$acc data copyin(grid), create(newGrid)
!$acc end data

Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. Here an ACC data region starts on line 27 and ends at line 90. grid will be copied to the accelerator at the start of the region. newGrid will be allocated on the device but no copies from host to device or device to host are needed.

!$acc kernels
!$acc end kernels

An acc kernels region will attempt to parallelize any enclosed loops across the accelerator. The kernels directive gives the compiler complete control, if not guided by the programmer, in deciding how to map the loops onto the target hardware. This clause does not guarantee the compiler will be able to parallelize the loop across the device, if the compiler is not able to determine the loop iterations are independent the loop may be run sequentially on the accelerator. It is imperative that compiler information be checked to ensure that the loops are parallelized.

!$acc loop independent
!$acc loop independent

The compiler is not always able to determine that loop iterations can be done independently, when this is encountered the loop will be scheduled sequentially on the accelerator. The loop independent clause lets the compiler know that it is safe to parallelize the immediately following loop.

numNeighbors = grid(i,j+1) + grid(i,j-1)&     !left and right
             + grid(i+1,j) + grid(i-1,j)&     !upper and lower
             + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
             + grid(i-1,j+1) + grid(i+1,j-1)

OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined. For this example, using the current set of compilers, it was best to inline the neighbors function by hand.

! Can't switch pointers so we mannually have to copy array over
do j=2,dim+1
    do i=2,dim+1
        grid(i,j) = newGrid(i,j)
    enddo
enddo

Device pointer manipulation is not currently supported by OpenACC so the pointer swap must be removed. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done.

! Sum up alive cells and print results
!$acc parallel
!$acc loop reduction(+:total)
do j=2,dim+1
    !$acc loop reduction(+:total)
    do i=2,dim+1
        total = total + grid(i,j)
    enddo 
enddo
!$acc end parallel

At the start of this acc parallel region a number of compiler determined gangs will be launched on the GPU. The acc loop directive indicates that the immediately following loop will be parallelized across the accelerator. The reduction clause on both loops indicate that a reduction is required on the scalar variable total.

Compiling GOL.f90

OpenACC 1.0 does not support function calls within acc regions, all function calls must be inlined using the appropriate compiler flag.
PGI

$ module load cudatoolkit
$ ftn -acc gol.f90 -Minline -Minfo -ta=nvidia:kepler

CRAY

Cray will automatically try to inline the getNeighbors function

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -hacc,msgs gol.f90

Compiling GOL.f90 with CapsMC

CapsMC can be used in conjunction with any of the programming environments available on Titan

CapsMC will automatically try to inline the getNeighbors function

$ module load capsmc
$ module load cudatoolkit
$ ftn gol.f90

The following C and Fortran examples show how to combine OpenACC data regions with GPU libraries such as CUFFT. Since files containing CUFFT function calls must be compiled with the nvcc compiler and OpenACC containing files must be compiled with cc/CC/ftn it is necessary to create wrapper functions as shown below. The acc host_data use_device directive will be used which causes the device allocated memory address of specified variables to be used in host code within the directives scope. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Care must be taken to ensure that all dependent GPU operations occur on the same CUDA stream. The example below queries OpenACC for the stream in use and executes all library calls within this stream. It is an error to assume OpenACC is using the default stream and doing so may very well produce difficult to detect race conditions.

C/C++

fft.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "openacc.h"

#ifndef M_PI
#define M_PI           3.14159265358979323846
#endif

// Forward declaration of wrapper function that will call CUFFT
extern void launchCUFFT(float *d_data, int n, void *stream);

int main(int argc, char *argv[])
{
    int n = 256;
    float *data = malloc(2*n*sizeof(float));
    int i;

    // Initialize interleaved input data on host
    float w = 7.0;
    float x;
    for(i=0; i<2*n; i+=2)  {
        x = (float)i/2.0/(n-1);
        data[i] = cos(2*M_PI*w*x);
        data[i+1] = 0.0;
    }

    // Copy data to device at start of region and back to host and end of region
    #pragma acc data copy(data[0:2*n])
    {
        // Inside this region the device data pointer will be used
        #pragma acc host_data use_device(data)
        {
           // Query OpenACC for CUDA stream
           void *stream = acc_get_cuda_stream(acc_async_sync);

           // Launch FFT on the GPU
           launchCUFFT(data, n, stream);
        }
    }

    // Find the frequency
    int max_id = 0;
    for(i=0; i<n; i+=2) {
        if( data[i] > data[max_id] )
            max_id = i;
    }
    printf("frequency = %d\n", max_id/2);

    return 0;
}

fft.c + cufft.cu

#include <cufft.h>

// Declared extern "C" to disable C++ name mangling
extern "C" void launchCUFFT(float *d_data, int n, void *stream)
{
    cufftHandle plan;
    cufftPlan1d(&plan, n, CUFFT_C2C, 1);
    cufftSetStream(plan, (cudaStream_t)stream);
    cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD);
    cufftDestroy(plan);
}

Compiling C/C++

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc cufft.cu -c
$ cc -h pragma=acc -lcufft fft.c cufft.o

PGI

$ module load cudatoolkit
$ nvcc cufft.cu -c
$ cc -acc -lcufft fft.c cufft.o

Fortran

fft.f90

module cufft
  INTERFACE
    subroutine launchcufft(data, n, stream) BIND (C, NAME='launchCUFFT')
      USE ISO_C_BINDING
      implicit none
      type (C_PTR), value :: data
      integer (C_INT), value :: n
      type (C_PTR), value :: stream
    end subroutine
  END INTERFACE
end module cufft

program fft
    USE ISO_C_BINDING
    USE cufft
    USE openacc
    IMPLICIT NONE

    INTEGER, PARAMETER :: n = 256
    COMPLEX (C_FLOAT_COMPLEX) :: data(n)
    INTEGER (C_INT):: i
    INTEGER :: max_id,istat
    type (C_PTR) :: stream

    ! Initialize interleaved input data on host
    REAL :: w = 7.0
    REAL :: x
    REAL, PARAMETER :: PI = 3.1415927
    do i=1,n
        x = (i-1.0)/(n-1.0);
        data(i) = CMPLX(COS(2.0*PI*w*x),0.0)
    enddo

    ! Copy data to device at start of region and back to host and end of region
    !$acc data copy(data)

        ! Inside this region the device data pointer will be used
        !$acc host_data use_device(data)

        ! Query OpenACC for CUDA stream
        stream = acc_get_cuda_stream(acc_async_sync)

        ! Launch FFT on the GPU
        call launchcufft(C_LOC(data), n, stream)
        !$acc end host_data

    !$acc end data

    ! Find the frequency
    max_id = 1
    do i=1,n/2
        if (REAL(data(i)) .gt. REAL(data(max_id))) then
            max_id = i-1
        endif
    enddo
    print *, "frequency:", max_id

end program fft

fft.f90 + cufft.cu

#include <cufft.h>

// Declared extern "C" to disable C++ name mangling
extern "C" void launchCUFFT(float *d_data, int n, void *stream)
{
    cufftHandle plan;
    cufftPlan1d(&plan, n, CUFFT_C2C, 1);
    cufftSetStream(plan, (cudaStream_t)stream);
    cufftExecC2C(plan, (cufftComplex*)d_data, (cufftComplex*)d_data,CUFFT_FORWARD);
    cufftDestroy(plan);
}

Compiling Fortran

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc -c cufft.cu
$ ftn -c -hacc fft.f90
$ ftn fft.o cufft.o -lcufft

PGI

$ module load cudatoolkit
$ nvcc -c cufft.cu
$ ftn -c -acc fft.f90
$ ftn -acc fft.o cufft.o -lcufft

The following C and Fortran examples show how to leverage OpenACC, cuRand, and Thrust in a single application. This example will use cuRand to generate random data, OpenACC to hash the data, and finally Thrust to sort the data into canonical order. Although this example is provided to illustrate OpenACC interoperability only it was written with a simple nearest neighbor hash implementation given interleaved 3D positions(x1,y1,z1,x2,y2,z2…) in mind. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Care must be taken to ensure that all dependent GPU operations occur on the same CUDA stream. The example below queries OpenACC for the stream in use and executes all library calls within this stream. It is an error to assume OpenACC is using the default stream and doing so may very well produce difficult to detect race conditions.

C/C++

sort.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "openacc.h"

// Forward declaration of nvcc compiled functions
void fill_rand(float *d_buffer, int num, void *stream);
inline int hash_val(float x, float y, float z);
void sort(int *d_key, int *d_values, int num, void *stream);

int main(int argc, char* argv[])
{
    int i,index;
    float x,y,z;
    int dim = 3;
    int num = 1000000;
    int length = dim*num;
    float *restrict positions = (float*)malloc(length*sizeof(float));
    int *restrict keys   = (int*)malloc(num*sizeof(int));
    int *restrict values = (int*)malloc(num*sizeof(int));

    // Query OpenACC for CUDA stream
    void *stream = acc_get_cuda_stream(acc_async_sync);

    // OpenACC will create positions, keys, and values arrays on the device
    #pragma acc data create(positions[0:length], keys[0:num], values[0:num])
    {
        // NVIDIA cuRandom will create our initial random data
        #pragma acc host_data use_device(positions)
        {
             fill_rand(positions, length, stream);
        }

        // OpenACC will calculate the hash value for each particle
        #pragma acc parallel loop
        for(i=0; i<num; i++) {
            index = i*3;
            x = positions[index];
            y = positions[index+1];
            z = positions[index+2];

            // Key is hash value and value is 'particle id'
            keys[i] = hash_val(x,y,z);
            values[i] = i;
        }

        // Thrust will be used to sort our key value pairs
        #pragma acc host_data use_device(keys, values)
        {
            sort(keys, values, num, stream);
        }
    }

    return 0;
}

// Uniform grid hash
inline int hash_val(float x, float y, float z)
{
    double spacing = 0.01;

    // Calculate grid coordinates
    int grid_x,grid_y,grid_z;
    grid_x = abs(floor(x/spacing));
    grid_y = abs(floor(y/spacing));
    grid_z = abs(floor(z/spacing));

    int num_x = 1.0/spacing;
    int num_y = 1.0/spacing;
    int num_z = 1.0/spacing;

    int grid_position = (num_x * num_y * grid_z) + (grid_y * num_x + grid_x);
    return grid_position;
}

sortGPU.cu

#include <stdio.h>
#include <curand.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/system_error.h>

// Fill d_buffer with num random numbers
extern "C" void fill_rand(float *d_buffer, int num, void *stream)
{
  curandGenerator_t gen;
  int status;

  // Create generator
  status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);

  // Set CUDA stream
  status |= curandSetStream(gen, (cudaStream_t)stream);

  // Set seed
  status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);

  // Generate num random numbers
  status |= curandGenerateUniform(gen, d_buffer, num);

  // Cleanup generator
  status |= curandDestroyGenerator(gen);

  if (status != CURAND_STATUS_SUCCESS) {
      printf ("curand failure!\n");
      exit (EXIT_FAILURE);
  }
}

// Sort key value pairs
extern "C" void sort(int *keys, int *values, int num, void *stream)
{
    try {
        // Sort keys AND values array by key
        thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream),
                            keys, keys + num, values);
    }
    catch(thrust::system_error &e) {
        std::cerr << "Error sorting with Thrust: " << e.what() << std::endl;
        exit (EXIT_FAILURE);
    }
}

Compiling C/C++

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc sortGPU.cu -c
$ cc -hpragma=acc -lcurand sort.c sortGPU.o

PGI

$ module load cudatoolkit

$ nvcc sortGPU.cu -c
$ cc -acc -lcurand sort.c sortGPU.o -Minline

Fortran

sort.f90

! Module containing nvcc compiled function declarations and hash function
module sortgpu
  INTERFACE
    subroutine fill_rand(positions, length, stream) BIND(C,NAME='fill_rand')
      USE ISO_C_BINDING
      implicit none
      type (C_PTR), value :: positions
      integer (C_INT), value :: length
      type (C_PTR), value :: stream
    end subroutine fill_rand
 
    subroutine sort(keys, values, num, stream) BIND(C,NAME='sort')
      USE ISO_C_BINDING
      implicit none
      type (C_PTR), value :: keys
      type (C_PTR), value :: values
      integer (C_INT), value :: num
      type (C_PTR), value :: stream
    end subroutine sort
  END INTERFACE
 
  contains
    integer function hash_val(x,y,z)
      real :: x,y,z,spacing
      integer :: grid_x, grid_y, grid_z
      integer :: num_x, num_y, num_z
 
      spacing =  0.01;
 
      grid_x = abs(floor(x/spacing))
      grid_y = abs(floor(y/spacing))
      grid_z = abs(floor(z/spacing))
 
      num_x = 1.0/spacing
      num_y = 1.0/spacing
      num_z = 1.0/spacing
 
      hash_val =  (num_x * num_y * grid_z) + (grid_y * num_x + grid_x) + 1
      return
    end function hash_val
end module sortgpu
 
program interop
    use ISO_C_BINDING
    use sortgpu
    use openacc
    implicit none
 
    integer :: i,indx
    real :: x,y,z
    integer, parameter :: dims = 3
    integer(C_INT), parameter :: num = 1000000
    integer(C_INT), parameter :: length = dims*num
    real (C_FLOAT) :: positions(length)
    integer(C_INT) :: keys(num)
    integer(C_INT) :: values(num)
    type (C_PTR) :: stream

    ! Query OpenACC for CUDA stream
    stream = acc_get_cuda_stream(acc_async_sync)

    ! OpenACC will create positions, keys, and values arrays on the device
    !$acc data create(positions, keys, values)
 
        ! NVIDIA cuRandom will create our initial random data
        !$acc host_data use_device(positions)
        call fill_rand(C_LOC(positions), length, stream)
        !$acc end host_data
 
        ! OpenACC will calculate the hash value for each particle
        !$acc parallel loop
        do i=0,num-1
            indx = i*3;
            x = positions(indx+1)
            y = positions(indx+2)
            z = positions(indx+3)
 
            ! Key is 'particle' id and value is hashed position
            keys(i+1) = hash_val(x,y,z);
            values(i+1) = i+1;
        enddo
        !$acc end parallel loop
 
        ! Thrust will be used to sort our key value pairs
        !$acc host_data use_device(keys, values)
        call sort(C_LOC(keys), C_LOC(values), num, stream);
        !$acc end host_data
 
    !$acc end data    
 
end program interop

sortGPU.cu

#include <stdio.h>
#include <curand.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/system_error.h>

// Fill d_buffer with num random numbers
extern "C" void fill_rand(float *d_buffer, int num, void *stream)
{
  curandGenerator_t gen;
  int status;

  // Create generator
  status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_DEFAULT);

  // Set CUDA stream
  status |= curandSetStream(gen, (cudaStream_t)stream);

  // Set seed
  status |= curandSetPseudoRandomGeneratorSeed(gen, 1234ULL);

  // Generate num random numbers
  status |= curandGenerateUniform(gen, d_buffer, num);

  // Cleanup generator
  status |= curandDestroyGenerator(gen);

  if (status != CURAND_STATUS_SUCCESS) {
      printf ("curand failure!\n");
      exit (EXIT_FAILURE);
  }
}

// Sort key value pairs
extern "C" void sort(int *keys, int *values, int num, void *stream)
{
    try {
        // Sort keys AND values array by key
        thrust::sort_by_key(thrust::cuda::par.on((cudaStream_t)stream),
                            keys, keys + num, values);
    }
    catch(thrust::system_error &e) {
        std::cerr << "Error sorting with Thrust: " << e.what() << std::endl;
        exit (EXIT_FAILURE);
    }
}

Compiling Fortran

Cray

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ nvcc sortGPU.cu -c
$ ftn -hacc -lcurand sort.f90 sortGPU.o

PGI

$ module load cudatoolkit
$ module switch pgi pgi/13.6.0

$ nvcc sortGPU.cu -c
$ ftn -acc -lcurand sort.f90 sortGPU.o -Minline=name:hash_val

The use of GPU acceleration can add a useful level of parallelization to your code. This tutorial will show how to parallelize a simple Monte Carlo operation using a GPU accelerator. The first code will be a simple serial implementation of the Monte Carlo simulation used to compute π. The next samples will be the same code but accelerated using CUDA, OpenACC, and then Thrust.

Serial

The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of random points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void main(int argc, char* argv[])
{
    double niter = 100000;
    double x,y;
    int i;
    int count=0;
    double z;
    double pi;
    srand(time(NULL));
    //main loop
    for (i=0; i<niter; ++i)
    {
        //get random points
        x = (double)random()/RAND_MAX;
        y = (double)random()/RAND_MAX;
        z = sqrt((x*x)+(y*y));
        //check to see if point is in unit circle
        if (z<=1)
        {
            ++count;
        }
    }
    //p = 4(m/n)
    pi = ((double)count/(double)niter)*4.0;
    printf("Pi: %f\n", pi);

To compile this code, run:

$ cc serialpi.c -o a.out

The resulting executable can either be run in an interactive session or via batch script. To launch an interactive session run:

$ qsub -I -A -l walltime=1:00,nodes=1

This executable can also be run via batch script. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A <PROJID>
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n1 ./a.out

Save this in your project’s scratch directory as something like titan.pbs and copy your executable to $MEMBERWORK/<projid>.

To submit the script to the batch node, run

$ qsub titan.pbs

CUDA

Compute Unified Device Architecture (CUDA) is the parallel programming model developed by NVIDIA that extends C or Fortran code with features that target the GPU device. In CUDA, the programmer defines functions to be launched in many thousands or millions of parallel threads on the GPU. These functions are generally referred to as CUDA kernels or simply “kernels”. When writing these kernels, the programmer needs to keep in mind how many threads and blocks of threads they want launched on the GPU at runtime. A possible kernel for this code could look something like this:

__global__ void kernel(int* count_d, float* randomnums)
{
        int i;
        double x,y,z;
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        i = tid;
        int xidx = 0, yidx = 0;

        xidx = (i+i);
        yidx = (xidx+1);

        x = randomnums[xidx];
        y = randomnums[yidx];
        z = ((x*x)+(y*y));

        if (z<=1)
                count_d[tid] = 1;
        else
                count_d[tid] = 0;
}

Take note of the __global__ decorator. This tells the nvcc compiler that this function is a CUDA kernel that will be called by host code and be run on the device. Also, note the variables blockDim.x, blockIdx.x, and threadIdx.x. These are built in variables that identify a thread’s place in the overall three-dimensional grid, similar to the concept of an MPI rank or OpenMP thread.

Another thing to keep in mind is the fact that the CPU (host) and GPU (device) have separate memory. This means that the host can’t directly access memory on the device and vice versa. To overcome this, CUDA gives us a set of functions to transfer and allocate data on the GPU device’s RAM. For example, in the Monte Carlo code, there needs to be a set of random numbers available on the device for the kernel to use, as well as an empty array to hold a 1 or 0, depending on whether or not the point is in the circle or not (See the Serial section for more information on the Monte Carlo specifics). So after the random numbers are generated and stored into an array, that array needs to be copied to the device. Also the empty array needs to be allocated on the device. To do this we use two different CUDA functions. Firstly, the empty array needs to be allocated on the device. The cudaMalloc() function gives us the ability to allocate memory on the GPU:

cudaError_t cudaMalloc(void** devPtr,
                       size_t size)

Before this function is called, an int* named count_d needs to be created (Note that device pointer variables generally end with “_d”). Also, the programmer needs to determine the size of the array. In this case, since the kernel is only processing one point per thread, the number of threads can be used as the size (thread blocks * threads per thread block = total number of threads) Once this is done, the cudaMalloc() function can be called. For this code, it could look like this:

cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));

Also, an array to house the random numbers (2 per thread this time) needs to be allocated on the device using cudaMalloc. This is done in the same fashion as before:

cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));

Once there is a place to contain the required random numbers, they can be generated. The CURAND library contains kernels that generate high quality random numbers using the GPU. CURAND needs a bit of setting up before it can generate the random numbers. First, a CURAND generator needs to be declared:

curandGenerator_t gen;

The next function creates the generator and declares its type.

curandStatus_t curandCreateGenerator(curandGenerator_t* generator,
                                     curandRngType_t rng_type);

The next one sets the seed.

curandStatus_t curandSetPseudoRandomGeneratorSeed(curandGenerator_t generator,
                                                  unsigned long long seed);

Next, curandGenerateUniform() is the actual function that generates the random numbers.

curandStatus_t curandGenerateUniform(curandGenerator_t generator,
                                     float* outputPtr,
                                     size_t num);

Finally, the random number generator can be destroyed because theres no further need for it.

curandStatus_t curandDestroyGenerator(curandGenerator_t generator);

Using these definitions, create the random numbers as follows:

curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL));
curandGenerateUniform(gen, randomnums, (2*niter));
curandDestroyGeneraor(gen);

Now, the CUDA kernel needs to be called and launched on the device. Before this can be done, the programmer needs to know how many threads per block, and how many blocks that need to be launched. In this case, 1000 threads/block and 100 blocks should work fine (these numbers we picked arbitrarily, the programmer can set their own). Once this is determined, the kernel can be launched and passed the correct device pointers.

kernel <<<blocks, threads>>> (count_d, randomnums);

Once the kernel has finished, the count_d array needs to be copied back to the host. Similar to the standard C function memcpy(), we use the CUDA function cudaMemcpy() which looks like this:

cudaError_t cudaMemcpy(void * dst,
                       const void * src,
                       size_t count,
                       enum cudaMemcpyKind kind)

Before this can be done, the programmer needs to declare an empty array on the host that will contain the data that is copied back with cudaMemcpy. Also, make sure the amount of data copied back (size_t count) is the right size (the same size as the cudaMalloc’d array). Finally, cudaMemcpy needs to know in what direction the copy is going to be performed. In this particular instance, the array is on the device and needs to be copied to the host, so the final parameter for the cudaMemcpy needs to be the built in “cudaMemcpyDeviceToHost”. The completed cudaMemcpy line looks like this:

cudaMemcpy(count, count_d, blocks*threads*sizeof(int),
           cudaMemcpyDeviceToHost);

Now that all the CUDA code is set up, the rest of the code looks like this:

cpi.cu:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <time.h>

//Declare the CUDA kernel
__global__ void kernel(int* count_d, float* randomnums)
{
        int i;
        double x,y,z;
        //Find the overall ID of the thread
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        i = tid;
        int xidx = 0, yidx = 0;

        //Do the MonteCarlo!
        xidx = (i+i);
        yidx = (xidx+1);

        //Get the random x,y points
        x = randomnums[xidx];
        y = randomnums[yidx];
        z = ((x*x)+(y*y));

        if (z<=1)
                count_d[tid] = 1;
        else
                count_d[tid] = 0;
}

//Used to check if there are any errors launching the kernel
void CUDAErrorCheck()
{
        cudaError_t error = cudaGetLastError();
        if (error != cudaSuccess)
        {
                printf("CUDA error : %s (%d)\n", cudaGetErrorString(error), error);
                exit(0);
        }
}

int main(int argc,char* argv[])
{
        //NOTE: if threads and/or blocks is changed, niter needs to be changed to reflect
        //that change (niter=threads*blocks)
        int niter = 100000;
        float *randomnums;
        double pi;
        //Allocate the array for the random numbers
        cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));
        //Use CuRand to generate an array of random numbers on the device
        int status;
        curandGenerator_t gen;
        status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
        status |= curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL));
        status |= curandGenerateUniform(gen, randomnums, (2*niter));
        status |= curandDestroyGenerator(gen);
        //Check to see if there was any problem launching the CURAND kernels and generating
        //the random numbers on the device
        if (status != CURAND_STATUS_SUCCESS)
        {
                printf("CuRand Failure\n");
                exit(EXIT_FAILURE);
        }

        //Threads per thread block to be launched
        int threads = 1000;
        //Number of thread blocks launched
        int blocks = 100;
        int* count_d;
        int *count = (int*)malloc(blocks*threads*sizeof(int));
        unsigned int reducedcount = 0;
        //Allocate the array to hold a value (1,0) whether the point in is the circle (1) or not (0)
        cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));
        CUDAErrorCheck();
        //Launch the kernel
        kernel <<<blocks, threads>>> (count_d, randomnums);
        //Acts as a kind of code Barrier until the kernel is finished. Kernel calls are nonblocking so
        //the code would continue regardless of whether the kernel succeeded or not without the Sync
        cudaDeviceSynchronize();
        CUDAErrorCheck();
        //Copy the resulting array back
        cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost);
        int i = 0;

        //Reduce array into int
        for(i = 0; i<niter; i++)
                reducedcount += count[i];

        //Free the cudaMalloc()'d arrays
        cudaFree(randomnums);
        cudaFree(count_d);
        free(count);

        //Find the ratio
        pi = ((double)reducedcount/niter)*4.0;
        printf("Pi: %f\n", pi);

        return 0;
}

Compiling

Before this can be compiled, the CUDA toolkit needs to be loaded. To do this, run:

$ module load cudatoolkit

Now that the module is loaded, the nvcc compiler wrapper is available. To compile this run:

$ nvcc -arch=sm_35 -lcurand cpi.cu -o a.out

Here, the -arch=sm_35 switch is used to tell the compiler that the target device Compute Capabilty is 3.5. This allows the programmer to do more things in the CUDA kernel, for instance use double. The -lcurand switch tells the compiler to link in the CURAND library and give the code access to the CURAND functions.

Then copy the executable (in this case, a.out) to the scratch directory (see the serial section for more info on finding the scratch directory). This can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, use this:

$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1

Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command

$ aprun -n1 ./a.out

Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A <PROJID>
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 ./a.out

Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:

$ qsub titan.pbs

OpenACC

OpenACC provides an alternative pragma based accelerator programming model that can be easier to use than CUDA. These pragmas are similar to OpenMP’s but launch threads on the GPU instead of the CPU. Parallelizing the Monte Carlo code with OpenACC is a fairly straight-forward procedure. The programmer must first identify the best region to parallelize. In this code, the most parallelizable region is the main for loop which actually performs the Monte Carlo simulation. The main loop can be broken up into two for loops. One loop that will do the operation and store a 1 or 0 in an array depending on if the point is in the circle or not, like in the CUDA version. The second loop will become a separate kernel that will reduce the output array of the first kernel into 1 integer on the host. The OpenACC pragma should look something like this:

#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)

Here,”acc” tells the compiler that this is an OpenACC pragma. Next, the parallel construct is used to launch multiple threads on the GPU. Next, the copy(randomnums[0:200000]) directive copies the randomnums array onto the device. The copyout(output[0:100000]) directive copies the array onto the device then back off onto the host once the kernel is finished. Finally the create(x,y,z) directive  created the three variables, x,y,z, on the device. The parallel region looks something like this:

        #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
        {
                for (i = 0; i<100000; i++)
                {
                        int temp = i+i;
                        x = randomnums[temp];
                        y = randomnums[temp+1];
                        z = ((x*x)+(y*y));
                        output[i] = z;
                }
                for (i = 0; i<100000; i++)
                {
                        if (output[i]<=1)
                        {
                                ++count;
                        }
                        else
                                continue;
                }
        }

The complete code looks like this:

accpi.cpp

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
        double niter = 100000;
        double x,y;
        int count=0;
        double z;
        double pi;
        int a = 0;
        time_t t;
        double randomnums[200000];
        double output[100000];
        srand48(((unsigned)time(&t)));
        for (a = 0; a<=2*niter; a++)
                randomnums[a] = (double)drand48();
        int i = 0;
        #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
        {
                for (i = 0; i<100000; i++)
                {
                        int temp = i+i;
                        x = randomnums[temp];
                        y = randomnums[temp+1];
                        z = ((x*x)+(y*y));
                        output[i] = z;
                }
                for (i = 0; i<100000; i++)
                {
                        if (output[i]<=1)
                        {
                                ++count;
                        }
                        else
                                continue;
                }
        }
        pi = ((double)count/(double)niter)*4.0;
        printf("Pi: %f\n", pi);
}

To compile this the cudatoolkit needs to be loaded. To load it run module load cudatoolkit. Next, run

$ CC -acc -Minfo=all -ta=nvidia:5.0,cc3x accpi.cpp -o a.out

to compile the code into an executable. The -Minfo=all switch will tell pgicc to print out various information regarding how its compiling the OpenACC code. This can be a very useful tool for debugging OpenACC code. The -ta=nvidia:5.0,cc3x switch tells what version of the CUDA to use and what compute architecture to use as the target. Some sample output for the -Minfo=all switch is:

main:
21, Generating copy(randomnums[0:])
Generating copyout(output[0:])
Generating create(x)
Generating create(y)
Generating create(z)
Accelerator kernel generated
22, #pragma acc loop vector(256) /* threadIdx.x */
30, #pragma acc loop vector(256) /* threadIdx.x */
21, Generating NVIDIA code
Generating compute capability 3.0 binary
22, Loop is parallelizable
30, Loop is parallelizable
34, Sum reduction generated for count

To compile this code with the Cray compiler, load the PrgEnv-cray module as well as the craype-accel-nvidia35 module

$ module swap PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35

To compile to an exectuable, run:

$ CC -hpragma=acc accpi.cpp -o a.out
Running:

Then copy the executable (in this case, a.out) to the scratch directory. Like before, this can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, run:

$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1

Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command

$ aprun -n1 ./a.out

Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A [PROJID]
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 -N 1 ./a.out

Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:

$ qsub titan.pbs

Thrust

Finally, this operation can be accelerated with the unique Thrust library. Thrust is a template library that allows programmers to write parallel code in the same way that they would use the C++ Standard Library for serial code. Because of this, the Thrust parallelized code is structured quite differently. For instance, the Monte Carlo kernel from the previous CUDA example is replaced by an equivalent function inside a struct.

struct montecarlo :
       public thrust::unary_function<unsigned int, float>
{
    //use the __host__ __device__ decorators so that
    //thrust knows that this is a gpu function.
    //think of operator() as similar to a CUDA kernel
	__host__ __device__
	float operator()(unsigned int thread_id)
	{
                unsigned int seed;
                //set the seed
                seed = 49868^thread_id;
                int i;
                float x,y,z,sum=0.0;

                //define the random number generator
                thrust::default_random_engine rng(seed);
                thrust::uniform_real_distribution u01(0,1);
                for (i = 0; i < 1000; i++)
                {
                        //get the random numbers
                        x = u01(rng);
                        y = u01(rng);
                        z = ((x*x)+(y*y));
                        if (z<=1)
                                sum += 1;
                }
                return sum;
        }
};

First, the montecarlo struct is defined and given a Thrust function template to adhere to.

struct montecarlo : public thrust::unary_function

This Thrust template says that the function in the struct will be a standard thrust unary function with an unsigned int as an input (this is the thread number. This is handled by Thrust) and will return a floating point number. Next, the programmer defines the operator function

__host__ __device__
float operator()(unsigned int thread_id)

and gives it the __host__ and __device__ decorator so that the compiler knows that it is a GPU function. The next thing that the programmer should take note of is the use of the Thrust random number generator.

thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution u01(0,1);

Since it is not possible to call host functions from a GPU kernel, this step is necessary. By using Thrust functions like this, it allows the programmer to more easily and quickly accelerate the serial code without having to use more complex and specialized APIs, like CURAND.

Finally the way this kernel can be launched is by the use of another Thrust function. This transform_reduce() function is defined as:

OutputType thrust::transform_reduce(InputIterator first,
                                    InputIterator last,
                                    UnaryFunction unary_op,
                                    OutputType init,
                                    BinaryFunction binary_op)

Here, the thrust::transform_reduce() function is used because it will run every number between first and last through our predefined functor (unary_op) and then it will reduce the resulting transformed answers into one final float according to binary_op with the initial value of init. first and last need to be defined as thrust::counting_iterator(first) and thrust::counting_iterator(last) The full function will look like this.

count = thrust::transform_reduce(thrust::counting_iterator(0),
                                 thrust::counting_iterator(niter),
                                 montecarlo(),
                                 0.0,
                                 thrust::plus());

So with that in mind, the full code can look something like this:

thrust.cu

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <thrust/generate.h>
#include <thrust/random.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <cmath>

struct montecarlo :
       public thrust::unary_function<unsigned int, float>
{
    //use the __host__ __device__ decorators so that
    //thrust knows that this is a gpu function.
    //think of operator() as similar to a CUDA kernel
	__host__ __device__
	float operator()(unsigned int thread_id)
	{
                unsigned int seed;
                //set the seed
                seed = 49868^thread_id;
                int i;
                float x,y,z,sum=0.0;

                //define the random number generator
                thrust::default_random_engine rng(seed);
                thrust::uniform_real_distribution u01(0,1);
                for (i = 0; i < 1000; i++)
                {
                        //get the random numbers
                        x = u01(rng);
                        y = u01(rng);
                        z = ((x*x)+(y*y));
                        if (z<=1)
                                sum += 1;
                }
                return sum;
        }
};

int main(int argc, char* argv[])
{
        float pi;
        float count = 0.0;
        int niter = 1000;
        count = thrust::transform_reduce(thrust::counting_iterator(0),
                                     thrust::counting_iterator(niter),
                                     montecarlo(),
                                     0.0,
                                     thrust::plus());
        pi = (count/(niter*niter))*4.0;
        printf("Pi: %f\n", pi);
}

To compile this code, the cudatoolkit needs to be loaded

$ module load cudatoolkit

Once the code is ready to be compiled, it should be compiled using NVIDIA’s nvcc compiler like so:

$ nvcc -arch=sm_35 thrust.cu -o a.out

It can be run either by submitting it in a batch script or by running it from an interactive job like described in the previous examples.

The code in this tutorial along with a makefile can be found at https://github.com/olcf/Accelerating-Serial-Code

If you have any questions about these examples or how to run them please email help@nccs.gov.