titan

Up since 11/8/17 02:45 pm

eos

Up since 11/14/17 11:20 pm

rhea

Up since 10/17/17 05:40 pm

hpss

Up since 11/20/17 09:15 am

atlas1

Up since 11/15/17 07:25 am

atlas2

Up since 11/27/17 10:45 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

HMPP Vector Addition

This tutorial covers CAPS HMPP accelerator directives, If you are interested in CAPS OpenACC support please see: OpenACC Vector Addition
 

Introduction

This sample shows a minimal conversion from our vector addition CPU code to an HMPP accelerator directives version, consider this an HMPP ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

HMPP allows code to be offloaded onto the GPU using two different methods, both of which are covered. The codelet method allows an entire C function or Fortran subroutine to be executed on the GPU. The region method allows a contiguous block of code, not necessarily residing in a function or subroutine, to be executed on the GPU.

vecAdd-codelet.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out 
void vecAdd(int n, double a[n], double b[n], double c[n])
{
    int j;
    for(j=0; j<n; j++) {
        c[j] = a[j] + b[j];
    }
}

int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;

    // Input vectors
    double *a;
    double *b;
    // Output vector
    double *c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);

    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // Sum component wise and save result into vector c
    #pragma hmpp vecAdd callsite
    vecAdd(n, a, b, c);

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes:

#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out 
void vecAdd(int n, double a[n], double b[n], double c[n])
{
    int j;
    for(j=0; j<n; j++) {
        c[j] = a[j] + b[j];
    }
}

The combined #pragma hmpp directive and C function vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.

#pragma hmpp vecAdd callsite
vecAdd(n, a, b, c);

The combined #pragma hmpp directive and C function call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

Compiling:

Before compiling the hmpp module must be loaded:

$ module load PrgEnv-pgi capsmc cudatoolkit
$ hmpp cc vecAdd-codelet.c -o vecAdd.out

Output:
The compiler will output the following:

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-codelet.c:9: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

Running:

$ aprun ./vecAdd.out
final result: 1.000000

vecAdd-codelet.f90

!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
subroutine vecAdd(n, a, b, c)
    implicit none
    integer, intent(in) :: n
    real(8), intent(in) :: a(n), b(n)
    real(8), intent(out) :: c(n)

    integer :: j
    do j=1,n
        c(j) = a(j) + b(j)
    enddo
end subroutine vecAdd

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    ! Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c

    !$hmpp vecAdd callsite
    call vecAdd(n, a, b, c)

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program main

Changes:

!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
subroutine vecAdd(n, a, b, c)
    implicit none
    integer, intent(in) :: n
    real(8), intent(in) :: a(n), b(n)
    real(8), intent(out) :: c(n)

    integer :: j
    do j=1,n
        c(j) = a(j) + b(j)
    enddo
end subroutine vecAdd

The combined !$hmpp directive and Fortran subroutine vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.

!$hmpp vecAdd callsite
call vecAdd(n, a, b, c)

The combined !$hmpp directive and Fortran subroutine call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

Compiling:

$ module load PrgEnv-pgi cudatoolkit capsmc
$ hmpp ftn vecAdd.f90 -o vecAdd.out

Output:
The compiler will output the following:

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-codelet.f90:9: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.


Additional Information

Much information is obscured from the programmer so let’s add the –io-report hmpp flag to see what memory transfers will take place between the GPU and host.

C

$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out

Fortran

$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out

Output

In GROUP 'vecadd'
 CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd'
    Parameter 'n' has intent IN
    Parameter 'a' has intent IN
    Parameter 'b' has intent IN
    Parameter 'c' has intent OUT

We see that n, a, and b will be copied into the GPU while c will be copied out.
What if we were the omit the intent for the vector c to be copied back to the host in our codelet declaration?
C

#pragma hmpp vecAdd codelet, target=CUDA

Fortran

!$hmpp vecAdd codelet, target=CUDA

Output

In GROUP 'vecadd'
 CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd'
    Parameter 'n' has intent IN
    Parameter 'a' has intent IN
    Parameter 'b' has intent IN
    Parameter 'c' has intent IN

We see that the compiler does not correctly copy the vector c back to the host. This will cause erroneous results that do not produce any warning or error message. It is vitally important to always check that memory transfers are correct.


vecAdd-region.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

    // Size of vectors
    int n = 100000;

    // Input vectors
    double *a;
    double *b;
    // Output vector
    double *c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector
    a = (double*)malloc(bytes);
    b = (double*)malloc(bytes);
    c = (double*)malloc(bytes);
    
    // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    int i;
    for(i=0; i<n; i++) {
        a[i] = sin(i)*sin(i);
        b[i] = cos(i)*cos(i);
    }

    // Sum component wise and save result into vector c
    #pragma hmpp vecAdd region, target=CUDA,  args[*].transfer=atcall
    {
        int j;
        for(j=0; j<n; j++) {
            c[j] = a[j] + b[j];
        }
    }

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++) {
        sum += c[i];
    }
    sum = sum/n;
    printf("final result: %f\n", sum);

    // Release memory
    free(a);
    free(b);
    free(c);

    return 0;
}

Changes:

#pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall
{
    int j;
    for(j=0; j<n; j++) {
        c[j] = a[j] + b[j];
    }
}

The code inside of the hmpp region is computed on the GPU. The region begins with the #pragma hmpp region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

Compiling:

$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out

Output:
The compiler will output the following:

In GROUP 'vecadd'
 REGION 'vecadd' at vecAdd-region.c:34, function '__hmpp_region__vecadd'
    Parameter  'n' has intent IN
    Parameter  'a' has intent IN
    Parameter  'b' has intent IN
    Parameter  'c' has intent INOUT

We see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-region.c:37: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.


vecAdd-region.f90

Modifications from CPU code in bold

program main

    ! Size of vectors
    integer :: n = 100000

    ! Input vectors
    real(8),dimension(:),allocatable :: a
    real(8),dimension(:),allocatable :: b
    !Output vector
    real(8),dimension(:),allocatable :: c

    integer :: i
    real(8) :: sum

    ! Allocate memory for each vector
    allocate(a(n))
    allocate(b(n))
    allocate(c(n))

    ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
    do i=1,n
        a(i) = sin(i*1D0)*sin(i*1D0)
        b(i) = cos(i*1D0)*cos(i*1D0)
    enddo

    ! Sum component wise and save result into vector c

    !$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall
    do i=1,n
        c(i) = a(i) + b(i)
    enddo
    !$hmpp vecAdd endregion

    ! Sum up vector c and print result divided by n, this should equal 1 within error
    do i=1,n
        sum = sum +  c(i)
    enddo
    sum = sum/n
    print *, 'final result: ', sum

    ! Release memory
    deallocate(a)
    deallocate(b)
    deallocate(c)

end program

Changes:

!$hmpp vecAdd region, target=CUDA,  args[*].transfer=atcall
do i=1,n
    c(i) = a(i) + b(i)
enddo
!$hmpp vecAdd endregion

The code inside of the hmpp region is computed on the GPU. The region begins with the !$hmpp region directive and ends with the !$hmpp endregion directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

Compile:

$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out

Output:

The compiler will output the following:

In GROUP 'vecadd'
 REGION 'vecadd' at vecAdd-region.f90:28, function 'hmpp_region__vecadd'
    Parameter  'n' has intent IN
    Parameter  'n_1' has intent IN
    Parameter  'n_2' has intent IN
    Parameter  'a' has intent IN
    Parameter  'n_4' has intent IN
    Parameter  'n_5' has intent IN
    Parameter  'b' has intent IN
    Parameter  'n_7' has intent IN
    Parameter  'n_8' has intent IN
    Parameter  'c' has intent INOUT
    Parameter  'i' has intent INOUT

The current HMPP compiler doesn’t do too well with the Fortran version, copying in several variables that are not used. For now we will need to ignore these erroneous variable copies and their associated warning messages. We do see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-region.f90:29: Loop 'i' was gridified (1D)

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 29 with induction variable ‘i’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.