This sample shows a minimal conversion from our vector addition CPU code to an HMPP accelerator directives version, consider this an HMPP ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

This tutorial covers CAPS HMPP accelerator directives, If you are interested in CAPS OpenACC support please see: OpenACC Vector Addition

HMPP allows code to be offloaded onto the GPU using two different methods, both of which are covered. The codelet method allows an entire C function or Fortran subroutine to be executed on the GPU. The region method allows a contiguous block of code, not necessarily residing in a function or subroutine, to be executed on the GPU.

```#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
void vecAdd(int n, double a[n], double b[n], double c[n])
{
int j;
for(j=0; j<n; j++) {
c[j] = a[j] + b[j];
}
}

int main( int argc, char* argv[] )
{
// Size of vectors
int n = 100000;

// Input vectors
double *a;
double *b;
// Output vector
double *c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);

// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for(i=0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}

// Sum component wise and save result into vector c

// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++) {
sum += c[i];
}
sum = sum/n;
printf("final result: %f\n", sum);

// Release memory
free(a);
free(b);
free(c);

return 0;
}
```

```#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
void vecAdd(int n, double a[n], double b[n], double c[n])
{
int j;
for(j=0; j<n; j++) {
c[j] = a[j] + b[j];
}
}
```

The combined #pragma hmpp directive and C function vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.

```#pragma hmpp vecAdd callsite
```

The combined #pragma hmpp directive and C function call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

Before compiling the hmpp module must be loaded:

```\$ module load PrgEnv-pgi capsmc cudatoolkit
```

Output:
The compiler will output the following:

```hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-codelet.c:9: Loop 'j' was gridified (1D)
```

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

```\$ aprun ./vecAdd.out
final result: 1.000000
```

```!\$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
implicit none
integer, intent(in) :: n
real(8), intent(in) :: a(n), b(n)
real(8), intent(out) :: c(n)

integer :: j
do j=1,n
c(j) = a(j) + b(j)
enddo

program main

! Size of vectors
integer :: n = 100000

! Input vectors
real(8),dimension(:),allocatable :: a
real(8),dimension(:),allocatable :: b
! Output vector
real(8),dimension(:),allocatable :: c

integer :: i
real(8) :: sum

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
a(i) = sin(i*1D0)*sin(i*1D0)
b(i) = cos(i*1D0)*cos(i*1D0)
enddo

! Sum component wise and save result into vector c

! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
sum = sum +  c(i)
enddo
sum = sum/n
print *, 'final result: ', sum

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

end program main
```

```!\$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out
implicit none
integer, intent(in) :: n
real(8), intent(in) :: a(n), b(n)
real(8), intent(out) :: c(n)

integer :: j
do j=1,n
c(j) = a(j) + b(j)
enddo
```

The combined !\$hmpp directive and Fortran subroutine vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.

```!\$hmpp vecAdd callsite
```

The combined !\$hmpp directive and Fortran subroutine call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

```\$ module load PrgEnv-pgi cudatoolkit capsmc
```

Output:
The compiler will output the following:

```hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-codelet.f90:9: Loop 'j' was gridified (1D)
```

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

Much information is obscured from the programmer so let’s add the –io-report hmpp flag to see what memory transfers will take place between the GPU and host.

C

```\$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out
```

Fortran

```\$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out
```

Output

```In GROUP 'vecadd'
Parameter 'n' has intent IN
Parameter 'a' has intent IN
Parameter 'b' has intent IN
Parameter 'c' has intent OUT
```

We see that n, a, and b will be copied into the GPU while c will be copied out.
What if we were the omit the intent for the vector c to be copied back to the host in our codelet declaration?
C

```#pragma hmpp vecAdd codelet, target=CUDA
```

Fortran

```!\$hmpp vecAdd codelet, target=CUDA
```

Output

```In GROUP 'vecadd'
Parameter 'n' has intent IN
Parameter 'a' has intent IN
Parameter 'b' has intent IN
Parameter 'c' has intent IN
```

We see that the compiler does not correctly copy the vector c back to the host. This will cause erroneous results that do not produce any warning or error message. It is vitally important to always check that memory transfers are correct.

```#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main( int argc, char* argv[] )
{

// Size of vectors
int n = 100000;

// Input vectors
double *a;
double *b;
// Output vector
double *c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);

// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for(i=0; i<n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}

// Sum component wise and save result into vector c
#pragma hmpp vecAdd region, target=CUDA,  args[*].transfer=atcall
{
int j;
for(j=0; j<n; j++) {
c[j] = a[j] + b[j];
}
}

// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++) {
sum += c[i];
}
sum = sum/n;
printf("final result: %f\n", sum);

// Release memory
free(a);
free(b);
free(c);

return 0;
}
```

```#pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall
{
int j;
for(j=0; j<n; j++) {
c[j] = a[j] + b[j];
}
}
```

The code inside of the hmpp region is computed on the GPU. The region begins with the #pragma hmpp region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

```\$ hmpp --io-report cc vecAdd-region.c -o vecAdd.out
```

Output:
The compiler will output the following:

```In GROUP 'vecadd'
Parameter  'n' has intent IN
Parameter  'a' has intent IN
Parameter  'b' has intent IN
Parameter  'c' has intent INOUT
```

We see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

```hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-region.c:37: Loop 'j' was gridified (1D)
```

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

Modifications from CPU code in bold

```program main

! Size of vectors
integer :: n = 100000

! Input vectors
real(8),dimension(:),allocatable :: a
real(8),dimension(:),allocatable :: b
!Output vector
real(8),dimension(:),allocatable :: c

integer :: i
real(8) :: sum

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
a(i) = sin(i*1D0)*sin(i*1D0)
b(i) = cos(i*1D0)*cos(i*1D0)
enddo

! Sum component wise and save result into vector c

do i=1,n
c(i) = a(i) + b(i)
enddo

! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
sum = sum +  c(i)
enddo
sum = sum/n
print *, 'final result: ', sum

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

end program
```

```!\$hmpp vecAdd region, target=CUDA,  args[*].transfer=atcall
do i=1,n
c(i) = a(i) + b(i)
enddo
```

The code inside of the hmpp region is computed on the GPU. The region begins with the !\$hmpp region directive and ends with the !\$hmpp endregion directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

```\$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out
```

Output
The compiler will output the following:

```In GROUP 'vecadd'
Parameter  'n' has intent IN
Parameter  'n_1' has intent IN
Parameter  'n_2' has intent IN
Parameter  'a' has intent IN
Parameter  'n_4' has intent IN
Parameter  'n_5' has intent IN
Parameter  'b' has intent IN
Parameter  'n_7' has intent IN
Parameter  'n_8' has intent IN
Parameter  'c' has intent INOUT
Parameter  'i' has intent INOUT
```

The current HMPP compiler doesn’t do too well with the Fortran version, copying in several variables that are not used. For now we will need to ignore these erroneous variable copies and their associated warning messages. We do see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

```hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu".
hmppcg: [Message DPL3000] vecAdd-region.f90:29: Loop 'i' was gridified (1D)
```

The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 29 with induction variable ‘i’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.