This sample shows a minimal conversion from our vector addition CPU code to an HMPP accelerator directives version, consider this an HMPP ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov
HMPP allows code to be offloaded onto the GPU using two different methods, both of which are covered. The codelet method allows an entire C function or Fortran subroutine to be executed on the GPU. The region method allows a contiguous block of code, not necessarily residing in a function or subroutine, to be executed on the GPU.
vecAdd-codelet.c
#include <stdio.h> #include <stdlib.h> #include <math.h> #pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out void vecAdd(int n, double a[n], double b[n], double c[n]) { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } } int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Input vectors double *a; double *b; // Output vector double *c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // Sum component wise and save result into vector c #pragma hmpp vecAdd callsite vecAdd(n, a, b, c); // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd-codelet.c
#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out void vecAdd(int n, double a[n], double b[n], double c[n]) { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } }
The combined #pragma hmpp directive and C function vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.
#pragma hmpp vecAdd callsite vecAdd(n, a, b, c);
The combined #pragma hmpp directive and C function call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.
Compiling and Running vecAdd-codelet.c
Before compiling the hmpp module must be loaded:
$ module load PrgEnv-pgi capsmc cudatoolkit $ hmpp cc vecAdd-codelet.c -o vecAdd.out
Output:
The compiler will output the following:
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-codelet.c:9: Loop 'j' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.
$ aprun ./vecAdd.out final result: 1.000000
vecAdd-codelet.f90
!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out subroutine vecAdd(n, a, b, c) implicit none integer, intent(in) :: n real(8), intent(in) :: a(n), b(n) real(8), intent(out) :: c(n) integer :: j do j=1,n c(j) = a(j) + b(j) enddo end subroutine vecAdd program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$hmpp vecAdd callsite call vecAdd(n, a, b, c) ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program main
Changes to vecAdd-codelet.f90
!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out subroutine vecAdd(n, a, b, c) implicit none integer, intent(in) :: n real(8), intent(in) :: a(n), b(n) real(8), intent(out) :: c(n) integer :: j do j=1,n c(j) = a(j) + b(j) enddo end subroutine vecAdd
The combined !$hmpp directive and Fortran subroutine vecAdd form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector c from the GPU to the host at the end of the codelet call and so we must specify it explicitly with args[c].io=out. This will be explored in more detail later.
!$hmpp vecAdd callsite call vecAdd(n, a, b, c)
The combined !$hmpp directive and Fortran subroutine call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.
Compiling vecAdd-codelet.f90
$ module load PrgEnv-pgi cudatoolkit capsmc $ hmpp ftn vecAdd.f90 -o vecAdd.out
Output:
The compiler will output the following:
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-codelet.f90:9: Loop 'j' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.
Additional Codelet Information
Much information is obscured from the programmer so let’s add the –io-report hmpp flag to see what memory transfers will take place between the GPU and host.
C
$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out
Fortran
$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out
Output
In GROUP 'vecadd' CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent OUT
We see that n, a, and b will be copied into the GPU while c will be copied out.
What if we were the omit the intent for the vector c to be copied back to the host in our codelet declaration?
C
#pragma hmpp vecAdd codelet, target=CUDA
Fortran
!$hmpp vecAdd codelet, target=CUDA
Output
In GROUP 'vecadd' CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent IN
We see that the compiler does not correctly copy the vector c back to the host. This will cause erroneous results that do not produce any warning or error message. It is vitally important to always check that memory transfers are correct.
vecAdd-region.c
#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Input vectors double *a; double *b; // Output vector double *c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // Sum component wise and save result into vector c #pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }
Changes to vecAdd-region.c
#pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } }
The code inside of the hmpp region is computed on the GPU. The region begins with the #pragma hmpp region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.
Compiling vecAdd-region.c
$ hmpp --io-report cc vecAdd-region.c -o vecAdd.out
Output:
The compiler will output the following:
In GROUP 'vecadd' REGION 'vecadd' at vecAdd-region.c:34, function '__hmpp_region__vecadd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent INOUT
We see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-region.c:37: Loop 'j' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.
vecAdd-region.f90
Modifications from CPU code in bold
program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b !Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall do i=1,n c(i) = a(i) + b(i) enddo !$hmpp vecAdd endregion ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program
Changes: to vecAdd-region.f90
!$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall do i=1,n c(i) = a(i) + b(i) enddo !$hmpp vecAdd endregion
The code inside of the hmpp region is computed on the GPU. The region begins with the !$hmpp region directive and ends with the !$hmpp endregion directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.
Compiling vecAdd-region.f90
$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out
Output
The compiler will output the following:
In GROUP 'vecadd' REGION 'vecadd' at vecAdd-region.f90:28, function 'hmpp_region__vecadd' Parameter 'n' has intent IN Parameter 'n_1' has intent IN Parameter 'n_2' has intent IN Parameter 'a' has intent IN Parameter 'n_4' has intent IN Parameter 'n_5' has intent IN Parameter 'b' has intent IN Parameter 'n_7' has intent IN Parameter 'n_8' has intent IN Parameter 'c' has intent INOUT Parameter 'i' has intent INOUT
The current HMPP compiler doesn’t do too well with the Fortran version, copying in several variables that are not used. For now we will need to ignore these erroneous variable copies and their associated warning messages. We do see that n, a, b, and c will be copied into the GPU while c will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of c to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.
hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-region.f90:29: Loop 'i' was gridified (1D)
The compiler tells us that it has created the CUDA file vecadd_cuda.cu for the region. The second line tells us that the loop starting on line 29 with induction variable ‘i’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.