This sample shows the Game of Life CPU code converted to an HMPP accelerator directive version. Both **codelet** and **region** C and Fortran methods are presented. A second version is presented that fixes a performance issue incurred in the initial version. Please direct any questions or comments to help@nccs.gov

## GOL-codelet.c

#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #pragma hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2], int maxIter) { int i, j, iter; // Main game loop for (iter = 0; iter<maxIter; iter++) { // Left-Right columns for (i = 1; i<=dim; i++) { grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column } // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //diagonals + grid[i-1][j+1] + grid[i+1][j-1]; // Here we have explicitly all of the game rules if (grid[i][j] == 1 && numNeighbors < 2) newGrid[i][j] = 0; else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[i][j] = 1; else if (grid[i][j] == 1 && numNeighbors > 3) newGrid[i][j] = 0; else if (grid[i][j] == 0 && numNeighbors == 3) newGrid[i][j] = 1; else newGrid[i][j] = grid[i][j]; } } // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[i][j] = newGrid[i][j]; } } } } int main(int argc, char* argv[]) { int i,j,iter,numNeighbors; int dim = 1024; // Number of game iterations int maxIter = 1<<15; // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } // Call the GPU to execute codelet #pragma hmpp GOL callsite GOL(dim, grid, newGrid, maxIter); // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); free(grid); free(newGrid); return 0; }

**Changes**

Lines 6-58

#pragma hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2], int maxIter) { int i, j, iter; // Main game loop for (iter = 0; iter<maxIter; iter++) { // Left-Right columns for (i = 1; i<=dim; i++) { grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column } // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //diagonals + grid[i-1][j+1] + grid[i+1][j-1]; // Here we have explicitly all of the game rules if (grid[i][j] == 1 && numNeighbors < 2) newGrid[i][j] = 0; else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[i][j] = 1; else if (grid[i][j] == 1 && numNeighbors > 3) newGrid[i][j] = 0; else if (grid[i][j] == 0 && numNeighbors == 3) newGrid[i][j] = 1; else newGrid[i][j] = grid[i][j]; } } // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[i][j] = newGrid[i][j]; } } } }

To use the codelet method all of the game logic was moved into its own function. With the directive option **args[grid].io=inout** we specify that **grid** is to be copied onto the GPU at the start of the codelet and copied back to the host at the end. **newGrid** is specified to have no transfer between the host and device. The main game loop is included inside of the codelet even though it is neccesarily serial to ensure that the game grids are not coppied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below. **grid** and **newGrid** are passed into our codelet with their full dimension stated, this is neccesary for HMPP to correctly copy the data over.

// Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //diagonals + grid[i-1][j+1] + grid[i+1][j-1];

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.

// Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[i][j] = newGrid[i][j]; } }

This modification is **very** important, device pointers are not accessible inside of the **acc region**. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cell int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

grid[i*(dim+2)+j] = rand() % 2;

total += grid[i*(dim+2)+j];

HMPP seems to currently be limited to copying single dimensional arrays from host the device. We are able to maintain our multidimensional array indexing in the codelet while on the host we must allocate and address linear arrays.

## GOL-region.c

#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define idx(i,j) ((i)*(dim+2)+(j)) int main(int argc, char* argv[]) { int i,j,numNeighbors; int dim = 1024; // Number of game iterations int maxIter = 1<<15; // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } #pragma hmpp GOL region, target=CUDA, args[*].transfer=atcall, args[grid].io=inout & #pragma hmpp & args[newGrid].io=none { int i,j,iter; // Main game loop for (iter = 0; iter<maxIter; iter++) { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } #pragma hmppcg gridify // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } } } } // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += grid[idx(i,j)]; } } printf("Total Alive: %d\n", total); free(grid); free(newGrid); return 0; }

**Changes**

#define idx(i,j) ((i)*(dim+2)+(j))

In the codelet version we were able to stretch the C language a bit and use multidimensional arrays in our codelet and a single dimensional array on our host, this is not the case with the region method. A convenience macro is defined and used whenever ** grid** or **newGrid** is accessed.

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2));

As in the codelet version HMPP does not currently handle multidimensional arrays properly and so we revert back to a single dimensional array.

#pragma hmpp GOL region, target=CUDA, args[*].transfer=atcall, args[grid].io=inout & #pragma hmpp & args[newGrid.io]=none

We add the directive option **args[grid].io=inout** we specify that **grid** is to be copied onto the GPU at the start of the region and copied back to the host at the end. **newGrid** is specified to have no transfer between the host and device. The main game loop is included inside of the region even though it is neccesarily serial to ensure that the game grids are not coppied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below.

int i,j,iter;

At the time of this writing HMPP does not correctly identify the induction variables used in **hmpp region** that are declared outside of the region. The solution is to simply define our induction variables inside of the region.

#pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) {

#pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) {

// Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) {

#pragma hmppcg gridify // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) {

The **hmppcg** directives are used to guide the compiler at the loop level. In this particular case the compiler did not want to parallelize the loops due to the outermost game loop being sequential. Adding **hmppcg gridify** tells the compiler the loop immediately following the directive should be executed in parallel on the GPU.

// Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)];

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.

#pragma hmppcg gridify // Can't switch pointers so we mannually have to copy array over for(i = 1; i<dim+2; i++) { for(j = 1; j<dim+2; j++) { grid[idx(i,j)] = newGrid[idx(i,j)]; } }

This modification is **very** important, device pointers are not accessible inside of the **acc region**. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

**Compiling**

We add the **–io-report** option to hmpp to see compile time host/device memory transfer information.

$ hmpp --io-report cc GOL-codelet.c -o gol.out

or

$ hmpp --io-report cc GOL-region.c -o gol.out

**Output**:

The output will be similar for both the codelet and region methods.

hmpp: [Info] IO report : In GROUP 'gol' REGION 'gol' at GOL-region.c:29, function '__hmpp_region__gol' Parameter 'dim' has intent IN Parameter 'maxIter' has intent IN Parameter 'grid' has intent INOUT Parameter 'newGrid' has intent INOUT

Here we see the **intent**, the transfer direction with respect to the GPU, of each of our variables. We will see more explicit methods of dealing with data transfer below.

hmpp: [Info] Generated codelet filename is "gol_cuda.cu". hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:37: Loop 'i' was gridified (1D) hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:44: Loop 'j' was gridified (1D) hmppcg: [Message DPL3001] GOLHMPP-TRY0-REGION.c:52: Loops 'j' and 'i' were gridified (2D) hmppcg: [Message DPL3000] GOLHMPP-TRY0-REGION.c:77: Loop 'i' was gridified (1D) hmppcg: [Warning DPL3002] GOLHMPP-TRY0-REGION.c:33: Loop not gridified: will be executed on CPU

All of the loops have been **gridified** indicating a kernel has been generated for the given loop that will be executed in parallel on the GPU. A warming message tells us that the main game loop has not been gridified and will be executed serially on the CPU, this is the expected and correct behavior.

## GOL-codelet.f90

!$hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none subroutine GOL(dim, grid, newGrid, maxIter) implicit none integer, intent(in) :: dim, maxIter integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2) integer :: i,j,iter,numNeighbors ! Main game loop do iter=1,maxIter ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo enddo!end main game loop end subroutine GOL program main implicit none integer :: i,j,seed(8),numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo !$hmpp GOL callsite call GOL(dim, grid, newGrid, maxIter) ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + grid(i,j) enddo enddo print *, "Total Alive", total ! Release memory deallocate(grid) deallocate(newGrid) end program

**Changes**

Lines 1-55:

!$hmpp GOL codelet, target=CUDA, args[grid].io=inout, args[*].transfer=atcall, args[newGrid].io=none subroutine GOL(dim, grid, newGrid, maxIter) implicit none integer, intent(in) :: dim, maxIter integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2) integer :: i,j,iter,numNeighbors ! Main game loop do iter=1,maxITer ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo enddo!end main game loop end subroutine GOL

To use the codelet method all of the game logic was moved into its own subroutine. With the directive option **args[grid].io=inout** we specify that **grid** is to be copied onto the GPU at the start of the codelet and copied back to the host at the end. **newGrid** is specified to have no transfer between the host and device. The main game loop is included inside of the codelet even though it is neccesarily serial to ensure that the game grids are not copied to and from the host after each game iteration. We will look at a more explicit way to manage data transfer below.

! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1)

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our subroutine to calculate the number of alive neighbors a cell has.

! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo

This modification is **very** important, device pointers are not accessible inside of the **acc region**. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

## GOL-region.f90

program main implicit none integer :: i,j,iter,seed(8),numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo !$hmpp GOL region, target=CUDA, args[grid].io=inout, args[*].transfer=atcall & !$hmpp & args[newGrid].io=none ! Main game loop do iter=1,maxITer ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j, i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) !Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo ! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo enddo!end main game loop !$hmpp GOL endregion ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + grid(i,j) enddo enddo print *, "Total Alive", total ! Release memory deallocate(grid) deallocate(newGrid) end program

**Changes**

! Now we loop over all cells and determine their fate !$hmppcg gridify(j, i) do j=2,dim+1 do i=2,dim+1

By default the hmpp compiler did not want to gridify the loop over each cell. We can attempt to force the compiler to do so adding the **hmppcg** directive with **gridify(j, i)** option which specifies we want to gridify with respect to the induction variables **j** and **i**.

! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1)

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our subroutine to calculate the number of alive neighbors a cell has.

! Can't switch pointers so we mannually have to copy array over do j=2,dim+1 do i=2,dim+1 grid(i,j) = newGrid(i,j) enddo enddo

**very** important, device pointers are not accessible inside of the **acc region**. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the HMPP directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. A possible workaround for this operation is investigated below.

**Compiling**

We add the **–io-report** option to hmpp to see compile time host/device memory transfer information.

$ hmpp --io-report ftn GOL-codelet.f90 -o gol.out

or

$ hmpp --io-report ftn GOL-region.f90 -o gol.out

**Output**:

The output will be similar for both the codelet and region methods. In the most recent release as of this writing you may get some warning messages from the compiler about unused variables when using the region method.

hmpp: [Info] IO report : In GROUP 'gol' CODELET 'gol' at GOLHMPP-TRY0-CODELET.f90:1, function 'gol' Parameter 'dim' has intent IN Parameter 'grid' has intent INOUT Parameter 'newgrid' has intent INOUT Parameter 'maxiter' has intent IN

Here we see the **intent**, the transfer direction with respect to the GPU, of each of our variables. We will see more explicit methods of dealing with data transfer below.

hmpp: [Info] Generated codelet filename is "gol_cuda.cu". hmppcg: [Message DPL3000] GOLHMPP-TRY0-CODELET.f90:11: Loop 'j' was gridified (1D) hmppcg: [Message DPL3000] GOLHMPP-TRY0-CODELET.f90:17: Loop 'i' was gridified (1D) hmppcg: [Message DPL3001] GOLHMPP-TRY0-CODELET.f90:23: Loops 'i' and 'j' were gridified (2D) hmppcg: [Message DPL3001] GOLHMPP-TRY0-CODELET.f90:47: Loops 'i' and 'j' were gridified (2D) hmppcg: [Warning DPL3002] GOLHMPP-TRY0-CODELET.f90:9: Loop not gridified: will be executed on CPU

All of the loops have been **gridified** indicating a kernel has been generated for the given loop that will be executed in parallel on the GPU. A warming message tells us that the main game loop has not been gridified and will be executed serially on the CPU, this is the expected and correct behavior.

## GOL-codelet-Try2.c

In this version we will remove the memory copy kernel that we needed to use since HMPP accelerate does not currently suppoort device pointers. As an added bonus we will also take a look at more advanced memory management techniques. There are numerous ways this could be approached but the methods shown highlight many good to know HMPP features.

#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #pragma hmpp <GOL> group target=CUDA #pragma hmpp <GOL> mapbyname, dim #pragma hmpp <GOL> map, args[even::grid;odd::newGrid] #pragma hmpp <GOL> map, args[even::newGrid;odd::grid] #pragma hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual #pragma hmpp <GOL> odd codelet, args[dim,grid,newGrid].transfer=manual void GOL(int dim, int grid[dim+2][dim+2], int newGrid[dim+2][dim+2]) { int i, j, numNeighbors; // Left-Right columns for (i = 1; i<=dim; i++) { grid[i][0] = grid[i][dim]; //Copy first real column to right ghost column grid[i][dim+1] = grid[i][1]; //Copy last real column to left ghost column } // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[0][j] = grid[dim][j]; //Copy first real row to bottom ghost row grid[dim+1][j] = grid[1][j]; //Copy last real row to top ghost row } // Now we loop over all cells and determine their fate for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower + grid[i][j+1] + grid[i][j-1] //right left + grid[i+1][j+1] + grid[i-1][j-1] //Diagonals + grid[i-1][j+1] + grid[i+1][j-1]; // Here we have explicitly all of the game rules if (grid[i][j] == 1 && numNeighbors < 2) newGrid[i][j] = 0; else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[i][j] = 1; else if (grid[i][j] == 1 && numNeighbors > 3) newGrid[i][j] = 0; else if (grid[i][j] == 0 && numNeighbors == 3) newGrid[i][j] = 1; else newGrid[i][j] = grid[i][j]; } } } int main(int argc, char* argv[]) { int i,j,iter; // Linear grid dimension int dim = 1024; // Number of game iterations int maxIter = 1<<10; #pragma hmpp <GOL> allocate // Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } // Copy memory tp GPU #pragma hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid] // Main game loop for (iter = 0; iter<maxIter; iter++) { if(iter%2 == 0) { #pragma hmpp <GOL> even callsite GOL(dim, grid, newGrid); } else { #pragma hmpp <GOL> odd callsite GOL(dim, grid, newGrid); } }//iter loop // Read back final configuration to confirm it worked #pragma hmpp <GOL> delegatedstore, args[odd::newGrid] // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += newGrid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); #pragma hmpp <GOL> release free(grid); free(newGrid); return 0; }

**Changes**

#pragma hmpp <GOL> group target=CUDA #pragma hmpp <GOL> mapbyname, dim #pragma hmpp <GOL> map, args[even::grid;odd::newGrid] #pragma hmpp <GOL> map, args[even::newGrid;odd::grid] #pragma hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual #pragma hmpp <GOL> odd codelet, args[dim,grid,newGrid].transfer=manual

A major change to our code has been to get rid of the **newGrid** to **grid** memory copy. Instead we use some of the built in HMPP directive functionality.

#pragma hmpp <GOL> group target=CUDA

Here we create what is known as a codelet group, name GOL. This allows us to keep track of similar kernels and set common attributes.

#pragma hmpp <GOL> mapbyname, dim

This line will map the variable dim across all kernels. This means once dim is allocated and copied to the device for one kernel it will be available for use by all kernels in the GOL group.

#pragma hmpp <GOL> map, args[even::grid;odd::newGrid] #pragma hmpp <GOL> map, args[even::newGrid;odd::grid]

The map directive is used to map arguments between different kernels. This will effectively switch our pointers for us. The notation **args[CodeletName1::Argument1,CodeletName2::Argument2]** is used such that Argument1 in CodeletName1 will refer to the same device memory as Argument2 in CodeletName2. We see in our case we will soon be defining two kernels, one named even and one named odd.

#pragma hmpp <GOL> even codelet #pragma hmpp <GOL> odd codelet, args[newGrid].io=none

The GOL group will contain two codelets, even and odd. Our game will oscillate between calling these codelets in the main game loop. The only difference between these codelets is caused by the map directive used above which effectively switches the device pointers to grid and newGrid.

#pragma hmpp <GOL> allocate

The allocate directive will allocate space on the GPU for all memory needed by the named group. Any memory objects shared by more than one codelet through the **map** or **mapbyname** directives will only be allocated once.

// Copy memory tp GPU #pragma hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]

Once we have allocated space on the GPU we can copy the results from the host to the GPU with the **advancedload** directive. Notice when referencing what arguments will be copied we specify one codelet names, either codelet name could be specified.

// Main game loop for (iter = 0; iter<maxIter; iter++) { if(iter%2 == 0) { #pragma hmpp <GOL> even callsite GOL(dim, grid, newGrid); } else { #pragma hmpp <GOL> odd callsite GOL(dim, grid, newGrid); } }//iter loop

The main game loop induction variable is used to switch between calling our **even** codelet and our **odd** codelet. Although it looks as though the same GOL function is being called in both cases because of the **map** HMPP directive defined with the codelet we know the grid and newGrid GPU pointers will be switched.

// Read back final configuration to confirm it worked #pragma hmpp <GOL> delegatedstore, args[odd::newGrid]

After the kernel is finished we copy the grid from the device to the host.

#pragma hmpp <GOL> release

Finally since we allocated the device memory we must free it with the directive **hmpp GOL release** .

## GOL-region-Try2.c

#include <stdio.h> #include <stdlib.h> #define SRAND_VALUE 1985 #define idx(i,j) ((i)*(dim+2)+(j)) int main(int argc, char* argv[]) { int i,j,iter; int dim = 1024; //Number of game iterations int maxIter = 1<<10; #pragma hmpp GOL allocate, args[grid,newGrid].size={(dim+2)*(dim+2)} // Allocate square grid of (dim+2)^2 elements, 2 added for perioid boundary condition ghost cells int *grid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Allocate newGrid int *newGrid = (int*) malloc( sizeof(int) * (dim+2) *(dim+2)); // Assign initial population randomly srand(SRAND_VALUE); for(i = 1; i<=dim; i++) { for(j = 1; j<=dim; j++) { grid[i*(dim+2)+j] = rand() % 2; } } // Copy memory tp GPU #pragma hmpp GOL advancedload, args[dim;grid;newGrid] // Main game loop for (iter = 0; iter<maxIter; iter++) { #pragma hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall { int i,j; if(iter%2 == 0) { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } } else { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { newGrid[idx(i,0)] = newGrid[idx(i,dim)]; //Copy first real column to right ghost column newGrid[idx(i,dim+1)] = newGrid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { newGrid[idx(0,j)] = newGrid[idx(dim,j)]; //Copy first real row to bottom ghost row newGrid[idx(dim+1,j)] = newGrid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = newGrid[idx(i+1,j)] + newGrid[idx(i-1,j)] //upper lower + newGrid[idx(i,j+1)] + newGrid[idx(i,j-1)] //right left + newGrid[idx(i+1,j+1)] + newGrid[idx(i-1,j-1)] //diagonals + newGrid[idx(i-1,j+1)] + newGrid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (newGrid[idx(i,j)] == 1 && numNeighbors < 2) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) grid[idx(i,j)] = 1; else if (newGrid[idx(i,j)] == 1 && numNeighbors > 3) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 0 && numNeighbors == 3) grid[idx(i,j)] = 1; else grid[idx(i,j)] = newGrid[idx(i,j)]; } } } } }//iter loop // Read back final configuration to confirm it worked #pragma hmpp GOL delegatedstore, args[grid] // Sum up alive cells and print results int total = 0; for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { total += grid[i*(dim+2)+j]; } } printf("Total Alive: %d\n", total); #pragma hmpp GOL release free(grid); free(newGrid); return 0; }

**Changes**

#pragma hmpp GOL allocate, args[grid,newGrid].size={(dim+2)*(dim+2)}

The **hmpp GOL allocate** directive will allocate space on the GPU for the named region. The compiler requires us to specify the size of the passed in arrays at allocation time.

hmpp GOL advancedload, args[dim;iter;grid;newGrid]

**hmpp GOL advancedload, args[dim;iter;grid;newGrid]** will copy the named arguments of the region from the host to the GPU.

#pragma hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall

In this version of the code we have our region inside of our main game loop but no longer have to worry about unnecessary memory copies.

if(iter%2 == 0) { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { grid[idx(i,0)] = grid[idx(i,dim)]; //Copy first real column to right ghost column grid[idx(i,dim+1)] = grid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { grid[idx(0,j)] = grid[idx(dim,j)]; //Copy first real row to bottom ghost row grid[idx(dim+1,j)] = grid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = grid[idx(i+1,j)] + grid[idx(i-1,j)] //upper lower + grid[idx(i,j+1)] + grid[idx(i,j-1)] //right left + grid[idx(i+1,j+1)] + grid[idx(i-1,j-1)] //diagonals + grid[idx(i-1,j+1)] + grid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (grid[idx(i,j)] == 1 && numNeighbors < 2) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) newGrid[idx(i,j)] = 1; else if (grid[idx(i,j)] == 1 && numNeighbors > 3) newGrid[idx(i,j)] = 0; else if (grid[idx(i,j)] == 0 && numNeighbors == 3) newGrid[idx(i,j)] = 1; else newGrid[idx(i,j)] = grid[idx(i,j)]; } } } else { #pragma hmppcg gridify // Left-Right columns for (i = 1; i<=dim; i++) { newGrid[idx(i,0)] = newGrid[idx(i,dim)]; //Copy first real column to right ghost column newGrid[idx(i,dim+1)] = newGrid[idx(i,1)]; //Copy last real column to left ghost column } #pragma hmppcg gridify // Top-Bottom rows for (j = 0; j<=dim+1; j++) { newGrid[idx(0,j)] = newGrid[idx(dim,j)]; //Copy first real row to bottom ghost row newGrid[idx(dim+1,j)] = newGrid[idx(1,j)]; //Copy last real row to top ghost row } int numNeighbors; // Now we loop over all cells and determine their fate #pragma hmppcg gridify (i,j) for (i = 1; i<=dim; i++) { for (j = 1; j<=dim; j++) { // Get the number of neighbors for a given grid point numNeighbors = newGrid[idx(i+1,j)] + newGrid[idx(i-1,j)] //upper lower + newGrid[idx(i,j+1)] + newGrid[idx(i,j-1)] //right left + newGrid[idx(i+1,j+1)] + newGrid[idx(i-1,j-1)] //diagonals + newGrid[idx(i-1,j+1)] + newGrid[idx(i+1,j-1)]; // Here we have explicitly all of the game rules if (newGrid[idx(i,j)] == 1 && numNeighbors < 2) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 1 && (numNeighbors == 2 || numNeighbors == 3)) grid[idx(i,j)] = 1; else if (newGrid[idx(i,j)] == 1 && numNeighbors > 3) grid[idx(i,j)] = 0; else if (newGrid[idx(i,j)] == 0 && numNeighbors == 3) grid[idx(i,j)] = 1; else grid[idx(i,j)] = newGrid[idx(i,j)]; } } }

A major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around a costly memory transfer.

// Read back final configuration to confirm it worked #pragma hmpp GOL delegatedstore, args[grid]

After the kernel is finished we copy the grid from the device to the host with **hmpp GOL delegatedstore, args[grid]**

#pragma hmpp GOL release

Finally since we allocated the device memory we must free it with the directive **hmpp GOL release** .

## GOL-codelet-Try2.f90

!$hmpp <GOL> group, target=CUDA !$hmpp <GOL> mapbyname, dim !$hmpp <GOL> map, args[even::grid;odd::newGrid] !$hmpp <GOL> map, args[even::newGrid;odd::grid] !$hmpp <GOL> even codelet, args[dim,grid,newGrid].transfer=manual !$hmpp <GOL> odd codelet, args[dim,grid,newGrid].transfer=manual subroutine GOL(dim, grid, newGrid) implicit none integer, intent(in) :: dim integer, intent(inout) :: grid(dim+2,dim+2), newGrid(dim+2,dim+2) integer :: i,j,numNeighbors ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !right & left + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo end subroutine GOL program main implicit none integer :: i,j,seed(8),iter,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid !$hmpp <GOL> allocate ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo ! Copy memory tp GPU !$hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid] ! Main game loop do iter=1,maxITer if(MOD(iter,2) == 1) then !$hmpp <GOL> even callsite call GOL(dim, grid, newGrid) else !$hmpp <GOL> odd callsite call GOL(dim, grid, newGrid) endif enddo!end main game loop ! Read back final configuration to confirm it worked !$hmpp <GOL> delegatedstore, args[odd::newGrid] ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + newGrid(i,j) enddo enddo print *, "Total Alive", total ! Release memory !$hmpp <GOL> release deallocate(grid) deallocate(newGrid) end program

**Changes**

!$hmpp <GOL> group, target=CUDA !$hmpp <GOL> mapbyname, dim !$hmpp <GOL> map, args[even::grid;odd::newGrid] !$hmpp <GOL> map, args[even::newGrid;odd::grid] !$hmpp <GOL> even codelet !$hmpp <GOL> odd codelet

A major change to our code has been to get rid of the **newGrid** to **grid** memory copy. Instead we use some of the built in HMPP directive functionality.

!$hmpp <GOL> group, target=CUDA

Here we create what is known as a codelet group, name GOL. This allows us to keep track of similar kernels and set common attributes.

!$hmpp <GOL> mapbyname, dim

This line will map the variable dim across all kernels. This means once dim is allocated and copied to the device for one kernel it will be available for use by all kernels in the GOL group.

!$hmpp <GOL> map, args[even::grid;odd::newGrid] !$hmpp <GOL> map, args[even::newGrid;odd::grid]

The map directive is used to map arguments between different kernels. This will effectively switch our pointers for us. The notation args[CodeletName1::Argument1,CodeletName2::Argument2] is used such that Argument1 in CodeletName1 will refer to the same device memory as Argument2 in CodeletName2. We see in our case we will soon be defining two kernels, one named even and one named odd.

!$hmpp <GOL> even codelet !$hmpp <GOL> odd codelet

The GOL group will contain two codelets, even and odd. Our game will oscillate between calling these codelets in the main game loop. The only difference between these codelets is caused by the map directive used above which effectively switches the device pointers to grid and newGrid.

!$hmpp <GOL> allocate

The allocate directive will allocate space on the GPU for all memory needed by the named group. Any memory objects shared by more than one codelet through the map or mapbyname directives will only be allocated once.

! Copy memory tp GPU !$hmpp <GOL> advancedload, args[even::dim;even::grid;even::newGrid]

Once we have allocated space on the GPU we can copy the results from the host to the GPU with the advancedload directive. Notice when referencing what arguments will be copied we specify one codelet names, either codelet name could be specified.

if(MOD(iter,2) == 1) then !$hmpp <GOL> even callsite, args[grid].noupdate=true call GOL(dim, grid, newGrid) else !$hmpp <GOL> odd callsite, args[grid].noupdate=true call GOL(dim, grid, newGrid) endif

We see that we use our main game loop induction variable to switch between calling our even codelet and our odd codelet. Although it looks as though the same GOL function is being called in both cases because of the map HMPP directive defined with the codelet we know the grid and newGrid GPU pointers will be switched. **args[grid].noupdate=true** indicates that we do not want to have any implicit data movement between the host and device.

! Read back final configuration to confirm it worked !$hmpp <GOL> delegatedstore, args[odd::newGrid]

After the kernel is finished we copy the grid from the device to the host with hmpp GOL delegatedstore, args[grid]

!$hmpp <GOL> release

Finally since we allocated the device memory we must free it with the directive hmpp GOL release .

## GOL-region-Try2.f90

program main implicit none integer :: i,j,seed(8),iter,numNeighbors,total real :: randm ! Linear game grid dimension integer :: dim = 1024 ! Number of game iterations integer :: maxIter = 2**10 ! Game grid pointers integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid !$hmpp GOL allocate, args[grid,newGrid].size={(dim+2),(dim+2)} ! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells allocate(grid(dim+2,dim+2)) allocate(newGrid(dim+2,dim+2)) ! Assign initial population randomly seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/) call random_seed(PUT=seed) do j=1,dim do i=1,dim call random_number(randm) grid(i,j) = nint(randm) enddo enddo ! Copy memory tp GPU !$hmpp GOL advancedload, args[dim;grid;newGrid] ! Main game loop do iter=1,maxITer !$hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall if(MOD(iter,2) == 1) then ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !right & left + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo else ! Top-Bottom ghost rows do j=2,dim+1 newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 newGrid(i,1) = newGrid(i, dim+1) !Copy first game grid column to right ghost column newGrid(i,dim+2) = newGrid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)& !right & left + newGrid(i+1,j) + newGrid(i-1,j)& !upper & lower + newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals + newGrid(i-1,j+1) + newGrid(i+1,j-1) ! Here we have explicitly all of the game rules if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then grid(i,j) = 0 elseif(newGrid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then grid(i,j) = 1 elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then grid(i,j) = 0 elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then grid(i,j) = 1 else grid(i,j) = newGrid(i,j) endif enddo enddo endif !End grid newGrid switch !$hmpp GOL endregion enddo!end main game loop ! Read back final configuration to confirm it worked !$hmpp GOL delegatedstore, args[grid] ! Sum up alive cells and print results total = 0 do j=2,dim+1 do i=2,dim+1 total = total + grid(i,j) enddo enddo print *, "Total Alive", total ! Release memory !$hmpp GOL release deallocate(grid) deallocate(newGrid) end program

**Changes**

!$hmpp GOL allocate, args[grid,newGrid].size={(dim+2),(dim+2)}

The **hmpp GOL allocate** directive will allocate space on the GPU for the named region. The compiler requires us to specify the size of the passed in arrays at allocation time with **args[grid,newGrid].size={(dim+2),(dim+2)}**

!$hmpp GOL advancedload, args[dim;grid;newGrid]

**hmpp GOL advancedload, args[dim;iter;grid;newGrid]** will copy the named arguments of the region from the host to the GPU.

hmpp GOL region, target=CUDA, args[dim,grid,newGrid].transfer=manual, args[iter].transfer=atcall

In this version of the code we have our region inside of our main game loop but no longer have to worry about unnecessary memory copies.

Lines 35-109:

if(MOD(iter,2) == 1) then ! Top-Bottom ghost rows do j=2,dim+1 grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column grid(i,dim+2) = grid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = grid(i,j+1) + grid(i,j-1)& !right & left + grid(i+1,j) + grid(i-1,j)& !upper & lower + grid(i+1,j+1) + grid(i-1,j-1)& !diagonals + grid(i-1,j+1) + grid(i+1,j-1) ! Here we have explicitly all of the game rules if(grid(i,j) == 1 .AND. numNeighbors < 2) then newGrid(i,j) = 0 elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then newGrid(i,j) = 1 elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then newGrid(i,j) = 0 elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then newGrid(i,j) = 1 else newGrid(i,j) = grid(i,j) endif enddo enddo else ! Top-Bottom ghost rows do j=2,dim+1 newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row enddo ! Left-Right ghost columns do i=1,dim+2 newGrid(i,1) = newGrid(i, dim+1) !Copy first game grid column to right ghost column newGrid(i,dim+2) = newGrid(i,2) !Copy last game grid column to left ghost column enddo ! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i) do j=2,dim+1 do i=2,dim+1 ! Get the number of neighbors for a given grid point numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)& !right & left + newGrid(i+1,j) + newGrid(i-1,j)& !upper & lower + newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals + newGrid(i-1,j+1) + newGrid(i+1,j-1) ! Here we have explicitly all of the game rules if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then grid(i,j) = 0 elseif(newGrid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then grid(i,j) = 1 elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then grid(i,j) = 0 elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then grid(i,j) = 1 else grid(i,j) = newGrid(i,j) endif enddo enddo endif !End grid newGrid switch

A major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around a costly memory transfer.

! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i)

! Now we loop over all cells and determine their fate !$hmppcg gridify(j,i)

By default the hmpp compiler did not want to gridify the loop over each cell. We can attempt to force the compiler to do so adding the**hmppcg gridify(j, i)** directive which specifies we want to gridify with respect to the induction variables **j** and **i**.

! Read back final configuration to confirm it worked !$hmpp GOL delegatedstore, args[grid]

After the kernel is finished we copy the grid from the device to the host with **hmpp GOL delegatedstore, args[grid]**

! Release memory !$hmpp GOL release

Finally since we allocated the device memory we must free it with the directive **hmpp GOL release**.

This sample shows a minimal conversion from our vector addition CPU code to an HMPP accelerator directives version, consider this an HMPP ‘Hello World’. Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

HMPP allows code to be offloaded onto the GPU using two different methods, both of which are covered. The **codelet** method allows an entire C function or Fortran subroutine to be executed on the GPU. The **region** method allows a contiguous block of code, not necessarily residing in a function or subroutine, to be executed on the GPU.

## vecAdd-codelet.c

#include <stdio.h> #include <stdlib.h> #include <math.h> #pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out void vecAdd(int n, double a[n], double b[n], double c[n]) { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } } int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Input vectors double *a; double *b; // Output vector double *c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // Sum component wise and save result into vector c #pragma hmpp vecAdd callsite vecAdd(n, a, b, c); // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }

### Changes to vecAdd-codelet.c

#pragma hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out void vecAdd(int n, double a[n], double b[n], double c[n]) { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } }

The combined **#pragma hmpp** directive and C function **vecAdd** form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector **c** from the GPU to the host at the end of the codelet call and so we must specify it explicitly with **args[ c].io=out**. This will be explored in more detail later.

#pragma hmpp vecAdd callsite vecAdd(n, a, b, c);

The combined **#pragma hmpp** directive and C function call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

### Compiling and Running vecAdd-codelet.c

Before compiling the hmpp module must be loaded:

$ module load PrgEnv-pgi capsmc cudatoolkit $ hmpp cc vecAdd-codelet.c -o vecAdd.out

**Output**:

The compiler will output the following:

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-codelet.c:9: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file *vecadd_cuda.cu* for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

$ aprun ./vecAdd.out final result: 1.000000

## vecAdd-codelet.f90

!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out subroutine vecAdd(n, a, b, c) implicit none integer, intent(in) :: n real(8), intent(in) :: a(n), b(n) real(8), intent(out) :: c(n) integer :: j do j=1,n c(j) = a(j) + b(j) enddo end subroutine vecAdd program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b ! Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$hmpp vecAdd callsite call vecAdd(n, a, b, c) ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program main

### Changes to vecAdd-codelet.f90

!$hmpp vecAdd codelet, target=CUDA, args[*].transfer=atcall, args[ c ].io=out subroutine vecAdd(n, a, b, c) implicit none integer, intent(in) :: n real(8), intent(in) :: a(n), b(n) real(8), intent(out) :: c(n) integer :: j do j=1,n c(j) = a(j) + b(j) enddo end subroutine vecAdd

The combined **!$hmpp** directive and Fortran subroutine **vecAdd** form what is referred to as the codelet. This codelet, given the name vecAdd, will be computed on the GPU when matched with an HMPP callsite. Memory is copied from the CPU to the GPU at the start of the codelet and back from the GPU to the CPU at the end of the codelet. It must be noted that the current compiler, version 2.4.1, does not correctly copy the vector **c** from the GPU to the host at the end of the codelet call and so we must specify it explicitly with **args[ c].io=out**. This will be explored in more detail later.

!$hmpp vecAdd callsite call vecAdd(n, a, b, c)

The combined **!$hmpp** directive and Fortran subroutine call form what is referred to as the callsite. The callsite will trigger the specified codelet to be run on the GPU.

### Compiling vecAdd-codelet.f90

$ module load PrgEnv-pgi cudatoolkit capsmc $ hmpp ftn vecAdd.f90 -o vecAdd.out

**Output:**

The compiler will output the following:

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-codelet.f90:9: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file *vecadd_cuda.cu* for the codelet. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

## Additional Codelet Information

Much information is obscured from the programmer so let’s add the **–io-report** hmpp flag to see what memory transfers will take place between the GPU and host.

*C*

$ hmpp --io-report cc vecAdd-codelet.c -o vecAdd.out

*Fortran*

$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out

*Output*

In GROUP 'vecadd' CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent OUT

We see that n, **a**, and **b** will be copied into the GPU while **c** will be copied out.

What if we were the omit the intent for the vector c to be copied back to the host in our codelet declaration?

*C*

#pragma hmpp vecAdd codelet, target=CUDA

*Fortran*

!$hmpp vecAdd codelet, target=CUDA

*Output*

In GROUP 'vecadd' CODELET 'vecadd' at vecAdd-codelet.c:5, function 'vecAdd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent IN

We see that the compiler does not correctly copy the vector **c** back to the host. This will cause erroneous results that do not produce any warning or error message. It is vitally important to always check that memory transfers are correct.

## vecAdd-region.c

#include <stdio.h> #include <stdlib.h> #include <math.h> int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Input vectors double *a; double *b; // Output vector double *c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector a = (double*)malloc(bytes); b = (double*)malloc(bytes); c = (double*)malloc(bytes); // Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 int i; for(i=0; i<n; i++) { a[i] = sin(i)*sin(i); b[i] = cos(i)*cos(i); } // Sum component wise and save result into vector c #pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } } // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) { sum += c[i]; } sum = sum/n; printf("final result: %f\n", sum); // Release memory free(a); free(b); free(c); return 0; }

### Changes to vecAdd-region.c

#pragma hmpp vecAdd region, target=CUDA, args[*].transfer=atcall { int j; for(j=0; j<n; j++) { c[j] = a[j] + b[j]; } }

The code inside of the hmpp region is computed on the GPU. The region begins with the #pragma hmpp region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

### Compiling vecAdd-region.c

$ hmpp --io-report cc vecAdd-region.c -o vecAdd.out

**Output:**

The compiler will output the following:

In GROUP 'vecadd' REGION 'vecadd' at vecAdd-region.c:34, function '__hmpp_region__vecadd' Parameter 'n' has intent IN Parameter 'a' has intent IN Parameter 'b' has intent IN Parameter 'c' has intent INOUT

We see that n, **a**, **b**, and **c** will be copied into the GPU while **c** will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of **c** to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-region.c:37: Loop 'j' was gridified (1D)

The compiler tells us that it has created the CUDA file *vecadd_cuda.cu* for the region. The second line tells us that the loop starting on line 37 with induction variable ‘j’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.

## vecAdd-region.f90

*Modifications from CPU code in bold*

program main ! Size of vectors integer :: n = 100000 ! Input vectors real(8),dimension(:),allocatable :: a real(8),dimension(:),allocatable :: b !Output vector real(8),dimension(:),allocatable :: c integer :: i real(8) :: sum ! Allocate memory for each vector allocate(a(n)) allocate(b(n)) allocate(c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n a(i) = sin(i*1D0)*sin(i*1D0) b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Sum component wise and save result into vector c !$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall do i=1,n c(i) = a(i) + b(i) enddo !$hmpp vecAdd endregion ! Sum up vector c and print result divided by n, this should equal 1 within error do i=1,n sum = sum + c(i) enddo sum = sum/n print *, 'final result: ', sum ! Release memory deallocate(a) deallocate(b) deallocate(c) end program

### Changes: to vecAdd-region.f90

!$hmpp vecAdd region, target=CUDA, args[*].transfer=atcall do i=1,n c(i) = a(i) + b(i) enddo !$hmpp vecAdd endregion

The code inside of the hmpp region is computed on the GPU. The region begins with the **!$hmpp region** directive and ends with the **!$hmpp endregion** directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

### Compiling vecAdd-region.f90

$ hmpp --io-report ftn vecAdd.f90 -o vecAdd.out

**Output**

The compiler will output the following:

In GROUP 'vecadd' REGION 'vecadd' at vecAdd-region.f90:28, function 'hmpp_region__vecadd' Parameter 'n' has intent IN Parameter 'n_1' has intent IN Parameter 'n_2' has intent IN Parameter 'a' has intent IN Parameter 'n_4' has intent IN Parameter 'n_5' has intent IN Parameter 'b' has intent IN Parameter 'n_7' has intent IN Parameter 'n_8' has intent IN Parameter 'c' has intent INOUT Parameter 'i' has intent INOUT

The current HMPP compiler doesn’t do too well with the Fortran version, copying in several variables that are not used. For now we will need to ignore these erroneous variable copies and their associated warning messages. We do see that n, **a**, **b**, and **c** will be copied into the GPU while **c** will be copied out. This will produce the correct output although the GPU is doing the extra work of copying over the content of **c** to the GPU when it is unnecessary. Memory management will be looked at further in the Game of Life tutorial.

hmpp: [Info] Generated codelet filename is "vecadd_cuda.cu". hmppcg: [Message DPL3000] vecAdd-region.f90:29: Loop 'i' was gridified (1D)

The compiler tells us that it has created the CUDA file *vecadd_cuda.cu* for the region. The second line tells us that the loop starting on line 29 with induction variable ‘i’ will be parallelized on the GPU and that the kernel will launch with a 1 dimensional grid of thread blocks.