This sample shows a minimal conversion of our vector addition CPU code to a PGI accelerator directives version. Consider this a PGI Accelerator ‘Hello World.’ Modifications from the CPU version will be highlighted and briefly discussed. Please direct any questions or comments to help@nccs.gov

This tutorial covers PGI Accelerator directives, If you are interested in PGI OpenACC support please see: OpenACC Vector Addition

```#include <stdio.h>;
#include <stdlib.h>;
#include <math.h>;

int main( int argc, char* argv[] )
{

// Size of vectors
int n = 100000;

// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;

// Size, in bytes, of each vector
size_t bytes = n*sizeof(double);

// Allocate memory for each vector
a = (double*)malloc(bytes);
b = (double*)malloc(bytes);
c = (double*)malloc(bytes);

// Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
int i;
for(i=0; i&lt;n; i++) {
a[i] = sin(i)*sin(i);
b[i] = cos(i)*cos(i);
}

// sum component wise and save result into vector c
#pragma acc region
{
for(i=0; i&lt;n; i++) {
c[i] = a[i] + b[i];
}
}

// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i&lt;n; i++) {
sum += c[i];
}
sum = sum/n;
printf(&quot;final result: %f\n&quot;, sum);

// Release memory
free(a);
free(b);
free(c);

return 0;
}
```

```// Input vectors
double *restrict a;
double *restrict b;
// Output vector
double *restrict c;
```

The restrict keyword is necessary for all vectors that will be sent to, or retrieved from, the GPU. This is a non enforced guarantee to the compiler that the pointers are not aliased.

```// sum component wise and save result into vector c
#pragma acc region
{
for(i=0; i&amp;lt;n; i++) {
c[i] = a[i] + b[i];
}
}
```

The code inside of the acc region is computed on the GPU. The region begins with the #pragma acc region directive and is enclosed in curly brackets. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region as deemed necessary by the compiler.

We add the target accelerator flag to specify we want to compile for NVIDIA accelerators

```\$ module load cudatoolkit
```

```\$ aprun ./vecAdd.out
final result: 1.000000
```

```program main

! Size of vectors
integer :: n = 100000

! Input vectors
real,dimension(:),allocatable :: a
real,dimension(:),allocatable :: b
! Output vector
real,dimension(:),allocatable :: c

integer :: i
real :: sum

! Allocate memory for each vector
allocate(a(n))
allocate(b(n))
allocate(c(n))

! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2
do i=1,n
a(i) = sin(i*1.0)*sin(i*1.0)
b(i) = cos(i*1.0)*cos(i*1.0)
enddo

! Sum component wise and save result into vector c

!\$acc region
do i=1,n
c(i) = a(i) + b(i)
enddo
!\$acc end region

! Sum up vector c and print result divided by n, this should equal 1 within error
do i=1,n
sum = sum +  c(i)
enddo
sum = sum/n
print *, 'final result: ', sum

! Release memory
deallocate(a)
deallocate(b)
deallocate(c)

end program
```

```    !\$acc region
do i=1,n
c(i) = a(i) + b(i)
enddo
!\$acc end region
```

The code inside of the acc region is computed on the GPU. The region begins with the !acc region directive and ends with the !acc end region directive. Memory is copied from the CPU to the GPU at the start of the region and back from the GPU to the CPU at the end of the region.

We add the target accelerator flag to specify we want to compile for NVIDIA accelerators

```\$ module load cudatoolkit
```

```\$ aprun ./vecAdd.out
final result: 1.000000
```

Much information is obscured from the programmer so let’s add the Minfo compiler flag to see what the compiler is doing. With the Minfo flag we will see memory transfer and thread placement information.

Compiling
C

```\$ cc -ta=nvidia -Minfo vecAdd.c -o vecAdd.out
```

Fortran

```\$ ftn -ta=nvidia -Minfo vecAdd.f90 -o vecAdd.out
```

Output

```main:
33, Generating copyin(b[0:99999])
Generating copyin(a[0:99999])
Generating copyout(c[0:99999])
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
35, Loop is parallelizable
Accelerator kernel generated
35, #pragma acc for parallel, vector(256)
CC 1.0 : 5 registers; 20 shared, 36 constant, 0 local memory bytes; 100 occupancy
CC 1.3 : 5 registers; 20 shared, 36 constant, 0 local memory bytes; 100 occupancy
```

Breaking this output down:

```33, Generating copyin(b[0:99999])
Generating copyin(a[0:99999])
Generating copyout(c[0:99999])
```

We see that at line 33, the start of our acc region, that elements 0 to 99999 of the vectors a and b will be copied to the GPU. Vector c does not need to be copied into the GPU but does need to come out and we see it has been correctly handled by the compiler.

```Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
```

Next the compiler tells us it has generated binaries for both compute capability 1.0 and compute capability 1.3 devices. The binary with the highest compute capability less than or equal to the GPU it is being run on will be used, allowing the executable to be portable yet highly tuned.

```35, Loop is parallelizable
Accelerator kernel generated
```

Starting with line 35, the line containing the for/do loop statement, that the compiler has found the loop parallelizable and generated a GPU kernel. Let’s break down the provided information.

```35, #pragma acc for parallel, vector(256)
```

In CUDA terminology this translates to a kernel that has a block size of 256, that is, 256 threads will be in each logical thread block.

```CC 1.0 : 5 registers; 20 shared, 36 constant, 0 local memory bytes; 100 occupancy
```
• CC 1.0: Compute capability the following information applies to
• 5 registers: Number of registers per thread
• 20 shared: Number of shared memory variables each thread block will use.
• 36 constant Number of cached constants the kernel will use.
• 0 local memory bytes: Bytes stored in local memory. Local memory is located off SM and is used by the compiler to alleviate register pressure.
• 100 occupancy: Percentage of possible warps that are active. High occupancy is a necessary but not sufficient condition for high GPU performance.

This sample shows the Game of Life CPU code converted to a PGI accelerator directive version. Please direct any questions or comments to help@nccs.gov

This tutorial covers PGI Accelerator directives, If you are interested in PGI OpenACC support please see: OpenACC Vector Addition

## GOL.c

```#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

int main(int argc, char* argv[])
{
int i,j,iter;
// Linear game grid dimension excluding ghost cells
int dim = 1024;
// Number of game iterations
int maxIter = 1<<10;

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
int **restrict grid = (int**) malloc( sizeof(int*) * (dim+2) );
grid = (int*) malloc(sizeof(int) * (dim+2)*(dim+2));
for(i = 1; i < dim+2; i++) grid[i] = grid[i-1]+(dim+2);

// Allocate newGrid
int **restrict newGrid = (int**) malloc( sizeof(int*) * (dim+2) );
newGrid = (int*) malloc(sizeof(int) * (dim+2)*(dim+2));
for(i = 1; i < dim+2; i++) newGrid[i] = newGrid[i-1]+(dim+2);

srand(SRAND_VALUE);
for(i = 1; i <= dim; i++) {
for(j = 1; j <= dim; j++) {
grid[i][j] = rand() % 2;
}
}

#pragma acc region copy(grid[0:dim+1][0:dim+1]) local(newGrid[0:dim+1][0:dim+1])
{
// Main game loop
for (iter = 0; iter<maxIter; iter++) {
// Left-Right columns
for (i = 1; i <= dim; i++) {
grid[i] = grid[i][dim];   //Copy first real column to right ghost column
grid[i][dim+1] = grid[i]; //Copy last real column to left ghost column
}
// Top-Bottom rows
for (j = 0; j <= dim+1; j++) {
grid[j] = grid[dim][j];   //Copy first real row to bottom ghost row
grid[dim+1][j] = grid[j]; //Copy last real row to top ghost row
}

// Now we loop over all cells and determine their fate
for (i = 1; i <= dim; i++) {
for (j = 1; j <= dim; j++) {
// Get the number of neighbors for a given grid point
int numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower
+ grid[i][j+1] + grid[i][j-1] //right left
+ grid[i+1][j+1] + grid[i-1][j-1] //diagonals
+ grid[i-1][j+1] + grid[i+1][j-1];

// Here we have explicitly all of the game rules
if (grid[i][j] == 1 && numNeighbors < 2)
newGrid[i][j] = 0;
else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
newGrid[i][j] = 1;
else if (grid[i][j] == 1 && numNeighbors > 3)
newGrid[i][j] = 0;
else if (grid[i][j] == 0 && numNeighbors == 3)
newGrid[i][j] = 1;
else
newGrid[i][j] = grid[i][j];
}
}
// Can't switch pointers so we mannually have to copy array over
for(i = 1; i <= dim; i++) {
for(j = 1; j <= dim; j++) {
grid[i][j] = newGrid[i][j];
}
}
}
}//end Acc region

// Sum up alive cells and print results
int total = 0;
for (i = 1; i <= dim; i++) {
for (j = 1; j <= dim; j++) {
total += grid[i][j];
}
}
printf("Total Alive: %d\n", total);

// Release memory
free(grid);
free(newGrid);

return 0;
}
```

### Changes to GOL.c

```// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
int **restrict grid = (int**) malloc( sizeof(int*) * (dim+2) );
grid = (int*) malloc(sizeof(int) * (dim+2)*(dim+2));
for(i = 1; i < dim+2; i++) grid[i] = grid[i-1]+(dim+2);

// Allocate newGrid
int **restrict newGrid = (int**) malloc( sizeof(int*) * (dim+2) );
newGrid = (int*) malloc(sizeof(int) * (dim+2)*(dim+2));
for(i = 1; i < dim+2; i++) newGrid[i] = newGrid[i-1]+(dim+2);
```

Memory transfers between host and device can be very costly and so optimizing memory transfers in any way possible is highly recommended. One way to possibly speed this transfer up is to allocate multidimensional arrays in a contiguous chunk of memory, as shown here. As discussed in the vector addition example our pointers must have restrict qualifier as a guarantee to the compiler that we are not aliasing our pointers.

```#pragma acc region copy(grid[0:dim+1][0:dim+1]) local(newGrid[0:dim+1][0:dim+1])
```

We start the acc region before our main game loop starts even though we can not parallelize this loop. If we were to place the acc region inside of the main game loop the game grids would be copied back and forth between the host and device each iteration of the loop. We let the compiler know that grid is to be copied to the device at the start of the directive and from the device at the end of the directive. newGrid can reside solely on the device and so we specify it to be local. Another way to approach this is the data region approach looked at below.

```// Get the number of neighbors for a given grid point
int numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower
+ grid[i][j+1] + grid[i][j-1] //right left
+ grid[i+1][j+1] + grid[i-1][j-1] //diagonals
+ grid[i-1][j+1] + grid[i+1][j-1];
```

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.

```// Can't switch pointers so we mannually have to copy array over
for(i = 1; i <= dim; i++) {
for(j = 1; j <= dim; j++) {
grid[i][j] = newGrid[i][j];
}
}
```

This modification is very important, device pointers are not accessible inside of the acc region. If you leave the pointer swap in place from the cpu code the compiler will not complain, the code will run, and you will get incorrect results. This is not a limitation of GPU programming as you will see in the CUDA GOL example but of the PGI directive framework. The easiest way to fix this is to replace the pointer swap with a copy operation as we have done. The cost of this operation and possible workarounds are investigated below.

```}//end Acc region
```

At the end of our acc region the game grid is copied back to the host via DMA.

### Compiling GOL.c

We add the time option to the target accelerator flag. This flag will print out GPU timing information at runtime. We will also specify the option cc13 to tell the compiler to only generate code for compute capability 1.3, this will make the output slightly more digestible.

```\$ module load cudatoolkit
\$ cc -ta=nvidia,time,cc20 -Minfo GOL.c -o GOL.out
```

Output:

```main:
31, Generating local(newGrid[:dim+1][:dim+1])
Generating copy(grid[:dim+1][:dim+1])
Generating compute capability 1.3 binary
```

The memory transfer between the host and accelerator, as well as the compute capability, has been explicitly stated by us.

```     34, Loop carried dependence due to exposed use of 'grid[0:1023][1:1025]' prevents parallelization
...
Loop carried dependence due to exposed use of 'grid[1:1024][1:1024]' prevents parallelization
Loop carried dependence due to exposed use of 'newGrid[1:1024][1:1024]' prevents parallelization
Sequential loop scheduled on host
```

In the current version we wrap the acc region directive around the entire game loop even though the game loop must be executed serially. The compiler correctly identifies it can not parallelize this loop.

```     36, Loop is parallelizable
Accelerator kernel generated
36, #pragma acc for parallel, vector(32)
Non-stride-1 accesses for array 'grid'
CC 1.3 : 4 registers; 24 shared, 116 constant, 0 local memory bytes; 25 occupancy
41, Loop is parallelizable
Accelerator kernel generated
41, #pragma acc for parallel, vector(256)
CC 1.3 : 6 registers; 24 shared, 116 constant, 0 local memory bytes; 100 occupancy
```

To copy over the ghost cells our two loops get converted into two accelerator kernels. We notice that for the first loop we get the line Non-stride-1 accesses for array ‘grid’, indicating that our memory access is not accessed in the best possible way. By starting our loop index at 1 instead of 0 we get non coalesced memory access which degrades performance. Coalescing requirements are dependent on compute capability and can be found in Appendix G of the CUDA_C_Programming_guide.

```     47, Loop is parallelizable
48, Loop is parallelizable
Accelerator kernel generated
47, #pragma acc for parallel, vector(16)
48, #pragma acc for parallel, vector(16)
Cached references to size [18x18] block of 'grid'
Using register for 'newGrid'
CC 1.3 : 20 registers; 1328 shared, 140 constant, 0 local memory bytes; 75 occupancy
```

The main computation takes place here. The nested loops are mapped onto 16×16 thread blocks. A software managed cache is used to minimize global memory access of grid, this is particularly important in devices with compute capability less than 2.0, as they do not have a global memory cache.

```     68, Loop is parallelizable
69, Loop is parallelizable
Accelerator kernel generated
68, #pragma acc for parallel, vector(16)
69, #pragma acc for parallel, vector(16)
CC 1.3 : 7 registers; 28 shared, 120 constant, 0 local memory bytes; 100 occupancy
```

Lastly we see the kernel is generated for the grid/newGrid switch. Not being able to switch pointers is a limitation of the current PGI accelerate implimentation and not inherent of CUDA, we will measure the impact this extra memory transfer has below.

## GOL.f90

```program main
implicit none

integer :: i,j,iter,seed(8),numNeighbors,total
real :: randm
! Linear game grid dimension
integer :: dim = 1024
! Number of game iterations
integer :: maxIter = 2**10

! Game grid pointers
integer,dimension(:,:),allocatable :: grid, newGrid

! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
allocate(grid(dim+2,dim+2))
allocate(newGrid(dim+2,dim+2))

seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
call random_seed(PUT=seed)
do j=1,dim
do i=1,dim
call random_number(randm)
grid(i,j) = nint(randm)
enddo
enddo

! Main game loop

!\$acc region copy(grid(1:dim+2,1:dim+2)) local(newGrid(1:dim+2,1:dim+2))
do iter=1,maxIter
! Top-Bottom ghost rows
do j=2,dim+1
grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
enddo

! Left-Right ghost columns
do i=1,dim+2
grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
enddo

! Now we loop over all cells and determine their fate

!\$acc do independent
do j=2,dim+1
!\$acc do independent
do i=2,dim+1
! Get the number of neighbors for a given grid point
numNeighbors = grid(i,j+1) + grid(i,j-1)&     !left & right
+ grid(i+1,j) + grid(i-1,j)&     !upper & lower
+ grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
+ grid(i-1,j+1) + grid(i+1,j-1)

! Here we have explicitly all of the game rules
if(grid(i,j) == 1 .AND. numNeighbors < 2) then
newGrid(i,j) = 0
elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
newGrid(i,j) = 1
elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
newGrid(i,j) = 0
elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
newGrid(i,j) = 1
else
newGrid(i,j) = grid(i,j)
endif
enddo
enddo

! Can't switch pointers so we mannually have to copy array over
!\$acc do independent
do j=2,dim+1
!\$acc do independent
do i=2,dim+1
grid(i,j) = newGrid(i,j)
enddo
enddo

enddo!end main game loop
!\$acc end region

! Sum up alive cells and print results
total = 0
do j=2,dim+1
do i=2,dim+1
total = total + grid(i,j)
enddo
enddo
print *, "Total Alive", total

! Release memory
deallocate(grid)
deallocate(newGrid)

end program
```

### Changes to GOL.f90

```! Game grid pointers
integer,dimension(:,:),allocatable :: grid, newGrid
```

The pointer attribute found in the CPU version is changed to allocatable, we will see that we can not use pointers as we desire on the device.

```!\$acc region copy(grid(1:dim+2,1:dim+2)) local(newGrid(1:dim+2,1:dim+2))
```

We start the acc region before our main game loop starts even though we can not parallelize this loop. If we were to place the acc region inside of the main game loop the game grids would be copied back and forth between the host and device each iteration of the loop. We let the compiler know that grid is to be copied to the device at the start of the directive and from the device at the end of the directive. newGrid can reside solely on the device and so we specify it to be local. Another way to approach this is the data region approach looked at below.

```!\$acc do independent
do j=2,dim+1
!\$acc do independent
do i=2,dim+1
```
```!\$acc do independent
do j=2,dim+1
!\$acc do independent
do i=2,dim+1
```

The acc do directive applies only to the loop immediately following it and is used, with additional instructions, to guide the compiler in mapping the code to the target accelerator. In our case we specify that each iteration of the do loop is independent of one another. Without these guiding directives the compiler will complain that grid and newGrid carry a dependency that prevents parallelization, we the programmer know this is not true however. The game grid is only updated at the end of each game iteration and so the specified loop iterations are independent.

```! Get the number of neighbors for a given grid point
numNeighbors = grid(i,j+1) + grid(i,j-1)&     !left & right
+ grid(i+1,j) + grid(i-1,j)&     !upper & lower
+ grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
+ grid(i-1,j+1) + grid(i+1,j-1)
```

In the CUDA programming model all functions must be effectively inlined. Here we manually inline our function to calculate the number of alive neighbors a cell has.

```!\$acc end region
```

At the end of our acc region the game grid is copied back to the host via DMA.

### Compiling GOL.f90

We add the time option to the target accelerator flag. This flag will print out GPU timing information at runtime. We will also specify the option cc13 to tell the compiler to only generate code for compute capability 1.3, this will make the output slightly more digestible.

```\$ module load cudatoolkit
\$ ftn -ta=nvidia,time,cc20 -Minfo GOL.f90 -o GOL.out
```

Output:

```main:
30, Generating local(newgrid(1:dim+2,1:dim+2))
Generating copy(grid(1:dim+2,1:dim+2))
Generating compute capability 1.3 binary
```

The memory transfer between the host and accelerator, as well as the compute capability, has been explicitly stated by us.

```     31, Loop carried dependence due to exposed use of 'grid(1:dim+2,2)' prevents parallelization
...
Loop carried dependence due to exposed use of 'grid(1:dim+2,dim+1)' prevents parallelization
Loop carried dependence due to exposed use of 'newgrid(1:dim,1:dim)' prevents parallelization
```

In the current version we wrap the acc region directive around the entire game loop even though the game loop must be executed serially. The compiler correctly identifies it can not parallelize this loop.

```     33, Loop is parallelizable
Accelerator kernel generated
33, !\$acc do parallel, vector(32)
Non-stride-1 accesses for array 'grid'
CC 1.3 : 5 registers; 24 shared, 148 constant, 0 local memory bytes; 25 occupancy
39, Loop is parallelizable
Accelerator kernel generated
39, !\$acc do parallel, vector(256)
CC 1.3 : 8 registers; 24 shared, 148 constant, 0 local memory bytes; 100 occupancy
```

To copy over the ghost cells our two loops get converted into two accelerator kernels. We notice that for the first loop we get the line Non-stride-1 accesses for array ‘grid’, indicating that our memory access is not accessed in the best possible way. By starting our loop index at 1 instead of 0 we get non coalesced memory access which degrades performance. Coalescing requirements are dependent on compute capability and can be found in Appendix G of the CUDA_C_Programming_guide.

```     47, Loop is parallelizable
49, Loop is parallelizable
Accelerator kernel generated
47, !\$acc do parallel, vector(16)
49, !\$acc do parallel, vector(16)
Cached references to size [18x18] block of 'grid'
Using register for 'newgrid'
CC 1.3 : 20 registers; 1328 shared, 140 constant, 0 local memory bytes; 75 occupancy
```

The main computation takes place here. The nested loops are mapped onto 16×16 thread blocks via the vector(16) directive and done in parallel on the GPU. We could specify our own block size in the same way if we chose to change it. A software managed cache is used to minimize global memory access of grid, this is particularly important in devices with compute capability less than 2.0, as they do not have a global memory cache.

```     72, Loop is parallelizable
74, Loop is parallelizable
Accelerator kernel generated
72, !\$acc do parallel, vector(16)
74, !\$acc do parallel, vector(16)
CC 1.3 : 7 registers; 28 shared, 120 constant, 0 local memory bytes; 100 occupancy
```

Lastly we see the kernel is generated for the grid/newGrid switch. Not being able to switch pointers is a limitation of the current PGI accelerate implimentation and not inherent of CUDA, we will measure the impact this extra memory transfer has below.

### Running GOL.f90

```\$ aprun gol.out
```

Output
The following timings were computed on a NVIDIA Tesla C2070.

```Total Alive: 45224

Accelerator Kernel Timing data
GOL.c
main
32: region entered 1 time
time(us): total=4460133 init=3498235 region=961898
kernels=752537 data=4349
w/o init: total=961898 max=961898 min=961898 avg=961898
```

The overall timing data is listed first, giving various timings in microseconds. The time is the wallclock time that the accelerator region took including any initialization overhead. time is the initialization time required by the GPU, this varies depending on OS, driver, and GPU. Furthermore the is the raw computational time of all kernels in the region and data is the amount of time spent transfering data from the host to the device and from the device to the host. In our case we have to be careful since some of our kernels souly transfer data, which this overall time does not account for. In addition we have the total time taking into account the init time as well as min, max, and average timings for the region. Since our region is only entered once min, max and avg are all the same.

```        37: kernel launched 1024 times
grid:   block: 
time(us): total=12357 max=34 min=10 avg=12
42: kernel launched 1024 times
grid:   block: 
time(us): total=5579 max=7 min=5 avg=5
```

The kernels generated to copy our ghost cells do not require much time. this is because there is not much to copy and that it is a device to device trasfer. We do notice that the uncoelesced kernel takes approximately twice as long to run as the coelesced. This is expected as Appendix G of the CUDA_C_Programming_guide states for compute capability 2.0 GPUs that for a sequential but misaligned request to global memory, as we have in the non coalesced kernel, each memory transaction will be broken into two seperate requests.

```           49: kernel launched 1024 times
grid: [64x64]  block: [16x16]
time(us): total=523002 max=604 min=496 avg=510
```

Our main computation takes place here and we see it takes the majority of the time.

```        71: kernel launched 1024 times
grid: [64x64]  block: [16x16]
time(us): total=211599 max=210 min=200 avg=206
```

Nearly half of the GPUs computational time is spent doing this memory transfer that replaced our pointer switch. Currently the only way to get rid of this kernel and still use the PGI accelerate directives is to slightly rewrite our algorithm. See if you can figure out how to rewrite the above using minimal modifcations so that we can remove the memory copy kernel, One solution is given below.

## GOL-Try2.c

In this version we will remove the memory copy kernel that we needed to use since PGI accelerate does not currently suppoort device pointers. As an added bonus we will also take a look at the acc data region directive.

```#include <stdio.h>
#include <stdlib.h>

#define SRAND_VALUE 1985

int main(int argc, char* argv[])
{
int i,j,iter,numNeighbors;
int dim = 1024;
// Number of game iterations
int maxIter = 1<<15;

// Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
int **restrict grid = (int**) malloc( sizeof(int*) * (dim+2) );
grid = (int*) malloc(sizeof(int*) * (dim+2)*(dim+2));
for(i = 1; i<dim+2; i++) grid[i] = grid[i-1]+(dim+2);

// Allocate newGrid
int **restrict newGrid = (int**) malloc( sizeof(int*) * (dim+2) );
newGrid = (int*) malloc(sizeof(int*) * (dim+2)*(dim+2));
for(i = 1; i<dim+2; i++) newGrid[i] = newGrid[i-1]+(dim+2);

srand(SRAND_VALUE);

for(i = 1; i<=dim; i++) {
for(j = 1; j<=dim; j++) {
grid[i][j] = (int)rand() % 2;
}
}

#pragma acc data region copy(grid[0:dim+1][0:dim+1]) local(newGrid[0:dim+1][0:dim+1])
{
// Main game loop
for (iter = 0; iter<maxIter; iter++) {

if(iter%2==0) {
#pragma acc region
{
// Left-Right columns
for (i = 1; i<=dim; i++) {
grid[i] = grid[i][dim]; //Copy first real column to right ghost column
grid[i][dim+1] = grid[i]; //Copy last real column to left ghost column
}

// Note the o to dim+1 to accounts for the "corners"
// Top-Bottom rows
for (j = 0; j<=dim+1; j++) {
grid[j] = grid[dim][j]; //Copy first real row to bottom ghost row
grid[dim+1][j] = grid[j]; //Copy last real row to top ghost row
}

// Now we loop over all cells and determine their fate
for (i = 1; i<=dim; i++) {
for (j = 1; j<=dim; j++) {

// Get the number of neighbors for a given grid point
numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower
+ grid[i][j+1] + grid[i][j-1] //right left
+ grid[i+1][j+1] + grid[i-1][j-1]  //diagonals
+ grid[i-1][j+1] + grid[i+1][j-1];

// Here we have explicitly all of the game rules
if (grid[i][j] == 1 && numNeighbors < 2)
newGrid[i][j] = 0;
else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
newGrid[i][j] = 1;
else if (grid[i][j] == 1 && numNeighbors > 3)
newGrid[i][j] = 0;
else if (grid[i][j] == 0 && numNeighbors == 3)
newGrid[i][j] = 1;
else
newGrid[i][j] = grid[i][j];
}
}

}//end acc region

}//end if
else {
#pragma acc region
{
// Left-Right columns
for (i = 1; i<=dim; i++) {
newGrid[i] = newGrid[i][dim]; //Copy first real column to right ghost column
newGrid[i][dim+1] = newGrid[i]; //Copy last real column to left ghost column
}

// Note the o to dim+1 to accounts for the "corners"
// Top-Bottom rows
for (j = 0; j<=dim+1; j++) {
newGrid[j] = newGrid[dim][j]; //Copy first real row to bottom ghost row
newGrid[dim+1][j] = newGrid[j]; //Copy last real row to top ghost row
}

// Now we loop over all cells and determine their fate
for (i = 1; i<=dim; i++) {
for (j = 1; j<=dim; j++) {

// Get the number of neighbors for a given grid point
numNeighbors = newGrid[i+1][j] + newGrid[i-1][j] //upper lower
+ newGrid[i][j+1] + newGrid[i][j-1] //right left
+ newGrid[i+1][j+1] + newGrid[i-1][j-1]  //diagonals
+ newGrid[i-1][j+1] + newGrid[i+1][j-1];

// Here we have explicitly all of the game rules
if (newGrid[i][j] == 1 && numNeighbors < 2)
grid[i][j] = 0;
else if (newGrid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
grid[i][j] = 1;
else if (newGrid[i][j] == 1 && numNeighbors > 3)
grid[i][j] = 0;
else if (newGrid[i][j] == 0 && numNeighbors == 3)
grid[i][j] = 1;
else
grid[i][j] = newGrid[i][j];
}
}

}//end acc region
}//end else

}//end iter loop
}//end data region

// Sum up alive cells and print results
int total = 0;
for (i = 1; i<=dim; i++) {
for (j = 1; j<=dim; j++) {
total += grid[i][j];
}
}

printf("Total Alive: %d\n", total);

free(grid);
free(newGrid);

return 0;
}
```

### Changes in GOL-Try2.c

```#pragma acc data region copy(grid[0:dim+1][0:dim+1]) local(newGrid[0:dim+1][0:dim+1])
{
```
```}//end data region
```

Although it is not neccesary to use in this program the acc data region is an important aspect of the PGI accelerate directives. When such a directive is encountered data is copied over to the device where it persists until the data region is exited. This directive can take the same options as the acc region in terms of tuning data transfers. While inside of a data region any number of acc regions can be encountered that make use of the data already on the device. This allows the programmer greater control over host to device and device to host data transfers.

Lines 37-121

```    if(iter%2==0) {
#pragma acc region
{
// Left-Right columns
for (i = 1; i<=dim; i++) {
grid[i] = grid[i][dim]; //Copy first real column to right ghost column
grid[i][dim+1] = grid[i]; //Copy last real column to left ghost column
}

// Note the o to dim+1 to accounts for the "corners"
// Top-Bottom rows
for (j = 0; j<=dim+1; j++) {
grid[j] = grid[dim][j]; //Copy first real row to bottom ghost row
grid[dim+1][j] = grid[j]; //Copy last real row to top ghost row
}

// Now we loop over all cells and determine their fate
for (i = 1; i<=dim; i++) {
for (j = 1; j<=dim; j++) {

// Get the number of neighbors for a given grid point
numNeighbors = grid[i+1][j] + grid[i-1][j] //upper lower
+ grid[i][j+1] + grid[i][j-1] //right left
+ grid[i+1][j+1] + grid[i-1][j-1]  //diagonals
+ grid[i-1][j+1] + grid[i+1][j-1];

// Here we have explicitly all of the game rules
if (grid[i][j] == 1 && numNeighbors < 2)
newGrid[i][j] = 0;
else if (grid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
newGrid[i][j] = 1;
else if (grid[i][j] == 1 && numNeighbors > 3)
newGrid[i][j] = 0;
else if (grid[i][j] == 0 && numNeighbors == 3)
newGrid[i][j] = 1;
else
newGrid[i][j] = grid[i][j];
}
}

}//end acc region

}//end if
else {
#pragma acc region
{
// Left-Right columns
for (i = 1; i<=dim; i++) {
newGrid[i] = newGrid[i][dim]; //Copy first real column to right ghost column
newGrid[i][dim+1] = newGrid[i]; //Copy last real column to left ghost column
}

// Note the o to dim+1 to accounts for the "corners"
// Top-Bottom rows
for (j = 0; j<=dim+1; j++) {
newGrid[j] = newGrid[dim][j]; //Copy first real row to bottom ghost row
newGrid[dim+1][j] = newGrid[j]; //Copy last real row to top ghost row
}

// Now we loop over all cells and determine their fate
for (i = 1; i<=dim; i++) {
for (j = 1; j<=dim; j++) {

// Get the number of neighbors for a given grid point
numNeighbors = newGrid[i+1][j] + newGrid[i-1][j] //upper lower
+ newGrid[i][j+1] + newGrid[i][j-1] //right left
+ newGrid[i+1][j+1] + newGrid[i-1][j-1]  //diagonals
+ newGrid[i-1][j+1] + newGrid[i+1][j-1];

// Here we have explicitly all of the game rules
if (newGrid[i][j] == 1 && numNeighbors < 2)
grid[i][j] = 0;
else if (newGrid[i][j] == 1 && (numNeighbors == 2 || numNeighbors == 3))
grid[i][j] = 1;
else if (newGrid[i][j] == 1 && numNeighbors > 3)
grid[i][j] = 0;
else if (newGrid[i][j] == 0 && numNeighbors == 3)
grid[i][j] = 1;
else
grid[i][j] = newGrid[i][j];
}
}

}//end acc region
}//end else
```

The major change to our code has been to get rid of the grid/newGrid switching kernel. We have explicitly coded the two cases and use the modulus of the main iteration variable to switch between the two pieces of code. This increases the code complexity but does get around the costly memory transfer.

## GOL-Try2.f90

```program main
implicit none

integer :: i,j,iter,seed(8),numNeighbors,total
real :: randm
! Linear game grid dimension
integer :: dim = 1024
! Number of game iterations
integer :: maxIter = 2**10

! Game grid pointers
integer,dimension(:,:),allocatable :: grid, newGrid, tmpGrid

! Allocate square grid of (dim+2)^2 elements, 2 added for ghost cells
allocate(grid(dim+2,dim+2))
allocate(newGrid(dim+2,dim+2))

seed = (/1985, 2011, 2012, 500, 24, 15, 99, 8/)
call random_seed(PUT=seed)
do j=1,dim
do i=1,dim
call random_number(randm)
grid(i,j) = nint(randm)
enddo
enddo

! Main game loop

!\$acc data region copy(grid(1:dim+2,1:dim+2)) local(newGrid(1:dim+2,1:dim+2))
do iter=1,maxITer

if(MOD(iter,2) == 1) then

!\$acc region
! Top-Bottom ghost rows
do j=2,dim+1
grid(1,j) = grid(dim+1,j) !Copy first game grid row to bottom ghost row
grid(dim+2,j) = grid(2,j) !Copy first game grid row to top ghost row
enddo

! Left-Right ghost columns
do i=1,dim+2
grid(i,1) = grid(i, dim+1) !Copy first game grid column to right ghost column
grid(i,dim+2) = grid(i,2)  !Copy last game grid column to left ghost column
enddo

! Now we loop over all cells and determine their fate

!\$acc do independent
do j=2,dim+1
!\$acc do independent
do i=2,dim+1
! Get the number of neighbors for a given grid point
numNeighbors = grid(i,j+1) + grid(i,j-1)& !left & right
+ grid(i+1,j) + grid(i-1,j)& !upper & lower
+ grid(i+1,j+1) + grid(i-1,j-1)& !diagonals
+ grid(i-1,j+1) + grid(i+1,j-1)

! Here we have explicitly all of the game rules
if(grid(i,j) == 1 .AND. numNeighbors < 2) then
newGrid(i,j) = 0
elseif(grid(i,j) == 1 .AND. (numNeighbors == 2 .OR. numNeighbors == 3)) then
newGrid(i,j) = 1
elseif(grid(i,j) == 1 .AND. numNeighbors > 3) then
newGrid(i,j) = 0
elseif(grid(i,j) == 0 .AND. numNeighbors == 3) then
newGrid(i,j) = 1
else
newGrid(i,j) = grid(i,j)
endif
enddo
enddo
!\$acc end region

else

!\$acc region
! Top-Bottom ghost rows
do j=2,dim+1
newGrid(1,j) = newGrid(dim+1,j) !Copy first game grid row to bottom ghost row
newGrid(dim+2,j) = newGrid(2,j) !Copy first game grid row to top ghost row
enddo

! Left-Right ghost columns
do i=1,dim+2
newGrid(i,1) = newGrid(i,dim+1) !Copy first game grid column to right ghost col
newGrid(i,dim+2) = newGrid(i,2) !Copy last game grid column to left ghost col
enddo

! Now we loop over all cells and determine their fate

!\$acc do independent
do j=2,dim+1
!\$acc do independent
do i=2,dim+1
!Get the number of neighbors for a given grid point
numNeighbors = newGrid(i,j+1) + newGrid(i,j-1)& !left and right
+ newGrid(i+1,j) + newGrid(i-1,j)& !upper and lower
+ newGrid(i+1,j+1) + newGrid(i-1,j-1)& !diagonals
+ newGrid(i-1,j+1) + newGrid(i+1,j-1)

!Here we have explicitly all of the game rules
if(newGrid(i,j) == 1 .AND. numNeighbors < 2) then
grid(i,j) = 0
elseif(newGrid(i,j)==1 .AND. (numNeighbors==2 .OR. numNeighbors==3)) then
grid(i,j) = 1
elseif(newGrid(i,j) == 1 .AND. numNeighbors > 3) then
grid(i,j) = 0
elseif(newGrid(i,j) == 0 .AND. numNeighbors == 3) then
grid(i,j) = 1
else
grid(i,j) = newGrid(i,j)
endif
enddo
enddo
!\$acc end region

endif !end grid newGrid switch

enddo!end main game loop
!\$acc end data region

! Sum up alive cells and print results
total = 0
do j=2,dim+1
do i=2,dim+1
total = total + grid(i,j)
enddo
enddo
print *, "Total Alive", total

! Release memory
deallocate(grid)
deallocate(newGrid)

end program
```