This tutorial shows a minimal conversion from our vector addition CPU code to PGI CUDA Fortran. Consider this a CUDA Fortran ‘Hello World’. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.cuf
module kernel contains ! CUDA kernel. Each thread takes care of one element of c attributes(global) subroutine vecAdd_kernel(n, a, b, c) integer, value :: n real(8), device :: a(n), b(n), c(n) integer :: id ! Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x ! Make sure we do not go out of bounds if (id <= n) then c(id) = a(id) + b(id) endif end subroutine vecAdd_kernel end module kernel program main use cudafor use kernel type(dim3) :: blockSize, gridSize real(8) :: sum integer :: i ! Size of vectors integer :: n = 100000 ! Host input vectors real(8),dimension(:),allocatable :: h_a real(8),dimension(:),allocatable :: h_b !Host output vector real(8),dimension(:),allocatable :: h_c ! Device input vectors real(8),device,dimension(:),allocatable :: d_a real(8),device,dimension(:),allocatable :: d_b !Host output vector real(8),device,dimension(:),allocatable :: d_c ! Allocate memory for each vector on host allocate(h_a(n)) allocate(h_b(n)) allocate(h_c(n)) ! Allocate memory for each vector on GPU allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n)) ! Initialize content of input vectors, vector a[i] = sin(i)^2 vector b[i] = cos(i)^2 do i=1,n h_a(i) = sin(i*1D0)*sin(i*1D0) h_b(i) = cos(i*1D0)*cos(i*1D0) enddo ! Implicit copy of host vectors to device d_a = h_a(1:n) d_b = h_b(1:n) ! Number of threads in each thread block blockSize = dim3(1024,1,1) ! Number of thread blocks in grid gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1) ! Execute the kernel call vecAdd_kernel<<<gridSize, blockSize>>>(n, d_a, d_b, d_c) ! Implicit copy of device array to host h_c = d_c(1:n) ! Sum up vector c and print result divided by n, this should equal 1 within error sum = 0.0; do i=1,n sum = sum + h_c(i) enddo sum = sum/real(n) print *, 'final result: ', sum ! Release device memory deallocate(d_a) deallocate(d_b) deallocate(d_c) ! Release host memory deallocate(h_a) deallocate(h_b) deallocate(h_c) end program main
Changes
Kernel
The kernel is the heart of our CUDA code. When we launch a kernel we specify the number of threads per block(blockdim) and number of blocks per grid(griddim). The total number of threads is (blockdim) * (griddim). Each thread evaluates one copy of the kernel.
module kernel contains !CUDA kernel. Each thread takes care of one element of c attributes(global) subroutine vecAdd_kernel(n, a, b, c) integer, value :: n real(8), device :: a(n), b(n), c(n) integer :: id !Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x !Make sure we do not go out of bounds if (id <= n) then c(id) = a(id) + b(id) endif end subroutine vecAdd_kernel end module kernel
Let’s take a look at what makes up this simple kernel.
module kernel contains
end module kernel
The interface for the kernel subroutine must be explicitly known to the calling program, one way of achieving this is placing the subroutine in a module.
attributes(global) subroutine vecAdd_kernel(n, a, b, c)
The global attribute specifies this is a CUDA kernel, otherwise normal Fortran subroutine syntax is used.
!Get our global thread ID id = (blockidx%x-1)*blockdim%x + threadidx%x
We calculate the threads global id using CUDA Fortran provided derived data types. blockidx contains the blocks position in the grid, ranging from 1 to griddim. threadidx is the threads index inside of it’s associated block, ranging from 1 to blockdim.
if (id <= n) then
Unless we have an array length divisible by our blockdim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.
c(id) = a(id) + b(id)
the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c
Modules
use cudafor use kernel
The cudafor is a PGI provided module that provides some of the CUDA specific derived types such as dim3. The kernel module we created must also be included.
Array Pointers
!Host input vectors real(8),dimension(:),allocatable :: h_a real(8),dimension(:),allocatable :: h_b !Host output vector real(8),dimension(:),allocatable :: h_c !Device input vectors real(8),device,dimension(:),allocatable :: d_a real(8),device,dimension(:),allocatable :: d_b !Host output vector real(8),device,dimension(:),allocatable :: d_c
With the host CPU and GPU having separate memory spaces we must have two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them. Notice that the GPU pointers have the device attribute.
Allocate device memory
!Allocate memory for each vector on GPU allocate(d_a(n)) allocate(d_b(n)) allocate(d_c(n))
Given a pointer with the device attribute memory will be allocated in the GPUs global memory space.
Copy to device
!Implicit copy of host vectors to device d_a = h_a(1:n) d_b = h_b(1:n)
After we initialize the data on the host we must copy it over to the device, to do so we can can use the syntax above. This is a convenient way of doing things but care must be taken. This operation can be very time consuming as we are initiating a DMA transfer of data from the host memory, over the PCI bus, to the device memory.
Thread mapping
!Number of threads in each thread block blockSize = dim3(1024,1,1) !number of thread blocks in grid gridSize = dim3(ceiling(real(n)/real(blockSize%x)) ,1,1)
To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the block Size will be chosen based upon kernel memory requirements and the grid size calculated such that enough threads are launched to cover the kernel domain. dim3 is a CUDA Fortran provided data type that has 3 dimensions, in this case we are dealing with a one dimensional block and grid so we specify a dimensionality of 1 for the other two dimensions.
Launch kernel
call vecAdd_kernel<<< gridSize, blockSize >>>(n, d_a, d_b, d_c)
We launch our kernel with a modified Fortran subroutine syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.
Copy to host
!Implicit copy of device array to host h_c = d_c(1:n)
The implicit copy from the device to the host will block until the kernel is completed.
Release device memory
deallocate(d_a) deallocate(d_b) deallocate(d_c)
Device memory is deallocated in the same programmatic manner as host allocated arrays.
Compiling
$ module load cudatoolkit $ ftn vecAdd.cuf -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000