This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
vecAdd.cu
#include <stdio.h> #include <stdlib.h> #include <math.h> // CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; } int main( int argc, char* argv[] ) { // Size of vectors int n = 100000; // Host input vectors double *h_a; double *h_b; //Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; //Device output vector double *d_c; // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes); int i; // Initialize vectors on host for( i = 0; i < n; i++ ) { h_a[i] = sin(i)*sin(i); h_b[i] = cos(i)*cos(i); } // Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice); int blockSize, gridSize; // Number of threads in each thread block blockSize = 1024; // Number of thread blocks in grid gridSize = (int)ceil((float)n/blockSize); // Execute the kernel vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n); // Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost ); // Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); // Release host memory free(h_a); free(h_b); free(h_c); return 0; }
Changes
Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.
// CUDA kernel. Each thread takes care of one element of c __global__ void vecAdd(double *a, double *b, double *c, int n) { // Get our global thread ID int id = blockIdx.x*blockDim.x+threadIdx.x; // Make sure we do not go out of bounds if (id < n) c[id] = a[id] + b[id]; }
Let’s take a look at what makes up this simple kernel.
__global__ void vecAdd(double *a, double *b, double *c, int n)
The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.
int id = blockIdx.x*blockDim.x+threadIdx.x;
We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members.
if (id < n)
Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.
c[id] = a[id] + b[id]
the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c
Array Pointers:
// Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input vectors double *d_a; double *d_b; // Device output vector double *d_c;
With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.
cudaMalloc:
// Allocate memory for each vector on GPU cudaMalloc(&d_a, bytes); cudaMalloc(&d_b, bytes); cudaMalloc(&d_c, bytes);
Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.
cudaMemcpy to device:
// Copy host vectors to device cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice); cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.
Thread mapping:
int blockSize, gridSize; // Number of threads in each thread block blockSize = 256; // Number of thread blocks in grid gridSize = ceil((float)n/blockSize);
To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.
Launch kernel:
// Execute the kernel vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);
We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.
cudaMemcpy to host:
// Copy array back to host cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.
release device memory
// Release device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.
Compiling
The CUDA environment must be set up so before compiling make sure the module is loaded
$ module load cudatoolkit $ nvcc vecAdd.cu -o vecAdd.out
Running
$ aprun ./vecAdd.out final result: 1.000000