This sample shows a minimal conversion from our vector addition CPU code to C for CUDA, consider this a CUDA C ‘Hello World’. If you are not yet familiar with basic CUDA concepts please see the Accelerated Computing Guide. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

vecAdd.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

int main( int argc, char* argv[] )
{
    // Size of vectors
    int n = 100000;

    // Host input vectors
    double *h_a;
    double *h_b;
    //Host output vector
    double *h_c;

    // Device input vectors
    double *d_a;
    double *d_b;
    //Device output vector
    double *d_c;

    // Size, in bytes, of each vector
    size_t bytes = n*sizeof(double);

    // Allocate memory for each vector on host
    h_a = (double*)malloc(bytes);
    h_b = (double*)malloc(bytes);
    h_c = (double*)malloc(bytes);

    // Allocate memory for each vector on GPU
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    int i;
    // Initialize vectors on host
    for( i = 0; i < n; i++ ) {
        h_a[i] = sin(i)*sin(i);
        h_b[i] = cos(i)*cos(i);
    }

    // Copy host vectors to device
    cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

    int blockSize, gridSize;

    // Number of threads in each thread block
    blockSize = 1024;

    // Number of thread blocks in grid
    gridSize = (int)ceil((float)n/blockSize);

    // Execute the kernel
    vecAdd<<<gridSize, blockSize>>>(d_a, d_b, d_c, n);

    // Copy array back to host
    cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

    // Sum up vector c and print result divided by n, this should equal 1 within error
    double sum = 0;
    for(i=0; i<n; i++)
        sum += h_c[i];
    printf("final result: %f\n", sum/n);

    // Release device memory
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    // Release host memory
    free(h_a);
    free(h_b);
    free(h_c);

    return 0;
}

Changes

Kernel:
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id < n)
        c[id] = a[id] + b[id];
}

Let’s take a look at what makes up this simple kernel.

__global__ void vecAdd(double *a, double *b, double *c, int n)

The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void.

int id = blockIdx.x*blockDim.x+threadIdx.x;

We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members.

if (id < n)

Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.

c[id] = a[id] + b[id]

the thread ID is used to index the arrays that reside in global device memory. Each thread will load a value from a and b and write the sum to c

Array Pointers:

// Host input vectors
double *h_a;
double *h_b;
// Host output vector
double *h_c;

// Device input vectors
double *d_a;
double *d_b;
// Device output vector
double *d_c;

With the host CPU and GPU having separate memory spaces we must maintain two sets of pointers, one set for our host arrays and one set for our device arrays. Here we use the h_ and d_ prefix to differentiate them.

cudaMalloc:

// Allocate memory for each vector on GPU
cudaMalloc(&d_a, bytes);
cudaMalloc(&d_b, bytes);
cudaMalloc(&d_c, bytes);

Given a pointer cudaMalloc will allocate memory in the devices global memory space and set the passed pointer to point to this memory.

cudaMemcpy to device:

// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);

After we initialize the data on the host we must copy it over to the device, to do so we use cudaMemcpy. By calling cudaMemcpy we initiate a DMA transfer between host and device.

Thread mapping:

int blockSize, gridSize;
 
// Number of threads in each thread block
blockSize = 256;
 
// Number of thread blocks in grid
gridSize = ceil((float)n/blockSize);

To launch our kernel we must specify the number of threads per block and the number of blocks in our grid. The maximum value of both is dependent on the devices compute capability. Normally the blockSize will be chosen based upon kernel memory requirements and the gridSize calculated such that enough threads are launched to cover the kernel domain.

Launch kernel:

// Execute the kernel
vecAdd<<< gridSize, blockSize >>>(d_a, d_b, d_c, n);

We launch our kernel with a modified C function syntax that lets us specify the grid and block sizes. Kernel calls are non blocking.

cudaMemcpy to host:

// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );

cudaMemcpy will block until the kernel is complete and then copy the results from the device back to the host.

release device memory

// Release device memory
 cudaFree(d_a);
 cudaFree(d_b);
 cudaFree(d_c);

As with C malloc we need to release our device arrays when we are finished, to do so we use cudaFree.

Compiling

The CUDA environment must be set up so before compiling make sure the module is loaded

$ module load cudatoolkit
$ nvcc vecAdd.cu -o vecAdd.out

Running

$ aprun ./vecAdd.out
final result: 1.000000