The use of GPU acceleration can add a useful level of parallelization to your code. This tutorial will show how to parallelize a simple Monte Carlo operation using a GPU accelerator. The first code will be a simple serial implementation of the Monte Carlo simulation used to compute π. The next samples will be the same code but accelerated using CUDA, OpenACC, and then Thrust.

Serial

The codes use Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of random points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π. To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void main(int argc, char* argv[])
{
    double niter = 100000;
    double x,y;
    int i;
    int count=0;
    double z;
    double pi;
    srand(time(NULL));
    //main loop
    for (i=0; i<niter; ++i)
    {
        //get random points
        x = (double)random()/RAND_MAX;
        y = (double)random()/RAND_MAX;
        z = sqrt((x*x)+(y*y));
        //check to see if point is in unit circle
        if (z<=1)
        {
            ++count;
        }
    }
    //p = 4(m/n)
    pi = ((double)count/(double)niter)*4.0;
    printf("Pi: %f\n", pi);

To compile this code, run:

$ cc serialpi.c -o a.out

The resulting executable can either be run in an interactive session or via batch script. To launch an interactive session run:

$ qsub -I -A -l walltime=1:00,nodes=1

This executable can also be run via batch script. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A <PROJID>
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n1 ./a.out

Save this in your project’s scratch directory as something like titan.pbs and copy your executable to $MEMBERWORK/<projid>.

To submit the script to the batch node, run

$ qsub titan.pbs

CUDA

Compute Unified Device Architecture (CUDA) is the parallel programming model developed by NVIDIA that extends C or Fortran code with features that target the GPU device. In CUDA, the programmer defines functions to be launched in many thousands or millions of parallel threads on the GPU. These functions are generally referred to as CUDA kernels or simply “kernels”. When writing these kernels, the programmer needs to keep in mind how many threads and blocks of threads they want launched on the GPU at runtime. A possible kernel for this code could look something like this:

__global__ void kernel(int* count_d, float* randomnums)
{
        int i;
        double x,y,z;
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        i = tid;
        int xidx = 0, yidx = 0;

        xidx = (i+i);
        yidx = (xidx+1);

        x = randomnums[xidx];
        y = randomnums[yidx];
        z = ((x*x)+(y*y));

        if (z<=1)
                count_d[tid] = 1;
        else
                count_d[tid] = 0;
}

Take note of the __global__ decorator. This tells the nvcc compiler that this function is a CUDA kernel that will be called by host code and be run on the device. Also, note the variables blockDim.x, blockIdx.x, and threadIdx.x. These are built in variables that identify a thread’s place in the overall three-dimensional grid, similar to the concept of an MPI rank or OpenMP thread.

Another thing to keep in mind is the fact that the CPU (host) and GPU (device) have separate memory. This means that the host can’t directly access memory on the device and vice versa. To overcome this, CUDA gives us a set of functions to transfer and allocate data on the GPU device’s RAM. For example, in the Monte Carlo code, there needs to be a set of random numbers available on the device for the kernel to use, as well as an empty array to hold a 1 or 0, depending on whether or not the point is in the circle or not (See the Serial section for more information on the Monte Carlo specifics). So after the random numbers are generated and stored into an array, that array needs to be copied to the device. Also the empty array needs to be allocated on the device. To do this we use two different CUDA functions. Firstly, the empty array needs to be allocated on the device. The cudaMalloc() function gives us the ability to allocate memory on the GPU:

cudaError_t cudaMalloc(void** devPtr,
                       size_t size)

Before this function is called, an int* named count_d needs to be created (Note that device pointer variables generally end with “_d”). Also, the programmer needs to determine the size of the array. In this case, since the kernel is only processing one point per thread, the number of threads can be used as the size (thread blocks * threads per thread block = total number of threads) Once this is done, the cudaMalloc() function can be called. For this code, it could look like this:

cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));

Also, an array to house the random numbers (2 per thread this time) needs to be allocated on the device using cudaMalloc. This is done in the same fashion as before:

cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));

Once there is a place to contain the required random numbers, they can be generated. The CURAND library contains kernels that generate high quality random numbers using the GPU. CURAND needs a bit of setting up before it can generate the random numbers. First, a CURAND generator needs to be declared:

curandGenerator_t gen;

The next function creates the generator and declares its type.

curandStatus_t curandCreateGenerator(curandGenerator_t* generator,
                                     curandRngType_t rng_type);

The next one sets the seed.

curandStatus_t curandSetPseudoRandomGeneratorSeed(curandGenerator_t generator,
                                                  unsigned long long seed);

Next, curandGenerateUniform() is the actual function that generates the random numbers.

curandStatus_t curandGenerateUniform(curandGenerator_t generator,
                                     float* outputPtr,
                                     size_t num);

Finally, the random number generator can be destroyed because theres no further need for it.

curandStatus_t curandDestroyGenerator(curandGenerator_t generator);

Using these definitions, create the random numbers as follows:

curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL));
curandGenerateUniform(gen, randomnums, (2*niter));
curandDestroyGeneraor(gen);

Now, the CUDA kernel needs to be called and launched on the device. Before this can be done, the programmer needs to know how many threads per block, and how many blocks that need to be launched. In this case, 1000 threads/block and 100 blocks should work fine (these numbers we picked arbitrarily, the programmer can set their own). Once this is determined, the kernel can be launched and passed the correct device pointers.

kernel <<<blocks, threads>>> (count_d, randomnums);

Once the kernel has finished, the count_d array needs to be copied back to the host. Similar to the standard C function memcpy(), we use the CUDA function cudaMemcpy() which looks like this:

cudaError_t cudaMemcpy(void * dst,
                       const void * src,
                       size_t count,
                       enum cudaMemcpyKind kind)

Before this can be done, the programmer needs to declare an empty array on the host that will contain the data that is copied back with cudaMemcpy. Also, make sure the amount of data copied back (size_t count) is the right size (the same size as the cudaMalloc’d array). Finally, cudaMemcpy needs to know in what direction the copy is going to be performed. In this particular instance, the array is on the device and needs to be copied to the host, so the final parameter for the cudaMemcpy needs to be the built in “cudaMemcpyDeviceToHost”. The completed cudaMemcpy line looks like this:

cudaMemcpy(count, count_d, blocks*threads*sizeof(int),
           cudaMemcpyDeviceToHost);

Now that all the CUDA code is set up, the rest of the code looks like this:

cpi.cu:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <time.h>

//Declare the CUDA kernel
__global__ void kernel(int* count_d, float* randomnums)
{
        int i;
        double x,y,z;
        //Find the overall ID of the thread
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        i = tid;
        int xidx = 0, yidx = 0;

        //Do the MonteCarlo!
        xidx = (i+i);
        yidx = (xidx+1);

        //Get the random x,y points
        x = randomnums[xidx];
        y = randomnums[yidx];
        z = ((x*x)+(y*y));

        if (z<=1)
                count_d[tid] = 1;
        else
                count_d[tid] = 0;
}

//Used to check if there are any errors launching the kernel
void CUDAErrorCheck()
{
        cudaError_t error = cudaGetLastError();
        if (error != cudaSuccess)
        {
                printf("CUDA error : %s (%d)\n", cudaGetErrorString(error), error);
                exit(0);
        }
}

int main(int argc,char* argv[])
{
        //NOTE: if threads and/or blocks is changed, niter needs to be changed to reflect
        //that change (niter=threads*blocks)
        int niter = 100000;
        float *randomnums;
        double pi;
        //Allocate the array for the random numbers
        cudaMalloc((void**)&randomnums, (2*niter)*sizeof(float));
        //Use CuRand to generate an array of random numbers on the device
        int status;
        curandGenerator_t gen;
        status = curandCreateGenerator(&gen, CURAND_RNG_PSEUDO_MRG32K3A);
        status |= curandSetPseudoRandomGeneratorSeed(gen, 4294967296ULL^time(NULL));
        status |= curandGenerateUniform(gen, randomnums, (2*niter));
        status |= curandDestroyGenerator(gen);
        //Check to see if there was any problem launching the CURAND kernels and generating
        //the random numbers on the device
        if (status != CURAND_STATUS_SUCCESS)
        {
                printf("CuRand Failure\n");
                exit(EXIT_FAILURE);
        }

        //Threads per thread block to be launched
        int threads = 1000;
        //Number of thread blocks launched
        int blocks = 100;
        int* count_d;
        int *count = (int*)malloc(blocks*threads*sizeof(int));
        unsigned int reducedcount = 0;
        //Allocate the array to hold a value (1,0) whether the point in is the circle (1) or not (0)
        cudaMalloc((void**)&count_d, (blocks*threads)*sizeof(int));
        CUDAErrorCheck();
        //Launch the kernel
        kernel <<<blocks, threads>>> (count_d, randomnums);
        //Acts as a kind of code Barrier until the kernel is finished. Kernel calls are nonblocking so
        //the code would continue regardless of whether the kernel succeeded or not without the Sync
        cudaDeviceSynchronize();
        CUDAErrorCheck();
        //Copy the resulting array back
        cudaMemcpy(count, count_d, blocks*threads*sizeof(int), cudaMemcpyDeviceToHost);
        int i = 0;

        //Reduce array into int
        for(i = 0; i<niter; i++)
                reducedcount += count[i];

        //Free the cudaMalloc()'d arrays
        cudaFree(randomnums);
        cudaFree(count_d);
        free(count);

        //Find the ratio
        pi = ((double)reducedcount/niter)*4.0;
        printf("Pi: %f\n", pi);

        return 0;
}

Compiling

Before this can be compiled, the CUDA toolkit needs to be loaded. To do this, run:

$ module load cudatoolkit

Now that the module is loaded, the nvcc compiler wrapper is available. To compile this run:

$ nvcc -arch=sm_35 -lcurand cpi.cu -o a.out

Here, the -arch=sm_35 switch is used to tell the compiler that the target device Compute Capabilty is 3.5. This allows the programmer to do more things in the CUDA kernel, for instance use double. The -lcurand switch tells the compiler to link in the CURAND library and give the code access to the CURAND functions.

Then copy the executable (in this case, a.out) to the scratch directory (see the serial section for more info on finding the scratch directory). This can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, use this:

$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1

Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command

$ aprun -n1 ./a.out

Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A <PROJID>
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 ./a.out

Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:

$ qsub titan.pbs

OpenACC

OpenACC provides an alternative pragma based accelerator programming model that can be easier to use than CUDA. These pragmas are similar to OpenMP’s but launch threads on the GPU instead of the CPU. Parallelizing the Monte Carlo code with OpenACC is a fairly straight-forward procedure. The programmer must first identify the best region to parallelize. In this code, the most parallelizable region is the main for loop which actually performs the Monte Carlo simulation. The main loop can be broken up into two for loops. One loop that will do the operation and store a 1 or 0 in an array depending on if the point is in the circle or not, like in the CUDA version. The second loop will become a separate kernel that will reduce the output array of the first kernel into 1 integer on the host. The OpenACC pragma should look something like this:

#pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)

Here,”acc” tells the compiler that this is an OpenACC pragma. Next, the parallel construct is used to launch multiple threads on the GPU. Next, the copy(randomnums[0:200000]) directive copies the randomnums array onto the device. The copyout(output[0:100000]) directive copies the array onto the device then back off onto the host once the kernel is finished. Finally the create(x,y,z) directive  created the three variables, x,y,z, on the device. The parallel region looks something like this:

        #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
        {
                for (i = 0; i<100000; i++)
                {
                        int temp = i+i;
                        x = randomnums[temp];
                        y = randomnums[temp+1];
                        z = ((x*x)+(y*y));
                        output[i] = z;
                }
                for (i = 0; i<100000; i++)
                {
                        if (output[i]<=1)
                        {
                                ++count;
                        }
                        else
                                continue;
                }
        }

The complete code looks like this:

accpi.cpp

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

int main(int argc, char* argv[])
{
        double niter = 100000;
        double x,y;
        int count=0;
        double z;
        double pi;
        int a = 0;
        time_t t;
        double randomnums[200000];
        double output[100000];
        srand48(((unsigned)time(&t)));
        for (a = 0; a<=2*niter; a++)
                randomnums[a] = (double)drand48();
        int i = 0;
        #pragma acc parallel copy(randomnums[0:200000]) copyout(output[0:100000]) create(x,y,z)
        {
                for (i = 0; i<100000; i++)
                {
                        int temp = i+i;
                        x = randomnums[temp];
                        y = randomnums[temp+1];
                        z = ((x*x)+(y*y));
                        output[i] = z;
                }
                for (i = 0; i<100000; i++)
                {
                        if (output[i]<=1)
                        {
                                ++count;
                        }
                        else
                                continue;
                }
        }
        pi = ((double)count/(double)niter)*4.0;
        printf("Pi: %f\n", pi);
}

To compile this the cudatoolkit needs to be loaded. To load it run module load cudatoolkit. Next, run

$ CC -acc -Minfo=all -ta=nvidia:5.0,cc3x accpi.cpp -o a.out

to compile the code into an executable. The -Minfo=all switch will tell pgicc to print out various information regarding how its compiling the OpenACC code. This can be a very useful tool for debugging OpenACC code. The -ta=nvidia:5.0,cc3x switch tells what version of the CUDA to use and what compute architecture to use as the target. Some sample output for the -Minfo=all switch is:

main:
21, Generating copy(randomnums[0:])
Generating copyout(output[0:])
Generating create(x)
Generating create(y)
Generating create(z)
Accelerator kernel generated
22, #pragma acc loop vector(256) /* threadIdx.x */
30, #pragma acc loop vector(256) /* threadIdx.x */
21, Generating NVIDIA code
Generating compute capability 3.0 binary
22, Loop is parallelizable
30, Loop is parallelizable
34, Sum reduction generated for count

To compile this code with the Cray compiler, load the PrgEnv-cray module as well as the craype-accel-nvidia35 module

$ module swap PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35

To compile to an exectuable, run:

$ CC -hpragma=acc accpi.cpp -o a.out
Running:

Then copy the executable (in this case, a.out) to the scratch directory. Like before, this can be run either from an interactive job or it can be submitted to the queue to be run. To run this executable job inside an interactive job, run:

$ qsub -I -A [PROJID] -l walltime=1:00,nodes=1

Before you run this, change [PROJID] to the appropriate project ID. Once the job starts, change to the scratch directory and execute the command

$ aprun -n1 ./a.out

Alternately, this can be submitted as a batch job to the queues. To do this, a batch script needs to be written and submitted. A sample batch script could look like this:

#!/bin/bash
#    Begin PBS directives
#PBS -A [PROJID]
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 -N 1 ./a.out

Again, before this script can be submitted, the appropriate project ID needs to be added instead of [PROJID]. Save this file as something unique, like titan.pbs, and submit it by running:

$ qsub titan.pbs

Thrust

Finally, this operation can be accelerated with the unique Thrust library. Thrust is a template library that allows programmers to write parallel code in the same way that they would use the C++ Standard Library for serial code. Because of this, the Thrust parallelized code is structured quite differently. For instance, the Monte Carlo kernel from the previous CUDA example is replaced by an equivalent function inside a struct.

struct montecarlo :
       public thrust::unary_function<unsigned int, float>
{
    //use the __host__ __device__ decorators so that
    //thrust knows that this is a gpu function.
    //think of operator() as similar to a CUDA kernel
	__host__ __device__
	float operator()(unsigned int thread_id)
	{
                unsigned int seed;
                //set the seed
                seed = 49868^thread_id;
                int i;
                float x,y,z,sum=0.0;

                //define the random number generator
                thrust::default_random_engine rng(seed);
                thrust::uniform_real_distribution u01(0,1);
                for (i = 0; i < 1000; i++)
                {
                        //get the random numbers
                        x = u01(rng);
                        y = u01(rng);
                        z = ((x*x)+(y*y));
                        if (z<=1)
                                sum += 1;
                }
                return sum;
        }
};

First, the montecarlo struct is defined and given a Thrust function template to adhere to.

struct montecarlo : public thrust::unary_function

This Thrust template says that the function in the struct will be a standard thrust unary function with an unsigned int as an input (this is the thread number. This is handled by Thrust) and will return a floating point number. Next, the programmer defines the operator function

__host__ __device__
float operator()(unsigned int thread_id)

and gives it the __host__ and __device__ decorator so that the compiler knows that it is a GPU function. The next thing that the programmer should take note of is the use of the Thrust random number generator.

thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution u01(0,1);

Since it is not possible to call host functions from a GPU kernel, this step is necessary. By using Thrust functions like this, it allows the programmer to more easily and quickly accelerate the serial code without having to use more complex and specialized APIs, like CURAND.

Finally the way this kernel can be launched is by the use of another Thrust function. This transform_reduce() function is defined as:

OutputType thrust::transform_reduce(InputIterator first,
                                    InputIterator last,
                                    UnaryFunction unary_op,
                                    OutputType init,
                                    BinaryFunction binary_op)

Here, the thrust::transform_reduce() function is used because it will run every number between first and last through our predefined functor (unary_op) and then it will reduce the resulting transformed answers into one final float according to binary_op with the initial value of init. first and last need to be defined as thrust::counting_iterator(first) and thrust::counting_iterator(last) The full function will look like this.

count = thrust::transform_reduce(thrust::counting_iterator(0),
                                 thrust::counting_iterator(niter),
                                 montecarlo(),
                                 0.0,
                                 thrust::plus());

So with that in mind, the full code can look something like this:

thrust.cu

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <thrust/generate.h>
#include <thrust/random.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <cmath>

struct montecarlo :
       public thrust::unary_function<unsigned int, float>
{
    //use the __host__ __device__ decorators so that
    //thrust knows that this is a gpu function.
    //think of operator() as similar to a CUDA kernel
	__host__ __device__
	float operator()(unsigned int thread_id)
	{
                unsigned int seed;
                //set the seed
                seed = 49868^thread_id;
                int i;
                float x,y,z,sum=0.0;

                //define the random number generator
                thrust::default_random_engine rng(seed);
                thrust::uniform_real_distribution u01(0,1);
                for (i = 0; i < 1000; i++)
                {
                        //get the random numbers
                        x = u01(rng);
                        y = u01(rng);
                        z = ((x*x)+(y*y));
                        if (z<=1)
                                sum += 1;
                }
                return sum;
        }
};

int main(int argc, char* argv[])
{
        float pi;
        float count = 0.0;
        int niter = 1000;
        count = thrust::transform_reduce(thrust::counting_iterator(0),
                                     thrust::counting_iterator(niter),
                                     montecarlo(),
                                     0.0,
                                     thrust::plus());
        pi = (count/(niter*niter))*4.0;
        printf("Pi: %f\n", pi);
}

To compile this code, the cudatoolkit needs to be loaded

$ module load cudatoolkit

Once the code is ready to be compiled, it should be compiled using NVIDIA’s nvcc compiler like so:

$ nvcc -arch=sm_35 thrust.cu -o a.out

It can be run either by submitting it in a batch script or by running it from an interactive job like described in the previous examples.

The code in this tutorial along with a makefile can be found at https://github.com/olcf/Accelerating-Serial-Code

If you have any questions about these examples or how to run them please email help@nccs.gov.