Up since 11/8/17 02:45 pm


Up since 11/14/17 11:20 pm


Up since 10/17/17 05:40 pm


Up since 11/20/17 09:15 am


Up since 11/15/17 07:25 am


Up since 11/27/17 10:45 am
OLCF User Assistance Center

The center's normal support hours are 9 a.m. until 5 p.m. (Eastern time) Monday through Friday, exclusive of holidays. Outside of normal business hours, calls are directed to the ORNL Computer Operations staff. If you require immediate assistance outside of normal business hours, you may contact them at the phone number listed above. If your request is not urgent, you may send an email to help@nccs.gov, where it will be answered by a NCCS User Assistance member the next business day.

Accelerated Computing Guide


1. Accelerated Computing Overview

(Back to Top)

Accelerated Computing is a computing model used in scientific and engineering applications whereby calculations are carried out on specialized processors (known as accelerators) in tandem with traditional CPUs to achieve faster real-world execution times. Accelerators are highly specialized microprocessors designed with data parallelism in mind. In the most practical terms, execution times are reduced by offloading parallelizable computationally-intensive portions of an application to the accelerators while the remainder of the code continues to run on the CPU.

The flagship supercomputer at the OLCF, Titan, employs 18,688 NVIDIA Telsa K20X GPU accelerators. This guide is provided to aid researchers in the process of transforming traditional CPU code to accelerated code that is well-suited for execution on Titan.

1.1. Accelerated Computing Basics

(Back to Top)

It is difficult to discuss how to accelerate code on Titan without first introducing some basic background information about the history of accelerator technology as well as some fundamental concepts pertaining to accelerators. The following sections aim to provide a narrow introduction to these fundamental concepts, with a special emphasis on the technologies employed on Titan, i.e., NVIDIA GPUs and CUDA.

1.1.1. History of CUDA, OpenCL, and the GPGPU

(Back to Top)

While research into the use of specialized processors for general computation has been ongoing for many decades, the practical realization of modern accelerators for use in high-performance computing (HPC) did not happen until the late 2000's. Around that time, specialized processors called graphics processing units (GPUs) produced by NVIDIA and ATI/AMD were being manufactured with capabilities sufficient to support HPC workloads.

During this time, NVIDIA released the initial public beta of their proprietary CUDA development framework (2007), as well as the first generation of its Tesla GPUs marketed to the HPC sector. ATI/AMD also released GPUs which implemented a non-proprietary, committee-backed specification called OpenCL, the first version of which was approved in 2008.

The practice of using these general-purpose graphical processing units (GPGPUs) for HPC workloads is now common. The CUDA framework is utilized on NVIDIA GPUs and the OpenCL framework is utilized on both AMD GPUs and NVIDIA GPUs. Other, higher-level accelerated computing techniques and tools have been built upon these two low-level GPU frameworks to assist programmers in accelerating applications with GPU technologies. While other accelerator technologies are emerging, GPUs are currently the most widely-adopted accelerator technology in high-performance computing.

Titan contains 18,688 NVIDIA Telsa K20X GPU accelerators. As such, most of the acceleration techniques and tools employed on Titan make use of the CUDA framework either directly or indirectly.

1.1.2. CUDA Thread Model

(Back to Top)

The K20X architecture is closely coupled to the CUDA thread model. Due to the large scale threading capabilities of the accelerator global thread communication and synchronization is not available. The CUDA thread model is an abstraction used to let the programmer or compiler more easily utilize the available levels of thread cooperation. When work is issued to the GPU it is in the form of a function, referred to as the kernel, that is to be executed N times in parallel by N CUDA threads.

CUDA threads are logically divided into 1,2, or 3 dimensional groups referred to as thread blocks. Threads within a block can cooperate though access to low latency shared memory as well as synchronization capabilities. All threads in a block must reside on the same streaming multiprocessor and share the limited resources, additionally a 1024 threads per block upper limit is imposed.

The thread blocks of a given kernel are partitioned into a 1,2, or 3 dimensional logical grouping referred to as a grid. The grid encompasses all threads required to execute a given kernel. There is no cooperation between blocks in a grid, as such blocks must be able to be executed independently. When a kernel is launched the number of threads per threadblock and the number of threadblocks is specified, this in turn defines the total number of cuda threads launched.

Each thread has access to it's integer position within it's own block as well as the integer position of the thread's enclosing block within the grid, as displayed below. In general the thread uses this position information to read/write to/from device global memory. In this fashion although each thread is executing the same kernel code each thread has it's own data to operate on.

Note: The dimensionality of the grid and threadblocks are purely a convenience to the programmer as to better fit the problem domain.

Example block and grid layouts:

A 1D grid of 1D blocks. threading A 2D grid of 1D blocks. threading A 2D grid of 2D blocks. threading

1.1.3. CUDA Memory Model

(Back to Top)

The CUDA memory model specifies several memory spaces in which a thread may access. How these memory spaces are implemented varies depending on the specific accelerator model. For K20X memory specifics please see the K20X Memory Hierarchy article.

mem Device Global

Global memory is persistent over multiple kernel invocations and is readable and writable from all threads. Global memory is the largest memory area but suffers from the highest latency and lowest bandwidth.


A low latency high bandwidth shared memory available to all threads within a thread block. The lifetime of shared memory is that of the thread block. Shared memory promotes data reuse and allows communication between threads within a thread-block.


Each thread has access to private local memory with a lifetime of thread. Thread private memory is generally contained in the streaming multiprocessor register file although may spill to global memory in certain circumstances.

1.2. XK7 (Titan) Node Description

(Back to Top)

Each Titan compute node contains (1) AMD Opteron™ 6274 (Interlagos) CPU and (1) NVIDIA Tesla™ K20X (Kepler) GPU connected through a PCI express 2.0 interface. Each CPU is connected to a Cray Gemini interconnect through a hypertransport 3 interface. The CPU has access to 32 gigabytes of DDR3 memory while the GPU has access to 6 GB of GDDR5 memory. This configuration is shown graphically below. XK7 Node

1.3. Accelerator Basics

(Back to Top)

Accelerators, loosely defined, are modern microprocessors that are capable of performing the same operation on multiple data points simultaneously. Similar in theory to the vector processors of the 1970s and 1980s, accelerators fall under the single instruction, multiple data classification of Flynn's taxonomy. Titan employs 18,688 NVIDIA Telsa K20X GPUs as accelerators.

1.3.1. K20X Accelerator Description

(Back to Top)

Each Kepler K20X accelerator contains (14) SMX streaming multiprocessors, The SMX architecture is discussed in more detail below. All work and data transfer between the CPU and GPU is managed through the PCI express(PCIe) interface. The GigaThread engine is responsible for distributing work between each SMX and (6) 64-bit memory controllers control access to 6 gigabytes of GDDR5 device(global) memory. A shared 1.5 megabyte L2 cache sits between global memory and each SMX.

NVIDIA K20X (Kepler) accelerator

XK7 Node For arithmetic operations each SMX contains (192) CUDA cores for handling single precision floating point and simple integer arithmetic, (64) double precision floating point cores, and (32) special function units capable of handling transcendental instructions. Servicing these arithmetic/logic units are (32) memory load/store units. In addition to (65,536) 32-bit registers each SMX has access to 64 kilobytes of on-chip memory which is configurable between a user controlled shared memory and L1 cache. Additionally each SMX has 48 kilobytes of on-chip read-only, user or compiler manageable, cache.

NVIDIA K20X (Kepler) SMX

XK7 Node units

1.3.2. K20X Memory Hierarchy

(Back to Top)


  • Registers

    Each thread has access to private register storage. Registers are the lowest latency and highest bandwidth memory available to each thread but are a scarce resource. Each SMX has access to (65,536) 32-bit registers that must be shared by all resident threads. An upper limit of 255 registers per thread is imposed by the hardware, if more thread private storage is required an area of device memory known as local memory will be used.

  • Shared memory/ L1 cache

    Each SMX has a 64 kilobyte area of memory that is split between shared memory and L1 cache. The programer specifies the ratio between shared and L1.

    Shared memory is a programer managed low latency high bandwidth memory. Variables declared in shared memory are available to all threads within the same threadblock.

    L1 cache on the K20X handles local memory (register spillage, stack data, etc.). Global memory loads are not cached in L1.

  • Read Only cache

    Each SMX has 48 kilobytes of read only cache populated from global device memory. Eligible variables are determined by the compiler along with guidance from the programer.

  • Constant Cache

    Although not pictured each SMX has a small 8 kilobyte constant cache optimized for warp level broadcast. Constants reside in device memory and have an aggregate max size of 64 kilobytes.

  • Texture

    Texture memory provides multiple hardware accelerated boundary and filtering modes for 1D,2D, and 3D data. Textures reside in global memory but are cached in the on SMX 48 kilobyte read-only cache.

  • L2 cache

    Each K20X contains 1.5 megabytes of L2 cache. All access to DRAM, including transfers to and from the host, go through this cache.

  • DRAM(Global memory)

    Each K20X contains 6 gigabytes of GDDR5 DRAM. On titan ECC is enabled which reduces the available global memory by 12.5%; Approximately 5.25 gigabytes is available to the user. In addition to global memory DRAM is also used for local storage(in case of register spillage), constant, and texture memory.

1.3.3. K20X Instruction Issuing

(Back to Top)

When a kernel is launched on the K20X the GigaThread engine handles dispatching the enclosed thread blocks to available SMXs. Each SMX is capable of handling up to 2048 threads or 64 threadblocks, including blocks from concurrent kernels, if resources allow. All threads within a particular thread block must reside on a single SMX.

Once a thread block is assigned to a particular SMX all threads contained in the block will execute entirely on that SMX. At the hardware level on the SMX each thread block is broken down into 32 consecutive thread chunks, each 32 thread chunk is referred to as a warp. The K20X issues instructions at the warp level, that is to say an instruction is issued in vector like fashion to 32 consecutive threads at a time. This execution model is referred to as Single Instruction Multiple Thread, or SIMT.


Each Kepler SMX has (4) warp schedulers. When a block is divided up into warps the warps are assigned to a warp scheduler. Warps will stay on the assigned scheduler for the lifetime of the warp. The scheduler is able to switch between concurrent warps, originating from any block of any kernel, without overhead. When one warp stalls -- that is, the next instruction can not be executed in the next cycle -- the scheduler will switch to a warp that is able to execute an instruction. This low overhead warp swapping allowing for instruction latency to be effectively hidden, assuming enough warps with issuable instructions reside on the SMX.

Each warp scheduler has (2) instruction dispatch units. Each cycle the scheduler selects a warp and if possible two independent instructions will be issued to that warp. Two clock cycles are required to issue a double precision floating point instruction to a full warp.


1.3.4. K20X by the Numbers

(Back to Top)

Model K20X
Compute Capability 3.5
Peak double precision floating point performance 1.31 teraflops
Peak single precision floating point performance 3.95 teraflops
Single precision CUDA cores 2688
Double precision CUDA cores 896
CUDA core clock frequency 732 MHz
Memory Bandwidth(ECC off) 250 GB/s*
Memory size GDDR5(ECC on) 5.25 GB
L2 cache 1.5 MB
Shared memory/L1 configurable 64 KB per SMX
Read-only cache 48 KB per SMX
Constant memory 64 KB
32-bit Registers 65,536 per SMX
Max registers per thread 255
Number of multiprocessors(SMX) 14
Warp size 32 threads
Maximum resident warps per SMX 64
Maximum resident blocks per SMX 16
Maximum resident threads per SMX 2048
Maximum threads per block 1024 threads
Maximum block dimensions 1024, 1024, 64
Maximum grid dimensions 2147483647, 65535, 65535
* ECC will reduce achievable bandwidth by 15+%

2. The Code Acceleration Process

(Back to Top)

The task of accelerating an application is an iterative process that, in general, comprises these steps:

  1. Identify a section of unexploited parallelism in your application's source code.
  2. Choose an appropriate programming technique for code acceleration.
  3. Implement the parallelization within your source code.
  4. Verify the accelerated application's performance and output.

This process may be repeated many times, with a new section of the application being accelerated in each iteration.

Each of the following chapters in this guide covers an aspect of the code acceleration process in more detail with special emphasis on the techniques and tools that are available to end-users on Titan.

3. Identifying Parallelism

(Back to Top)

The first step in accelerating an existing code is identifying the regions of code that may benefit from acceleration. This is typically accomplished through profiling of the code to determine where most of the time is spent. A portion of code that will benefit from acceleration must have sufficient parallelism exposed as discussed below. If a section of code does not have sufficient parallelism the algorithm may need to be changed or in some cases the code may be better suited for the CPU. As the code is being accelerated this profile can be used as a base line to compare performance against.

3.1. Code Analysis Tools

(Back to Top)

The following code analysis tools can be used to obtain baseline CPU performance metrics on Titan.

CrayPAT is a profiling tool that provides information on application performance. CrayPAT is used for basic profiling of serial, multiprocessor, multithreaded, and accelerated programs. More information can be found on the CrayPAT software page.


TAU Performance System is a portable profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. More information can be found on the Tau software page.


The Score-P measurement infrastructure allows users to instrument and record the behavior of sequential, MPI-parallel and hybrid parallel applications as well as CUDA-enabled applications. The resulting trace files in OTF2 format can be visualized using Vampir. More information can be found on the Score-P page.

Allinea MAP

Allinea MAP (part of the Allinea Forge suite, with DDT) is a profiler for parallel, multithreaded or single threaded C, C++, Fortran and F90 codes. It provides in depth analysis and bottleneck pinpointing to the source line. Unlike most profilers, it's designed to be able to profile pthreads, OpenMP or MPI for parallel and threaded code. More information can be found on the Allinea MAP page.

Additional Tools

Additional tools are available and can be found on the debugging and profiling software page.

3.2. Accelerator Occupancy Considerations

(Back to Top)

Sufficient parallelism must be exposed in order to make efficient use of the accelerator. For most codes the following guidelines can be useful to keep in mind. To ensure enough parallelism to hide instruction latency on the K20X you will generally want each SMX to have at least 32 active warps(1024 threads). The exact number of warps needed to saturate the instructional throughput will depend on the amount of instruction level parallelism(ILP) and the specific instruction throughput. The block size should be a multiple of 32 to ensure that it can be evenly divided into warps with 128-256 being a good starting point. To ensure that every SMX is kept busy throughout the kernel execution it is best to launch 1000+ thread blocks if possible.

4. Choosing and Implementing Acceleration Techniques

(Back to Top)

Titan is configured with a broad set of tools to facilitate acceleration of both new and existing application codes. These tools can be broken down into (3) categories based on their implementation methodology:

Development Tool Hierarchy
  • Accelerated Libraries
  • Accelerator Compiler Directives
  • Accelerator Languages/Frameworks

Each of these three methods has pros and cons that must be weighed for each program and are not mutually exclusive. In addition to these tools, Titan supports a wide variety of performance and debugging tools to ensure that, however you choose to implement acceleration into your program code, it is being done so efficiently.

4.1. Interoperability Considerations

(Back to Top)

Often a single acceleration technology is not the most efficient way to accelerate a given code. Most acceleration techniques provide mechanisms for GPU memory pointers to be input or output allowing interoperability between multiple technologies. This is not universally true, for example the OpenCL specification does not currently include interoperability with CUDA based technologies. The Accelerator interoperability and Accelerator interoperability II tutorials provide several interoperability example codes. For additional details please see the specific technologies documentation.

4.2. GPU Accelerated Libraries

(Back to Top)

Due to the performance benefits that come with GPU computing, many scientific libraries are now offering accelerated versions. If your program contains BLAS or LAPACK function calls, GPU-accelerated versions may be available. Magma, CULA, cuBLAS, cuSPARSE, and LibSciACC libraries provide optimized GPU linear algebra routines that require only minor changes to existing code. These libraries require little understanding of the underlying GPU hardware, and performance enhancements are transparent to the end developer. For more general libraries, such as Trilinos and PETSc, you will want to visit the appropriate software development site to examine the current status of GPU integration. For Trilinos, please see the latest documentation at the Sandia Trillinos page. Similarly, Argonne's PETSc Documentation has a page containing the latest GPU integration information.

The MAGMA project aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.
This module is currently only compatible with the GNU programming environment:
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit magma
To link in the MAGMA library while on Titan:
$ cc -lcuda -lmagma source.c
Linking in MAGMA on Rhea is a bit different because the Titan compiler wrapper takes care of some of the extra flags. For example:
$ nvcc $MAGMA_INC $MAGMA_LIB -L$ACML_DIR/gfortran64/lib -lmagma -lcublas -lacml source.c
This also requires that a BLAS library be loaded like ACML
$ module load acml
For comprehensive user manual please see the MAGMA Documentation. A knowledgable MAGMA User Forum is also available for personalized help. To see MAGMA in action see the following two PGI articles that include full example code of MAGMA usage with PGI Fortran: Using MAGMA With PGI Fortran and Using GPU-enabled Math Libraries with PGI Fortran.
CULA is a GPU accelerated linear algebra library, mimicking LAPACK, that utilizes the NVIDIA CUDA. For C and Fortran code currently using LAPACK this should be a relatively simple port and does not require CUDA knowledge.
CULA is accessed through the cula module, for linking it will be convenient to load the cuda module as well:
$ module load cula-dense cudatoolkit
To link in the CULA library:
$ cc -lcula_core -lcula_lapack source.c
A comprehensive CULA Programmers Guide is available for CULA that covers everything you need to know to use it. Once the module is loaded you can find up to date documentation in the $CULA_ROOT/doc directory and examples in $CULA_ROOT/examples. An example of using CULA with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran Running the examples: Obtain an interactive job and load the appropriate modules:
$ qsub -I -A[projID] -lwalltime=00:30:00,nodes=1
$ module load cuda cula-dense
Copy the example files:
$ cd $MEMBERWORK/[projid]
$ cp -r $CULA_ROOT/examples .
$ cd examples
Now each example can be built and executed:
$ cd basicUsage
$ make build64
$ aprun basicUsage
cuBLAS and cuSPARSE are NVIDIA provided BLAS GPU routines optimized for dense and sparse use respectively. If your program currently uses BLAS routines integration should be straight forward and minimal CUDA knowledge is needed. Although primarily designed for use in C/C++ code Fortran bindings are available.
cuBLAS and cuSPARSE are accessed through the cublas header and need to be linked against the cublas library:
$ module load cudatoolkit
$ cc -lcublas source.c
The CUBLAS and CUSPARSE user guides are available to download from NVIDIA, these guides provide complete function listings as well as example code. The nvidia SDK provides sample code and can accessed using the instructions below. An example of using CUBLAS with PGI Fortran is available in Using GPU-enabled Math Libraries with PGI Fortran. Running the examples: Obtain an interactive job and load the appropriate modules:
$ qsub -I -A[projID] -lwalltime=00:30:00,nodes=1
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit nvidia-sdk
Copy the example files:
$ cd $MEMBERWORK/[projid]
$ cp -r $NVIDIA_SDK_PATH/CUDALibraries .
$ cd CUDALibraries
Now each example can be executed:
$ cd bin/linux/release
$ aprun simpleCUBLAS
Cray's LibSciACC provides GPU enabled BLAS and LAPACK routines. LibSicACC provides two interfaces, automatic and manual. The Automatic interface is largely transparent to the programmer. LibSciACC will determine if the call is likely to benefit from GPU acceleration and if so will take care of accelerating the routine and the associated memory management. The manual interface provides an API to manage accelerator resources, providing more control to the programmer.
It is recommended that the craype-accel-nvidia35 module be used to manage LibSciAcc. The LibSciACC Automatic interface is currently compatible with the Cray and GNU programming environments:
$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
LibSciAcc will automatically be linked in when using the Cray provided compiler wrappers
$ cc source.c
$ ftn source.f90
The man page intro_libsci_acc provides detailed usage information. The environment variable $LIBSCI_ACC_EXAMPLES_DIR specifies a directory containing several C and Fortran example codes.
CUFFT provides a set of optimized GPU fast Fourier routines that are provided by NVIDIA as part of the CUDA toolkit. The CUFFT library provides an API similar to FFTW for managing accelerated FFT's. The CUFFTW interface provides a FFTW3 interface to CUFFT to aid in porting existing applications.
The cudatoolkit module will append the include and library directories required by CUFFT. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ cc -lcufft source.c
$ nvcc -lcufft source.c
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CUFFT with Fortran through the ISO_C_BINDING interface please see the following example. The OLCF provides an OpenACC and CUFFT interoperability tutorial.
CURAND is an NVIDIA provided random number generator library. CURAND provides both a host launched and device inalienable interface. Multiple pseudorandom and quasirandom algorithms are supported.
The cudatoolkit module will append the include and library directories required by CURAND. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu
$ cc -lcurand source.c
$ nvcc -lcurand source.cu
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using the CURAND host library please see the following OLCF tutorial.
Thrust is a CUDA accelerated C++ template library modeled after the Standard Template Library(STL). Thrust provides a high level host interface for GPU data management as well as an assortment of accelerated algorithms. Even if your application is not currently using the STL the easy access to many optimized accelerated algorithms Thrust provides is worth taking a look at.
The cudatoolkit module will append the include and library directories required by Thrust. When using NVCC or the GNU programming environment the library can then be added.
$ module load cudatoolkit
$ module switch PrgEnv-pgi PrgEnv-gnu
$ CC source.cpp
$ nvcc source.cu
NVIDIA provides comprehensive documentation, including example code, available Here. For an example of using Thrust please see the following OLCF tutorial. The Github page allows access to the Thrust source code, examples, and information on how to obtain help.

4.3. Accelerator Compiler Directives

(Back to Top)

Accelerator compiler directives allow the compiler, guided by the programmer, to take care of low-level accelerator work. One of the main benefits of a directives-based approach is an easier and faster transition of existing code compared to low-level GPU languages. Additional benefits include performance enhancements that are transparent to the end developer and greater portability between current and future many-core architectures. Although initially several vendors provided their own set of proprietary directives OpenACC has now provided a unified specification for accelerator directives.


OpenACC aims to provide an open accelerator interface consisting primarily of compiler directives. Currently PGI, Cray, and CapsMC provide OpenACC implementations for C/C++ and Fortran. OpenACC aims to provide a portable cross platform solution for accelerator programming.

Using C/C++
PGI Compiler
$ module load cudatoolkit
$ cc -acc vecAdd.c -o vecAdd.out
Cray Compiler
$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -h pragma=acc vecAdd.c -o vecAdd.out
CapsMC Compiler
$ module load cudatoolkit capsmc
$ cc vecAdd.c -o vecAdd.out
Using Fortran
PGI Compiler
$ module load cudatoolkit
$ ftn -acc vecAdd.f90 -o vecAdd.out
Cray Compiler
$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -h acc vecAdd.f90 -o vecAdd.out
CapsMC Compiler
$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load cudatoolkit capsmc
$ ftn vecAdd.f90 -o vecAdd.out

The OpenACC specification provides the basis for all OpenACC implementations and is available OpenACC specification . In addition the implementation specific documentation may be of use. PGI has a site dedicated to collecting OpenACC resources. Chapter 5 of the Cray C and C++ Reference Manual provides details on Crays implementation. CapsMC has provided an OpenACC Reference Manual.


The OLCF provides Vector Addition and Game of Life example codes demonstrating the OpenACC accelerator directives.

Note: This section covers the PGI Accelerator directives, Not PGI's OpenACC implementation.
PGI Accelerator

The Portland Group provides accelerator directive support with their latest C and Fortran compilers. Performance and feature additions are still taking place at a rapid pace but it is currently stable and full featured enough to use in production code.


To make use of the PGI accelerator directives the cuda module and pgi programming environment must be loaded:

$ module load cudatoolkit
$ module load PrgEnv-pgi

To specify the platform that the compiler directives should be applied to the Target Accelerator flag is used:

$ cc -ta=nvidia source.c
$ ftn -ta=nvidia source.f90

PGI provides a useful web portal for Accelerator resources. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. In addition the PGI Accelerator Programming Model on NVIDIA GPUs article series by Michael Wolfe walks you through basic and advanced programming using the framework providing very helpful tips along the way. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.


The OLCF provides Vector Addition and Game of Life example codes demonstrating the PGI accelerator directives.

Note: This section covers the CapsMC directives, Not Caps OpenACC implementation.

The core of CAPS Enterprises GPU directive framework is CapsMC. CapsMC is a compiler and runtime environment that interprets OpenHMPP and OpenACC directives and in conjunction with your traditional compiler (PGI, GNU, Cray or Intel C or Fortran compiler) creates GPU accelerated executables.


To use CAPS accelerator framework you will need the cuda and capsmc modules loaded. Additionally a PGI, GNU, or Intel Programming environment must be enabled.

$ module load cudatoolkit
$ module load capsmc
$ module load PrgEnv-gnu

CapsMC modifies the Cray compiler wrappers, generating accelerator code and then linking it in without any additional flags.

$ cc source.c
$ ftn source.f90

CAPS provides several documents and code snippets to get you started with HMPP Workbench. It is recomended to start off with the HMPP directives reference manual and the HMPPCG reference manual.


The OLCF provides Vector Addition and Game of Life example codes demonstrating the HMPP accelerator directives.

4.4. GPU Languages/Frameworks

(Back to Top)

For complete control over the GPU, Titan supports CUDA C, CUDA Fortran, and OpenCL. These languages and language extensions, while allowing explicit control, are generally more cumbersome than directive-based approaches and must be maintained to stay up-to-date with the latest performance guidelines. Substantial code structure changes may be needed and an in-depth knowledge of the underlying hardware is often necessary for best performance.


NVIDIA's CUDA C is largely responsible for launching GPU computing to the forefront of HPC. With a few minimal additions to the C programming language, NVIDIA has allowed low-level control of the GPU without having to deal directly with a driver-level API.


To setup the CUDA environment the cudatoolkit module must be loaded:

$ module load cudatoolkit

This module will provide access to NVIDIA supplied utilities such as the nvcc compiler, the CUDA visual profiler(computeprof), cuda-gdb, and cuda-memcheck. The environment variable CUDAROOT will also be set to provide easy access to NVIDIA GPU libraries such as cuBLAS and cuFFT.

To compile we use the NVIDIA CUDA compiler, nvcc.

$ nvcc source.cu

For a full usage walkthrough please see the supplied tutorials.


NVIDIA provides a comprehensive web portal for CUDA developer resources here. The developer documentation center contains the CUDA C programming guide which very thoroughly covers the CUDA architecture. The programming guide covers everything from the underlying hardware to performance tuning and is a must read for those interested in CUDA programming. Also available on the same downloads page are whitepapers covering topics such as Fermi Tuning and CUDA C best practices. The CUDA SDK is available for download as well and provides many samples to help illustrate C for CUDA programming technique. For personalized assistance NVIDIA has a very knowledgeable and active developer forum.


The OLCF provides both a Vector Addition and Game of Life example code tutorial demonstrating CUDA C usage.

PGI CUDA Fortran

PGI's CUDA Fortran provides a well-integrated Fortran interface for low-level GPU programming, doing for Fortran what NVIDIA did for C. PGI worked closely with NVIDIA to ensure that the Fortran interface provides nearly all of the low-level capabilities of the CUDA C framework.


CUDA Fortran will be properly configured by loading the PGI programming environment:

$ module load PrgEnv-pgi

To compile a file with the cuf extension we use the PGI Fortran compiler as usual:

$ ftn source.cuf

For a full usage walkthrough please see the supplied tutorials.


PGI provides a comprehensive web portal for CUDA Fortran resources here. The portal links to the PGI Fortran & C Accelerator Programming Model which provides a comprehensive overview of the framework and is an excellent starting point. The web portal also features a set of articles covering introductory material, device kernels, and memory management. If you run into trouble PGI has a user forum where PGI staff regularly answer questions.


The OLCF provides both a Vector Addition and Game of Life example code tutorial demonstrating CUDA Fortran usage.


The Khronos group, a non-profit industry consortium, currently maintains the OpenCL (Open Compute Language) standard. The OpenCL standard provides a common low-level interface for heterogeneous computing. At its core, OpenCL is composed of a kernel language extension to C (similar to CUDA C) and a C API to control data management and code execution.


The cuda module must be loaded for the OpenCL header files to be found and a PGI or GNU programming environment enabled:

$ module load PrgEnv-pgi
$ module load cudatoolkit

To use OpenCL you must include the OpenCL library and library path:

gcc -lOpenCL source.c

Khronos provides a web portal for OpenCL. From here you can view the specification, browse the reference pages, and get individual level help from the OpenCL forums. A developers page is also of great use and includes tutorials and example code to get you started.

In addition to the general Khronos provided material users will want to check out the vendor-specific available information for capability and optimization details. Of main interest to OLCF users will be the AMD and NVIDIA OpenCL developer zones.


The OLCF provides both a Vector Addition and Game of Life example code tutorial demonstrating OpenCL usage.

5. Debugging Accelerated Code

(Back to Top)

Although debugging the GPU has been historically difficult due to a lack of debugging tools great strides have been made within the past few years. NVIDIA and Allinea both provide tools with interfaces similar to existing CPU debugging programs to aid accelerated applications.

5.1. Accelerator Debugging Tools

(Back to Top)

The following accelerator enabled debugging tools are currently available on Titan.


DDT is the primary debugging tool available to users on Titan and supports both CPU and GPU debugging. DDT is a commercial debugger sold by Allinea Software, a leading provider of parallel software development tools for High Performance Computing.

For more information on DDT, see the DDT software page, or Allinea's DDT support page.

Warning: DDT does not support non-MPI applications on Titan.

cuda-gdb and cuda-memcheck may be run through aprun although MPI enabled applications are not supported. It is recommended to use Allinea DDT for MPI debugging.

6. Verifying Accelerated Application Performance and Output

(Back to Top)

The goal of writing accelerated code is to increase program performance while maintaining program correctness. To ensure that both of these goals are reached it is necessary to verify application results as well as profile accelerator performance during each step of the acceleration processes. How results are verified will be dependent on the application as many applications will not simply be able to do a bit-wise comparison of final output due to reasons discussed in K20X Floating Point Considerations. Once the results have been verified the developer is encouraged to make use of performance tools to verify the accelerator is efficiently used.

6.1. Accelerator Performance Tools

(Back to Top)

To ensure efficient use of Titan a full suite of performance and analysis tools will be offered. These tools offer a wide variety of support from static code analysis to full runtime heterogeneous tracing.


NVIDIA's command line profiler, NVPROF, provides profiling for CUDA codes. No special steps are needed when compiling your code. The profiler includes tracing capability as well as the ability to provide many performance metrics, including FLOPS. The profiler data can be saved and imported into the NVIDIA visual profiler for easier analysis.


To use NVPROF the cudatoolkit module must be loaded and PMI daemon forking disabled. To view the output in the NVIDIA Compute Visual Profiler X11 forwarding must be enabled.

The aprun -b flag is currently required to use NVPROF, this requires that your executable reside on a compute node visible filesystem.
$ module load cudatoolkit
$ export PMI_NO_FORK=1

Although NVPROF doesn't provide MPI aggregated data the %h and %p output file modifiers can be used to create separate output files for each host and process.

$ aprun -b -n16 nvprof -o output.%h.%p ./gpu.out 

A variety of metrics and events can be captured by the profiler. For example to output the number of double precision flops you may use the following:

$ aprun -b -n16 nvprof --metrics flops_dp -o output.%h.%p ./gpu.out 

To see a list of all available metrics and events the following can be used:

$ aprun -b nvprof --query-metrics
$ aprun -b nvprof --query-events 

To view the output in the NVIDIA visual profiler please see the following NVIDIA documentation.


The nvprof user guide is available on the NVIDIA Developer Documentation Site and provides comprehensive coverage of the profiler's usage and features.

NVIDIA Command Line Profiler

NVIDIA's Command Line Profiler provides run-time profiling of CUDA and OpenCL code. No special steps are needed when compiling your code and any tool that utilizes CUDA or OpenCL code, including compiler directives and accelerator libraries, can be profiled. The profile data can be collected in .txt format or .csv format to be viewed with the NVIDIA visual profiler.


To use the NVIDIA Command Line Profiler the cudatoolkit module must be loaded. To view the output in the NVIDIA Compute Visual Profiler X11 forwarding must be enabled:

$ module load cudatoolkit

Although NVPROF doesn't provide MPI aggregated data the compute node HOSTNAME and %p output file modifiers can be used to create separate output files for each host and process. The %h modifier is not available in the command line profiler.

$ aprun gpu.out

To view the output in the NVIDIA Visual Profiler please see the NVIDIA Visual Profiler Users Guide.


The NVIDIA Command Line Profiler User Guide is available on NVIDIA's Developer Documentation Site and provides comprehensive coverage of the profilers usage and features.


Score-P is a software system that provides a measurement infrastructure for profiling and event trace recording of MPI, SHMEM, OpenMP/Pthreads, and hybrid parallel applications as well as CUDA-enabled applications. Trace files, in OTF2 format, can be visualized using Vampir.


To run, the scorep module must be loaded, and the cuda module must be loaded as well for GPU traces:

$ module load cudatoolkit
$ module load scorep

Once loaded, the program must be recompiled by prefixing scorep to the original compiler:

$ scorep cc source.c
$ scorep --cuda nvcc source.cu
To see all available options for instrumentation:
$ scorep --help

Example for generating trace files:

$ aprun instrumented_binary.out

Additional information can be found on the OLCF Score-P software page. The Score-P User Manual provides comprehensive coverage of Score-P usage.


Vampir is a graphical user interface for analyzing OTF trace data. For small traces all analysis may be done on the users local machine running a local Vampir copy. For larger traces the GUI can be run from the users local machine while the analysis is done using VampirServer, running on the parallel machine.


The easiest way to get started is to launch the Vampir GUI from an OLCF compute resource, however a slow network connection may limit usability.

$ module load vampir
$ vampir
Additional information may be found on the OLCF Vampir software page. The Vampir User Manual provides comprehensive coverage of Vampir usage.


TAU provides profiling and tracing tools for C, C++, Fortran, and GPU hybrid programs. Generated traces can be viewed in the included Paraprof GUI or displayed in Vampir.


A simple GPU profiling example could be preformed as follows:

$ module switch PrgEnv-pgi PrgEnv-gnu
$ module load tau cudatoolkit
$ nvcc source.cu -o gpu.out

Once the cuda code has been compiled tau_exec -cuda can be used to profile the code at runtime

$ aprun tau_exec -cuda ./gpu.out

The resulting trace file can then be viewed using paraprof

$ paraprof

Additional information may be found in the OLCF TAU software page. The TAU documentation website contains a complete User Guide, Reference Guide, and even video tutorials.

CrayPAT is a profiling tool that provides information on application performance. CrayPAT is used for basic profiling of serial, multiprocessor, multithreaded, and GPU accelerated programs.

A simple GPU profiling example could be preformed as follows:

With PrgEnv-Cray:
$ module load craype-accel-nvidia35
$ module load perftools
With PrgEnv other than Cray:
$ module load cudatoolkit
$ module load perftools
$ nvcc -g -c cuda.cu
$ cc cuda.cu launcher.c -o gpu.out
$ pat_build -u gpu.out
$ export PAT_RT_ACC_STATS=all
$ pat_report gpu.out+*.xf

More information can be found on the CrayPAT software page. For more details on linking nvcc and compiler wrapper compiled code please see our tutorial on Compiling Mixed GPU and CPU Code.

6.2. K20X Floating Point Considerations

(Back to Top)

When porting existing applications to make use the GPU it is natural to check the ported application's results against those of the original application. Much care must be taken when comparing results between a particular algorithm run on the CPU and GPU since, in general, bit-for-bit agreement will not be possible.

The NVIDIA K20X provides full IEEE 754-2008 support for single and double precision floating point arithmetic. For simple floating point arithmetic (addition, subtraction, multiplication, division, square root, FMA) individual operations on the AMD Opteron and NVIDIA K20X should produce bit-for-bit identical values to the precision specified by the standard.

Transcendental operations (trigonometric, logarithmic, exponential, etc.) should not be expected to produce bit-for-bit identical values when compared against the CPU. Please refer to the CUDA programming guide for single and double precision function ULP.

Floating point operations are not necessarily associative nor distributive. For a given algorithm, the number of threads, thread ordering, and instruction execution may differ between the CPU and GPU, resulting in potential discrepancies; as such it should not be expected that a series of arithmetic operations will produce bit-for-bit identical results.

7. Running Accelerated Applications on Titan

(Back to Top)

Each of Titan's 18,688 compute nodes contains an NVIDIA K20X accelerator, the login and service nodes do not. As such the only way to reach a node containing an accelerator is through aprun. For more details on the types of nodes that constitute Titan please see login vs service vs compute nodes. No additional steps are required to access the GPU beyond what is required by the acceleration technique used. Titan does possess a few unique accelerator characteristics that are discussed below.

7.1. Accelerator Modules

(Back to Top)

Access to the CUDA framework is provided through the cudatoolkit module. This module contains access to NVIDIA provided tools such as nvcc as well as libraries such as the CUDA run time. When the cudatoolkit module is loaded shared linking will be enabled in the Cray compiler wrappers CC,cc,ftn. The craype-accel-nvidia35 module will load the cudatoolkit as well as set several accelerator options used by the Cray toolchain.

7.2. Compiling and Linking CUDA

(Back to Top)

When compiling on Cray machines, the Cray compiler wrappers (e.g. cc, CC, and ftn) work in conjunction with the modules system to link in needed libraries such as MPI; it is therefore recommended that the Cray compiler wrappers be used to compile CPU portions of your code.

To generate compiler-portable code it is necessary to compile CUDA C and CUDA runtime containing code with NVCC. The resulting NVCC compiled object code must then be linked in with object code compiled with the Cray wrappers. NVCC performs GNU style C++ name mangling on compiled functions so care must be taken in compiling and linking codes.

The section below briefly covers this technique. For complete examples, please see our tutorial on Compiling Mixed GPU and CPU Code.

C++ Host Code

When linking C++ host code with NVCC compiled code, the C++ code must use GNU-compatible name mangling. This is controlled through compiler specific wrapper flags.

PGI Compiler
$ nvcc -c GPU.cu
$ CC --gnu CPU.cxx GPU.o
Cray, GNU, and Intel Compilers
$ nvcc -c GPU.cu
$ CC CPU.cxx GPU.o
C Host Code

NVCC name mangling must be disabled if it is to be linked with C code. This requires the extern "C" function qualifier be used on functions compiled with NVCC but called from cc compiled code.

extern "C" void GPU_function()
Fortran: Simple

To allow C code to be called from Fortran, one method requires that the C function name be modified. NVCC name mangling must be disabled if it is to be linked with Fortran code. This requires the extern "C" function qualifier. Additionally, function names must be lowercase and end in an underscore character (i.e., _ ).

NVCC Compiled
extern "C" void gpu_function_()
ftn Compiled
call gpu_function()

ISO_C_BINDING provides Fortran a greater interoperability with C and removes the need to modify the C function name. Additionally the ISO_C_BINDING guarantees data type compatibility.

NVCC Compiled
extern "C" void gpu_function()
ftn Compiled
module gpu
    subroutine gpu_function() BIND (C, NAME='gpu_function')
      implicit none
    end subroutine gpu_function
end module gpu


call gpu_function()

7.3. CUDA Proxy

(Back to Top)

The default GPU compute mode for Titan is exclusive process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the CUDA proxy server transparently manages work issued to the GPU context from multiple processes.

Warning: Currently, GPU memory between processes accessing the proxy is not guarded, meaning process i can access memory allocated by process j. This SHOULD NOT be used to share memory between processes and care should be taken to ensure process only access GPU memory they have allocated themselves.
How to Enable

To enable the proxy server the following steps must be taken before invoking aprun:

$ export CRAY_CUDA_PROXY=1

Currently GPU debugging and profiling are not supported when the proxy is enabled. On Titan, specifying the qsub flag -l feature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode, debugging and profiling are available, and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU. Additionally, inconsistent behavior may be encountered under certain conditions.

7.4. GPUDirect: CUDA-enabled MPICH

(Back to Top)

Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below.

How to Enable

To enable GPUDirect the following steps must be taken before invoking aprun:

  $ module switch cray-mpich2 cray-mpich2/5.6.4

Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial-and-error tuning to achieve best results.


Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth-bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is (16), and messages under (8) kilobytes in size are never pipelined.


Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.

This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.

This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.


Several examples are provided in our GPU Direct Tutorial.

The following benchmarks were performed with cray-mpich2/6.1.1. benchmark benchmark

7.5. Pinned Memory

(Back to Top)

Memory bandwidth between the CPU (host) and GPU (device) can be increased through the use of pinned, or page-locked, host memory. Additionally, pinned memory allows for asynchronous memory copies.

To transfer memory between the host and device, the device driver must know the host memory is pinned. If it does not, the memory will be first copied into a pinned buffer and then transferred, effectively lowering copy bandwidth. For this reason, pinned memory usage is recommended on Titan.

K20X Bandwidth

8. Further Reading on Accelerated Computing

(Back to Top)

This user guide has covered merely a few of the topics that a user of Titan may have in regards to accelerated computing. For further information a vast amount of online resources are available, some links that may be of interest are listed below. Targeted links are provided in OLCF articles covering specific topics. As always if you have any questions at all please contact the OLCF helpdesk at help@olcf.ornl.gov.