Accelerated Computing is a computing model used in scientific and engineering applications whereby calculations are carried out on specialized processors (known as accelerators) in tandem with traditional CPUs to achieve faster real-world execution times. Accelerators are highly specialized microprocessors designed with data parallelism in mind. In the most practical terms, execution times are reduced by offloading parallelizable computationally-intensive portions of an application to the accelerators while the remainder of the code continues to run on the CPU.

The flagship supercomputer at the OLCF, Titan, employs 18,688 NVIDIA Telsa K20X GPU accelerators. This guide is meant to aid researchers in the process of transforming traditional CPU code to accelerated code that is well-suited for execution on machines such as Titan. The sections below provide a basic overview of the fundamental concepts of GPU programming, while the links within each section point to more detailed information (e.g. usage, tutorials, etc.).

Accelerated Computing Basics

In order to discuss how to accelerate code using GPUs, it is important to first introduce the reader to some basic background information about the CUDA programming model. The following subsections aim to provide this basic information along with some terminology. For more detailed documentation, please see NVIDIA’s CUDA Programming Guide.

CUDA Thread Model

The CUDA thread model is closely coupled to the GPU architecture. When work is issued to the GPU, it is in the form of a function (referred to as the kernel) that is to be executed N times in parallel by N CUDA threads. Due to the large-scale threading capabilities of the accelerator, global thread communication and synchronization are not available. The CUDA thread model is an abstraction that allows the programmer or compiler to more easily utilize the various levels of thread cooperation that are available during kernel execution.

CUDA threads are logically divided into 1,2, or 3 dimensional groups referred to as thread blocks. Threads within a block can cooperate through access to low latency shared memory as well as synchronization capabilities. All threads in a block must reside on the same streaming multiprocessor (SM; a compartmentalized subset of the hardware resources on a GPU – e.g. CUDA cores, control units, caches, registers) and share the limited resources. Additionally, a 1024 threads per block upper limit is imposed.

The thread blocks of a given kernel are partitioned into a 1,2, or 3 dimensional logical grouping referred to as a grid. The grid encompasses all threads required to execute a given kernel. There is no cooperation between blocks in a grid, as such blocks must be able to be executed independently. When a kernel is launched the number of threads per thread block and the number of thread blocks is specified, which in turn defines the total number of cuda threads launched.

Each thread has access to its integer position within its own block as well as the integer position of the thread’s enclosing block within the grid, as displayed below. In general the thread uses this position information to read/write to/from device global memory. In this fashion, although each thread is executing the same kernel code, each thread has its own data to operate on.

Note: The dimensionality of the grid and thread blocks are purely a convenience to the programmer as to better fit the problem domain.

Example block and grid layouts:

A 1D grid of 1D blocks.
threading

A 2D grid of 1D blocks.
threading

A 2D grid of 2D blocks.
threading

CUDA Memory Model

The CUDA memory model specifies several memory spaces to which a thread has access. How these memory spaces are implemented varies depending on the specific accelerator model. For K20X memory specifics, please see the K20X Memory Hierarchy section of the Titan User Guide.

mem

Device Global

Global memory is persistent over multiple kernel invocations and is readable and writable from all threads. Global memory is the largest memory area but suffers from the highest latency and lowest bandwidth.

Shared

A low latency high bandwidth shared memory available to all threads within a thread block. The lifetime of shared memory is that of the thread block. Shared memory promotes data reuse and allows communication between threads within a thread-block.

Local

Each thread has access to private local memory with a lifetime of thread. Thread private memory is generally contained in the SM’s register file although it may spill to global memory in certain circumstances.


The Code Acceleration Process

The task of accelerating an application is an iterative process that, in general, comprises these steps:

  1. Identify a section of unexploited parallelism in your application’s source code.
  2. Choose an appropriate programming technique for code acceleration.
  3. Implement the parallelization within your source code.
  4. Verify the accelerated application’s performance and output.

This process may be repeated many times, with a new section of the application being accelerated in each iteration.

Each of the following subsections covers an aspect of the code acceleration process in more detail with special emphasis on the techniques and tools that are available to end-users on Titan.

Identifying Parallelism

The first step in accelerating an existing code is identifying the regions of code that may benefit from acceleration. This is typically accomplished through profiling of the code to determine where most of the time is spent. A portion of code that will benefit from acceleration must have sufficient parallelism exposed as discussed below. If a section of code does not have sufficient parallelism the algorithm may need to be changed or in some cases the code may be better suited for the CPU. As the code is being accelerated this profile can be used as a base line to compare performance against.

Accelerator Occupancy Considerations:
Sufficient parallelism must be exposed in order to make efficient use of the accelerator. For most codes the following guidelines can be useful to keep in mind. To ensure enough parallelism to hide instruction latency on the K20X you will generally want each SM to have at least 32 active warps(1024 threads). The exact number of warps needed to saturate the instructional throughput will depend on the amount of instruction level parallelism and the specific instruction throughput. The block size should be a multiple of 32 to ensure that it can be evenly divided into warps with 128-256 being a good starting point. To ensure that every SM is kept busy throughout the kernel execution it is best to launch 1000+ thread blocks if possible.

Choosing and Implementing Acceleration Techniques

Titan is configured with a broad set of tools to facilitate acceleration of both new and existing application codes. These tools can be broken down into (3) categories based on their implementation methodology:

Development Tool Hierarchy

  • Accelerated Libraries
  • Accelerator Compiler Directives
  • Accelerator Languages/Frameworks

Each of these three methods has pros and cons that must be weighed for each program and are not mutually exclusive. In addition to these tools, Titan supports a wide variety of performance and debugging tools to ensure that, however you choose to implement acceleration into your program code, it is being done so efficiently.

GPU Accelerated Libraries

Due to the performance benefits that come with GPU computing, many scientific libraries are now offering accelerated versions. If your program contains BLAS or LAPACK function calls, GPU-accelerated versions may be available. Magma, CULA, cuBLAS, cuSPARSE, and LibSciACC libraries provide optimized GPU linear algebra routines that require only minor changes to existing code. These libraries require little understanding of the underlying GPU hardware, and performance enhancements are transparent to the end developer.

For more information on accelerated libraries available on OLCF systems, please see the Software page. For examples of using accelerated libraries, please see the GPU Accelerated Libraries Tutorial.

For more general libraries, such as Trilinos and PETSc, you will want to visit the appropriate software development site to examine the current status of GPU integration. For Trilinos, please see the latest documentation at the Sandia Trillinos page. Similarly, Argonne’s PETSc Documentation has a page containing the latest GPU integration information.

Accelerator Compiler Directives

Accelerator compiler directives allow the compiler, guided by the programmer, to take care of low-level accelerator work. One of the main benefits of a directives-based approach is an easier and faster transition of existing code compared to low-level GPU languages. Additional benefits include performance enhancements that are transparent to the end developer and greater portability between current and future many-core architectures. Although initially several vendors provided their own set of proprietary directives OpenACC has now provided a unified specification for accelerator directives.

For information on compiling codes with directive-based GPU acceleration, please see the Accelerated Compiler Directives section in the Titan user guide. For examples of using directive-based approaches to GPU acceleration, please see the OLCF Tutorials page.

GPU Languages/Frameworks

For lower-level control over the GPU, Titan supports CUDA C, CUDA Fortran, and OpenCL. These languages and language extensions, while allowing explicit control, are generally more cumbersome than directive-based approaches and must be maintained to stay up-to-date with the latest performance guidelines. Substantial code structure changes may be needed and an in-depth knowledge of the underlying hardware is often necessary for best performance.

For more information, please see the GPU Languages/Frameworks section of the Titan user guide. For examples of using CUDA or OpenCL, please see the OLCF Tutorials page.

Interoperability Considerations:
Often a single acceleration technology is not the most efficient way to accelerate a given code. Most acceleration techniques provide mechanisms for GPU memory pointers to be input or output allowing interoperability between multiple technologies. This is not universally true, for example the OpenCL specification does not currently include interoperability with CUDA based technologies. The Accelerator interoperability and Accelerator interoperability II tutorials provide several interoperability example codes. For additional details please see the specific technologies documentation.

Debugging

Although debugging the GPU has been historically difficult due to a lack of debugging tools, great strides have been made within the past several years. NVIDIA and Allinea both provide tools with interfaces similar to existing CPU debugging programs to aid accelerated applications. Please see the Debugging section in the Titan user guide or the Software page for more information on the GPU debugging tools available on OLCF systems.


Verifying Accelerated Application Performance and Output

The goal of writing accelerated code is to increase program performance while maintaining program correctness. To ensure that both of these goals are reached it is necessary to verify application results as well as profile accelerator performance during each step of the acceleration processes. How results are verified will be dependent on the application as many applications will not simply be able to do a bit-wise comparison of final output due to reasons discussed in K20X Floating Point Considerations. Once the results have been verified the developer is encouraged to make use of performance tools to verify the accelerator is efficiently used.

Please see the Optimization & Profiling section in the Titan user guide or the Software page for more information on the profiling tools available on OLCF systems.

K20X Floating Point Considerations:
When porting existing applications to make use the GPU it is natural to check the ported application’s results against those of the original application. Much care must be taken when comparing results between a particular algorithm run on the CPU and GPU since, in general, bit-for-bit agreement will not be possible.

The NVIDIA K20X provides full IEEE 754-2008 support for single and double precision floating point arithmetic. For simple floating point arithmetic (addition, subtraction, multiplication, division, square root, FMA) individual operations on the AMD Opteron and NVIDIA K20X should produce bit-for-bit identical values to the precision specified by the standard.

Transcendental operations (trigonometric, logarithmic, exponential, etc.) should not be expected to produce bit-for-bit identical values when compared against the CPU. Please refer to the CUDA programming guide for single and double precision function ULP.

Floating point operations are not necessarily associative nor distributive. For a given algorithm, the number of threads, thread ordering, and instruction execution may differ between the CPU and GPU, resulting in potential discrepancies; as such it should not be expected that a series of arithmetic operations will produce bit-for-bit identical results.


Running Accelerated Applications on Titan

Each of Titan’s 18,688 compute nodes contains an NVIDIA K20X accelerator, however, the login and service nodes do not. As such the only way to reach a node containing an accelerator is through aprun command. For more details on how to run jobs on Titan, please see the Running Jobs on Titan page in the Titan user guide.


Further Reading on Accelerated Computing

This user guide has covered merely a few of the topics that a user of Titan may have in regards to accelerated computing. For further information a vast amount of online resources are available, some links that may be of interest are listed below. Targeted links are provided in OLCF articles covering specific topics. As always if you have any questions at all please contact the OLCF helpdesk at help@olcf.ornl.gov.