K20X Accelerator Description

Each Kepler K20X accelerator contains (14) streaming multiprocessors (SM), The SM architecture is discussed in more detail below. All work and data transfer between the CPU and GPU is managed through the PCI express(PCIe) interface. The GigaThread engine is responsible for distributing work between each SM and (6) 64-bit memory controllers control access to 6 gigabytes of GDDR5 device(global) memory. A shared 1.5 megabyte L2 cache sits between global memory and each SM.

NVIDIA K20X accelerator

XK7 Node

For arithmetic operations each SM contains (192) CUDA cores for handling single precision floating point and simple integer arithmetic, (64) double precision floating point cores, and (32) special function units capable of handling transcendental instructions. Servicing these arithmetic/logic units are (32) memory load/store units. In addition to (65,536) 32-bit registers each SM has access to 64 kilobytes of on-chip memory which is configurable between a user controlled shared memory and L1 cache. Additionally, each SM has 48 kilobytes of on-chip read-only, user or compiler manageable, cache.

NVIDIA K20X SM

XK7 Node
units

K20X Memory Hierarchy

hierarchy

Registers:
Each thread has access to private register storage. Registers are the lowest latency and highest bandwidth memory available to each thread but are a scarce resource. Each SM has access to (65,536) 32-bit registers that must be shared by all resident threads. An upper limit of 255 registers per thread is imposed by the hardware, if more thread private storage is required an area of device memory known as local memory will be used.

Shared memory/L1 cache:
Each SM has a 64 kilobyte area of memory that is split between shared memory and L1 cache. The programmer specifies the ratio between shared and L1.

Shared memory is a programmer managed low latency high bandwidth memory. Variables declared in shared memory are available to all threads within the same threadblock.

L1 cache on the K20X handles local memory (register spillage, stack data, etc.). Global memory loads are not cached in L1.

Read Only cache:
Each SM has 48 kilobytes of read only cache populated from global device memory. Eligible variables are determined by the compiler along with guidance from the programmer.

Constant Cache:
Although not pictured each SM has a small 8 kilobyte constant cache optimized for warp level broadcast. Constants reside in device memory and have an aggregate max size of 64 kilobytes.

Texture:
Texture memory provides multiple hardware accelerated boundary and filtering modes for 1D,2D, and 3D data. Textures reside in global memory but are cached in the on SM 48 kilobyte read-only cache.

L2 cache:
Each K20X contains 1.5 megabytes of L2 cache. All access to DRAM, including transfers to and from the host, go through this cache.

DRAM (Global memory):
Each K20X contains 6 gigabytes of GDDR5 DRAM. On titan ECC is enabled which reduces the available global memory by 12.5%; Approximately 5.25 gigabytes is available to the user. In addition to global memory DRAM is also used for local storage(in case of register spillage), constant, and texture memory.

K20X Instruction Issuing

When a kernel is launched on the K20X, the GigaThread engine handles dispatching the enclosed thread blocks to available SMs. Each SM is capable of handling up to 2048 threads or 64 threadblocks, including blocks from concurrent kernels, if resources allow. All threads within a particular thread block must reside on a single SM.

Once a thread block is assigned to a particular SM, all threads contained in the block will execute entirely on that SM. At the hardware level on the SM, each thread block is broken down into chunks of 32 consecutive threads, referred to as warps. The K20X issues instructions at the warp level. That is to say an instruction is issued in vector like fashion to 32 consecutive threads at a time. This execution model is referred to as Single Instruction Multiple Thread, or SIMT.

warp/dispatch

Each Kepler SM has (4) warp schedulers. When a block is divided up into warps, each warp is assigned to a warp scheduler. Warps will stay on the assigned scheduler for the lifetime of the warp. The scheduler is able to switch between concurrent warps, originating from any block of any kernel, without overhead. When one warp stalls — that is, the next instruction can not be executed in the next cycle — the scheduler will switch to a warp that is able to execute an instruction. This low overhead warp swapping allows for instruction latency to be effectively hidden, assuming enough warps with issuable instructions reside on the SM.

Each warp scheduler has (2) instruction dispatch units. At each cycle the scheduler selects a warp, and if possible, two independent instructions will be issued to that warp. Two clock cycles are required to issue a double precision floating point instruction to a full warp.

warp/schedule

K20X by the numbers

Model K20X
Compute Capability 3.5
Peak double precision floating point performance 1.31 teraflops
Peak single precision floating point performance 3.95 teraflops
Single precision CUDA cores 2688
Double precision CUDA cores 896
CUDA core clock frequency 732 MHz
Memory Bandwidth(ECC off) 250 GB/s*
Memory size GDDR5(ECC on) 5.25 GB
L2 cache 1.5 MB
Shared memory/L1 configurable 64 KB per SM
Read-only cache 48 KB per SM
Constant memory 64 KB
32-bit Registers 65,536 per SM
Max registers per thread 255
Number of multiprocessors(SM) 14
Warp size 32 threads
Maximum resident warps per SM 64
Maximum resident blocks per SM 16
Maximum resident threads per SM 2048
Maximum threads per block 1024 threads
Maximum block dimensions 1024, 1024, 64
Maximum grid dimensions 2147483647, 65535, 65535

* ECC will reduce achievable bandwidth by 15+%

K20X Floating Point Considerations

When porting existing applications to make use the GPU it is natural to check the ported application’s results against those of the original application. Much care must be taken when comparing results between a particular algorithm run on the CPU and GPU since, in general, bit-for-bit agreement will not be possible.

The NVIDIA K20X provides full IEEE 754-2008 support for single and double precision floating point arithmetic. For simple floating point arithmetic (addition, subtraction, multiplication, division, square root, FMA) individual operations on the AMD Opteron and NVIDIA K20X should produce bit-for-bit identical values to the precision specified by the standard.

Transcendental operations (trigonometric, logarithmic, exponential, etc.) should not be expected to produce bit-for-bit identical values when compared against the CPU. Please refer to the CUDA programming guide for single and double precision function ULP.

Floating point operations are not necessarily associative nor distributive. For a given algorithm, the number of threads, thread ordering, and instruction execution may differ between the CPU and GPU, resulting in potential discrepancies; as such it should not be expected that a series of arithmetic operations will produce bit-for-bit identical results.