Publishing Notice Prior to publishing work performed on Summit, please contact help@olcf.ornl.gov.

The NVIDIA Tesla V100 accelerator has a peak performance of 7.8 TFLOP/s (double-precision) and contributes to a majority of the computational work performed on Summit. Each V100 contains 80 streaming multiprocessors (SMs), 16 GB of high-bandwidth memory (HBM2), and a 6 MB L2 cache that is available to the SMs. The GigaThread Engine is responsible for distributing work among the SMs and (8) 512-bit memory controllers control access to the 16 GB of HBM2 memory. The V100 uses NVIDIA’s NVLink interconnect to pass data between GPUs as well as from CPU-to-GPU.

GV100

NVIDIA V100 SM

Each SM on the V100 contains 32 FP64 (double-precision) cores, 64 FP32 (single-precision) cores, 64 INT32 cores, and 8 tensor cores. A 128-KB combined memory block for shared memory and L1 cache can be configured to allow up to 96 KB of shared memory. In addition, each SM has 4 texture units which use the (configured size of the) L1 cache.

GV100_SM

HBM2

Each V100 has access to 16 GB of high-bandwidth memory (HBM2), which can be accessed at speeds of up to 900 GB/s. Access to this memory is controlled by (8) 512-bit memory controllers, and all accesses to the high-bandwidth memory go through the 6 MB L2 cache.

NVIDIA NVLink

The processors within a node are connected by NVIDIA’s NVLink interconnect. Each link has a peak bandwidth of 25 GB/s (in each direction), and since there are 2 links between processors, data can be transferred from GPU-to-GPU and CPU-to-GPU at a peak rate of 50 GB/s.

NOTE: The 50-GB/s peak bandwidth stated above is for data transfers in a single direction. However, this bandwidth can be achieved in both directions simultaneously, giving a peak “bi-directional” bandwidth of 100 GB/s between processors.

The figure below shows a schematic of the NVLink connections between the CPU and GPUs on a single socket of a Summit node.

NVLink2

Tensor Cores

The Tesla V100 contains 640 tensor cores (8 per SM) intended to enable faster training of large neural networks. Each tensor core performs a D = AB + C operation on 4×4 matrices. A and B are FP16 matrices, while C and D can be either FP16 or FP32:


nv_tensor_core_1

Each of the 16 elements that result from the AB matrix multiplication come from 4 floating-point fused-multiply-add (FMA) operations (basically a dot product between a row of A and a column of B). Each FP16 multiply yields a full-precision product which is accumulated in a FP32 result:


nv_tensor_core_2

Each tensor core performs 64 of these FMA operations per clock. The 4×4 matrix operations outlined here can be combined to perform matrix operations on larger (and higher dimensional) matrices.

Volta Multi-Process Service

When a CUDA program begins, each MPI rank creates a separate CUDA context on the GPU, but the scheduler on the GPU only allows one CUDA context (and so one MPI rank) at a time to launch on the GPU. This means that multiple MPI ranks can share access to the same GPU, but each rank gets exclusive access while the other ranks wait (time-slicing). This can cause the GPU to become underutilized if a rank (that has exclusive access) does not perform enough work to saturate the resources of the GPU. The following figure depicts such time-sliced access to a pre-Volta GPU.

nv_mps_1

 
The Multi-Process Service (MPS) enables multiple processes (e.g. MPI ranks) to concurrently share the resources on a single GPU. This is accomplished by starting an MPS server process, which funnels the work from multiple CUDA contexts (e.g. from multiple MPI ranks) into a single CUDA context. In some cases, this can increase performance due to better utilization of the resources. The figure below illustrates MPS on a pre-Volta GPU.


nv_mps_2

 
Volta GPUs improve MPS with new capabilities. For instance, each Volta MPS client (MPI rank) is assigned a “subcontext” that has its own GPU address space, instead of sharing the address space with other clients. This isolation helps protect MPI ranks from out-of-range reads/writes performed by other ranks within CUDA kernels. Because each subcontext manages its own GPU resources, it can submit work directly to the GPU without the need to first pass through the MPS server. In addition, Volta GPUs support up to 48 MPS clients (up from 16 MPS clients on Pascal).


nv_mps_3

 

For more information, please see the following document from NVIDIA:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

Unified Memory

Unified memory is a single virtual address space that is accessible to any processor in a system (within a node). This means that programmers only need to allocate a single unified-memory pointer (e.g. using cudaMallocManaged) that can be accessed by both the CPU and GPU, instead of requiring separate allocations for each processor. This “managed memory” is automatically migrated to the accessing processor, which eliminates the need for explicit data transfers.


nv_um_1

 

On Pascal-generation GPUs and later, this automatic migration is enhanced with hardware support. A page migration engine enables GPU page faulting, which allows the desired pages to be migrated to the GPU “on demand” instead of the entire “managed” allocation. In addition, 49-bit virtual addressing allows programs using unified memory to access the full system memory size. The combination of GPU page faulting and larger virtual addressing allows programs to oversubscribe the system memory, so very large data sets can be processed. In addition, new CUDA API functions introduced in CUDA8 allow users to fine tune the use of unified memory.

Unified memory is further improved on Volta GPUs through the use of access counters that can be used to automatically tune unified memory by determining where a page is most often accessed.

For more information, please see the following section of NVIDIA’s CUDA Programming Guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Independent Thread Scheduling

The V100 supports independent thread scheduling, which allows threads to synchronize and cooperate at sub-warp scales. Pre-Volta GPUs implemented warps (groups of 32 threads which execute instructions in single-instruction, multiple thread – SIMT – mode) with a single call stack and program counter for a warp as a whole.

nv_ind_threads_1

Within a warp, a mask is used to specify which threads are currently active when divergent branches of code are encountered. The (active) threads within each branch execute their statements serially before threads in the next branch execute theirs. This means that programs on pre-Volta GPUs should avoid sub-warp synchronization; a sync point in the branches could cause a deadlock if all threads in a warp do not reach the synchronization point.
 

nv_ind_threads_2

The Volta V100 introduces warp-level synchronization by implementing warps with a program counter and call stack for each individual thread (i.e. independent thread scheduling).

nv_ind_threads_3

This implementation allows threads to diverge and synchronize at the sub-warp level using the __syncwarp() function. The independent thread scheduling enables the thread scheduler to stall execution of any thread, allowing other threads in the warp to execute different statements. This means that threads in one branch can stall at a sync point and wait for the threads in the other branch to reach their sync point.

nv_ind_threads_4

For more information, please see the following section of NVIDIA’s CUDA Programming Guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x

Tesla V100 Specifications

Compute Capability 7.0
Peak double precision floating point performance 7.8 TFLOP/s
Peak single precision floating point performance 15.7 TFLOP/s
Single precision CUDA cores 5120
Double precision CUDA cores 2560
Tensor cores 640
Clock frequency 1530 MHz
Memory Bandwidth 900 GB/s
Memory size (HBM2) 16 GB
L2 cache 6 MB
Shared memory size / SM Configurable up to 96 KB
Constant memory 64 KB
Register File Size 256 KB (per SM)
32-bit Registers 65536 (per SM)
Max registers per thread 255
Number of multiprocessors (SMs) 80
Warp size 32 threads
Maximum resident warps per SM 64
Maximum resident blocks per SM 32
Maximum resident threads per SM 2048
Maximum threads per block 1024
Maximum block dimensions 1024, 1024, 64
Maximum grid dimensions 2147483647, 65535, 65535
Maximum number of MPS clients 48

Further Reading

For more information on the NVIDIA Volta architecture, please visit the following (outside) links.

NVIDIA Volta Architecture White Paper:
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

NVIDIA PARALLEL FORALL blog article:
https://devblogs.nvidia.com/parallelforall/inside-volta/