The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov
How to Enable
To enable the proxy server the following steps must be taken before invoking aprun:
$ export CRAY_CUDA_MPS=1
Issues
Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.
Examples
The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.
For these examples use a single node and request an interactive job.
$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00
C Wrapper
The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.
pauseWrapper.cu
#include <stdio.h> #include <stdlib.h> #include <time.h> // CUDA kernel to pause for at least num_cycle cycles __global__ void sleep(int64_t num_cycles) { int64_t cycles = 0; int64_t start = clock64(); while(cycles < num_cycles) { cycles = clock64() - start; } } // Returns number of cycles required for requested seconds extern "C" int64_t get_cycles(int seconds) { // Get device frequency in Hz int64_t Hz; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, 0); Hz = int64_t(prop.clockRate) * 1000; // Calculate number of cycles to wait int64_t num_cycles; num_cycles = seconds * Hz; return num_cycles; } // Launches a kernel that sleeps for at least num_cycles extern "C" void sleep_kernel(int64_t num_cycles) { // Our kernel will launch a single thread to sleep the kernel int blockSize, gridSize; blockSize = 1; gridSize = 1; // Execute the kernel sleep<<<gridSize, blockSize>>>(num_cycles); } // Wait for all pending GPU transactions to end extern "C" void wait_for_gpu() { cudaDeviceSynchronize(); }
Compiling
The following will compile the wrapper functions into object code to be linked in with our examples.
$ module load cudatoolkit $ nvcc -c -arch=sm_35 pauseWrapper.cu
Multiple kernels single process
Launching multiple kernels from a single process does not require the CUDA proxy.
launcher.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch num_kernel kernels asynchrnously for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); return 0; }
Compiling and running multiple kernels from a single process
$ CC pauseWrapper.o launcher.cpp -o sleep.out $ aprun -n1 ./sleep.out
Multiple kernels launched from multiple threads
Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.
launcherOMP.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <omp.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Number of kernels to launch int num_kernels = 14; // Launch kernel omp_set_num_threads(num_kernels); #pragma omp parallel for shared(cycles) num_threads(num_kernels) for(int i=0; i<num_kernels; i++){ sleep_kernel(cycles); } // Wait for the kernel to complete wait_for_gpu(); }
Compiling and running multiple kernels launched from multiple threads
$ CC pauseWrapper.o launcherMP.cpp -o sleep.out $ aprun -n1 -d14 ./sleep.out
Multiple kernels launched from multiple MPI tasks
Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.
launcherMPI.cpp
#include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <mpi.h> extern "C" int64_t get_cycles(int seconds); extern "C" void sleep_kernel(int64_t num_cycles); extern "C" void wait_for_gpu(); int main(int argc, char *argv[]) { MPI_Init(&argc, &argv); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); // Get number of cycles to sleep for 1 second uint64_t cycles; cycles = get_cycles(1); // Sleep kernel for 1 second sleep_kernel(cycles); // Wait for the kernel to complete wait_for_gpu(); MPI_Finalize(); return 0; }
Compiling and running multiple kernels launched from multiple MPI tasks
$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out $ export CRAY_CUDA_MPS=1 $ aprun -n14 -N14 ./sleep.out