The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Currently GPU memory between processes accessing the proxy is not guarded, meaning process i can access memory allocated by process j. This SHOULD NOT be used to share memory between processes and care should be taken to ensure process only access GPU memory they have allocated themselves.

How to Enable

To enable the proxy server the following steps must be taken before invoking aprun:

$ export CRAY_CUDA_MPS=1

Issues

Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.

Examples

The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.

For these examples use a single node and request an interactive job.

$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00

C Wrapper

The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.

pauseWrapper.cu

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}

// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(int seconds)
{
    // Get device frequency in Hz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;

    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = seconds * Hz;

    return num_cycles;
}

// Launches a kernel that sleeps for at least num_cycles
extern "C" void sleep_kernel(int64_t num_cycles)
{
    // Our kernel will launch a single thread to sleep the kernel
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;

    // Execute the kernel
    sleep<<<gridSize, blockSize>>>(num_cycles);
}

// Wait for all pending GPU transactions to end
extern "C" void wait_for_gpu()
{
    cudaDeviceSynchronize();
}

Compiling

The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -c -arch=sm_35 pauseWrapper.cu

Multiple kernels single process

Launching multiple kernels from a single process does not require the CUDA proxy.

launcher.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch num_kernel kernels asynchrnously
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();

    return 0;
}

Compiling and running multiple kernels from a single process

$ CC pauseWrapper.o launcher.cpp -o sleep.out
$ aprun -n1 ./sleep.out

Multiple kernels launched from multiple threads

Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.

launcherOMP.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch kernel
    omp_set_num_threads(num_kernels);

    #pragma omp parallel for shared(cycles) num_threads(num_kernels)
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();
}

Compiling and running multiple kernels launched from multiple threads

$ CC pauseWrapper.o launcherMP.cpp -o sleep.out
$ aprun -n1 -d14 ./sleep.out

Multiple kernels launched from multiple MPI tasks

Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.

launcherMPI.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mpi.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Sleep kernel for 1 second
    sleep_kernel(cycles);

    // Wait for the kernel to complete
    wait_for_gpu();

    MPI_Finalize();
    return 0;
}

Compiling and running multiple kernels launched from multiple MPI tasks

$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out
$ export CRAY_CUDA_MPS=1
$ aprun -n14 -N14 ./sleep.out