

# PROGRAMMING MULTI-GPU NODES

Steve Abbott, February 12, 2019

### AGENDA

Multi-GPU Programming Models Multi-GPU Programming with OpenACC and CUDA

#### SUMMIT NODE

(2) IBM POWER9 + (6) NVIDIA VOLTA V100



(50 GB/s) NVLink2

## MULTI-GPU PROGRAMMING MODELS

#### **MULTI-GPU PROGRAMMING MODELS**

Single Thread, Multiple GPUs

• A single thread will change devices as-needed to send data and kernels to different GPUs

Multiple Threads, Multiple GPUs

• Using OpenMP, Pthreads, or similar, each thread can manage its own GPU

Multiple Ranks, Single GPU

- Each rank acts as-if there's just 1 GPU, but multiple ranks per node use all GPUs Multiple Ranks, Multiple GPUs
- Each rank manages multiple GPUs, multiple ranks/node. Gets complicated quickly!

#### MULTI-GPU PROGRAMMING MODELS

#### Trade-offs Between Approaches

- Conceptually • Simple
- Requires • additional loops
- CPU can become a • bottleneck
- Remaining CPU • cores often underutilized

Conceptually Very • Simple

•

- Set and forget the device numbers
- Relies on external Threading API
- Can see improved • utilization
- Watch affinity

- Little to no code changes required
- Re-uses existing • domain decomposition
- Probably already • using MPI
- Watch affinity •

- Easily share data • between peer devices
- Coordinating • between GPUs extremely tricky

Multiple Ranks, Multiple GPUs

Single Thread, Multiple GPUs

Multiple Threads, Multiple GPUs

Multiple Ranks, Single GPU

#### **MULTI-DEVICE CUDA**

CUDA by default exposes all devices, numbered 0 - (N-1), if devices are not all the same, it will reorder the "best" to device 0.

Each device has its own pool of streams.

If you do nothing, *all* work will go to Device #0.

Developer must change the current device explicitly

#### **MULTI-DEVICE OPENACC**

OpenACC presents devices numbered 0 - (N-1) for each device type available.

The order of the devices comes from the runtime, almost certainly the same as CUDA

By default all data and work go to the *current device* 

Developers must change the current device and maybe the current device type using an API

#### **MULTI-DEVICE OPENMP**

OpenMP devices numbered 0 - (N-1) for ALL devices on the machine, including the host.

The order is determined by the runtime, but devices of the same type are contiguous.

To change the device for data and compute a clause is added to directives.

Device API routines include a devicenum

## MULTI-GPU PROGRAMMING WITH OPENACC AND CUDA

#### **MULTI-GPU W/ CUDA AND OPENACC**

The CUDA and OpenACC approaches are sufficiently similar, that I will demonstrate using OpenACC.

Decoder Ring:

| OpenACC                          | CUDA                 |
|----------------------------------|----------------------|
| <pre>acc_get_device_type()</pre> | N/A                  |
| <pre>acc_set_device_type()</pre> | N/A                  |
| <pre>acc_set_device_num()</pre>  | cudaSetDevice()      |
| <pre>acc_get_device_num()</pre>  | cudaGetDevice()      |
| <pre>acc_get_num_devices()</pre> | cudaGetDeviceCount() |

#### Multi-Device Pipeline A Case Study

We'll use a simple image filter to demonstrate these techniques.

No inter-GPU communication required

Pipelining: Breaking a large operation into smaller parts so that independent operations can overlap.

Since each part is independent, they can easily be run on different devices. We will extend the filter to run on more than one device.



### **Pipelining in a Nutshell**



#### Multi-device Pipelining in a Nutshell

| H2D | kernel | D2H | H2D | kernel | D2H |
|-----|--------|-----|-----|--------|-----|
|-----|--------|-----|-----|--------|-----|

| Device 0 | H2D | kernel | D2H    |     |  |
|----------|-----|--------|--------|-----|--|
| Dev      |     | H2D    | kernel | D2H |  |
| e 1      | H2D | kernel | D2H    |     |  |
| Device   |     | H2D    | kernel | D2H |  |

### **Pipelined Code**

```
#pragma acc data create(imgData[w*h*ch],out[w*h*ch])
                 copyin(filter)
for ( long blocky = 0; blocky < nblocks; blocky++)</pre>
  long starty = MAX(0,blocky * blocksize - filtersize/2);
 long endy = MIN(h,starty + blocksize + filtersize/2);
#pragma acc update device(imgData[starty*step:blocksize*step]) async(block%3)
  starty = blocky * blocksize;
 endy = starty + blocksize;
#pragma acc parallel loop collapse(2) gang vector async(block%3)
 for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) {
    <filter code ommitted>
   out[y * step + x * ch] = 255 - (scale * blue);
   out[y * step + x * ch + 1 ] = 255 - (scale * green);
   out[v * step + x * ch + 2] = 255 - (scale * red);
#pragma acc update self(out[starty*step:blocksize*step]) async(block%3)
#pragma acc wait
```

Cycle between 3 async queues by blocks.

### **Pipelined Code**

```
#pragma acc data create(imgData[w*h*ch],out[w*h*ch])
                 copyin(filter)
                                                                              Cycle between 3 async
for ( long blocky = 0; blocky < nblocks; blocky++)</pre>
                                                                                queues by blocks.
  long starty = MAX(0,blocky * blocksize - filtersize/2);
 long endy = MIN(h,starty + blocksize + filtersize/2);
#pragma acc update device(imgData[starty*step:blocksize*step]) async(block%3)
  starty = blocky * blocksize;
 endy = starty + blocksize;
#pragma acc parallel loop collapse(2) gang vector async(block%3)
 for (y=starty; y<endy; y++) for ( x=0; x<w; x++ ) {
   <filter code ommitted>
   out[y * step + x * ch] = 255 - (scale * blue);
   out[y * step + x * ch + 1 ] = 255 - (scale * green);
   out[v * step + x * ch + 2] = 255 - (scale * red);
                                                                               Wait for all blocks to
#pragma acc update self(out[starty*step:blocksize*step]) async(block%3)
                                                                                    complete.
#pragma acc wait
```

### **NVPROF Timeline of Pipeline**

| 💺 N          | WIDIA Visual Profiler @jlarkin-dt              |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | x |
|--------------|------------------------------------------------|-------|------------|---------|----|---------|---|------------|-----------------|---------|---------------|-------------|--------------|--------------|----------|-----|---|
| <u>F</u> ile | e ⊻iew <u>W</u> indow <u>R</u> un <u>H</u> elp |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
| Ľ            | 📓 🖳 📑 👒 🗣 I 🕂 🗨                                | 😫   F | <b>K</b> [ | K 5     |    | δ.      |   |            |                 |         |               |             |              |              |          |     |   |
| 8            | د<br>♦ *NewSession1 %                          |       |            |         |    |         |   |            |                 |         |               |             |              |              | =        | - 0 | 6 |
|              |                                                |       |            | 189 ms  |    |         |   | 189.5 ms   |                 | 2.36 ms | s             | 190.        | 5 ms         |              | 191 ms   |     |   |
|              | Process "acc california-175                    |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | 1 |
| Th           | Thread 2899274752                              |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | └ OpenACC                                      |       |            |         |    |         |   |            |                 | acc_    | _wait@inve    | rt.c:218    |              | ac           | c_wait@i |     |   |
|              | L Runtime API                                  |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | L Driver API                                   |       |            |         |    |         |   |            |                 | cuSt    | treamSync     | hronize     |              | cu           | StreamS  |     |   |
|              | Profiling Overhead                             |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | L |
|              | 🖃 [0] Tesla K20c                               |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | L |
|              | Context 1 (CUDA)                               |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | L |
|              | 🗏 🍸 MemCpy (HtoD)                              |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | L |
|              | 🗏 🍸 MemCpy (DtoH)                              |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     | L |
|              | 🖃 Compute                                      |       | blur       | 5_pipel |    | r5_pipe | _ | lur5_pipel | in<br>blur5_pi  |         | pipelin<br>bl | ur5_pipelin | lur5_pipelin | <br>blur5_pi | ipelin   |     |   |
|              | └ 🍸 41.9% blur5_pipe                           |       | blur       | 5_pipel | _  | r5_pipe | _ | lur5_pipel | lin<br>blur5_pi |         | pipelin<br>bl | ur5_pipelin | lur5_pipelin | <br>blur5_pi | ipelin   |     |   |
|              | └ 🍸 20.1% blur5_bloc                           |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | └ 🍸 20.1% blur5_upd                            |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | └ 🍸 17.9% blur5_34_g                           |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | 🖃 Streams                                      |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | - Default                                      |       |            |         |    |         |   |            |                 |         |               |             |              |              |          |     |   |
|              | L Stream 13                                    |       | blur       | 5_pipel | in |         |   |            | blur5_pi        | pelin   |               |             | lur5_pipelin |              |          |     | • |
|              |                                                | •     |            |         |    |         |   |            |                 |         |               |             |              |              |          | •   |   |

#### Extending to multiple devices

Create 1 OpenMP thread on the CPU per-device. This is not strictly necessary, but simplifies the code.

Within each thread, set the device number.

Divide the blocks as evenly as possible among the CPU threads.

#### Multi-GPU Pipelined Code (OpenMP)



Spawn 1 thread per

#### **Multi-GPU Pipelined Performance**



#### **OpenACC** with MPI

Domain decomposition is performed using MPI ranks

Each rank should set its own device

- Maybe acc\_set\_device\_num
- Maybe handled by environment variable (CUDA\_VISIBLE\_DEVICES)

GPU affinity can be handled by standard MPI task placement

Multiple MPI Ranks/GPU (using MPS) can work in place of OpenACC work queues/CUDA Streams

### Setting a device by local rank

Determine a unique ID for each rank on the same node.

Use this unique ID to select a device per rank.

You may also try using MPI\_Comm\_split\_type() using MPI\_COMM\_TYPE\_SHARED or OMPI\_COMM\_TYPE\_SOCKET.

In the end, you need to understand how jsrun/mpirun is placing your ranks.

#### MPI Image Filter (pseudocode)

```
if (rank == 0 ) read image();
                                                                      Decompose image
// Distribute the image to all ranks
                                                                       across processes
MPI_Scatterv(image);
                                                                           (ranks)
MPI_Barrier(); // Ensures all ranks line up for timing
omp get wtime();
blur_filter(); // Contains OpenACC filter
MPI_Barrier(); // Ensures all ranks complete before timing
omp get wtime();
                                                                      Receive final parts
                                                                       from all ranks.
MPI Gatherv(out);
if (rank == 0 ) write_image();
                                                                      Launch with good
$ jsrun -n 6 -a 1 -c 1 -g 1 ...
                                                                     GPU/process affinity
```

There's a variety of ways to do MPI decomposition, this is what I used for this particular example.

#### Multi-GPU Pipelined Performance (MPI)



#### Multi-GPU Pipelined Performance (MPI)



**MULTI-DEVICE CUDA** 

#### Same Pattern, Different API

```
cudaDeviceSynchronize();
```

MPI\_Comm\_rank(local\_comm, &local\_rank);

```
cudaSetDevice(local_rank);
```

cudaDeviceSynchronize();

#### **MULTI-DEVICE OPENMP 4.5**

#### Same Pattern, Different API

```
#pragma omp parallel num_threads(num_dev)
```

```
#pragma omp for
for ( int b=0; b < nblocks; b++)
  #pragma omp target update map(to:...) \
          device(dev) depend(inout:A) \
          nowait
  #pragma omp target teams distribute \
          parallel for simd device (dev) \setminus
          depend(inout:A)
  for(...) { ... }
  #pragma omp target update map(from:...) \
          device(dev) depend(inout:A) \
          nowait
#pragma omp taskwait
```

```
MPI_Comm_rank(local_comm, &local_rank);
int dev = local_rank;
```

```
for ( int b=0; b < nblocks; b++)
{
    #pragma omp target update map(to:...) \
        device(dev) depend(inout:A) \
        nowait
    #pragma omp target teams distribute \
        parallel for simd device(dev) \
        depend(inout:A)
    for(...) { ... }
    #pragma omp target update map(from:...) \</pre>
```

```
device(dev) depend(inout:A) \
nowait
```

```
#pragma omp taskwait
```

# Multi-GPU Approaches

Choosing an approach

Single-Threaded, Multiple-GPUs - Requires additional loops to manage devices, likely undesirable.

Multi-Threaded, Multiple-GPUs - Very convenient set-and-forget the device. Could possibly conflict with existing threading.

Multiple-Ranks, Single-GPU each - Probably the simplest if you already have MPI, he decomposition is done. Must get your MPI placement correct

Multiple-Ranks, Multiple-GPUs - Can allow all GPUs to share common data structures. Only do this is you absolutely need to, difficult to get right.

## **CLOSING SUMMARY**

#### **MULTI-GPU APPROACHES**

#### Choosing an approach

Single-Threaded, Multiple-GPUs - Requires additional loops to manage devices, likely undesirable.

Multi-Threaded, Multiple-GPUs - Very convenient set-and-forget the device. Could possibly conflict with existing threading.

Multiple-Ranks, Single-GPU each - Probably the simplest if you already have MPI, he decomposition is done. Must get your MPI placement correct

Multiple-Ranks, Multiple-GPUs - Can allow all GPUs to share common data structures. Only do this is you absolutely need to, difficult to get right.

#### **GPU TO GPU COMMUNICATION**

- CUDA aware MPI functionally portable
  - OpenACC/MP interoperable
  - Performance may vary between on/off node, socket, HW support for GPU Direct
  - WARNING: Unified memory support varies wildly between implementations!
- Single-process, multi-GPU
  - Enable peer access for straight forward on-node transfers
- Multi-process, single-gpu
  - Pass CUDA IPC handles for on-node copies
- Combine for more flexibility/complexity!

#### **ESSENTIAL TOOLS AND TRICK**

- Pick on-node layout with OLCF jsrun visualizer
  - https://jsrunvisualizer.olcf.ornl.gov/index.html
- Select MPI/GPU interaction with jsrun --smpiargs
  - "-gpu" for CUDA aware, "off" for pure GPU without MPI
- Profile MPI and NVLinks with nvprof
- Good performance will require experimentation!

