To fully exploit the computational power of the GPU, a large amount of data parallelism must be expressed. If your problem does not possess a sufficient amount of data parallelism, a second option is to combine data parallelism with task parallelism on the GPU through the use of concurrent kernels. To facilitate task parallelism the NVIDIA Kepler K20x features Hyper-Q, a set of 32 hardware managed work queues. When using CUDA streams, each stream will be automatically mapped onto Hyper-Q, allowing up to 32 streams to execute concurrency. The NVIDIA Multi-Process Service allows multiple processes, such as intra-node MPI ranks, to be mapped onto Hyper-Q. This tutorial will demonstrate how to take advantage of GPU concurrency on Titan through the use of Hyper-Q. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov


C

The Serial, OpenMP, and MPI C samples make use of the following gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams

sleep.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
 
static cudaStream_t *streams;

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}
 
// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(float seconds)
{
    // Get device frequency in KHz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;
 
    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = (int64_t)(seconds * Hz);
   
    return num_cycles;
}
 
// Create streams
extern "C" void create_streams(int num_streams)
{
    // Allocate streams
    streams = (cudaStream_t *) malloc((num_streams+1)*sizeof(cudaStream_t));    
 
    // Default stream
    streams[0] = NULL;

    // Primer kernel launch
    sleep<<< 1, 1 >>>(1); 

    // Create streams
    for(int i = 1; i <= num_streams; i++)
        cudaStreamCreate(&streams[i]);
}
 
// Launches a kernel that sleeps for num_cycles
extern "C" void sleep_kernel(int64_t num_cycles, int stream_id)
{
    // Launch a single GPU thread to sleep
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;
 
    // Execute the kernel
    sleep<<< gridSize, blockSize, 0, streams[stream_id] >>>(num_cycles);
}
 
// Wait for stream to complete
extern "C" void wait_for_stream(int stream_id)
{
    cudaStreamSynchronize(streams[stream_id]);
}
 
// Wait for streams to complete
extern "C" void wait_for_streams(int num_streams)
{
    for(int i = 1; i <= num_streams; i++)
        cudaStreamSynchronize(streams[i]);
}
 
// Destroy stream objects
extern "C" void destroy_streams(int num_streams)
{
    // Clean up stream
    for(int i = 1; i <= num_streams; i++)
        cudaStreamDestroy(streams[i]);
    free(streams);
}

Compile:
The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -arch=sm_35 -c sleep.cu

Serially Launched Concurrent Kernels in C

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/time.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    struct timeval start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Max number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        for(i=1; i<=num_kernels; i++){
            sleep_kernel(cycles, i);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);

        // Clean up streams
        destroy_streams(num_kernels);

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o serial.out

Run:

$ module load cudatoolkit
$ aprun ./serial.out

OpenMP Launched Concurrent Kernels in C

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_streams(int num_streams);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    uint64_t cycles;
    double start, stop;
    int i, num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {

        // Set number of OMP threads
        omp_set_num_threads(num_kernels);

        // Start timer
        start = omp_get_wtime();

        // Create streams
        create_streams(num_kernels);

        // Launch num_kernel kernels asynchrnously
        #pragma omp parallel firstprivate(cycles)
        {
            int stream_id = omp_get_thread_num()+1;
            sleep_kernel(cycles, stream_id);
        }

        // Wait for kernels to complete
        wait_for_streams(num_kernels);
   
        // Wait for kernels to complete and clean up streams
        destroy_streams(num_kernels);

        // Stop timer
        stop = omp_get_wtime();
        printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
    }

    return 0;
}

Compile:
PGI

$ module load cudatoolkit
$ cc -mp sleep.o launcher.c -o openmp.out

GNU

$ module load cudatoolkit
$ cc -fopenmp sleep.o launcher.c -o openmp.out

Intel

$ module load cudatoolkit
$ cc -openmp sleep.o launcher.c -o openmp.out

Cray

$ module load cudatoolkit
$ cc sleep.o launcher.c -o openmp.out

Run:

$ module load cudatoolkit
$ aprun -d16 ./openmp.out

MPI Launched Concurrent Kernels in C

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "mpi.h"

int64_t get_cycles(float seconds);
void sleep_kernel(int64_t num_cycles, int stream_id);
void create_streams(int num_streams);
void wait_for_stream(int stream_id);
void destroy_streams(int num_streams);

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);

    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    uint64_t cycles;
    double start, stop;
    int num_kernels;

    // Get number of cycles to sleep for 1 second
    cycles = get_cycles(1.0);

    // Number of kernels to launch
    int max_kernels = size;

    // Setup default stream in sleep.cu wrapper
    create_streams(0);

    // Loop through number of kernels to launch, from 1 to max_kernels
    for(num_kernels=1; num_kernels<=max_kernels; num_kernels++)
    {
        // Start timer
        MPI_Barrier(MPI_COMM_WORLD);
        if(rank == 0)
            start = MPI_Wtime();

        // Launch kernel into default stream
        if(rank < num_kernels)
            sleep_kernel(cycles, 0);

        // Wait for all ranks to submit kernel
        MPI_Barrier(MPI_COMM_WORLD);

        // Wait for default stream
        if(rank < num_kernels)
            wait_for_stream(0);

        // Wait for all ranks to complete
        MPI_Barrier(MPI_COMM_WORLD);

        // Print seconds ellapsed
        if(rank == 0) {
            stop = MPI_Wtime();
            printf("Total time for %d kernels: %f s\n", num_kernels, stop-start);
        }
    }

    destroy_streams(0);
    MPI_Finalize();

    return 0;
}

Compile:

$ module load cudatoolkit
$ cc sleep.o launcher.c -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in C

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.c

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "math.h"
#include <sys/time.h>

int main(int argc, char *argv[])
{

    uint64_t num_cycles;
    struct timeval start, stop;
    int i,num_kernels;

    // Set number of cycles to sleep for 1 second
    // We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/15;

    // Number of kernels to launch
    int max_kernels = 33;

    // Loop through number of kernels to launch, from 1 to num_kernels
    for(num_kernels=1; num_kernels<max_kernels; num_kernels++)
    {
        // Start timer
        gettimeofday(&start, NULL);

        for(i=0; i<=num_kernels; i++)
        {
            #pragma acc parallel async(i) vector_length(1) num_gangs(1)
            {
                uint64_t cycles;
                #pragma acc loop seq
                for(cycles=0; cycles<num_cycles; cycles++) {
                    cycles = cycles;
                }
            }
        }

        // Wait for all async streams to complete
        #pragma acc wait

        // Print seconds ellapsed
        gettimeofday(&stop, NULL);
        double seconds;
        seconds = (stop.tv_sec - start.tv_sec);
        seconds += (stop.tv_usec - start.tv_usec) / 1000000.0;
        printf("Total time for %d kernels: %f s\n", num_kernels, seconds);
    }

    return 0;
}
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of cycles should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ cc -acc launcher.c -o acc.out

Cray

$ module load cudatoolkit
$ cc -hpragma=acc launcher.c -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

Fortran

The Serial, OpenMP, and MPI Fortran samples make use of the following CUDA Fortran gpu wrapper code. The wrapper is used for the following:

  • sleep(): GPU kernel to sleep for a given number of clock cycles
  • get_cycles(): Return the number of cycles required to sleep for the requested number of seconds
  • create_streams(): Create a given number of non default streams
  • sleep_kernel(): Launch a single GPU thread that sleeps for a given number of cycles
  • wait_for_streams(): Wait for streams 1 through the number of streams specified to complete any work
  • destroy_streams(): Destroy non default streams
For simplicity the following will use the PGI programming environment as it provides CUDA Fortran. For additional information on compiler interoperability please see the tutorial Compiling mixed GPU and CPU code.
module sleep
    use cudadevice
    use cudafor
    implicit none
 
    integer, dimension(:), allocatable :: streams
 
    contains
 
    !CUDA kernel to pause for at least num_cycle cycles
    attributes(global) subroutine sleep(num_cycles)
        integer(8), value :: num_cycles
        integer(8) :: cycles
        integer(8) :: start
 
        cycles = 0
        start = clock64
        do while (cycles < num_cycles)
            cycles = clock64 - start
        enddo
    end subroutine sleep
 
    !Returns number of cycles required for requested seconds
    integer(8) function get_cycles(seconds) result(num_cycles)
        real(8), intent(in) :: seconds
        integer(8) :: istat, Hz
        type(cudadeviceprop) :: prop
 
        istat = cudaGetDeviceProperties(prop, 0)
        Hz = prop%clockRate * 1000
        num_cycles = seconds * Hz
    end function get_cycles
 
    !Create streams
    subroutine create_streams(num_streams)
        integer :: num_streams, istat, i
 
        allocate(streams(num_streams+1))

        streams(1) = 0 

        ! Primer kernel launch
        call sleep<<< 1, 1 >>>(int8(1));

        do i=2,num_streams+1
            istat = cudaStreamCreate(streams(i))
        enddo
    end subroutine create_streams
 
    !Launches a kernel that sleeps for num_cycles
    subroutine sleep_kernel(num_cycles, stream_id)
        integer(8) :: num_cycles
        integer    ::  stream_id
        type(dim3) :: blockSize, gridSize
 
        blockSize = dim3(1,1,1)
        gridSize = dim3(1,1,1)
 
        call sleep<<<gridSize, blockSize, 0, streams(stream_id)>>>(num_cycles)
    end subroutine sleep_kernel
 
    ! Wait for stream to complete
    subroutine wait_for_stream(stream_id)
        integer :: stream_id, istat
 
        istat = cudaStreamSynchronize(streams(stream_id))
    end subroutine wait_for_stream
 
    ! Wait for streams to complete
    subroutine wait_for_streams(num_streams)
        integer :: num_streams, istat, i
 
        do i=2,num_streams+1
            istat = cudaStreamSynchronize(streams(i))
        enddo
    end subroutine wait_for_streams
 
    ! Destroy streams
    subroutine destroy_streams(num_streams)
        integer :: num_streams, i, istat
 
        do i=2,num_streams+1
            istat = cudaStreamDestroy(streams(i))
        enddo
        deallocate(streams)
    end subroutine destroy_streams
 
end module sleep

Serially Launched Concurrent Kernels in Fortran

The following sample launches multiple GPU kernels that sleep a single GPU thread for one second. GPU kernel launches are asynchronous on the CPU which allows a for loop to be used to launch concurrent kernels. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, i
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 32

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        do i = 2, num_kernels+1
            call sleep_kernel(cycles, i)
        enddo

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o serial.out

OpenMP Launched Concurrent Kernels in Fortran

The following example uses OpenMP to launch multiple sleep kernels in a parallel region, using the OMP thread number to specify which stream to use. Each of the kernels will be launched into its own stream, allowing up to 32 kernels to execute concurrently.

launcher.f90

program main
    use sleep
    use omp_lib

    implicit none

    integer(8) :: cycles
    integer    ::  max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Set number of OMP threads
        call omp_set_num_threads(num_kernels)

        ! Start timer
        start = omp_get_wtime()

        ! Create streams
        call create_streams(num_kernels)

        ! Launch num_kernel kernels asynchrnously
        !$omp parallel private(stream_id) firstprivate(cycles)
        stream_id = omp_get_thread_num()+2
        call sleep_kernel(cycles, stream_id)
        !$omp end parallel

        ! Wait for kernels to complete and clean up streams
        call destroy_streams(num_kernels)

        ! Stop timer
        stop = omp_get_wtime()

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o openmp.out

MPI Launched Concurrent Kernels in Fortran

The following example uses MPI to launch a single GPU kernel per MPI process into the default stream. Titan’s GPU compute mode does not allow multiple processes to access the GPU simultaneously, so the CRAY_CUDA_PROXY must be enabled. When enabled the Cray CUDA proxy works in conjunction with HyperQ to allow kernels from up to 32 different MPI processes to run concurrently. Note however that on Titan currently the max number of MPI processes per node is 16.

launcher.f90

program main
    use sleep
    implicit none
    include 'mpif.h'

    integer    ::  max_kernels, num_kernels, i, ierr, rank, size
    integer(8) :: cycles
    real(8)    :: start, stop, seconds

    call MPI_Init(ierr)

    ! Get number of cycles to sleep for 1 second
    seconds = 1.0
    cycles = get_cycles(seconds)

    call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
    call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

    ! Number of kernels to launch
    max_kernels = size

    ! Setup default stream in sleep.cu wrapper
    call create_streams(0);

    ! Loop through number of kernels to launch, from 1 to max_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call MPI_Barrier(MPI_COMM_WORLD, ierr)
        if (rank == 0) then
            start = MPI_Wtime()
        endif

        ! Launch num_kernel kernels asynchrnously
        if (rank < num_kernels) then
            call sleep_kernel(cycles, 1)
        endif

        ! Wait for all ranks to submit kernel
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Wait for kernel to complete
        if(rank < num_kernels) then
            call wait_for_stream(0)
        endif

        ! Wait for all ranks to complete
        call MPI_Barrier(MPI_COMM_WORLD, ierr)

        ! Print seconds ellapsed
        if (rank == 0) then
            stop = MPI_Wtime()
            print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'
        endif

    enddo

    ! clean up array in wrapper, no stream actually destroyed
    call destroy_streams(0)

    call MPI_Finalize(ierr)

end program main

Compile:
PGI

$ module load cudatoolkit
$ ftn -ta=nvidia,kepler sleep.cuf launcher.f90 -o proxy.out
Note that the sleep wrapper does not do proper error checking for simplicity, running this example without the CRAY_CUDA_PROXY enabled will fail silently.

Run:

$ module load cudatoolkit
$ export CRAY_CUDA_PROXY=1
$ aprun -n16 ./proxy.out

OpenACC Launched Concurrent Kernels in Fortran

OpenACC by default will block on the CPU when launching kernels or performing data movement. Using the async(i) clause will allow asynchronous kernel launches, or data transfer. Although it is implementation dependent the integer argument, i, will generally determine which stream the kernel or data operation is launched in. At the time of this writing each implementation handles the integer argument slightly differently.

launcher.f90

program main
    implicit none

    integer(8) :: num_cycles, cycles, i
    integer    :: max_kernels, num_kernels, stream_id
    real(8)    :: start, stop, seconds
    real(4)    :: foo(33)

    ! Set number of cycles to sleep for 1 second
    ! We'll use frequency/instruction latency to estimate this
    num_cycles = 730000000/(15*5)

    ! Maximum number of kernels to launch
    max_kernels = 33

    ! Loop through number of kernels to launch, from 1 to num_kernels
    do num_kernels = 1, max_kernels

        ! Start timer
        call cpu_time(start)

        ! Launch num_kernel kernels asynchrnously
        do i = 1, num_kernels
            !$acc parallel async(i) vector_length(1) num_gangs(1) copy(foo)
            !$acc loop seq
            do cycles = 1, num_cycles
                foo(i) = sin(1.5708)
            enddo
            !$acc end loop
            !$acc end parallel
        enddo

        ! Wait for all async streams to complete
        !$acc wait

        ! Stop timer
        call cpu_time(stop)

        print *, 'Total time for ', num_kernels,' kernels: ', stop-start, 'seconds'

    enddo

end program main
Please note that for Cray the for loop will be optimized out by default. Adding the volatile qualifier to the declaration of foo should force the loop to be kept. the volatile qualifier is currently not supported in PGI accelerator regions.

Compile:
PGI

$ module load cudatoolkit
$ ftn -acc launcher.f90 -o acc.out

Cray

$ module load cudatoolkit
$ ftn -hacc launcher.f90 -o acc.out

Run:

$ module load cudatoolkit
$ aprun ./acc.out

Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov.

How to Enable

To enable GPUDirect the following steps must be taken before invoking aprun:

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1

Examples

The following examples will initialize memory on the GPU and then preform an MPI_Allgather operation between GPUs.

For these examples use two nodes and request an interactive job.

$qsub -I -A PROJ### -lnodes=2,walltime=00:30:00

The executables must be placed in /tmp/work/$USER to run properly on the GPU.

CUDA C

direct.cpp

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <mpi.h>

int main( int argc, char** argv )
{
    MPI_Init (&argc, &argv);

    int direct;
    int rank, size;
    int *h_buff = NULL;
    int *d_rank = NULL;
    int *d_buff = NULL;
    size_t bytes;
    int i;

    // Ensure that RDMA ENABLED CUDA is set correctly
    direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA"));
    if(direct != 1){
        printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
        exit (EXIT_FAILURE);
    }

    // Get MPI rank and size
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);

    // Allocate host and device buffers and copy rank value to GPU
    bytes = size*sizeof(int);
    h_buff = (int*)malloc(bytes);
    cudaMalloc(&d_buff, bytes);
    cudaMalloc(&d_rank, sizeof(int));
    cudaMemcpy(d_rank, &rank, sizeof(int), cudaMemcpyHostToDevice);

    // Preform Allgather using device buffer
    MPI_Allgather(d_rank, 1, MPI_INT, d_buff, 1, MPI_INT, MPI_COMM_WORLD);

    // Check that the GPU buffer is correct
    cudaMemcpy(h_buff, d_buff, bytes, cudaMemcpyDeviceToHost);
    for(i=0; i<size; i++){
        if(h_buff[i] != i) {
            printf ("Alltoall Failed!\n");
            exit (EXIT_FAILURE);
        }
    }
    if(rank==0)
        printf("Success!\n");

    // Clean up
    free(h_buff);
    cudaFree(d_buff);
    cudaFree(d_rank);
    MPI_Finalize();

    return 0;
}

Compiling CUDA C

For ease of compiling the GNU environment will be used.

  $ module load cudatoolkit
  $ module switch PrgEnv-pgi PrgEnv-gnu
  $ CC -lcudart direct.cpp -o direct.out

Running CUDA C

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1
  $ aprun -n2 -N1 ./direct.out

CUDA Fortran

direct.cuf

program GPUdirect
    use cudafor
    implicit none
    include 'mpif.h'

    integer :: direct
    character(len=255) :: env_var
    integer :: rank, size, ierror
    integer,dimension(:),allocatable :: h_buff
    integer,device :: d_rank
    integer,dimension(:),allocatable,device :: d_buff
    integer :: i

    call getenv("MPICH_RDMA_ENABLED_CUDA", env_var)
    read( env_var, '(i10)' ) direct
    if (direct .NE. 1) then
      print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!'
      call exit(1)
    endif

    call MPI_INIT(ierror)

    ! Get MPI rank and size
    call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror)
    call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror)

    ! Initialize host and device buffers
    allocate(h_buff(size))
    allocate(d_buff(size))
    ! Implicity copy rank to device
    d_rank = rank

    ! Preform allgather using device buffers
    call MPI_ALLGATHER(d_rank, 1, MPI_INTEGER, d_buff, 1, MPI_INTEGER, MPI_COMM_WORLD, ierror);

    ! Check that buffer is correct
    h_buff = d_buff(1:size)
    do i=1,size
        if (h_buff(i) .NE. i-1) then
            print *, 'Alltoall Failed!'
            call exit(1)
        endif
    enddo
    if (rank .EQ. 0) then
        print *, 'Success!'
    endif

    ! Clean up
    deallocate(h_buff)
    deallocate(d_buff)
    call MPI_FINALIZE(ierror)

end program GPUdirect

Compiling CUDA Fortran

  $ module load cudatoolkit
  $ ftn direct.cuf -o direct.out

Running CUDA Fortran

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1
  $ aprun -n2 -N1 ./direct.out

OpenACC C

direct.c

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main( int argc, char** argv )
{
    MPI_Init (&argc, &argv);

    int direct;
    int rank, size;
    int *restrict buff = NULL;
    size_t bytes;
    int i;

    // Ensure that RDMA ENABLED CUDA is set correctly
    direct = getenv("MPICH_RDMA_ENABLED_CUDA")==NULL?0:atoi(getenv ("MPICH_RDMA_ENABLED_CUDA"));
    if(direct != 1){
        printf ("MPICH_RDMA_ENABLED_CUDA not enabled!\n");
        exit (EXIT_FAILURE);
    }

    // Get MPI rank and size
    MPI_Comm_rank (MPI_COMM_WORLD, &rank);
    MPI_Comm_size (MPI_COMM_WORLD, &size);

    // Initialize buffer
    bytes = size*sizeof(int);
    buff = (int*)malloc(bytes);

    // Copy buff to device at start of region and back to host and end of region
    #pragma acc data copy(rank, buff[0:size])
    {
        // Inside this region the device data pointer will be used
        #pragma acc host_data use_device(rank, buff)
        {
            MPI_Allgather(&rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD);
        }
    }

    // Check that buffer is correct
    for(i=0; i<size; i++){
        if(buff[i] != i) {
            printf ("Alltoall Failed!\n");
            exit (EXIT_FAILURE);
        }
    }
    if(rank==0)
        printf("Success!\n");

    // Clean up
    free(buff);

    MPI_Finalize();

    return 0;
}

Compiling OpenACC C

CRAY

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ cc -hpragma=acc direct.c -o direct.out

PGI

$ module load cudatoolkit
$ cc -acc -lcudart direct.c -o direct.out

Running OpenACC C

CRAY

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

PGI

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

OpenACC Fortran

direct.f90

program GPUdirect
    include 'mpif.h'

    integer :: direct
    character(len=255) :: env_var
    integer :: rank, size, ierror
    integer,dimension(:),allocatable :: buff
    integer :: i

    call getenv("MPICH_RDMA_ENABLED_CUDA", env_var)
    read( env_var, '(i10)' ) direct
    if (direct .NE. 1) then
      print *, 'MPICH_RDMA_ENABLED_CUDA not enabled!'
      call exit(1)
    endif

    call MPI_INIT(ierror)

    ! Get MPI rank and size
    call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierror)
    call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierror)

    ! Initialize buffer
    allocate(buff(size))

    ! Copy buff to device at start of region and back to host and end of region
    !$acc data copy(rank, buff(1:size))
        ! Inside this region the device data pointer will be used
        !$acc host_data use_device(rank, buff)
            ! Preform all to all using device buffer
            call MPI_ALLGATHER(rank, 1, MPI_INT, buff, 1, MPI_INT, MPI_COMM_WORLD, ierror);
        !$acc end host_data
    !$acc end data

    ! Check that buffer is correct
    do i=1,size
        if (buff(i) .NE. i-1) then
            print *, 'Alltoall Failed!'
            call exit(1)
        endif
    enddo
    if (rank .EQ. 0) then
        print *, 'Success!'
    endif

    ! Clean up
    deallocate(buff)
    call MPI_FINALIZE(ierror)
end program GPUdirect

Compiling OpenACC Fortran

CRAY

$ module switch PrgEnv-pgi PrgEnv-cray
$ module load craype-accel-nvidia35
$ ftn -hacc direct.f90 -o direct.out

PGI

$ module load cudatoolkit
$ ftn -acc -lcudart direct.f90 -o direct.out

Running OpenACC Fortran

CRAY

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

PGI

$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export MPICH_RDMA_ENABLED_CUDA=1
$ aprun -n2 -N1 ./direct.out

Optimizations

Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial and error tuning to achieve best results.

Pipelining

Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is 16 and messages under 8 Kilobytes in size are never pipelined.

Nemesis

Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.

This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.

This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.

The default GPU compute mode for Titan is Exclusive Process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the proxy server transparently manages work issued to the GPU context from multiple processes. The full source can be viewed or downloaded from the OLCF GitHub. Please direct any questions or comments to help@nccs.gov

Currently GPU memory between processes accessing the proxy is not guarded, meaning process i can access memory allocated by process j. This SHOULD NOT be used to share memory between processes and care should be taken to ensure process only access GPU memory they have allocated themselves.

How to Enable

To enable the proxy server the following steps must be taken before invoking aprun:

$ export CRAY_CUDA_MPS=1

Issues

Currently, GPU debugging and profiling are not supported when the proxy is enabled. Specifying the qsub flag -lfeature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode debugging and profiling are available and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU, additionally inconsistent behavior may be encountered under certain conditions.

Examples

The following examples will demonstrate when and how to use the CUDA proxy. In each example several kernels will be launched to demonstrate use of the CUDA proxy. These examples will be used in a future tutorial to investigate how multiple kernels are scheduled on the GPU.

For these examples use a single node and request an interactive job.

$ qsub -I -A PROJ### -lnodes=1,walltime=00:30:00

C Wrapper

The following C wrapper functions will be used to setup and launch a kernel that sleeps a single GPU thread for a specified number of seconds. This file will be compiled with NVCC and then linked into our examples.

pauseWrapper.cu

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// CUDA kernel to pause for at least num_cycle cycles
__global__ void sleep(int64_t num_cycles)
{
    int64_t cycles = 0;
    int64_t start = clock64();
    while(cycles < num_cycles) {
        cycles = clock64() - start;
    }
}

// Returns number of cycles required for requested seconds
extern "C" int64_t get_cycles(int seconds)
{
    // Get device frequency in Hz
    int64_t Hz;
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);
    Hz = int64_t(prop.clockRate) * 1000;

    // Calculate number of cycles to wait
    int64_t num_cycles;
    num_cycles = seconds * Hz;

    return num_cycles;
}

// Launches a kernel that sleeps for at least num_cycles
extern "C" void sleep_kernel(int64_t num_cycles)
{
    // Our kernel will launch a single thread to sleep the kernel
    int blockSize, gridSize;
    blockSize = 1;
    gridSize = 1;

    // Execute the kernel
    sleep<<<gridSize, blockSize>>>(num_cycles);
}

// Wait for all pending GPU transactions to end
extern "C" void wait_for_gpu()
{
    cudaDeviceSynchronize();
}

Compiling

The following will compile the wrapper functions into object code to be linked in with our examples.

$ module load cudatoolkit
$ nvcc -c -arch=sm_35 pauseWrapper.cu

Multiple kernels single process

Launching multiple kernels from a single process does not require the CUDA proxy.

launcher.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch num_kernel kernels asynchrnously
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();

    return 0;
}

Compiling and running multiple kernels from a single process

$ CC pauseWrapper.o launcher.cpp -o sleep.out
$ aprun -n1 ./sleep.out

Multiple kernels launched from multiple threads

Launching multiple kernels from multiple threads spawned from a single process does not require the CUDA proxy.

launcherOMP.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <omp.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Number of kernels to launch
    int num_kernels = 14;

    // Launch kernel
    omp_set_num_threads(num_kernels);

    #pragma omp parallel for shared(cycles) num_threads(num_kernels)
    for(int i=0; i<num_kernels; i++){
        sleep_kernel(cycles);
    }

    // Wait for the kernel to complete
    wait_for_gpu();
}

Compiling and running multiple kernels launched from multiple threads

$ CC pauseWrapper.o launcherMP.cpp -o sleep.out
$ aprun -n1 -d14 ./sleep.out

Multiple kernels launched from multiple MPI tasks

Launching multiple kernels from multiple MPI tasks on a single node does require the CUDA proxy.

launcherMPI.cpp

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <mpi.h>

extern "C" int64_t get_cycles(int seconds);
extern "C" void sleep_kernel(int64_t num_cycles);
extern "C" void wait_for_gpu();

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    // Get number of cycles to sleep for 1 second
    uint64_t cycles;
    cycles = get_cycles(1);

    // Sleep kernel for 1 second
    sleep_kernel(cycles);

    // Wait for the kernel to complete
    wait_for_gpu();

    MPI_Finalize();
    return 0;
}

Compiling and running multiple kernels launched from multiple MPI tasks

$ CC pauseWrapper.o launcherMPI.cpp -o sleep.out
$ export CRAY_CUDA_MPS=1
$ aprun -n14 -N14 ./sleep.out

This tutorial describes how to use PETSc for domain decomposition to parallelize (with MPI) a solver for the simple advection equation (1.1).

Equation 1.1

Equation 1.1

In equation (1.1), D is a scalar density and V = { Vx,Vy,Vz } is a velocity field. We assume only Vx is non-zero and constant. A simple time-explicit scheme for solving equation (1.1) is given by (1.2)

Equation 1.2

Equation 1.2

where, for example, the “upwind flux” gives (1.3)

Equation 1.3

Equation 1.3

Thus, the numerical scheme couples nearest neighbors, which means that data must be exchanged between MPI tasks during the evolution. Although the equation solved in this tutorial example is simple, the strategy outlined here can be easily generalized to more complex problems using more advanced numerical methods. The example code provided (with the driver Advection3D.F90) performs three main tasks; (1) creates a distributed logically Cartesian grid/mesh (in MeshModule.F90; Section 4), (2) creates physical fields associated with each grid cell (in FieldsModule.F90; Section 5), and (3) evolves the simple advection equation on the distributed grid (in Advection3D.F90 and AdvectionModule.F90). The main data structure used by the program (UniverseType, defined in UniverseModule.F90) holds one data structure for the mesh (MeshType, defined in MeshModule.F90) and one data structure for the physical fields (FieldsType, defined in FieldsModule.F90). The actual time stepping loop is in the main driver (Advection3D.F90).

Note: this tutorial is intended to supplement — not replace — the existing PETSc documentation, available at http://www.mcs.anl.gov/petsc/documentation. In addition, also see the useful code examples provided with the PETSc distribution

• $PETSC_DIR/source/src/dm/examples/tutorials/
• $PETSC_DIR/source/src/ksp/ksp/examples/tutorials/
• $PETSC_DIR/source/src/mat/examples/tutorials/
• $PETSC_DIR/source/src/snes/examples/tutorials/
• $PETSC_DIR/source/src/sys/examples/tutorials/
• $PETSC_DIR/source/src/ts/examples/tutorials/
• $PETSC_DIR/source/src/vec/vec/examples/tutorials/

In the following section we describe the individual modules of a Fortran code that use PETSc for domain decomposition. In each section we provide links to PETSc documentation for all the PETSc routines used by the code. The user is encouraged to download, compile, and run the provided code.

Please direct any questions or comments to help@nccs.gov.

Advection3D.F90

Mesh and Data Structures

The call to CreateUniverse creates the main data structure UniverseType. PETSc (and MPI) are also initialized in CreateUniverse. It should be the first call in any program using the data structure defined by UniverseType. Notable input parameters are the number of mesh cells in each spatial dimension, the inner and outer boundaries of the computational domain, and the type of boundary conditions used. BoundaryConditions are of type DMDABoundaryType, and PETSc supports DMDA_BOUNDARY_NONE, DMDA_BOUNDARY_GHOST, and DMDA_BOUNDARY_PERIODIC.

  !  Program solves dD/dt+Vx*dD/dx+Vy*dD/dy+Vz*dD/dz=0 on periodic unit domain 
  !    with constant advection velocity (V1, V2, V3) = (1, 0, 0), 
  !    and initial condition D = Sin(2 * Pi * X1)

  !  Create Mesh and Data Structures to Hold Physical Fields:

  nZones(1:3) = 256

  CALL CreateUniverse( &
         U, Name = 'Advection3D',          &
         nZones                            &
           = nZones,                       &
         InnerBoundaries                   &
           = (/ Zero, Zero, Zero /),       &
         OuterBoundaries                   &
           = (/ One,  One,  One  /),       &
         BoundaryConditions                &
           = (/ DMDA_BOUNDARY_PERIODIC,    &
                DMDA_BOUNDARY_PERIODIC,    &
                DMDA_BOUNDARY_PERIODIC /))

The call to DestroyUniverse frees up memory allocated by CreateUniverse and finalizes PETSc (and MPI) to end cleanly.

  ! Destroy Mesh and Data Structures that Hold Physical Fields:

  CALL DestroyUniverse(U)

UniverseModule.F90

The following PETSc routines are called
PetscInitialize
PetscFinalize
PetscGetTime

UniverseType

UniverseType contains Mesh (MeshType) and Fields (FieldsType), and additional rudimentary variables associated with parallel execution (e.g., the integer Communicator).

  TYPE, PUBLIC :: UniverseType
    CHARACTER(80) :: &
      Name
    PetscInt :: &
      Communicator, &
      MPI_Rank, &
      N_MPI_Ranks
    PetscLogDouble :: &
      WallTimeInitialize, &
      WallTimeFinalize
    TYPE(MeshType), POINTER :: &
      Mesh   ! 
    TYPE(FieldsType), POINTER :: &
      Fields ! 
  END TYPE UniverseType

CreateUniverse

CreateUniverse initializes PETSc (and MPI) with the call to PetscInitialize. It also initializes the rudimentary variables and creates the data structures MeshType and FieldsType with the calls to CreateMesh and CreateFields.

  SUBROUTINE CreateUniverse(U, Name, &
               nZones, InnerBoundaries, OuterBoundaries, BoundaryConditions)

    TYPE(UniverseType), POINTER :: &
      U
    CHARACTER(len=*), INTENT(in) :: &
      Name
    PetscInt, DIMENSION(:), INTENT(in) :: &
      nZones
    PetscReal, DIMENSION(:), INTENT(in) :: &
      InnerBoundaries, &
      OuterBoundaries
    DMDABoundaryType, DIMENSION(:), INTENT(in) :: &
      BoundaryConditions

    PetscInt :: &
      Error

    CALL PetscInitialize(PETSC_NULL_CHARACTER, Error)

    ALLOCATE(U)
    U % Communicator = MPI_COMM_WORLD
    CALL MPI_Comm_Rank(U % Communicator, U % MPI_Rank,    Error)
    CALL MPI_Comm_Size(U % Communicator, U % N_MPI_Ranks, Error)

    U % Name = Name

    CALL PetscGetTime(U % WallTimeInitialize, Error)

    IF(U % MPI_RANK == 0)THEN
      PRINT*
      PRINT*, "INFO: Creating Universe ", TRIM(U % Name)
      PRINT*
      PRINT*, "  INFO: Number of MPI Ranks = ", U % N_MPI_Ranks
    END IF

    ALLOCATE(U % Mesh)
    CALL CreateMesh(U % Mesh, U % Communicator, U % MPI_Rank, &
           nZones, InnerBoundaries, OuterBoundaries, BoundaryConditions)

    ALLOCATE(U % Fields)
    CALL CreateFields(U % Fields, U % Mesh, U % Communicator, U % MPI_Rank)

  END SUBROUTINE CreateUniverse

The call to PetscGetTime is used to time the program (see DestroyUniverse).

DestroyUniverse

DestroyUniverse frees up memory allocated by CreateUniverse. In particular, Fields and Mesh are deallocated by calls to DestroyFields and DestroyMesh. DestroyUniverse also finalizes PETSc (and MPI) to exit cleanly.

 SUBROUTINE DestroyUniverse(U)

    TYPE(UniverseType), POINTER :: &
      U

    PetscInt :: &
      Error

    IF(U % MPI_Rank == 0)THEN
      PRINT*
      PRINT*, "INFO: Destroying Universe ", TRIM(U % Name)
    END IF

    CALL DestroyFields(U % Fields)
    DEALLOCATE(U % Fields)

    CALL DestroyMesh(U % Mesh)
    DEALLOCATE(U % Mesh)

    CALL PetscGetTime(U % WallTimeFinalize, Error)

    IF(U % MPI_Rank == 0)THEN
      PRINT*
      PRINT*, "  INFO: ", U % N_MPI_Ranks, " MPI Ranks Finished in ", &
        U % WallTimeFinalize - U % WallTimeInitialize, " Seconds"
      PRINT*
    END IF

    CALL PetscFinalize(Error)

    DEALLOCATE(U)

  END SUBROUTINE DestroyUniverse

MeshModule.F90

The following PETSc routines are called
DMDACreate1D
DMDACreate2D
DMDACreate3D
DMDAGetCoordinateDA
DMCreateGlobalVector
VecDuplicate
DMDAGetCorners
DMCreateLocalVector
DMDAGetGhostCorners
DMDAVecGetArrayF90
DMDAVecRestoreArrayF90
DMGlobalToLocalBegin
DMGlobalToLocalEnd
VecDestroy
DMDestroy

X1,X2,X3 are integers used to access the mesh coordinates (see e.g., CreateMesh3D for usage).

  ! Coordinate Names

  PetscInt, PUBLIC, PARAMETER :: &
    X1 = 0, &
    X2 = 1, &
    X3 = 2

MeshType

MeshType contains basic information about the mesh; e.g., dimensionality (nDimensions), number of zones in each spatial dimension (nZones), the coordinates of the inner and outer boundaries (InnerBoundaries and OuterBoundaries), the type of boundary condition used in each dimension (BoundaryType). StencilWidth is the width of the computational stencil and sets the depth of the ghost zone layer to be communicated. (StencilWidth=1 in the current example, but can be larger for higher-order methods.) StencilType can be of type DMDA_STENCIL_BOX or DMDA_STENCIL_STAR to accommodate different computational stencils (see note below). The main data structures in MeshType are MeshDA and CoordinateDA, both are DM objects that are used to manage data for a structured grid in 1, 2, or 3 dimensions. MeshType also contains Positions (PositionsType), which holds mesh coordinate values.

  TYPE, PUBLIC :: MeshType
    PetscInt :: &
      nDimensions, &
      StencilWidth
    PetscInt, DIMENSION(:), ALLOCATABLE :: &
      nZones
    PetscReal, DIMENSION(:), ALLOCATABLE :: &
      InnerBoundaries, &
      OuterBoundaries
    DMDAStencilType :: &
      StencilType
    DMDABoundaryType, DIMENSION(:), ALLOCATABLE :: &
      BoundaryType
    DM :: &
      MeshDA, &
      CoordinateDA
    TYPE(PositionsType), POINTER :: &
      Positions
  END TYPE MeshType

stencil

PETSc supports two types of computational stencils with DMDAStencilType. It determines what ghost zones will be available to a process (see Figure). With DMDA_STENCIL_STAR only the blue cells will be available. With DMDA_STENCIL_BOX both the red and blue cells are available. StencilType is set to DMDA_STENCIL_STAR in CreateMesh.

PositionsType

PositionsType contains distributed vectors (of type Vec) to hold position coordinates. InnerEdge, Center, and OuterEdge hold the inner, center, and outer coordinates of a computational mesh cell. Vectors with the “Global” suffix (e.g., InnerEdgeGlobal) are created with DMCreateGlobalVector and do not contain ghost zones. Vectors with the “Local” suffix (e.g., InnerEdgeLocal) are created with DMCreateLocalVector and contains ghost zones (see CreatePositions in Section 4 e). The integers iBX(1:3) (“i Begin X”), iWX(1:3) (“i Width X”), and iEX(1:3) (“i End X”) are used to access specific cells on a mesh portion owned by a particular process. We have also defined the corresponding iBXG(1:3), iWXG(1:3), and iEXG(1:3) for the ghost domains. iBX and iWX, and iBXG and iWXG are set with calls to DMDAGetCorners and DMDAGetGhostedCorners (see Section 4e).

TYPE :: PositionsType
    PetscInt, DIMENSION(3) :: &
      iBX,  & ! Inner Indices of Local Domain
      iWX,  & ! Width of Local Domain
      iEX,  & ! Outer Indices of Local Domain
      iBXG, & ! Inner Indices of Local Ghosted Domain
      iWXG, & ! Width of Local Ghosted Domain
      iEXG    ! Outer Indices of Local Ghosted Domain
    Vec :: &
      InnerEdgeGlobal, &
      CenterGlobal,    &
      OuterEdgeGlobal, &
      InnerEdgeLocal,  &
      CenterLocal,     &
      OuterEdgeLocal
  END TYPE PositionsType

CreateMesh

CreateMesh sets basic information about the mesh, and, depending on the dimensionality of the problem, calls CreateMesh1D, CreateMesh2D, or CreateMesh3D, which creates the data structures MeshDA and CoordinateDA and sets the position coordinates. After the global position coordinates are initialized (e.g., InnerEdgeGlobal), the calls DMGlobalToLocalBegin and DMGlobalToLocalEnd populate ghost values in local position coordinates (e.g., InnerEdgeLocal).

  SUBROUTINE CreateMesh(M, Comm, Rank, &
               nZones, InnerBoundaries, OuterBoundaries, BoundaryConditions)

    TYPE(MeshType), POINTER :: &
      M
    PetscInt, INTENT(in) :: &
      Comm, &
      Rank
    PetscInt, DIMENSION(:), INTENT(in) :: &
      nZones
    PetscReal, DIMENSION(:), INTENT(in) :: &
      InnerBoundaries, &
      OuterBoundaries
    DMDABoundaryType, DIMENSION(:), INTENT(in) :: &
      BoundaryConditions

    PetscInt :: &
      iDim, &
      Error

    M % nDimensions = SIZE(nZones)

    ALLOCATE( M % nZones         (M % nDimensions) )
    ALLOCATE( M % InnerBoundaries(M % nDimensions) )
    ALLOCATE( M % OuterBoundaries(M % nDimensions) )
    ALLOCATE( M % BoundaryType   (M % nDimensions) )

    DO iDim = 1, M % nDimensions
      M % nZones(iDim)          = nZones(iDim)
      M % InnerBoundaries(iDim) = InnerBoundaries(iDim)
      M % OuterBoundaries(iDim) = OuterBoundaries(iDim)
      M % BoundaryType(iDim)    = BoundaryConditions(iDim)
    END DO

    M % StencilWidth = 1
    M % StencilType  = DMDA_STENCIL_STAR

    IF(Rank == 0)THEN
      PRINT*
      PRINT*, "  INFO: Creating Mesh"
      PRINT*, "    Dimensionality = ", M % nDimensions
      PRINT*, "    Stencil Width  = ", M % StencilWidth
      DO iDim = 1, M % nDimensions
        PRINT*, "      iDim, nZones = ", iDim, M % nZones(iDim)
        PRINT*, "        Inner Boundary   = ", M % InnerBoundaries(iDim)
        PRINT*, "        Outer Boundary   = ", M % OuterBoundaries(iDim)
      END DO
    END IF

    SELECT CASE (M % nDimensions)
      CASE (1)

        CALL CreateMesh1D(M, Comm)

      CASE (2)

        CALL CreateMesh2D(M, Comm)

      CASE (3)

        CALL CreateMesh3D(M, Comm)

    END SELECT

    !  Populate Ghost Zones with Coordinates:

    CALL DMGlobalToLocalBegin( &
           M % CoordinateDA, &
           M % Positions % InnerEdgeGlobal, INSERT_VALUES, &
           M % Positions % InnerEdgeLocal, Error)
    CALL DMGlobalToLocalEnd( &
           M % CoordinateDA, &
           M % Positions % InnerEdgeGlobal, INSERT_VALUES, &
           M % Positions % InnerEdgeLocal, Error)

    CALL DMGlobalToLocalBegin( &
           M % CoordinateDA, &
           M % Positions % OuterEdgeGlobal, INSERT_VALUES, &
           M % Positions % OuterEdgeLocal, Error)
    CALL DMGlobalToLocalEnd( &
           M % CoordinateDA, &
           M % Positions % OuterEdgeGlobal, INSERT_VALUES, &
           M % Positions % OuterEdgeLocal, Error)

    CALL DMGlobalToLocalBegin( &
           M % CoordinateDA, &
           M % Positions % CenterGlobal, INSERT_VALUES, &
           M % Positions % CenterLocal, Error)
    CALL DMGlobalToLocalEnd( &
           M % CoordinateDA, &
           M % Positions % CenterGlobal, INSERT_VALUES, &
           M % Positions % CenterLocal, Error)

  END SUBROUTINE CreateMesh

CreateMesh3D

CreateMesh3D creates a three-dimensional mesh. The call to the PETSc routine DMDACreate3D creates the object MeshDA (of type DM) to manage the communication of three-dimensional data that is distributed across processors (see the provided code for corresponding 1D and 2D versions). The call to DMDAGetCoordinateDA initializes CoordinateDA (also of type DM) that is used to scatter between global and local coordinates and to access the coordinate vectors defined in PositionsType. The call to CreatePositions creates/allocates the position coordinate vectors and initializes indices used to access specific cells in the mesh owned by a particular process. To fill in values in a global vector (e.g., InnerEdgeLocal) we must call DMDAVecGetGetArrayF90, which returns a multi-dimensional array (e.g., InnerEdge) that shares data with the underlying vector. In the 3D case, InnerEdge is a rank-four array, where the first index denotes the coordinate (i.e., X1, X2, or X3) and the following three indices denote the logical position in the mesh. After the position coordinates have been set and access to InnerEdge is no longer needed, one must call DMDAVecResoreArrayF90.

 SUBROUTINE CreateMesh3D(M, Comm)

    TYPE(MeshType), POINTER :: &
      M
    PetscInt, INTENT(in) :: &
      Comm

    PetscInt :: &
      Error, &
      iX1, iX2, iX3
    PetscReal, DIMENSION(3) :: &
      dX
    PetscReal, DIMENSION(:,:,:,:), POINTER :: &
      InnerEdge, &
      OuterEdge, &
      Center

    CALL DMDACreate3D( &
           Comm, &
           M % BoundaryType(1), M % BoundaryType(2), M % BoundaryType(3), &
           M % StencilType, &
           M % nZones(1), M % nZones(2), M % nZones(3), &
           PETSC_DECIDE, PETSC_DECIDE, PETSC_DECIDE, &
           1, M % StencilWidth, &
           PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, &
           M % MeshDA, Error)

    CALL DMDAGetCoordinateDA( &
           M % MeshDA, M % CoordinateDA, Error)

    !
    !  Create vectors to hold coordinate values:
    !

    ALLOCATE(M % Positions)
    CALL CreatePositions(M % Positions, M % nDimensions, M % CoordinateDA)

    !
    !  Fill in coordinate values:
    !

    CALL DMDAVecGetArrayF90( &
           M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error)
    CALL DMDAVecGetArrayF90( &
           M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error)
    CALL DMDAVecGetArrayF90( &
           M % CoordinateDA, M % Positions % CenterGlobal,    Center,    Error)

    !  Equidistant mesh for simplicity
    dX = ( M % OuterBoundaries - M % InnerBoundaries ) / REAL( M % nZones )

    DO iX3 = M % Positions % iBX(3), M % Positions % iEX(3)
      DO iX2 = M % Positions % iBX(2), M % Positions % iEX(2)
        DO iX1 = M % Positions % iBX(1), M % Positions % iEX(1)

          InnerEdge(X1, iX1, iX2, iX3) &
            = M % InnerBoundaries(1) + dX(1) * iX1
          OuterEdge(X1, iX1, iX2, iX3) &
            = InnerEdge(X1, iX1, iX2, iX3) + dX(1)
          Center   (X1, iX1, iX2, iX3) &
            = Half * ( InnerEdge(X1, iX1, iX2, iX3) &
                       + OuterEdge(X1, iX1, iX2, iX3) )

          InnerEdge(X2, iX1, iX2, iX3) &
            = M % InnerBoundaries(2) + dX(2) * iX2
          OuterEdge(X2, iX1, iX2, iX3) &
            = InnerEdge(X2, iX1, iX2, iX3) + dX(2)
          Center   (X2, iX1, iX2, iX3) &
            = Half * ( InnerEdge(X2, iX1, iX2, iX3) &
                       + OuterEdge(X2, iX1, iX2, iX3) )

          InnerEdge(X3, iX1, iX2, iX3) &
            = M % InnerBoundaries(3) + dX(3) * iX3
          OuterEdge(X3, iX1, iX2, iX3) &
            = InnerEdge(X3, iX1, iX2, iX3) + dX(3)
          Center   (X3, iX1, iX2, iX3) &
            = Half * ( InnerEdge(X3, iX1, iX2, iX3) &
                       + OuterEdge(X3, iX1, iX2, iX3) )

        END DO
      END DO
    END DO

    CALL DMDAVecRestoreArrayF90( &
           M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error)
    CALL DMDAVecRestoreArrayF90( &
           M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error)
    CALL DMDAVecRestoreArrayF90( &
           M % CoordinateDA, M % Positions % CenterGlobal,    Center,    Error)

  END SUBROUTINE CreateMesh3D

CreatePositions

CreatePositions creates global and local vectors containing position coordinates (e.g., InnerEdgeLocal and InnerEdgeGlobal). Vectors created with DMCreateGlobalVector do not have ghost points accessible. Vectors created with DMCreateLocalVector have accessible ghost points. Calls to DMDAGetCorners and DMDAGetGhostedCorners returns the indices of the lower left corner of the local region and the extent (width) of the local region.

 SUBROUTINE CreatePositions(Positions, nDimensions, CoordinateDA)

    TYPE(PositionsType), POINTER :: &
      Positions
    PetscInt :: &
      nDimensions
    DM :: &
      CoordinateDA

    PetscInt :: &
      Error, &
      iDim

    ! Global (Non-Ghosted) Position Vectors:

    CALL DMCreateGlobalVector( &
           CoordinateDA,                Positions % InnerEdgeGlobal, Error)
    CALL VecDuplicate( &
           Positions % InnerEdgeGlobal, Positions % CenterGlobal,    Error)
    CALL VecDuplicate( &
           Positions % InnerEdgeGlobal, Positions % OuterEdgeGlobal, Error)

    Positions % iBX(:) = 0
    Positions % iWX(:) = 1
    Positions % iEX(:) = 0

    ! Get Indices for Non-Ghosted Vectors:

    SELECT CASE (nDimensions)
      CASE (1)
        CALL DMDAGetCorners( &
               CoordinateDA, &
               Positions % iBX(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, &
               Positions % iWX(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, &
               Error)

      CASE (2)
        CALL DMDAGetCorners( &
               CoordinateDA, &
               Positions % iBX(1), Positions % iBX(2), PETSC_NULL_INTEGER, &
               Positions % iWX(1), Positions % iWX(2), PETSC_NULL_INTEGER, &
               Error)

      CASE(3)
        CALL DMDAGetCorners( &
               CoordinateDA, &
               Positions % iBX(1), Positions % iBX(2), Positions % iBX(3), &
               Positions % iWX(1), Positions % iWX(2), Positions % iWX(3), &
               Error)

    END SELECT

    DO iDim = 1, nDimensions
      Positions % iEX(iDim) &
        = Positions % iBX(iDim) &
            + Positions % iWX(iDim) - 1
    END DO

    !  Local (Ghosted) Position Vectors:

    CALL DMCreateLocalVector( &
           CoordinateDA,               Positions % InnerEdgeLocal, Error)
    CALL VecDuplicate( &
           Positions % InnerEdgeLocal, Positions % CenterLocal,    Error)
    CALL VecDuplicate( &
           Positions % InnerEdgeLocal, Positions % OuterEdgeLocal, Error)

    Positions % iBXG(:) = 0
    Positions % iWXG(:) = 1
    Positions % iEXG(:) = 0

    !  Get Indices for Ghosted Vectors:

    SELECT CASE (nDimensions)
      CASE (1)
        CALL DMDAGetGhostCorners( &
               CoordinateDA, &
               Positions % iBXG(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, &
               Positions % iWXG(1), PETSC_NULL_INTEGER, PETSC_NULL_INTEGER, &
               Error)

      CASE (2)
        CALL DMDAGetGhostCorners( &
               CoordinateDA, &
               Positions % iBXG(1), Positions % iBXG(2), PETSC_NULL_INTEGER, &
               Positions % iWXG(1), Positions % iWXG(2), PETSC_NULL_INTEGER, &
               Error)

      CASE (3)
        CALL DMDAGetGhostCorners( &
               CoordinateDA, &
               Positions % iBXG(1), Positions % iBXG(2), Positions % iBXG(3), &
               Positions % iWXG(1), Positions % iWXG(2), Positions % iWXG(3), &
               Error)

    END SELECT

    DO iDim = 1, nDimensions
      Positions % iEXG(iDim) &
        = Positions % iBXG(iDim) &
            + Positions % iWXG(iDim) - 1
    END DO

  END SUBROUTINE CreatePositions

DestroyMesh

DestroyMesh frees memory allocated by CreateMesh

  SUBROUTINE DestroyMesh(M)

    TYPE(MeshType), POINTER :: &
      M

    PetscInt :: &
      Error

    DEALLOCATE( M % nZones )
    DEALLOCATE( M % InnerBoundaries )
    DEALLOCATE( M % OuterBoundaries )
    DEALLOCATE( M % BoundaryType )

    CALL DestroyPositions(M % Positions)
    DEALLOCATE(M % Positions)

    CALL DMDestroy(M % MeshDA, Error)

  END SUBROUTINE DestroyMesh

DestroyPositions

DestroyPositions destroys the vectors created in CreatePositions

  SUBROUTINE DestroyPositions(Positions)

    TYPE(PositionsType), POINTER :: &
      Positions

    PetscInt :: &
      Error

    CALL VecDestroy(Positions % InnerEdgeGlobal, Error)
    CALL VecDestroy(Positions % CenterGlobal,    Error)
    CALL VecDestroy(Positions % OuterEdgeGlobal, Error)

    CALL VecDestroy(Positions % InnerEdgeLocal, Error)
    CALL VecDestroy(Positions % CenterLocal,    Error)
    CALL VecDestroy(Positions % OuterEdgeLocal, Error)

  END SUBROUTINE DestroyPositions

FieldsModule.F90

The following PETSc routines are called
DMDACreate3D
DMCreateGlobalVector
DMCreateLocalVector
VecDestroy
DMDestroy

D, V1, V2, V3 are integers used to access physical variables/fields (here representing the density and three components of velocity)

  ! Field Names

  PetscInt, PUBLIC, PARAMETER :: &
    D        = 0, &
    V1       = 1, &
    V2       = 2, &
    V3       = 3, &
    N_FIELDS = 4

FieldsType

FieldsType contains vectors to hold the physical fields: FieldsGlobal (non-ghosted) and FieldsLocal (ghosted). FieldsDA (of type DM) is the object used to manage data for a structured grid (e.g., exchange of data between processes). The integer DegreesOfFreedom holds the number of physical fields. It is used to create the FieldsDA object.

  TYPE, PUBLIC :: FieldsType
    PetscInt :: &
      DegreesOfFreedom
    Vec :: &
      FieldsGlobal, &
      FieldsLocal
    DM :: &
      FieldsDA
  END TYPE FieldsType

CreateFields

CreateFields initializes and creates data structures defined in FieldsType. Depending on the dimensionality of the problem, it creates the FieldsDA object by calling either DMDACreate1D, DMDACreate2D, or DMDACreate3D. It also creates the global and local vectors to hold the physical fields. Note that DegreesOfFreedom > 1 in this case, which has consequences for the shape of the array returned by DMDAVecGetArrayF90 (see initialization of FieldsGlobal in Advection3D.F90). For DegreesOfFreedom > 1 the first index in the array returned by DMDAVecGetArrayF90 is used to access specific fields. The preceding indices refer to the field’s logical position in the mesh.

  SUBROUTINE CreateFields(F, M, Comm, Rank)

    TYPE(FieldsType), POINTER :: &
      F
    TYPE(MeshType), POINTER :: &
      M
    PetscInt, INTENT(in) :: &
      Comm, &
      Rank

    PetscInt :: &
      Error

    F % DegreesOfFreedom = N_FIELDS

    IF(Rank == 0)THEN
      PRINT*
      PRINT*, "  INFO: Creating Fields"
      PRINT*, "    Degrees of Freedom = ", F % DegreesOfFreedom
    END IF

    SELECT CASE (M % nDimensions)
      CASE (1)
        CALL DMDACreate1D( &
               Comm,                 &
               M % BoundaryType(1),  &
               M % nZones(1),        &
               F % DegreesOfFreedom, &
               M % StencilWidth,     &
               PETSC_NULL_INTEGER,   &
               F % FieldsDA,         &
               Error)

      CASE (2)
        CALL DMDACreate2D( &
               Comm,                 &
               M % BoundaryType(1),  &
               M % BoundaryType(2),  &
               M % StencilType,      &
               M % nZones(1),        &
               M % nZones(2),        &
               PETSC_DECIDE,         &
               PETSC_DECIDE,         &
               F % DegreesOfFreedom, &
               M % StencilWidth,     &
               PETSC_NULL_INTEGER,   &
               PETSC_NULL_INTEGER,   &
               F % FieldsDA,         &
               Error)

      CASE (3)
        CALL DMDACreate3D( &
               Comm,                 &
               M % BoundaryType(1),  &
               M % BoundaryType(2),  &
               M % BoundaryType(3),  &
               M % StencilType,      &
               M % nZones(1),        &
               M % nZones(2),        &
               M % nZones(3),        &
               PETSC_DECIDE,         &
               PETSC_DECIDE,         &
               PETSC_DECIDE,         &
               F % DegreesOfFreedom, &
               M % StencilWidth,     &
               PETSC_NULL_INTEGER,   &
               PETSC_NULL_INTEGER,   &
               PETSC_NULL_INTEGER,   &
               F % FieldsDA,         &
               Error)

    END SELECT

    CALL DMCreateGlobalVector( &
           F % FieldsDA, F % FieldsGlobal, Error)

    CALL DMCreateLocalVector( &
           F % FieldsDA, F % FieldsLocal,  Error)

  END SUBROUTINE CreateFields

DestroyFields

DestroyFields frees memory allocated by CreateFields

  SUBROUTINE DestroyFields(F)

    TYPE(FieldsType), POINTER :: &
      F

    PetscInt :: &
      Error

    CALL VecDestroy(F % FieldsGlobal, Error)

    CALL VecDestroy(F % FieldsLocal,  Error)

    CALL DMDestroy(F % FieldsDA, Error)

  END SUBROUTINE DestroyFields

Advection3D.F90 (continued)

The following PETSc routines are called

VecDuplicate
DMDAVecGetArrayF90
DMDAVecRestoreArrayF90
VecCopy
DMGlobalToLocalBegin
DMGlobalToLocalEnd
VecDestroy

!  Get Access to Mesh Coordinates:

  CALL DMDAVecGetArrayF90( &
         M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error)
  CALL DMDAVecGetArrayF90( &
         M % CoordinateDA, M % Positions % CenterGlobal,    Center,    Error)
  CALL DMDAVecGetArrayF90( &
         M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error)
  !  Release Access to Coordinates:

  CALL DMDAVecRestoreArrayF90( &
         M % CoordinateDA, M % Positions % InnerEdgeGlobal, InnerEdge, Error)
  CALL DMDAVecRestoreArrayF90( &
         M % CoordinateDA, M % Positions % CenterGlobal,    Center,    Error)
  CALL DMDAVecRestoreArrayF90( &
         M % CoordinateDA, M % Positions % OuterEdgeGlobal, OuterEdge, Error)

Set initial condition

To initialize FieldsGlobal, we first call DMDAVecGetArrayF90, which returns FieldsOld, a rank-four array (DegreesOfFreedom > 1) where the first dimension holds the individual fields and the next three dimensions hold the logical position in the mesh. After initialization of the data we call DMDAVecRestoreArrayF90.

  !  Get Access to Non Ghosted Fields Data:

  CALL DMDAVecGetArrayF90( &
         F % FieldsDA, F % FieldsGlobal, FieldsOld, Error)

  !  Set Initial Condition and Compute Time Step: 

  MyTimeStep = Huge

  DO iX3 = iBX3, iEX3
    DO iX2 = iBX2, iEX2
      DO iX1 = iBX1, iEX1

        FieldsOld(D,  iX1, iX2, iX3) &
          = SIN( Two * Pi * Center(X1, iX1, iX2, iX3) )

        FieldsOld(V1, iX1, iX2, iX3) &
          = One
        FieldsOld(V2, iX1, iX2, iX3) &
          = Zero
        FieldsOld(V3, iX1, iX2, iX3) &
          = Zero

        MyTimeStep &
          = MIN(MyTimeStep, &
                ( OuterEdge(X1, iX1, iX2, iX3) &
                  - InnerEdge(X1, iX1, iX2, iX3) ) &
                / ( Tiny + FieldsOld(V1, iX1, iX2, iX3) ) )

        MyTimeStep &
          = MIN(MyTimeStep, &
                ( OuterEdge(X2, iX1, iX2, iX3) &
                  - InnerEdge(X2, iX1, iX2, iX3) ) &
                / ( Tiny + FieldsOld(V2, iX1, iX2, iX3) ) )

        MyTimeStep &
          = MIN(MyTimeStep, &
                ( OuterEdge(X3, iX1, iX2, iX3) &
                  - InnerEdge(X3, iX1, iX2, iX3) ) &
                / ( Tiny + FieldsOld(V3, iX1, iX2, iX3) ) )

      END DO
    END DO
  END DO

  !  Release Access to Non-Ghosted Fields:

  CALL DMDAVecRestoreArrayF90( &
         F % FieldsDA, F % FieldsGlobal, FieldsOld, Error)

  !  Reduce Global Time Step:

  CALL MPI_ALLREDUCE( &
         MyTimeStep, TimeStep, 1, MPI_DOUBLE_PRECISION, MPI_MIN, &
         U % Communicator, Error)

  !  Populate Ghosted Fields with Initial Condition:

  CALL DMGlobalToLocalBegin( &
         F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, Error)
  CALL DMGlobalToLocalEnd( &
         F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, Error)

  !  Keep Copy of Initial Condition:

  CALL VecDuplicate(F % FieldsLocal, InitialCondition, Error)
  CALL VecCopy(     F % FieldsLocal, InitialCondition, Error)

Evolve initial condition

overlap

Mesh cells in a domain owned in particular process are divided into “exchange cells” (red) and “non-exchange cells” (green). A process’ exchange cells are ghost cells for neighboring processes.

The time step loop is organized in the following way to overlap communications and computations:

  1. Update cells whose data is communicated with neighboring processes (AdvectExchangeCells3D; see AdvectionModule.F90); red cells in the above figure
  2. Begin ghost exchange by calling DMGlobalToLocalBegin
  3. Update cells whose data is not communicated with neighboring processes (AdvectNonExchangeCells3D; see AdvectionModule.F90); green cells in the above figure
  4. Finish ghost exchange by calling DMGlobalToLocalEnd. Ghost cells are now populated with updated values, and ready for the next time step
  !  Begin Time Step Loop:

  Done      = .FALSE.
  iCycle    = 0
  MaxCycles = 100000
  Time      = Zero
  EndTime   = One

  DO WHILE (.NOT.Done)

    iCycle = iCycle + 1

    IF(U % MPI_Rank == 0)THEN
      PRINT*
      PRINT*, "    Cycle     = ", iCycle
      PRINT*, "    Time Step = ", TimeStep
      PRINT*, "    Time      = ", Time
    END IF

    !  Copy Ghosted Data to Local Work Vector:

    CALL VecCopy(F % FieldsLocal, FieldsScratch, Error)

    !  Get Access to Fields Data:

    CALL DMDAVecGetArrayF90( &
           F % FieldsDA, FieldsScratch,    FieldsOld, Error)
    CALL DMDAVecGetArrayF90( &
           F % FieldsDA, F % FieldsGlobal, FieldsNew, Error)

    !  Update Cells Whose Data is Communicated:

    CALL AdvectExchangeCells3D(FieldsNew, FieldsOld, InnerEdge, OuterEdge, &
           TimeStep, iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW)

    !  Begin Ghost Exchange:

    CALL DMGlobalToLocalBegin( &
           F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, &
           Error)

    !  Update Cells Whose Data is Not Comminucated:

    CALL AdvectNonExchangeCells3D(FieldsNew, FieldsOld, InnerEdge, OuterEdge, &
           TimeStep, iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW)

    !  Release Access to Fields Data:

    CALL DMDAVecRestoreArrayF90( &
           F % FieldsDA, FieldsScratch,    FieldsOld, Error)
    CALL DMDAVecRestoreArrayF90( &
           F % FieldsDA, F % FieldsGlobal, FieldsNew, Error)

    !  End Ghost Exchange:

    CALL DMGlobalToLocalEnd( &
           F % FieldsDA, F % FieldsGlobal, INSERT_VALUES, F % FieldsLocal, &
           Error)

    !  Copy New to Old:

    CALL DMDAVecGetArrayF90( &
           F % FieldsDA, F % FieldsLocal,  FieldsOld, Error)
    CALL DMDAVecGetArrayF90( &
           F % FieldsDA, F % FieldsGlobal, FieldsNew, Error)

    DO iX3 = iBX3, iEX3
      DO iX2 = iBX2, iEX2
        DO iX1 = iBX1, iEX1
          DO iField = 0, DOF - 1

            FieldsOld(iField, iX1, iX2, iX3) &
              = FieldsNew(iField, iX1, iX2, iX3)

          END DO
        END DO
      END DO
    END DO

    CALL DMDAVecRestoreArrayF90( &
           F % FieldsDA, F % FieldsLocal,  FieldsOld, Error)
    CALL DMDAVecRestoreArrayF90( &
           F % FieldsDA, F % FieldsGlobal, FieldsNew, Error)

    Time = Time + TimeStep

    IF(iCycle >= MaxCycles .OR. Time >= EndTime) Done = .TRUE.

  END DO

  !  Destroy Local Scratch Vector:

  CALL VecDestroy(FieldsScratch, Error)

Compare result with initial condition

  !  Compare Evolved Solution with Initial Condition in the L1 Norm:

  CALL DMDAVecGetArrayF90( &
         F % FieldsDA, InitialCondition, FieldsOld, Error)
  CALL DMDAVecGetArrayF90( &
         F % FieldsDA, F % FieldsLocal,  FieldsNew, Error)

  MyL1Norm = Zero
  DO iX3 = iBX3, iEX3
    DO iX2 = iBX2, iEX2
      DO iX1 = iBX1, iEX1

        MyL1Norm &
          = MyL1Norm &
              + ABS(FieldsNew(D, iX1, iX2, iX3) - FieldsOld(D, iX1, iX2, iX3))

      END DO
    END DO
  END DO

  CALL MPI_REDUCE( &
         MyL1Norm, L1Norm, 1, MPI_DOUBLE_PRECISION, MPI_SUM, 0, &
         U % Communicator, Error)

  IF(U % MPI_Rank == 0)THEN
    PRINT*
    PRINT*, "  INFO: L1 Norm = ", L1Norm
  END IF

  CALL DMDAVecRestoreArrayF90( &
         F % FieldsDA, InitialCondition, FieldsOld, Error)
  CALL DMDAVecRestoreArrayF90( &
         F % FieldsDA, F % FieldsGlobal, FieldsNew, Error)

  CALL VecDestroy(InitialCondition, Error)

AdvectionModule.F90

AdvectionModule contains AdvectExchangeCells3D and AdvectNonExchangeCells3D which call the kernel Advect3D on exchange cells and non-exchange cells, respectively.

AdvectExchangeCells3D

  SUBROUTINE AdvectExchangeCells3D( &
               FieldsNew, FieldsOld, InnerEdge, OuterEdge, TimeStep, &
               iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW)

    PetscReal, DIMENSION(:,:,:,:), POINTER :: &
      FieldsNew, &
      FieldsOld, &
      InnerEdge, &
      OuterEdge
    PetscReal, INTENT(in) :: &
      TimeStep
    PetscInt, INTENT(in) :: &
      iBX1, iBX2, iBX3, &
      iEX1, iEX2, iEX3, SW

    PetscInt :: &
      X1B, X1E, &
      X2B, X2E, &
      X3B, X3E

    !  1:
    X1B = iBX1
    X1E = iBX1 + SW - 1
    X2B = iBX2
    X2E = iEX2
    X3B = iBX3
    X3E = iEX3
    CALL Advect3D( &
           FieldsNew(D,                        &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           FieldsOld(D,                        &
             X1B - SW : X1E + SW,              &
             X2B - SW : X2E + SW,              &
             X3B - SW : X3E + SW),             &
           FieldsNew(V1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           OuterEdge(X1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E)  &
           - InnerEdge(X1,                     &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           TimeStep,                           &
           X1B = X1B, X1E = X1E,               &
           X2B = X2B, X2E = X2E,               &
           X3B = X3B, X3E = X3E, SW = SW)

    !  2:
    X1B = iEX1 - SW + 1
    X1E = iEX1
    X2B = iBX2
    X2E = iEX2
    X3B = iBX3
    X3E = iEX3
    CALL Advect3D( &
           FieldsNew(D,                        &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           FieldsOld(D,                        &
             X1B - SW : X1E + SW,              &
             X2B - SW : X2E + SW,              &
             X3B - SW : X3E + SW),             &
           FieldsNew(V1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           OuterEdge(X1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E)  &
           - InnerEdge(X1,                     &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           TimeStep,                           &
           X1B = X1B, X1E = X1E,               &
           X2B = X2B, X2E = X2E,               &
           X3B = X3B, X3E = X3E, SW = SW)

    !  3:
    X1B = iBX1 + SW
    X1E = iEX1 - SW
    X2B = iBX2
    X2E = iBX2 + SW - 1
    X3B = iBX3
    X3E = iEX3
    CALL Advect3D( &
           FieldsNew(D,                        &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           FieldsOld(D,                        &
             X1B - SW : X1E + SW,              &
             X2B - SW : X2E + SW,              &
             X3B - SW : X3E + SW),             &
           FieldsNew(V1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           OuterEdge(X1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E)  &
           - InnerEdge(X1,                     &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           TimeStep,                           &
           X1B = X1B, X1E = X1E,               &
           X2B = X2B, X2E = X2E,               &
           X3B = X3B, X3E = X3E, SW = SW)

    !  4:
    IF(iEX2 > iBX2)THEN
      X1B = iBX1 + SW
      X1E = iEX1 - SW
      X2B = iEX2 - SW + 1
      X2E = iEX2
      X3B = iBX3
      X3E = iEX3
      CALL Advect3D( &
             FieldsNew(D,                        &
               X1B : X1E, X2B : X2E, X3B : X3E), &
             FieldsOld(D,                        &
               X1B - SW : X1E + SW,              &
               X2B - SW : X2E + SW,              &
               X3B - SW : X3E + SW),             &
             FieldsNew(V1,                       &
               X1B : X1E, X2B : X2E, X3B : X3E), &
             OuterEdge(X1,                       &
               X1B : X1E, X2B : X2E, X3B : X3E)  &
             - InnerEdge(X1,                     &
               X1B : X1E, X2B : X2E, X3B : X3E), &
             TimeStep,                           &
             X1B = X1B, X1E = X1E,               &
             X2B = X2B, X2E = X2E,               &
             X3B = X3B, X3E = X3E, SW = SW)
    END IF

    !  5:
    X1B = iBX1 + SW
    X1E = iEX1 - SW
    X2B = iBX2 + SW
    X2E = iEX2 - SW
    X3B = iBX3
    X3E = iBX3 + SW - 1
    CALL Advect3D( &
           FieldsNew(D,                        &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           FieldsOld(D,                        &
             X1B - SW : X1E + SW,              &
             X2B - SW : X2E + SW,              &
             X3B - SW : X3E + SW),             &
           FieldsNew(V1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           OuterEdge(X1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E)  &
           - InnerEdge(X1,                     &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           TimeStep,                           &
           X1B = X1B, X1E = X1E,               &
           X2B = X2B, X2E = X2E,               &
           X3B = X3B, X3E = X3E, SW = SW)

    !  6:
    IF(iEX3 > iBX3)THEN
      X1B = iBX1 + SW
      X1E = iEX1 - SW
      X2B = iBX2 + SW
      X2E = iEX2 - SW
      X3B = iEX3 - SW + 1
      X3E = iEX3
      CALL Advect3D( &
             FieldsNew(D,                        &
               X1B : X1E, X2B : X2E, X3B : X3E), &
             FieldsOld(D,                        &
               X1B - SW : X1E + SW,              &
               X2B - SW : X2E + SW,              &
               X3B - SW : X3E + SW),             &
             FieldsNew(V1,                       &
               X1B : X1E, X2B : X2E, X3B : X3E), &
             OuterEdge(X1,                       &
               X1B : X1E, X2B : X2E, X3B : X3E)  &
             - InnerEdge(X1,                     &
               X1B : X1E, X2B : X2E, X3B : X3E), &
             TimeStep,                           &
             X1B = X1B, X1E = X1E,               &
             X2B = X2B, X2E = X2E,               &
             X3B = X3B, X3E = X3E, SW = SW)
    END IF

  END SUBROUTINE AdvectExchangeCells3D

AdvectNonExchangeCells3D

SUBROUTINE AdvectNonExchangeCells3D( &
               FieldsNew, FieldsOld, InnerEdge, OuterEdge, TimeStep, &
               iBX1, iBX2, iBX3, iEX1, iEX2, iEX3, SW)

    PetscReal, DIMENSION(:,:,:,:), POINTER :: &
      FieldsNew, &
      FieldsOld, &
      InnerEdge, &
      OuterEdge
    PetscReal, INTENT(in) :: &
      TimeStep
    PetscInt, INTENT(in) :: &
      iBX1, iBX2, iBX3, &
      iEX1, iEX2, iEX3, SW

    PetscInt :: &
      X1B, X1E, &
      X2B, X2E, &
      X3B, X3E

    X1B = iBX1 + SW
    X1E = iEX1 - SW
    X2B = iBX2 + SW
    X2E = iEX2 - SW
    X3B = iBX3 + SW
    X3E = iEX3 - SW
    CALL Advect3D( &
           FieldsNew(D,                        &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           FieldsOld(D,                        &
             X1B - SW : X1E + SW,              &
             X2B - SW : X2E + SW,              &
             X3B - SW : X3E + SW),             &
           FieldsNew(V1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           OuterEdge(X1,                       &
             X1B : X1E, X2B : X2E, X3B : X3E)  &
           - InnerEdge(X1,                     &
             X1B : X1E, X2B : X2E, X3B : X3E), &
           TimeStep,                           &
           X1B = X1B, X1E = X1E,               &
           X2B = X2B, X2E = X2E,               &
           X3B = X3B, X3E = X3E, SW = SW)

END SUBROUTINE AdvectNonExchangeCells3D

Advect3D

  SUBROUTINE Advect3D(D_NEW, D_OLD, V1, dX1, dt, &
               X1B, X1E, X2B, X2E, X3B, X3E, SW)

    PetscInt, INTENT(in) :: &
      X1B, X1E, X2B, X2E, X3B, X3E, SW
    PetscReal, DIMENSION(X1B : X1E, X2B : X2E, X3B : X3E), INTENT(inout) :: &
      D_NEW
    PetscReal, DIMENSION(X1B - SW : X1E + SW, &
                         X2B - SW : X2E + SW, &
                         X3B - SW : X3E + SW), INTENT(in) :: &
      D_OLD
    PetscReal, DIMENSION(X1B : X1E, X2B : X2E, X3B : X3E), INTENT(in) :: &
      V1, &
      dX1
    PetscReal, INTENT(in) :: &
      dt

    PetscReal :: &
      FluxOuterX1, &
      FluxInnerX1
    PetscInt :: &
      iX1, iX2, iX3

    DO iX3 = X3B, X3E
      DO iX2 = X2B, X2E
        DO iX1 = X1B, X1E

          FluxInnerX1 &
            = Half * ( ( V1(iX1, iX2, iX3) &
                         + ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1-1, iX2, iX3) &
                       + ( V1(iX1, iX2, iX3) &
                           - ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1, iX2, iX3) )

          FluxOuterX1 &
            = Half * ( ( V1(iX1, iX2, iX3) &
                         + ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1, iX2, iX3) &
                       + ( V1(iX1, iX2, iX3) &
                           - ABS(V1(iX1, iX2, iX3)) ) * D_OLD(iX1+1, iX2, iX3) )

          D_NEW(iX1, iX2, iX3) &
            = D_OLD(iX1, iX2, iX3) &
                - dt * ( FluxOuterX1 - FluxInnerX1 ) / dX1(iX1, iX2, iX3)

        END DO
      END DO
    END DO

  END SUBROUTINE Advect3D

Compiling

$ module load cray-petsc
$ module load git
$ git clone https://github.com/olcf/PETSc-tutorial.git
$ cd PETSc-tutorial
$ mkdir Executables
$ cd Executables
$ ftn -c ../RealKindModule.F90 ../MeshModule.F90 ../FieldsModule.F90 ../UniverseModule.F90 ../AdvectionModule.F90 ../Advection3D.F90
$ ftn -o Advection3D RealKindModule.o MeshModule.o FieldsModule.o UniverseModule.o AdvectionModule.o Advection3D.o

Running

$ aprun -n16 ./Advection3D

 INFO: Creating Universe Advection3D

   INFO: Number of MPI Ranks =            16

   INFO: Creating Mesh
     Dimensionality =             3
     Stencil Width  =             1
       iDim, nZones =             1          256
         Inner Boundary   =     0.000000000000000
         Outer Boundary   =     1.000000000000000
       iDim, nZones =             2          256
         Inner Boundary   =     0.000000000000000
         Outer Boundary   =     1.000000000000000
       iDim, nZones =             3          256
         Inner Boundary   =     0.000000000000000
         Outer Boundary   =     1.000000000000000

   INFO: Creating Fields
     Degrees of Freedom =             4

     Cycle     =             1
     Time Step =    3.9062500000000000E-003
     Time      =     0.000000000000000

     Cycle     =             2
     Time Step =    3.9062500000000000E-003
     Time      =    3.9062500000000000E-003

     Cycle     =             3
     Time Step =    3.9062500000000000E-003
     Time      =    7.8125000000000000E-003

     ...

     Cycle     =           256
     Time Step =    3.9062500000000000E-003
     Time      =    0.9960937500000000

   INFO: Finished Evolution
     Cycles Evolved =           256
     Time           =     1.000000000000000

   INFO: L1 Norm =    2.2737367544323206E-013

 INFO: Destroying Universe Advection3D

   INFO:            16  MPI Ranks Finished in     109.6593058109283
  Seconds

This tutorial covers how to write a parallel program to calculate π using the Monte Carlo method. The first code is a simple serial implementation. The next codes are  parallelized using MPI and OpenMP and then finally, the last code sample is a version that combines both of these parallel techniques.

Serial Implementation

The codes use  Monte Carlo methods to estimate π. To set up the estimate, randomly located points are generated within a 2×2 square which has a circle inscribed within it– think of a game of darts. The algorithm generates a large number of points and checks to see if the coordinates, x and y, of each point are inside the circle- x2+y2≤1. The ratio, P, of points inside the circle to the total number of points tried is calculated. Using this ratio, the approximation of π can be computed by assuming that P approaches the value of ratio of the area of the circle to the area of the square when the number of points, niter, is large. Thus we can say: P= (πr²)/4r² = π/4 and solve for π.  To do this, P is multiplied by 4 and the result is an approximation of π. (Note: The more iterations of points generated (niter), the more accurate the approximation will be).

serialpi.c

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void main(int argc, char* argv[])
{
	double niter = 10000000;
    double x,y;
	int i;
    int count=0;
	double z;
	double pi;
    srand(time(NULL));
    //main loop
	for (i=0; i<niter; ++i)
	{
        //get random points
        x = (double)random()/RAND_MAX;
		y = (double)random()/RAND_MAX;
		z = sqrt((x*x)+(y*y));
        //check to see if point is in unit circle
		if (z<=1)
		{
			++count;
		}
	}
	pi = ((double)count/(double)niter)*4.0;          //p = 4(m/n)
	printf("Pi: %f\n", pi);

Next, compile the code using the compiler wrappers on the supercomputer

cc serialpi.c -o a.out

To run this program on Titan, or any of the other clusters, a job needs to be submitted in the form of a PBS batch script. Here is the an example batch scrip, titan_pi.pbs:

#!/bin/bash
#    Begin PBS directives
#PBS -A AAA###
#PBS -N pi
#PBS -j oe
#PBS -l walltime=0:01:00,nodes=1
#PBS -V
#PBS -q debug
#    End PBS directives and begin shell commands

cd $PBS_O_WORKDIR
date
aprun -n 1 ./a.out

Replace AAA### with the appropriate project ID that the time will be charged to. See the Titan System User Guide for a more in depth look at writing batch scripts.

To submit this batch script, run the following:

qsub titan_pi.pbs

When executed, the result will be something like:
Pi: 3.141664

Parallelize with MPI

MPI Send/Recv

In this next version of the program, the code has been parallelized with MPI. In order to do this, a few things need to be set up in the code. First, the mpi.h header file needs to be included.


#include "mpi.h"

Next, an MPI instance needs to be created. To do this, add this:

MPI_Init(&argv, &argc);

right before the code needed to be parallelized with MPI. To utilize the created MPI instance, the code needs to have MPI sends and receives present. This message sending and receiving is the basic concept behind MPI. Many functions in MPI can be implemented with combinations of basic send and receive calls. Here, our code is set up to do computations on all but the master node (MPI rank 0). The only thing that rank 0 does is receive the messages (the calculated points in/out of the circle) sent by the other nodes and calculate the ratio based upon that data.  The MPI_Recv(); function needs to be added so that our master node can receive the messages. To set up the receive, it needs the following info:

int MPI_Recv(void *buf,
             int count,
             MPI_Datatype datatype,
             int source,
             int tag,
             MPI_Comm comm,
             MPI_Status *status);
  • A buffer to store the reveived data
  • A number of maximum elements in the receive buffer
  • An MPI_Datatype of the appropriate type for the data to be revieved
  • The rank number that the message will be sent from. This can be the wildcard MPI_ANY_SOURCE
  • A message tag. This serves as a label to identify a particular message. It is defined by an application. It can be the wildcard MPI_ANY_TAG
  • The name of the MPI communicator. I.e. MPI_COMM_WORLD.
  • A status variable

So the first MPI_Recv line should look like this:

MPI_Recv(&recieved[i],
         nodenum,
         MPI_INT,
         MPI_ANY_SOURCE,
         1,
         MPI_COMM_WORLD,
         MPI_STATUS_IGNORE);

Next, an MPI_Send(); needs to be set up so that the master rank  has data to receive. MPI_Send(); only needs to know:

int MPI_Send(void *buf,
             int count,
             MPI_Datatype datatype,
             int dest,
             int tag,
             MPI_Comm comm);
  • A pointer to the send buffer (data to send)
  • Number of elements in the send buffer
  • An MPI_Datatype for what kind of data thats being sent
  • The destination rank
  • A message tag to help identify a specific message or type of message, as specified/defined by the application
  • The MPI communicator

So with this in mind, the first MPI_Send needs to look like this:

MPI_Send(&count,
         1,
         MPI_INT,
         0,
         1,
         MPI_COMM_WORLD);

Finally, the MPI instance should be closed out before the program exits. To do this simply add this line after the parallel code:

MPI_Finalize();

This code is written so that each node is running the same algorithm to place the random points, categorize them, and count them. The task assigned to master node collects the total number of tried points and total number of points in the circle. It takes the ratio of them and multiplies it by 4 to get a more accurate estimation of π. This is more accurate, because each node is running  as many iterations as the singe node in the serial code ran. The master node treats them as if the values were ultimately from the same circle. The same thing can be achieved with the serial code by multiplying niter, the number of iterations,  by the total number of nodes used in the parallelized version–but that is much less efficient than the MPI version because it is not dividing the work over several nodes simultaneously.

mpiPi.c

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <math.h>
#define SEED 35791246

int main(int argc, char* argv[])
{
	long niter = 1000000;
	int myid;						//holds process's rank id
	double x,y;						//x,y value for the random coordinate
	int i, count=0;                 //Count holds all the number of how many good coordinates
	double z;						//Used to check if x^2+y^2<=1
	double pi;						//holds approx value of pi
	int nodenum;

	MPI_Init(&argc, &argv);					//Start MPI
	MPI_Comm_rank(MPI_COMM_WORLD, &myid);			//get rank of node's process
	MPI_Comm_size(MPI_COMM_WORLD, &nodenum);
	int recieved[nodenum];
	long recvniter[nodenum];
	srand(SEED+myid);						//Give rand() a seed value. Needs to be different on each node

	if(myid != 0)
	{
		for (i=0; i<niter; ++i)					//main loop
		{
			x= ((double)rand())/RAND_MAX;			//gets a random x coordinate
			y =((double)rand())/RAND_MAX;			//gets a random y coordinate
			z = sqrt(x*x+y*y);					//Checks to see if number in inside unit circle
			if (z<=1)
			{
				count++;				//if it is, consider it a valid random point
			}
		}
		for(i=0; i<nodenum; ++i)
		{
			MPI_Send(&count,
                                 1,
                                 MPI_INT,
                                 0,
                                 1,
                                 MPI_COMM_WORLD);
			MPI_Send(&niter,
                                 1,
                                 MPI_LONG,
                                 0,
                                 2,
                                 MPI_COMM_WORLD);
		}
	}
	else if (myid == 0)
	{
		for(i=0; i<nodenum; ++i)
		{
			MPI_Recv(&recieved[i],
                                 nodenum,
                                 MPI_INT,
                                 MPI_ANY_SOURCE,
                                 1,
                                 MPI_COMM_WORLD,
                                 MPI_STATUS_IGNORE);
			MPI_Recv(&recvniter[i],
                                 nodenum,
                                 MPI_LONG,
                                 MPI_ANY_SOURCE,
                                 2,
                                 MPI_COMM_WORLD,
                                 MPI_STATUS_IGNORE);
		}
	}

	if (myid == 0)						//if root process
	{
		int finalcount = 0;
		long finalniter = 0;
		for(i = 0; i<nodenum; ++i)
		{
			finalcount += recieved[i];
			finalniter += recvniter[i];
		}

		pi = ((double)finalcount/(double)finalniter)*4.0;				//p = 4(m/n)
		printf("Pi: %f\n", pi);				//Print the calculated value of pi

	}

	MPI_Finalize();						//Close the MPI instance
	return 0;
}

Compile this code by running:

cc mpiPi.c -o a.out

The job submission is the same as the previous program but two lines in the batch script need to be edited:

#PBS -l walltime=0:01:00,nodes=1

the “nodes=1” needs to be changed to to how many nodes needed to run this code.
and

aprun -n 1 ./a.out

The “-n 1” needs to be changed to reflect the change made in the previous line (e.g. if the previous line now reads “nodes = 100” then the aprun line needs to reflect that by saying “-n 100”

Now the job can be submitted to the batch system to be run when the number of requested nodes becomes available.

qsub titan_pi.pbs

MPI Reduce

In the following version of the code, the serial code has been parallelized with MPI again, but this time the collective MPI_Reduce() operation is used. MPI_Reduce() gathers all the specified values on all nodes to one reduced value on the master node according to the specified operation (i.e. MPI_SUM). This has the same effect as the previous code, but makes for cleaner reading code.

mpirPi.c

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <math.h>

int main(int argc, char* argv[])
{
	int niter = 100000;
	int myid;						//holds process's rank id
	double x,y;						//x,y value for the random coordinate
	int i;
        int count=0;				//Count holds all the number of how many good coordinates
	double z;						//Used to check if x^2+y^2<=1
	double pi;						//holds approx value of pi
	int reducedcount;					//total number of "good" points from all nodes
	int reducedniter;					//total number of ALL points from all nodes

	MPI_Init(&argc, &argv);					//Start MPI
	MPI_Comm_rank(MPI_COMM_WORLD, &myid);			//get rank of node's process

/* Everyone can do the work */

	for (i=0; i<niter; ++i)					//main loop
	{
		srand48(time(NULL)+myid);				//Give rand() a seed value unique on each node (times are synced)
		x = (double)random()/RAND_MAX;			//gets a random x coordinate
		y = (double)random()/RAND_MAX;			//gets a random y coordinate
		z = sqrt((x*x)+(y*y));				//Checks to see if number in inside unit circle
		if (z<=1)
		{
			++count;				//if it is, consider it a valid random point
		}
	}
	

	MPI_Reduce(&count,
                   &reducedcount,
                   1,
                   MPI_INT,
                   MPI_SUM,
                   0,
                   MPI_COMM_WORLD);
	MPI_Reduce(&niter,
                   &reducedniter,
                   1,
                   MPI_INT,
                   MPI_SUM,
                   0,
                   MPI_COMM_WORLD);
	
	if (myid == 0)						//if root process
	{
		pi = ((double)reducedcount/(double)reducedniter)*4.0;				//p = 4(m/n)
		printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter);
		//Print the calculated value of pi

	}

	MPI_Finalize();						//Close the MPI instance
	return 0;
}

Compile this code by running

cc mpirPi.c -o a.out

Again, the batch script needs to be edited to reflect the number of nodes requested for this code. Once that is done, submit the job by once again running

qsub titan_pi.pbs

Parallelize with OpenMP

Here the original serial code has been parallelized with OpenMP instead of MPI. OpenMP is used to create separate threads to run on the multicore CPUs of Titan. To use OpenMP the following line needs to added:


#include <omp.h>

Also, the #pragma omp parallel compiler directive needs to be added. This tells the compiler that the following block will be OpenMP parallel code. This code is creating 16 threads; one for each core of the CPU. Since OpenMP is a shared memory API, one needs to keep in mind what variables need to be accessible by what threads. Here, our code needs each thread to have its own copy of x, y, z, and i, and the variable “count” needs to be accessible by all threads so that it can be incremented when needed. In order to accomplish this, OpenMP needs to be told that x, y, z, and i need to be private. Here the firstprivate() clause is used in the omp pragma to declare them as thread private. The firstprivate() clause also has the feature that will automatically initialize the variables inside its parentheses. The variable, “count” needs to be accessible to all threads (i.e. shared) so the shared() clause is used to accomplish that. Finally, OpenMP is told to spawn 16 threads (one for each core of the processor). This is done by using the num_threads() clause.

omppi.c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <math.h>

int main(int argc, char* argv[])
{
	int niter = 1000000;			//number of iterations per FOR loop
	double x,y;						//x,y value for the random coordinate
	int i;							//loop counter
        int count=0;				//Count holds all the number of how many good coordinates
	double z;						//Used to check if x^2+y^2<=1
	double pi;						//holds approx value of pi
        int numthreads = 16;

	#pragma omp parallel firstprivate(x, y, z, i) shared(count) num_threads(numthreads)
	{
		srandom((int)time(NULL) ^ omp_get_thread_num());	//Give random() a seed value
		for (i=0; i<niter; ++i)				//main loop
		{
			x = (double)random()/RAND_MAX;		//gets a random x coordinate
			y = (double)random()/RAND_MAX;		//gets a random y coordinate
			z = sqrt((x*x)+(y*y));			//Checks to see if number is inside unit circle
		if (z<=1)
		{
				++count;			//if it is, consider it a valid random point
		}
	}
		//print the value of each thread/rank
	}
		pi = ((double)count/(double)(niter*numthreads))*4.0;
		printf("Pi: %f\n", pi);

	return 0;
}

Compiling it is similar to the previous examples, with one exception. When the compiler is called, the correct OpenMP flag needs to be added. This flag is different depending on which programming environment module you have loaded (run module list and look for a module that says PrgEnv-*****)

    • PrgEnv-pgi: -mp
    • PrgEnv-intel: -openmp
    • PrgEnv-cray: -openmp
    PrgEnv-gnu: -fopenmp

Now the code can be compiled using this format:

cc -(appropriate flag here) omppi.c -o a.out

The batch script needs to be edited so that only 1 node is requested (there is no MPI code in this for multiple nodes. Yet.)

Submit the script:

qsub titan_pi.pbs

OpenMP + MPI Hybrid

This next code sample combines both MPI and OpenMP. This allows it to fully exploit the parallel architecture of Titan’s multicore CPUs as well as add a layer of parallelism from MPI. In this version, 16 OpenMP threads are launched on each MPI rank. Here, a different approach is taken with the OpenMP section. Instead of declaring count as shared, the reduction(:) clause is used. This clause effectively creates a private copy of “count” on each thread to be incremented and then once all threads complete, OpenMP will reduce all instances of “count” into the original count variable according to the mathematical operator given. In this case, the total number from all threads is needed, so they are reduced by taking the sum of all of them. Then MPI_Reduce is used to gather and reduce “count” from each node into one variable on the master node.

mpiOMPpi.c

#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
#include <omp.h>
#include <math.h>

int main(int argc, char* argv[])
{
	int niter = 1000000;					//number of iterations per FOR loop
	int myid;						//holds process's rank id
	double x,y;						//x,y value for the random coordinate
	int i;							//loop counter
        int count=0;				//Count holds all the number of how many good coordinates
	double z;						//Used to check if x^2+y^2<=1
	double pi;						//holds approx value of pi
	int reducedcount;					//total number of "good" points from all nodes
	int reducedniter;					//total number of ALL points from all nodes
	int ranknum = 0;					//total number of nodes available
        int numthreads = 16;
	MPI_Init(&argc, &argv);					//Start MPI
	MPI_Comm_rank(MPI_COMM_WORLD, &myid);			//get rank of node's process
	MPI_Comm_size(MPI_COMM_WORLD, &ranknum);		//Gets number of nodes availible to process

	if(myid != 0)						//Do the following on all except the master node
	{
		//Start OpenMP code: 16 threads/node
		#pragma omp parallel firstprivate(x, y, z, i) reduction(+:count) num_threads(numthreads)
		{
			srandom((int)time(NULL) ^ omp_get_thread_num());	//Give random() a seed value
			for (i=0; i<niter; ++i)				//main loop
			{
				x = (double)random()/RAND_MAX;		//gets a random x coordinate
				y = (double)random()/RAND_MAX;		//gets a random y coordinate
				z = sqrt((x*x)+(y*y));			//Checks to see if number is inside unit circle
				if (z<=1)
				{
					++count;			//if it is, consider it a valid random point
				}
			}
		}
	}
	MPI_Barrier(MPI_COMM_WORLD);
	MPI_Reduce(&count,
                   &reducedcount,
                   1,
                   MPI_INT,
                   MPI_SUM,
                   0,
                   MPI_COMM_WORLD);
	reducedniter = numthreads*niter*(ranknum-1);

	if (myid == 0)						//if root process/master node
	{
	    	//p = 4(m/n)
		pi = ((double)reducedcount/(double)reducedniter)*4.0;
		//Print the calculated value of pi
		printf("Pi: %f\n%i\n%d\n", pi, reducedcount, reducedniter);
	}

	MPI_Finalize();						//Close the MPI instance
	return 0;
}

Compiling this code is similar to the past examples (don’t forget to use the correct OpenMP flag according to the programming environment used). To compile run:

cc -(OpenMP flag) mpiOMPpi.c -o a.out

The batch script can be edited now so that it can request the appropriate number of nodes (see above). After the batch script is edited, submit the job like before:

qsub titan_pi.pbs

If you have any questions about these examples or how to run them please email help@nccs.gov.

All code from this tutorial, including a Makefile and batch scripts can be found as a git repository at https://github.com/olcf/Serial-to-Parallel–Monte-Carlo-Pi