Publishing Note:
Prior to publishing work performed on Summitdev, please contact help@olcf.ornl.gov.
Beta2 Update (October 17):



SummitDev was upgraded to the latest IBM software release (beta2) on October 17. The latest Spectrum MPI and LSF releases require all users to modify their batch scripts and job launch. Please note the following changes:

Batch node request syntax change
The new LSF -nnodes #nodes flag replaces the previous -n #slots flag. Note: -nnodes requests nodes while -n requests slots/cores.
All batch scripts should be updated to use #BSUB –nnodes #nodes
Resource Requirement Flag (-R)
The resource requirement flag (-R) is no longer allowed. This was previously used for e.g. setting the number of processes per node (#BSUB -R "span[ptile=$processes]"). This situation should instead be handled using jsrun:
For example, to run a job on 2 nodes with 10 MPI ranks per node, you could use
#BSUB -nnodes 2 (in your batch script)
jsrun -n2 -a10 -r1 ./a.out
This will request 2 resource sets (-n2), each with 10 processes (-a10), and 1 resource set per node (-r1).
Batch Script Redirection
Redirecting batch scripts into bsub will no longer be allowed. Workflows should be updated to remove redirection.
Previous submissions bsub < script.lsf will become simply bsub script.lsf.

jsrun Replaces mpirun
Jsrun will replace mpirun as the default job launcher.
More information on jsrun can be found in the running jobs section.

Batch launch nodes
As of October 17, all non-jsrun commands will be executed on the batch launch node.
When you start a batch job, your shell runs on the batch launch node. Previously shell commands ran on one of the job's allocated compute nodes.
The batch launch node concept is similar to the service node process used on Titan
Environment Module Changes
The following changes will be made to the Environment Module system (lmod). The changes should not require action and should be transparent.

  • The environment modules will be reorganized to allow MPI version dependencies.
  • The Spectrum MPI module will be renamed from spectrum_mpi to spectrum-mpi.
  • A number of older module version will be removed.
Default Version Changes
Package Current Default Future Default
CUDA 8.0.54 9.0.69
PGI 17.3 17.9
XL 20161123 20170914
xlflang 20170131 20170627
Spectrum MPI 10.1.0.2-20161221 10.1.0.4-20170915
Beta2 Known Issues:

jsrun environment propagation
Jsrun does not currently propagate environment to compute nodes. Environment variables can be passed to the compute nodes through the jsrun -E flag. For example:

jsrun -n4 -a1 -g1 -E LD_LIBRARY_PATH ./a.out

UPDATE: jsrun has been wrapped to pass all variables set at submit time to jsrun through the -E flag. The new jsrun is added to the environment through the lsf-tools environment module. The new jsrun should prevent the need to specify the -E flag.

jsrun OMP_NUM_THREADS
jsrun changes the value of OMP_NUM_THREADS to the value passed to the -c flag. A space is also added to the end of the variable which causes issues for some compilers. The work around for this issue is to pass OMP_NUM_THREADS to jsrun using the -E flag. For example:

setenv OMP_NUM_THREADS 4
jsrun -n4 -a1 -g1 -c4 -E LD_LIBRARY_PATH -E OMP_NUM_THREADS ./a.out

UPDATE: jsrun has been wrapped to pass all variables set at submit time to jsrun through the -E flag. The new jsrun is added to the environment through the lsf-tools environment module. The new jsrun should prevent the need to specify the -E flag.

JOB_FEATURE
The previous method to set GPU and SMT features: -env "JOB_FEATURE=gpumps" no longer works under Beta2. After investigation, the method has been replaced with the -alloc_flags flag. JOB_FEATURE should be replaced with -alloc_flags. For example: Replace

#BSUB -env "JOB_FEATURE=gpumps"

with

 #BSUB -alloc_flags gpumps
Changing SMT Level
In addition to the need to change from JOB_FEATURE to -alloc_flags, changing SMT modes is currently problematic. -alloc_flags smt1 is the correct method to set SMT mode to 1. But, this method hangs on occasion. The issue is under investigation.

PMIx Errors
PMIx errors can occur as a result of a number of failures. We need to understand and reproduce the underlying issue so it can be reported. If you see PMIx errors, please contact the help@olcf.ornl.gov. We would like to work with you to understand what was done to trigger the message.

Below is an example of a PMIx failure:

summitdev-login1 104> setenv TMPDIR /ccs/home
summitdev-login1 104> jsrun -n2 hostname
Connection request to PMIx server timed out.
Error: No such file or directory
Error initializing RM connection. Exiting.

The above has been reported and should be corrected in the next jsrun release. But, this is just one example. Please continue to help us find underlying reasons for PMIx failures so we can report the issues to the developers.

jsrun GPU-CPU Layout Issues
jsrun is currently not enforcing the gpu-cpu latency priority. Jobs requesting two GPUs per resource set can be assigned a task on one socket and a GPU on another socket. While this issue is worked, the -c cores-per-resource set flag can be used to help force layout within a node. For example:

The following places rank 2 and 3 on the same socket:

summitdev-login1 681> jsrun -n4 -a1 -g2 -lgpu-cpu ./basic-layout-mpi.x | sort
Rank:    0  Core:   3 Host: summitdev-r0c1n12
Rank:    1  Core:  86 Host: summitdev-r0c1n12
Rank:    2  Core:  83 Host: summitdev-r0c1n13
Rank:    3  Core:  92 Host: summitdev-r0c1n13
summitdev-login1 682>

The following places one task per socket:

summitdev-login1 682> jsrun -n4 -a1 -g2 -lgpu-cpu -c10 ./basic-layout-mpi.x | sort
Rank:    0  Core:   3 Host: summitdev-r0c1n12
Rank:    1  Core:  83 Host: summitdev-r0c1n12
Rank:    2  Core:  83 Host: summitdev-r0c1n13
Rank:    3  Core:   2 Host: summitdev-r0c1n13
summitdev-login1 683>
GPU Direct Issues
To enable GPU Direct under Beta 2 when using jsrun, you need to first run:

source $OLCF_SPECTRUM_MPI_ROOT/jsm_pmix/bin/export_smpi_env -gpu
jsrun -n2 -a1 -g1 -c1 ./a.out

System Overview

The Summitdev system is an early access system that is one generation removed from OLCF’s next big supercomputer, Summit. The system contains three racks, each with 18 IBM POWER8 S822LC nodes for a total of 54 nodes. Each IBM S822LC node has 2 IBM POWER8 CPUs and 4 NVIDIA Tesla P100 GPUs. The POWER8 CPUs have 32 8GB DDR4 memory (256 GB). Each POWER8 node has 10 cores with 8 HW threads. The GPUs are connected by NVLink 1.0 at 80GB/s and each GPU has 16GB HBM2 memory. The nodes are connected in a full fat-tree via EDR InfiniBand. The racks are liquid cooled with a heat exchanger rate. Summitdev has access to Spider 2, the OLCF's center-wide Lustre parallel file system.






Access

Initial access to the summitdev system will be limited to the OLCF CAAR teams.

Note:
Summitdev is not accessible outside the OLCF network. To access Summitdev, you must first log into home.ccs.ornl.gov and then from home SSH into summitdev.

There is a single login node, summitdev.ccs.ornl.gov that can be accessed via SSH from home.ccs.ornl.gov. You will be prompted for your SecurID PASSCODE to authenticate.






Hardware

Summitdev is built with IBM S822LC nodes. Nodes are configured as follows:

  • 2 10-core IBM POWER8 CPUs
    • Each core supports 8 hardware threads
    • 256 GB DDR4 memory
  • 4 NVIDIA Tesla P100 GPUs
    • Connected via NVLink 1.0 (80 GB/s)
    • 16 GB HBM2 memory
  • 2 Mellanox EDR Infiniband
  • 800 GB NVMe storage

Summitdev has three racks of compute nodes (18 nodes each for a total of 54 compute nodes) and an additional rack of login/support nodes.  There is a single login node, summitdev.ccs.ornl.gov that can be accessed via SSH from home.ccs.ornl.gov.  Nodes are connected in a full fat-tree via EDR Infiniband.  Unlike Titan, login nodes and compute nodes on Summitdev have the same configuration.






Data Storage and Filesystems

Overview

The available file systems are summarized in the table below.

Filesystem Mount Points Backed Up? Purged? Quota? Comments
Home Directories (NFS) /ccs/home/$USER Yes No 50GB Shared across all OLCF resources
Project Space (NFS) /ccs/proj/$PROJECTID Yes No 50GB
Parallel Scratch (Lustre) /lustre/atlas/proj-shared/$PROJECTID
/lustre/atlas/scratch/$USER/$PROJECTID
/lustre/atlas/world-shared/$PROJECTID
No Yes Shared across all OLCF resources
NVMe (XFS) /xfs/scratch No Yes 800GB Node-local high-speed scratch space. Only writeable/readable during a job. Requires setup. See section 3.5.

Home (NFS)

  • User $HOME area
  • Project Space /ccs/proj/{PROJID}
  • Center-provided software space /sw/summitdev
Note:
Errors writing to NFS AsynchronouslyWrite operations to NFS shares are asynchronous and can cause unexpected race conditions when used by multi-process applications. Do not use your NFS home space as scratch in parallel IO jobs.

Lustre

Sumittdev mounts the OLCF Lustre filesystem. See Lustre Basics for the basic description.

NVMe (XFS)

Each node has a Non-Volatile Memory (NVMe) storage device, colloquially known as a "Burst Buffer". Users will have access to an 800 GB partition of each NVMe. Summit will also have node-local NVMes. The NVMes will be used to reduce the time that applications wait for I/O. Using an SSD drive per compute node, the burst buffer will be used to transfers data to or from the drive before the application reads a file or after it writes a file. The result will be that the application benefits from native SSD performance for a portion of its I/O requests.

Users are not required to use the NVMes. Data can also be written directly to the parallel filesystem.

slide1
The NVMes on Summitdev are local to each node.

Current NVMe Usage

Tools for using the burst buffers are still under development. Currently, the user must create a writeable directory on each node's NVMe with a provided prolog script and then explicitly move data to and from the NVMes with posix commands during a job. This mode of usage only supports writing file-per-process or file-per-node. It does not support automatic "n to 1" file writing, writing from multiple nodes to a single file. After a job completes the NVMes are trimmed, a process that irreversibly deletes data from the devices, so all desired data from the NVMes will need to be copied back to the parallel filesystem before the job ends. This largely manual mode of usage will not be the recommended way to use the burst buffer for most applications because tools are actively being developed to automate and improve the NVMe transfer and data management process.

Here are the basic steps for using the BurstBuffers in their current limited mode of usage:

  1. Modify your application to write to /xfs/scratch/$USER, a directory that will be created on each NVMe.
  2. Modify either your application or your job submission script to copy the desired data from /xfs/scratch/$USER back to the parallel filesystem before the job ends.
  3. Modify your job submission script to include the -env "all,JOB_FEATURE=NVME"  bsub option. This will run a provided prolog script that creates a directory on each node's burst buffer called /xfs/scratch/$USER.
  4. Submit your bash script or run the application.
  5. Assemble the resulting data as needed.

Interactive Jobs Using the NVMe

The NVMe can be setup for test usage within an interactive job as follows:

 -bash-4.2$ bsub -W 30 -n 1 -env "all,JOB_FEATURE=NVME" -P project124 -Is bash
Beta2 Change (October 17):
-n #slots will be replaced with -nnodes #nodes in the Oct 17 update.

The  -env "all,JOB_FEATURE=NVME" option will create a directory called /xfs/scrach/$USER on each requested node's NVMe. The /xfs/scrach/$USER directories will be writeable and readable until the interactive job ends. Outside of a job /xfs/sctach will be empty and you will not be able to write to it.

NVMe Usage Example

The following example illustrates how to use the burst buffers (NVMes) in the only mode currently available on Summitdev.  This example uses a hello_world bash script, called test_nvme.sh, and its submission script, check_nvme.lsf. It is assumed that the files are sitting in the user's Lustre scratch area, /lustre/atlas/scratch/$USER/, and that the user is operating from there as well.

Hello world bash script: test_nvme.sh

Below is bash script that writes hello from each node to a file sitting on that node's NVMe and then transfers those files back to the GPFS.

Note that $LSB_JOBID, used in line 20, is an environment variable that is automatically set to contain the job ID number.

#!/bin/bash
 
#test_nvme.sh
#create a tempfile on the local NVMe on each node.
TMPFILE="/xfs/scratch/$USER/$(hostname)"
touch "$TMPFILE"
 
#Write data to TMPFILE
echo "Hello from beautiful $HOSTNAME on $(date)" >> "$TMPFILE"
 
#list the contents TMPFILE to standard output.
cat $TMPFILE
 
 
# Parallel stage-out
#List the location of the TMPFILES
ls -l "/xfs/scratch/$USER"
 
#Create a output diretory name after the job ID on the parllel filesystem.
OUTPUT_DIR="/lustre/atlas/scratch/$USER//$LSB_JOBID"
mkdir -p "$OUTPUT_DIR"
  
#Copy data from the TMPFILEs to the filesystem
cp "$TMPFILE" "$OUTPUT_DIR/."

If you are copying/pasting this to try it out, remember to make name_test.sh executable with chmod +x name_test.sh.

Job submssion script: check_nvme.lsf.

This will submit a job to run on two nodes. Note that line 7 has the bsub option that creates a directory on each node's local NVMe called /xfs/scratch/$USER.

#!/bin/bash
#BSUB -P project123
#BSUB -J name_test
#BSUB -o nvme_test.o%J
#BSUB -W 5
#BSUB -nnodes 2
#BSUB -env "all,JOB_FEATURE=NVME"
 
cd /lustre/atlas/scratch/$USER/
date
 
jsrun -n 2 -r 1 -a 1 ./test_nvme.sh

To run this issue: bsub ./check_nvme.lsf.

Expected output

When the job finishes it should create a directory called /lustre/atlas/scratch/$USER//$LSB_JOBIDoutput. The output directory will contain a directory named after the job ID, which will contain two files named after the nodes that ran the job.

For example if the job ID was 1137,7 and it ran on nodes, summitdev-c0r0n02 and summitdev-c0r0n02, the user should get:

-bash-4.2$ ls output
11377
-bash-4.2$ ls 11377
summitdev-c0r0n02  summitdev-c1r0n16

Looking at the contents of files summitdev-c0r0n02 and summitdev-c0r0n02 with cat, each file displays the greeting from that processor.

-bash-4.2$ cat summitdev-c0r0n02
Hello from beautiful summitdev-c0r0n02 on Tue Nov 22 16:47:49 EST 2016
-bash-4.2$ cat summitdev-c1r0n16
Hello from beautiful summitdev-c1r0n16 on Tue Nov 22 16:47:49 EST 2016

Remember that these two files were created to show, that in the current mode of NVMe usage, data staged on the NVMes will be held in multiple files with one file from each node used. If users desire a single file as output from data staged on the NMVe they will need to construct it. Methods that allow automatic n to 1 file writing with NVMe staging are underdevelopment.

Project and User Archive Directories (HPSS)

Summit dev has access to the User-centric and Project-centric archival spaces on the High Performance Storage System (HPSS).  Archive directories may be accessed only via specialized tools called HSI and HTAR. For more information on HPSS and using HSI or HTAR, see the HPSS page.

HPSS HSI path Backed Up? Purged? Quota? Comments
User Centric Archive /home/$USER No No 10TB Accessible from OLCF systems
Project Centric Archive /proj/[projid] No No 100TB Accessible from OLCF systems






Software and Development Environment

Overview

Structure

The software environment is managed through environment modules. Software provided by the OLCF is installed under the path /sw/summitdev which is available on all nodes. The environment modules are separated into several zones based on compiler and ABI compatibility. When viewed through the environment module system, these zones consist of core modules ( /sw/summitdev/modulefiles/core ) and multiple exclusive compiler families ( /sw/summitdev/modulefiles/compiler/COMPILER/VERSION ). Core modules are always available. Software and libraries intrinsically bound by ABI restrictions to specific compilers are only available after loading a specific compiler/version module.

Prefix Paths

Many OLCF-provided software packages are installed using the Spack package manager to a hidden directory. The file system tree for these packages is not intended to be navigated directly by users. For convenience, when an OLCF-provided environment module is loaded, it sets an environment variable pointing to that module's software installation prefix that can be used when constructing build scripts. These variables have the format OLCF_{MODULENAME}_ROOT, where {MODULENAME} is the base module name.
Prefer better tools for complex builds
Most OLCF-provided modules update the PKG_CONFIG_PATH and CMAKE_PREFIX_PATH allowing complex builds against OLCF-provided libraries to use pkg-config and CMake. These and similar tools should be preferred whenever possible over complex, manually-constructed site-specific compiler invocations in your build scripts to improve portability.

Login vs Compute Nodes

Unlike other OLCF resources, there is no significant difference in the software environment available on the compute nodes as compared to the login nodes. On either node type, it is the responsibility of the user to ensure that the loaded environment modules are correct for a particular task.

Batch System Environment Inheritance

By default, the environment of a batch job inherits from the environment present at the time the job is launched via bsub as though it had been called 'bsub -env "all" ... '. Recursively self-submitted job scripts that extend environment variables must take care to avoid unintentional duplicated additions. It is possible to start a job that does not inherit the present environment using 'bsub -env "none" ...' in which case the entire run-time environment must be set in the batch script by explicitly loading all required environment modules.
Recursive job environment inheritence
When a batch job recursively submits itself, any items manually appended to environment variables in the job will be compounded for each iteration. It is not advisable for recursive batch jobs to make alterations such as export PATH=$PATH:/my/added/dir
A workaround can be achieved by including a block to make the modification only once, for example with
my_job.lsf

# Do environment initialization once in a recursive job.
if [ -z "${INITIALIZED:-}" ]; then
    export PATH=$PATH:/my/added/dir
    export INITIALIZED="True"
fi
  
# Do work. Get interrogative for controlling recursion.
RECURSE_Q=$(...)
  
# Define condition when job should recurse.
RECURSE_CONDITION="1"
  
# Relaunch self when appropriate.
if [ "${RECURSE_Q}" -eq "${RECURSE_CONDITION}" ]; then
     bsub my_job.lsf
fi

Environment Modules

Environment modules are provided through the LMOD recursive environment module system (documentation). Environment modules permit you to dynamically alter the software available in your shell environment among packages and versions that could not ordinarily coexist in an environment. LMOD does this by managing and modifying your current shell's environment variables, notably the values of
PATH
LD_LIBRARY_PATH
PKG_CONFIG_PATH
etc
according to instructions in modular, pre-written modulefiles. Modulefiles are known to LMOD either through an internal cache or by walking among the paths listed in the MODULESPATH environment variable when the cache is unavailable. In this way, various software installed under separate isolated prefix paths can be conveniently made available to each individual shell session. LMOD can interpret both rich Lua modulefiles as well as legacy Tcl modulefiles. LMOD aims to be compatible at the interface level with Tcl Modules. However, LMOD is a recursive environment module system which means that modules will be automatically removed or reloaded when conflicts arise. For example, modules provided under a given compiler family which should not typically be used in conjunction with other compilers will be implicitly altered when changing the loaded compiler family. LMOD will issue a message to STDERR when automatic implicit changes are made.

General Usage

The interface to LMOD is provided by the module command

$ module [options] SUBCOMMAND [args...]

The most common operations include checking information about the available and loaded environment modules, adding or removing software modules from your environment, and resetting the environment to a known, consistent state. These basic operations are summarized in the table below.

Command Description
module -t list Shows a terse list of the currently loaded modules.
module avail Shows a table of the currently available modules
module help MODULENAME Shows help information about MODULENAME
module show MODULENAME Shows the environment changes made by the MODULENAME modulefile
module spider STRING Searches all possible modules according to STRING
module load MODULENAME [...] Loads the given MODULENAME(s) into the current environment
module use PATH Adds PATH to the modulefile search cache and MODULESPATH
module unuse PATH Removes PATH from the modulefile search cache and MODULESPATH
module purge Unloads all modules
module reset Resets loaded modules to system defaults
module update Reloads all currently loaded modules

Modules are changed recursively
Some commands, such as module swap, are available to maintain to compatibility with scripts using Tcl Environment Modules, but are not strictly necessary given that LMOD recursively processes loaded modules and automatically resolves conflicts.

Searching for Modules

Modules with specific dependencies are only visible as available (module avail) when the underlying dependencies, such as compiler families, are loaded. To find modules across all possible dependencies, it is necessary to search the modulefile dependency graph using the spider sub-command. The results of a search depend on the arguments passed to the spider sub-command as seen in the following table.

Command Description
module spider Shows the entire possible graph of modules
module spider MODULENAME Searches for modules named MODULENAME in the graph of possible modules
module spider MODULENAME/VERSION Searches for a specific version of MODULENAME in the graph of possible modules
module spider STRING Searches for modulefiles containing STRING

A version number is required for deep searches
The dependency graph for a module is only visible when spider is given a specific MODULENAME/VERSION string.

User-defined Module Collections

User-defined collections override the system cache
The presence of user-defined defaults and collections override the system's cache of available modules. User collections must be removed and re-generated before newly installed system modules will be usable.

LMOD supports caching commonly used collections of environment modules on a per-user basis in $HOME/.lmod.d. To create a collection called NAME from the currently loaded modules, simply call
module save NAME

Omitting NAME will set the user's default collection. Saved collections can be recalled and examined with the commands summarized in the following table.

Command Description
module restore NAME Recalls a specific saved user collection
module restore Recalls the user-defined defaults
module reset
module restore system Recalls the system defaults
module showlist Shows the list user-defined saved collections

Use unique collection names
Because the HOME filesystem is shared among multiple systems using LMOD, your collections should be given system- and application- specific names to avoid conflicts. For example, use names such as `summit-myapp-development`, `crest-myapp-development`, `summit-myapp-debugging`, `summit-myapp-production`.
User-defined collection 'default' is shared on all systems
The user-defined default collection will be applied to the environment on all systems using LMOD. Be careful to include only modules that are available on all resources you plan to use.

X11 Issues

Applications using an X11 display server require forwarding tunnels to be opened through SSH and LSF batch submission system.

The following quick example will open xeyes from SummitDev on a a local system going through home.ccs.ornl.gov:

From local system:
localsys> ssh -L 9001:summitdev-login1:22 userid@home.ccs.ornl.gov

Connect to summitdev from local system:
localsys> ssh –X -p 9001 userid@localhost

Display xeyes on local system from summitdev :
summitdev-login1 > xeyes

You can also simply ssh to home and then to summitdev, passing the -X flag each time:

From local system:
localsys> ssh -X userid@home.ccs.ornl.gov

Connect to summitdev from home:
home> ssh -X userid@summitdev.ccs.ornl.gov

Display xeyes on local system from summitdev:
summitdev> xeyes







Compilers

Available Compilers

  • XL: the IBM XL Compilers, loaded by default
  • LLVM: the LLVM compiler infrastructure.
  • PGI: the Portland Group compiler suite.
  • GNU: the GNU Compiler Collection.
  • NVCC: The CUDA C compiler.

C compilation

Note: type char is unsigned by default
Vendor Module compiler Version Enable C99 Enable C11 Default signed char Define macro
IBM xl xlc xlc_r 13.1.5 -std=gnu99 -std=gnu11 -qchar=signed -WF,-D
GNU system default gcc 4.8.x -std=gnu99 -std=gnu11 -fsigned-char -D
GNU gcc gcc 6.x -std=gnu99 -std=gnu11 -fsigned-char -D
LLVM clang clang 4.0.0 default -std=gnu11 -fsigned-char -D
PGI pgi pgcc 16.10 -c99 -c11 -Mschar -D

C++ compilation

Note: type char is unsigned by default
Vendor Module Compiler Version Enable C++11 Enable C++14 Default signed char Define macro
IBM xl xlc++, xlc++_r 13.1.5 -std=gnu++11 -std=gnu++1y (PARTIAL) -qchar=signed -WF,-D
GNU system default g++ 4.8.x -std=gnu++11 -std=gnu++1y -fsigned-char -D
GNU gcc g++ 6.x -std=gnu++11 -std=gnu++1y -fsigned-char -D
LLVM clang clang++ 4.0.0 -std=gnu++11 -std=gnu++1y -fsigned-char -D
PGI pgi pgc++ 16.10 -std=c++11 --gnu_extensions NOT SUPPORTED -Mschar -D

Fortran compilation

Vendor Module Compiler Version Enable F90 Enable F2003 Enable F2008 Define macro
IBM xl xlf
xlf90
xlf95
xlf2003
xlf2008
15.1.5 -qlanglvl=90std -qlanglvl=2003std -qlanglvl=2008std -WF,-D
GNU system default gfortran 4.8.x, 6.x -std=f90 -std=f2003 -std=f2008 -D
xlflang xlflang* xlflang 4.0.0 n/a n/a n/a -D
PGI pgi pgfortran 16.10 use .F90 source file suffix use .F03 source file suffix use .F08 source file suffix -D
Note: * The xlflang module currently conflicts with the clang module. This restriction is expected to be lifted in future releases.

OpenMP

Note: OpenMP offloading support is still under active development. Performance and debugging capabilities in particular are expected to improve as the implementations mature.
Vendor 3.1 Support Enable OpenMP 4.x Support Enable OpenMP 4.x Offload
IBM FULL -qsmp=omp PARTIAL -qsmp=omp
-qoffload
GNU FULL -fopenmp PARTIAL -fopenmp
clang FULL -fopenmp PARTIAL -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
--cuda-path=${OLCF_CUDA_ROOT}
xlflang FULL -fopenmp PARTIAL -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
PGI FULL -mp NONE NONE

OpenACC

Vendor Module OpenACC Support Enable OpenACC
IBM xl NONE NONE
GNU system default NONE NONE
GNU gcc 2.5 -fopenacc
LLVM clang or xlflang NONE NONE
PGI pgi 2.5 -acc, -ta=tesla:cc60

CUDA compilation

NVIDIA

CUDA C/C++ support is provided through the cuda module

nvcc : Primary CUDA C/C++ compiler

Language support

-std=c++11 : provide C++11 support

--expt-extended-lambda : provide experimental host/device lambda support

--expt-relaxed-constexpr : provide experimental host/device constexpr support

Compiler support

NVCC currently supports XL, GCC/4.8.5, and PGI C++ backends.

--ccbin : set to host compiler location

CUDA Fortran compilation

IBM

The IBM compiler suite is made available through the default loaded xl module, the cuda module is also required.

xlcuf : primary Cuda fortran compiler, thread safe

Language support flags

 -qlanglvl=90std : provide Fortran90 support

-qlanglvl=95std : provide Fortran95 support

-qlanglvl=2003std : provide Fortran2003 support

-qlanglvl=2008std : provide Fortran2003 support

PGI

The PGI compiler suite is available through the pgi module.

pgfortran : Primary fortran compiler with CUDA Fortran support

Language support:

Files with .cuf suffix automatically compiled with cuda fortran support

Standard fortran suffixed source files determines the standard involved, see the man page for full details

-Mcuda : Enable CUDA Fortran on provided source file

MPI

Overvivew

IBM Spectrum MPI is based on OpenMPI. IBM's online documentation for Spectrum MPI (version 10.1.0) s available at http://www.ibm.com/support/knowledgecenter/SSZTET_10.1.0/smpi_welcome/smpi_welcome.html
To compile MPI codes, you need to use an appropriate compiler wrapper

Language Compiler Name
C mpicc
C++ mpiCC, mpic++, mpicxx
Fortran mpif77, mpif90

Compiler Flags

The --showme flag, which accepts several subcommands, will give more information regarding how the compiler is invoked. From IBM's online documentation, available options (and their impact) are:

Flag Description
--showme Displays all of the command line options that will be used to compile the program.
--showme:command Displays the underlying compiler command
--showme:compile Displays the compiler flags that will be passed to the underlying compiler.
--showme:help Displays a usage message.
--showme:incdirs Displays a list of directories that the wrapper script will pass to the underlying compiler. These directories indicate the location of relevant header files. It is a space-delimited list.
--showme:libdirs Displays a list of directories that the wrapper script will pass to the underlying linker. These directories indicate the location of relevant libraries. It is a space-delimited list.
--showme:libs Displays a list of library names that the wrapper script will use to link an application. For example, this might show:
mpi open-rte open-pal util
It is a space-delimited list.
--showme:link Displays the linker flags that will be passed to the underlying compiler. --showme:version Displays the current Open MPI version number.

Environment Variables

As compared to previous IBM MPI implementations, it appears that Spectrum MPI tends to use command line options as opposed to environment variables. A good comparison of Spectrum MPI to IBM PE Runtime Edition can be found here.

Examples

For examples of launching MPI jobs, see the Running section.







Threads

OpenMP and Pthreads Overview

Both Pthreads and OpenMP are supported on Summitdev. As noted in section 8.2 below, some compilers may need to be invoked with special arguments or in a particular way to enable threads/ensure thread safety.

Compiler Flags

Some compilers, such as the IBM compilers, are invoked differently for threaded codes vs. non-threaded codes. For example, xlc is the C compiler and xlc_r is the thread-safe C compiler. Other compiler suites use command-line flags to enable/disable threading such as the -mp option for the PGI compilers. The compilers page includes threading-specific information for the various compilers available on Summitdev.
IBM also provides threaded versions of the Engineering and Scientific Subroutine Library (ESSL), both in non-CUDA and CUDA versions:

  • Multi-threaded subroutines provided by ESSL
  • ESSL SMP CUDA Library

Environment Variables

Variable Description
OMP_NUM_THREADS Number of OpenMP threads to launch per MPI process

Simultaneous Multi-Threading (SMT)

Summitdev supports Simultaneous Multi-Threading (SMT) up to 8 hardware threads per core. The SMT mode for a particular job can be set using a job feature. For example, to set the SMT mode to 4 hardware threads per core in a single node (total of 80 hardware threads per node):

bsub -alloc_flags smt4 -W 30 -nnodes 1 -P proj123 -Is $SHELL
<<Waiting for dispatch ...>>
<<Starting on summitdev-r0c2n01>>
summitdev-r0c2n01:~> /usr/sbin/ppc64_cpu --smt
SMT=4

After the job completes, the LSF epilogue will return the SMT mode to the default of 8.







Running

This section focuses on launching the parallel job itself. For information on batch jobs, see the 13. Batch System section.

Beta2 Change (October 17):
Please note, jsrun replaced mpirun as the default job launcher on Oct. 17. The default spectrum mpi module contains jsrun, the csmbatch queue will not be required, -n was be replaced by -nnodes, and redirecting the batch script into bsub is no longer allowed.

Overview

The default job launcher for SummitDev is jsrun. jsrun was developed by IBM for the Oak Ridge and Livermore Power systems. The tool will execute a given program on resources allocated through the LSF batch scheduler; this is similar to mpirun and aprun functionality.

Resource Sets

While jsrun performs similar job launching functions as aprun and mpirun, its syntax is very different. A large reason for syntax differences is the introduction of the resource set concept. Through resource sets, jsrun can control how a node appears to each job. Users can, through jsrun command line flags, control which resources on a node are visible to a job. Resource sets also allow the ability to run multiple jsruns simultaneously within a node.

Jsrun will create one or more resource sets within a node. Each resource set will contain 1 or more cores and 0 or more GPUs.

At a high level, you need to decide how your code would like the node to appear. How many tasks/threads per GPU? This understanding will allow you to create a resource set. Then you simply need to decide how many resource sets are needed.

The basic form of jsrun:

 jsrun [ -n #resource sets ] [tasks, threads, and GPUs within each resource set]  program [ program args ]

The basic steps to creating resource sets:

1) Understand how your code expects to interact with the system.
How many tasks/threads per GPU?
Does each task expect to see a single GPU? Do multiple tasks expect to share a GPU? Is the code written to internally manage task to GPU workload based on the number of available cores and GPUs?
2) Create resource sets containing the needed GPU to task binding
Based on how your code expects to interact with the system, you can create resource sets containing the needed GPU and core resources.
If a code expects to utilize one GPU per task, a resource set would contain one core and one GPU. If a code expects to pass work to a single GPU from two tasks, a resource set would contain two cores and one GPU.
3) Decide on the number of resource sets needed
Once you understand tasks, threads, and GPUs in a resource set, you simply need to decide the number of resource sets needed.

As on Titan it is useful to keep the general layout of a node in mind when laying out resource sets.

Login vs. Compute

  • Login and Compute nodes have the same CPU/accelerator configuration. This makes building codes easier.

Batch launch nodes

  • As of October 17, all non-jsrun commands are executed on the batch launch node
  • When you start a batch job, your shell runs on the batch launch node, not a compute node.
  • The batch launch node concept is similar to the service node process used on Titan

Parallel codes shouldn't be launched/run on the login node

  • The jsrun command won't stop you (other than limiting you to the available number of cores) but this is a shared node, so please avoid running on the login node.

Launching a Job with jsrun

Common jsrun Options

Below are common jsrun options. More flags and details can be found in the jsrun man page.

Flags Description
-n --nrs Number of resource sets
-a --tasks_per_rs Number of tasks per resource set Avail Oct 17
-c --cpu_per_rs Number of CPUs per resource set. Threads per rs.
-g --gpu_per_rs Number of GPUs per resource set
-r --rs_per_host Number of resource sets per host
-l --latency_priority Latency Priority. Controls layout priorities. Can currently be cpu-cpu or gpu-cpu

Examples

The following 2-node image will be used to describe jsrun resource sets and layout. The image includes:

  • 2 nodes (grey)
  • 4 sockets (white)
  • 40 cores (blue)
  • 8 GPUs (orange)
  • 4 Memory blocks (green)

The below examples were launched in the following 2 node interactive batch job:

summitdev> bsub -nnodes 2 -Pprj123 -W02:00 -Is $SHELL

16 tasks, 8 GPUs total, 2 tasks per GPU. In this case, you will need to create 8 resource sets. Each resource set will contain 2 tasks and 1 GPU.

summitdev> jsrun -a2 -g1 -n8 ./basic-layout-mpi.x 

8 tasks 1 GPU per task. A resource set would be created for each task/gpu pair.

summitdev> jsrun -n8 -a1 -g1 ./basic-layout-mpi.x | sort
Rank:    0  Core:   3 Host: summitdev-r0c0n15
Rank:    1  Core:  10 Host: summitdev-r0c0n15
Rank:    2  Core:  83 Host: summitdev-r0c0n15
Rank:    3  Core:  91 Host: summitdev-r0c0n15

Rank:    4  Core:   3 Host: summitdev-r0c2n03
Rank:    5  Core:  10 Host: summitdev-r0c2n03
Rank:    6  Core:  84 Host: summitdev-r0c2n03
Rank:    7  Core:  92 Host: summitdev-r0c2n03
summitdev>

4 tasks each with 2 GPUs. A resource set would be created for each task/2 GPU pair.

summitdev> jsrun -n2 -a2 -g2 ./basic-layout-mpi.x | sort
Rank:    0  Core:   4 Host: summitdev-r0c0n15
Rank:    1  Core:  82 Host: summitdev-r0c0n15
Rank:    2  Core:   3 Host: summitdev-r0c2n03
Rank:    3  Core:  83 Host: summitdev-r0c2n03
summitdev>

2 tasks each with 1 GPUs and 4 threads. A resource set would be created for each task, GPU, and 4 thread group.

setenv OMP_NUM_THREADS 4
summitdev> jsrun -n2 -a1 -g1 -c4 ./basic-layout-mpi.x | sort
Rank:    0  Thread: 0   Core:   4 Host: summitdev-r0c0n15
Rank:    0  Thread: 1   Core:   5 Host: summitdev-r0c0n15
Rank:    0  Thread: 2   Core:   6 Host: summitdev-r0c0n15
Rank:    0  Thread: 3   Core:   7 Host: summitdev-r0c0n15
Rank:    1  Thread: 0   Core:  82 Host: summitdev-r0c0n15
Rank:    1  Thread: 1   Core:  83 Host: summitdev-r0c0n15
Rank:    1  Thread: 2   Core:  84 Host: summitdev-r0c0n15
Rank:    1  Thread: 3   Core:  85 Host: summitdev-r0c0n15
summitdev>
Notice:
jsrun in the October 17 Beta2 release overwrites the OMP_NUM_THREADS variable with the value passed to -c. This will be corrected in the next software drop.
Notice:
jsrun in the October 17 Beta2 release only supports threads at the physical core level; it does not support hyper-thread level.

GPU Specifics

Summitdev supports CUDA Multi-Process Service (MPS) which allows multiple processes to access the same GPU context. For example, in cases when multiple MPI processes on a single compute node access the same GPU.

Note:
Please note that CUDA MPS supports only 16 MPI ranks per node. Using more ranks will result in the job hanging.

Several compute modes are supported on the GPUs. By default all GPUs on Summitdev are set to EXCLUSIVE_PROCESS. Other available GPU compute modes are listed in the table below.

Value Name Meaning
0 DEFAULT Multiple host threads can access the GPU simultaneously
1 EXCLUSIVE THREAD Only one thread can access the GPU at a time
2 PROHIBITED No threads can access the GPU
3 EXCLUSIVE PROCESS Only one context is allowed per device, usable from multiple threads at a time

By default all GPUs are available to an application. To set a specific set or ordering of GPU devices on a compute that should be used, you can use the CUDA_VISIBLE_DEVICES environment variable.

To learn about how to request GPUs set to DEFAULT compute mode or how to enable CUDA MPS, please see the GPU Specific Jobs section.







Batch Scripts

Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:

  • Commands that can be interpreted by batch scheduling software (e.g. LSF)
  • Commands that can be interpreted by a shell

The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the resources requested in the batch job will be allocated and the script will be executed on the allocated computational resources.

Sections of a Batch Script

Batch scripts are parsed into the following three sections:

The Interpreter Line

The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter's default shell will be used. The line uses the "hash-bang-shell" syntax: #!/path/to/shell

The Scheduler Options Section

The batch scheduler options are preceded by #BSUB, making them appear as comments to a shell. LSF will look for #BSUB options in a batch script from the script’s first line through the first non-comment line. A comment line begins with #. #BSUB options entered after the first non-comment line will not be read by LSF.

Note: All batch scheduler options must appear at the beginning of the batch script.

The Executable Commands Section

The shell commands follow the last #BSUB option and represent the main content of the batch job. If any #BSUB lines follow executable statements, they will be ignored as comments.

The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job's queue wait time is finished, commands within this section will be executed. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.

An Example Batch Script

1 #!/bin/bash
2 # Begin LSF directives
3 #BSUB -P proj123
4 #BSUB -J test
5 #BSUB -o tst.o%J
6 #BSUB -W 1:00
7 #BSUB -nnodes 1
8 # End LSF directives and begin shell commands
9 cd $MEMBERWORK/proj123
10 date
11 jsrun -n 2 -a 1 -g 1 ./a.out

The lines of this batch script do the following:

Line Option Description
1 Optional Specifies that the script should be interpreted by the bash shell.
2 Optional Comments do nothing.
3 Required The job will be charged to the "proj123" project.
4 Optional The job will be named "test".
5 Optional The job’s standard output and error will be placed in tst.o.
6 Required The job will request a 1-hour walltime.
7 Required The job will request 1 node.
8 Optional Comments do nothing.
9 -- This shell command will the change to the user's $MEMBERWORK/proj123 directory.
10 -- This shell command will run the date command.
11 -- This invocation of jsrun will run 2 MPI instances of the executable a.out on the compute nodes allocated by the batch system. The request will create 2 resource sets each containing 1 MPI task and 1 GPU.

Batch Scheduler node Requests

A node’s resources will not be allocated to multiple batch jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through LSF.

Submitting Batch Scripts

Once written, a batch script is submitted to the batch scheduler via the bsub command.

summitdev$ cd /path/to/batch/script
summitdev$ bsub ./script.pbs

If successfully submitted, a LSF job ID will be returned. This ID is needed to monitor the job's status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.

Note: Bsub Redirection
As of Oct 17, redirecting the batch script into bsub is no longer required.

Options to the bsub command allow the specification of attributes which affect the behavior of the job. In general, options to bsub on the command line can also be placed in the batch scheduler options section of the batch script via #BSUB.

Interactive Batch Jobs

Batch scripts are useful for submitting a group of commands, allowing them to run through the queue, then viewing the results at a later time. However, it is sometimes necessary to run tasks within a job interactively. Users are not permitted to access compute nodes nor run mpirun directly from login nodes. Instead, users must use an interactive batch job to allocate and gain access to compute resources interactively. This is done by using the -Is option to bsub.

Interactive Batch Example

For interactive batch jobs, LSF options are passed through bsub on the command line.

summitdev$ bsub -P proj123 -XF -nnodes 1 -W 30 -Is $SHELL

This request will:

-P Charge to the "proj123" project
-nnodes Request 1 node
-XF Enables X11 forwarding. The DISPLAY environment variable must be set.
-W Request 30 minutes walltime.
-Is Request interactive batch access

After running this command, you will have to wait until enough compute nodes are available, just as in any other batch job. However, once the job starts, you will be given an interactive prompt on the head node of your allocated resource. From here commands may be executed directly instead of through a batch script.

Common Batch Job Flags

Flag Use Required Description
-W #BSUB -W Required Maximum wall-clock time.
-P #BSUB -P Required Causes the job time to be charged to . The account string, e.g. pjt000, is typically composed of three letters followed by three digits and optionally followed by a subproject identifier. This option is required by all jobs.
-n #BSUB -nnodes Required Maximum number of compute nodes. Jobs will be allocated entire nodes. #BSUB -nnodes 2 will allocate 2 nodes.
-o #BSUB -o .o%J Writes standard output to .o
-e #BSUB -e .e%J Writes standard error to .e
-J #BSUB -J Sets the job name to instead of the name of the job script.
-w #BSUB -w "ended()" Place a dependency on the submitted job. The job will not be eligible for execution until exits the queue in either an EXIT or DONE state.

GPU Specific Jobs

The batch scheduler supports the JOB_FEATURE environment variable to control both the GPU compute mode and also to enable CUDA MPS. To request GPUs to be set to DEFAULT mode instead of EXCLUSIVE_PROCESS, you must request the gpudefault job feature. Similarly, to start a CUDA MPS enabled job, you must request the gpumps job feature.

To request an interactive job with GPUs set to DEFAULT, you would use:

bsub -alloc_flags gpudefault -W 30 -nnodes 1 -P proj123 -Is $SHELL -l

To request interactive job with CUDA MPS enabled, you would use:

bsub -alloc_flags gpumps -W 30 -nnodes 1 -P proj123 -Is $SHELL -l

It is also possible to request a job that has both features enabled at the same time. To do so, submit a job using:

bsub -alloc_flags "gpudefault,gpumps" -W 30 -nnodes 1 -P proj123 -Is $SHELL -l

Please note that passing an environment variable using -env will, by default, remove all other environment settings. To ensure environment modules and other default compilers are still available when your jobs starts, you must start a login shell (e.g. -l in bash). Alternatively, if you want to carry over the entire environment from the login node, you can add the all keyword to your job feature request.

bsub -alloc_flags gpumps -W 30 -nnodes 1 -P proj123 -Is $SHELL

Monitoring Batch Jobs

bjobs
The LSF bjobs utility can be used to view information about current and recent batch jobs.

Command Output
bjobs Display only calling user's currently queued batch jobs.
bjobs -u all Display all users currently queued jobs.
bjobs -l Display detailed information about a single job.
bjobs -d Display your recently completed batch jobs.
bjobs -p Display your pending batch jobs and pending reason.
bjobs -u all -o "jobid: user: proj_name:16 stat: priority: slots:4 submit_time:14 nalloc_slot:4 start_time:14 time_left: delimiter=' '" Customized output

Modifying Batch Jobs

Command Action
bkill Removes batch job with the id from queue.
btop Reorder your pending jobs. Moves pending job () to the top of your pending job list.
bbot Reorder your pending jobs. Moves pending job () to the bottom of your pending job list.
bstop Places in a suspended state.
bresume Releases suspension from .

Batch Queue Limits

Nodes Max Walltime
4 or fewer 4 hours
greater than 4 1 hour

In addition to the walltime limits, each user is allowed to have 2 jobs in a running state.

Batch Job Environment Variables

The LSF batch job system sets a number of environment variables usable within the batch job. A subset of the more useful ones for managing workflow within a batch job are summarized in the following table.

Variable Content
LSB_SUB_HOST The hostname of the node from which qsub was called
LSB_DJOB_NUMPROC The number of requested compute slots
LSB_JOBID, LSB_BATCH_JID The batch job ID
LSB_OUTDIR The directory path in which the batch job output file is written
LSB_OUTPUTFILE The batch job output file name
LSB_PROJECT_NAME The project to which the batch job is billed
LSB_JOBNAME The batch job name
LSB_HOSTS Space-separated list of host names for each compute slot
LSB_MCPU_HOSTS Space-separated list of " " pairs for each unique host in the reservation.







MPI Task Layout

Beta2 Change (October 17):
jsrun became the default job launcher on Oct. 17. The below is provided for reference.

The --map-by option gives you control over how tasks are placed on the nodes. There are two ways to invoke --map-by. One simply says how to distribute tasks (i.e. by socket, by core, by node, etc) while the other explicitly says how many tasks to launch per socket/node. The first three examples are of the former, the final 2 of the latter:

Place tasks "socket first"; fill node before moving to next node

To run 40 tasks, filling in a "node first" placement:

mpirun -np 40 --map-by core ./a.out
Compute Node 0
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Compute Node 1
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39

 

 

Assign tasks round-robin between sockets; fill node before moving to next node

To run 40 tasks, assigning in a round robin fashion between sockets (but still filling a node before moving to the next node):

mpirun -np 40 –map-by socket ./a.out
Compute Node 0
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 2 4 6 8 10 12 14 16 18 1 3 5 7 9 11 13 15 17 19
Compute Node 1
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
20 22 24 26 28 30 32 34 36 38 21 23 25 27 29 31 33 35 37 39

 

 

Assign tasks round-robin between nodes

To launch 40 tasks, assigning them round-robin between nodes (NOTE: Socket 0 on each node will fill up before any tasks are placed on socket 1):

mpirun -np 40 --map-by node ./a.out
Compute Node 0
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
Compute Node 1
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

 

 

Assign a specific number of tasks per node

The following will start a specific number of tasks per node. You don't need to specify the total number of tasks because it can be calculated based on the launch command and the number of nodes you requested.

mpirun --map-by ppr:5:node ./a.out
Compute Node 0
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 1 2 3 4
Compute Node 1
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
5 6 7 8 9

 

 

Assign a specific number of tasks per socket

The following will start a specific number of tasks per socket. As with the tasks per node request, you don't need to specify the total number of tasks because it can be calculated based on the launch command and the number of sockets (2 per node) you requested.

mpirun --map-by ppr:5:socket ./a.out
Compute Node 0
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 1 2 3 4 5 6 7 8 9
Compute Node 1
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
10 11 12 13 14 15 16 17 18 19

 

 

Combining layout options

One final example, if you want to spread out your MPI tasks across nodes (instead of filling node 0 first), you can combine the –map-by and –rank-by flags and add :span to the –rank-by flag. The following command will launch 8 tasks, going round robin between sockets and nodes.

mpirun -np 8 --rank-by socket:span --map-by ppr:2:socket ./a.out
Compute Node 0
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 4 1 5
Compute Node 1
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
2 6 3 7
Note:
The -display-map (you can also use --display-map) option to mpirun will show you how tasks will be placed across the nodes/cores in your allocation.
The -report-bindings (alternatively, --report-bindings) option to mpirun will report bindings for launched processes

Controlling Thread Layout Within a Node

Per the mpirun manpage:
"If your application uses threads, then you probably want to ensure that you are either not bound at all (by specifying --bind-to none), or bound to multiple cores using an appropriate binding level or specific number of processing elements per application process."
To bind to a specified number of cores, you should us --map-by :PE=n. (This is preferred over the deprecated --cpus-per-proc X and --cpus-per-rank X syntax). In this case, a PE or "processing element" appears to refer to a core, not a number of hardware threads per core. For example, you might run mpirun -np 4 --map-by socket:PE=2. This will use a single node with two MPI tasks per socket. Tasks 0 and 2 will be assigned to socket 0, with tasks 1 and 3 assigned to socket 1. In each case, the "first" rank on a socket will have cores 0 and 1 reserved and the second rank will have cores 2 and 3 reserved as shown below:

Compute Node
SOCKET 0 SOCKET 1
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9
0 2 1 3

And, recall that each core supports 8 hardware threads. Thus, each MPI rank has access to 8 hardware threads.