System Overview

Summit is an IBM system located at the Oak Ridge Leadership Computing Facility. With a theoretical peak double-precision performance of approximately 200 PF, it is one of the most capable systems in the world for a wide range of traditional computational science applications. It is also one of the “smartest” computers in the world for deep learning applications with a mixed-precision capability in excess of 3 EF.

Summit Nodes

The basic building block of Summit is the IBM Power System AC922 node. Each of the approximately 4,600 compute nodes on Summit contains two IBM POWER9 processors and six NVIDIA Volta V100 accelerators and provides a theoretical double-precision capability of approximately 40 TF. Each POWER9 processor is connected via dual NVLINK bricks, each capable of a 25GB/s transfer rate in each direction. Nodes contain 512 GB of DDR4 memory for use by the POWER9 processors and 96 GB of High Bandwidth Memory (HBM2) for use by the accelerators. Additionally, each node has 1.6TB of non-volatile memory that can be used as a burst buffer.

The POWER9 processor is built around IBM’s SIMD Multi-Core (SMC). The processor provides 22 SMCs with separate 32kB L1 data and instruction caches. Pairs of SMCs share a 512kB L2 cache and a 10MB L3 cache. SMCs support Simultaneous Multi-Threading (SMT) up to a level of 4, meaning each physical core supports up to 4 hardware threads.

The POWER9 processors and V100 accelerators are cooled with cold plate technology. The remaining components are cooled through more traditional methods, although exhaust is passed through a back-of-cabinet heat exchanger prior to being released back into the room. Both the cold plate and heat exchanger operate using medium temperature water which is more cost-effective for the center to maintain than chilled water used by older systems.

Node Types

On Summit, there are three major types of nodes you will encounter: Login, Launch, and Compute. While all of these are similar in terms of hardware, they differ considerably in their intended use.

Node Type Description
Login When you connect to Summit, you’re placed on a login node. This is the place to write/edit/compile your code, manage data, submit jobs, etc. You should never launch parallel jobs from a login node nor should you run threaded jobs on a login node. Login nodes are shared resources that are in use by many users simultaneously.
Launch When your batch script (or interactive batch job) starts running, it will execute on a Launch Node. (If you are/were a user of Titan, these are similar in function to service nodes on that system). All commands within your job script (or the commands you run in an interactive job) will run on a launch node. Like login nodes, these are shared resources so you should not run multiprocessor/threaded programs on Launch nodes. It is appropriate to launch the jsrun command from launch nodes.
Compute Most of the nodes on Summit are compute nodes. These are where your parallel job executes. They’re accessed via the jsrun command.

Although the nodes are logically organized into different types, they all contain similar hardware. As a result of this homogeneous architecture there is not a need to cross-compile when building on a login node. Since login nodes have similar hardware resources as compute nodes, any tests that are run by your build process (especially by utilities such as autoconf and cmake) will have access to the same type of hardware that is on compute nodes and should not require intervention that might be required on non-homogeneous systems.

NOTE: Login nodes have (2) 16-core Power9 CPUs and (4) V100 GPUs. Compute nodes have (2) 22-core Power9 CPUs and (6) V100 GPUs.

System Interconnect

Summit nodes are connected to a dual-rail EDR InfiniBand network providing a node injection bandwidth of 23 GB/s. Nodes are interconnected in a Non-blocking Fat Tree topology. This interconnect is a three-level tree implemented by a switch to connect nodes within each cabinet (first level) along with Director switches (second and third level) that connect cabinets together.

File Systems

Summit is connected to an IBM Spectrum Scale™ filesystem providing 250PB of storage capacity with a peak write speed of 2.5 TB/s. Summit also has access to the center-wide NFS-based filesystem (which provides user and project home areas) and has access to the center’s High Performance Storage System (HPSS) for user and project archival storage.

Operating System

Summit is running Red Hat Enterprise Linux (RHEL) version 7.5.

Hardware Threads

The IBM POWER9 processor supports Hardware Threads. Each of the POWER9’s physical cores has 4 “slices”. These slices provide Simultaneous Multi Threading (SMT) support within the core. Three SMT modes are supported: SMT4, SMT2, and SMT1. In SMT4 mode, each of the slices operates independently of the other three. This would permit four separate streams of execution (i.e. OpenMP threads or MPI tasks) on each physical core. In SMT2 mode, pairs of slices work together to run tasks. Finally, in SMT1 mode the four slices work together to execute the task/thread assigned to the physical core. Regardless of the SMT mode used, the four slices share the physical core’s L1 instruction & data caches.

NVIDIA V100 GPUs

The NVIDIA Tesla V100 accelerator has a peak performance of 7.8 TFLOP/s (double-precision) and contributes to a majority of the computational work performed on Summit. Each V100 contains 80 streaming multiprocessors (SMs), 16 GB of high-bandwidth memory (HBM2), and a 6 MB L2 cache that is available to the SMs. The GigaThread Engine is responsible for distributing work among the SMs and (8) 512-bit memory controllers control access to the 16 GB of HBM2 memory. The V100 uses NVIDIA’s NVLink interconnect to pass data between GPUs as well as from CPU-to-GPU.

GV100

NVIDIA V100 SM

Each SM on the V100 contains 32 FP64 (double-precision) cores, 64 FP32 (single-precision) cores, 64 INT32 cores, and 8 tensor cores. A 128-KB combined memory block for shared memory and L1 cache can be configured to allow up to 96 KB of shared memory. In addition, each SM has 4 texture units which use the (configured size of the) L1 cache.

GV100_SM

HBM2

Each V100 has access to 16 GB of high-bandwidth memory (HBM2), which can be accessed at speeds of up to 900 GB/s. Access to this memory is controlled by (8) 512-bit memory controllers, and all accesses to the high-bandwidth memory go through the 6 MB L2 cache.

NVIDIA NVLink

The processors within a node are connected by NVIDIA’s NVLink interconnect. Each link has a peak bandwidth of 25 GB/s (in each direction), and since there are 2 links between processors, data can be transferred from GPU-to-GPU and CPU-to-GPU at a peak rate of 50 GB/s.

NOTE: The 50-GB/s peak bandwidth stated above is for data transfers in a single direction. However, this bandwidth can be achieved in both directions simultaneously, giving a peak “bi-directional” bandwidth of 100 GB/s between processors.

The figure below shows a schematic of the NVLink connections between the CPU and GPUs on a single socket of a Summit node.

NVLink2

Tensor Cores

The Tesla V100 contains 640 tensor cores (8 per SM) intended to enable faster training of large neural networks. Each tensor core performs a D = AB + C operation on 4×4 matrices. A and B are FP16 matrices, while C and D can be either FP16 or FP32:


nv_tensor_core_1
Each of the 16 elements that result from the AB matrix multiplication come from 4 floating-point fused-multiply-add (FMA) operations (basically a dot product between a row of A and a column of B). Each FP16 multiply yields a full-precision product which is accumulated in a FP32 result:


nv_tensor_core_2
Each tensor core performs 64 of these FMA operations per clock. The 4×4 matrix operations outlined here can be combined to perform matrix operations on larger (and higher dimensional) matrices.

Volta Multi-Process Service

When a CUDA program begins, each MPI rank creates a separate CUDA context on the GPU, but the scheduler on the GPU only allows one CUDA context (and so one MPI rank) at a time to launch on the GPU. This means that multiple MPI ranks can share access to the same GPU, but each rank gets exclusive access while the other ranks wait (time-slicing). This can cause the GPU to become underutilized if a rank (that has exclusive access) does not perform enough work to saturate the resources of the GPU. The following figure depicts such time-sliced access to a pre-Volta GPU.

nv_mps_1

The Multi-Process Service (MPS) enables multiple processes (e.g. MPI ranks) to concurrently share the resources on a single GPU. This is accomplished by starting an MPS server process, which funnels the work from multiple CUDA contexts (e.g. from multiple MPI ranks) into a single CUDA context. In some cases, this can increase performance due to better utilization of the resources. The figure below illustrates MPS on a pre-Volta GPU.


nv_mps_2

Volta GPUs improve MPS with new capabilities. For instance, each Volta MPS client (MPI rank) is assigned a “subcontext” that has its own GPU address space, instead of sharing the address space with other clients. This isolation helps protect MPI ranks from out-of-range reads/writes performed by other ranks within CUDA kernels. Because each subcontext manages its own GPU resources, it can submit work directly to the GPU without the need to first pass through the MPS server. In addition, Volta GPUs support up to 48 MPS clients (up from 16 MPS clients on Pascal).


nv_mps_3
 

For more information, please see the following document from NVIDIA:
https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf

Unified Memory

Unified memory is a single virtual address space that is accessible to any processor in a system (within a node). This means that programmers only need to allocate a single unified-memory pointer (e.g. using cudaMallocManaged) that can be accessed by both the CPU and GPU, instead of requiring separate allocations for each processor. This “managed memory” is automatically migrated to the accessing processor, which eliminates the need for explicit data transfers.


nv_um_1
 

On Pascal-generation GPUs and later, this automatic migration is enhanced with hardware support. A page migration engine enables GPU page faulting, which allows the desired pages to be migrated to the GPU “on demand” instead of the entire “managed” allocation. In addition, 49-bit virtual addressing allows programs using unified memory to access the full system memory size. The combination of GPU page faulting and larger virtual addressing allows programs to oversubscribe the system memory, so very large data sets can be processed. In addition, new CUDA API functions introduced in CUDA8 allow users to fine tune the use of unified memory.

Unified memory is further improved on Volta GPUs through the use of access counters that can be used to automatically tune unified memory by determining where a page is most often accessed.

For more information, please see the following section of NVIDIA’s CUDA Programming Guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd

Independent Thread Scheduling

The V100 supports independent thread scheduling, which allows threads to synchronize and cooperate at sub-warp scales. Pre-Volta GPUs implemented warps (groups of 32 threads which execute instructions in single-instruction, multiple thread – SIMT – mode) with a single call stack and program counter for a warp as a whole.

nv_ind_threads_1

Within a warp, a mask is used to specify which threads are currently active when divergent branches of code are encountered. The (active) threads within each branch execute their statements serially before threads in the next branch execute theirs. This means that programs on pre-Volta GPUs should avoid sub-warp synchronization; a sync point in the branches could cause a deadlock if all threads in a warp do not reach the synchronization point.
nv_ind_threads_2

The Volta V100 introduces warp-level synchronization by implementing warps with a program counter and call stack for each individual thread (i.e. independent thread scheduling).

nv_ind_threads_3

This implementation allows threads to diverge and synchronize at the sub-warp level using the __syncwarp() function. The independent thread scheduling enables the thread scheduler to stall execution of any thread, allowing other threads in the warp to execute different statements. This means that threads in one branch can stall at a sync point and wait for the threads in the other branch to reach their sync point.

nv_ind_threads_4

For more information, please see the following section of NVIDIA’s CUDA Programming Guide:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#independent-thread-scheduling-7-x

Tesla V100 Specifications

Compute Capability 7.0
Peak double precision floating point performance 7.8 TFLOP/s
Peak single precision floating point performance 15.7 TFLOP/s
Single precision CUDA cores 5120
Double precision CUDA cores 2560
Tensor cores 640
Clock frequency 1530 MHz
Memory Bandwidth 900 GB/s
Memory size (HBM2) 16 GB
L2 cache 6 MB
Shared memory size / SM Configurable up to 96 KB
Constant memory 64 KB
Register File Size 256 KB (per SM)
32-bit Registers 65536 (per SM)
Max registers per thread 255
Number of multiprocessors (SMs) 80
Warp size 32 threads
Maximum resident warps per SM 64
Maximum resident blocks per SM 32
Maximum resident threads per SM 2048
Maximum threads per block 1024
Maximum block dimensions 1024, 1024, 64
Maximum grid dimensions 2147483647, 65535, 65535
Maximum number of MPS clients 48

 

Further Reading

For more information on the NVIDIA Volta architecture, please visit the following (outside) links.

NVIDIA Volta Architecture White Paper:
http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

NVIDIA PARALLEL FORALL blog article:
https://devblogs.nvidia.com/parallelforall/inside-volta/

Connecting

This section covers the basic procedures for accessing computational resources at the Oak Ridge Leadership Computing Facility.

Connect with SSH

Secure shell (SSH) clients are the only supported remote clients for use with OLCF systems. SSH encrypts the entire session between OLCF systems and the client system, and avoids risks associated with using plain-text communication.

Note: To access OLCF systems, your SSH client must support SSH protocol version 2 (this is common) and allow keyboard-interactive authentication.

For UNIX-based SSH clients, the following line should be in either the default ssh_config file or your $HOME/.ssh/config file:

PreferredAuthentications keyboard-interactive,password

The line may also contain other authentication methods, but keyboard-interactive must be included.

SSH clients are also available for Windows-based systems, such as SecureCRT published by Van Dyke Software. For recent SecureCRT versions, the preferred authentications setting shown above can be made through the “connection properties” menu.

Note: SSH multiplexing is disabled on all of the OLCF’s user-facing systems. Users will receive an error message if they attempt to connect to an OLCF resource that tries to reuse an SSH control path. To ensure SSH connections will not attempt multiplexing, you will need to modify your $HOME/.ssh/config file by adding the following:

  Host *.ccs.ornl.gov
    ControlMaster no

OLCF System Hostnames

Each OLCF system has a single, designated hostname for general user-initiated connections. Using these hostnames allows for automatic load-balancing that will send users to other hosts as needed. The designated OLCF hostnames for general user connections are as follows:

 

 

System Name Hostname RSA Key Fingerprints
Summit summit.olcf.ornl.gov MD508:d0:fe:3f:f3:41:96:9c:ae:73:73:a8:92:6c:79:34
SHA256nA7X4qyPvtEpXWxG5MDeXEC8xfpmm0UMiLq/LkgM33I
Titan titan.ccs.ornl.gov MD577:dd:c9:2c:65:2f:c3:89:d6:24:a6:57:26:b5:9b:b7
SHA2566Df2kqvj26HGadu3KDegPSeE/vbLYUjSIuot2AhsqL4
Rhea rhea.ccs.ornl.gov MD59a:72:79:cf:9e:47:33:d1:91:dd:4d:4e:e4:de:25:33
SHA256AJIOXipN3Pgpcp/pFp8+g1zG09v0CiFpwQlc17OJ9S8
Eos eos.ccs.ornl.gov MD5e3:ae:eb:12:0d:b1:4c:0b:6e:53:40:5c:e7:8a:0d:19
SHA256LlznEESwYCT16sUty1ItSbO6n9FbqT0NNMVQMLQX3IY
Data Transfer Nodes dtn.ccs.ornl.gov MD5b3:31:ac:44:83:2b:ce:37:cc:23:f4:be:7a:40:83:85
SHA256GzKEIvoBKdeEH/yekGSOO8aSKibkl/KNTp9ZfYQq7VM
Home (machine) home.ccs.ornl.gov MD5ba:12:46:8d:23:e7:4d:37:92:39:94:82:91:ea:3d:e9
SHA256FjDs4sRAX8hglzA7TVkK22NzRKsjhDTTTdfeEAHwPEA
System Name Hostname ECDSA Key Fingerprints
Summit summit.olcf.ornl.gov MD5cf:32:f9:35:fd:3f:2a:0f:ed:d3:84:b1:2d:f0:35:1b
SHA256m0iF9JJEoJu6jJGA8FFbSABlpKFYPGKbdmi25rFC1AI
Titan titan.ccs.ornl.gov MD554:2a:81:ed:75:14:d6:ec:fc:85:b8:4f:fb:b1:11:fa
SHA256afnEsujjMnIvC+1HFxnbsj4WmGa/Ka7tVn0nXHp2ebw
Rhea rhea.ccs.ornl.gov N/A
Eos eos.ccs.ornl.gov MD5d7:bb:7d:a1:73:f7:92:42:43:e6:75:d6:31:29:87:8a
SHA256ddtmhprIkEcTt7OChHW6ITb0EjlCOdlP5DXMYC49Vog
Data Transfer Nodes dtn.ccs.ornl.gov MD504:35:86:9a:97:8a:19:74:32:21:29:5c:53:91:35:7a
SHA256owzOIoCC9VcYNXjAcOgHgIgRfmbtglfzhVtf8TM5qtQ
Home (machine) home.ccs.ornl.gov MD58a:92:0f:31:4d:38:2d:2c:ec:7d:53:ce:8b:46:73:d6
SHA2560hc6SDou8vauFWgOaeXKUmhDSmKK8roj9jWpapV4qzc

 

For example, to connect to Titan from a UNIX-based system, use the following:

$ ssh userid@titan.ccs.ornl.gov

RSA Key Fingerprints

Occasionally, you may receive an error message upon logging in to a system such as the following:

@@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.

This can be a result of normal system maintenance that results in a changed RSA public key, or could be an actual security incident. If the RSA fingerprint displayed by your SSH client does not match the OLCF-authorized RSA fingerprint for the machine you are accessing, do not continue authentication; instead, contact help@olcf.ornl.gov.


Authenticating to OLCF Systems

All OLCF systems currently employ two-factor authentication only. To login to OLCF systems, an RSA SecurID® Token (fob) is required.
Image of an RSA SecudID fob

Activating a new SecurID® Token (fob)

Follow the steps described below to set up your new SecurID Token (fob).

  1. Initiate an SSH connection to home.ccs.ornl.gov using your OLCF username.
    (i.e., ssh userid@home.ccs.ornl.gov)
  2. When prompted for a PASSCODE, enter the 6 digits displayed on your token.
  3. When asked if you are ready to set your PIN, answer with “y”.
  4. You will then be prompted to enter a PIN. Enter a 4- to 8-character alphanumeric PIN you can remember. You will then be prompted to re-enter your PIN.
  5. A message will appear stating that your PIN has been accepted. Press enter to continue.
  6. Finally, you will be prompted again with “Enter PASSCODE”. This time enter both your PIN and the 6 digits displayed on your token before pressing enter.
  7. Your PIN is now set and you are logged into home.ccs.ornl.gov.

Using a SecurID® Token (fob)

When prompted for your PASSCODE, enter your PIN followed by the 6 digits shown on your SecurID® token before pressing enter. For example, if your pin is 1234 and the 6 digits on the token are 000987, enter 1234000987 when you are prompted for a PASSCODE.

Warning: The 6-digit code displayed on the SecurID token can only be used once. If prompted for multiple PASSCODE entries, always allow the 6-digit code to change between entries. Re-using the 6-digit code can cause your account to be automatically disabled.

PINs, Passcodes, and Tokencodes

When users connect with RSA SecurID tokens, they are most often prompted for a PASSCODE. Sometimes, they are instead prompted for a PIN (typically only on initial setup) and other times they might be prompted to wait for the tokencode to change and enter the new tokencode. What do these terms mean?

The TOKENCODE is the 6-digit number generated by the RSA token.

The PIN is a (4) to (8)-digit number selected by the user when they initially set up their RSA token.

The PASSCODE is simply the user’s PIN followed by the current tokencode.

These are relatively straightforward; however, there can be some confusion on initial setup. The first time a user connects with a new token (or, if for some reason the user requested that we clear the PIN associated with their token) they are prompted for a PASSCODE but in reality only enter a tokencode. This is because during this initial setup procedure a PIN does not exist. Since there is no PIN, the PASSCODE is the same as the tokencode in this rare case.


X11 Forwarding

Automatic forwarding of the X11 display to a remote computer is possible with the use of SSH and a local X server. To set up automatic X11 forwarding within SSH, you can do (1) of the following:

  • Invoke ssh on the command line with:
    $ ssh -X hostname

    Note that use of the -x option (lowercase) will disable X11 forwarding.

  • Edit (or create) your $HOME/.ssh/config file to include the following line:
    ForwardX11 yes

All X11 data will go through an encrypted channel. The $DISPLAY environment variable set by SSH will point to the remote machine with a port number greater than zero. This is normal, and happens because SSH creates a proxy X server on the remote machine for forwarding the connections over an encrypted channel. The connection to the real X server will be made from the local machine.

Warning: Users should not manually set the $DISPLAY environment variable for X11 forwarding; a non-encrypted channel may be used in this case.

Connecting to Internal OLCF Systems

Some OLCF systems are not directly accessible from outside the OLCF network. In order to access these systems, you must first log into Home.

 ssh userid@home.ccs.ornl.gov

Once logged into Home, you can ssh into the desired (internal) system. Please see the Home section on the Getting Started page for more information (e.g. appropriate uses).

File Systems: Data Storage & Transfers

Storage Overview

OLCF users have many options for data storage. Each user has multiple user-affiliated storage spaces, and each project has multiple project-affiliated storage spaces where data can be shared for collaboration. Below we give an overview and explain where each storage area is mounted.

Alpine IBM Spectrum Scale Filesystem

Summit mounts a POSIX-based IBM Spectrum Scale parallel filesystem called Alpine. Alpine's maximum capacity is 250 PB. It is consisted of 77 IBM Elastic Storage Server (ESS) GL4 nodes running IBM Spectrum Scale 5.x which are called Network Shared Disk (NSD) servers. Each IBM ESS GL4 node, is a scalable storage unit (SSU), constituted by two dual-socket IBM POWER9 storage servers, and a 4X EDR InfiniBand network for up to 100Gbit/sec of networking bandwidth. The maximum performance of the final production system will be about 2.5 TB/s for sequential I/O and 2.2 TB/s for random I/O under FPP mode, which means each process, writes its own file. Metada operations are improved with around to minimum 50,000 file access per sec and aggregated up to 2.6 million accesses of 32KB small files.  

     Figure 1. An example of the NDS servers on Summit

Performance under not ideal workload

The I/O performance can be lower than the optimal one when you save one single shared file with non-optimal I/O pattern. Moreover, the previous performance results are achieved under an ideal system, the system is dedicated, and a specific number of compute nodes are used. The file system is shared across many users; the I/O performance can vary because other users that perform heavy I/O as also executing large scale jobs and stress the interconnection network. Finally, if the I/O pattern is not aligned, then the I/O performance can be significantly lower than the ideal one. Similar, related to the number of the concurrent users, is applied for the metadata operations, they can be lower than the expected performance.

Tips

  • For best performance on the IBM Spectrum Scale filesystem, use large page aligned I/O and asynchronous reads and writes. The filesystem blocksize is 16MB, the minimum fragment size is 16K so when a file under 16K is stored, it will still use 16K of the disk. Writing files of 16 MB or larger, will achieve better performance. All files are striped across LUNs which are distributed across all IO servers.
  • If your application occupies up to two compute nodes and it requires a significant number of I/O operations, you could try to add the following flag in your job script file and investigate if the total execution time is decreased. This flag could cause worse results, it depends on the application.

#BSUB -alloc_flags maximizegpfs

Major difference between Titan (Lustre) and Summit (IBM Spectrum Scale)

The file systems have many technical differences, but we will mention only what a user needs to be familiar with:
  • On Summit, there is no concept of striping from the user point of view, the user uses the Alpine storage without the need to declare the striping for files/directories. The GPFS will handle the workload, the file system was tuned during the installation.

Storage Areas

The storage area to use in any given situation depends upon the activity you wish to carry out. Each user has a User Home area on a Network File System (NFS) and a User Archive area on the archival High Performance Storage System (HPSS). These user storage areas are intended to house user-specific files. Each project has a Project Home area on NFS, multiple Project Work areas on Lustre and Spectrum Scale, and a Project Archive area on HPSS. These project storage areas are intended to house project-centric files. We have defined several areas as listed below by function:
  • User Home: Long-term data for routine access that is unrelated to a project.
  • User Archive: Long-term data for archival access that is unrelated to a project.
  • Project Home: Long-term project data for routine access that's shared with other project members.
  • Member Work: Short-term user data for fast, batch-job access that is not shared with other project members. There are versions of this on both the Atlas Lustre filesystem and the Alpine Spectrum Scale filesystem.
  • Project Work: Short-term project data for fast, batch-job access that's shared with other project members. There are versions of this on both the Atlas Lustre filesystem and the Alpine Spectrum Scale filesystem.
  • World Work: Short-term project data for fast, batch-job access that's shared with OLCF users outside your project. There are versions of this on both the Atlas Lustre filesystem and the Alpine Spectrum Scale filesystem.
  • Project Archive: Long-term project data for archival access that's shared with other project members.

Storage policy

A brief description of each area and basic guidelines to follow are provided in the table below:
Name Path Type Permissions Backups Purged Quota
User Home $HOME NFS User Set yes no 50GB
User Archive /home/$USER HPSS User Set no no 2TB
Project Home /ccs/proj/[projid] NFS 770 yes no 50GB
Member Work /gpfs/alpine/scratch/[userid]/[projid]/ Spectrum Scale 700 no 90 days 50TB
Project Work /gpfs/alpine/proj-shared/[projid] Spectrum Scale 770 no 90 days 50TB
World Work /gpfs/alpine/world-shared/[projid] Spectrum Scale 775 no 90 days 50TB
Project Archive /proj/[projid] HPSS 770 no no 100TB
For storage policy on TITAN, click here On Summit paths to the various project-centric work storage areas are simplified by the use of environment variables that point to the proper directory on a per-user basis:
  • Member Work Directory: $MEMBERWORK/[projid]
  • Project Work Directory: $PROJWORK/[projid]
  • World Work Directory: $WORLDWORK/[projid]
These environment variables are not set on the data transfer nodes.

Information

  • Although there are no hard quota limits, an upper storage limit should be reported in the project request. The available space of a project can be modified upon request.
  • The user will be informed when the project reaches 90% of the requested storage utilization.

Purge

To keep the Lustre and Spectrum Scale file systems exceptionally performant, untouched files in the project and user areas are purged at the intervals shown in the table above. Please make sure that valuable data is moved off of these systems regularly. See HPSS Best Practices for information about using the HSI and HTAR utilities to archive data on HPSS.

Retention

At the completion of a project or at the end of a member's association with the project, data will be available for 90 days, except in areas that are purged, in that case, the data will be retained according to the purge policy. After 90 days, the data will not be available but not purged for another 60 days, where the data will be removed if it not requested otherwise.

Other OLCF Storage Systems

The High Performance Storage System (HPSS) at the OLCF provides longer-term storage for the large amounts of data created on the OLCF compute systems. The HPSS is accessible from all OLCF Filesystems through utilities called HSI and HTAR. For more information on using HSI or HTAR, see the HPSS Best Practices documentation. OLCF also has a Network File System, referred to as NFS, and Lustre filesystems called Atlas. Summit does not mount Lustre. However, during the early use of Summit, users may need to use Lustre in a multi-stage process with HPSS for larger data transfer with Alpine. To learn more about this please see Data Transfer and Summit section below. The following shows the availability of each of the filesystems on primary OLCF clusters and supercomputers.
Area Summit Summitdev Titan Data Transfer Nodes Rhea Eos
Atlas Lustre Filesystem no no yes yes yes yes
Alpine Spectrum Scale Filesystem yes yes no yes no no
NFS Network Filesystem yes yes yes yes yes yes
HPSS HSI/Htar HSI/Htar HSI/Htar HSI/Htar HSI/Htar HSI/Htar

Guidelines

A brief description of each area and basic guidelines to follow are provided in the table below:

System Name Path Type Permissions Backups Purged Quota
User Home User Home $HOME NFS User Set yes no 50GB
User Archive User Archive /home/$USER HPSS User Set no no 2TB
Project Home Project Home /ccs/proj/[projid] NFS 770 yes no 50GB
Alpine Member Work /gpfs/alpine/scratch/[userid]/[projid]/ Spectrum Scale 700 no 90 days 50TB
Project Work /gpfs/alpine/proj-shared/[projid]/ Spectrum Scale 770 no 90 days 50TB
World Work /gpfs/alpine/world-shared/[projid]/ Spectrum Scale 775 no 90 days 50TB
Atlas Member Work /lustre/atlas/scratch/[userid]/[projid] Lustre 700 no 14 days 10TB
Project Work /lustre/atlas/proj-shared/[projid] Lustre 770 no 90 days 100TB
World Work /lustre/atlas/world-shared/[projid] Lustre 775 no 90 days 10TB
Project Archive Project Archive /proj/[projid] HPSS 770 no no 100TB

 

Backups for Files on NFS

Online backups are performed at regular intervals for your files in project home and user home. Hourly backups for the past 24 hours, daily backups for the last 7 days, and 1 weekly backup are available. The backup directories are named hourly.*, daily.* , and weekly.* where * is the date/time stamp of the backup. For example, hourly.2016-12-01-0905 is an hourly backup made on December 1, 2016 at 9:05 AM.

The backups are accessed via the .snapshot subdirectory. You may list your available hourly/daily/weekly backups by doing “ls .snapshot”. The .snapshot feature is available in any subdirectory of your home directory and will show the online backup of that subdirectory. In other words, you don’t have to start at /ccs/home/$USER and navigate the full directory structure; if you’re in a /ccs/home subdirectory several “levels” deep, an “ls .snapshot” will access the available backups of that subdirectory.

To retrieve a backup, simply copy it into your desired destination with the cp command.

Retention

At the completion of a project or at the end of a member’s association with the project, data will be retained for 90 days, except in areas that are purged, in that case the data will be retained according the purge policy.

A more detailed description of each storage area is given below.

User-Centric Data Storage

The following table summarizes user-centric storage areas available on OLCF resources and lists relevant polices.

User-Centric Storage Areas
Area Path Type Permissions Quota Backups Purged Retention
User Home $HOME NFS User-controlled 10 GB Yes No 90 days
User Archive /home/$USER HPSS User-controlled 2 TB [1] No No 90 days
[1] In addition, there is a quota/limit of 2,000 files on this directory.

User Home Directories (NFS)

The environment variable $HOME will always point to your current home directory. It is recommended, where possible, that you use this variable to reference your home directory. In cases in which using $HOME is not feasible, it is recommended that you use /ccs/home/$USER.

Users should note that since this is an NFS-mounted filesystem, its performance will not be as high as other filesystems.

User Home Quotas

Quotas are enforced on user home directories. To request an increased quota, contact the OLCF User Assistance Center. To view your current quota and usage, use the quota command:
$ quota -Qs
Disk quotas for user usrid (uid 12345):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
nccsfiler1a.ccs.ornl.gov:/vol/home
                  4858M   5000M   5000M           29379   4295m   4295m

User Home Permissions

The default permissions for user home directories are 0750 (full access to the user, read and execute for the group). Users have the ability to change permissions on their home directories, although it is recommended that permissions be set to as restrictive as possible (without interfering with your work).

User Website Directory

Users interested in sharing files publicly via the World Wide Web can request a user website directory be created for their account. User website directories (~/www) have a 5GB storage quota and allow access to files at http://users.nccs.gov/~user (where user is your userid). If you are interested in having a user website directory created, please contact the User Assistance Center at help@olcf.ornl.gov.

User Archive Directories (HPSS)

The High Performance Storage System (HPSS) at the OLCF provides longer-term storage for the large amounts of data created on the OLCF compute systems. The mass storage facility consists of tape and disk storage components, servers, and the HPSS software. After data is uploaded, it persists on disk for some period of time. The length of its life on disk is determined by how full the disk caches become. When data is migrated to tape, it is done so in a first-in, first-out fashion.

User archive areas on HPSS are intended for storage of data not immediately needed in either User Home directories (NFS) or User Work directories (Lustre®). User Archive directories should not be used to store project-related data. Rather, Project Archive directories should be used for project data.

User archive directories are located at /home/$USER.

User Archive Access

Each OLCF user receives an HPSS account automatically. Users can transfer data to HPSS from any OLCF system using the HSI or HTAR utilities. For more information on using HSI or HTAR, see the HPSS Best Practices section.

User Archive Accounting

Each file and directory on HPSS is associated with an HPSS storage allocation. For information on HPSS storage allocations, please visit the HPSS Archive Accounting section.

For information on usage and best practices for HPSS, please see the HPSS Best Practices documentation.


Project-Centric Data Storage

Project directories provide members of a project with a common place to store code, data, and other files related to their project.

Project Home Directories (NFS)

Name Path Type Permissions Backups Purged Quota
Project Home /ccs/proj/[projid] NFS 770 yes no 50GB

Projects are provided with a Project Home storage area in the NFS-mounted filesystem. This area is intended for storage of data, code, and other files that are of interest to all members of a project. Since Project Home is an NFS-mounted filesystem, its performance will not be as high as other filesystems.

Project Home Path

Project Home area is accessible at /ccs/proj/abc123 (where abc123 is your project ID).

Project Home Quotas

To check your project’s current usage, run df -h /ccs/proj/abc123 (where abc123 is your project ID). Quotas are enforced on project home directories. The current limit is shown in the table above.

Project Home Permissions

The default permissions for project home directories are 0770 (full access to the user and group). The directory is owned by root and the group includes the project’s group members. All members of a project should also be members of that group-specific project. For example, all members of project “ABC123” should be members of the “abc123” UNIX group.

Three Project Work Areas to Facilitate Collaboration

To facilitate collaboration among researchers, the OLCF provides (3) distinct types of project-centric work storage areas: Member Work directories, Project Work directories, and World Work directories. Each directory should be used for storing files generated by computationally-intensive HPC jobs related to a project.

Name Path Type Permissions Backups Purged Quota
Member Work /lustre/atlas/scratch/[projid] Lustre 700 no 14 days 10 TB
/gpfs/alpinetds/scratch/[projid] Spectrum Scale 700 no 90 days 50 TB
Project Work /lustre/atlas/proj-shared/[projid] Lustre 770 no 90 days 100 TB
/gpfs/alpinetds/proj-shared/[projid] Spectrum Scale 770 no 90 days 50 TB
World Work /lustre/atlas/world-shared/[projid] Lustre 775 no 90 days 10TB
/gpfs/alpinetds/world-shared/[projid] Spectrum Scale 775 no 90 days 50 TB

 

The difference between the three lies in the accessibility of the data to project members and to researchers outside of the project. Member Work directories are accessible only by an individual project member by default. Project Work directories are accessible by all project members. World Work directories are readable by any user on the system.

Permissions

UNIX Permissions on each project-centric work storage area differ according to the area’s intended collaborative use. Under this setup, the process of sharing data with other researchers amounts to simply ensuring that the data resides in the proper work directory.

      • Member Work Directory: 700
      • Project Work Directory: 770
      • World Work Directory: 775

For example, if you have data that must be restricted only to yourself, keep them in your Member Work directory for that project (and leave the default permissions unchanged). If you have data that you intend to share with researchers within your project, keep them in the project’s Project Work directory. If you have data that you intend to share with researchers outside of a project, keep them in the project’s World Work directory.

Backups

Member Work, Project Work, and World Work directories are not backed up. Project members are responsible for backing up these files, either to Project Archive areas (HPSS) or to an off-site location.

Project Archive Directories

Name Path Type Permissions Backups Purged Quota
Project Archive /proj/[projid] HPSS 770 no no 100TB

Projects are also allocated project-specific archival space on the High Performance Storage System (HPSS). The default quota is shown on the table above. If a higher quota is needed, contact the User Assistance Center.

The Project Archive space on HPSS is intended for storage of data not immediately needed in either Project Home (NFS) areas nor Project Work (Alpine) areas, and to serve as a location to store backup copies of project-related files.

Project Archive Path

The project archive directories are located at /proj/pjt000 (where pjt000 is your Project ID).

Project Archive Access

Project Archive directories may only be accessed via utilities called HSI and HTAR. For more information on using HSI or HTAR, see the HPSS Best Practices section.

Data Transfer and Summit

Many users will need to move or copy their data from Titan (Atlas) to Summit (Alpine). The Alpine file system is mounted on OLCF's data transfer nodes and is feasible to transfer data with tools such as globus. The following sections will outline the current options for transferring data to and from Alpine.

Local Transfers

For local data transfers (e.g. from Atlas to Alpine), users have two options depending on the size of the data. For small data transfers, users can first copy data (via the cp command) from Atlas to the NFS storage area (i.e. home /ccs/home/userid/ or project space /ccs/proj/), and then from their NFS directory to Alpine (/gpfs/alpine/). Because the NFS storage areas have a 50 GB quota, such transfers must be less than 50 GB. If larger data must be moved between Atlas and Alpine, it can be passed through HPSS. For example, if your data is initially located in /lustre/atlas/proj-shared/[projid] and you want to move it to /gpfs/alpine/proj-shared/[projid]: Log in to titan.ccs.ornl.gov or dtn.css.ornl.gov
cd  /lustre/atlas/proj-shared/[projid]
htar -cvf mylargefiles.tar mylargefiles
Login to summit.olcf.ornl.gov
cd /gpfs/alpine/scratch/[projid]
htar -xvf mylargefiles.tar
Below is a screencast showing the process of transferring data from Atlas to Alpine.

Remote Transfers with Alpine

scp and rsync are available on Summit for small remote transfers. For larger remote transfers with Alpine, we recommend staging the data through Alpine and using Globus to do the remote transfer. Follow the steps:
  • Visit www.globus.org and login
  • Choose organization:
  • Use your credentials to login
  • Search for the endpoint OLCF DTN
 
  • Declare path
  • Open a second panel to declare where your files would like to be transferred, select if you want an encrypt transfer, select your file/folder and click start
     
  • Then an activity report will appear and you can click on it to see the status. When the transfer is finished or failed, you will receive an email
   

HPSS Best Practices

Currently, HSI and HTAR are offered for archiving data into HPSS or retrieving data from the HPSS archive. For optimal transfer performance, we recommend sending a file of 768 GB or larger to HPSS. The minimum file size that we recommend sending is 512 MB. HPSS will handle files between 0K and 512 MB, but write and read performance will be negatively affected. For files smaller than 512 MB we recommend bundling them with HTAR to achieve an archive file of at least 512 MB. When retrieving data from a tar archive larger than 1 TB, we recommend that you pull only the files that you need rather than the full archive. Examples of this will be given in the htar section below.

Using HSI

Issuing the command hsi will start HSI in interactive mode. Alternatively, you can use:
  hsi [options] command(s)
...to execute a set of HSI commands and then return. To list you files on the HPSS, you might use:
  hsi ls
hsi commands are similar to ftp commands. For example, hsi get and hsi put are used to retrieve and store individual files, and hsi mget and hsi mput can be used to retrieve multiple files. To send a file to HPSS, you might use:
  hsi put a.out
To put a file in a pre-existing directory on hpss:
  hsi “cd MyHpssDir; put a.out”
To retrieve one, you might use:
  hsi get /proj/projectid/a.out
Here is a list of commonly used hsi commands.
Command Function
cd Change current directory
get, mget Copy one or more HPSS-resident files to local files
cget Conditional get - get the file only if it doesn't already exist
cp Copy a file within HPSS
rm mdelete Remove one or more files from HPSS
ls List a directory
put, mput Copy one or more local files to HPSS
cput Conditional put - copy the file into HPSS unless it is already there
pwd Print current directory
mv Rename an HPSS file
mkdir Create an HPSS directory
rmdir Delete an HPSS directory
 
Additional HSI Documentation
There is interactive documentation on the hsi command available by running:
  hsi help
Additionally, documentation can be found at the Gleicher Enterprises website, including an HSI Reference Manual and man pages for HSI.

Using HTAR

The htar command provides an interface very similar to the traditional tar command found on UNIX systems. It is used as a command-line interface. The basic syntax of htar is:
htar -{c|K|t|x|X} -f tarfile [directories] [files]
As with the standard Unix tar utility the -c, -x, and -t options, respectively, function to create, extract, and list tar archive files. The -K option verifies an existing tarfile in HPSS and the -X option can be used to re-create the index file for an existing archive. For example, to store all files in the directory dir1 to a file named allfiles.tar on HPSS, use the command:
  htar -cvf allfiles.tar dir1/*
To retrieve these files:
  htar -xvf allfiles.tar 
htar will overwrite files of the same name in the target directory. When possible, extract only the files you need from large archives. To display the names of the files in the project1.tar archive file within the HPSS home directory:
  htar -vtf project1.tar
To extract only one file, executable.out, from the project1 directory in the Archive file called project1.tar:
  htar -xm -f project1.tar project1/ executable.out 
To extract all files from the project1/src directory in the archive file called project1.tar, and use the time of extraction as the modification time, use the following command:
  htar -xm -f project1.tar project1/src
HTAR Limitations
The htar utility has several limitations.
Apending data
You cannot add or append files to an existing archive.
File Path Length
File path names within an htar archive of the form prefix/name are limited to 154 characters for the prefix and 99 characters for the file name. Link names cannot exceed 99 characters.
Size
There are limits to the size and number of files that can be placed in an HTAR archive.
Individual File Size Maximum 68GB, due to POSIX limit
Maximum Number of Files per Archive 1 million
  For example, when attempting to HTAR a directory with one member file larger that 64GB, the following error message will appear:
[titan-ext1]$htar -cvf hpss_test.tar hpss_test/

INFO: File too large for htar to handle: hpss_test/75GB.dat (75161927680 bytes)
ERROR: 1 oversize member files found - please correct and retry
ERROR: [FATAL] error(s) generating filename list 
HTAR: HTAR FAILED
Additional HTAR Documentation
The HTAR user's guide can be found at the Gleicher Enterprises website Gleicher Enterprises website, including the HTAR man page.

Software

For a full list of software available at the OLCF, please see the Software section.

Shell & Programming Environments

OLCF systems provide hundreds of software packages and scientific libraries pre-installed at the system-level for users to take advantage of. To facilitate this, environment management tools are employed to handle necessary changes to the shell. The sections below provide information about using these management tools on Summit.

Default Shell

A user’s default shell is selected when completing the User Account Request form. The chosen shell is set across all OLCF resources, and is the shell interface a user will be presented with upon login to any OLCF system. Currently, supported shells include:

  • bash
  • tsch
  • csh
  • ksh

If you would like to have your default shell changed, please contact the OLCF User Assistance Center at help@nccs.gov.

Environment Management with Lmod

Environment modules are provided through Lmod, a Lua-based module system for dynamically altering shell environments. By managing changes to the shell’s environment variables (such as PATH, LD_LIBRARY_PATH, and PKG_CONFIG_PATH), Lmod allows you to alter the software available in your shell environment without the risk of creating package and version combinations that cannot coexist in a single environment.

Lmod is a recursive environment module system, meaning it is aware of module compatibility and actively alters the environment to protect against conflicts. Messages to stderr are issued upon Lmod implicitly altering the environment. Environment modules are structured hierarchically by compiler family such that packages built with a given compiler will only be accessible if the compiler family is first present in the environment.

Note: Lmod can interpret both Lua modulefiles and legacy Tcl modulefiles. However, long and logic-heavy Tcl modulefiles may require porting to Lua.

General Usage

Typical use of Lmod is very similar to that of interacting with modulefiles on other OLCF systems. The interface to Lmod is provided by the module command:

Command Description
module -t list Shows a terse list of the currently loaded modules.
module avail Shows a table of the currently available modules
module help <modulename> Shows help information about <modulename>
module show <modulename> Shows the environment changes made by the <modulename> modulefile
module spider <string> Searches all possible modules according to <string>
module load <modulename> […] Loads the given <modulename>(s) into the current environment
module use <path> Adds <path> to the modulefile search cache and MODULESPATH
module unuse <path> Removes <path> from the modulefile search cache and MODULESPATH
module purge Unloads all modules
module reset Resets loaded modules to system defaults
module update Reloads all currently loaded modules
Note: Modules are changed recursively. Some commands, such as module swap, are available to maintain compatibility with scripts using Tcl Environment Modules, but are not necessary since Lmod recursively processes loaded modules and automatically resolves conflicts.

Searching for modules

Modules with dependencies are only available when the underlying dependencies, such as compiler families, are loaded. Thus, module avail will only display modules that are compatible with the current state of the environment. To search the entire hierarchy across all possible dependencies, the spider sub-command can be used as summarized in the following table.

Command Description
module spider Shows the entire possible graph of modules
module spider <modulename> Searches for modules named <modulename> in the graph of possible modules
module spider <modulename>/<version> Searches for a specific version of <modulename> in the graph of possible modules
module spider <string> Searches for modulefiles containing <string>

 

Defining custom module collections

Lmod supports caching commonly used collections of environment modules on a per-user basis in $HOME/.lmod.d. To create a collection called “NAME” from the currently loaded modules, simply call module save NAME. Omitting “NAME” will set the user’s default collection. Saved collections can be recalled and examined with the commands summarized in the following table.

Command Description
module restore NAME Recalls a specific saved user collection titled “NAME”
module restore Recalls the user-defined defaults
module reset Resets loaded modules to system defaults
module restore system Recalls the system defaults
module savelist Shows the list user-defined saved collections

Note:
You should use unique names when creating collections to specify the application (and possibly branch) you are working on. For example, `app1-development`, `app1-production`, and `app2-production`.
Note:
In order to avoid conflicts between user-defined collections on multiple compute systems that share a home file system (e.g. /ccs/home/[userid]), lmod appends the hostname of each system to the files saved in in your ~/.lmod.d directory (using the environment variable LMOD_SYSTEM_NAME). This ensures that only collections appended with the name of the current system are visible.

The following screencast shows an example of setting up user-defined module collections on Summit.

Compiling

Compilers

Available Compilers

The following compilers are available on Summit:
XL: IBM XL Compilers (loaded by default)
LLVM: LLVM compiler infrastructure
PGI: Portland Group compiler suite
GNU: GNU Compiler Collection
NVCC: CUDA C compiler

Upon login, the default versions of the XL compiler suite and Spectrum Message Passing Interface (MPI) are added to each user’s environment through the modules system. No changes to the environment are needed to make use of the defaults.

Multiple versions of each compiler family are provided, and can be inspected using the modules system:

summit$ module -t avail pgi
/sw/summit/modulefiles/site/linux-rhel7-ppc64le/Core:
pgi/17.10-patched
pgi/18.3
pgi/18.4
pgi/18.5
pgi/18.7

C compilation

Note: type char is unsigned by default
Vendor Module Compiler Version Enable C99 Enable C11 Default signed char Define macro
IBM xl xlc xlc_r 13.1.6 -std=gnu99 -std=gnu11 -qchar=signed -WF,-D
GNU system default gcc 4.8.5 -std=gnu99 -std=gnu11 -fsigned-char -D
GNU gcc gcc 6.4.0 -std=gnu99 -std=gnu11 -fsigned-char -D
LLVM llvm clang 3.8.0 default -std=gnu11 -fsigned-char -D
PGI pgi pgcc 17.10 -c99 -c11 -Mschar -D

C++ compilations

Note: type char is unsigned by default
Vendor Module Compiler Version Enable C++11 Enable C++14 Default signed char Define macro
IBM xl xlc++, xlc++_r 13.1.6 -std=gnu++11 -std=gnu++1y (PARTIAL) -qchar=signed -WF,-D
GNU system default g++ 4.8.5 -std=gnu++11 -std=gnu++1y -fsigned-char -D
GNU gcc g++ 6.4.0 -std=gnu++11 -std=gnu++1y -fsigned-char -D
LLVM llvm clang++ 3.8.0 -std=gnu++11 -std=gnu++1y -fsigned-char -D
PGI pgi pgc++ 17.10 -std=c++11 –gnu_extensions -std=c++14 –gnu_extensions -Mschar -D

Fortran compilation

Vendor Module Compiler Version Enable F90 Enable F2003 Enable F2008 Define macro
IBM xl xlf
xlf90
xlf95
xlf2003
xlf2008
15.1.6 -qlanglvl=90std -qlanglvl=2003std -qlanglvl=2008std -WF,-D
GNU system default gfortran 4.8.5, 6.4.0 -std=f90 -std=f2003 -std=f2008 -D
LLVM llvm xlflang 3.8.0 n/a n/a n/a -D
PGI pgi pgfortran 17.10 use .F90 source file suffix use .F03 source file suffix use .F08 source file suffix -D
Note: * The xlflang module currently conflicts with the clang module. This restriction is expected to be lifted in future releases.

MPI

MPI on Summit is provided by IBM Spectrum MPI. Spectrum MPI provides compiler wrappers that automatically choose the proper compiler to build your application.

The following compiler wrappers are available:
C: mpicc
C++: mpic++, mpiCC
Fortran: mpifort, mpif77, mpif90

While these wrappers conveniently abstract away linking of Spectrum MPI, it’s sometimes helpful to see exactly what’s happening when invoked. The --showme flag will display the full link lines, without actually compiling:

summit$ mpicc --showme
/sw/summit/xl/16.1.1-beta6/xlC/16.1.1/bin/xlc -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20171006/linux-rhel7-ppc64le/xl-16.1.1-beta6/spectrum-mpi-10.2.0.7-20180830-eyo7zxm2piusmyffr3iytmgwdacl67ju/include -pthread -L/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20171006/linux-rhel7-ppc64le/xl-16.1.1-beta6/spectrum-mpi-10.2.0.7-20180830-eyo7zxm2piusmyffr3iytmgwdacl67ju/lib -lmpiprofilesupport -lmpi_ibm

OpenMP

Note: When using OpenMP with IBM XL compilers, the thread-safe compiler variant is required; These variants have the same name as the non-thread-safe compilers with an additional _r suffix. e.g. to compile OpenMPI C code one would use xlc_r
Note: OpenMP offloading support is still under active development. Performance and debugging capabilities in particular are expected to improve as the implementations mature.
Vendor 3.1 Support Enable OpenMP 4.x Support Enable OpenMP 4.x Offload
IBM FULL -qsmp=omp PARTIAL -qsmp=omp
-qoffload
GNU FULL -fopenmp PARTIAL -fopenmp
clang FULL -fopenmp PARTIAL -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
–cuda-path=${OLCF_CUDA_ROOT}
xlflang FULL -fopenmp PARTIAL -fopenmp
-fopenmp-targets=nvptx64-nvidia-cuda
PGI FULL -mp NONE NONE

OpenACC

Vendor Module OpenACC Support Enable OpenACC
IBM xl NONE NONE
GNU system default NONE NONE
GNU gcc 2.5 -fopenacc
LLVM clang or xlflang NONE NONE
PGI pgi 2.5 -acc, -ta=nvidia:cc70

CUDA compilation

NVIDIA

CUDA C/C++ support is provided through the cuda module.
nvcc : Primary CUDA C/C++ compiler

Language support

-std=c++11 : provide C++11 support
–expt-extended-lambda : provide experimental host/device lambda support
–expt-relaxed-constexpr : provide experimental host/device constexpr support

Compiler support

NVCC currently supports XL, GCC, and PGI C++ backends.
–ccbin : set to host compiler location

CUDA Fortran compilation

IBM

The IBM compiler suite is made available through the default loaded xl module, the cuda module is also required.
xlcuf : primary Cuda fortran compiler, thread safe
Language support flags
-qlanglvl=90std : provide Fortran90 support
-qlanglvl=95std : provide Fortran95 support
-qlanglvl=2003std : provide Fortran2003 support
-qlanglvl=2008std : provide Fortran2003 support

PGI

The PGI compiler suite is available through the pgi module.
pgfortran : Primary fortran compiler with CUDA Fortran support

Language support:

Files with .cuf suffix automatically compiled with cuda fortran support
Standard fortran suffixed source files determines the standard involved, see the man page for full details
-Mcuda : Enable CUDA Fortran on provided source file

Linking in Libraries

OLCF systems provide hundreds of software packages and scientific libraries pre-installed at the system-level for users to take advantage of. In order to link these libraries into an application, users must direct the compiler to their location. The module show command can be used to determine the location of a particular library. For example

summit$ module show essl
------------------------------------------------------------------------------------
   /sw/summit/modulefiles/core/essl/6.1.0-1:
------------------------------------------------------------------------------------
whatis("ESSL 6.1.0-1 ")
prepend_path("LD_LIBRARY_PATH","/sw/summit/essl/6.1.0-1/essl/6.1/lib64")
append_path("LD_LIBRARY_PATH","/sw/summit/xl/16.1.1-beta4/lib")
prepend_path("MANPATH","/sw/summit/essl/6.1.0-1/essl/6.1/man")
setenv("OLCF_ESSL_ROOT","/sw/summit/essl/6.1.0-1/essl/6.1")
help([[ESSL 6.1.0-1

]])

When this module is loaded, the $OLCF_ESSL_ROOT environment variable holds the path to the ESSL installation, which contains the lib64/ and include/ directories:

summit$ module load essl
summit$ echo $OLCF_ESSL_ROOT
/sw/summit/essl/6.1.0-1/essl/6.1
summit$ ls $OLCF_ESSL_ROOT
FFTW3  READMES  REDIST.txt  include  iso-swid  ivps  lap  lib64  man  msg

The following screencast shows an example of linking two libraries into a simple program on Summit.

Running Jobs

As is the case on other OLCF systems, computational work on Summit is performed within jobs. A typical job consists of several components:

  • A submission script
  • An executable
  • Input files needed by the executable
  • Output files created by the executable

In general, the process for running a job is to:

  1. Prepare executables and input files
  2. Write the batch script
  3. Submit the batch script
  4. Monitor the job’s progress before and during execution

The following sections will provide more information regarding running jobs on Summit. Summit uses IBM Spectrum Load Sharing Facility (LSF) as the batch scheduling system.

Login, Launch, and Compute Nodes

Recall from the System Overview section that Summit has three types of nodes: login, launch, and compute. When you log into the system, you are placed on a login node. When your batch scripts or interactive jobs run, the resulting shell will run on a launch node. Compute nodes are accessed via the jsrun command. The jsrun command should only be issued from within an LSF job (either batch or interactive) on a launch node. Othewise, you will not have any compute nodes allocated and your parallel job will run on the login node. If this happens, your job will interfere with (and be interfered with by) other users’ login node tasks.

Batch Scripts

The most common way to interact with the batch system is via batch jobs. A batch job is simply a shell script with added directives to request various resources from or provide certain information to the batch scheduling system. Aside from the lines containing LSF options, the batch script is simply the series commands needed to set up and run your job.

To submit a batch script, use the bsub command:
bsub myjob.lsf

If you’ve previously used LSF, you’re probably used to submitting a job with input redirection (i.e. bsub < myjob.lsf).
This is not needed (and will not work) on Summit.

As an example, consider the following batch script:

1.	#!/bin/bash
2.	# Begin LSF Directives
3.	#BSUB -P ABC123
4.	#BSUB -W 3:00
5.	#BSUB -nnodes 2048
6.	#BSUB -alloc_flags gpumps
7.	#BSUB -J RunSim123
8.	#BSUB -o RunSim123.%J
9.	#BSUB -e RunSim123.%J
10.	 
11.	cd $MEMBERWORK/abc123
12.	cp $PROJWORK/abc123/RunData/Input.123 ./Input.123
13.	date
14.	jsrun -n 4092 -r 2 -a 12 -g 3 ./a.out
15.	cp my_output_file /ccs/proj/abc123/Output.123
Line # Option Description
1 Shell specification. This script will run under with bash as the shell
2 Comment line
3 Required This job will charge to the ABC123 project
4 Required Maximum walltime for the job is 3 hours
5 Required The job will use 2,048 nodes
6 Optional Enable GPU Multi-Process Service
7 Optional The name of the job is RunSim123
8 Optional Write standard output to a file named RunSim123.#, where # is the job ID assigned by LSF
9 Optional Write standard error to a file named RunSim123.#, where # is the job ID assigned by LSF
10 Blank line
11 Change into one of the scratch filesystems
12 Copy input files into place
13 Run the date command to write a timestamp to the standard output file
14 Run the executable
15 Copy output files from the scratch area into a more permanent location

Interactive Jobs

Most users will find batch jobs to be the easiest way to interact with the system, since they permit you to hand off a job to the scheduler and then work on other tasks; however, it is sometimes preferable to run interactively on the system. This is especially true when developing, modifying, or debugging a code.

Since all compute resources are managed/scheduled by LSF, it is not possible to simply log into the system and begin running a parallel code interactively. You must request the appropriate resources from the system and, if necessary, wait until they are available. This is done with an “interactive batch” job. Interactive batch jobs are submitted via the command line, which supports the same options that are passed via #BSUB parameters in a batch script. The final options on the command line are what makes the job “interactive batch”: -Is followed by a shell name. For example, to request an interactive batch job (with bash as the shell) equivalent to the sample batch script above, you would use the command:
bsub -W 3:00 -nnodes 2048 -P ABC123 -Is /bin/bash

Common bsub Options

The table below summarizes options for submitted jobs. Unless otherwise noted, these can be used from batch scripts or interactive jobs. For interactive jobs, the options are simply added to the bsub command line. For batch scripts, they can either be added on the bsub command line or they can appear as a #BSUB directive in the batch script. If conflicting options are specified (i.e. different walltime specified on the command line versus in the script), the option on the command line takes precedence. Note that LSF has numerous options; only the most common ones are described here. For more in-depth information about other LSF options, see the documentation.

Option Example Usage Description
-W #BSUB -W 50 Requested maximum walltime.

NOTE: The format is [hours:]minutes, not [[hours:]minutes:]seconds like PBS/Torque/Moab

-nnodes #BSUB -nnodes 1024 Number of nodes

NOTE: There is specified with only one hyphen (i.e. -nnodes, not –nnodes)

-P #BSUB -P ABC123 Specifies the project to which the job should be charged
-o #BSUB -o jobout.%J File into which job STDOUT should be directed (%J will be replaced with the job ID number)

If you do not also specify a STDERR file with -e or -eo, STDERR will also be written to this file.

-e #BSUB -e jobout.%J File into which job STDERR should be directed (%J will be replaced with the job ID number)
-J #BSUB -J MyRun123 Specifies the name of the job (if not present, LSF will use the name of the job script as the job’s name)
-w #BSUB -w ended() Place a dependency on the job
-N #BSUB -N Send a job report via email when the job completes
-XF #BSUB -XF Use X11 forwarding
-alloc_flags #BSUB -alloc_flags "gpumps smt1" Used to request GPU Multi-Process Service (MPS) and to set SMT (Simultaneous Multithreading) levels. Only one “#BSUB alloc_flags” command is recognized so multiple alloc_flags options need to be enclosed in quotes and space-separated.

Setting gpumps enables NVIDIA’s Multi-Process Service, which allows multiple MPI ranks to simultaneously access a GPU.

Setting smtn (where n is 1, 2, or 4) sets different SMT levels. To run with 2 hardware threads per physical core, you’d use smt2. The default level is smt4.

Batch Environment Variables

LSF provides a number of environment variables in your job’s shell environment. Many job parameters are stored in environment variables and can be queried within the batch job. Several of these variables are summarized in the table below. This is not an all-inclusive list of variables available to your batch job; in particular only LSF variables are discussed, not the many “standard” environment variables that will be available (such as $PATH).

Variable Description
LSB_JOBID The ID assigned to the job by LSF
LS_JOBPID The job’s process ID
LSB_JOBINDEX The job’s index (if it belongs to a job array)
LSB_HOSTS The hosts assigned to run the job
LSB_QUEUE The queue from which the job was dispatched
LSB_INTERACTIVE Set to “Y” for an interactive job; otherwise unset
LS_SUBCWD The directory from which the job was submitted

Job States

A job will progress through a number of states through its lifetime. The states you’re most likely to see are:

State Description
PEND Job is pending
RUN Job is running
DONE Job completed normally (with an exit code of 0)
EXIT Job completed abnormally
PSUSP Job was suspended (either by the user or an administrator) while pending
USUSP Job was suspended (either by the user or an administrator) after starting
SSUSP Job was suspended by the system after starting

Scheduling Policy

In a simple batch queue system, jobs run in a first-in, first-out (FIFO) order. This often does not make effective use of the system. A large job may be next in line to run. If the system is using a strict FIFO queue, many processors sit idle while the large job waits to run. Backfilling would allow smaller, shorter jobs to use those otherwise idle resources, and with the proper algorithm, the start time of the large job would not be delayed. While this does make more effective use of the system, it indirectly encourages the submission of smaller jobs.

The DOE Leadership-Class Job Mandate

As a DOE Leadership Computing Facility, the OLCF has a mandate that a large portion of Summit’s usage come from large, leadership-class (aka capability) jobs. To ensure the OLCF complies with DOE directives, we strongly encourage users to run jobs on Summit that are as large as their code will warrant. To that end, the OLCF implements queue policies that enable large jobs to run in a timely fashion.

Note: The OLCF implements queue policies that encourage the submission and timely execution of large, leadership-class jobs on Summit.

The basic priority-setting mechanism for jobs waiting in the queue is the time a job has been waiting relative to other jobs in the queue.

If your jobs require resources outside these queue policies, please complete the relevant request form on the Special Requests page. If you have any questions or comments on the queue policies below, please direct them to the User Assistance Center.

Job Priority by Processor Count

Jobs are aged according to the job’s requested processor count (older age equals higher queue priority). Each job’s requested processor count places it into a specific bin. Each bin has a different aging parameter, which all jobs in the bin receive.

Bin Min Nodes Max Nodes Max Walltime (Hours) Aging Boost (Days)
1 2,765 4,608 24.0 15
2 922 2,764 24.0 10
3 92 921 12.0 0
4 46 91 6.0 0
5 1 45 2.0 0
batch Queue Policy

The batch queue is the default queue for production work on Summit. Most work on Summit is handled through this queue. It enforces the following policies:

  • Limit of (2) eligible-to-run jobs per user.
  • Jobs in excess of the per user limit above will be placed into a held state, but will change to eligible-to-run at the appropriate time.
  • Users may have only (100) jobs queued at any state at any time. Additional jobs will be rejected at submit time.
Note: The eligible-to-run state is not the running state. Eligible-to-run jobs have not started and are waiting for resources. Running jobs are actually executing.

Allocation Overuse Policy

Projects that overrun their allocation are still allowed to run on OLCF systems, although at a reduced priority. Like the adjustment for the number of processors requested above, this is an adjustment to the apparent submit time of the job. However, this adjustment has the effect of making jobs appear much younger than jobs submitted under projects that have not exceeded their allocation. In addition to the priority change, these jobs are also limited in the amount of wall time that can be used. For example, consider that job1 is submitted at the same time as job2. The project associated with job1 is over its allocation, while the project for job2 is not. The batch system will consider job2 to have been waiting for a longer time than job1. Additionally, projects that are at 125% of their allocated time will be limited to only one running job at a time. The adjustment to the apparent submit time depends upon the percentage that the project is over its allocation, as shown in the table below:

% Of Allocation Used Priority Reduction
< 100% 0 days
100% to 125% 30 days
> 125% 365 days

System Reservation Policy

Projects may request to reserve a set of processors for a period of time through the reservation request form, which can be found on the Special Requests page. If the reservation is granted, the reserved processors will be blocked from general use for a given period of time. Only users that have been authorized to use the reservation can utilize those resources. Since no other users can access the reserved resources, it is crucial that groups given reservations take care to ensure the utilization on those resources remains high. To prevent reserved resources from remaining idle for an extended period of time, reservations are monitored for inactivity. If activity falls below 50% of the reserved resources for more than (30) minutes, the reservation will be canceled and the system will be returned to normal scheduling. A new reservation must be requested if this occurs. Since a reservation makes resources unavailable to the general user population, projects that are granted reservations will be charged (regardless of their actual utilization) a CPU-time equivalent to (# of cores reserved) * (length of reservation in hours).


Job Dependencies

As is the case with many other queuing systems, it is possible to place dependencies on jobs to prevent them from running until other jobs have started/completed/etc. Several possible dependency settings are described in the table below:

Expression Meaning
#BSUB -w started(12345) The job will not start until job 12345 starts. Job 12345 is considered to have started if is in any of the following states: USUSP, SSUSP, DONE, EXIT or RUN (with any pre-execution command specified by bsub -E completed)
#BSUB -w done(12345)
#BSUB -w 12345
The job will not start until job 12345 has a state of DONE (i.e. completed normally). If a job ID is given with no condition, done() is assumed.
#BSUB -w exit(12345) The job will not start until job 12345 has a state of EXIT (i.e. completed abnormally)
#BSUB -w ended(12345) The job will not start until job 12345 has a state of EXIT or DONE

Dependency expressions can be combined with logical operators. For example, if you want a job held until job 12345 is DONE and job 12346 has started, you can use #BSUB -w "done(12345) && started(12346)"

Monitoring Jobs

LSF provides several utilities with which you can monitor jobs. These include monitoring the queue, getting details about a particular job, viewing STDOUT/STDERR of running jobs, and more.

The most straightforward monitoring is with the bjobs command. This command will show the current queue, including both pending and running jobs. Running bjobs -l will provide much more detail about a job (or group of jobs). For detailed output of a single job, specify the job id after the -l. For example, for detailed output of job 12345, you can run bjobs -l 12345 . Other options to bjobs are shown below. In general, if the command is specified with -u all it will show information for all users/all jobs. Without that option, it only shows your jobs. Note that this is not an exhaustive list. See man bjobs for more information.

Command Description
bjobs Show your current jobs in the queue
bjobs -u all Show currently queued jobs for all users
bjobs -P ABC123 Shows currently-queued jobs for project ABC123
bjobs -UF Don’t format output (might be useful if you’re using the output in a script)
bjobs -a Show jobs in all states, including recently finished jobs
bjobs -l Show long/detailed output
bjobs -l 12345 Show long/detailed output for jobs 12345
bjobs -d Show details for recently completed jobs
bjobs -s Show suspended jobs, including the reason(s) they’re suspended
bjobs -r Show running jobs
bjobs -p Show pending jobs
bjobs -w Use “wide” formatting for output

If you want to check the STDOUT/STDERR of a currently running job, you can do so with the bpeek command. The command supports several options:

Command Description
bpeek -J jobname Show STDOUT/STDERR for the job you’ve most recently submitted with the name jobname
bpeek 12345 Show STDOUT/STDERR for job 12345
bpeek -f ... Used with other options. Makes bpeek use tail -f and exit once the job completes.

Interacting With Jobs

Sometimes it’s necessary to interact with a batch job after it has been submitted. LSF provides several commands for interacting with already-submitted jobs.

Many of these commands can operate on either one job or a group of jobs. In general, they only operate on the most recently submitted job that matches other criteria provided unless “0” is specified as the job id.

Suspending and Resuming Jobs

LSF supports user-level suspension and resumption of jobs. Jobs are suspended with the bstop command and resumed with the bresume command. The simplest way to invoke these commands is to list the job id to be suspended/resumed:

bstop 12345
bresume 12345

Instead of specifying a job id, you can specify other criteria that will allow you to suspend some/all jobs that meet other criteria such as a job name, a queue name, etc. These are described in the manpages for bstop and bresume.

Signaling Jobs

You can send signals to jobs with the bkill command. While the command name suggests its only purpose is to terminate jobs, this is not the case. Similar to the kill command found in Unix-like operating systems, this command can be used to send various signals (not just SIGTERM and SIGKILL) to jobs. The command can accept both numbers and names for signals. For a list of accepted signal names, run bkill -l. Common ways to invoke the command include:

Command Description
bkill 12345 Force a job to stop by sending SIGINT, SIGTERM, and SIGKILL. These signals are sent in that order, so users can write applications such that they will trap SIGINT and/or SIGTERM and exit in a controlled manner.
bkill -s USR1 12345 Send SIGUSR1 to job 12345

NOTE: When specifying a signal by name, omit SIG from the name. Thus, you specify USR1 and not SIGUSR1 on the bkill command line.

bkill -s 9 12345 Send signal 9 to job 12345

Like bstop and bresume, bkill command also supports identifying the job(s) to be signaled by criteria other than the job id. These include some/all jobs with a given name, in a particular queue, etc. See man bkill for more information.

Checkpointing Jobs

LSF documentation mentions the bchkpnt and brestart commands for checkpointing and restarting jobs, as well as the -k option to bsub for configuring checkpointing. Since checkpointing is very application specific and a wide range of applications run on OLCF resources, this type of checkpointing is not configured on Summit. If you wish to use checkpointing (which is highly encouraged), you’ll need to configure it within your application.

If you wish to implement some form of on-demand checkpointing, keep in mind the bkill command is really a signaling command and you can have your job script/application checkpoint as a response to certain signals (such as SIGUSR1).

Other LSF Commands

The table below summarizes some additional LSF commands that might be useful.

Command Description
bparams -a Show current parameters for LSF

The behavior/available options for some LSF commands depend on settings in various configuration files. This command shows those settings without having to search for the actual files.

bjdepinfo Show job dependency information (could be useful in determining what job is keeping another job in a pending state)

PBS/Torque/MOAB-to-LSF Translation

More details about these commands are given elsewhere in this section; the table below is simply for your convenience in looking up various LSF commands.

Users of other OLCF resources are likely familiar with PBS-like commands which are used by the Torque/Moab instances on other systems. The table below summarizes the equivalent LSF command for various PBS/Torque/Moab commands.

LSF Command PBS/Torque/Moab Command Description
bsub job.sh qsub job.sh Submit the job script job.sh to the batch system
bsub -Is /bin/bash qsub -I Submit an interactive batch job
bjobs -u all qstat
showq
Show jobs currently in the queue

NOTE: without the -u all argument, bjobs will only show your jobs

bjobs -l checkjob Get information about a specific job
bjobs -d showq -c Get information about completed jobs
bjobs -p showq -i
showq -b
checkjob
Get information about pending jobs
bjobs -r showq -r Get information about running jobs
bkill qsig Send a signal to a job
bkill qdel Terminate/Kill a job
bstop qhold Hold a job/stop a job from running
bresume qrls Release a held job
bqueues qstat -q Get information about queues
bjdepinfo checkjob Get information about job dependencies

The table below shows shows LSF (bsub) command-line/batch script options and the PBS/Torque/Moab (qsub) options that provide similar functionality.

LSF Option PBS/Torque/Moab Option Description
#BSUB -W 60 #PBS -l walltime=1:00:00 Request a walltime of 1 hour
#BSUB -nnodes 1024 #PBS -l nodes=1024 Request 1024 nodes
#BSUB -P ABC123 #PBS -A ABC123 Charge the job to project ABC123
#BSUB -alloc_flags gpumps No equivalent (set via environment variable) Enable multiple MPI tasks to simultaneously access a GPU

Easy Mode vs. Expert Mode

The Cluster System Management (CSM) component of the job launch environment supports two methods of job submission, termed “easy” mode and “expert” mode. The difference in the modes is where the responsibility for creating the LSF resource string is placed. In easy mode, the system software converts options such as -nnodes in a batch script into the resource string needed by the scheduling system. In expert mode, the user is responsible for creating this string and options such as -nnodes cannot be used.

In easy mode, you will not be able to use bsub -R to create resource strings. The system will automatically create the resource string based on your other bsub options. In expert mode, you will be able to use -R, but you will not be able to use the following options to bsub: -ln_slots, -ln_mem, -cn_cu, or -nnodes.

Most users will want to use easy mode. However, if you need precise control over your job’s resources, such as placement on (or avoidance of) specific nodes, you will need to use expert mode. To use expert mode, add #BSUB -csm y to your batch script (or -csm y to your bsub command line).

Hardware Threads

Hardware threads are a feature of the POWER9 processor through which individual physical cores can support multiple execution streams, essentially looking like one or more virtual cores (similar to hyperthreading on some Intel® microprocessors). This feature is often called Simultaneous Multithreading or SMT. The POWER9 processor on Summit supports SMT levels of 1, 2, or 4, meaning (respectively) each physical core looks like 1, 2, or 4 virtual cores. The SMT level is controlled by the -alloc_flags option to bsub. For example, to set the SMT level to 2, add the line
#BSUB –alloc_flags smt2 to your batch script or add the option -alloc_flags smt2 to you bsub command line.

The default SMT level is 4.

System Service Core Isolation

One core per socket is set aside for system service tasks. The cores are not available to jsrun. When listing available resources through jsrun, you will not see cores with hyperthreads 84-87 and 172-175. Isolating a socket’s system services to a single core helps to reduce jitter and improve performance of tasks performed on the socket’s remaining cores.

The isolated core always operates at SMT4 regardless of the batch job’s SMT level.

GPFS System Service Isolation

By default, GPFS system service tasks are forced onto only the isolated cores. This can be overridden at the batch job level using the maximizegpfs argument to LSF’s alloc_flags. For example:

 #BSUB -alloc_flags maximizegpfs

The maximizegpfs flag will allow GPFS tasks to utilize any core on the compute node. This may be beneficial because it provides more resources for GPFS service tasks, but it may also cause resource contention for the jsrun compute job.

MPS

The Multi-Process Service (MPS) enables multiple processes (e.g. MPI ranks) to concurrently share the resources on a single GPU. This is accomplished by starting an MPS server process, which funnels the work from multiple CUDA contexts (e.g. from multiple MPI ranks) into a single CUDA context. In some cases, this can increase performance due to better utilization of the resources. As mentioned in the COMMON BSUB OPTIONS section above, MPS can be enabled with the -alloc_flags "gpumps" option to bsub. The screencast below shows an example of how to start an MPS server process for a job.

Other Notes

Compute nodes are only allocated to one job at a time; they are not shared. This is why users request nodes (instead of some other resource such as cores or GPUs) in batch jobs and is why projects are charged based on the number of nodes allocated multiplied by the amount of time for which they were allocated. Thus, a job using only 1 core on each of its nodes is charged the same as a job using every core and every GPU on each of its nodes.




Job Launcher (jsrun)

The default job launcher for Summit is jsrun. jsrun was developed by IBM for the Oak Ridge and Livermore Power systems. The tool will execute a given program on resources allocated through the LSF batch scheduler; similar to mpirun and aprun functionality.

Compute Node Description

The following compute node image will be used to discuss jsrun resource sets and layout.

  • 1 node
  • 2 sockets (grey)
  • 42 physical cores* (dark blue)
  • 168 hardware cores (light blue)
  • 6 GPUs (orange)
  • 2 Memory blocks (yellow)

*Core Isolation: 1 core on each socket has been set aside for overhead and is not available for allocation through jsrun. The core has been omitted and is not shown in the above image.

Resource Sets

While jsrun performs similar job launching functions as aprun and mpirun, its syntax is very different. A large reason for syntax differences is the introduction of the resource set concept. Through resource sets, jsrun can control how a node appears to each job. Users can, through jsrun command line flags, control which resources on a node are visible to a job. Resource sets also allow the ability to run multiple jsruns simultaneously within a node. Under the covers, a resource set is a cgroup.

At a high level, a resource set allows users to configure what a node look like to their job.

Jsrun will create one or more resource sets within a node. Each resource set will contain 1 or more cores and 0 or more GPUs. A resource set can span sockets, but it may not span a node. While a resource set can span sockets within a node, consideration should be given to the cost of cross-socket communication. By creating resource sets only within sockets, costly communication between sockets can be prevented.

One or more resource sets can be created on a single node and can span sockets. But, a resource set can not span nodes.

While a resource set can span sockets within a node, consideration should be given to the cost of cross-socket communication. Creating resource sets within sockets will prevent cross-socket communication.

Subdividing a Node with Resource Sets

Resource sets provides the ability to subdivide node’s resources into smaller groups. The following examples show how a node can be subdivided and how many resource set could fit on a node.

Multiple Methods to Creating Resource Sets

Resource sets should be created to fit code requirements. The following examples show multiple ways to create resource sets that allow two MPI tasks access to a single GPU.

  1. 6 resource sets per node: 1 GPU, 2 cores per (Titan)

    In this case, CPUs can only see single assigned GPU.
  2. 2 resource sets per node: 3 GPUs and 6 cores per socket

    In this case, all 6 CPUs can see 3 GPUs. Code must manage CPU -> GPU communication. CPUs on socket0 can not access GPUs or Memory on socket1.
  3. Single resource set per node: 6 GPUs, 12 cores

    In this case, all 12 CPUs can see all node’s 6 GPUs. Code must manage CPU to GPU communication. CPUs on socket0 can access GPUs and Memory on socket1. Code must manage cross socket communication.

Designing a Resource Set

Resource sets allow each jsrun to control how the node appears to a code. This method is unique to jsrun, and requires thinking of each job launch differently than aprun or mpirun. While the method is unique, the method is not complicated and can be reasoned in a few basic steps.

The first step to creating resource sets is understanding how a code would like the node to appear. For example, the number of tasks/threads per GPU. Once this is understood, the next step is to simply calculate the number of resource sets that can fit on a node. From here, the number of needed nodes can be calculated and passed to the batch job request.

The basic steps to creating resource sets:

1) Understand how your code expects to interact with the system.
How many tasks/threads per GPU?
Does each task expect to see a single GPU? Do multiple tasks expect to share a GPU? Is the code written to internally manage task to GPU workload based on the number of available cores and GPUs?
2) Create resource sets containing the needed GPU to task binding
Based on how your code expects to interact with the system, you can create resource sets containing the needed GPU and core resources.
If a code expects to utilize one GPU per task, a resource set would contain one core and one GPU. If a code expects to pass work to a single GPU from two tasks, a resource set would contain two cores and one GPU.
3) Decide on the number of resource sets needed
Once you understand tasks, threads, and GPUs in a resource set, you simply need to decide the number of resource sets needed.

As on Titan it is useful to keep the general layout of a node in mind when laying out resource sets.

Launching a Job with jsrun

jsrun Format

  jsrun    [ -n #resource sets ]   [tasks, threads, and GPUs within each resource set]   program [ program args ] 

Common jsrun Options

Below are common jsrun options. More flags and details can be found in the jsrun man page.

Flags Description
Long Short
--nrs -n Number of resource sets
--tasks_per_rs -a Number of MPI tasks (ranks) per resource set
--cpu_per_rs -c Number of CPUs (cores) per resource set.
--gpu_per_rs -g Number of GPUs per resource set
--bind -b Binding of tasks within a resource set. Can be none, rs, or packed:#
--rs_per_host -r Number of resource sets per host
--latency_priority -l Latency Priority. Controls layout priorities. Can currently be cpu-cpu or gpu-cpu
--launch_distribution -d How tasks are started on resource sets

Aprun to jsrun

Mapping aprun commands used on Titan to Summit’s jsrun is only possible in simple single GPU cases. The following table shows some basic single GPU examples that could be executed on Titan or Summit. In the single node examples, each resource set will resemble a Titan node containing a single GPU and one or more cores. Although not required in each case, common jsrun flags (resource set count, GPUs per resource set, tasks per resource set, cores per resource set, binding) are included in each example for reference. The jsrun -n flag can be used to increase the number of resource sets needed. Multiple resource sets can be created on a single node. If each MPI task requires a single GPU, up to 6 resource sets could be created on a single node.

GPUs per Task MPI Tasks Threads per Task aprun jsrun
1 1 0 aprun -n1 jsrun -n1 -g1 -a1 -c1
1 2 0 aprun -n2 jsrun -n1 -g1 -a2 -c1
1 1 4 aprun -n1 -d4 jsrun -n1 -g1 -a1 -c4 -bpacked:4
1 2 8 aprun -n2 -d8 jsrun -n1 -g1 -a2 -c16 -bpacked:8

The jsrun -n flag can be used to increase the number of resource sets needed. Multiple resource sets can be created on a single node. If each MPI task requires a single GPU, up to 6 resource sets could be created on a single node.

 

The following example images show how a single-gpu/single-task job would be placed on a single Titan and Summit node. On Summit, the red box represents a resource set created by jsrun. The resource set looks similar to a Titan node, containing a single GPU, a single core, and memory.

Titan Node Summit Node
aprun -n1 jsrun -n1 -g1 -a1 -c1

Because Summit’s nodes are much larger than Titan’s, 6 single-gpu resource sets can be created on a single Summit node. The following image shows how six single-gpu, single-task resource sets would be placed on a node by default. In the example, the command jsrun -n6 -g1 -a1 -c1 is used to create six single-gpu resource sets on the node. Each resource set is indicated by differing colors. Notice, the -n flag is all that changed between the above single resource set example. The -n flag tells jsrun to create six resource sets.

  • jsrun -n 6 -g 1 -a 1 -c 1
  • Starts 6 resource sets, each indicated by differing colors
  • Each resource contains 1 GPU, 1 Core, and memory
  • The red resource set contains GPU 0 and Core 0
  • The purple resource set contains GPU 3 and Core 84
  • -n 6 tells jsrun how many resource sets to create
  • In this example, each resource set is similar to a single Titan node

 

jsrun Examples

The below examples were launched in the following 2 node interactive batch job:

summit> bsub -nnodes 2 -Pprj123 -W02:00 -Is $SHELL

Single MPI Task, single GPU per RS

The following example will create 12 resource sets each with 1 MPI task and 1 GPU. Each MPI task will have access to a single GPU.

Rank 0 will have access to GPU 0 on the first node ( red resource set). Rank 1 will have access to GPU 1 on the first node ( green resource set). This pattern will continue until 12 resources sets have been created.

The following jsrun command will request 12 resource sets ( -n12 ) 6 per node ( -r6 ). Each resource set will contain 1 MPI task ( -a1 ), 1 GPU ( -g1 ), and 1 core ( -c1 ).

summit> jsrun -n12 -r6 -a1 -g1 -c1 ./a.out
Rank:    0; NumRanks: 12; RankCore:   0; Hostname: h41n04; GPU: 0
Rank:    1; NumRanks: 12; RankCore:   4; Hostname: h41n04; GPU: 1
Rank:    2; NumRanks: 12; RankCore:   8; Hostname: h41n04; GPU: 2
Rank:    3; NumRanks: 12; RankCore:  84; Hostname: h41n04; GPU: 3
Rank:    4; NumRanks: 12; RankCore:  89; Hostname: h41n04; GPU: 4
Rank:    5; NumRanks: 12; RankCore:  92; Hostname: h41n04; GPU: 5

Rank:    6; NumRanks: 12; RankCore:   0; Hostname: h41n03; GPU: 0
Rank:    7; NumRanks: 12; RankCore:   4; Hostname: h41n03; GPU: 1
Rank:    8; NumRanks: 12; RankCore:   8; Hostname: h41n03; GPU: 2
Rank:    9; NumRanks: 12; RankCore:  84; Hostname: h41n03; GPU: 3
Rank:   10; NumRanks: 12; RankCore:  89; Hostname: h41n03; GPU: 4
Rank:   11; NumRanks: 12; RankCore:  92; Hostname: h41n03; GPU: 5

Multiple tasks, single GPU per RS

The following jsrun command will request 12 resource sets ( -n12 ). Each resource set will contain 2 MPI tasks ( -a2 ), 1 GPU ( -g1 ), and 2 cores ( -c2 ). 2 MPI tasks will have access to a single GPU.

Ranks 0 – 1 will have access to GPU 0 on the first node ( red resource set). Ranks 2 – 3 will have access to GPU 1 on the first node ( green resource set). This pattern will continue until 12 resource sets have been created.

Adding cores to the RS: The -c flag should be used to request the needed cores for tasks and treads. The default -c core count is 1. In the above example, if -c is not specified both tasks will run on a single core.

summit> jsrun -n12 -a2 -g1 -c2 -dpacked ./a.out | sort
Rank:    0; NumRanks: 24; RankCore:   0; Hostname: a01n05; GPU: 0
Rank:    1; NumRanks: 24; RankCore:   4; Hostname: a01n05; GPU: 0

Rank:    2; NumRanks: 24; RankCore:   8; Hostname: a01n05; GPU: 1
Rank:    3; NumRanks: 24; RankCore:  12; Hostname: a01n05; GPU: 1

Rank:    4; NumRanks: 24; RankCore:  16; Hostname: a01n05; GPU: 2
Rank:    5; NumRanks: 24; RankCore:  20; Hostname: a01n05; GPU: 2

Rank:    6; NumRanks: 24; RankCore:  88; Hostname: a01n05; GPU: 3
Rank:    7; NumRanks: 24; RankCore:  92; Hostname: a01n05; GPU: 3

Rank:    8; NumRanks: 24; RankCore:  96; Hostname: a01n05; GPU: 4
Rank:    9; NumRanks: 24; RankCore: 100; Hostname: a01n05; GPU: 4

Rank:   10; NumRanks: 24; RankCore: 104; Hostname: a01n05; GPU: 5
Rank:   11; NumRanks: 24; RankCore: 108; Hostname: a01n05; GPU: 5

Rank:   12; NumRanks: 24; RankCore:   0; Hostname: a01n01; GPU: 0
Rank:   13; NumRanks: 24; RankCore:   4; Hostname: a01n01; GPU: 0

Rank:   14; NumRanks: 24; RankCore:   8; Hostname: a01n01; GPU: 1
Rank:   15; NumRanks: 24; RankCore:  12; Hostname: a01n01; GPU: 1

Rank:   16; NumRanks: 24; RankCore:  16; Hostname: a01n01; GPU: 2
Rank:   17; NumRanks: 24; RankCore:  20; Hostname: a01n01; GPU: 2

Rank:   18; NumRanks: 24; RankCore:  88; Hostname: a01n01; GPU: 3
Rank:   19; NumRanks: 24; RankCore:  92; Hostname: a01n01; GPU: 3

Rank:   20; NumRanks: 24; RankCore:  96; Hostname: a01n01; GPU: 4
Rank:   21; NumRanks: 24; RankCore: 100; Hostname: a01n01; GPU: 4

Rank:   22; NumRanks: 24; RankCore: 104; Hostname: a01n01; GPU: 5
Rank:   23; NumRanks: 24; RankCore: 108; Hostname: a01n01; GPU: 5

summit>

Multiple Task, Multiple GPU per RS

The following example will create 4 resource sets each with 6 tasks and 3 GPUs. Each set of 6 MPI tasks will have access to 3 GPUs.

Ranks 0 – 5 will have access to GPUs 0 – 2 on the first socket of the first node ( red resource set). Ranks 6 – 11 will have access to GPUs 3 – 5 on the second socket of the first node ( green resource set). This pattern will continue until 4 resource sets have been created.

The following jsrun command will request 4 resource sets ( -n4 ). Each resource set will contain 6 MPI tasks ( -a6 ), 3 GPUs ( -g3 ), and 6 cores ( -c6 ).

summit> jsrun -n 4 -a 6 -c 6 -g 3 -d packed -l GPU-CPU ./a.out
Rank:    0; NumRanks: 24; RankCore:   0; Hostname: a33n06; GPU: 0, 1, 2
Rank:    1; NumRanks: 24; RankCore:   4; Hostname: a33n06; GPU: 0, 1, 2
Rank:    2; NumRanks: 24; RankCore:   8; Hostname: a33n06; GPU: 0, 1, 2
Rank:    3; NumRanks: 24; RankCore:  12; Hostname: a33n06; GPU: 0, 1, 2
Rank:    4; NumRanks: 24; RankCore:  16; Hostname: a33n06; GPU: 0, 1, 2 
Rank:    5; NumRanks: 24; RankCore:  20; Hostname: a33n06; GPU: 0, 1, 2
 
Rank:    6; NumRanks: 24; RankCore:  88; Hostname: a33n06; GPU: 3, 4, 5
Rank:    7; NumRanks: 24; RankCore:  92; Hostname: a33n06; GPU: 3, 4, 5
Rank:    8; NumRanks: 24; RankCore:  96; Hostname: a33n06; GPU: 3, 4, 5
Rank:    9; NumRanks: 24; RankCore: 100; Hostname: a33n06; GPU: 3, 4, 5
Rank:   10; NumRanks: 24; RankCore: 104; Hostname: a33n06; GPU: 3, 4, 5
Rank:   11; NumRanks: 24; RankCore: 108; Hostname: a33n06; GPU: 3, 4, 5

Rank:   12; NumRanks: 24; RankCore:   0; Hostname: a33n05; GPU: 0, 1, 2
Rank:   13; NumRanks: 24; RankCore:   4; Hostname: a33n05; GPU: 0, 1, 2
Rank:   14; NumRanks: 24; RankCore:   8; Hostname: a33n05; GPU: 0, 1, 2
Rank:   15; NumRanks: 24; RankCore:  12; Hostname: a33n05; GPU: 0, 1, 2
Rank:   16; NumRanks: 24; RankCore:  16; Hostname: a33n05; GPU: 0, 1, 2
Rank:   17; NumRanks: 24; RankCore:  20; Hostname: a33n05; GPU: 0, 1, 2

Rank:   18; NumRanks: 24; RankCore:  88; Hostname: a33n05; GPU: 3, 4, 5
Rank:   19; NumRanks: 24; RankCore:  92; Hostname: a33n05; GPU: 3, 4, 5
Rank:   20; NumRanks: 24; RankCore:  96; Hostname: a33n05; GPU: 3, 4, 5
Rank:   21; NumRanks: 24; RankCore: 100; Hostname: a33n05; GPU: 3, 4, 5
Rank:   22; NumRanks: 24; RankCore: 104; Hostname: a33n05; GPU: 3, 4, 5
Rank:   23; NumRanks: 24; RankCore: 108; Hostname: a33n05; GPU: 3, 4, 5
summit>

Single Task, Multiple GPUs, Multiple Threads per RS

The following example will create 12 resource sets each with 1 task, 4 threads, and 1 GPU. Each MPI task will start 4 threads and have access to 1 GPU.

Rank 0 will have access to GPU 0 and start 4 threads on the first socket of the first node ( red resource set). Rank 2 will have access to GPU 1 and start 4 threads on the second socket of the first node ( green resource set). This pattern will continue until 12 resource sets have been created.

The following jsrun command will create 12 resource sets ( -n12 ). Each resource set will contain 1 MPI task ( -a1 ), 1 GPU ( -g1 ), and 4 cores ( -c4 ). Notice that more cores are requested than MPI tasks; the extra cores will be needed to place threads. Without requesting additional cores, threads will be placed on a single core.

Requesting Cores for Threads: The -c flag should be used to request additional cores for thread placement. Without requesting additional cores, threads will be placed on a single core.

Binding Cores to Tasks: The -b binding flag should be used to bind cores to tasks. Without specifying binding, all threads will be bound to the first core.

summit> setenv OMP_NUM_THREADS 4
summit> jsrun -n12 -a1 -c4 -g1 -b packed:4 -d packed ./a.out
Rank: 0; RankCore: 0; Thread: 0; ThreadCore: 0; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}
Rank: 0; RankCore: 0; Thread: 1; ThreadCore: 4; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}
Rank: 0; RankCore: 0; Thread: 2; ThreadCore: 8; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}
Rank: 0; RankCore: 0; Thread: 3; ThreadCore: 12; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}

Rank: 1; RankCore: 16; Thread: 0; ThreadCore: 16; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}
Rank: 1; RankCore: 16; Thread: 1; ThreadCore: 20; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}
Rank: 1; RankCore: 16; Thread: 2; ThreadCore: 24; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}
Rank: 1; RankCore: 16; Thread: 3; ThreadCore: 28; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}

...

Rank: 10; RankCore: 104; Thread: 0; ThreadCore: 104; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}
Rank: 10; RankCore: 104; Thread: 1; ThreadCore: 108; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}
Rank: 10; RankCore: 104; Thread: 2; ThreadCore: 112; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}
Rank: 10; RankCore: 104; Thread: 3; ThreadCore: 116; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}

Rank: 11; RankCore: 120; Thread: 0; ThreadCore: 120; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}
Rank: 11; RankCore: 120; Thread: 1; ThreadCore: 124; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}
Rank: 11; RankCore: 120; Thread: 2; ThreadCore: 128; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}
Rank: 11; RankCore: 120; Thread: 3; ThreadCore: 132; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}

summit>

Hardware Threads: Multiple Threads per Core

Each physical core on Summit contains 4 hardware threads. The SMT level can be set using LSF flags:

SMT1
#BSUB -alloc_flags smt1
jsrun -n1 -c1 -a1 -bpacked:4 csh -c 'echo $OMP_PLACES’
0
SMT2
#BSUB -alloc_flags smt2
jsrun -n1 -c1 -a1 -bpacked:4 csh -c 'echo $OMP_PLACES’
{0:2}
SMT4
#BSUB -alloc_flags smt4
jsrun -n1 -c1 -a1 -bpacked:4 csh -c 'echo $OMP_PLACES’
{0:4}

Common Use Cases

The following table provides a quick reference for creating resource sets of various common use cases. The -n flag can be altered to specify the number of resource sets needed.

Resource Sets MPI Tasks Threads Physical Cores GPUs jsrun Command
1 42 0 42 0 jsrun -n1 -a42 -c42 -g0
1 1 0 1 1 jsrun -n1 -a1 -c1 -g1
1 2 0 2 1 jsrun -n1 -a2 -c2 -g1
1 1 0 1 2 jsrun -n1 -a1 -c1 -g2
1 1 21 21 3 jsrun -n1 -a1 -c21 -g3 -bpacked:21

 

jsrun Tools

This section describes tools that users might find helpful to better understand the jsrun job launcher.

Hello_jsrun

Hello_jsrun is a “Hello World”-type program that users can run on Summit nodes to better understand how MPI ranks and OpenMP threads are mapped to the hardware.
https://code.ornl.gov/t4p/Hello_jsrun

A screencast showing how to use Hello_jsrun is also available:

jsrunVisualizer

jsrunVisualizer is a web-application that mimics basic jsrun behavior locally in your browser. It’s an easy way to get familiar with jsrun options for Summit, understand how multiple flags interact, and share your layout ideas with others. Once you’ve crafted your per-node resource sets, you can take the job script it generates and run the same layout on Summit itself!

https://jsrunvisualizer.olcf.ornl.gov/

More Information

This section provides some of the most commonly used LSF commands as well as some of the most useful options to those commands and information on jsrun, Summit’s job launch command. Many commands have much more information than can be easily presented here. More information about these commands is available via the online manual (i.e. man jsrun). Additional LSF information can be found on IBM’s website, specifically the Running Jobs and Command Reference Documents.

Debugging

Arm DDT

Arm DDT is an advanced debugging tool used for scalar, multi-threaded, and large-scale parallel applications.
In addition to traditional debugging features (setting breakpoints, stepping through code, examining variables), DDT also supports attaching to already-running processes and memory debugging. In-depth details of DDT can be found in the Official DDT User Guide, and instructions for how to use it on OLCF systems can be found on the Forge (DDT/MAP) Software Page. DDT is the OLCF’s recommended debugging software for large parallel applications.

GDB

GDB, the GNU Project Debugger, is a command-line debugger useful for traditional debugging and investigating code crashes. GDB lets you debug programs written in Ada, C, C++, Objective-C, Pascal (and many other languages). GDB is available on Summit under all compiler families:

module load gdb

Additional information about GDB usage and OLCF-provided builds can be found on the GDB Software Page.

Valgrind

Valgrind is an instrumentation framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. You can also use Valgrind to build new tools.

The Valgrind distribution currently includes five production-quality tools: a memory error detector, a thread error detector, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler. It also includes two experimental tools: a data race detector, and an instant memory leak detector.

The Valgrind tool suite provides a number of debugging and profiling tools. The most popular is Memcheck, a memory checking tool which can detect many common memory errors such as:

  • Touching memory you shouldn’t (eg. overrunning heap block boundaries, or reading/writing freed memory).
  • Using values before they have been initialized.
  • Incorrect freeing of memory, such as double-freeing heap blocks.
  • Memory leaks.

Valgrind is available on Summit under all compiler families:

module load valgrind

Additional information about Valgrind usage and OLCF-provided builds can be found on the Valgrind Software Page.

Optimizing and Profiling

Profiling CUDA Code with NVPROF

NVIDIA’s command-line profiler, nvprof, provides profiling for CUDA codes. No extra compiling steps are required to use nvprof. The profiler includes tracing capability as well as the ability to gather many performance metrics, including FLOPS. The profiler data output can be saved and imported into the NVIDIA Visual Profiler for additional graphical analysis.

To use nvprof, the cuda module must be loaded.

summit> module load cuda

A simple “Hello, World!” run using nvprof can be done by adding “nvprof” to the jsrun line in your batch script.

...
jsrun -n1 -a1 -g1 nvprof ./hello_world_gpu
...

Although nvprof doesn’t provide aggregated MPI data, the %h and %p output file modifiers can be used to create separate output files for each host and process.

...
jsrun -n1 -a1 -g1 nvprof -o output.%h.%p ./hello_world_gpu
...

There are many various metrics and events that the profiler can capture. For example, to output the number of double-precision FLOPS, you may use the following:

...
jsrun -n1 -a1 -g1 nvprof --metrics flops_dp -o output.%h.%p ./hello_world_gpu
...

To see a list of all available metrics and events, use the following:

summit> nvprof --query-metrics
summit> nvprof --query-events

While using nvprof on the command-line is a quick way to gain insight into your CUDA application, a full visual profile is often even more useful. For information on how to view the output of nvprof in the NVIDIA Visual Profiler, see the NVIDIA Documentation.

Score-P

The Score-P measurement infrastructure is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications. Score-P supports analyzing C, C++ and Fortran applications that make use of multi-processing (MPI, SHMEM), thread parallelism (OpenMP, PThreads) and accelerators (CUDA, OpenCL, OpenACC) and combinations.

For detailed information about using Score-P on Summit and the builds available, please see the Score-P Software Page.

Vampir

Vampir is a software performance visualizer focused on highly parallel applications. It presents a unified view on an application run including the use of programming paradigms like MPI, OpenMP, PThreads, CUDA, OpenCL and OpenACC. It also incorporates file I/O, hardware performance counters and other performance data sources. Various interactive displays offer detailed insight into the performance behavior of the analyzed application. Vampir’s scalable analysis server and visualization engine enable interactive navigation of large amounts of performance data. Score-P and TAU generate OTF2 trace files for Vampir to visualize.

For detailed information about using Vampir on Summit and the builds available, please see the Vampir Software Page.

Known Issues

Coming soon