System Overview

Cumulus is a Cray® XC40™ cluster with Intel Broadwell processors and Cray’s Aries interconnect in a network topology called Dragonfly. Each node has (128) GB of DDR3 SDRAM and (2) sockets with (18) physical cores each. The system has (2) external login nodes.

Logging In to Cumulus

To log in to Cumulus, please use your XCAMS/UCAMS username and password:

$ ssh cumulus.ccs.ornl.gov

NOTE: You do not need to use an RSA token to log in to Cumulus. Please use your XCAMS/UCAMS username and password (which is different from the username and PIN + RSA token code used to log in to other OLCF systems such as Summit).

NOTE: It will take ~5 minutes for your directories to be created, so if your account was just created and you log in and you do not have a home directory, this is likely the reason.

Data Storage and Filesystems

Cumulus mounts the OLCF open enclave file systems. The available filesystems are summarized in the table below.

Filesystem Mount Points Backed Up? Purged? Quota? Comments
Home Directories (NFS) /ccsopen/home/$USER Yes No 50GB
Project Space (NFS) /ccsopen/proj/$PROJECTID Yes No 50GB
Parallel Scratch (GPFS) /gpfs/wolf/proj-shared/$PROJECTID
/gpfs/wolf/scratch/$USER/$PROJECTID
/gpfs/wolf/world-shared/$PROJECTID
No Yes

Software Environment

The software environment is managed through the Environment Module tool.

Command Description
module list Lists modules currently loaded in a user’s environment
module avail Lists all available modules on a system in condensed format
module avail -l Lists all available modules on a system in long format
module display Shows environment changes that will be made by loading a given module
module load Loads a module
module unload Unloads a module
module help Shows help for a module
module swap Swaps a currently loaded module for an unloaded module

Compiling

Available Compilers

The following compilers are available on Cumulus:

  • Intel Composer XE (default)
  • Portland Group Compiler Suite
  • GNU Compiler Collection
  • Cray Compiling Environment

Upon login, the default versions of the Intel compiler and associated Message Passing Interface (MPI) libraries are added to each user’s environment through a programming environment module. Users do not need to make any environment changes to use the default version of Intel and MPI.

Changing Compilers

If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.

The following programming environment modules are available:

  • PrgEnv-intel (default)
  • PrgEnv-pgi
  • PrgEnv-gnu
  • PrgEnv-cray

To change the default loaded Intel environment to the default GCC environment use:

$ module unload PrgEnv-intel 
$ module load PrgEnv-gnu

Or alternatively:

$ module swap PrgEnv-intel PrgEnv-gnu

Changing Versions of the Same Compiler

To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:

$ module swap PrgEnv-intel PrgEnv-gnu
$ module swap gcc gcc/4.6.1

Compiler Commands

As is the case with our other Cray systems, the C, C++, and Fortran compilers are invoked with the following commands:

  • For the C compiler: cc
  • For the C++ compiler: CC
  • For the Fortran compiler: ftn

These are actually compiler wrappers that automatically link in appropriate libraries (such as MPI and math libraries) and build code that targets the compute-node processor architecture. These wrappers should be used regardless of the underlying compiler (Intel, PGI, GNU, or Cray).

Note: You should not call the vendor compilers (i.e. pgf90, icpc, gcc) directly.
Note: Commands such as mpicc, mpiCC, and mpif90 are not available on Cray systems. You should use cc, CC, and ftn instead.

General Programming Environment Guidelines

We recommend the following general guidelines for using the programming environment modules:

  • Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
  • Do not swap or unload any of the Cray provided modules (those with names like xt-*, xe-*, xk-*, or cray-*).
  • Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.

Threaded Codes

When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.

For Intel, use the -openmp option:

$ cc -openmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For PGI, add -mp to the build line:

$ module swap PrgEnv-intel PrgEnv-pgi
$ cc -mp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For GNU, add -fopenmp to the build line:

$ module swap PrgEnv-intel PrgEnv-gnu
$ cc -fopenmp test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

For Cray, no additional flags are required:

$ module swap PrgEnv-intel PrgEnv-cray
$ cc test.c -o test.x
$ setenv OMP_NUM_THREADS 2
$ aprun -n2 -d2 ./test.x

Running with OpenMP and PrgEnv-intel

An extra thread created by the Intel OpenMP runtime interacts with the CLE thread binding mechanism and causes poor performance. To work around this issue, CPU-binding should be turned off. This is only an issue for OpenMP with the Intel programming environment.

How CPU-binding is shut off depends on how the job is placed on the node. In the following examples, we refer to the number of threads per MPI task as depth; this is controlled by the -d option to aprun. We refer to the number of MPI task or processing elements per socket as npes; this is controlled by the -n option to aprun. In the following examples replace depth with the value for number of threads per MPI task, and npes with the value for the number of MPI tasks (processing elements) per socket that you plan to use. 

For the case of running when depth divides evenly into the number of processing elements on a socket (npes),

export OMP_NUM_THREADS="<=depth" 
aprun -n npes -d "depth" -cc numa_node a.out

For the case of running when depth does not divide evenly into the number of processing elements on a socket (npes),

export OMP_NUM_THREADS="<=depth" 
aprun -n npes -d “depth” -cc none a.out

In the future, a new feature should provide an aprun option to interface more smoothly with OpenMP codes using the Intel programing environment. This documentation will be updated at that time.

Running

In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.

A job on Cumulus typically comprises a few different components:

  • A batch submission script.
  • A binary executable.
  • A set of input files for the executable.
  • A set of output files created by the executable.

And the process for running a job, in general, is to:

  1. Prepare executables and input files.
  2. Write a batch script.
  3. Submit the batch script to the batch scheduler.
  4. Optionally monitor the job before and during execution.

The following sections describe in detail how to create, submit, and manage jobs for execution on Percival.

Login vs. Compute Nodes

Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Percival nodes as existing in two categories: login nodes or compute nodes.

Login Nodes

Login nodes are designed to facilitate ssh access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.

Warning: Processor-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning.

Compute Nodes

On Cray machines, when the aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.

Note: On Cray machines, the only way to access the compute nodes is via the aprun command.

Parallel Execution through aprun

The aprun command is used to run a compiled application program across one or more compute nodes. You use the aprun command to specify application resource requirements, request application placement, and initiate application launch.

Common aprun Options

The following table lists commonly-used options to aprun. For a more detailed description of aprun options, see the aprun man page.

Option Description
-D Debug; shows the layout aprun will use
-n Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to aprun, the system will default to 1.
-N Number of MPI tasks (aka ‘processing elements’) per physical node.

Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
-m Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task
-d Number of threads per MPI task.

Warning: The default value for -d is 1. If you specify OMP_NUM_THREADS but do not give a -d option, aprun will allocate your threads to a single core. Use OMP_NUM_THREADS to specify to your code the number of threads per MPI task; use -d to tell aprun how to place those threads.
-cc This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun.
-S Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
-ss Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.
-r Assign system services associated with your application to a compute core.

If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly.

Should use -r 1 ensuring -N is less than 16 or -S is less than 8.

Batch System

Batch Scripts

Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:

  • Commands that can be interpreted by batch scheduling software (e.g. PBS)
  • Commands that can be interpreted by a shell

The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.

Sections of a Batch Script

Batch scripts are parsed into the following three sections:

  1. The Interpreter Line
  2. The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter’s default shell will be used. The line uses the “hash-bang-shell” syntax: #!/path/to/shell

  3. The Scheduler Options Section
  4. The batch scheduler options are preceded by #PBS, making them appear as comments to a shell. PBS will look for #PBS options in a batch script from the script’s first line through the first non-comment line. A comment line begins with #. #PBS options entered after the first non-comment line will not be read by PBS.

    Note: All batch scheduler options must appear at the beginning of the batch script.
  5. The Executable Commands Section
  6. The shell commands follow the last #PBS option and represent the main content of the batch job. If any #PBS lines follow executable statements, they will be ignored as comments.

The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on a service node (sometimes called a “head node”) from the set of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.

An Example Batch Script
 1: #!/bin/bash
 2: #    Begin PBS directives
 3: #PBS -A pjt000
 4: #PBS -N test
 5: #PBS -j oe
 6: #PBS -l walltime=1:00:00,nodes=1
 7: #    End PBS directives and begin shell commands
 8: cd $MEMBERWORK/pjt000
 9: date
10: aprun -n 16 ./a.out

The lines of this batch script do the following:

Line Option Description
1 Optional Specifies that the script should be interpreted by the bash shell.
2 Optional Comments do nothing.
3 Required The job will be charged to the “pjt000” project.
4 Optional The job will be named “test”.
5 Optional The job’s standard output and error will be combined.
6 Required The job will request 373 compute nodes for 1 hour.
7 Optional Comments do nothing.
8 This shell command will the change to the user’s member work directory.
9 This shell command will run the date command.
10 This invocation will run 5968 MPI instances of the executable a.out on the compute nodes allocated by the batch system.
Note: For more batch script examples, please see the Additional Example Batch Scripts section of the Titan User Guide.

Batch Scheduler node Requests

A node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.

Note: Whole nodes must be requested at the time of job submission, and allocations are reduced by core-hour amounts corresponding to whole nodes, regardless of actual core utilization.

Submission

Once written, a batch script is submitted to the batch scheduler via the qsub command.

$ cd /path/to/batch/script
$ qsub ./script.pbs

If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job’s status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.

Note: Always make a note of the returned job ID upon job submission, and include it in help requests to the OLCF User Assistance Center.

Options to the qsub command allow the specification of attributes which affect the behavior of the job. In general, options to qsub on the command line can also be placed in the batch scheduler options section of the batch script via #PBS.

Scheduling Policy

The following queues are available on Cumulus:

Queue Max Walltime
batch (default) 48 hrs
debug 1 hr
research 120 hrs

The batch queue is the default. If no queue specified, the batch job will run in the batch queue

In addition to the above per-queue limits, the following batch limits apply:

  • Unlimited running jobs
  • Limit of (5) eligible-to-run jobs per user.
  • Debug queue is limited to (1) queued job per user in any state.