Cumulus is a Cray® XC40™ cluster with Intel Broadwell processors and Cray’s Aries interconnect in a network topology called Dragonfly. Each node has (128) GB of DDR3 SDRAM and (2) sockets with (18) physical cores each. The system has (2) external login nodes.
Logging In to Cumulus
To log in to Cumulus, please use your XCAMS/UCAMS username and password:
$ ssh cumulus.ccs.ornl.gov
NOTE: You do not need to use an RSA token to log in to Cumulus. Please use your XCAMS/UCAMS username and password (which is different from the username and PIN + RSA token code used to log in to other OLCF systems such as Summit).
NOTE: It will take ~5 minutes for your directories to be created, so if your account was just created and you log in and you do not have a home directory, this is likely the reason.
Data Storage and Filesystems
Cumulus mounts the OLCF open enclave file systems. The available filesystems are summarized in the table below.
|Filesystem||Mount Points||Backed Up?||Purged?||Quota?||Comments|
|Home Directories (NFS)||/ccsopen/home/$USER||Yes||No||50GB|
|Project Space (NFS)||/ccsopen/proj/$PROJECTID||Yes||No||50GB|
|Parallel Scratch (GPFS)||/gpfs/wolf/proj-shared/$PROJECTID
The software environment is managed through the
Environment Module tool.
||Lists modules currently loaded in a user’s environment|
||Lists all available modules on a system in condensed format|
||Lists all available modules on a system in long format|
||Shows environment changes that will be made by loading a given module|
||Loads a module|
||Unloads a module|
||Shows help for a module|
||Swaps a currently loaded module for an unloaded module|
The following compilers are available on Cumulus:
- Intel Composer XE (default)
- Portland Group Compiler Suite
- GNU Compiler Collection
- Cray Compiling Environment
Upon login, the default versions of the Intel compiler and associated Message Passing Interface (MPI) libraries are added to each user’s environment through a programming environment module. Users do not need to make any environment changes to use the default version of Intel and MPI.
If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.
The following programming environment modules are available:
- PrgEnv-intel (default)
To change the default loaded Intel environment to the default GCC environment use:
$ module unload PrgEnv-intel $ module load PrgEnv-gnu
$ module swap PrgEnv-intel PrgEnv-gnu
Changing Versions of the Same Compiler
To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:
$ module swap PrgEnv-intel PrgEnv-gnu $ module swap gcc gcc/4.6.1
As is the case with our other Cray systems, the C, C++, and Fortran compilers are invoked with the following commands:
- For the C compiler:
- For the C++ compiler:
- For the Fortran compiler:
These are actually compiler wrappers that automatically link in appropriate libraries (such as MPI and math libraries) and build code that targets the compute-node processor architecture. These wrappers should be used regardless of the underlying compiler (Intel, PGI, GNU, or Cray).
mpif90are not available on Cray systems. You should use
General Programming Environment Guidelines
We recommend the following general guidelines for using the programming environment modules:
- Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
- Do not swap or unload any of the Cray provided modules (those with names like
- Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.
When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.
For Intel, use the
$ cc -openmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For PGI, add
-mp to the build line:
$ module swap PrgEnv-intel PrgEnv-pgi $ cc -mp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For GNU, add
-fopenmp to the build line:
$ module swap PrgEnv-intel PrgEnv-gnu $ cc -fopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For Cray, no additional flags are required:
$ module swap PrgEnv-intel PrgEnv-cray $ cc test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
Running with OpenMP and PrgEnv-intel
An extra thread created by the Intel OpenMP runtime interacts with the CLE thread binding mechanism and causes poor performance. To work around this issue, CPU-binding should be turned off. This is only an issue for OpenMP with the Intel programming environment.
How CPU-binding is shut off depends on how the job is placed on the node. In the following examples, we refer to the number of threads per MPI task as depth; this is controlled by the
-d option to
aprun. We refer to the number of MPI task or processing elements per socket as npes; this is controlled by the
-n option to
aprun. In the following examples replace
depth with the value for number of threads per MPI task, and
npes with the value for the number of MPI tasks (processing elements) per socket that you plan to use.
For the case of running when
depth divides evenly into the number of processing elements on a socket (
export OMP_NUM_THREADS="<=depth" aprun -n npes -d "depth" -cc numa_node a.out
For the case of running when
depth does not divide evenly into the number of processing elements on a socket (
export OMP_NUM_THREADS="<=depth" aprun -n npes -d “depth” -cc none a.out
In the future, a new feature should provide an aprun option to interface more smoothly with OpenMP codes using the Intel programing environment. This documentation will be updated at that time.
In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.
A job on Cumulus typically comprises a few different components:
- A batch submission script.
- A binary executable.
- A set of input files for the executable.
- A set of output files created by the executable.
And the process for running a job, in general, is to:
- Prepare executables and input files.
- Write a batch script.
- Submit the batch script to the batch scheduler.
- Optionally monitor the job before and during execution.
The following sections describe in detail how to create, submit, and manage jobs for execution on Percival.
Login vs. Compute Nodes
Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Percival nodes as existing in two categories: login nodes or compute nodes.
Login nodes are designed to facilitate ssh access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.
On Cray machines, when the
aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to
aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.
Parallel Execution through
aprun command is used to run a compiled application program across one or more compute nodes. You use the
aprun command to specify application resource requirements, request application placement, and initiate application launch.
The following table lists commonly-used options to
aprun. For a more detailed description of
aprun options, see the
aprun man page.
||Debug; shows the layout
||Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to
||Number of MPI tasks (aka ‘processing elements’) per physical node.
Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
||Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task|
||Number of threads per MPI task.
Warning: The default value for
||This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun.|
||Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.|
||Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.|
Assign system services associated with your application to a compute core.
If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly.
Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:
- Commands that can be interpreted by batch scheduling software (e.g. PBS)
- Commands that can be interpreted by a shell
The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.
Sections of a Batch Script
Batch scripts are parsed into the following three sections:
- The Interpreter Line
- The Scheduler Options Section
- The Executable Commands Section
The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter’s default shell will be used. The line uses the “hash-bang-shell” syntax:
The batch scheduler options are preceded by
#PBS, making them appear as comments to a shell. PBS will look for
#PBS options in a batch script from the script’s first line through the first non-comment line. A comment line begins with
#PBS options entered after the first non-comment line will not be read by PBS.
The shell commands follow the last
#PBS option and represent the main content of the batch job. If any
#PBS lines follow executable statements, they will be ignored as comments.
The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on a service node (sometimes called a “head node”) from the set of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.
An Example Batch Script
1: #!/bin/bash 2: # Begin PBS directives 3: #PBS -A pjt000 4: #PBS -N test 5: #PBS -j oe 6: #PBS -l walltime=1:00:00,nodes=1 7: # End PBS directives and begin shell commands 8: cd $MEMBERWORK/pjt000 9: date 10: aprun -n 16 ./a.out
The lines of this batch script do the following:
|1||Optional||Specifies that the script should be interpreted by the bash shell.|
|2||Optional||Comments do nothing.|
|3||Required||The job will be charged to the “pjt000” project.|
|4||Optional||The job will be named “test”.|
|5||Optional||The job’s standard output and error will be combined.|
|6||Required||The job will request 373 compute nodes for 1 hour.|
|7||Optional||Comments do nothing.|
|8||—||This shell command will the change to the user’s member work directory.|
|9||—||This shell command will run the
|10||—||This invocation will run 5968 MPI instances of the executable
A node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.
Once written, a batch script is submitted to the batch scheduler via the
$ cd /path/to/batch/script $ qsub ./script.pbs
If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job’s status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.
Options to the
qsub command allow the specification of attributes which affect the behavior of the job. In general, options to
qsub on the command line can also be placed in the batch scheduler options section of the batch script via
The following queues are available on Cumulus:
|batch (default)||48 hrs|
The batch queue is the default. If no queue specified, the batch job will run in the batch queue
In addition to the above per-queue limits, the following batch limits apply:
- Unlimited running jobs
- Limit of (5) eligible-to-run jobs per user.
- Debug queue is limited to (1) queued job per user in any state.