System Overview
Percival is a 168-node Cray® XC40™ supercomputer with 64-core Intel Xeon Phi 7230 processors (previously code named “Knights Landing” or “KNL”). Percival uses Cray’s Aries interconnect in a network topology called Dragonfly. The system has (2) external login nodes.
Access
There are two login nodes, that can be accessed via SSH from hub.ccs.ornl.gov
and/or home.ccs.ornl.gov
.
hub.ccs.ornl.gov
or home.ccs.ornl.gov
and then from hub
/home
SSH into percival
.
Data Storage and Filesystems
The available file systems are summarized in the table below.
Filesystem | Mount Points | Backed Up? | Purged? | Quota? | Comments |
---|---|---|---|---|---|
Home Directories (NFS) | /ccs/home/$USER | Yes | No | 50GB | Shared across all OLCF resources |
Project Space (NFS) | /ccs/proj/$PROJECTID | Yes | No | 50GB | |
Parallel Scratch (Lustre) | /lustre/atlas/proj-shared/$PROJECTID /lustre/atlas/scratch/$USER/$PROJECTID /lustre/atlas/world-shared/$PROJECTID |
No | Yes | Shared across all OLCF resources |
Software Environment
As on other OLCF systems, the software environment is managed through the Environment Module
tool.
Command | Description |
---|---|
module list |
Lists modules currently loaded in a user’s environment |
module avail |
Lists all available modules on a system in condensed format |
module avail -l |
Lists all available modules on a system in long format |
module display |
Shows environment changes that will be made by loading a given module |
module load |
Loads a module |
module unload |
Unloads a module |
module help |
Shows help for a module |
module swap |
Swaps a currently loaded module for an unloaded module |
Compiling
Available Compilers
The following compilers are available on Percival:
- Intel Composer XE (default)
- Portland Group Compiler Suite
- GNU Compiler Collection
- Cray Compiling Environment
Upon login, the default versions of the Intel compiler and associated Message Passing Interface (MPI) libraries are added to each user’s environment through a programming environment module. Users do not need to make any environment changes to use the default version of Intel and MPI.
Changing Compilers
If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.
The following programming environment modules are available:
- PrgEnv-intel (default)
- PrgEnv-pgi
- PrgEnv-gnu
- PrgEnv-cray
To change the default loaded Intel environment to the default GCC environment use:
$ module unload PrgEnv-intel $ module load PrgEnv-gnu
Or alternatively:
$ module swap PrgEnv-intel PrgEnv-gnu
Changing Versions of the Same Compiler
To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:
$ module swap PrgEnv-intel PrgEnv-gnu $ module swap gcc gcc/4.6.1
Compiler Commands
As is the case with our other Cray systems, the C, C++, and Fortran compilers are invoked with the following commands:
- For the C compiler:
cc
- For the C++ compiler:
CC
- For the Fortran compiler:
ftn
These are actually compiler wrappers that automatically link in appropriate libraries (such as MPI and math libraries) and build code that targets the compute-node processor architecture. These wrappers should be used regardless of the underlying compiler (Intel, PGI, GNU, or Cray).
pgf90
, icpc
, gcc
) directly.mpicc
, mpiCC
, and mpif90
are not available on Cray systems. You should use cc
, CC
, and ftn
instead.General Programming Environment Guidelines
We recommend the following general guidelines for using the programming environment modules:
- Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
- Do not swap or unload any of the Cray provided modules (those with names like
xt-*
,xe-*
,xk-*
, orcray-*
). - Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.
Threaded Codes
When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.
For Intel, use the -qopenmp
option:
$ cc -qopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For PGI, add -mp
to the build line:
$ module swap PrgEnv-intel PrgEnv-pgi $ cc -mp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For GNU, add -fopenmp
to the build line:
$ module swap PrgEnv-intel PrgEnv-gnu $ cc -fopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For Cray, no additional flags are required:
$ module swap PrgEnv-intel PrgEnv-cray $ cc test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
Running with OpenMP and PrgEnv-intel
An extra thread created by the Intel OpenMP runtime interacts with the CLE thread binding mechanism and causes poor performance. To work around this issue, CPU-binding should be turned off. This is only an issue for OpenMP with the Intel programming environment.
How CPU-binding is shut off depends on how the job is placed on the node. In the following examples, we refer to the number of threads per MPI task as depth; this is controlled by the -d
option to aprun
. We refer to the number of MPI task or processing elements per socket as npes; this is controlled by the -n
option to aprun
. In the following examples replace depth
with the value for number of threads per MPI task, and npes
with the value for the number of MPI tasks (processing elements) per socket that you plan to use.
For the case of running when depth
divides evenly into the number of processing elements on a socket (npes
),
export OMP_NUM_THREADS="<=depth" aprun -n npes -d "depth" -cc numa_node a.out
For the case of running when depth
does not divide evenly into the number of processing elements on a socket (npes
),
export OMP_NUM_THREADS="<=depth" aprun -n npes -d “depth” -cc none a.out
In the future, a new feature should provide an aprun option to interface more smoothly with OpenMP codes using the Intel programing environment. This documentation will be updated at that time.
Running
In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.
A job on Percival typically comprises a few different components:
- A batch submission script.
- A binary executable.
- A set of input files for the executable.
- A set of output files created by the executable.
And the process for running a job, in general, is to:
- Prepare executables and input files.
- Write a batch script.
- Submit the batch script to the batch scheduler.
- Optionally monitor the job before and during execution.
The following sections describe in detail how to create, submit, and manage jobs for execution on Percival.
Login vs. Compute Nodes
Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Percival nodes as existing in two categories: login nodes or compute nodes.
Login Nodes
Login nodes are designed to facilitate ssh access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.
Compute Nodes
On Cray machines, when the aprun
command is issued within a job script (or on the command line within an interactive batch job), the binary passed to aprun
is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.
aprun
command. Parallel Execution through aprun
The aprun
command is used to run a compiled application program across one or more compute nodes. You use the aprun
command to specify application resource requirements, request application placement, and initiate application launch.
Common aprun
Options
The following table lists commonly-used options to aprun
. For a more detailed description of aprun
options, see the aprun
man page.
Option | Description |
---|---|
-D |
Debug; shows the layout aprun will use |
-n |
Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to aprun , the system will default to 1. |
-N |
Number of MPI tasks (aka ‘processing elements’) per physical node.
Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
|
-m |
Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task |
-d |
Number of threads per MPI task.
Warning: The default value for
-d is 1. If you specify OMP_NUM_THREADS but do not give a -d option, aprun will allocate your threads to a single core. Use OMP_NUM_THREADS to specify to your code the number of threads per MPI task; use -d to tell aprun how to place those threads. |
-cc |
This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun. |
-S |
Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8. |
-ss |
Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node. |
-r |
Assign system services associated with your application to a compute core.
If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly. Should use |
Batch System
Batch Scripts
Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:
- Commands that can be interpreted by batch scheduling software (e.g. PBS)
- Commands that can be interpreted by a shell
The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.
Sections of a Batch Script
Batch scripts are parsed into the following three sections:
- The Interpreter Line
The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter’s default shell will be used. The line uses the “hash-bang-shell” syntax:#!/path/to/shell
- The Scheduler Options Section
The batch scheduler options are preceded by#PBS
, making them appear as comments to a shell. PBS will look for#PBS
options in a batch script from the script’s first line through the first non-comment line. A comment line begins with#
.#PBS
options entered after the first non-comment line will not be read by PBS.Note: All batch scheduler options must appear at the beginning of the batch script. - The Executable Commands Section
The shell commands follow the last#PBS
option and represent the main content of the batch job. If any#PBS
lines follow executable statements, they will be ignored as comments.
The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on a service node (sometimes called a “head node”) from the set of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.
An Example Batch Script
1: #!/bin/bash 2: # Begin PBS directives 3: #PBS -A pjt000 4: #PBS -N test 5: #PBS -j oe 6: #PBS -l walltime=1:00:00,nodes=373 7: # End PBS directives and begin shell commands 8: cd $MEMBERWORK/pjt000 9: date 10: aprun -n 5968 ./a.out
The lines of this batch script do the following:
Line | Option | Description |
---|---|---|
1 | Optional | Specifies that the script should be interpreted by the bash shell. |
2 | Optional | Comments do nothing. |
3 | Required | The job will be charged to the “pjt000” project. |
4 | Optional | The job will be named “test”. |
5 | Optional | The job’s standard output and error will be combined. |
6 | Required | The job will request 373 compute nodes for 1 hour. |
7 | Optional | Comments do nothing. |
8 | — | This shell command will the change to the user’s member work directory. |
9 | — | This shell command will run the date command. |
10 | — | This invocation will run 5968 MPI instances of the executable a.out on the compute nodes allocated by the batch system. |
Batch Scheduler node
Requests
A node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.
Submission
Once written, a batch script is submitted to the batch scheduler via the qsub
command.
$ cd /path/to/batch/script $ qsub ./script.pbs
If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job’s status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.
Options to the qsub
command allow the specification of attributes which affect the behavior of the job. In general, options to qsub
on the command line can also be placed in the batch scheduler options section of the batch script via #PBS
.
Scheduling Policy
The batch queue enforces the following policies:
- Unlimited running jobs
- Limit of (4) eligible-to-run jobs per user.
- Maximum walltime of (6) hours per batch job.