Percival is a 168-node Cray® XC40™ supercomputer with 64-core Intel Xeon Phi 7230 processors (previously code named “Knights Landing” or “KNL”). Percival uses Cray’s Aries interconnect in a network topology called Dragonfly. The system has (2) external login nodes.
There are two login nodes, that can be accessed via SSH from
home.ccs.ornl.govand then from
Data Storage and Filesystems
The available file systems are summarized in the table below.
|Filesystem||Mount Points||Backed Up?||Purged?||Quota?||Comments|
|Home Directories (NFS)||/ccs/home/$USER||Yes||No||50GB||Shared across all OLCF resources|
|Project Space (NFS)||/ccs/proj/$PROJECTID||Yes||No||50GB|
|Parallel Scratch (Lustre)||/lustre/atlas/proj-shared/$PROJECTID
|No||Yes||Shared across all OLCF resources|
As on other OLCF systems, the software environment is managed through the
Environment Module tool.
||Lists modules currently loaded in a user’s environment|
||Lists all available modules on a system in condensed format|
||Lists all available modules on a system in long format|
||Shows environment changes that will be made by loading a given module|
||Loads a module|
||Unloads a module|
||Shows help for a module|
||Swaps a currently loaded module for an unloaded module|
The following compilers are available on Percival:
- Intel Composer XE (default)
- Portland Group Compiler Suite
- GNU Compiler Collection
- Cray Compiling Environment
Upon login, the default versions of the Intel compiler and associated Message Passing Interface (MPI) libraries are added to each user’s environment through a programming environment module. Users do not need to make any environment changes to use the default version of Intel and MPI.
If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, programming environment modules are provided. The programming environment modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the programming environment modules be used when changing compiler vendors.
The following programming environment modules are available:
- PrgEnv-intel (default)
To change the default loaded Intel environment to the default GCC environment use:
$ module unload PrgEnv-intel $ module load PrgEnv-gnu
$ module swap PrgEnv-intel PrgEnv-gnu
Changing Versions of the Same Compiler
To use a specific compiler version, you must first ensure the compiler’s PrgEnv module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version:
$ module swap PrgEnv-intel PrgEnv-gnu $ module swap gcc gcc/4.6.1
As is the case with our other Cray systems, the C, C++, and Fortran compilers are invoked with the following commands:
- For the C compiler:
- For the C++ compiler:
- For the Fortran compiler:
These are actually compiler wrappers that automatically link in appropriate libraries (such as MPI and math libraries) and build code that targets the compute-node processor architecture. These wrappers should be used regardless of the underlying compiler (Intel, PGI, GNU, or Cray).
mpif90are not available on Cray systems. You should use
General Programming Environment Guidelines
We recommend the following general guidelines for using the programming environment modules:
- Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it.
- Do not swap or unload any of the Cray provided modules (those with names like
- Do not swap moab, torque, or MySQL modules after loading a programming environment modulefile.
When building threaded codes on Cray machines, you may need to take additional steps to ensure a proper build.
For Intel, use the
$ cc -qopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For PGI, add
-mp to the build line:
$ module swap PrgEnv-intel PrgEnv-pgi $ cc -mp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For GNU, add
-fopenmp to the build line:
$ module swap PrgEnv-intel PrgEnv-gnu $ cc -fopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
For Cray, no additional flags are required:
$ module swap PrgEnv-intel PrgEnv-cray $ cc test.c -o test.x $ setenv OMP_NUM_THREADS 2 $ aprun -n2 -d2 ./test.x
Running with OpenMP and PrgEnv-intel
An extra thread created by the Intel OpenMP runtime interacts with the CLE thread binding mechanism and causes poor performance. To work around this issue, CPU-binding should be turned off. This is only an issue for OpenMP with the Intel programming environment.
How CPU-binding is shut off depends on how the job is placed on the node. In the following examples, we refer to the number of threads per MPI task as depth; this is controlled by the
-d option to
aprun. We refer to the number of MPI task or processing elements per socket as npes; this is controlled by the
-n option to
aprun. In the following examples replace
depth with the value for number of threads per MPI task, and
npes with the value for the number of MPI tasks (processing elements) per socket that you plan to use.
For the case of running when
depth divides evenly into the number of processing elements on a socket (
export OMP_NUM_THREADS="<=depth" aprun -n npes -d "depth" -cc numa_node a.out
For the case of running when
depth does not divide evenly into the number of processing elements on a socket (
export OMP_NUM_THREADS="<=depth" aprun -n npes -d “depth” -cc none a.out
In the future, a new feature should provide an aprun option to interface more smoothly with OpenMP codes using the Intel programing environment. This documentation will be updated at that time.
In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.
A job on Percival typically comprises a few different components:
- A batch submission script.
- A binary executable.
- A set of input files for the executable.
- A set of output files created by the executable.
And the process for running a job, in general, is to:
- Prepare executables and input files.
- Write a batch script.
- Submit the batch script to the batch scheduler.
- Optionally monitor the job before and during execution.
The following sections describe in detail how to create, submit, and manage jobs for execution on Percival.
Login vs. Compute Nodes
Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Percival nodes as existing in two categories: login nodes or compute nodes.
Login nodes are designed to facilitate ssh access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.
On Cray machines, when the
aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to
aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.
Parallel Execution through
aprun command is used to run a compiled application program across one or more compute nodes. You use the
aprun command to specify application resource requirements, request application placement, and initiate application launch.
The following table lists commonly-used options to
aprun. For a more detailed description of
aprun options, see the
aprun man page.
||Debug; shows the layout
||Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to
||Number of MPI tasks (aka ‘processing elements’) per physical node.
Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
||Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task|
||Number of threads per MPI task.
Warning: The default value for
||This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun.|
||Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.|
||Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.|
Assign system services associated with your application to a compute core.
If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly.
Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:
- Commands that can be interpreted by batch scheduling software (e.g. PBS)
- Commands that can be interpreted by a shell
The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.
Sections of a Batch Script
Batch scripts are parsed into the following three sections:
- The Interpreter Line
The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter’s default shell will be used. The line uses the “hash-bang-shell” syntax:
- The Scheduler Options Section
The batch scheduler options are preceded by
#PBS, making them appear as comments to a shell. PBS will look for
#PBSoptions in a batch script from the script’s first line through the first non-comment line. A comment line begins with
#PBSoptions entered after the first non-comment line will not be read by PBS.Note: All batch scheduler options must appear at the beginning of the batch script.
- The Executable Commands Section
The shell commands follow the last
#PBSoption and represent the main content of the batch job. If any
#PBSlines follow executable statements, they will be ignored as comments.
The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on a service node (sometimes called a “head node”) from the set of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.
An Example Batch Script
1: #!/bin/bash 2: # Begin PBS directives 3: #PBS -A pjt000 4: #PBS -N test 5: #PBS -j oe 6: #PBS -l walltime=1:00:00,nodes=373 7: # End PBS directives and begin shell commands 8: cd $MEMBERWORK/pjt000 9: date 10: aprun -n 5968 ./a.out
The lines of this batch script do the following:
|1||Optional||Specifies that the script should be interpreted by the bash shell.|
|2||Optional||Comments do nothing.|
|3||Required||The job will be charged to the “pjt000” project.|
|4||Optional||The job will be named “test”.|
|5||Optional||The job’s standard output and error will be combined.|
|6||Required||The job will request 373 compute nodes for 1 hour.|
|7||Optional||Comments do nothing.|
|8||—||This shell command will the change to the user’s member work directory.|
|9||—||This shell command will run the
|10||—||This invocation will run 5968 MPI instances of the executable
A node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.
Once written, a batch script is submitted to the batch scheduler via the
$ cd /path/to/batch/script $ qsub ./script.pbs
If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job’s status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.
Options to the
qsub command allow the specification of attributes which affect the behavior of the job. In general, options to
qsub on the command line can also be placed in the batch scheduler options section of the batch script via
The batch queue enforces the following policies:
- Unlimited running jobs
- Limit of (4) eligible-to-run jobs per user.
- Maximum walltime of (6) hours per batch job.