Categories: Running Jobs
Print this article
aprun command is used to run a compiled application program across one or more compute nodes. You use the
aprun command to specify application resource requirements, request application placement, and initiate application launch.
The machine’s physical node layout plays an important role in how
aprun works. Each Titan compute node contains (2) 8-core NUMA nodes on a single socket (a total of 16 cores).
apruncommand is the only mechanism for running an executable in parallel on compute nodes. To run jobs as efficiently as possible, a thorough understanding of how to use
aprunand its various options is paramount.
OLCF uses a version of aprun with two extensions. One is used to identify which libraries are used by an executable to allow us to better track third party software that is being actively used on the system. The other analyzes the command line to identify cases where users might be able to optimize their application’s performance by using slightly different job layout options. We highly recommend using both of these features; however, if there is a reason you wish to disable one or the other please contact the User Assistance Center for information on how to do that.
Shell Resource Limits
aprun will not forward shell limits set by
ulimit for sh/ksh/bash or by
limit for csh/tcsh.
To pass these settings to your batch job, you should set the environment variable
APRUN_XFER_LIMITS to 1 via
export APRUN_XFER_LIMITS=1 for sh/ksh/bash or
setenv APRUN_XFER_LIMITS 1 for csh/tcsh.
All aprun processes are launched from a small number of shared service nodes. Because large numbers of aprun processes can cause other users’ apruns to fail, users are asked to limit the number of simultaneous apruns executed within a batch script. Users are limited to 50 aprun processes per batch job; attempts to launch apruns over the limit will result in the following error:
apsched: no more claims allowed for this reservation (max 50)
aprun Process Ensembles with
In some situations, the simultaneous
aprun limit can be overcome by using the utility
wraprun. Wraprun has the capacity to run an arbitrary number and combination of qualified MPI or serial applications under a single
wraprunmust dynamically linked. Non-MPI applications must be launched using a serial wrapper included with
wraprunshould each consume approximately the same walltime to avoid wasting allocation hours.
Complete information and examples on using wraprun can be found the wraprun documentation.
The following table lists commonly-used options to
aprun. For a more detailed description of
aprun options, see the
aprun man page.
||Debug; shows the layout
||Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to
||Number of MPI tasks (aka ‘processing elements’) per physical node.
Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
||Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task|
||Number of threads per MPI task.
Warning: The default value for
For Titan: Number of CPUs to use per paired-core compute unit. The -j parameter specifies the number of CPUs to be allocated per paired-core compute unit. The valid values for -j are 0 (use the system default), 1 (use one integer core), and 2 (use both integer cores; this is the system default).
For Eos: The -j parameter controls Hyper Threading. The valid values for -j are 0 (use the system default), 1 (turn Hyper Threading off), and 2 (turn Hyper Threading on; this is the system default).
||This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun.|
||Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.|
||Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.|
Assign system services associated with your application to a compute core.
If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly.