Controlling MPI Task Layout Within a Physical Node
Categories: Running Jobs
Print this article
Users have (2) ways to control MPI task layout:
- Within a physical node
- Across physical nodes
This article focuses on how to control MPI task layout within a physical node.
Understanding NUMA Nodes
Each physical node is organized into (2) 8-core NUMA nodes. NUMA is an acronym for “Non-Uniform Memory Access”. You can think of a NUMA node as a division of a physical node that contains a subset of processor cores and their high-affinity memory.
Applications may use resources from one or both NUMA nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one NUMA node before allocating tasks to another NUMA node.
Spreading MPI Tasks Across NUMA Nodes
Each physical node contains (2) NUMA nodes. Users can control MPI task layout using the
aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a physical node’s MPI task load over the (2) available NUMA nodes via the
-S option to
Example 1: Default NUMA Placement
Job requests (2) processor cores without a NUMA flag. Both tasks are placed on the first NUMA node.
$ aprun -n2 ./a.out Rank 0, Node 0, NUMA 0, Core 0 Rank 1, Node 0, NUMA 0, Core 1
Example 2: Specific NUMA Placement
Job requests (2) processor cores with
aprun -S. A task is placed on each of the (2) NUMA nodes:
$ aprun -n2 -S1 ./a.out Rank 0, Node 0, NUMA 0, Core 0 Rank 1, Node 0, NUMA 1, Core 0
The following table summarizes common NUMA node options to
||Processing elements (essentially a processor core) per NUMA node. Specifies the number of PEs to allocate per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.|
||Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.|
Advanced NUMA Node Placement
Example 1: Grouping MPI Tasks on a Single NUMA Node
Run a.out on (8) cores. Place (8) MPI tasks on (1) NUMA node. In this case the
aprun -S option is optional:
$ aprun -n8 -S8 ./a.out
Example 2: Spreading MPI tasks across NUMA nodes
Run a.out on (8) cores. Place (4) MPI tasks on each of (2) NUMA nodes via
$ aprun -n8 -S4 ./a.out
Example 3: Spread Out MPI Tasks Across Paired-Core Compute Units
-j option can be used for codes to allow one task per paired-core compute unit.
Run a.out on (8) cores; (4) cores per NUMA node; but only (1) core on each paired-core compute unit:
$ aprun -n8 -S4 -j1 ./a.out
To see MPI rank placement information on the nodes set the PMI_DEBUG environment variable to 1
$ setenv PMI_DEBUG 1
$ export PMI_DEBUG=1
Example 4: Assign System Services to a Unused Compute Core
-r option can be used to assign system services associated with your application to a compute core.
If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly.
You should use
-r 1 ensuring
-N is less than 16 or
-S is less than 8. The following example will place a task from a.out on cores 0-14; core 15 will be used only for system services:
Run a.out on (8) cores; (4) cores per NUMA node; but only (1) core on each paired-core compute unit. All node services will be placed on the node’s last core:
$ aprun -n8 -S4 -j1 -r1 ./a.out
* System Services