Controlling MPI Task Layout Within an Eos Node
Print this article
Users have (2) ways to control MPI task layout:
- Within a physical node
- Across physical nodes
This article focuses on how to control MPI task layout within a physical node.
Understanding NUMA Nodes
Each physical node is organized into (2) 8-core NUMA nodes. NUMA is an acronym for “Non-Uniform Memory Access”. You can think of a NUMA node as a division of a physical node that contains a subset of processor cores and their high-affinity memory.
Applications may use resources from one or both NUMA nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one NUMA node before allocating tasks to another NUMA node.
Spreading MPI Tasks Across NUMA Nodes
Each physical node contains (2) NUMA nodes. Users can control MPI task layout using the
aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a physical node’s MPI task load over the (2) available NUMA nodes via the
-S option to
Example 1: Default NUMA Placement
Job requests (2) processor cores without a NUMA flag. Both tasks are placed on the first NUMA node.
$ aprun -n2 ./a.out Rank 0, Node 0, NUMA 0, Core 0 Rank 1, Node 0, NUMA 0, Core 1
Example 2: Specific NUMA Placement
Job requests (2) processor cores with
aprun -S. A task is placed on each of the (2) NUMA nodes:
$ aprun -n2 -S1 ./a.out Rank 0, Node 0, NUMA 0, Core 0 Rank 1, Node 0, NUMA 1, Core 0
The following table summarizes common NUMA node options to
||Processing elements (essentially a processor core) per NUMA node. Specifies the number of PEs to allocate per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.|
||Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.|
Advanced NUMA Node Placement
Example 1: Grouping MPI Tasks on a Single NUMA Node
Run a.out on (8) cores. Place (8) MPI tasks on (1) NUMA node. In this case the
aprun -S option is optional:
$ aprun -n8 -S8 ./a.out
Example 2: Spreading MPI tasks across NUMA nodes
Run a.out on (8) cores. Place (4) MPI tasks on each of (2) NUMA nodes via
$ aprun -n8 -S4 ./a.out
Example 3: Spreading Out MPI Tasks Across Numa Nodes with Hyper Threading
Hyper Threading is enabled by default and can be explicitly enabled with the
-j2 aprun option. With Hyper Threading, the NUMA nodes behave as if each has 16 logical cores.
Run a.out on (18) cores. Place (9) MPI tasks on each of (2) NUMA nodes via
aprun -n 18 -S9 ./a.out
To see MPI rank placement information on the nodes set the PMI_DEBUG environment variable to 1
$ setenv PMI_DEBUG 1
$ export PMI_DEBUG=1