titan

Up since 11/8/17 02:45 pm

eos

Up since 11/14/17 11:20 pm

rhea

Up since 10/17/17 05:40 pm

hpss

Up since 11/20/17 09:15 am

atlas1

Up since 11/15/17 07:25 am

atlas2

Up since 11/27/17 10:45 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Controlling MPI Task Layout Within an Eos Node

See this article in context within the following user guides: Eos

Users have (2) ways to control MPI task layout:

  1. Within a physical node
  2. Across physical nodes

This article focuses on how to control MPI task layout within a physical node.

Understanding NUMA Nodes

Each physical node is organized into (2) 8-core NUMA nodes. NUMA is an acronym for “Non-Uniform Memory Access”. You can think of a NUMA node as a division of a physical node that contains a subset of processor cores and their high-affinity memory.

Applications may use resources from one or both NUMA nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one NUMA node before allocating tasks to another NUMA node.

Note: A brief description of how a physical XC30 node is organized can be found on the XC30 node description page.
Spreading MPI Tasks Across NUMA Nodes

Each physical node contains (2) NUMA nodes. Users can control MPI task layout using the aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a physical node’s MPI task load over the (2) available NUMA nodes via the -S option to aprun.

Note: Jobs that do not utilize all of a physical node’s processor cores may see performance improvements by spreading MPI tasks across NUMA nodes within a physical node.
Example 1: Default NUMA Placement

Job requests (2) processor cores without a NUMA flag. Both tasks are placed on the first NUMA node.

$ aprun -n2 ./a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 0, Core 1
Example 2: Specific NUMA Placement

Job requests (2) processor cores with aprun -S. A task is placed on each of the (2) NUMA nodes:

$ aprun -n2 -S1 ./a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 1, Core 0

The following table summarizes common NUMA node options to aprun:

Option Description
-S Processing elements (essentially a processor core) per NUMA node. Specifies the number of PEs to allocate per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
-ss Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.
Advanced NUMA Node Placement
Example 1: Grouping MPI Tasks on a Single NUMA Node

Run a.out on (8) cores. Place (8) MPI tasks on (1) NUMA node. In this case the aprun -S option is optional:

$ aprun -n8 -S8 ./a.out

Compute Node
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7

Example 2: Spreading MPI tasks across NUMA nodes

Run a.out on (8) cores. Place (4) MPI tasks on each of (2) NUMA nodes via aprun -S.

$ aprun -n8 -S4 ./a.out

Compute Node
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7

Example 3: Spreading Out MPI Tasks Across Numa Nodes with Hyper Threading

Hyper Threading is enabled by default and can be explicitly enabled with the -j2 aprun option. With Hyper Threading, the NUMA nodes behave as if each has 16 logical cores.

Run a.out on (18) cores. Place (9) MPI tasks on each of (2) NUMA nodes via aprun -S.

aprun -n 18 -S9 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16
Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15
8 17


To see MPI rank placement information on the nodes set the PMI_DEBUG environment variable to 1

For cshell:

$ setenv PMI_DEBUG 1

For bash:

$ export PMI_DEBUG=1