titan

Up since 11/8/17 02:45 pm

eos

Up since 11/14/17 11:20 pm

rhea

Up since 10/17/17 05:40 pm

hpss

Up since 11/20/17 09:15 am

atlas1

Up since 11/15/17 07:25 am

atlas2

Up since 11/27/17 10:45 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Eos Controlling Thread Layout Within a Physical Node

See this article in context within the following user guides: Eos

Eos supports threaded programming within a compute node. Threads may span across both processors within a single compute node, but cannot span compute nodes. With Intel’s Hyper Threading enabled, each node with 16 physical cores has the ability to behave as if it has 32 logical cores. Hyper Threading is enabled by default, so users must pass the -j1 aprun option to explicitly turn it off. For threaded codes hyper threading allow the user to run twice as many threads per physical core. Users have a great deal of flexibility in thread placement. Several examples are shown below.

Note: Threaded codes must use the -d (depth) option to aprun.

The -d option to aprun specifies the number of threads per MPI task. Under previous CNL versions this option was not required. Under the current CNL version, the number of cores used is calculated by multiplying the value of -d by the value of -n.

Warning: Without the -d option, all threads will be started on the same processor core. This can lead to performance degradation for threaded codes.
Thread Layout Examples

The following examples are written for the bash shell. If using csh/tcsh, you should change export OMP_NUM_THREADS=x to setenv OMP_NUM_THREADS x wherever it appears.

Example 1: (2) MPI tasks, (16) Threads Each

This example will launch (2) MPI tasks, each with (16) threads running on their own dedicated physical core. This requests (2) compute nodes and requires a node request of (2):

$ export OMP_NUM_THREADS=16
$ aprun -n2 -d16 -j1 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 0, Thread 2, Node 0, NUMA 0, Core 2 <-- slave
Rank 0, Thread 3, Node 0, NUMA 0, Core 3 <-- slave
Rank 0, Thread 4, Node 0, NUMA 0, Core 4 <-- slave
Rank 0, Thread 5, Node 0, NUMA 0, Core 5 <-- slave
Rank 0, Thread 6, Node 0, NUMA 0, Core 6 <-- slave
Rank 0, Thread 7, Node 0, NUMA 0, Core 7 <-- slave
Rank 0, Thread 8, Node 0, NUMA 1, Core 0 <-- slave
Rank 0, Thread 9, Node 0, NUMA 1, Core 1 <-- slave
Rank 0, Thread 10,Node 0, NUMA 1, Core 2 <-- slave
Rank 0, Thread 11,Node 0, NUMA 1, Core 3 <-- slave
Rank 0, Thread 12,Node 0, NUMA 1, Core 4 <-- slave
Rank 0, Thread 13,Node 0, NUMA 1, Core 5 <-- slave
Rank 0, Thread 14,Node 0, NUMA 1, Core 6 <-- slave
Rank 0, Thread 15,Node 0, NUMA 1, Core 7 <-- slave
Rank 1, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 1, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 1, Thread 2, Node 1, NUMA 0, Core 2 <-- slave
Rank 1, Thread 3, Node 1, NUMA 0, Core 3 <-- slave
Rank 1, Thread 4, Node 1, NUMA 0, Core 4 <-- slave
Rank 1, Thread 5, Node 1, NUMA 0, Core 5 <-- slave
Rank 1, Thread 6, Node 1, NUMA 0, Core 6 <-- slave
Rank 1, Thread 7, Node 1, NUMA 0, Core 7 <-- slave
Rank 1, Thread 8, Node 1, NUMA 1, Core 0 <-- slave
Rank 1, Thread 9, Node 1, NUMA 1, Core 1 <-- slave
Rank 1, Thread 10,Node 1, NUMA 1, Core 2 <-- slave
Rank 1, Thread 11,Node 1, NUMA 1, Core 3 <-- slave
Rank 1, Thread 12,Node 1, NUMA 1, Core 4 <-- slave
Rank 1, Thread 13,Node 1, NUMA 1, Core 5 <-- slave
Rank 1, Thread 14,Node 1, NUMA 1, Core 6 <-- slave
Rank 1, Thread 15,Node 1, NUMA 1, Core 7 <-- slave

This can also be accomplished on one node with Hyper Threading.

This requests (1) compute nodes and requires a node request of (1)

$ export OMP_NUM_THREADS=16
$ aprun -n2 -d16 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 0, Thread 2, Node 0, NUMA 0, Core 2 <-- slave
Rank 0, Thread 3, Node 0, NUMA 0, Core 3 <-- slave
Rank 0, Thread 4, Node 0, NUMA 0, Core 4 <-- slave
Rank 0, Thread 5, Node 0, NUMA 0, Core 5 <-- slave
Rank 0, Thread 6, Node 0, NUMA 0, Core 6 <-- slave
Rank 0, Thread 7, Node 0, NUMA 0, Core 7 <-- slave
Rank 0, Thread 8, Node 0, NUMA 0, Core 8 <-- slave
Rank 0, Thread 9, Node 0, NUMA 0, Core 9 <-- slave
Rank 0, Thread 10,Node 0, NUMA 0, Core 10 <-- slave
Rank 0, Thread 11,Node 0, NUMA 0, Core 11<-- slave
Rank 0, Thread 12,Node 0, NUMA 0, Core 12<-- slave
Rank 0, Thread 13,Node 0, NUMA 0, Core 13<-- slave
Rank 0, Thread 14,Node 0, NUMA 0, Core 14<-- slave
Rank 0, Thread 15,Node 0, NUMA 0, Core 15 <-- slave
Rank 1, Thread 0, Node 0, NUMA 1, Core 0 <-- MASTER
Rank 1, Thread 1, Node 0, NUMA 1, Core 1 <-- slave
Rank 1, Thread 2, Node 0, NUMA 1, Core 2 <-- slave
Rank 1, Thread 3, Node 0, NUMA 1, Core 3 <-- slave
Rank 1, Thread 4, Node 0, NUMA 1, Core 4 <-- slave
Rank 1, Thread 5, Node 0, NUMA 1, Core 5 <-- slave
Rank 1, Thread 6, Node 0, NUMA 1, Core 6 <-- slave
Rank 1, Thread 7, Node 0, NUMA 1, Core 7 <-- slave
Rank 1, Thread 8, Node 0, NUMA 1, Core 8 <-- slave
Rank 1, Thread 9, Node 0, NUMA 1, Core 9 <-- slave
Rank 1, Thread 10,Node 0, NUMA 1, Core 10 <-- slave
Rank 1, Thread 11,Node 0, NUMA 1, Core 11<-- slave
Rank 1, Thread 12,Node 0, NUMA 1, Core 12<-- slave
Rank 1, Thread 13,Node 0, NUMA 1, Core 13<-- slave
Rank 1, Thread 14,Node 0, NUMA 1, Core 14<-- slave
Rank 1, Thread 15,Node 0, NUMA 1, Core 15<-- slave
Example 2: (2) MPI tasks, (6) Threads Each

This example will launch (2) MPI tasks, each with (6) threads. Place (1) MPI task per NUMA node. This requests (1) physical compute nodes and requires a nodes request of (1):

$ export OMP_NUM_THREADS=6
$ aprun -n2 -d6 -S1 a.out

Compute Node
NUMA 0 NUMA 1
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Rank0
Thread0
Rank0
Thread1
Rank0
Thread2
Rank0
Thread3
Rank0
Thread4
Rank0
Thread5
Rank1
Thread0
Rank1
Thread1
Rank1
Thread2
Rank1
Thread3
Rank1
Thread4
Rank1
Thread5

Example 3: (4) MPI tasks, (2) Threads Each

This example will launch (4) MPI tasks, each with (2) threads. Place only (1) MPI task [and its (2) threads] on each NUMA node. This requests (2) physical compute nodes and requires a nodes request of (2), even though only (8) cores are actually being used:

$ export OMP_NUM_THREADS=2
$ aprun -n4 -d2 -S1 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 1, Thread 0, Node 0, NUMA 1, Core 0 <-- MASTER
Rank 1, Thread 1, Node 0, NUMA 1, Core 1 <-- slave
Rank 2, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 2, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 3, Thread 0, Node 1, NUMA 1, Core 0 <-- MASTER
Rank 3, Thread 1, Node 1, NUMA 1, Core 1 <-- slave