titan

Up since 11/8/17 02:45 pm

eos

Up since 11/14/17 11:20 pm

rhea

Up since 10/17/17 05:40 pm

hpss

Up since 11/20/17 09:15 am

atlas1

Up since 11/15/17 07:25 am

atlas2

Up since 11/27/17 10:45 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Eos Thread Affinity Example

See this article in context within the following user guides: Eos

The example application is a Cray test code, cray-mpi-ex.c, that shows you how your MPI tasks are placed on the “cores” within in Eos’s compute nodes. This application may be useful to you if you are trying to figure out how your application’s MPI task will be distributed. The cray-mpi-ex.c test code is given at the bottom of this example if you want to view it or copy it for your own use.

Compile

The compiler wrappers on Eos function like the compiler wrappers on Titan. The main functional difference is that the Intel compiler and programing environment is default on Eos and the PGI equivalents are default on Titan. I will use the Intel environment below.

To compile:

eos%  cc cray-mpi-ex.c
Batch Script

Here is the example batch script cray-mpi.pbs:

#!/bin/bash
#    Begin PBS directives
#PBS -A STF007
#PBS -N cray-mpi
#PBS -j oe
#PBS -l walltime=1:00:00,nodes=2
#    End PBS directives and begin shell commands
cd $MEMBERWORK/stf007
aprun -n 32 -j1 ./a.out

To submit this from $MEMBERWORK/[projid]:

eos% qsub cray-mpi.pbs

The output will be in $MEMBERWORK/[projid] in a file called cray-mpi.ojob_number.

I used two nodes and I did not use Hyper Threading. In the output below notice that two nodes, 708 and 709 have been used, and two sets of cores, 0-14, are listed for 32 ranks.

Rank 9, Node 00708, Core 9
Rank 8, Node 00708, Core 8
Rank 15, Node 00708, Core 15
Rank 7, Node 00708, Core 7
Rank 5, Node 00708, Core 5
Rank 3, Node 00708, Core 3
Rank 2, Node 00708, Core 2
Rank 1, Node 00708, Core 1
Rank 4, Node 00708, Core 4
Rank 6, Node 00708, Core 6
Rank 12, Node 00708, Core 12
Rank 14, Node 00708, Core 14
Rank 10, Node 00708, Core 10
Rank 13, Node 00708, Core 13
Rank 0, Node 00708, Core 0
Rank 11, Node 00708, Core 11
Rank 25, Node 00709, Core 9
Rank 20, Node 00709, Core 4
Rank 16, Node 00709, Core 0
Rank 30, Node 00709, Core 14
Rank 21, Node 00709, Core 5
Rank 22, Node 00709, Core 6
Rank 19, Node 00709, Core 3
Rank 23, Node 00709, Core 7
Rank 18, Node 00709, Core 2
Rank 26, Node 00709, Core 10
Rank 31, Node 00709, Core 15
Rank 28, Node 00709, Core 12
Rank 24, Node 00709, Core 8
Rank 27, Node 00709, Core 11
Rank 29, Node 00709, Core 13
Rank 17, Node 00709, Core 1
Application 19374 resources: utime ~4s, stime ~20s, Rss ~4736, inblocks ~10667, outblocks ~24358

If I wanted to use Hyper Threading, I could omit the -j1 aprun option and half as many nodes. Here the modified batch script:

#!/bin/bash
#    Begin PBS directives
#PBS -A STF007
#PBS -N cray-mpi
#PBS -j oe
#PBS -l walltime=1:00:00,nodes=1
#    End PBS directives and begin shell commands
cd $MEMBERWORK/stf007
aprun -n 32 ./a.out

Below is the output with Hyper Threading. Notice that only one node, 708, was used and this time cores 0-32 were used for 32 ranks. Hyper threading makes each physical core behave as two logical cores. Each logical core stores a complete program state, but it must split the resources of the physical core.

Rank 0, Node 00708, Core 0
Rank 1, Node 00708, Core 16
Rank 29, Node 00708, Core 30
Rank 28, Node 00708, Core 14
Rank 2, Node 00708, Core 1
Rank 3, Node 00708, Core 17
Rank 12, Node 00708, Core 6
Rank 13, Node 00708, Core 22
Rank 9, Node 00708, Core 20
Rank 8, Node 00708, Core 4
Rank 22, Node 00708, Core 11
Rank 23, Node 00708, Core 27
Rank 17, Node 00708, Core 24
Rank 30, Node 00708, Core 15
Rank 19, Node 00708, Core 25
Rank 31, Node 00708, Core 31
Rank 18, Node 00708, Core 9
Rank 5, Node 00708, Core 18
Rank 4, Node 00708, Core 2
Rank 21, Node 00708, Core 26
Rank 20, Node 00708, Core 10
Rank 26, Node 00708, Core 13
Rank 27, Node 00708, Core 29
Rank 14, Node 00708, Core 7
Rank 15, Node 00708, Core 23
Rank 10, Node 00708, Core 5
Rank 11, Node 00708, Core 21
Rank 25, Node 00708, Core 28
Rank 24, Node 00708, Core 12
Rank 16, Node 00708, Core 8
Rank 7, Node 00708, Core 19
Rank 6, Node 00708, Core 3
Application 19376 resources: utime ~1s, stime ~1s, Rss ~3484, inblocks ~5226, outblocks ~12180
The Test Application cray-mpi-ex.c

Here is the example program if you wish to view or copy it.

#define _GNU_SOURCE 
#include <stdio.h>
#include <sched.h>
#include <string.h>
#include "mpi.h"
/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
char *ptr = str;
int i, j, entry_made = 0;
for (i = 0; i < CPU_SETSIZE; i++) {
if (CPU_ISSET(i, mask)) {
int run = 0;
entry_made = 1;
for (j = i + 1; j < CPU_SETSIZE; j++) {
if (CPU_ISSET(j, mask)) run++;
else break;
 }
if (!run)
sprintf(ptr, "%d,", i);
else if (run == 1) {
sprintf(ptr, "%d,%d,", i, i + 1);
i++;
} else {
sprintf(ptr, "%d-%d,", i, i + run);
i += run;
}
while (*ptr != 0) ptr++;
}
}
ptr -= entry_made;
*ptr = 0;
return(str);
}
int main(int argc, char *argv[])
{
int rank, thread, i;
cpu_set_t coremask;
char clbuf[7 * CPU_SETSIZE], hnbuf[64], node[64];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
memset(clbuf, 0, sizeof(clbuf));
memset(hnbuf, 0, sizeof(hnbuf));
(void)gethostname(hnbuf, sizeof(hnbuf));

        /*Remove nid from node name*/
  for (i=3; hnbuf[i] != '\0'; i++)
  {
    node[i-3] = hnbuf[i];
  }

{
(void)sched_getaffinity(0, sizeof(coremask), &coremask);
cpuset_to_cstr(&coremask, clbuf);
printf("Rank %d, Node %s, Core %s\n", rank, node, clbuf);
}
MPI_Finalize();
return(0);
}

If you have any questions about this example, please contact user support or your scientific computing liaison.