In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.

A job on Eos typically comprises a few different components:

  • A batch submission script.
  • A binary executable.
  • A set of input files for the executable.
  • A set of output files created by the executable.

And the process for running a job, in general, is to:

  1. Prepare executables and input files.
  2. Write a batch script.
  3. Submit the batch script to the batch scheduler.
  4. Optionally monitor the job before and during execution.

The following sections describe in detail how to create, submit, and manage jobs for execution on Eos.

Login vs. Compute Nodes

Cray Supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Eos nodes as existing in two categories: login nodes or compute nodes.

Login Nodes

Login nodes are designed to facilitate SSH access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.

Warning: Processor-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning.

Compute Nodes

On Cray machines, when the aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.

Note: On Cray machines, the only way to access the compute nodes is via the aprun command.

Filesystems Available to Compute Nodes on Eos

The Eos compute nodes can only see the center-wide Lustre® filesystem (Spider) which includes the $MEMBERWORK, $PROJWORK, and $WORLDWORK storage areas. Other storage spaces (User Home, User Archive, Project Home, and Project Archive) are not mounted on compute nodes.

Warning: Only $MEMBERWORK, $PROJWORK, and $WORLDWORK areas are available to compute nodes on Eos.

As a result, job executable binaries and job input files must reside within a Lustre-backed work directory, e.g. $MEMBERWORK/[projid]. Job output must also be sent to a Lustre-backed work directory.

Batch jobs can be submitted from User Home or Project Home, but additional steps are required to ensure the job runs successfully. Jobs submitted from Home areas should cd into a Lustre-backed work directory prior to invoking aprun. An error like the following may be returned if this is not done:

aprun: [NID 94]Exec /lustre/atlas/scratch/userid/projid/a.out failed: chdir /autofs/na1_home/userid
No such file or directory

Writing Batch Scripts for Eos

Batch scripts, or job submission scripts, are the mechanism by which a user submits and configures a job for eventual execution. A batch script is simply a shell script which contains:

  • Commands that can be interpreted by batch scheduling software (e.g. PBS)
  • Commands that can be interpreted by a shell

The batch script is submitted to the batch scheduler where it is parsed. Based on the parsed data, the batch scheduler places the script in the scheduler queue as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.

Sections of a Batch Script

Batch scripts are parsed into the following three sections:

  1. The Interpreter Line
    The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter’s default shell will be used. The line uses the “hash-bang-shell” syntax: #!/path/to/shell
  2. The Scheduler Options
    The batch scheduler options are preceded by #PBS, making them appear as comments to a shell. PBS will look for #PBS options in a batch script from the script’s first line through the first non-comment line. A comment line begins with #. #PBS options entered after the first non-comment line will not be read by PBS.

    Note: All batch scheduler options must appear at the beginning of the batch script.
  3. The Executable Commands
    The shell commands follow the last #PBS option and represent the main content of the batch job. If any #PBS lines follow executable statements, they will be ignored as comments.

The execution section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on a service node (sometimes called a “head node”) from the set of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.

An Example Batch Script

 1: #!/bin/bash
 2: #    Begin PBS directives
 3: #PBS -A pjt000
 4: #PBS -N test
 5: #PBS -j oe
 6: #PBS -l walltime=1:00:00,nodes=373
 7: #    End PBS directives and begin shell commands
 8: cd $MEMBERWORK/pjt000
 9: date
10: aprun -n 5968 ./a.out

The lines of this batch script do the following:

Line Option Description
1 Optional Specifies that the script should be interpreted by the bash shell.
2 Optional Comments do nothing.
3 Required The job will be charged to the “pjt000” project.
4 Optional The job will be named “test”.
5 Optional The job’s standard output and error will be combined.
6 Required The job will request 373 compute nodes for 1 hour.
7 Optional Comments do nothing.
8 This shell command will the change to the user’s member work directory.
9 This shell command will run the date command.
10 This invocation will run 5968 MPI instances of the executable a.out on the compute nodes allocated by the batch system.
Note: For more batch script examples, please see the Additional Example Batch Scripts section of the Titan User Guide.

Batch Scheduler node Requests

A node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based upon the computational resources a job makes unavailable to others, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.

Note: Whole nodes must be requested at the time of job submission, and allocations are reduced by core-hour amounts corresponding to whole nodes, regardless of actual core utilization.

Submitting Batch Scripts

Once written, a batch script is submitted to the batch scheduler via the qsub command.

$ cd /path/to/batch/script
$ qsub ./script.pbs

If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job’s status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.

Note: Always make a note of the returned job ID upon job submission, and include it in help requests to the OLCF User Assistance Center.

Options to the qsub command allow the specification of attributes which affect the behavior of the job. In general, options to qsub on the command line can also be placed in the batch scheduler options section of the batch script via #PBS.


Interactive Batch Jobs

Batch scripts are useful for submitting a group of commands, allowing them to run through the queue, then viewing the results at a later time. However, it is sometimes necessary to run tasks within a job interactively. Users are not permitted to access compute nodes nor run aprun directly from login nodes. Instead, users must use an interactive batch job to allocate and gain access to compute resources interactively. This is done by using the -I option to qsub.

Interactive Batch Example

For interactive batch jobs, PBS options are passed through qsub on the command line.

$ qsub -I -A pjt000 -q debug -X -l nodes=3,walltime=30:00

This request will:

Option Description
-I Start an interactive session
-A Charge to the “pjt000” project
-X Enables X11 forwarding. The DISPLAY environment variable must be set.
-q debug Run in the debug queue
-l nodes=3,walltime=30:00 Request 3 compute nodes for 30 minutes (you get all cores per node)

After running this command, you will have to wait until enough compute nodes are available, just as in any other batch job. However, once the job starts, you will be given an interactive prompt on the head node of your allocated resource. From here commands may be executed directly instead of through a batch script.

Debugging via Interactive Jobs

A common use of interactive batch is to aid in debugging efforts. Interactive access to compute resources allows the ability to run a process to the point of failure; however, unlike a batch job, the process can be restarted after brief changes are made without losing the compute resource allocation. This may help speed the debugging effort because a user does not have to wait in the queue in between each run attempts.

Note: To tunnel a GUI from an interactive batch job, the -X PBS option should be used to enable X11 forwarding.

Choosing an Interactive Job’s nodes Value

Because interactive jobs must sit in the queue until enough resources become available to allocate, to shorten the queue wait time, it is useful to base nodes selection on the number of unallocated nodes. The showbf command (i.e “show backfill”) to see resource limits that would allow your job to be immediately back-filled (and thus started) by the scheduler. For example, the snapshot below shows that 802 nodes are currently free.

$ showbf
Partition   Tasks   Nodes   StartOffset    Duration       StartDate
---------   -----   -----   ------------   ------------   --------------
ALL         4744    802     INFINITY       00:00:00       HH:MM:SS_MM/DD

See showbf –help for additional options.


Common Batch Options to PBS

The following table summarizes frequently-used options to PBS:

Option Use Description
-A #PBS -A <account> Causes the job time to be charged to <account>. The account string, e.g. pjt000, is typically composed of three letters followed by three digits and optionally followed by a subproject identifier. The utility showproj can be used to list your valid assigned project ID(s). This option is required by all jobs.
-l #PBS -l nodes=<value> Maximum number of compute nodes. Jobs cannot request partial nodes.
#PBS -l walltime=<time> Maximum wall-clock time. <time> is in the format HH:MM:SS.
#PBS -l partition=<partition_name> Allocates resources on specified partition.
-o #PBS -o <filename> Writes standard output to <name> instead of <job script>.o$PBS_JOBID. $PBS_JOBID is an environment variable created by PBS that contains the PBS job identifier.
-e #PBS -e <filename> Writes standard error to <name> instead of <job script>.e$PBS_JOBID.
-j #PBS -j {oe,eo} Combines standard output and standard error into the standard error file (eo) or the standard out file (oe).
-m #PBS -m a Sends email to the submitter when the job aborts.
#PBS -m b Sends email to the submitter when the job begins.
#PBS -m e Sends email to the submitter when the job ends.
-M #PBS -M <address> Specifies email address to use for -m options.
-N #PBS -N <name> Sets the job name to <name> instead of the name of the job script.
-S #PBS -S <shell> Sets the shell to interpret the job script.
-q #PBS -q <queue> Directs the job to the specified queue.This option is not required to run in the default queue on any given system.
-V #PBS -V Exports all environment variables from the submitting shell into the batch job shell. Since the login nodes differ from the service nodes, using the ‘-V’ option is not recommended. Users should create the needed environment within the batch job.
-X #PBS -X Enables X11 forwarding. The -X PBS option should be used to tunnel a GUI from an interactive batch job.
Note: Because the login nodes differ from the service nodes, using the ‘-V’ option is not recommended. Users should create the needed environment within the batch job.

Further details and other PBS options may be found through the qsub man page.


Batch Environment Variables

PBS sets multiple environment variables at submission time. The following PBS variables are useful within batch scripts:

Variable Description
$PBS_O_WORKDIR The directory from which the batch job was submitted. By default, a new job starts in your home directory. You can get back to the directory of job submission with cd $PBS_O_WORKDIR. Note that this is not necessarily the same directory in which the batch script resides.
$PBS_JOBID The job’s full identifier. A common use for PBS_JOBID is to append the job’s ID to the standard output and error files.
$PBS_NUM_NODES The number of nodes requested.
$PBS_JOBNAME The job name supplied by the user.
$PBS_NODEFILE The name of the file containing the list of nodes assigned to the job. Used sometimes on non-Cray clusters.

Modifying Batch Jobs

The batch scheduler provides a number of utility commands for managing submitted jobs. See each utilities’ man page for more information.

Removing and Holding Jobs

qdel

Jobs in the queue in any state can be stopped and removed from the queue using the command qdel.

$ qdel 1234

qhold

Jobs in the queue in a non-running state may be placed on hold using the qhold command. Jobs placed on hold will not be removed from the queue, but they will not be eligible for execution.

$ qhold 1234

qrls

Once on hold the job will not be eligible to run until it is released to return to a queued state. The qrls command can be used to remove a job from the held state.

$ qrls 1234

Modifying Job Attributes

qalter

Non-running jobs in the queue can be modified with the PBS qalter command. The qalter utility can be used to do the following (among others): Modify the job’s name:

$ qalter -N newname 130494

Modify the number of requested cores:

$ qalter -l nodes=12 130494

Modify the job’s walltime:

$ qalter -l walltime=01:00:00 130494
Note: Once a batch job moves into a running state, the job’s walltime can not be increased.

Monitoring Batch Jobs

PBS and Moab provide multiple tools to view queue, system, and job status. Below are the most common and useful of these tools.

Job Monitoring Commands

showq

The Moab utility showq can be used to view a more detailed description of the queue. The utility will display the queue in the following states:

State Description
Active These jobs are currently running.
Eligible These jobs are currently queued awaiting resources. Eligible jobs are shown in the order in which the scheduler will consider them for allocation.
Blocked These jobs are currently queued but are not eligible to run. A job may be in this state because the user has more jobs that are “eligible to run” than the system’s queue policy allows.

To see all jobs currently in the queue:

$ showq

To see all jobs owned by userA currently in the queue:

$ showq -u userA

To see all jobs submitted to partitionA:

$ showq -p partitionA

To see all completed jobs:

$ showq -c
Note: To increase response time, the MOAB utilities (showstart, checkjob) will display a cached result. The cache updates every 30 seconds. But, because the cached result is displayed, you may see the following message:

--------------------------------------------------------------------
NOTE: The following information has been cached by the remote server
      and may be slightly out of date.
--------------------------------------------------------------------

checkjob

The Moab utility checkjob can be used to view details of a job in the queue. For example, if job 736 is a job currently in the queue in a blocked state, the following can be used to view why the job is in a blocked state:

$ checkjob 736

The return may contain a line similar to the following:

BlockMsg: job 736 violates idle HARD MAXJOB limit of X for user (Req: 1 InUse: X)

This line indicates the job is in the blocked state because the owning user has reached the limit for jobs in the “eligible to run” state.

qstat

The PBS utility qstat will poll PBS (Torque) for job information. However, qstat does not know of Moab’s blocked and eligible states. Because of this, the showq Moab utility (see above) will provide a more accurate batch queue state. To show show all queued jobs:

$ qstat -a

To show details about job 1234:

$ qstat -f 1234

To show all currently queued jobs owned by userA:

$ qstat -u userA

Eos Scheduling Policy

Queue Policy

Queues are used by the batch scheduler to aid in the organization of jobs. Users typically have access to multiple queues, and each queue may allow different job limits and have different priorities. Unless otherwise notified, users have access to the following queues on Eos:

Name Usage Description
batch No explicit request required Default; most Eos work runs in this queue. 700 nodes available. See limits in the batch Queue section table below.
debug #PBS -q debug Quick-turnaround; short jobs for software generation, verification, and debugging. 36 nodes available. Users are limited to 1 job in any state for this queue.

The batch Queue

The batch queue is the default queue for work on Eos and has 700 nodes available. Most work on Eos is handled through this queue. The job time-limit is based based on job size as follows:

Size in Nodes Wall Clock Limit
1 to 175 nodes 24 hours
176 to 350 12 hours
351 to 700 4 hours

The batch queue enforces the following policies:

  • Unlimited running jobs
  • Limit of (2) eligible-to-run jobs per user.
  • Jobs in excess of the per user limit above will be placed into a held state, but will change to eligible-to-run at the appropriate time.

The debug Queue

The debug queue is intended to provide faster turnaround times for the code verification and debugging cycle. For example, interactive parallel work is an ideal use for the debug queue. 36 nodes are set aside for only debug use; although, a debug job can request more nodes and use nodes in the compute partition. The debug queue has a walltime of 2 hours and a limit of 1 job per user in any state.

Queue Priority

INCITE, ALCC, and Director’s Discretionary projects enter the queue system with equal priory by default on Eos.

The basic priority-setting mechanism for jobs waiting in the queue is the time a job has been waiting relative to other jobs in the queue. However, several factors are applied by the batch system to modify the apparent time a job has been waiting. These factors include:

      • The number of nodes requested by the job.
      • The queue to which the job is submitted.
      • The 8-week history of usage for the project associated with the job.
      • The 8-week history of usage for the user associated with the job.
Note: The command line utility $ mdiag -p can be used to see the individual factors contributing to a job’s priority.

If your jobs require resources outside these queue policies, please complete the relevant request form on the Special Requests page. If you have any questions or comments on the queue policies below, please direct them to the User Assistance Center.

Allocation Overuse Policy

Projects that overrun their allocation are still allowed to run on OLCF systems, although at a reduced priority. This is an adjustment to the apparent submit time of the job. However, this adjustment has the effect of making jobs appear much younger than jobs submitted under projects that have not exceeded their allocation. In addition to the priority change, these jobs are also limited in the amount of wall time that can be used.

For example, consider that job1 is submitted at the same time as job2. The project associated with job1 is over its allocation, while the project for job2 is not. The batch system will consider job2 to have been waiting for a longer time than job1.

The adjustment to the apparent submit time depends upon the percentage that the project is over its allocation, as shown in the table below:

% Of Allocation Used Priority Reduction
< 100% 0 days
100% to 125% 30 days
> 125% 365 days

Impact of Overuse on Separately Allocated Resources

Running in excess of the allocated time on one resource will not impact the priority on separately allocated resources. Eos allocations are given separately from Titan allocations; Overuse of a project’s allocation on Titan will not impact that project’s priority on Eos if there is time remaining in the project’s Eos allocation.

FairShare Scheduling Policy

FairShare, as its name suggests, tries to push each user and project towards their fair share of the system’s utilization: in this case, 5% of the system’s utilization per user and 10% of the system’s utilization per project.

To do this, the job scheduler adds (30) minutes priority aging per user and (1) hour of priority aging per project for every (1) percent the user or project is under its fair share value for the prior (8) weeks. Similarly, the job scheduler subtracts priority in the same way for users or projects that are over their fair share.

For instance, a user who has personally used 0.0% of the system’s utilization over the past (8) weeks who is on a project that has also used 0.0% of the system’s utilization will get a (12.5) hour bonus (5 * 30 min for the user + 10 * 1 hour for the project).

In contrast, a user who has personally used 0.0% of the system’s utilization on a project that has used 12.5% of the system’s utilization would get no bonus (5 * 30 min for the user – 2.5 * 1 hour for the project).


Eos Job Resource Accounting

Charging Factor

To match the charge factor for Titan, the usage factor for Eos will be set at 30 times nodes requested. The actual number of cores per nodes is 16 without Intel Hyper Threading (default) or 32 with Intel Hyper Threading enabled (-j2 aprun option). If you use showq to check your job’s allocation, it will show the number of cores requested in multiples of 32 regardless of  the chosen Hyper Threading options or utilized number of cores. This is because the scheduler must be set to allocate the potential maximum number of cores per node to enable all the possible ways of utilizing the node. This does not effect the aprun options you have chosen and will not change the 30 times per node usage factor. 

Note: Whole nodes must be requested at the time of job submission regardless of actual CPU or GPU core utilization.

Viewing Allocation Utilization

Projects are allocated time on Eos in units of “Eos core-hours”. Other OLCF systems are allocated in units of “core-hours”. This page describes how such units are calculated, and how users can access more detailed information on their relevant allocations.

Eos Core-Hour Calculation

The Eos core-hour charge for each batch job will be calculated as follows:

Eos core-hours = nodes requested * 30 * ( batch job endtime - batch job starttime )

Where batch job starttime is the time the job moves into a running state, and batch job endtime is the time the job exits a running state.

A batch job’s usage is calculated solely on requested nodes and the batch job’s start and end time. The number of cores actually used within any particular node within the batch job is not used in the calculation. For example, if a job requests 64 nodes through the batch script, runs for an hour, uses only 2 CPU cores per node, the job will still be charged for 64 * 30 * 1 = 1,920 Eos core-hours.

Viewing Usage

Utilization is calculated daily using batch jobs which complete between 00:00 and 23:59 of the previous day. For example, if a job moves into a run state on Tuesday and completes Wednesday, the job’s utilization will be recorded Thursday. Only batch jobs which write an end record are used to calculate utilization. Batch jobs which do not write end records due to system failure or other reasons are not used when calculating utilization.

Each user may view usage for projects on which they are members from the command line tool showusage and the My OLCF site.

On the Command Line via showusage

The showusage utility can be used to view your usage from January 01 through midnight of the previous day. For example:

$ showusage
Usage on Eos:
                                  Project Totals          <userid>
 Project      Allocation        Usage    Remaining          Usage
_________________________|___________________________|_____________
 <YourProj>    2000000   |   123456.78   1876543.22  |     1560.80

The -h option will list more usage details.

On the Web via My OLCF

More detailed metrics may be found on each project’s usage section of the My OLCF site.

The following information is available for each project:

  • YTD usage by system, subproject, and project member
  • Monthly usage by system, subproject, and project member
  • YTD usage by job size groupings for each system, subproject, and project member
  • Weekly usage by job size groupings for each system, and subproject
  • Batch system priorities by project and subproject
  • Project members

The My OLCF site is provided to aid in the utilization and management of OLCF allocations. If you have any questions or have a request for additional data, please contact the OLCF User Assistance Center.


Job Execution on Eos

Running jobs on Eos is similar to running jobs on Titan, with a few important differences:

  • The compute nodes have 16 physical cores and no GPUs are present.
  • Intel’s Hyper-threading (HT) technology, allows each physical core to appear as two logical cores so each node can functions as if it has 32 cores.
  • The default option on Eos is to run with Hyper Threading. You need to use the -j1 option with the aprun command to explicitly disable HT.
  • Each code should be tested to see how HT impacts its performance before HT is used.

Once resources have been allocated through the batch system, users can:

  • Run commands in serial on the resource pool’s primary service node
  • Run executables in parallel across compute nodes in the resource pool

Serial Execution

The executable portion of a batch script is interpreted by the shell specified on the first line of the script. If a shell is not specified, the submitting user’s default shell will be used. This portion of the script may contain comments, shell commands, executable scripts, and compiled executables. These can be used in combination to, for example, navigate file systems, set up job execution, run executables, and even submit other batch jobs.

Parallel Execution

By default, commands in the job submission script will be executed on the job’s primary service node. The aprun command is used to execute a binary on one or more compute nodes within a job’s allocated resource pool.

Note: On Eos, the only way access a compute node is via the aprun command within a batch job.

Using the aprun command

The aprun command is used to run a compiled application program across one or more compute nodes. You use the aprun command to specify application resource requirements, request application placement, and initiate application launch.

The machine’s physical node layout plays an important role in how aprun works. Each Titan compute node contains (2) 8-core NUMA nodes on a single socket (a total of 16 cores).

Note: The aprun command is the only mechanism for running an executable in parallel on compute nodes. To run jobs as efficiently as possible, a thorough understanding of how to use aprun and its various options is paramount.

OLCF uses a version of aprun with two extensions. One is used to identify which libraries are used by an executable to allow us to better track third party software that is being actively used on the system. The other analyzes the command line to identify cases where users might be able to optimize their application’s performance by using slightly different job layout options. We highly recommend using both of these features; however, if there is a reason you wish to disable one or the other please contact the User Assistance Center for information on how to do that.

Shell Resource Limits

By default, aprun will not forward shell limits set by ulimit for sh/ksh/bash or by limit for csh/tcsh.

To pass these settings to your batch job, you should set the environment variable APRUN_XFER_LIMITS to 1 via export APRUN_XFER_LIMITS=1 for sh/ksh/bash or setenv APRUN_XFER_LIMITS 1 for csh/tcsh.

Simultaneous aprun Limit

All aprun processes are launched from a small number of shared service nodes. Because large numbers of aprun processes can cause other users’ apruns to fail, users are asked to limit the number of simultaneous apruns executed within a batch script. Users are limited to 50 aprun processes per batch job; attempts to launch apruns over the limit will result in the following error:

apsched: no more claims allowed for this reservation (max 50)
Warning: Users are limited to 50 aprun processes per batch job.

Single-aprun Process Ensembles with wraprun

Wraprun is a utility that enables independent execution of multiple MPI applications under a single aprun call. It borrows from aprun MPMD syntax and also contains some wraprun specific syntax. The latest documentation can be found on the wraprun development README.

In some situations, the simultaneous aprun limit can be overcome by using the utility wraprun. Wraprun has the capacity to run an arbitrary number and combination of qualified MPI or serial applications under a single aprun call.

Note: MPI executables launched under wraprun must dynamically linked. Non-MPI applications must be launched using a serial wrapper included with wraprun.
Warning: Tasks bundled with wraprun should each consume approximately the same walltime to avoid wasting allocation hours.
Using Wraprun

By default wraprun applications must be dynamically linked. Wraprun itself depends on Python and is compatible with python/2.7.X and python/3.X but requires the user to load their preferred python environment module before loading wraprun. Below is an example of basic wraprun use for the applications foo.out and bar.out.

$ module load dynamic-link
$ cc foo.c -o foo.out
$ cc bar.c -o bar.out
$ module load python wraprun
$ wraprun -n 80 ./foo.out : -n 160 ./bar.out

foo.out and bar.out will run independently under a single aprun call.

In addition to the standard process placement flags available to aprun, the --w-cd flag can be set to change the current working directory for each executable:

$ wraprun -n 80 --w-cd /foo/dir ./foo.out : -n 160 --w-cd /bar/dir ./bar.out

This is particularly useful for legacy FORTRAN applications that use hard coded input and output file names.

Multiple instances of an application can be placed on a node using comma-separated PES syntax PES1,PES2,…,PESN syntax, for instance:

$ wraprun -n 2,2,2 ./foo.out [ : other tasks...] 

would launch 3 two-process instances of foo.out on a single node.

In this case, the number of allocated nodes must be at least equal to the sum of processes in the comma-separated list of processing elements, divided by the maximum number of processes per node.

This may also be combined with the –w-cd flag :

$ wraprun -n 2,2,2 --w-cd /foo/dir1,/foo/dir2,/foo/dir3 ./foo.out [ : other tasks...] 

For nonMPI executables a wrapper application, serial, is provided. This wrapper ensures that all executables will run to completion before aprun exits. To use, place serial in front of your application and arguments:

$ wraprun -n 1 serial ./foo.out -foo_args [ : other tasks...] 

The stdout/err of each task run under wraprun will be directed to to it’s own unique file in the current working directory with names in the form:

${PBS_JOBID}_w_${TASK_ID}.out
${PBS_JOBID}_w_${TASK_ID}.err
Recommendations and Limitations

It is recommended that applications be dynamically linked. On Titan this can be accomplished by loading the dynamic-link module before invoking the Cray compile wrappers CC,cc, ftn.

The library may be statically linked although this is not fully supported.

All executables must reside in a compute node visible filesystem, e.g. Lustre. The underlying aprun call made by wraprun enforces the aprun ‘no-copy’ (‘-b’) flag.

wraprun works by intercepting all MPI function calls that contain an MPI_Comm argument. If an application calls an MPI function, containing an MPI_Comm argument, not included in src/split.c, the results are undefined.

Common aprun Options

The following table lists commonly-used options to aprun. For a more detailed description of aprun options, see the aprun man page.

Option Description
-D Debug; shows the layout aprun will use
-n Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to aprun, the system will default to 1.
-N Number of MPI tasks (aka ‘processing elements’) per physical node.

Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
-m Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task
-d Number of threads per MPI task.

Warning: The default value for -d is 1. If you specify OMP_NUM_THREADS but do not give a -d option, aprun will allocate your threads to a single core. Use OMP_NUM_THREADS to specify to your code the number of threads per MPI task; use -d to tell aprun how to place those threads.
-j
For Titan: Number of CPUs to use per paired-core compute unit. The -j parameter specifies the number of CPUs to be allocated per paired-core compute unit. The valid values for -j are 0 (use the system default), 1 (use one integer core), and 2 (use both integer cores; this is the system default).
For Eos: The -j parameter controls Hyper Threading. The valid values for -j are 0 (use the system default), 1 (turn Hyper Threading off), and 2 (turn Hyper Threading on; this is the system default).
-cc This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun.
-S Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
-ss Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.
-r Assign system services associated with your application to a compute core.

If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly.

Should use -r 1 ensuring -N is less than 16 or -S is less than 8.

XC30 CPU Description

The compute blade architecture of the XC30 is similar to the CPU blades of Titan with a few important differences.

Each node has two sockets with 8 physical cores each. Each core has its own level 1 (L1) and level2 (L2) caches.  The eight cores on each socket share a 20 MB L3 cache and 32 GB of SDRAM connected by a 4 channel DDR3 pipeline. As shown in the diagram below, each unit of 8 cores with its associated memory and caches can be thought of as a Non-uniform memory access (NUMA) Domaine.

 Slide1

Hyper Threading

Hyper Threading Overview

Eos includes Intel processors with Intel’s Hyper-Threading technology. With Hyper-Threading enabled, the operating system recognizes each physical core as two logical cores. Two independent processes or threads can run simultaneously on the same physical core, but because the two logical cores are sharing the same execution resources, the two streams may run at roughly half the speed of a single stream. If a process in a stream running on one of the logical cores stalls, the second stream on that core can use the stalled stream’s execution resources and possibly recoup cycles that would have been idle if the streams has been run with only one per physical core. Hyper Threading on Eos is supported by each of the available compilers — Intel, PGI, Cray and GNU.

Note: Hyper-Threading is enabled on Eos by default. The -j1 option to aprun explicitly disables Hyper Threading on Eos.

Hyper Threading for MPI Applications

For MPI applications, Hyper Threading can be utilized in a few different ways. One way is by running on half the nodes that you would typically need to allocate without Hyper Threading. The example below shows the code to do so and the resulting task layout on a node.

Code Example 1: (32) MPI tasks, (1) task per “core”, with Hyper Threading

# Request 1 node.
#PBS -l nodes=1# Tell aprun to use hyper threading.

# Implicitly using HT (default)
aprun -n 32 ./a.out

# Explicitly using HT
aprun -n 32 -j2 ./a.out

Compute Node 0
NUMA 0 NUMA 1
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7
0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
C8 C9 C10 C11 C12 C13 C14 C15 C8 C9 C10 C11 C12 C13 C14 C15
8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

[1] ‘C[n]’ indicates ‘Core [n]’ for the NUMA node.

In contrast, the same example without Hyper Threading is shown below along with the resulting task layout on a node.

Code Example 2: (32) MPI tasks without Hyper Threading
# Request 2 nodes.
#PBS -l nodes=2# No aprun hyper threading.
aprun -n 32 -j1 ./a.out

Compute Node 0
NUMA 0 NUMA 1
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Compute Node 1
NUMA 0 NUMA 1
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

[1] ‘C[n]’ indicates ‘Core [n]’ for the NUMA node.

Hyper Threading for Threaded Codes

For threaded codes, Hyper Theading can allow you to run with double the number of threads per MPI task for a fixed number of nodes and tasks.

For example, an MPI/OpenMP code designed to run with (2) MPI tasks and (8) threads per task without hyper threading via:

 aprun -n2 -d8 -j1

Could be run with (16) threads per task with Hyper Threading:

 aprun -n2 -d16

Controlling MPI Task Layout Within an Eos Node

Users have (2) ways to control MPI task layout:

  1. Within a physical node
  2. Across physical nodes

This article focuses on how to control MPI task layout within a physical node.

Understanding NUMA Nodes

Each physical node is organized into (2) 8-core NUMA nodes. NUMA is an acronym for “Non-Uniform Memory Access”. You can think of a NUMA node as a division of a physical node that contains a subset of processor cores and their high-affinity memory.

Applications may use resources from one or both NUMA nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one NUMA node before allocating tasks to another NUMA node.

Note: A brief description of how a physical XC30 node is organized can be found on the XC30 node description page.

Spreading MPI Tasks Across NUMA Nodes

Each physical node contains (2) NUMA nodes. Users can control MPI task layout using the aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a physical node’s MPI task load over the (2) available NUMA nodes via the -S option to aprun.

Note: Jobs that do not utilize all of a physical node’s processor cores may see performance improvements by spreading MPI tasks across NUMA nodes within a physical node.
Example 1: Default NUMA Placement

Job requests (2) processor cores without a NUMA flag. Both tasks are placed on the first NUMA node.

$ aprun -n2 ./a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 0, Core 1
Example 2: Specific NUMA Placement

Job requests (2) processor cores with aprun -S. A task is placed on each of the (2) NUMA nodes:

$ aprun -n2 -S1 ./a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 1, Core 0

The following table summarizes common NUMA node options to aprun:

Option Description
-S Processing elements (essentially a processor core) per NUMA node. Specifies the number of PEs to allocate per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
-ss Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.

Advanced NUMA Node Placement

Example 1: Grouping MPI Tasks on a Single NUMA Node

Run a.out on (8) cores. Place (8) MPI tasks on (1) NUMA node. In this case the aprun -S option is optional:

$ aprun -n8 -S8 ./a.out

Compute Node 0
NUMA 0 NUMA 1
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7
0 1 2 3 4 5 6 7
Example 2: Spreading MPI tasks across NUMA nodes

Run a.out on (8) cores. Place (4) MPI tasks on each of (2) NUMA nodes via aprun -S.

$ aprun -n8 -S4 ./a.out
Compute Node 0
NUMA 0 NUMA 1
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1 C2 C3 C4 C5 C6 C7
0 1 2 3 4 5 6 7
Example 3: Spreading Out MPI Tasks Across Numa Nodes with Hyper Threading

Hyper Threading is enabled by default and can be explicitly enabled with the -j2 aprun option. With Hyper Threading, the NUMA nodes behave as if each has 16 logical cores.

Run a.out on (18) cores. Place (9) MPI tasks on each of (2) NUMA nodes via aprun -S.

aprun -n 18 -S9 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7 9 10 11 12 13 14 15 16
Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15 Core 8 Core 9 Core 10 Core 11 Core 12 Core 13 Core 14 Core 15
8 17

To see MPI rank placement information on the nodes set the PMI_DEBUG environment variable to 1

For cshell:

$ setenv PMI_DEBUG 1

For bash:

$ export PMI_DEBUG=1

Controlling MPI Task Layout Across Many Physical Nodes

Users have (2) ways to control MPI task layout:

  1. Within a physical node
  2. Across physical nodes

This article focuses on how to control MPI task layout across physical nodes nodes.

The default MPI task layout is SMP-style. This means MPI will sequentially allocate all virtual cores on one physical node before allocating tasks to another physical node.

Viewing Multi-Node Layout Order

Task layout can be seen by setting MPICH_RANK_REORDER_DISPLAY to 1.

Changing Multi-Node Layout Order

For multi-node jobs, layout order can be changed using the environment variable MPICH_RANK_REORDER_METHOD. See man intro_mpi for more information.

Multi-Node Layout Order Examples

Example 1: Default Layout

The following will run a.out across (32) cores. This requires (2) physical compute nodes.

# On Titan
$ aprun -n 32 ./a.out

# On Eos, Hyper-threading must be disabled:
$ aprun -n 32 -j1 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Compute Node 1
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Example 2: Round-Robin Layout

The following will place tasks in a round robin fashion. This requires (2) physical compute nodes.

$ setenv MPICH_RANK_REORDER_METHOD 0
# On Titan
$ aprun -n 32 ./a.out

# On Eos, Hyper-threading must be disabled:
$ aprun -n 32 -j1 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Compute Node 1
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Example 3: Combining Inter-Node and Intra-Node Options

The following combines MPICH_RANK_REORDER_METHOD and -S to place tasks on three cores per processor within a node and in a round robin fashion across nodes.

$ setenv MPICH_RANK_REORDER_METHOD 0
$ aprun -n12 -S3 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 2 4 6 8 10
Compute Node 1
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
1 3 5 7 9 11

Eos Controlling Thread Layout Within a Physical Node

Eos supports threaded programming within a compute node. Threads may span across both processors within a single compute node, but cannot span compute nodes. With Intel’s Hyper Threading enabled, each node with 16 physical cores has the ability to behave as if it has 32 logical cores. Hyper Threading is enabled by default, so users must pass the -j1 aprun option to explicitly turn it off. For threaded codes hyper threading allow the user to run twice as many threads per physical core. Users have a great deal of flexibility in thread placement. Several examples are shown below.

Note: Threaded codes must use the -d (depth) option to aprun.

The -d option to aprun specifies the number of threads per MPI task. Under previous CNL versions this option was not required. Under the current CNL version, the number of cores used is calculated by multiplying the value of -d by the value of -n.

Warning: Without the -d option, all threads will be started on the same processor core. This can lead to performance degradation for threaded codes.

Thread Layout Examples

The following examples are written for the bash shell. If using csh/tcsh, you should change export OMP_NUM_THREADS=x to setenv OMP_NUM_THREADS x wherever it appears.

Example 1: (2) MPI tasks, (16) Threads Each

This example will launch (2) MPI tasks, each with (16) threads running on their own dedicated physical core. This requests (2) compute nodes and requires a node request of (2):

$ export OMP_NUM_THREADS=16
$ aprun -n2 -d16 -j1 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 0, Thread 2, Node 0, NUMA 0, Core 2 <-- slave
Rank 0, Thread 3, Node 0, NUMA 0, Core 3 <-- slave
Rank 0, Thread 4, Node 0, NUMA 0, Core 4 <-- slave
Rank 0, Thread 5, Node 0, NUMA 0, Core 5 <-- slave
Rank 0, Thread 6, Node 0, NUMA 0, Core 6 <-- slave
Rank 0, Thread 7, Node 0, NUMA 0, Core 7 <-- slave
Rank 0, Thread 8, Node 0, NUMA 1, Core 0 <-- slave
Rank 0, Thread 9, Node 0, NUMA 1, Core 1 <-- slave
Rank 0, Thread 10,Node 0, NUMA 1, Core 2 <-- slave
Rank 0, Thread 11,Node 0, NUMA 1, Core 3 <-- slave
Rank 0, Thread 12,Node 0, NUMA 1, Core 4 <-- slave
Rank 0, Thread 13,Node 0, NUMA 1, Core 5 <-- slave
Rank 0, Thread 14,Node 0, NUMA 1, Core 6 <-- slave
Rank 0, Thread 15,Node 0, NUMA 1, Core 7 <-- slave
Rank 1, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 1, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 1, Thread 2, Node 1, NUMA 0, Core 2 <-- slave
Rank 1, Thread 3, Node 1, NUMA 0, Core 3 <-- slave
Rank 1, Thread 4, Node 1, NUMA 0, Core 4 <-- slave
Rank 1, Thread 5, Node 1, NUMA 0, Core 5 <-- slave
Rank 1, Thread 6, Node 1, NUMA 0, Core 6 <-- slave
Rank 1, Thread 7, Node 1, NUMA 0, Core 7 <-- slave
Rank 1, Thread 8, Node 1, NUMA 1, Core 0 <-- slave
Rank 1, Thread 9, Node 1, NUMA 1, Core 1 <-- slave
Rank 1, Thread 10,Node 1, NUMA 1, Core 2 <-- slave
Rank 1, Thread 11,Node 1, NUMA 1, Core 3 <-- slave
Rank 1, Thread 12,Node 1, NUMA 1, Core 4 <-- slave
Rank 1, Thread 13,Node 1, NUMA 1, Core 5 <-- slave
Rank 1, Thread 14,Node 1, NUMA 1, Core 6 <-- slave
Rank 1, Thread 15,Node 1, NUMA 1, Core 7 <-- slave

This can also be accomplished on one node with Hyper Threading.

This requests (1) compute nodes and requires a node request of (1)

$ export OMP_NUM_THREADS=16
$ aprun -n2 -d16 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 0, Thread 2, Node 0, NUMA 0, Core 2 <-- slave
Rank 0, Thread 3, Node 0, NUMA 0, Core 3 <-- slave
Rank 0, Thread 4, Node 0, NUMA 0, Core 4 <-- slave
Rank 0, Thread 5, Node 0, NUMA 0, Core 5 <-- slave
Rank 0, Thread 6, Node 0, NUMA 0, Core 6 <-- slave
Rank 0, Thread 7, Node 0, NUMA 0, Core 7 <-- slave
Rank 0, Thread 8, Node 0, NUMA 0, Core 8 <-- slave
Rank 0, Thread 9, Node 0, NUMA 0, Core 9 <-- slave
Rank 0, Thread 10,Node 0, NUMA 0, Core 10 <-- slave
Rank 0, Thread 11,Node 0, NUMA 0, Core 11<-- slave
Rank 0, Thread 12,Node 0, NUMA 0, Core 12<-- slave
Rank 0, Thread 13,Node 0, NUMA 0, Core 13<-- slave
Rank 0, Thread 14,Node 0, NUMA 0, Core 14<-- slave
Rank 0, Thread 15,Node 0, NUMA 0, Core 15 <-- slave
Rank 1, Thread 0, Node 0, NUMA 1, Core 0 <-- MASTER
Rank 1, Thread 1, Node 0, NUMA 1, Core 1 <-- slave
Rank 1, Thread 2, Node 0, NUMA 1, Core 2 <-- slave
Rank 1, Thread 3, Node 0, NUMA 1, Core 3 <-- slave
Rank 1, Thread 4, Node 0, NUMA 1, Core 4 <-- slave
Rank 1, Thread 5, Node 0, NUMA 1, Core 5 <-- slave
Rank 1, Thread 6, Node 0, NUMA 1, Core 6 <-- slave
Rank 1, Thread 7, Node 0, NUMA 1, Core 7 <-- slave
Rank 1, Thread 8, Node 0, NUMA 1, Core 8 <-- slave
Rank 1, Thread 9, Node 0, NUMA 1, Core 9 <-- slave
Rank 1, Thread 10,Node 0, NUMA 1, Core 10 <-- slave
Rank 1, Thread 11,Node 0, NUMA 1, Core 11<-- slave
Rank 1, Thread 12,Node 0, NUMA 1, Core 12<-- slave
Rank 1, Thread 13,Node 0, NUMA 1, Core 13<-- slave
Rank 1, Thread 14,Node 0, NUMA 1, Core 14<-- slave
Rank 1, Thread 15,Node 0, NUMA 1, Core 15<-- slave
Example 2: (2) MPI tasks, (6) Threads Each

This example will launch (2) MPI tasks, each with (6) threads. Place (1) MPI task per NUMA node. This requests (1) physical compute nodes and requires a nodes request of (1):

$ export OMP_NUM_THREADS=6
$ aprun -n2 -d6 -S1 a.out

Compute Node
NUMA 0 NUMA 1
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Rank0
Thread0
Rank0
Thread1
Rank0
Thread2
Rank0
Thread3
Rank0
Thread4
Rank0
Thread5
Rank1
Thread0
Rank1
Thread1
Rank1
Thread2
Rank1
Thread3
Rank1
Thread4
Rank1
Thread5

Example 3: (4) MPI tasks, (2) Threads Each

This example will launch (4) MPI tasks, each with (2) threads. Place only (1) MPI task [and its (2) threads] on each NUMA node. This requests (2) physical compute nodes and requires a nodes request of (2), even though only (8) cores are actually being used:

$ export OMP_NUM_THREADS=2
$ aprun -n4 -d2 -S1 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 1, Thread 0, Node 0, NUMA 1, Core 0 <-- MASTER
Rank 1, Thread 1, Node 0, NUMA 1, Core 1 <-- slave
Rank 2, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 2, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 3, Thread 0, Node 1, NUMA 1, Core 0 <-- MASTER
Rank 3, Thread 1, Node 1, NUMA 1, Core 1 <-- slave


Thread Affinity Example

The example application is a Cray test code, cray-mpi-ex.c, that shows you how your MPI tasks are placed on the “cores” within in Eos’s compute nodes. This application may be useful to you if you are trying to figure out how your application’s MPI task will be distributed. The cray-mpi-ex.c test code is given at the bottom of this example if you want to view it or copy it for your own use.

Compile

The compiler wrappers on Eos function like the compiler wrappers on Titan. The main functional difference is that the Intel compiler and programing environment is default on Eos and the PGI equivalents are default on Titan. I will use the Intel environment below.

To compile:

eos%  cc cray-mpi-ex.c

Batch Script

Here is the example batch script cray-mpi.pbs:

#!/bin/bash
#    Begin PBS directives
#PBS -A STF007
#PBS -N cray-mpi
#PBS -j oe
#PBS -l walltime=1:00:00,nodes=2
#    End PBS directives and begin shell commands
cd $MEMBERWORK/stf007
aprun -n 32 -j1 ./a.out

To submit this from $MEMBERWORK/[projid]:

eos% qsub cray-mpi.pbs

The output will be in $MEMBERWORK/[projid] in a file called cray-mpi.ojob_number.

I used two nodes and I did not use Hyper Threading. In the output below notice that two nodes, 708 and 709 have been used, and two sets of cores, 0-14, are listed for 32 ranks.

Rank 9, Node 00708, Core 9
Rank 8, Node 00708, Core 8
Rank 15, Node 00708, Core 15
Rank 7, Node 00708, Core 7
Rank 5, Node 00708, Core 5
Rank 3, Node 00708, Core 3
Rank 2, Node 00708, Core 2
Rank 1, Node 00708, Core 1
Rank 4, Node 00708, Core 4
Rank 6, Node 00708, Core 6
Rank 12, Node 00708, Core 12
Rank 14, Node 00708, Core 14
Rank 10, Node 00708, Core 10
Rank 13, Node 00708, Core 13
Rank 0, Node 00708, Core 0
Rank 11, Node 00708, Core 11
Rank 25, Node 00709, Core 9
Rank 20, Node 00709, Core 4
Rank 16, Node 00709, Core 0
Rank 30, Node 00709, Core 14
Rank 21, Node 00709, Core 5
Rank 22, Node 00709, Core 6
Rank 19, Node 00709, Core 3
Rank 23, Node 00709, Core 7
Rank 18, Node 00709, Core 2
Rank 26, Node 00709, Core 10
Rank 31, Node 00709, Core 15
Rank 28, Node 00709, Core 12
Rank 24, Node 00709, Core 8
Rank 27, Node 00709, Core 11
Rank 29, Node 00709, Core 13
Rank 17, Node 00709, Core 1
Application 19374 resources: utime ~4s, stime ~20s, Rss ~4736, inblocks ~10667, outblocks ~24358

If I wanted to use Hyper Threading, I could omit the -j1 aprun option and half as many nodes. Here the modified batch script:

#!/bin/bash
#    Begin PBS directives
#PBS -A STF007
#PBS -N cray-mpi
#PBS -j oe
#PBS -l walltime=1:00:00,nodes=1
#    End PBS directives and begin shell commands
cd $MEMBERWORK/stf007
aprun -n 32 ./a.out

Below is the output with Hyper Threading. Notice that only one node, 708, was used and this time cores 0-32 were used for 32 ranks. Hyper threading makes each physical core behave as two logical cores. Each logical core stores a complete program state, but it must split the resources of the physical core.

Rank 0, Node 00708, Core 0
Rank 1, Node 00708, Core 16
Rank 29, Node 00708, Core 30
Rank 28, Node 00708, Core 14
Rank 2, Node 00708, Core 1
Rank 3, Node 00708, Core 17
Rank 12, Node 00708, Core 6
Rank 13, Node 00708, Core 22
Rank 9, Node 00708, Core 20
Rank 8, Node 00708, Core 4
Rank 22, Node 00708, Core 11
Rank 23, Node 00708, Core 27
Rank 17, Node 00708, Core 24
Rank 30, Node 00708, Core 15
Rank 19, Node 00708, Core 25
Rank 31, Node 00708, Core 31
Rank 18, Node 00708, Core 9
Rank 5, Node 00708, Core 18
Rank 4, Node 00708, Core 2
Rank 21, Node 00708, Core 26
Rank 20, Node 00708, Core 10
Rank 26, Node 00708, Core 13
Rank 27, Node 00708, Core 29
Rank 14, Node 00708, Core 7
Rank 15, Node 00708, Core 23
Rank 10, Node 00708, Core 5
Rank 11, Node 00708, Core 21
Rank 25, Node 00708, Core 28
Rank 24, Node 00708, Core 12
Rank 16, Node 00708, Core 8
Rank 7, Node 00708, Core 19
Rank 6, Node 00708, Core 3
Application 19376 resources: utime ~1s, stime ~1s, Rss ~3484, inblocks ~5226, outblocks ~12180

Example Application Code

The code for this example application can be found below if you wish to view or copy it.

#define _GNU_SOURCE 
#include <stdio.h>
#include <sched.h>
#include <string.h>
#include "mpi.h"
/* Borrowed from util-linux-2.13-pre7/schedutils/taskset.c */
static char *cpuset_to_cstr(cpu_set_t *mask, char *str)
{
char *ptr = str;
int i, j, entry_made = 0;
for (i = 0; i < CPU_SETSIZE; i++) {
if (CPU_ISSET(i, mask)) {
int run = 0;
entry_made = 1;
for (j = i + 1; j < CPU_SETSIZE; j++) {
if (CPU_ISSET(j, mask)) run++;
else break;
 }
if (!run)
sprintf(ptr, "%d,", i);
else if (run == 1) {
sprintf(ptr, "%d,%d,", i, i + 1);
i++;
} else {
sprintf(ptr, "%d-%d,", i, i + run);
i += run;
}
while (*ptr != 0) ptr++;
}
}
ptr -= entry_made;
*ptr = 0;
return(str);
}
int main(int argc, char *argv[])
{
int rank, thread, i;
cpu_set_t coremask;
char clbuf[7 * CPU_SETSIZE], hnbuf[64], node[64];
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
memset(clbuf, 0, sizeof(clbuf));
memset(hnbuf, 0, sizeof(hnbuf));
(void)gethostname(hnbuf, sizeof(hnbuf));

        /*Remove nid from node name*/
  for (i=3; hnbuf[i] != '\0'; i++)
  {
    node[i-3] = hnbuf[i];
  }

{
(void)sched_getaffinity(0, sizeof(coremask), &coremask);
cpuset_to_cstr(&coremask, clbuf);
printf("Rank %d, Node %s, Core %s\n", rank, node, clbuf);
}
MPI_Finalize();
return(0);
}

If you have any questions about this example, please contact user support or your scientific computing liaison.