In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, timely and efficient execution of jobs is the primary concern in the operation of any HPC system.

A job on Titan typically comprises a few different components:

  • A batch submission script.
  • A statically-linked binary executable.
  • A set of input files for the executable.
  • A set of output files created by the executable.

And the process for running a job, in general, is to:

  1. Prepare executables and input files.
  2. Write a batch script.
  3. Submit the batch script to the batch scheduler.
  4. Optionally monitor the job before and during execution.

The following sections describe in detail how to create, submit, and manage jobs for execution on Titan.

Login vs. Service vs. Compute Nodes

Cray supercomputers are complex collections of different types of physical nodes/machines. For simplicity, we can think of Titan nodes as existing in one of three categories: login nodes, service nodes, or compute nodes.

Login Nodes

Login nodes are designed to facilitate ssh access into the overall system, and to handle simple tasks. When you first log in, you are placed on a login node. Login nodes are shared by all users of a system, and should only be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory-intensive nor processing-intensive tasks. Users should also limit the number of simultaneous tasks performed on login nodes. For example, a user should not run ten simultaneous tar processes.

Warning: Processor-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning.

Service Nodes

Memory-intensive tasks, processor-intensive tasks, and any production-type work should be submitted to the machine’s batch system (e.g. to Torque/MOAB via qsub). When a job is submitted to the batch system, the job submission script is first executed on a service node. Any job submitted to the batch system is handled in this way, including interactive batch jobs (e.g. via qsub -I). Often users are under the (false) impression that they are executing commands on compute nodes while typing commands in an interactive batch job. On Cray machines, this is not the case.

Compute Nodes

On Cray machines, when the aprun command is issued within a job script (or on the command line within an interactive batch job), the binary passed to aprun is copied to and executed in parallel on a set of compute nodes. Compute nodes run a Linux microkernel for reduced overhead and improved performance.

Note: On Cray machines, the only way to access the compute nodes is via the aprun command.

Filesystems Available to Compute Nodes

The compute nodes do not mount all filesystems available from the login and service nodes. The Lustre® areas ($MEMBERWORK and $PROJWORK) as well as the /ccs/proj areas are available to compute nodes on OLCF Cray systems. User home directories are not mounted on compute nodes.

Warning: Home directories, /ccs/home/$USER are not available from the compute nodes.
As a result, job executable binaries and job input files must reside within a Lustre or /ccs/proj work space., e.g. $MEMBERWORK/[projid].

Overview of filesystems available to compute nodes

Type Access Mount Suggested Use
Lustre $MEMBERWORK,
$PROJWORK,
$WORLDWORK
Read/Write Batch job Input and Output
NFS /ccs/proj Read-only Binaries, Shared Object Libraries, Python Scripts
Notice: Due to the Meta-data overhead, the NFS areas are preferred storage locations for shared object libraries and python scripts.

Shared Object and Python Scripts

Because the /ccs/proj areas are backed-up daily and are accessible by all members of a project, the areas are very useful for sharing code with other project members. Due to the Meta-data overhead, the NFS areas are preferred storage locations for shared object libraries and python modules.

The Lustre $MEMBERWORK, $PROJWORK, and $WORLDWORK areas are much larger than the NFS areas and are configured for large data I/O.

/ccs/proj Update Delay and Read-Only Access

Access to /ccs/proj area from compute nodes is read-only. The /ccs/proj areas are not directly mounted on the compute nodes, but rather rely on a periodic rsync to provide access to shared project-centric data. The /ccs/proj areas are mounted for read/write on the login and service nodes, but it may take up to 30 minutes for changes to propagate to the compute nodes.

Note: Due to /ccs/proj areas being read-only on compute nodes, job output must be sent to a Lustre work space.

Home Directory Access Error

Batch jobs can be submitted from the User Home space, but additional steps are required to ensure the job runs successfully. Batch jobs submitted from Home areas should cd into a Lustre work directory prior to invoking aprun in the batch script. An error like the following may be returned if this is not done:

aprun: [NID 94]Exec /lustre/atlas/scratch/userid/a.out failed: chdir /autofs/na1_home/userid
No such file or directory

Writing Batch Scripts

Batch scripts, or job submission scripts, are the mechanism by which a user configures and submits a job for execution. A batch script is simply a shell script that also includes commands to be interpreted by the batch scheduling software (e.g. PBS).

Batch scripts are submitted to the batch scheduler, where they are then parsed for the scheduling configuration options. The batch scheduler places the script in the appropriate queue, where it is designated as a batch job. Once the batch job makes its way through the queue, the script will be executed on a service node within the set of allocated computational resources.

Sections of a Batch Script

Batch scripts are parsed into the following three sections:

  1. The Interpreter Line
    The first line of a script can be used to specify the script’s interpreter. This line is optional. If not used, the submitter’s default shell will be used. The line uses the “hash-bang-shell” syntax: #!/path/to/shell
  2. The Scheduler Options
    The batch scheduler configuration options are preceded by #PBS, making them appear as comments to a shell. PBS will look for #PBS options in a batch script from the script’s first line through the first non-comment line. A comment line begins with #. #PBS options entered after the first non-comment line will not be read by PBS.

    Note: All batch scheduler options must appear at the beginning of the batch script.
  3. The Executable Commands
    The shell commands follow the last #PBS option and represent the main content of the batch job. If any #PBS lines follow executable statements, they will be ignored as comments.

The executable commands section of a script will be interpreted by a shell and can contain multiple lines of executable invocations, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on a service node (sometimes called a “head node”) from the set of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.

An Example Batch Script

 1: #!/bin/bash
 2: #    Begin PBS directives
 3: #PBS -A pjt000
 4: #PBS -N test
 5: #PBS -j oe
 6: #PBS -l walltime=1:00:00,nodes=1500
 7: #PBS -l gres=atlas1%atlas2
 8: #    End PBS directives and begin shell commands
 9: cd $MEMBERWORK/pjt000
10: date
11: aprun -n 24000 ./a.out

The lines of this batch script do the following:

Line Option Description
1 Optional Specifies that the script should be interpreted by the bash shell.
2 Optional Comments do nothing.
3 Required The job will be charged to the “pjt000” project.
4 Optional The job will be named “test”.
5 Optional The job’s standard output and error will be combined.
6 Required The job will request 1,500 compute nodes for 1 hour.
7 Optional The job will require both the atlas1 and atlas2 Lustre® filesystems to be online.
8 Optional Comments do nothing.
9 This shell command will the change to the user’s $MEMBERWORK/pjt000 directory.
10 This shell command will run the date command.
11 This invocation will run 24,000 MPI instances of the executable a.out on the compute nodes allocated by the batch system.

Additional Example Batch Scripts

Warning: Compute nodes can see only the Lustre-backed storage areas; binaries must be executed from within User Work (i.e., $MEMBERWORK/{projectid} or Project Work (i.e., $PROJWORK/{projectid} areas. All data needed by a binary must also exist in these Lustre-backed areas. More information can be found in the Filesystems Available to Compute Nodes section.

Launching an MPI-only job

Suppose we want to launch a job on 300 Titan nodes, each using all 16 available CPU cores. The following example will request (300) nodes for 1 hour and 30 minutes. It will then launch 4,800 (300 x 16) MPI ranks on the allocated cores (one task per core).

  #!/bin/bash
  # File name: my-job-name.pbs
  #PBS -A PRJ123
  #PBS -l walltime=1:30:00
  #PBS -l nodes=300
  #PBS -l gres=atlas1%atlas2

  cd $MEMBERWORK/prj123
   
  aprun -n 4800 my-simulation.exe

The first line (#PBS -A PRJ123) PBS script will tell the system scheduler that you’d like to launch this job against the PRJ123 allocation. For example, if you are a member of project SCI404, your PBS scripts would need to say #PBS -A SCI404 instead. If you are a member of multiple projects, be careful to double check that your jobs launch with the intended allocation.

To invoke the above script from the command line, simply:

  $ qsub my-job-name.pbs
    123456.nid00004

You can check the status of job number 123456 by running:

  $ showq | grep 123456
    123456   userid    Running   4800   00:00:44   Sat Oct 15 06:18:56

Naming Jobs

Users who submit many jobs to the queue at once may want to consider naming their jobs in order to keep track of which ones are running and which are still being held in the queue. This can be done with the #PBS -N my-job-name option. For example, to name your job P3HT-PCBM-simulation-1:

  #!/bin/bash
  # File name: simulation1.pbs
  #PBS -A PRJ123
  #PBS -N P3HT-PCBM-simulation-1
  #PBS -l walltime=1:30:00
  #PBS -l nodes=300
  #PBS -l gres=atlas1%atlas2

  cd $MEMBERWORK/prj123
   
  aprun -n 4800 my-simulation.exe

Controlling Output

By default, when your jobs print data to STDOUT or STDERR, it gets aggregated into two files: job-name.o123456 and job-name.e123456 (where 123456 is your job id). These files are written into the directory from which you submitted your job with the qsub command.

If you wish to aggregate this output into a single file (which may help you understand where errors occur), you can join these two output streams by using the #PBS -j oe option. For example,

  #!/bin/bash
  # File name: simulation1.pbs
  #PBS -A PRJ123
  #PBS -N P3HT-PCBM-simulation-1
  #PBS -j oe
  #PBS -l walltime=1:30:00
  #PBS -l nodes=300
  #PBS -l gres=atlas1%atlas2

  cd $MEMBERWORK/prj123
   
  aprun -n 4800 my-simulation.exe

Using Environment Modules

By default, the module environment tool is not available in batch scripts. If you need to load modules before launching your jobs (to adjust your $PATH or to make shared libraries available), you will first need to import the module utility into your batch script with the following command:

  source $MODULESHOME/init/<myshell>

where <myshell> is the name of your default shell.

As an example, let’s load the ddt module before launching the following simulation (assuming we are using the bash shell):

  #!/bin/bash
  # File name: simulation.pbs
  #PBS -A PRJ123
  #PBS -N P3HT-PCBM-simulation
  #PBS -j oe
  #PBS -l walltime=1:30:00
  #PBS -l nodes=300
  #PBS -l gres=atlas1%atlas2

  source $MODULESHOME/init/bash
  module load ddt
  cd $MEMBERWORK/prj123
   
  aprun -n 4800 my-simulation.exe

If you are loading a specific programming enviroment, it is advisable to load your programming environment first before loading other modules. Some modules have different behavior for each programming environment, and may not load correctly if the programming environment is not specified first.

Basic MPI on Partial Nodes

A node’s cores cannot be shared by multiple batch jobs or aprun jobs; therefore, a job must be allocated all cores on a node. However, users do not have to utilize all of the cores allocated to their batch job. Through aprun options, users have the ability to run on all or only some of a node’s cores and they have some control over which cores are being used.

Reasons for utilizing only a portion of a node’s cores can be: increasing memory available to each task, utilizing one floating point unit per compute unit, and increasing memory bandwidth available to each task.

Each node contains (2) NUMA nodes. Users can control CPU task layout using the aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a node’s task load over the (2) NUMA nodes using aprun -S. The -j can also be used to utilize only the integer core on each compute unit.

The following example will request 4,000 nodes for (8) hours. It will then run a 24,000 task MPI job using (6) of each allocated node’s (16) cores.

  #!/bin/bash
  #PBS -A PRJ123
  #PBS -N mpi-partial-node
  #PBS -j oe
  #PBS -l walltime=8:00:00,nodes=4000
  #PBS -l gres=atlas1%atlas2

  cd $MEMBERWORK/prj123
   
  aprun -n 24000 -S 3 -j1 a.out
  $ qsub mpi-partial-node-ex.pbs
    234567.nid00004
  $ showq | grep 234567
    234567   userid   Running   64000   00:00:44   Mon Oct 09 03:11:23

Please note that per Titan’s scheduling policy, the job will be charged for all 4,000 nodes.

Hybrid OpenMP/MPI

The following example batch script will request (3) nodes for (1) hour. It will then run a hybrid OpenMP/MPI job using (3) MPI tasks each running (16) threads.

  #!/bin/bash
  #PBS -A PRJ123
  #PBS -N hybrid-test
  #PBS -j oe
  #PBS -l walltime=1:00:00,nodes=3
  #PBS -l gres=atlas1%atlas2

  cd $PROJWORK/prj123
   
  export OMP_NUM_THREADS=16

  aprun -n 3 -N 1 -d 16 mpi-openmp-ex.x
  $ cc -mp mpi-openmp-ex.c -o mpi-openmp-ex.x
  $ qsub mpi-openmp-ex.pbs
    345678.nid00004
  $  showq | grep 345678
    345678   userid   Running   48   00:00:44   Mon Aug 19 21:49:18

Thread Performance Considerations

On Titan, each pair of CPU cores shares a single Floating Point Unit (FPU). This means that arithmetic-laden threads on neighboring CPU cores may contend for the shared FPU, which could lead to performance degradation.

To help avoid this issue, aprun can force only 1 thread to be associated with each core pair by using the -j 1 option. Here’s how we could revise the previous example to allocate only 1 thread per FPU:

  #!/bin/bash
  #PBS -A PRJ123
  #PBS -N hybrid-test
  #PBS -j oe
  #PBS -l walltime=1:00:00,nodes=3
  #PBS -l gres=atlas1%atlas2

  cd $PROJWORK/prj123
   
  export OMP_NUM_THREADS=8

  aprun -n 3 -N 1 -d 8 -j 1 mpi-openmp-ex.x

Launching Several Executables at Once

Warning: Because large numbers of aprun processes can cause other users’ apruns to fail, users are asked to limit the number of simultaneous apruns executed within a batch script. Users are limited to 50 aprun processes per batch job.

The following example will request 6,000 nodes for (12) hours. It will then run (4) MPI jobs each simultaneously running on 24,000 cores. The OS will spread each aprun job out such that the jobs do not share nodes.

  #!/bin/bash
  #PBS -A PROJ123
  #PBS -N multi-job
  #PBS -j oe
  #PBS -l walltime=12:00:00,nodes=6000
  #PBS -l gres=atlas1%atlas2

  cd $MEMBERWORK/prj123

  aprun -n 24000 a.out1 &
  aprun -n 24000 a.out2 &
  aprun -n 24000 a.out3 &
  aprun -n 24000 a.out4 &

  wait
  $ qsub multi-job-ex.pbs
    456789.nid00004
  $ showq | grep 456789
    456789   userid    Running   96000   00:00:44   Thu Oct 07 11:32:52 
Important Considerations for Simultaneous Jobs in a Single Script
  • The aprun instances must be backgrounded
    The & symbols in the exmaple above will place each aprun in the background allowing the OS to place and run each simultaneously. Without placing the apruns in the background, the OS will run them serially waiting until one completes before starting the next.
  • The batch script must wait for backgrounded processes
    The wait command will prevent the batch script from returning until each backgrounded aprun completes. Without the wait the script will return once each aprun has been started, causing the batch job to end, which kills each of the backgrounded aprun processes.
  • The aprun instances cannot share nodes
    The system will only run one aprun per node; the system will not run multiple aprun instances on the same node at the same time. For exmaple, users cannot run (2) 8-core aprun jobs on the same node. In order to run (2) 8-core aprun instances at the same time, (2) nodes must be allocated.

    Note: aprun disallows multiple instances on a single node. See the wraprun section for details regarding sharing a node’s cores among multiple aprun instances.

Chaining Batch Jobs

The following example will

  1. Submit 1.pbs which will be immediately eligible for execution
  2. Submit 2.pbs which will be placed in a held state, not eligible for execution until 1.pbs completes without errors
$ qsub 1.pbs
123451
$ qsub -W depend=afterok:123451 2.pbs
123452

You can then use the showq and checkjob utilities to view job states:

$ showq -u userid
...
123451              userid    Running    16
...
123452              userid       Hold    16
...
$ checkjob 123452
...
NOTE:  job cannot run  (dependency 123451 jobsuccessfulcomplete not met)

Batch Scheduler node Requests

A node’s cores cannot be allocated to multiple jobs. The OLCF charges allocations based upon the computational resources a job makes unavailable to others. Thus, a job is charged for an entire node even if the job uses only one processor core. To simplify the process, users are required to request an entire node through PBS.

Note: Whole nodes must be requested at the time of job submission, and allocations are reduced by core-hour amounts corresponding to whole nodes, regardless of actual core utilization.

Submitting Batch Scripts

Once written, a batch script is submitted to the batch scheduler via the qsub command.

$ cd /path/to/batch/script
$ qsub ./script.pbs

If successfully submitted, a PBS job ID will be returned. This ID is needed to monitor the job’s status with various job monitoring utilities. It is also necessary information when troubleshooting a failed job, or when asking the OLCF User Assistance Center for help.

Note: Always make a note of the returned job ID upon job submission, and include it in help requests to the OLCF User Assistance Center.

Options to the qsub command allow the specification of attributes which affect the behavior of the job. In general, options to qsub on the command line can also be placed in the batch scheduler options section of the batch script via #PBS.


Interactive Batch Jobs

Batch scripts are useful for submitting a group of commands, allowing them to run through the queue, then viewing the results at a later time. However, it is sometimes necessary to run tasks within a job interactively. Users are not permitted to access compute nodes nor run aprun directly from login nodes. Instead, users must use an interactive batch job to allocate and gain access to compute resources interactively. This is done by using the -I option to qsub.

Interactive Batch Example

For interactive batch jobs, PBS options are passed through qsub on the command line.

$ qsub -I -A pjt000 -q debug -X -l nodes=3,walltime=30:00

This request will:

Option Description
-I Start an interactive session
-A Charge to the “pjt000” project
-X Enables X11 forwarding. The DISPLAY environment variable must be set.
-q debug Run in the debug queue
-l nodes=3,walltime=30:00 Request 3 compute nodes for 30 minutes (you get all cores per node)

After running this command, you will have to wait until enough compute nodes are available, just as in any other batch job. However, once the job starts, you will be given an interactive prompt on the head node of your allocated resource. From here commands may be executed directly instead of through a batch script.

Debugging via Interactive Jobs

A common use of interactive batch is to aid in debugging efforts. Interactive access to compute resources allows the ability to run a process to the point of failure; however, unlike a batch job, the process can be restarted after brief changes are made without losing the compute resource allocation. This may help speed the debugging effort because a user does not have to wait in the queue in between each run attempts.

Note: To tunnel a GUI from an interactive batch job, the -X PBS option should be used to enable X11 forwarding.

Choosing an Interactive Job’s nodes Value

Because interactive jobs must sit in the queue until enough resources become available to allocate, to shorten the queue wait time, it is useful to base nodes selection on the number of unallocated nodes. The showbf command (i.e “show backfill”) to see resource limits that would allow your job to be immediately back-filled (and thus started) by the scheduler. For example, the snapshot below shows that 802 nodes are currently free.

$ showbf
Partition   Tasks   Nodes   StartOffset    Duration       StartDate
---------   -----   -----   ------------   ------------   --------------
ALL         4744    802     INFINITY       00:00:00       HH:MM:SS_MM/DD

See showbf –help for additional options.


Common Batch Options to PBS

The following table summarizes frequently-used options to PBS:

Option Use Description
-A #PBS -A <account> Causes the job time to be charged to <account>. The account string, e.g. pjt000, is typically composed of three letters followed by three digits and optionally followed by a subproject identifier. The utility showproj can be used to list your valid assigned project ID(s). This option is required by all jobs.
-l #PBS -l nodes=<value> Maximum number of compute nodes. Jobs cannot request partial nodes.
#PBS -l walltime=<time> Maximum wall-clock time. <time> is in the format HH:MM:SS.
#PBS -l partition=<partition_name> Allocates resources on specified partition.
-o #PBS -o <filename> Writes standard output to <name> instead of <job script>.o$PBS_JOBID. $PBS_JOBID is an environment variable created by PBS that contains the PBS job identifier.
-e #PBS -e <filename> Writes standard error to <name> instead of <job script>.e$PBS_JOBID.
-j #PBS -j {oe,eo} Combines standard output and standard error into the standard error file (eo) or the standard out file (oe).
-m #PBS -m a Sends email to the submitter when the job aborts.
#PBS -m b Sends email to the submitter when the job begins.
#PBS -m e Sends email to the submitter when the job ends.
-M #PBS -M <address> Specifies email address to use for -m options.
-N #PBS -N <name> Sets the job name to <name> instead of the name of the job script.
-S #PBS -S <shell> Sets the shell to interpret the job script.
-q #PBS -q <queue> Directs the job to the specified queue.This option is not required to run in the default queue on any given system.
-V #PBS -V Exports all environment variables from the submitting shell into the batch job shell. Since the login nodes differ from the service nodes, using the ‘-V’ option is not recommended. Users should create the needed environment within the batch job.
-X #PBS -X Enables X11 forwarding. The -X PBS option should be used to tunnel a GUI from an interactive batch job.
Note: Because the login nodes differ from the service nodes, using the ‘-V’ option is not recommended. Users should create the needed environment within the batch job.

Further details and other PBS options may be found through the qsub man page.


Batch Environment Variables

PBS sets multiple environment variables at submission time. The following PBS variables are useful within batch scripts:

Variable Description
$PBS_O_WORKDIR The directory from which the batch job was submitted. By default, a new job starts in your home directory. You can get back to the directory of job submission with cd $PBS_O_WORKDIR. Note that this is not necessarily the same directory in which the batch script resides.
$PBS_JOBID The job’s full identifier. A common use for PBS_JOBID is to append the job’s ID to the standard output and error files.
$PBS_NUM_NODES The number of nodes requested.
$PBS_JOBNAME The job name supplied by the user.
$PBS_NODEFILE The name of the file containing the list of nodes assigned to the job. Used sometimes on non-Cray clusters.

Modifying Batch Jobs

The batch scheduler provides a number of utility commands for managing submitted jobs. See each utilities’ man page for more information.

Removing and Holding Jobs

qdel

Jobs in the queue in any state can be stopped and removed from the queue using the command qdel.

$ qdel 1234

qhold

Jobs in the queue in a non-running state may be placed on hold using the qhold command. Jobs placed on hold will not be removed from the queue, but they will not be eligible for execution.

$ qhold 1234

qrls

Once on hold the job will not be eligible to run until it is released to return to a queued state. The qrls command can be used to remove a job from the held state.

$ qrls 1234

Modifying Job Attributes

qalter

Non-running jobs in the queue can be modified with the PBS qalter command. The qalter utility can be used to do the following (among others): Modify the job’s name:

$ qalter -N newname 130494

Modify the number of requested cores:

$ qalter -l nodes=12 130494

Modify the job’s walltime:

$ qalter -l walltime=01:00:00 130494
Note: Once a batch job moves into a running state, the job’s walltime can not be increased.

Monitoring Batch Jobs

PBS and Moab provide multiple tools to view queue, system, and job status. Below are the most common and useful of these tools.

Job Monitoring Commands

showq

The Moab utility showq can be used to view a more detailed description of the queue. The utility will display the queue in the following states:

State Description
Active These jobs are currently running.
Eligible These jobs are currently queued awaiting resources. Eligible jobs are shown in the order in which the scheduler will consider them for allocation.
Blocked These jobs are currently queued but are not eligible to run. A job may be in this state because the user has more jobs that are “eligible to run” than the system’s queue policy allows.

To see all jobs currently in the queue:

$ showq

To see all jobs owned by userA currently in the queue:

$ showq -u userA

To see all jobs submitted to partitionA:

$ showq -p partitionA

To see all completed jobs:

$ showq -c
Note: To increase response time, the MOAB utilities (showstart, checkjob) will display a cached result. The cache updates every 30 seconds. But, because the cached result is displayed, you may see the following message:

--------------------------------------------------------------------
NOTE: The following information has been cached by the remote server
      and may be slightly out of date.
--------------------------------------------------------------------

checkjob

The Moab utility checkjob can be used to view details of a job in the queue. For example, if job 736 is a job currently in the queue in a blocked state, the following can be used to view why the job is in a blocked state:

$ checkjob 736

The return may contain a line similar to the following:

BlockMsg: job 736 violates idle HARD MAXJOB limit of X for user (Req: 1 InUse: X)

This line indicates the job is in the blocked state because the owning user has reached the limit for jobs in the “eligible to run” state.

qstat

The PBS utility qstat will poll PBS (Torque) for job information. However, qstat does not know of Moab’s blocked and eligible states. Because of this, the showq Moab utility (see above) will provide a more accurate batch queue state. To show show all queued jobs:

$ qstat -a

To show details about job 1234:

$ qstat -f 1234

To show all currently queued jobs owned by userA:

$ qstat -u userA

Titan Batch Queues

Queues are used by the batch scheduler to aid in the organization of jobs. Users typically have access to multiple queues, and each queue may allow different job limits and have different priorities. Unless otherwise notified, users have access to the following queues on Titan:

Name Usage Description Limits
batch No explicit request required Default; most production work runs in this queue. See the Titan Scheduling Policy for details.
killable #PBS -q killable Opportunistic; jobs start even if they will not complete before the onset of a scheduled outage.
debug #PBS -q debug Quick-turnaround; short jobs for software development, testing, and debugging.

The batch Queue

The batch queue is the default queue for production work on Titan. Most work on Titan is handled through this queue.

The killable Queue

At the start of a scheduled system outage, a queue reservation is used to ensure that no jobs are running. In the batch queue, the scheduler will not start a job if it expects that the job would not complete (based on the job’s user-specified max walltime) before the reservation’s start time. In contrast, the killable queue allows the scheduler to start a job even if it will not complete before a scheduled reservation.

Note: If your job can perform usable work in a (1) hour timeframe and is tolerant of abrupt termination, this queue may allow you to take advantage of idle resources availalable prior to a scheduled outage.

The debug Queue

The debug queue is intended to provide faster turnaround times for the code development, testing, and debugging cycle. For example, interactive parallel work is an ideal use for the debug queue.

Warning: Users who misuse the debug queue may have further access to the queue denied.

More detailed information on any of the batch scheduler queues can be found on the Titan Scheduling Policy page.


Job Execution on Titan

Once resources have been allocated through the batch system, users can:

  • Run commands in serial on the resource pool’s primary service node
  • Run executables in parallel across compute nodes in the resource pool

Serial Execution

The executable portion of a batch script is interpreted by the shell specified on the first line of the script. If a shell is not specified, the submitting user’s default shell will be used. This portion of the script may contain comments, shell commands, executable scripts, and compiled executables. These can be used in combination to, for example, navigate file systems, set up job execution, run executables, and even submit other batch jobs.

Warning: On Titan, each batch job is limited to 200 simultaneous processes. Attempting to open more simultaneous processes than the limit will result in No space left on device errors.

Parallel Execution

By default, commands in the job submission script will be executed on the job’s primary service node. The aprun command is used to execute a binary on one or more compute nodes within a job’s allocated resource pool.

Note: On Titan, the only way access a compute node is via the aprun command within a batch job.

Using the aprun command

The aprun command is used to run a compiled application program across one or more compute nodes. You use the aprun command to specify application resource requirements, request application placement, and initiate application launch. The machine’s physical node layout plays an important role in how aprun works. Each Titan compute node contains (2) 8-core NUMA nodes on a single socket (a total of 16 cores).

Note: The aprun command is the only mechanism for running an executable in parallel on compute nodes. To run jobs as efficiently as possible, a thorough understanding of how to use aprun and its various options is paramount.

OLCF uses a version of aprun with two extensions. One is used to identify which libraries are used by an executable to allow us to better track third party software that is being actively used on the system. The other analyzes the command line to identify cases where users might be able to optimize their application’s performance by using slightly different job layout options. We highly recommend using both of these features; however, if there is a reason you wish to disable one or the other please contact the User Assistance Center for information on how to do that.

Shell Resource Limits

By default, aprun will not forward shell limits set by ulimit for sh/ksh/bash or by limit for csh/tcsh. To pass these settings to your batch job, you should set the environment variable APRUN_XFER_LIMITS to 1 via export APRUN_XFER_LIMITS=1 for sh/ksh/bash or setenv APRUN_XFER_LIMITS 1 for csh/tcsh.

Simultaneous aprun Limit

All aprun processes are launched from a small number of shared service nodes. Because large numbers of aprun processes can cause other users’ apruns to fail, users are asked to limit the number of simultaneous apruns executed within a batch script. Users are limited to 50 aprun processes per batch job; attempts to launch apruns over the limit will result in the following error:

apsched: no more claims allowed for this reservation (max 50)
Warning: Users are limited to 50 aprun processes per batch job.


Single-aprun Process Ensembles with wraprun

Wraprun is a utility that enables independent execution of multiple MPI applications under a single aprun call. It borrows from aprun MPMD syntax and also contains some wraprun specific syntax. The latest documentation can be found on the wraprun development README.

In some situations, the simultaneous aprun limit can be overcome by using the utility wraprun. Wraprun has the capacity to run an arbitrary number and combination of qualified MPI or serial applications under a single aprun call.

Note: MPI executables launched under wraprun must dynamically linked. Non-MPI applications must be launched using a serial wrapper included with wraprun.
Warning: Tasks bundled with wraprun should each consume approximately the same walltime to avoid wasting allocation hours.
Using Wraprun

By default wraprun applications must be dynamically linked. Wraprun itself depends on Python and is compatible with python/2.7.X and python/3.X but requires the user to load their preferred python environment module before loading wraprun. Below is an example of basic wraprun use for the applications foo.out and bar.out.

$ module load dynamic-link
$ cc foo.c -o foo.out
$ cc bar.c -o bar.out
$ module load python wraprun
$ wraprun -n 80 ./foo.out : -n 160 ./bar.out

foo.out and bar.out will run independently under a single aprun call.

In addition to the standard process placement flags available to aprun, the --w-cd flag can be set to change the current working directory for each executable:

$ wraprun -n 80 --w-cd /foo/dir ./foo.out : -n 160 --w-cd /bar/dir ./bar.out

This is particularly useful for legacy FORTRAN applications that use hard coded input and output file names.

Multiple instances of an application can be placed on a node using comma-separated PES syntax PES1,PES2,…,PESN syntax, for instance:

$ wraprun -n 2,2,2 ./foo.out [ : other tasks...] 

would launch 3 two-process instances of foo.out on a single node.

In this case, the number of allocated nodes must be at least equal to the sum of processes in the comma-separated list of processing elements, divided by the maximum number of processes per node.

This may also be combined with the –w-cd flag :

$ wraprun -n 2,2,2 --w-cd /foo/dir1,/foo/dir2,/foo/dir3 ./foo.out [ : other tasks...] 

For nonMPI executables a wrapper application, serial, is provided. This wrapper ensures that all executables will run to completion before aprun exits. To use, place serial in front of your application and arguments:

$ wraprun -n 1 serial ./foo.out -foo_args [ : other tasks...] 

The stdout/err of each task run under wraprun will be directed to to it’s own unique file in the current working directory with names in the form:

${PBS_JOBID}_w_${TASK_ID}.out
${PBS_JOBID}_w_${TASK_ID}.err
Recommendations and Limitations

It is recommended that applications be dynamically linked. On Titan this can be accomplished by loading the dynamic-link module before invoking the Cray compile wrappers CC,cc, ftn.

The library may be statically linked although this is not fully supported.

All executables must reside in a compute node visible filesystem, e.g. Lustre. The underlying aprun call made by wraprun enforces the aprun ‘no-copy’ (‘-b’) flag.

wraprun works by intercepting all MPI function calls that contain an MPI_Comm argument. If an application calls an MPI function, containing an MPI_Comm argument, not included in src/split.c, the results are undefined.

Common aprun Options

The following table lists commonly-used options to aprun. For a more detailed description of aprun options, see the aprun man page.

Option Description
-D Debug; shows the layout aprun will use
-n Number of total MPI tasks (aka ‘processing elements’) for the executable. If you do not specify the number of tasks to aprun, the system will default to 1.
-N Number of MPI tasks (aka ‘processing elements’) per physical node.

Warning: Because each node contains multiple processors/NUMA nodes, the -S option is likely a better option than -N to control layout within a node.
-m Memory required per MPI task. There is a maximum of 2GB per core, i.e. requesting 2.1GB will allocate two cores minimum per MPI task
-d Number of threads per MPI task.

Warning: The default value for -d is 1. If you specify OMP_NUM_THREADS but do not give a -d option, aprun will allocate your threads to a single core. Use OMP_NUM_THREADS to specify to your code the number of threads per MPI task; use -d to tell aprun how to place those threads.
-j
For Titan: Number of CPUs to use per paired-core compute unit. The -j parameter specifies the number of CPUs to be allocated per paired-core compute unit. The valid values for -j are 0 (use the system default), 1 (use one integer core), and 2 (use both integer cores; this is the system default).
For Eos: The -j parameter controls Hyper Threading. The valid values for -j are 0 (use the system default), 1 (turn Hyper Threading off), and 2 (turn Hyper Threading on; this is the system default).
-cc This is the cpu_list option. It binds MPI tasks or threads to the specified CPUs. The list is given as a set of comma-separated numbers (0 though 15) which each specify a compute unit (core) on the node. The list can also be given as hyphen-separated rages of numbers which each specify a range of compute units (cores) on the node. See man aprun.
-S Number of MPI tasks (aka ‘processing elements’) per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
-ss Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.
-r Assign system services associated with your application to a compute core. If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly. Should use -r 1 ensuring -N is less than 16 or -S is less than 8.

XK7 CPU Description

Each Titan compute node contains (1) AMD Opteron™ 6274 (Interlagos) CPU. Each CPU contains (2) die. Each die contains (4) “bulldozer” compute units and a shared L3 cache. Each compute unit contains (2) integer cores (and their L1 cache), a shared floating point scheduler, and shared L2 cache. To aid in task placement, each die is organized into a NUMA node. Each compute node contains (2) NUMA nodes. Each NUMA node contains a die’s L3 cache and its (4) compute units (8 cores). This configuration is shown graphically below. Opteron 6274 CPU Schematic

Controlling MPI Task Layout Within a Physical Node

Users have (2) ways to control MPI task layout:

  1. Within a physical node
  2. Across physical nodes

This article focuses on how to control MPI task layout within a physical node.

Understanding NUMA Nodes

Each physical node is organized into (2) 8-core NUMA nodes. NUMA is an acronym for “Non-Uniform Memory Access”. You can think of a NUMA node as a division of a physical node that contains a subset of processor cores and their high-affinity memory. Applications may use resources from one or both NUMA nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all cores on one NUMA node before allocating tasks to another NUMA node.

Spreading MPI Tasks Across NUMA Nodes

Each physical node contains (2) NUMA nodes. Users can control MPI task layout using the aprun NUMA node flags. For jobs that do not utilize all cores on a node, it may be beneficial to spread a physical node’s MPI task load over the (2) available NUMA nodes via the -S option to aprun.

Note: Jobs that do not utilize all of a physical node’s processor cores may see performance improvements by spreading MPI tasks across NUMA nodes within a physical node.
Example 1: Default NUMA Placement

Job requests (2) processor cores without a NUMA flag. Both tasks are placed on the first NUMA node.

$ aprun -n2 ./a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 0, Core 1
Example 2: Specific NUMA Placement

Job requests (2) processor cores with aprun -S. A task is placed on each of the (2) NUMA nodes:

$ aprun -n2 -S1 ./a.out
Rank 0, Node 0, NUMA 0, Core 0
Rank 1, Node 0, NUMA 1, Core 0

The following table summarizes common NUMA node options to aprun:

Option Description
-S Processing elements (essentially a processor core) per NUMA node. Specifies the number of PEs to allocate per NUMA node. Can be 1, 2, 3, 4, 5, 6, 7, or 8.
-ss Strict memory containment per NUMA node. The default is to allow remote NUMA node memory access. This option prevents memory access of the remote NUMA node.

Advanced NUMA Node Placement

Example 1: Grouping MPI Tasks on a Single NUMA Node

Run a.out on (8) cores. Place (8) MPI tasks on (1) NUMA node. In this case the aprun -S option is optional:

$ aprun -n8 -S8 ./a.out

Compute Node
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7

Example 2: Spreading MPI tasks across NUMA nodes

Run a.out on (8) cores. Place (4) MPI tasks on each of (2) NUMA nodes via aprun -S.

$ aprun -n8 -S4 ./a.out

Compute Node
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7

Example 3: Spread Out MPI Tasks Across Paired-Core Compute Units

The -j option can be used for codes to allow one task per paired-core compute unit. Run a.out on (8) cores; (4) cores per NUMA node; but only (1) core on each paired-core compute unit:

$ aprun -n8 -S4 -j1 ./a.out

Compute Node
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7

To see MPI rank placement information on the nodes set the PMI_DEBUG environment variable to 1
For cshell:

$ setenv PMI_DEBUG 1

For bash:

$ export PMI_DEBUG=1

 

Example 4: Assign System Services to a Unused Compute Core

The -r option can be used to assign system services associated with your application to a compute core. If you use less than 16 cores, you can request all of the system services to be placed on an unused core. This will reduce “jitter” (i.e. application variability) because the daemons will not cause the application to context switch unexpectedly. You should use -r 1 ensuring -N is less than 16 or -S is less than 8. The following example will place a task from a.out on cores 0-14; core 15 will be used only for system services: Run a.out on (8) cores; (4) cores per NUMA node; but only (1) core on each paired-core compute unit. All node services will be placed on the node’s last core:

$ aprun -n8 -S4 -j1 -r1 ./a.out

Compute Node
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7 *

* System Services

Controlling MPI Task Layout Across Many Physical Nodes

Users have (2) ways to control MPI task layout:

  1. Within a physical node
  2. Across physical nodes

This article focuses on how to control MPI task layout across physical nodes. The default MPI task layout is SMP-style. This means MPI will sequentially allocate all virtual cores on one physical node before allocating tasks to another physical node.

Viewing Multi-Node Layout Order

Task layout can be seen by setting MPICH_RANK_REORDER_DISPLAY to 1.

Changing Multi-Node Layout Order

For multi-node jobs, layout order can be changed using the environment variable MPICH_RANK_REORDER_METHOD. See man intro_mpi for more information.

Multi-Node Layout Order Examples

Example 1: Default Layout

The following will run a.out across (32) cores. This requires (2) physical compute nodes.

# On Titan
$ aprun -n 32 ./a.out

# On Eos, Hyper-threading must be disabled:
$ aprun -n 32 -j1 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Compute Node 1
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Example 2: Round-Robin Layout

The following will place tasks in a round robin fashion. This requires (2) physical compute nodes.

$ setenv MPICH_RANK_REORDER_METHOD 0
# On Titan
$ aprun -n 32 ./a.out

# On Eos, Hyper-threading must be disabled:
$ aprun -n 32 -j1 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Compute Node 1
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31

Example 3: Combining Inter-Node and Intra-Node Options

The following combines MPICH_RANK_REORDER_METHOD and -S to place tasks on three cores per processor within a node and in a round robin fashion across nodes.

$ setenv MPICH_RANK_REORDER_METHOD 0
$ aprun -n12 -S3 ./a.out

Compute Node 0
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
0 2 4 6 8 10
Compute Node 1
NUMA 0 NUMA 1
Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7
1 3 5 7 9 11

________________________________________________________________ ________________________________________________________________ ________________________________________________________________

Controlling Thread Layout Within a Physical Node

Titan supports threaded programming within a compute node. Threads may span across both processors within a single compute node, but cannot span compute nodes. Users have a great deal of flexibility in thread placement. Several examples are shown below.

Note: Threaded codes must use the -d (depth) option to aprun.

The -d option to aprun specifies the number of threads per MPI task. Under previous CNL versions this option was not required. Under the current CNL version, the number of cores used is calculated by multiplying the value of -d by the value of -n.

Warning: Without the -d option, all threads will be started on the same processor core. This can lead to performance degradation for threaded codes.

Thread Layout Examples

The following examples are written for the bash shell. If using csh/tcsh, you should change export OMP_NUM_THREADS=x to setenv OMP_NUM_THREADS x wherever it appears.

Example 1: (2) MPI tasks, (16) Threads Each

This example will launch (2) MPI tasks, each with (16) threads. This requests (2) compute nodes and requires a node request of (2):

$ export OMP_NUM_THREADS=16
$ aprun -n2 -d16 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 0, Thread 2, Node 0, NUMA 0, Core 2 <-- slave
Rank 0, Thread 3, Node 0, NUMA 0, Core 3 <-- slave
Rank 0, Thread 4, Node 0, NUMA 0, Core 4 <-- slave
Rank 0, Thread 5, Node 0, NUMA 0, Core 5 <-- slave
Rank 0, Thread 6, Node 0, NUMA 0, Core 6 <-- slave
Rank 0, Thread 7, Node 0, NUMA 0, Core 7 <-- slave
Rank 0, Thread 8, Node 0, NUMA 1, Core 0 <-- slave
Rank 0, Thread 9, Node 0, NUMA 1, Core 1 <-- slave
Rank 0, Thread 10,Node 0, NUMA 1, Core 2 <-- slave
Rank 0, Thread 11,Node 0, NUMA 1, Core 3 <-- slave
Rank 0, Thread 12,Node 0, NUMA 1, Core 4 <-- slave
Rank 0, Thread 13,Node 0, NUMA 1, Core 5 <-- slave
Rank 0, Thread 14,Node 0, NUMA 1, Core 6 <-- slave
Rank 0, Thread 15,Node 0, NUMA 1, Core 7 <-- slave
Rank 1, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 1, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 1, Thread 2, Node 1, NUMA 0, Core 2 <-- slave
Rank 1, Thread 3, Node 1, NUMA 0, Core 3 <-- slave
Rank 1, Thread 4, Node 1, NUMA 0, Core 4 <-- slave
Rank 1, Thread 5, Node 1, NUMA 0, Core 5 <-- slave
Rank 1, Thread 6, Node 1, NUMA 0, Core 6 <-- slave
Rank 1, Thread 7, Node 1, NUMA 0, Core 7 <-- slave
Rank 1, Thread 8, Node 1, NUMA 1, Core 0 <-- slave
Rank 1, Thread 9, Node 1, NUMA 1, Core 1 <-- slave
Rank 1, Thread 10,Node 1, NUMA 1, Core 2 <-- slave
Rank 1, Thread 11,Node 1, NUMA 1, Core 3 <-- slave
Rank 1, Thread 12,Node 1, NUMA 1, Core 4 <-- slave
Rank 1, Thread 13,Node 1, NUMA 1, Core 5 <-- slave
Rank 1, Thread 14,Node 1, NUMA 1, Core 6 <-- slave
Rank 1, Thread 15,Node 1, NUMA 1, Core 7 <-- slave
Example 2: (2) MPI tasks, (6) Threads Each

This example will launch (2) MPI tasks, each with (6) threads. Place (1) MPI task per NUMA node. This requests (1) physical compute nodes and requires a nodes request of (1):

$ export OMP_NUM_THREADS=6
$ aprun -n2 -d6 -S1 a.out

Compute Node
NUMA 0 NUMA 1
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Rank0 Thread0 Rank0 Thread1 Rank0 Thread2 Rank0 Thread3 Rank0 Thread4 Rank0 Thread5 Rank1 Thread0 Rank1 Thread1 Rank1 Thread2 Rank1 Thread3 Rank1 Thread4 Rank1 Thread5

Example 3: (4) MPI tasks, (2) Threads Each

This example will launch (4) MPI tasks, each with (2) threads. Place only (1) MPI task [and its (2) threads] on each NUMA node. This requests (2) physical compute nodes and requires a nodes request of (2), even though only (8) cores are actually being used:

$ export OMP_NUM_THREADS=2
$ aprun -n4 -d2 -S1 a.out

Rank 0, Thread 0, Node 0, NUMA 0, Core 0 <-- MASTER
Rank 0, Thread 1, Node 0, NUMA 0, Core 1 <-- slave
Rank 1, Thread 0, Node 0, NUMA 1, Core 0 <-- MASTER
Rank 1, Thread 1, Node 0, NUMA 1, Core 1 <-- slave
Rank 2, Thread 0, Node 1, NUMA 0, Core 0 <-- MASTER
Rank 2, Thread 1, Node 1, NUMA 0, Core 1 <-- slave
Rank 3, Thread 0, Node 1, NUMA 1, Core 0 <-- MASTER
Rank 3, Thread 1, Node 1, NUMA 1, Core 1 <-- slave
Example 4: (2) MPI tasks, (4) Threads Each, Using only (1) core per compute unit

The -j option can be used to allow use of only one core per paired-core compute unit. This example will launch (2) MPI tasks, each with (4) threads. Place only (1) MPI task [and its (4) threads] on each NUMA node. One core per paired-core compute unit sit idle. This requires (1) physical compute node and requires a nodes request of (1), even though only (8) cores are actually being used:

$ export OMP_NUM_THREADS=4
$ aprun -n2 -d4 -S1 -j1 a.out

Compute Node
NUMA 0 NUMA 1
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Rank0 Thread0 Rank0 Thread1 Rank0 Thread2 Rank0 Thread3 Rank1 Thread0 Rank1 Thread1 Rank1 Thread2 Rank1 Thread3

Example 5: (2) MPI tasks, (8) Threads Each, Using only (1) core per compute unit

The -j option can be used to allow use of only one core per paired-core compute unit. This example will launch (1) MPI tasks, each with (8) threads. One core per paired-core compute unit will sit idle. This requires (2) physical compute node and requires a size request of (32), even though only (16) cores are actually being used:

$ export OMP_NUM_THREADS=8
$ aprun -n2 -d8 -N1 -j1 a.out

Compute Node 0
NUMA 0 NUMA 1
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Rank0 Thread0 Rank0 Thread1 Rank0 Thread2 Rank0 Thread3 Rank0 Thread4 Rank0 Thread5 Rank0 Thread6 Rank0 Thread7
Compute Node 1
NUMA 0 NUMA 1
Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7 Core0 Core1 Core2 Core3 Core4 Core5 Core6 Core7
Rank1 Thread0 Rank1 Thread1 Rank1 Thread2 Rank1 Thread3 Rank1 Thread4 Rank1 Thread5 Rank1 Thread6 Rank1 Thread7

The -cc option can be used to control the placement of threads or tasks on specific processing units. To accomplish the same layout shown above with -cc

$ export OMP_NUM_THREADS=8
$ aprun -n2 -d8 -N1 -cc 0,2,4,6,8,10,12,14 ./a.out


Running Accelerated Applications on Titan

Each of Titan’s 18,688 compute nodes contains an NVIDIA K20X accelerator, the login and service nodes do not. As such the only way to reach a node containing an accelerator is through aprun. For more details on the types of nodes that constitute Titan please see login vs service vs compute nodes. No additional steps are required to access the GPU beyond what is required by the acceleration technique used. Titan does possess a few unique accelerator characteristics that are discussed below.

Accelerator Modules

Access to the CUDA framework is provided through the cudatoolkit module. This module contains access to NVIDIA provided tools such as nvcc as well as libraries such as the CUDA run time. When the cudatoolkit module is loaded shared linking will be enabled in the Cray compiler wrappers CC,cc,ftn. The craype-accel-nvidia35 module will load the cudatoolkit as well as set several accelerator options used by the Cray toolchain.

Compiling and Linking CUDA

When compiling on Cray machines, the Cray compiler wrappers (e.g. cc, CC, and ftn) work in conjunction with the modules system to link in needed libraries such as MPI; it is therefore recommended that the Cray compiler wrappers be used to compile CPU portions of your code.

To generate compiler-portable code it is necessary to compile CUDA C and CUDA runtime containing code with NVCC. The resulting NVCC compiled object code must then be linked in with object code compiled with the Cray wrappers. NVCC performs GNU style C++ name mangling on compiled functions so care must be taken in compiling and linking codes.

The section below briefly covers this technique. For complete examples, please see our tutorial on Compiling Mixed GPU and CPU Code.

C++ Host Code

When linking C++ host code with NVCC compiled code, the C++ code must use GNU-compatible name mangling. This is controlled through compiler specific wrapper flags.

PGI Compiler

$ nvcc -c GPU.cu
$ CC --gnu CPU.cxx GPU.o

Cray, GNU, and Intel Compilers

$ nvcc -c GPU.cu
$ CC CPU.cxx GPU.o

C Host Code

NVCC name mangling must be disabled if it is to be linked with C code. This requires the extern "C" function qualifier be used on functions compiled with NVCC but called from cc compiled code.

extern "C" void GPU_function()
{
...
}

Fortran: Simple

To allow C code to be called from Fortran, one method requires that the C function name be modified. NVCC name mangling must be disabled if it is to be linked with Fortran code. This requires the extern "C" function qualifier. Additionally, function names must be lowercase and end in an underscore character (i.e., _ ).

NVCC Compiled

extern "C" void gpu_function_()
{
...
}

ftn Compiled

call gpu_function()

Fortran: ISO_C_BINDING

ISO_C_BINDING provides Fortran a greater interoperability with C and removes the need to modify the C function name. Additionally the ISO_C_BINDING guarantees data type compatibility.

NVCC Compiled

extern "C" void gpu_function()
{
...
}

ftn Compiled

module gpu
  INTERFACE
    subroutine gpu_function() BIND (C, NAME='gpu_function')
      USE ISO_C_BINDING
      implicit none
    end subroutine gpu_function
  END INTERFACE
end module gpu

...

call gpu_function()

CUDA Proxy

The default GPU compute mode for Titan is exclusive process. In this mode, many threads within a process may access the GPU context. To allow multiple processes access to the GPU context, such as multiple MPI tasks on a single node accessing the GPU, the CUDA proxy server was developed. Once enabled, the CUDA proxy server transparently manages work issued to the GPU context from multiple processes.

Warning: Currently, GPU memory between processes accessing the proxy is not guarded, meaning process i can access memory allocated by process j. This SHOULD NOT be used to share memory between processes and care should be taken to ensure process only access GPU memory they have allocated themselves.

How to Enable

To enable the proxy server the following steps must be taken before invoking aprun:

$ export CRAY_CUDA_PROXY=1

Issues

Currently GPU debugging and profiling are not supported when the proxy is enabled. On Titan, specifying the qsub flag -l feature=gpudefault will switch the compute mode from exclusive process to the CUDA default mode. In the default mode, debugging and profiling are available, and multiple MPI ranks will be able to access the GPU. The default compute mode is not recommended on Titan. In the default compute mode approximately 120 MB of device memory is used per processes accessing the GPU. Additionally, inconsistent behavior may be encountered under certain conditions.

GPUDirect: CUDA-enabled MPICH

Cray’s implementation of MPICH2 allows GPU memory buffers to be passed directly to MPI function calls, eliminating the need to manually copy GPU data to the host before passing data through MPI. Several examples of using this feature are given below.

How to Enable

To enable GPUDirect the following steps must be taken before invoking aprun:

  $ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
  $ export MPICH_RDMA_ENABLED_CUDA=1

Optimizations

Several optimizations for improving performance are given below. These optimizations are highly application dependent and may require some trial-and-error tuning to achieve best results.

Pipelining

Pipelining allows for overlapping of GPU to GPU MPI messages and may improve message passing performance for large bandwidth-bound messages. Setting the environment variable MPICH_G2G_PIPELINE=N allows a maximum of N GPU to GPU messages to be in flight at any given time. The default value of MPICH_G2G_PIPELINE is (16), and messages under (8) kilobytes in size are never pipelined.

Nemesis

Applications using asynchronous MPI calls may benefit from enabling the MPICH asynchronous progress feature. Setting the MPICH_NEMESIS_ASYNC_PROGRESS=1 environment variable enables additional threads to be spawned to progress the MPI state.

This feature requires that the thread level be set to multiple: MPICH_MAX_THREAD_SAFETY=multiple.

This feature works best when used in conjunction with core specialization: aprun -r N, which allows for N CPU cores to be reserved for system services.

Example

Several examples are provided in our GPU Direct Tutorial.

Microbenchmark

The following benchmarks were performed with cray-mpich2/6.1.1.

benchmark

benchmark

Pinned Memory

Memory bandwidth between the CPU (host) and GPU (device) can be increased through the use of pinned, or page-locked, host memory. Additionally, pinned memory allows for asynchronous memory copies.

To transfer memory between the host and device, the device driver must know the host memory is pinned. If it does not, the memory will be first copied into a pinned buffer and then transferred, effectively lowering copy bandwidth. For this reason, pinned memory usage is recommended on Titan.

K20X Bandwidth

Job Resource Accounting

The hybrid nature of Titan’s accelerated XK7 nodes mandated a new approach to its node allocation and job charge units. For the sake of resource accounting, each Titan XK7 node will be defined as possessing (30) total cores (e.g. (16) CPU cores + (14) GPU core equivalents). Jobs consume charge units in “Titan core-hours”, and each Titan node consumes (30) of such units per hour. As in years past, jobs on the Titan system will be scheduled in full node increments; a node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based on what a job makes unavailable to other users, a job is charged for an entire node even if it uses only one core on a node. To simplify the process, users are required to request an entire node through PBS. Notably, codes that do not take advantage of GPUs will have only (16) CPU cores available per node; however, allocation requests–and units charged–will be based on (30) cores per node.

Note: Whole nodes must be requested at the time of job submission, and associated allocations are reduced by (30) core-hours per node, regardless of actual CPU or GPU core utilization.

Titan Core-Hour Calculation

The Titan core-hour charge for each batch job will be calculated as follows:

Titan core-hours = nodes requested * 30 * ( batch job endtime - batch job starttime )

Where batch job starttime is the time the job moves into a running state, and batch job endtime is the time the job exits a running state. A batch job’s usage is calculated solely on requested nodes and the batch job’s start and end time. The number of cores actually used within any particular node within the batch job is not used in the calculation. For example, if a job requests 64 nodes through the batch script, runs for an hour, uses only 2 CPU cores per node, and uses no GPU cores, the job will still be charged for 64 * 30 * 1 = 1,920 Titan core-hours.

Note: Projects are allocated time on Titan in units of “Titan core-hours”. Other OLCF systems are allocated in units of “core-hours”.

Viewing Allocation Utilization

Utilization is calculated daily using batch jobs which complete between 00:00 and 23:59 of the previous day. For example, if a job moves into a run state on Tuesday and completes Wednesday, the job’s utilization will be recorded Thursday. Only batch jobs which write an end record are used to calculate utilization. Batch jobs which do not write end records due to system failure or other reasons are not used when calculating utilization. Each user may view usage for projects on which they are members from the command line tool showusage and the My OLCF site.

On the Command Line via showusage

The showusage utility can be used to view your usage from January 01 through midnight of the previous day. For example:

$ showusage
Usage on titan:
                                  Project Totals          <userid>
 Project      Allocation        Usage    Remaining          Usage
_________________________|___________________________|_____________
 <YourProj>    2000000   |   123456.78   1876543.22  |     1560.80

The -h option will list more usage details.

On the Web via My OLCF

More detailed metrics may be found on each project’s usage section of the My OLCF site. The following information is available for each project:

  • YTD usage by system, subproject, and project member
  • Monthly usage by system, subproject, and project member
  • YTD usage by job size groupings for each system, subproject, and project member
  • Weekly usage by job size groupings for each system, and subproject
  • Batch system priorities by project and subproject
  • Project members

The My OLCF site is provided to aid in the utilization and management of OLCF allocations. If you have any questions or have a request for additional data, please contact the OLCF User Assistance Center.


Titan Scheduling Policy

Note: This details an official policy of the OLCF, and must be agreed to by the following persons as a condition of access to or use of OLCF computational resources:

  • Principal Investigators (Non-Profit)
  • Principal Investigators (Industry)
  • All Users

Title: Titan Scheduling Policy Version: 13.02

In a simple batch queue system, jobs run in a first-in, first-out (FIFO) order. This often does not make effective use of the system. A large job may be next in line to run. If the system is using a strict FIFO queue, many processors sit idle while the large job waits to run. Backfilling would allow smaller, shorter jobs to use those otherwise idle resources, and with the proper algorithm, the start time of the large job would not be delayed. While this does make more effective use of the system, it indirectly encourages the submission of smaller jobs.

The DOE Leadership-Class Job Mandate

As a DOE Leadership Computing Facility, the OLCF has a mandate that a large portion of Titan’s usage come from large, leadership-class (aka capability) jobs. To ensure the OLCF complies with DOE directives, we strongly encourage users to run jobs on Titan that are as large as their code will warrant. To that end, the OLCF implements queue policies that enable large jobs to run in a timely fashion.

Note: The OLCF implements queue policies that encourage the submission and timely execution of large, leadership-class jobs on Titan.

The basic priority-setting mechanism for jobs waiting in the queue is the time a job has been waiting relative to other jobs in the queue. However, several factors are applied by the batch system to modify the apparent time a job has been waiting. These factors include:

  • The number of nodes requested by the job.
  • The queue to which the job is submitted.
  • The 8-week history of usage for the project associated with the job.
  • The 8-week history of usage for the user associated with the job.
Note: The command line utility $ mdiag -p can be used to see the individual factors contributing to a job’s priority.

If your jobs require resources outside these queue policies, please complete the relevant request form on the Special Requests page. If you have any questions or comments on the queue policies below, please direct them to the User Assistance Center.

Job Priority by Processor Count

Jobs are aged according to the job’s requested processor count (older age equals higher queue priority). Each job’s requested processor count places it into a specific bin. Each bin has a different aging parameter, which all jobs in the bin receive.

Bin Min Nodes Max Nodes Max Walltime (Hours) Aging Boost (Days)
1 11,250 24.0 15
2 3,750 11,249 24.0 5
3 313 3,749 12.0 0
4 126 312 6.0 0
5 1 125 2.0 0

FairShare Scheduling Policy

FairShare, as its name suggests, tries to push each user and project towards their fair share of the system’s utilization: in this case, 5% of the system’s utilization per user and 10% of the system’s utilization per project. To do this, the job scheduler adds (30) minutes priority aging per user and (1) hour of priority aging per project for every (1) percent the user or project is under its fair share value for the prior (8) weeks. Similarly, the job scheduler subtracts priority in the same way for users or projects that are over their fair share. For instance, a user who has personally used 0.0% of the system’s utilization over the past (8) weeks who is on a project that has also used 0.0% of the system’s utilization will get a (12.5) hour bonus (5 * 30 min for the user + 10 * 1 hour for the project). In contrast, a user who has personally used 0.0% of the system’s utilization on a project that has used 12.5% of the system’s utilization would get no bonus (5 * 30 min for the user – 2.5 * 1 hour for the project).

batch Queue Policy

The batch queue is the default queue for production work on Titan. Most work on Titan is handled through this queue. It enforces the following policies:

  • Limit of (4) eligible-to-run jobs per user.
  • Jobs in excess of the per user limit above will be placed into a held state, but will change to eligible-to-run at the appropriate time.
  • Users may have only (2) jobs in bin 5 running at any time. Any additional jobs will be blocked until one of the running jobs completes.
Note: The eligible-to-run state is not the running state. Eligible-to-run jobs have not started and are waiting for resources. Running jobs are actually executing.

killable Queue Policy

At the start of a scheduled system outage, a queue reservation is used to ensure that no jobs are running. In the batch queue, the scheduler will not start a job if it expects that the job would not complete (based on the job’s user-specified max walltime) before the reservation’s start time. In constrast, the killable queue allows the scheduler to start a job even if it will not complete before a scheduled reservation. It enforces the following policies:

  • Jobs will be killed if still running when a system outage begins.
  • The scheduler will stop scheduling jobs in the killable queue (1) hour before a scheduled outage.
  • Maximum-job-per-user limits are the same (i.e., in conjunction with) the batch queue.
  • Any killed jobs will be automatically re-queued after a system outage completes.

debug Queue Policy

The debug queue is intended to provide faster turnaround times for the code development, testing, and debugging cycle. For example, interactive parallel work is an ideal use for the debug queue. It enforces the following policies:

  • Production jobs are not allowed.
  • Maximum job walltime of (1) hour.
  • Limit of (1) job per user regardless of the job’s state.
  • Jobs receive a (2)-day priority aging boost for scheduling.
Warning: Users who misuse the debug queue may have further access to the queue denied.

Allocation Overuse Policy

Projects that overrun their allocation are still allowed to run on OLCF systems, although at a reduced priority. Like the adjustment for the number of processors requested above, this is an adjustment to the apparent submit time of the job. However, this adjustment has the effect of making jobs appear much younger than jobs submitted under projects that have not exceeded their allocation. In addition to the priority change, these jobs are also limited in the amount of wall time that can be used. For example, consider that job1 is submitted at the same time as job2. The project associated with job1 is over its allocation, while the project for job2 is not. The batch system will consider job2 to have been waiting for a longer time than job1. Additionally, projects that are at 125% of their allocated time will be limited to only one running job at a time. The adjustment to the apparent submit time depends upon the percentage that the project is over its allocation, as shown in the table below:

% Of Allocation Used Priority Reduction number eligible-to-run number running
< 100% 0 days 4 jobs unlimited jobs
100% to 125% 30 days 4 jobs unlimited jobs
> 125% 365 days 4 jobs 1 job

System Reservation Policy

Projects may request to reserve a set of processors for a period of time through the reservation request form, which can be found on the Special Requests page. If the reservation is granted, the reserved processors will be blocked from general use for a given period of time. Only users that have been authorized to use the reservation can utilize those resources. Since no other users can access the reserved resources, it is crucial that groups given reservations take care to ensure the utilization on those resources remains high. To prevent reserved resources from remaining idle for an extended period of time, reservations are monitored for inactivity. If activity falls below 50% of the reserved resources for more than (30) minutes, the reservation will be canceled and the system will be returned to normal scheduling. A new reservation must be requested if this occurs. Since a reservation makes resources unavailable to the general user population, projects that are granted reservations will be charged (regardless of their actual utilization) a CPU-time equivalent to (# of cores reserved) * (length of reservation in hours).


Aprun Tips

The following tips may help diagnose errors and improve job runtime by providing

Layout Suggestion: Avoiding Floating-Point Contention

Note: Because the layout of tasks within a node may negatively impact performance, you may receive an aprun warning notice if we detect that the specified aprun layout does not spread the tasks evenly over the node.

An aprun wrapper will parse the given layout options returning a warning if tasks are not spread equally over a node’s compute units and/or numa nodes. You may see a warning similar to the following if the wrapper detects a possible non-optimal layout:

 APRUN usage: requested less processes than cores (-N 2) without using -j 1 to avoid floating-point unit contention 

Each Titan compute node contains (1) AMD Opteron™ 6274 (Interlagos) CPU. Each CPU contains (2) die. Each die contains (4) “bulldozer” compute units. Each compute unit contains (2) integer cores and a shared floating point scheduler. By default, aprun will place 16 processes on a node. In this manner, pairs of processes placed on the same compute unit will contend for the compute unit’s floating point scheduler. If your code is floating point intensive, sharing the floating point scheduler may degraded performance. You can override this behavior using the aprun options -j and -S to control process layout. The following examples do not use all cores on a node but share compute units’ floating point schedule. The examples assume:

  • 16 cores per node
  • 4 nodes allocated to batch job: #PBS -l nodes=4

aprun -n16 -S2 ./a.out

Problem:
All cores on the node are not used, but the tasks will be placed on the first compute unit of each NUMA node. Taking the default layout, 3 compute units on each NUMA node will sit idle.
Suggestion:
Using the -j1 aprun flag the job will be spread out out such that only one integer core on each compute unit is used. This will prevent contention for the each compute unit’s floating point scheduler.

aprun -n16 -S2 -j1 ./a.out
Note: When using the -j flag, a portion of a node’s integer cores will sit idle. Batch jobs can not share nodes; a batch job will be charged for an entire node (30 core-hours per node) regardless of actual CPU or GPU core utilization.

aprun -n16 -N4 ./a.out

Problem:
All cores on the node are not used, but the tasks will be placed on the first two compute units of the node’s first NUMA node. Taking the default layout, the node’s second NUMA node will sit idle.
Suggestion:
Using the -S and -j1 aprun flags the job will be spread out out such that each both NUMA nodes on a node are used and only one integer core on each compute unit is used. This will prevent contention for the each compute unit’s floating point scheduler.

aprun -n16 -S2 -j1 ./a.out

Error: Requesting more resources than have been allocated

It is possible to ask aprun to utilize more cores than have been allocated to the batch job. Attempts to over-allocate a batch jobs’ reserved nodes may result in the following message:

  claim exceeds reservation's CPUs

The following examples result in over-allocation attempts. The examples assume

  • 16 cores per node
  • 4 nodes allocated to batch job: #PBS -l nodes=4

aprun -n128 ./a.out

Problem:
There are not enough cores allocated to the batch job to fulfill the request. 4 nodes requested with 16 cores per node provides 64 cores.
Corrections:
Request more nodes:

#PBS -lnodes=8
aprun -n128 ./a.out

Request fewer tasks:

aprun -n64 ./a.out

aprun -n32 -N8 -S2 ./a.out

Problem:
There are enough cores allocated (64) to fulfill the task request (-n32). There are also enough nodes allocated to run 8 cores per node (-N8 * 4 nodes). But, -S2 requests that aprun run only 2 tasks per numa node. Since there are only 2 numa nodes per node, only 4 cores could be allocated per node (4 cores * 4 nodes < 32).
Corrections:
The -N is not needed when -S is used. You could remove the -N flag and increase the number of tasks per NUMA node by increasing -S from 2 to 4:

 aprun -n32 -S4 -j1  ./a.out

You could remove the -N flag and increase the number of nodes allocated to the batch job:

#PBS -lnodes=8
aprun -n32 -S2 ./a.out

For more information on Titan’s aprun options and layout examples, see the Job Execution section of Titan’s user guide.

Working Around Node Failure

With the large number of nodes on titan, you may experience node failure on occasion. You may see node failures the following ways:

  • If a node fails between batch job allocation and the first aprun, you may see the following error:
      claim exceeds reservation's CPUs
    Note: This most often occurs when attempting to run on more resources than were allocated to the batch job. See the requesting more resources than have been allocated section for more information on this message when not related to node failure.
  • If a node fails during an aprun job, the aprun process should terminate.

The following steps may be useful when dealing with node failure:

  1. Request more nodes than are required by aprun.
  2. Add a loop around aprun to check for success (this check is code specific) and re-run the aprun process on the additional allocated nodes upon error.

The following is a pseudo code example:

#PBS -lnodes=(a few more than you need)

while (not successful)

aprun -n (exact number you need) ./a.out

sleep 120

end while

The loop’s purpose is to re-run the aprun process on the extra nodes in the case that the aprun process does not succeed. Upon completion of aprun, unless the success check determines that aprun completed successfully, the aprun will be re-run. If the aprun does not succeed due to a node issue, the aprun process should be re-run allowing the system to place the tasks on one of the extra node(s) instead of the troubled node. This process may allow the job to work through a node issue without exiting the batch system and re-entering the batch queue. Its success is dependent on how well you can tailor the success test to you code.

Working Around Hanging Apruns

This simple script demonstrates how to kill a job that is not making forward progress and start again without going back to the scheduling queue. A job may not be making forward progress for a variety of reasons including hardware and software failures. The script does not help a user in identifying the root cause of why a job is hung. The goal of this script is to help a user do the following things:

  1. Detect that a job is hung (i.e., the job is not making forward progress).
  2. Kill the hung job.
  3. Restart the application without having to go back to the scheduling queue.

As documented in the script, the detection is fairly straight forward. It watches an output file where the job is supposed to write its output periodically. If there has been no writes to a file for certain time period, then the script tags the jobs as hung and takes further action. There are two key items that a user needs to do for this step to work correctly. The OUTFILE and WINDOW variables have to be set appropriately. The OUTFILE variable corresponds to the output file which this script watches periodically. The WINDOW variable is the longest time interval (in minutes) after which the script tags the job as hung if there is no write to OUFILE. Currently, the WINDOW variable is set to 120 mins, but it can be changed as needed.

If a job is detected to be hung, then the script automatically kills the job by obtaining its APID without user intervention.

Finally, the script automatically attempts to restart the job by relaunching the aprun with the same parameters. For this to work correctly, the user is advised to allocate a couple of more nodes than what is used in the aprun command. This is illustrated in the script. The user can change the number of such restart trial by changing the loop iteration counter as desired (“for i in `seq 1 4`;”). If the user does not allocate a few spare nodes, the application will not restart correctly if there any hardware problem with one of the allocated node.

#!/bin/bash
#PBS -l nodes=2,walltime=0:10:00
#PBS -j oe
#PBS -m e
#PBS -A ABC123

WINDOW=120 # 2 hour window of no activity/progress, while loop checks the file every minute
USER=`whoami`
BINDIR="/lustre/atlas1/abc123/scratch/${USER}/apkill"
cd $BINDIR

for i in `seq 1 4`;
do
    aprun -n 1 ./a.out $i &  # the "&" at the end is essential so that the code below executed, the code below monitors the temporary output file
    #echo "loop = $i"

#####################################################################################################################
    # Snippet to be moved to application PBS/qsub script:
    # Make sure to set the variable USER and WINDOW same as above or appropriately
    # Flow: store the status (number of lines) of temporary output file and keep checking every minute for updates,
    # if it is being updated, keep within while loop, if not updating for a long duration (2 hours), do apkill.
#####################################################################################################################
    OUTFILE="$PBS_JOBID.OU"
    OUTLEN=`wc -l $OUTFILE | awk '{print $1}'`
    #echo "outlen = $OUTLEN"
    TIME=0;
    while true
    do
        sleep 60; # sleep in number of seconds
        OUTLEN_NEW=`wc -l $OUTFILE | awk '{print $1}'`
        #echo "len = $OUTLEN and $OUTLEN_NEW"
        if [ $OUTLEN -eq $OUTLEN_NEW ]; then
            TIME=`expr $TIME + 1`
        else
            TIME=0;
            OUTLEN=$OUTLEN_NEW
        fi

        #echo "len after = $OUTLEN and $OUTLEN_NEW"

        APID=`apstat | grep $USER | tail -n 1 | awk '{print $1}'`
        #echo "apid = $APID"
        if [ -n "$APID" ]; then
            if [ $TIME -gt $WINDOW ]; then
                apkill $APID
                #echo "apkill = $APID"
            fi
        else
            break # break the while loop if there is no APID found
        fi
    done
#####################################################################################################################
    #end of snippet to be moved to application pbs script
#####################################################################################################################

done
wait

Troubleshooting and Common Errors

Below are some error messages that you may encounter. Check back often, as this section will be updated.

Error Message: claim exceeds reservation’s nodes

This may occur if the batch job did not request enough nodes. In this case, request more nodes in the #PBS -lnodes line of your batch script.

In some cases, the error may occur if a troubled node is allocated to the job.

In this case, it may be useful to:

1) Request more nodes than are required by aprun.

2) Add a loop around aprun to check for success (this check is code specific) and re-run the aprun process on the additional allocated nodes upon error.

Here is an example pseudo code.

#PBS -lnodes=(a few more than you need)

while (not success test)

aprun -n (exact number you need) a.out

sleep 30

end

The purpose of the loop is to attempt to run the aprun process on the extra nodes requested in the batch script if the aprun process does not succeed.  Upon completion of aprun, unless the success check determines that aprun completed successfully, the aprun will be re-run. If the aprun does not succeed due to a node issue, the aprun process should be re-run allowing the system to place the tasks on one of the extra node(s) instead of the troubled node. This process may allow the job to work through a node issue without exiting the batch system and re-entering the batch queue. Its success is dependent on how well you can tailor the success test to you code.

Error Message: MPICH2 ERROR

Error Message:  MPICH2 ERROR [Rank 65411] [job id 2526230] [Thu May 16 04:17:23 2013] [c18-3c1s6n0] [nid07084] - MPID_nem_gni_check_localCQ(): GNI_CQ_EVENT_TYPE_POST had error (SOURCE_SSID_DREQ:MDD_INV)

Recommendation: Resubmit job. Possible causes of this error could be system issues. If the error reoccurs you may have an error in your code.

Error Message: Received node event ec_node_failed

Error Message:  [NID 04228] 2013-05-10 08:08:28 Apid 2509826 killed. Received node event ec_node_failed for nid 4313

Explanation: Sometimes a node will be in an unstable state but the system will still consider it to be up. When a job runs on it and fails, the system software sees that failure and then marks the node down.

Recommendation:  Resubmit job. Possible causes of this error could be system issues. If the error reoccurs you may have an error in your code.

Error Message: CUDA_ERROR_LAUNCH_FAILED

Error Message:  ACC: craylibs/libcrayacc/acc_hw_nvidia.c:548 CRAY_ACC_ERROR -  cuStreamSynchronize returned CUDA_ERROR_LAUNCH_FAILED[NID 11408] 2013-05-10 01:57:46 Apid 2508898: initiated application termination

Recommendation: Try to resubmit the job. Possible causes of this error could be system issues. If the error reoccurs you may have race conditions or other errors in your code. Contact user support if you have questions.

Error Message:  CUDA driver error 700

Recommendation: Try to resubmit the job. Possible causes of this error could be system issues. If the error reoccurs you may have race conditions or other errors in your code. Contact user support if you have questions.

Cray HSN detected criticalerror

[NID ####] 2013-01-01 23:55:00 Apid #######: Cray HSN detected criticalerror 0x40c[ptag 249]. Please contact admin for details. Killing pid ###(@@##)

Recommendation:  This could be a problem with the user code, MPI library, or compiler.

0x40c decodes to (GHAL_ERROR_MASK_FMA:HT_BAD_NP_REQUEST)

This can happen when a code has corrupted memory in some way that leads to loads targeting the fma window. It is possible for this error to show up whenever the application has somehow generated an address to load/store to that is not valid. Typically this would in fact cause a segmentation fault since the process would take a page fault and the kernel would see that the vaddr involved in the load/store was invalid, thus raising a sigsegv. However, on Gemini, there are 3 FMA windows, each 1 GB in size mapped into the process address space. These are very big and often nearby mmaps created by glibc malloc. If the offending address lies within these windows you get these kgni kill type errors. Note the way the hw, even a store into the window will trigger an error since the window has to be prepared for accepting stores out on to the network. The routines to do this preparation aren’t available to end user apps.If the number of nodes used in on the order of 10, we recommend that the user enable core dumps and try a debug run with:

export MPICH_NEMESIS_NETMOD=tcp

This will tell MPI to use TCP instead of the low-level Gemini interface. Performance will be worse, but the code should segfault and dump core instead of be killed by the Gemini driver.

To enable core dumps, place one of the following commands in your batch script before the aprun call:
ulimit -c unlimited (if you're using sh/ksh/bash)
limit coredumpsize unlimited (if you're using csh/tcsh)

You may want to first recompile your code and add the “-g” option to compile commands. This will enable debugging information and will make it easier to pinpoint the source of the problem.

The module list command will show you the current versions of MPI and and compilers that you have loaded.

CRAY_CUDA_MPS=1 Segmentation Faults

Issue: When using OpenCL, if I set CRAY_CUDA_MPS=1 ( turn on Proxy), I get seg. faults no matter what I do until I release the node.

Recommendation:  It is a known issue that CRAY_CUDA_MPS=1 is incompatible with OpenCL. Do not use CRAY_CUDA_MPS=1 with Open CL.

Possible DNS Spoofing Detected on DTNs

Error: You try to login to dtn.ccs.ornl.gov and you get this message:

[home2][02:11:32][~]$ssh dtn
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The RSA host key for dtn has changed,
and the key for the according IP address 160.91.202.138
is unchanged. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
Offending key for IP in /ccs/home/suzanne/.ssh/known_hosts:97
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
b3:31:ac:44:83:2b:ce:37:cc:23:f4:be:7a:40:83:85.
Please contact your system administrator.
Add correct host key in /ccs/home/user/.ssh/known_hosts to get rid of this message.
Offending key in /ccs/home/user /.ssh/known_hosts:106
RSA host key for dtn has changed and you have requested strict checking.
Host key verification failed.

Reason: We have just changed dtn.ccs.ornl.gov to point to dtn03 and dtn04 rather than dtn01 and dtn02. The ssh client will notice that the key signatures for dtn.ccs.ornl.gov no longer match those that were stored in your /ccs/home/user/.ssh/known_hosts file.

Resolution: You must remove the dtn key signatures from /ccs/home/user/.ssh/known_hosts. From OLCF resources like home.ccs.ornl.gov, you may do this by issuing

ssh-keygen -R dtn.

You may also need to do this on your desktop machine if you log directly into the dtns. If your desktop does not have ssh-keygen, you can manually remove all the dtn signatures from you desktop’s /.ssh/know_hosts with vi or any text editor.