titan

Up since 11/8/17 02:45 pm

eos

Up since 11/14/17 11:20 pm

rhea

Up since 10/17/17 05:40 pm

hpss

Up since 11/20/17 09:15 am

atlas1

Up since 11/15/17 07:25 am

atlas2

Up since 11/27/17 10:45 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Aprun Tips

See this article in context within the following user guides: Titan

The following tips may help diagnose errors and improve job runtime by providing


Layout Suggestion: Avoiding Floating-Point Contention
Note: Because the layout of tasks within a node may negatively impact performance, you may receive an aprun warning notice if we detect that the specified aprun layout does not spread the tasks evenly over the node.

An aprun wrapper will parse the given layout options returning a warning if tasks are not spread equally over a node’s compute units and/or numa nodes. You may see a warning similar to the following if the wrapper detects a possible non-optimal layout:

APRUN usage: requested less processes than cores (-N 2) without using -j 1 to avoid floating-point unit contention

Each Titan compute node contains (1) AMD Opteron™ 6274 (Interlagos) CPU. Each CPU contains (2) die. Each die contains (4) “bulldozer” compute units. Each compute unit contains (2) integer cores and a shared floating point scheduler.

By default, aprun will place 16 processes on a node. In this manner, pairs of processes placed on the same compute unit will contend for the compute unit’s floating point scheduler. If your code is floating point intensive, sharing the floating point scheduler may degraded performance. You can override this behavior using the aprun options -j and -S to control process layout.

Please note: Batch jobs can not share nodes; a batch job will be charged for an entire node (30 core-hours per node) regardless of actual CPU or GPU core utilization.

For more information on titan’s aprun options, node description, and layout examples, see the Job Execution section of titan’s user guide.

The following examples do not use all cores on a node but share compute units’ floating point schedule. The examples assume:

  • 16 cores per node
  • 4 nodes allocated to batch job: #PBS -l nodes=4

aprun -n16 -S2 ./a.out
Problem:
All cores on the node are not used, but the tasks will be placed on the first compute unit of each NUMA node. Taking the default layout, 3 compute units on each NUMA node will sit idle.
Suggestion:
Using the -j1 aprun flag the job will be spread out out such that only one integer core on each compute unit is used. This will prevent contention for the each compute unit’s floating point scheduler.

aprun -n16 -S2 -j1 ./a.out
Note: When using the -j flag, a portion of a node’s integer cores will sit idle. Batch jobs can not share nodes; a batch job will be charged for an entire node (30 core-hours per node) regardless of actual CPU or GPU core utilization.

aprun -n16 -N4 ./a.out
Problem:
All cores on the node are not used, but the tasks will be placed on the first two compute units of the node’s first NUMA node. Taking the default layout, the node’s second NUMA node will sit idle.

Suggestion:
Using the -S and -j1 aprun flags the job will be spread out out such that each both NUMA nodes on a node are used and only one integer core on each compute unit is used. This will prevent contention for the each compute unit’s floating point scheduler.

aprun -n16 -S2 -j1 ./a.out


Error: Requesting more resources than have been allocated

It is possible to ask aprun to utilize more cores than have been allocated to the batch job. Attempts to over-allocate a batch jobs’ reserved nodes may result in the following message:

  claim exceeds reservation's CPUs 

The following examples result in over-allocation attempts. The examples assume

  • 16 cores per node
  • 4 nodes allocated to batch job: #PBS -l nodes=4

aprun -n128 ./a.out
Problem:
There are not enough cores allocated to the batch job to fulfill the request. 4 nodes requested with 16 cores per node provides 64 cores.
Corrections:
Request more nodes:

#PBS -lnodes=8
aprun -n128 ./a.out

Request fewer tasks:

aprun -n64 ./a.out

aprun -n32 -N8 -S2 ./a.out
Problem:
There are enough cores allocated (64) to fulfill the task request (-n32). There are also enough nodes allocated to run 8 cores per node (-N8 * 4 nodes). But, -S2 requests that aprun run only 2 tasks per numa node. Since there are only 2 numa nodes per node, only 4 cores could be allocated per node (4 cores * 4 nodes < 32).
Corrections:
The -N is not needed when -S is used.

You could remove the -N flag and increase the number of tasks per NUMA node by increasing -S from 2 to 4:

 aprun -n32 -S4 -j1  ./a.out 

You could remove the -N flag and increase the number of nodes allocated to the batch job:

#PBS -lnodes=8
aprun -n32 -S2 ./a.out

For more information on titan’s aprun options and layout examples, see the Job Execution section of titan’s user guide.


Working Around Node Failure

With the large number of nodes on titan, you may experience node failure on occasion. You may see node failures the following ways:

  • If a node fails between batch job allocation and the first aprun, you may see the following error:

      claim exceeds reservation's CPUs 
    Note: This most often occurs when attempting to run on more resources than were allocated to the batch job. See the requesting more resources than have been allocated section for more information on this message when not related to node failure.
  • If a node fails during an aprun job, the aprun process should terminate.

The following steps may be useful when dealing with node failure:

  1. Request more nodes than are required by aprun.
  2. Add a loop around aprun to check for success (this check is code specific) and re-run the aprun process on the additional allocated nodes upon error.

The following is a pseudo code example:

#PBS -lnodes=(a few more than you need)

while (not successful)

aprun -n (exact number you need) ./a.out

sleep 120

end while

The loop’s purpose is to re-run the aprun process on the extra nodes in the case that the aprun process does not succeed. Upon completion of aprun, unless the success check determines that aprun completed successfully, the aprun will be re-run. If the aprun does not succeed due to a node issue, the aprun process should be re-run allowing the system to place the tasks on one of the extra node(s) instead of the troubled node. This process may allow the job to work through a node issue without exiting the batch system and re-entering the batch queue. Its success is dependent on how well you can tailor the success test to you code.


Working Around Hanging Apruns

This simple script demonstrates how to kill a job that is not making forward progress and start again without going back to the scheduling queue. A job may not be making forward progress for a variety of reasons including hardware and software failures. The script does not help a user in identifying the root cause of why a job is hung. The goal of this script is to help a user do the following things:

  1. Detect that a job is hung (i.e., the job is not making forward progress).
  2. Kill the hung job.
  3. Restart the application without having to go back to the scheduling queue.

As documented in the script, the detection is fairly straight forward. It watches an output file where the job is supposed to write its output periodically. If there has been no writes to a file for certain time period, then the script tags the jobs as hung and takes further action. There are two key items that a user needs to do for this step to work correctly. The OUTFILE and WINDOW variables have to be set appropriately. The OUTFILE variable corresponds to the output file which this script watches periodically. The WINDOW variable is the longest time interval (in minutes) after which the script tags the job as hung if there is no write to OUFILE. Currently, the WINDOW variable is set to 120 mins, but it can be changed as needed.

If a job is detected to be hung, then the script automatically kills the job by obtaining its APID without user intervention.

Finally, the script automatically attempts to restart the job by relaunching the aprun with the same parameters. For this to work correctly, the user is advised to allocate a couple of more nodes than what is used in the aprun command. This is illustrated in the script. The user can change the number of such restart trial by changing the loop iteration counter as desired (“for i in `seq 1 4`;”). If the user does not allocate a few spare nodes, the application will not restart correctly if there any hardware problem with one of the allocated node.

#!/bin/bash
#PBS -l nodes=2,walltime=0:10:00
#PBS -j oe
#PBS -m e
#PBS -A ABC123

WINDOW=120 # 2 hour window of no activity/progress, while loop checks the file every minute
USER=`whoami`
BINDIR="/lustre/atlas1/abc123/scratch/${USER}/apkill"
cd $BINDIR

for i in `seq 1 4`;
do
    aprun -n 1 ./a.out $i &  # the "&" at the end is essential so that the code below executed, the code below monitors the temporary output file
    #echo "loop = $i"

#####################################################################################################################
    # Snippet to be moved to application PBS/qsub script:
    # Make sure to set the variable USER and WINDOW same as above or appropriately
    # Flow: store the status (number of lines) of temporary output file and keep checking every minute for updates,
    # if it is being updated, keep within while loop, if not updating for a long duration (2 hours), do apkill.
#####################################################################################################################
    OUTFILE="$PBS_JOBID.OU"
    OUTLEN=`wc -l $OUTFILE | awk '{print $1}'`
    #echo "outlen = $OUTLEN"
    TIME=0;
    while true
    do
        sleep 60; # sleep in number of seconds
        OUTLEN_NEW=`wc -l $OUTFILE | awk '{print $1}'`
        #echo "len = $OUTLEN and $OUTLEN_NEW"
        if [ $OUTLEN -eq $OUTLEN_NEW ]; then
            TIME=`expr $TIME + 1`
        else
            TIME=0;
            OUTLEN=$OUTLEN_NEW
        fi

        #echo "len after = $OUTLEN and $OUTLEN_NEW"

        APID=`apstat | grep $USER | tail -n 1 | awk '{print $1}'`
        #echo "apid = $APID"
        if [ -n "$APID" ]; then
            if [ $TIME -gt $WINDOW ]; then
                apkill $APID
                #echo "apkill = $APID"
            fi
        else
            break # break the while loop if there is no APID found
        fi
    done
#####################################################################################################################
    #end of snippet to be moved to application pbs script
#####################################################################################################################

done
wait