Notice: Rhea and the DTNs will migrate batch schedulers from Moab to Slurm on September 03, 2019. To aid in the transition, a portion of Rhea’s compute nodes and the DTNs’ scheduled nodes were migrated to Slurm on August 13. Usage details can be found below

In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.

A job on a commodity cluster typically comprises a few different components:

  • A batch submission script.
  • A binary executable.
  • A set of input files for the executable.
  • A set of output files created by the executable.

And the process for running a job, in general, is to:

  1. Prepare executables and input files.
  2. Write a batch script.
  3. Submit the batch script to the batch scheduler.
  4. Optionally monitor the job before and during execution.

The following sections describe in detail how to create, submit, and manage jobs for execution on commodity clusters.


Login vs Compute Nodes on Commodity Clusters

Login Nodes

When you log into an OLCF cluster, you are placed on a login node. Login node resources are shared by all users of the system. Because of this, users should be mindful when performing tasks on a login node.

Login nodes should be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory- or compute-intensive tasks. Users should also limit the number of simultaneous tasks performed on the login resources. For example, a user should not run (10) simultaneous tar processes on a login node.

Warning: Compute-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning.

Compute Nodes

Rhea contains 521 compute nodes separated into two partitions:

Partition Node Count Memory GPU CPU
rhea (default) 512 128GB [2x] Intel® Xeon® E5-2650 @ 2.0 GHz – 8 cores, 16 HT
(for a total of 16 cores, 32 HT per node)
gpu 9 1TB [2x] NVIDIA® K80 [2x] Intel® Xeon® E5-2695 @ 2.3 GHz – 14 cores, 28 HT
(for a total of 28 cores, 56 HT per node)

Rhea Partition

The first 512 nodes make up the rhea partition, where each node contains two 8-core 2.0 GHz Intel Xeon processors with Intel’s Hyper-Threading (HT) Technology and 128GB of main memory. Each CPU in this partition features 8 physical cores, for a total of 16 physical cores per node. With Intel® Hyper-Threading Technology enabled, each node has 32 logical cores capable of executing 32 hardware threads for increased parallelism.

GPU Partition

Rhea also has nine large memory/GPU nodes, which make up the gpu partition. These nodes each have 1TB of main memory and two NVIDIA K80 GPUs in addition to two 14-core 2.30 GHz Intel Xeon processors with HT Technology. Each CPU in this partition features 14 physical cores, for a total of 28 physical cores per node. With Hyper-Threading enabled, these nodes have 56 logical cores that can execute 56 hardware threads for increased parallelism.

Note: To access the gpu partition, batch job submissions should request -p gpu.

Slurm

Rhea and DTN will migrate to the Slurm scheduler in September 2019. The following section provides batch scheduler instructions for Slurm. Below is a comparison table to the schedulers used on other OLCF resources:

Task Moab LSF Slurm
View batch queue showq jobstat squeue
Submit batch script qsub bsub sbatch
Submit interactive batch job qsub -I bsub -Is $SHELL salloc
Run parallel code within batch job mpirun jsrun srun

Writing Batch Scripts

Batch scripts, or job submission scripts, are the mechanism by which a user configures and submits a job for execution. A batch script is simply a shell script that also includes commands to be interpreted by the batch scheduling software (e.g. Slurm).

Batch scripts are submitted to the batch scheduler, where they are then parsed for the scheduling configuration options. The batch scheduler then places the script in the appropriate queue, where it is designated as a batch job. Once the batch jobs makes its way through the queue, the script will be executed on the primary compute node of the allocated resources.

Components of a Batch Script

Batch scripts are parsed into the following (3) sections:

Interpreter Line

The first line of a script can be used to specify the script’s interpreter; this line is optional. If not used, the submitter’s default shell will be used. The line uses the hash-bang syntax, i.e., #!/path/to/shell.

Slurm Submission Options

The Slurm submission options are preceded by the string #SBATCH, making them appear as comments to a shell. Slurm will look for SBATCH options in a batch script from the script’s first line through the first non-comment line. A comment line begins with #. #SBATCH options entered after the first non-comment line will not be read by Slurm.

Shell Commands

The shell commands follow the last #SBATCH option and represent the executable content of the batch job. If any #SBATCH lines follow executable statements, they will be treated as comments only.

The execution section of a script will be interpreted by a shell and can contain multiple lines of executables, shell commands, and comments. When the job’s queue wait time is finished, commands within this section will be executed on the primary compute node of the job’s allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed.

Example Batch Script

  1: #!/bin/bash
  2: #SBATCH -A XXXYYY
  3: #SBATCH -J test
  4: #SBATCH -N 2
  5: #SBATCH -t 1:00:00
  6:
  7: cd $SLURM_SUBMIT_DIR
  8: date
  9: srun -n 8 ./a.out

This batch script shows examples of the three sections outlined above:

Interpreter Line

1: This line is optional and can be used to specify a shell to interpret the script. In this example, the bash shell will be used.

Slurm Options

2: The job will be charged to the “XXXYYY” project.
3: The job will be named test.
4: The job will request (2) nodes.
5: The job will request (1) hour walltime.

Shell Commands

6: This line is left blank, so it will be ignored.
7: This command will change the current directory to the directory from where the script was submitted.
8: This command will run the date command.
9: This command will run (8) MPI instances of the executable a.out on the compute nodes allocated by the batch system.

Batch scripts can be submitted for execution using the qsub command. For example, the following will submit the batch script named test.slurm:

  sbatch test.slurm

If successfully submitted, a Slurm job ID will be returned. This ID can be used to track the job. It is also helpful in troubleshooting a failed job; make a note of the job ID for each of your jobs in case you must contact the OLCF User Assistance Center for support.

Interactive Batch Jobs on Commodity Clusters

Batch scripts are useful when one has a pre-determined group of commands to execute, the results of which can be viewed at a later time. However, it is often necessary to run tasks on compute resources interactively.

Users are not allowed to access cluster compute nodes directly from a login node. Instead, users must use an interactive batch job to allocate and gain access to compute resources. This is done by using the Slurm salloc command. Other Slurm options are passed to salloc on the command line as well:

  $ salloc -A abc123 -p gpu -N 4 -t 1:00:00

This request will:

salloc Start an interactive session
-A Charge to the abc123 project
-p gpu Run in the gpu partition
-N 4 Request (4) nodes…
-t 1:00:00 …for (1) hour

After running this command, the job will wait until enough compute nodes are available, just as any other batch job must. However, once the job starts, the user will be given an interactive prompt on the primary compute node within the allocated resource pool. Commands may then be executed directly (instead of through a batch script).

Using to Debug

A common use of interactive batch is to aid in debugging efforts. Interactive access to compute resources allows the ability to run a process to the point of failure; however, unlike a batch job, the process can be restarted after brief changes are made without losing the compute resource pool; thus speeding up the debugging effort.

Choosing a Job Size

Because interactive jobs must sit in the queue until enough resources become available to allocate, it is useful to know when a job can start.

Use the sbatch --test-only command to see when a job of a specific size could be scheduled. For example, the snapshot below shows that a (2) node job would start at 10:54.

  $ sbatch --test-only -N2 -t1:00:00 batch-script.slurm

      sbatch: Job 1375 to start at 2019-08-06T10:54:01 using 64 processors on nodes rhea[499-500] in partition batch
Note: The queue is fluid, the given time is an estimate made from the current queue state and load. Future job submissions and job completions can alter the estimate.

Common Batch Options to Slurm

The following table summarizes frequently-used options to Slurm:

Option Use Description
-A #SBATCH -A <account> Causes the job time to be charged to <account>. The account string, e.g. pjt000, is typically composed of three letters followed by three digits and optionally followed by a subproject identifier. The utility showproj can be used to list your valid assigned project ID(s). This option is required by all jobs.
-N #SBATCH -N <value> Number of compute nodes to allocate. Jobs cannot request partial nodes.
#SBATCH -t <time> Maximum wall-clock time. <time> is in the format HH:MM:SS.
#SBATCH -p <partition_name> Allocates resources on specified partition.
-o #SBATCH -o <filename> Writes standard output to <name> instead of <job script>.o$SLURM_JOB_UID. $SLURM_JOB_UID is an environment variable created by Slurm that contains the batch job identifier.
-e #SBATCH -e <filename> Writes standard error to <name> instead of <job script>.e$SLURM_JOB_UID.
--mail-type #SBATCH --mail-type=FAIL Sends email to the submitter when the job fails.
#SBATCH --mail-type=BEGIN Sends email to the submitter when the job begins.
#SBATCH --mail-type=END Sends email to the submitter when the job ends.
--mail-user #SBATCH --mail-user=<address> Specifies email address to use for --mail-type options.
-J #SBATCH -J <name> Sets the job name to <name> instead of the name of the job script.
--get-user-env #SBATCH --get-user-env Exports all environment variables from the submitting shell into the batch job shell. Since the login nodes differ from the service nodes, using the ‘–get-user-env’ option is not recommended. Users should create the needed environment within the batch job.
Note: Because the login nodes differ from the service nodes, using the ‘–get-user-env’ option is not recommended. Users should create the needed environment within the batch job.

Further details and other Slurm options may be found through the sbatch man page.


Batch Environment Variables

Slurm sets multiple environment variables at submission time. The following Slurm variables are useful within batch scripts:

Variable Description
$SLURM_SUBMIT_DIR The directory from which the batch job was submitted. By default, a new job starts in your home directory. You can get back to the directory of job submission with cd $SLURM_SUBMIT_DIR. Note that this is not necessarily the same directory in which the batch script resides.
$SLURM_JOBID The job’s full identifier. A common use for SLURM_JOBID is to append the job’s ID to the standard output and error files.
$SLURM_JOB_NUM_NODES The number of nodes requested.
$SLURM_JOB_NAME The job name supplied by the user.
$SLURM_NODELIST The list of nodes assigned to the job.

Modifying Batch Jobs

The batch scheduler provides a number of utility commands for managing submitted jobs. See each utilities’ man page for more information.

Removing and Holding Jobs

scancel

Jobs in the queue in any state can be stopped and removed from the queue using the command scancel.

$ scancel 1234
scontrol hold

Jobs in the queue in a non-running state may be placed on hold using the scontrol hold command. Jobs placed on hold will not be removed from the queue, but they will not be eligible for execution.

$ scontrol hold 1234
scontrol release

Once on hold the job will not be eligible to run until it is released to return to a queued state. The scontrol release command can be used to remove a job from the held state.

$ scontrol release 1234

Monitoring Batch Jobs

Slurm provides multiple tools to view queue, system, and job status. Below are the most common and useful of these tools.

Job Monitoring Commands

squeue

The Slurm utility squeue can be used to view the batch queue.
To see all jobs currently in the queue:

$ squeue -l

To see all of your queued jobs:

$ squeue -l -u $USER

sacct

The Slurm utility sacct can be used to view jobs currently in the queue and those completed days prior. The utility can also be used to see job steps in each batch job:

To see all jobs currently in the queue:

$ sacct -a -X

To see all jobs including steps owned by userA currently in the queue:

$ sacct -u userA

To see all steps submitted to job 123:

$ sacct -j 123

To see all of your jobs that completed on 2019-06-10:

$ sacct -S 2019-06-10T00:00:00 -E 2019-06-10T23:59:59 -o"jobid,user,account%16,cluster,AllocNodes,Submit,Start,End,TimeLimit" -X -P

jobstat

Similar to Summit, the local tool jobstat can be used to view the queue.

$jobstat
Running    jobs------------------------
ST  JOBID USER  ACCOUNT NODES PARTITION  NAME TIME_LIMIT     START_TIME           TIME_LEFT
R   1671  usrB  abc123  10    batch      jobA 10:00:00       2019-08-13T10:22:18  3:7:40

Pending    jobs------------------------
ST  JOBID USER  ACCOUNT  NODES PARTITION  NAME TIME_LIMIT  SUBMIT_TIME       PRIORITY START_TIME        REASON
PD  1677  usrA  abc123   10    batch      jobB 10:00       2019-08-13T13:43  10101    2019-08-13T17:45  Resources

scontrol show job jobid

Provides additional details of given job.

sview

The sview tool provide a graphical queue monitoring tool. To use, you will need an x-server running on your local system. You will also need to tunnel x-traffic through your ssh connection:

local-system> ssh -Y userid@rhea.ccs.ornl.gov
rhea-login> sview

Job Execution

Once resources have been allocated through the batch system, users have the option of running commands on the allocated resources’ primary compute node (a serial job) and/or running an MPI/OpenMP executable across all the resources in the allocated resource pool simultaneously (a parallel job).

Serial Job Execution

The executable portion of batch scripts is interpreted by the shell specified on the first line of the script. If a shell is not specified, the submitting user’s default shell will be used.

The serial portion of the batch script may contain comments, shell commands, executable scripts, and compiled executables. These can be used in combination to, for example, navigate file systems, set up job execution, run serial executables, and even submit other batch jobs.

Parallel Job Execution

Rhea compute node description

The following image represents a high level compute node that will be used below to display layout options.

Note: The Intel cores are numbered in a round robin fashion. 0 and 16 are on the same physical core.
Using srun

By default, commands will be executed on the job’s primary compute node, sometimes referred to as the job’s head node. The srun command is used to execute an MPI binary on one or more compute nodes in parallel.

srun accepts the following common options:

-N Minimum number of nodes
-n Total number of MPI tasks
--cpu-bind=no Allow code to control thread affinity
-c Cores per MPI task
--cpu-bind=cores Bind to cores
Note: If you do not specify the number of MPI tasks to srun via -n, the system will default to one task per node.
MPI Task Layout

Each compute node on Rhea contains two sockets each with 8 cores. Depending on your job, it may be useful to control task layout within and across nodes.

Physical Core Binding

The following will run four copies of a.out, one per CPU, two per node with physical core binding

Hyper Thread Binding

The following will run four copies of a.out, one per hyper-thread, two per node using a round robin task layout between nodes:

Thread Layout
Thread per Hyper-Thread

The following will run four copies of a.out. Each task will launch two threads. The -c flag will provide room for the threads.

Warning: Not adding enough resources using the -c flag, threads may be placed on the same resource.
Multiple Simultaneous Jobsteps

Multiple simultaneous sruns can be executed within a batch job by placing sruns in the background.

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 1:00:00
#SBATCH -A prj123
#SBATCH -J simultaneous-jobsteps

srun -n16 -N2 -c1 --cpu-bind=cores --exclusive ./a.out &
srun -n8 -N2 -c1 --cpu-bind=cores --exclusive ./b.out &
srun -n4 -N1 -c1 --cpu-bind=threads --exclusive ./c.out &
srun -n4 -N1 -c1 --cpu-bind=threads --exclusive ./c.out &
wait
Notice: The wait command must be used in a batch script to prevent the shell from exiting before all backgrounded sruns have completed.
Warning: The --exclusive flag must be used to prevent resource sharing. Without the flag each backgrounded jsrun will likely be placed on the same resources.

Batch Queues on Rhea

The compute nodes on Rhea are separated into two partitions (rhea and gpu) and are available through a single batch queue: batch. The scheduling policies for the individual partitions are as follows:

Rhea Partition Policy (default)

Jobs that do not specify a partition will run in the 512 node rhea partition.

Bin Node Count Duration Policy
A 1 – 16 Nodes 0 – 48 hr max 4 jobs running and 4 jobs eligible
per user
in bins A, B, and C
B 17 – 64 Nodes 0 – 36 hr
C 65 – 384 Nodes 0 – 3 hr

GPU Partition Policy

To access the 9 node gpu partition, batch job submissions should request -p gpu

Node Count Duration Policy
1-2 Nodes 0 – 48 hrs max 1 job running
per user
The queue structure was designed based on user feedback and analysis of batch jobs over the recent years. However, we understand that the structure may not meet the needs of all users. If this structure limits your use of the system, please let us know. We want Rhea to be a useful OLCF resource and will work with you providing exceptions or even changing the queue structure if necessary.

Users wishing to submit jobs that fall outside the queue structure are encouraged to request a reservation via the Special Request Form.

Allocation Overuse Policy

Projects that overrun their allocation are still allowed to run on OLCF systems, although at a reduced priority. Like the adjustment for the number of processors requested above, this is an adjustment to the apparent submit time of the job. However, this adjustment has the effect of making jobs appear much younger than jobs submitted under projects that have not exceeded their allocation. In addition to the priority change, these jobs are also limited in the amount of wall time that can be used.

For example, consider that job1 is submitted at the same time as job2. The project associated with job1 is over its allocation, while the project for job2 is not. The batch system will consider job2 to have been waiting for a longer time than job1. Also projects that are at 125% of their allocated time will be limited to only one running job at a time. The adjustment to the apparent submit time depends upon the percentage that the project is over its allocation, as shown in the table below:

% Of Allocation Used Priority Reduction number eligible-to-run number running
< 100% 0 days 4 jobs 4 jobs
100% to 125% 30 days 4 jobs 4 jobs
> 125% 365 days 4 jobs 1 job

Job Accounting on Rhea

Jobs on Rhea are scheduled in full node increments; a node’s cores cannot be allocated to multiple jobs. Because the OLCF charges based on what a job makes unavailable to other users, a job is charged for an entire node even if it uses only one core on a node. To simplify the process, users are given a multiples of entire nodes through PBS.

Viewing Allocation Utilization

Projects are allocated time on Rhea in units of node-hours. This is separate from a project’s Titan or Eos allocation, and usage of Rhea does not count against that allocation. This page describes how such units are calculated, and how users can access more detailed information on their relevant allocations.

Node-Hour Calculation

The node-hour charge for each batch job will be calculated as follows:

node-hours = nodes requested * ( batch job endtime - batch job starttime )

Where batch job starttime is the time the job moves into a running state, and batch job endtime is the time the job exits a running state.

A batch job’s usage is calculated solely on requested nodes and the batch job’s start and end time. The number of cores actually used within any particular node within the batch job is not used in the calculation. For example, if a job requests (6) nodes through the batch script, runs for (1) hour, uses only (2) CPU cores per node, the job will still be charged for 6 nodes * 1 hour = 6 node-hours.

Viewing Usage

Utilization is calculated daily using batch jobs which complete between 00:00 and 23:59 of the previous day. For example, if a job moves into a run state on Tuesday and completes Wednesday, the job’s utilization will be recorded Thursday. Only batch jobs which write an end record are used to calculate utilization. Batch jobs which do not write end records due to system failure or other reasons are not used when calculating utilization.

Each user may view usage for projects on which they are members from the command line tool showusage and the My OLCF site.

On the Command Line via showusage

The showusage utility can be used to view your usage from January 01 through midnight of the previous day. For example:

  $ showusage
    Usage:
                             Project Totals                      
    Project             Allocation      Usage      Remaining     Usage
    _________________|______________|___________|____________|______________
    abc123           |  20000       |   126.3   |  19873.7   |   1560.80

The -h option will list more usage details.

On the Web via My OLCF

More detailed metrics may be found on each project’s usage section of the My OLCF site.

The following information is available for each project:

  • YTD usage by system, subproject, and project member
  • Monthly usage by system, subproject, and project member
  • YTD usage by job size groupings for each system, subproject, and project member
  • Weekly usage by job size groupings for each system, and subproject
  • Batch system priorities by project and subproject
  • Project members

The My OLCF site is provided to aid in the utilization and management of OLCF allocations. If you have any questions or have a request for additional data, please contact the OLCF User Assistance Center.


Enabling Workflows through Cross-System Batch Submission

The OLCF now supports submitting jobs between OLCF systems via batch scripts. This can be useful for automatically triggering analysis and storage of large data sets after a successful simulation job has ended, or for launching a simulation job automatically once the input deck has been retrieved from HPSS and pre-processed.

Cross-Submission allows jobs on one OLCF resource to submit new jobs to other OLCF resources.

Cross-Submission allows jobs on one OLCF resource to submit new jobs to other OLCF resources.

The key to remote job submission is the command qsub -q host script.pbs which will submit the file script.pbs to the batch queue on the specified host. This command can be inserted at the end of an existing batch script in order to automatically trigger work on another OLCF resource. This feature is supported on the following hosts:

Host Remote Submission Command
Rhea qsub -q rhea visualization.pbs
Eos qsub -q eos visualization.pbs
Titan qsub -q titan compute.pbs
Data Transfer Nodes (DTNs) qsub -q dtn retrieve_data.pbs

Example Workflow 1: Automatic Post-Processing

The simplest example of a remote submission workflow would be automatically triggering an analysis task on Rhea at the completion of a compute job on Titan. This workflow would require two batch scripts, one to be submitted on Titan, and a second to be submitted automatically to Rhea. Visually, this workflow may look something like the following:

Post-processing Workflow
The batch scripts for such a workflow could be implemented as follows:

Batch-script-1.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=4096
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# run compute job on titan
cd $MEMBERWORK/prj123
aprun -n 65536 ./run_simulation.exe

# Submit visualization processing job to Rhea
qsub -q rhea Batch-script-2.pbs

Batch-script-2.pbs

#PBS -l walltime=2:00:00
#PBS -l nodes=10
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
mpirun -n 10 ./post_process_job.exe

The key to this workflow is the qsub -q batch@rhea-batch Batch-script-2.pbs command, which tells qsub to submit the file Batch-script-2.pbs to the batch queue on Rhea.

Initializing the Workflow

We can initialize this workflow in one of two ways:

  • Log into titan.ccs.ornl.gov and run qsub Batch-script-1.pbs OR
  • From Titan or Rhea, run qsub -q titan Batch-script-1.pbs

Example Workflow 2: Data Staging, Compute, and Archival

Now we give another example of a linear workflow. This example shows how to use the Data Transfer Nodes (DTNs) to retrieve data from HPSS and stage it to your project’s scratch area before beginning. Once the computation is done, we will automatically archive the output.

Post-processing Workflow
Batch-script-1.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Retrieve Data from HPSS
cd $MEMBERWORK/prj123
htar -xf /proj/prj123/input_data.htar input_data/

# Launch compute job
qsub -q titan Batch-script-2.pbs

Batch-script-2.pbs

#PBS -l walltime=6:00:00
#PBS -l nodes=4096
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
aprun -n 65536 ./analysis-task.exe

# Submit data archival job to DTNs
qsub -q dtn Batch-script-3.pbs

Batch-script-3.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
htar -cf /proj/prj123/viz_output.htar viz_output/
htar -cf /proj/prj123/compute_data.htar compute_data/
Initializing the Workflow

We can initialize this workflow in one of two ways:

  • Log into dtn.ccs.ornl.gov and run qsub Batch-script-1.pbs OR
  • From Titan or Rhea, run qsub -q dtn Batch-script-1.pbs

Example Workflow 3: Data Staging, Compute, Visualization, and Archival

This is an example of a “branching” workflow. What we will do is first use Rhea to prepare a mesh for our simulation on Titan. We will then launch the compute task on Titan, and once this has completed, our workflow will branch into two separate paths: one to archive the simulation output data, and one to visualize it. After the visualizations have finished, we will transfer them to a remote institution.

Post-processing Workflow
Step-1.prepare-data.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=10
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Prepare Mesh for Simulation
mpirun -n 160 ./prepare-mesh.exe

# Launch compute job
qsub -q titan Step-2.compute.pbs

Step-2.compute.pbs

#PBS -l walltime=6:00:00
#PBS -l nodes=4096
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
aprun -n 65536 ./analysis-task.exe

# Workflow branches at this stage, launching 2 separate jobs

# - Launch Archival task on DTNs
qsub -q dtn@dtn-batch Step-3.archive-compute-data.pbs

# - Launch Visualization task on Rhea
qsub -q rhea Step-4.visualize-compute-data.pbs

Step-3.archive-compute-data.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Archive compute data in HPSS
cd $MEMBERWORK/prj123
htar -cf /proj/prj123/compute_data.htar compute_data/

Step-4.visualize-compute-data.pbs

#PBS -l walltime=2:00:00
#PBS -l nodes=64
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Visualize Compute data
cd $MEMBERWORK/prj123
mpirun -n 768 ./visualization-task.py

# Launch transfer task
qsub -q dtn Step-5.transfer-visualizations-to-campus.pbs

Step-5.transfer-visualizations-to-campus.pbs

#PBS -l walltime=2:00:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Transfer visualizations to storage area at home institution
cd $MEMBERWORK/prj123
SOURCE=gsiftp://dtn03.ccs.ornl.gov/$MEMBERWORK/visualization.mpg
DEST=gsiftp://dtn.university-name.edu/userid/visualization.mpg
globus-url-copy -tcp-bs 12M -bs 12M -p 4 $SOURCE $DEST
Initializing the Workflow

We can initialize this workflow in one of two ways:

  • Log into rhea.ccs.ornl.gov and run qsub Step-1.prepare-data.pbs OR
  • From Titan or the DTNs, run qsub -q rhea Step-1.prepare-data.pbs

Checking Job Status

Host Remote qstat Remote showq
Rhea qstat -a @rhea-batch showq --host=rhea-batch
Eos qstat -a @eos-batch showq --host=eos-batch
Titan qstat -a @titan-batch showq --host=titan-batch
Data Transfer Nodes (DTNs) qstat -a @dtn-batch showq --host=dtn-batch

Deleting Remote Jobs

In order to delete a job (say, job number 18688) from a remote queue, you can do the following

Host Remote qdel
Rhea qdel 18688@rhea-batch
Eos qdel 18688@eos-batch
Titan qdel 18688@titan-batch
Data Transfer Nodes (DTNs) qdel 18688@dtn-batch

Potential Pitfalls

The OLCF advises users to keep their remote submission workflows simple, short, and mostly linear. Workflows that contain many layers of branches, or that trigger many jobs at once, may prove difficult to maintain and debug. Workflows that contain loops or recursion (jobs that can submit themselves again) may inadvertently waste allocation hours if a suitable exit condition is not reached.

Recursive workflows which do not exit will drain your project’s allocation. Refunds will not be granted. Please be extremely cautious when designing workflows that cause jobs to re-submit themselves.

Circular Workflow
As always, users on multiple projects are strongly advised to double check that the #PBS -A <PROJECTID> field is set to the correct project prior to submission. This will ensure that resource usage is associated with the intended project.