Publishing Notice Prior to publishing work performed on Summit, please contact help@olcf.ornl.gov.

On April 26, the following Summit batch policy changes were made to increase the maximum number of compute nodes available to an individual batch job and to improve batch access to all early Summit projects.

Maximum job size limit increased
The maximum node request has been increased to allow access to all nodes in the partition.
The new limit allows jobs up to 1,024 nodes for 2 hours.

Maximum number of submitted batch jobs
Each user is now limited to 3 simultaneously queued jobs, in any state.
Submissions that would result in more than 3 queued jobs will be rejected at submission time.

Priority by job size
Similar to Titan, a priority boost is now given to jobs based on the number of requested nodes.
Groupings:
0 – 255 : 1 day boost
256 – 511 : 6 day boost
512 – 1081 : 16 day boost

Over-allocation penalties
Over-allocation penalties are now enforced on Summit.
The showusage command line utility can be used to see a project’s usage and allocation. The utility can be executed from Summit’s command line

Maximum number of running batch jobs
Each user is limited to 1 running job
Up-to 3 submissions are allowed, but only one job will run simultaneously.

Policy exemptions have been removed
Policy exemptions approved prior to April 26 have been removed.
Please contact help@olcf.ornl.gov to request exemptions if needed.
Please keep in mind that the current 1,024 node partition must be shared by all early access projects.

As is the case on other OLCF systems, computational work on Summit is performed within jobs.  A typical job consists of several components:

  • A submission script
  • An executable
  • Input files needed by the executable
  • Output files created by the executable

In general, the process for running a job is to:

  1. Prepare executables and input files
  2. Write the batch script
  3. Submit the batch script
  4. Monitor the job’s progress before and during execution

The following sections will provide more information regarding running jobs on Summit.  Summit uses IBM Spectrum Load Sharing Facility (LSF) as the batch scheduling system.

Login, Launch, and Compute Nodes

Recall from the System Overview section that Summit has three types of nodes: login, launch, and compute. When you log into the system, you are placed on a login node. When your batch scripts or interactive jobs run, the resulting shell will run on a launch node. Compute nodes are accessed via the jsrun command. The jsrun command should only be issued from within an LSF job (either batch or interactive) on a launch node. Othewise, you will not have any compute nodes allocated and your parallel job will run on the login node. If this happens, your job will interfere with (and be interfered with by) other users’ login node tasks.

Batch Scripts

The most common way to interact with the batch system is via batch jobs. A batch job is simply a shell script with added directives to request various resources from or provide certain information to the batch scheduling system. Aside from the lines containing LSF options, the batch script is simply the series commands needed to set up and run your job.

To submit a batch script, use the bsub command:
bsub myjob.lsf

If you’ve previously used LSF, you’re probably used to submitting a job with input redirection (i.e. bsub < myjob.lsf).
This is not needed (and will not work) on Summit.

As an example, consider the following batch script:

1.	#!/bin/bash
2.	# Begin LSF Directives
3.	#BSUB -P ABC123
4.	#BSUB -W 3:00
5.	#BSUB -nnodes 2048
6.	#BSUB -alloc_flags gpumps
7.	#BSUB -J RunSim123
8.	#BSUB -o RunSim123.%J
9.	#BSUB -e RunSim123.%J
10.	 
11.	cd $MEMBERWORK/abc123
12.	cp $PROJWORK/abc123/RunData/Input.123 ./Input.123
13.	date
14.	jsrun -n 4092 -r 2 -a 12 -g 3 ./a.out
15.	cp my_output_file /ccs/proj/abc123/Output.123
Line # Option Description
1 Shell specification. This script will run under with bash as the shell
2 Comment line
3 Required This job will charge to the ABC123 project
4 Required Maximum walltime for the job is 3 hours
5 Required The job will use 2,048 nodes
6 Optional Enable GPU Multi-Process Service
7 Optional The name of the job is RunSim123
8 Optional Write standard output to a file named RunSim123.#, where # is the job ID assigned by LSF
9 Optional Write standard error to a file named RunSim123.#, where # is the job ID assigned by LSF
10 Blank line
11 Change into one of the scratch filesystems
12 Copy input files into place
13 Run the date command to write a timestamp to the standard output file
14 Run the executable
15 Copy output files from the scratch area into a more permanent location

Interactive Jobs

Most users will find batch jobs to be the easiest way to interact with the system, since they permit you to hand off a job to the scheduler and then work on other tasks; however, it is sometimes preferable to run interactively on the system. This is especially true when developing, modifying, or debugging a code.

Since all compute resources are managed/scheduled by LSF, it is not possible to simply log into the system and begin running a parallel code interactively. You must request the appropriate resources from the system and, if necessary, wait until they are available. This is done with an “interactive batch” job. Interactive batch jobs are submitted via the command line, which supports the same options that are passed via #BSUB parameters in a batch script. The final options on the command line are what makes the job “interactive batch”: -Is followed by a shell name. For example, to request an interactive batch job (with bash as the shell) equivalent to the sample batch script above, you would use the command:
bsub -W 3:00 -nnodes 2048 -P ABC123 -Is /bin/bash

Common bsub Options

The table below summarizes options for submitted jobs. Unless otherwise noted, these can be used from batch scripts or interactive jobs. For interactive jobs, the options are simply added to the bsub command line. For batch scripts, they can either be added on the bsub command line or they can appear as a #BSUB directive in the batch script. If conflicting options are specified (i.e. different walltime specified on the command line versus in the script), the option on the command line takes precedence. Note that LSF has numerous options; only the most common ones are described here. For more in-depth information about other LSF options, see the documentation.

Option Example Usage Description
-W #BSUB -W 50 Requested maximum walltime.

NOTE: The format is [hours:]minutes, not [[hours:]minutes:]seconds like PBS/Torque/Moab

-nnodes #BSUB -nnodes 1024 Number of nodes

NOTE: There is specified with only one hyphen (i.e. -nnodes, not –nnodes)

-P #BSUB -P ABC123 Specifies the project to which the job should be charged
-o #BSUB -o jobout.%J File into which job STDOUT should be directed (%J will be replaced with the job ID number)

If you do not also specify a STDERR file with -e or -eo, STDERR will also be written to this file.

-e #BSUB -e jobout.%J File into which job STDERR should be directed (%J will be replaced with the job ID number)
-J #BSUB -J MyRun123 Specifies the name of the job (if not present, LSF will use the name of the job script as the job’s name)
-w #BSUB -w ended() Place a dependency on the job
-N #BSUB -N Send a job report via email when the job completes
-XF #BSUB -XF Use X11 forwarding
-alloc_flags #BSUB -alloc_flags gpumps
#BSUB -alloc_flags smt4
Used to request GPU Multi-Process Service (MPS) and to set SMT (Simultaneous Multithreading) levels.

Setting gpumps enables NVIDIA’s Multi-Process Service, which allows multiple MPI ranks to simultaneously access a GPU.

Setting smtn (where n is 1, 2, or 4) sets different SMT levels. To run with 4 hardware threads per physical core, you’d use smt4.

Batch Environment Variables

LSF provides a number of environment variables in your job’s shell environment. Many job parameters are stored in environment variables and can be queried within the batch job. Several of these variables are summarized in the table below. This is not an all-inclusive list of variables available to your batch job; in particular only LSF variables are discussed, not the many “standard” environment variables that will be available (such as $PATH).

Variable Description
LSB_JOBID The ID assigned to the job by LSF
LS_JOBPID The job’s process ID
LSB_JOBINDEX The job’s index (if it belongs to a job array)
LSB_HOSTS The hosts assigned to run the job
LSB_QUEUE The queue from which the job was dispatched
LSB_INTERACTIVE Set to “Y” for an interactive job; otherwise unset
LS_SUBCWD The directory from which the job was submitted

Job States

A job will progress through a number of states through its lifetime. The states you’re most likely to see are:

State Description
PEND Job is pending
RUN Job is running
DONE Job completed normally (with an exit code of 0)
EXIT Job completed abnormally
PSUSP Job was suspended (either by the user or an administrator) while pending
USUSP Job was suspended (either by the user or an administrator) after starting
SSUSP Job was suspended by the system after starting

Scheduling Policy

To allow development opportunity to all early access users, the following queue policies will be enforced.

Running Jobs Eligible Jobs Max Nodes Max Walltime Max Queued
1 1 1024 2 hours 3 (any state)

Job Dependencies

As is the case with many other queuing systems, it is possible to place dependencies on jobs to prevent them from running until other jobs have started/completed/etc. Several possible dependency settings are described in the table below:

Expression Meaning
#BSUB -w started(12345) The job will not start until job 12345 starts. Job 12345 is considered to have started if is in any of the following states: USUSP, SSUSP, DONE, EXIT or RUN (with any pre-execution command specified by bsub -E completed)
#BSUB -w done(12345)
#BSUB -w 12345
The job will not start until job 12345 has a state of DONE (i.e. completed normally). If a job ID is given with no condition, done() is assumed.
#BSUB -w exit(12345) The job will not start until job 12345 has a state of EXIT (i.e. completed abnormally)
#BSUB -w ended(12345) The job will not start until job 12345 has a state of EXIT or DONE

Dependency expressions can be combined with logical operators. For example, if you want a job held until job 12345 is DONE and job 12346 has started, you can use #BSUB -w "done(12345) && started(12346)"

Monitoring Jobs

LSF provides several utilities with which you can monitor jobs. These include monitoring the queue, getting details about a particular job, viewing STDOUT/STDERR of running jobs, and more.

The most straightforward monitoring is with the bjobs command. This command will show the current queue, including both pending and running jobs. Running bjobs -l will provide much more detail about a job (or group of jobs). For detailed output of a single job, specify the job id after the -l. For example, for detailed output of job 12345, you can run bjobs -l 12345 . Other options to bjobs are shown below. In general, if the command is specified with -u all it will show information for all users/all jobs. Without that option, it only shows your jobs. Note that this is not an exhaustive list. See man bjobs for more information.

Command Description
bjobs Show your current jobs in the queue
bjobs -u all Show currently queued jobs for all users
bjobs -P ABC123 Shows currently-queued jobs for project ABC123
bjobs -UF Don’t format output (might be useful if you’re using the output in a script)
bjobs -a Show jobs in all states, including recently finished jobs
bjobs -l Show long/detailed output
bjobs -l 12345 Show long/detailed output for jobs 12345
bjobs -d Show details for recently completed jobs
bjobs -s Show suspended jobs, including the reason(s) they’re suspended
bjobs -r Show running jobs
bjobs -p Show pending jobs
bjobs -w Use “wide” formatting for output

If you want to check the STDOUT/STDERR of a currently running job, you can do so with the bpeek command. The command supports several options:

Command Description
bpeek -J jobname Show STDOUT/STDERR for the job you’ve most recently submitted with the name jobname
bpeek 12345 Show STDOUT/STDERR for job 12345
bpeek -f ... Used with other options. Makes bpeek use tail -f and exit once the job completes.

Interacting With Jobs

Sometimes it’s necessary to interact with a batch job after it has been submitted. LSF provides several commands for interacting with already-submitted jobs.

Many of these commands can operate on either one job or a group of jobs. In general, they only operate on the most recently submitted job that matches other criteria provided unless “0” is specified as the job id.

Modifying Job Submission Options

For pending jobs, LSF permits users to modify any option that can be specified at job submit time. This includes not only options you originally specified (for example, changing the walltime value specified via -W) but also options that were not originally submitted. Thus, if you forgot to specify a STDOUT/STDERR file in your original job submission, you may do so via bmod.

The modifications permitted on jobs that have already started running are much more limited. You can specify a new -R value, although there are some restrictions on what can be set. You can also modify the -We, -We+, -Wep, flags, although most users will not use those and will instead only use the -W flag. LSF documentation notes several other options that can be modified if the site configuration file sets LSB_MOD_ALL_JOBS=y; OLCF does not set that flag, so the associated parameters cannot be modified.

If you are using expert mode (i.e. bsub -csm y), you can modify the resourse request (-R) with bmod as noted above. Any such modification will completely replace the existing resource request rather than appending to it. This is consistent with the behavior of other options modified via bmod (i.e. the original values are completely replaced).

LSF also allows you to easily undo a bmod command. If you’d like to restore a parameter that you modified to its initial value, you don’t have to remember the old value. You can simply run the bmod command with a lowercase “n” appended to the option that you’d like to reset. For example, if you incorrectly modified the project parameter (i.e. -P), you can set it back to its original value by running bmod -Pn.

LSF can show you a history of modifications to a job. Both the bjobs and bhist commands support a -l option to show detailed job history information. For example, to see detailed information about job 12345, run the command bjobs -l 12345.

For a full description of the bmod command and its options, see the documentation sources listed in the ‘For More Information’ section below.

Suspending and Resuming Jobs

LSF supports user-level suspension and resumption of jobs. Jobs are suspended with the bstop command and resumed with the bresume command. The simplest way to invoke these commands is to list the job id to be suspended/resumed:

bstop 12345
bresume 12345

Instead of specifying a job id, you can specify other criteria that will allow you to suspend some/all jobs that meet other criteria such as a job name, a queue name, etc. These are described in the manpages for bstop and bresume.

Signaling Jobs

You can send signals to jobs with the bkill command. While the command name suggests its only purpose is to terminate jobs, this is not the case. Similar to the kill command found in Unix-like operating systems, this command can be used to send various signals (not just SIGTERM and SIGKILL) to jobs. The command can accept both numbers and names for signals. For a list of accepted signal names, run bkill -l. Common ways to invoke the command include:

Command Description
bkill 12345 Force a job to stop by sending SIGINT, SIGTERM, and SIGKILL. These signals are sent in that order, so users can write applications such that they will trap SIGINT and/or SIGTERM and exit in a controlled manner.
bkill -s USR1 12345 Send SIGUSR1 to job 12345

NOTE: When specifying a signal by name, omit SIG from the name. Thus, you specify USR1 and not SIGUSR1 on the bkill command line.

bkill -s 9 12345 Send signal 9 to job 12345

Like bstop and bresume, bkill command also supports identifying the job(s) to be signaled by criteria other than the job id. These include some/all jobs with a given name, in a particular queue, etc. See man bkill for more information.

Checkpointing Jobs

LSF documentation mentions the bchkpnt and brestart commands for checkpointing and restarting jobs, as well as the -k option to bsub for configuring checkpointing. Since checkpointing is very application specific and a wide range of applications run on OLCF resources, this type of checkpointing is not configured on Summit. If you wish to use checkpointing (which is highly encouraged), you’ll need to configure it within your application.

If you wish to implement some form of on-demand checkpointing, keep in mind the bkill command is really a signaling command and you can have your job script/application checkpoint as a response to certain signals (such as SIGUSR1).

Other LSF Commands

The table below summarizes some additional LSF commands that might be useful.

Command Description
bparams -a Show current parameters for LSF

The behavior/available options for some LSF commands depend on settings in various configuration files. This command shows those settings without having to search for the actual files.

bjdepinfo Show job dependency information (could be useful in determining what job is keeping another job in a pending state)

PBS/Torque/MOAB-to-LSF Translation

More details about these commands are given elsewhere in this section; the table below is simply for your convenience in looking up various LSF commands.

Users of other OLCF resources are likely familiar with PBS-like commands which are used by the Torque/Moab instances on other systems. The table below summarizes the equivalent LSF command for various PBS/Torque/Moab commands.

LSF Command PBS/Torque/Moab Command Description
bsub job.sh qsub job.sh Submit the job script job.sh to the batch system
bsub -Is /bin/bash qsub -I Submit an interactive batch job
bjobs -u all qstat
showq
Show jobs currently in the queue

NOTE: without the -u all argument, bjobs will only show your jobs

bjobs -l checkjob Get information about a specific job
bjobs -d showq -c Get information about completed jobs
bjobs -p showq -i
showq -b
checkjob
Get information about pending jobs
bjobs -r showq -r Get information about running jobs
bkill qsig Send a signal to a job
bkill qdel Terminate/Kill a job
bstop qhold Hold a job/stop a job from running
bresume qrls Release a held job
bmod[ify] qalter Modify job submission parameters
bqueues qstat -q Get information about queues
bjdepinfo checkjob Get information about job dependencies

The table below shows shows LSF (bsub) command-line/batch script options and the PBS/Torque/Moab (qsub) options that provide similar functionality.

LSF Option PBS/Torque/Moab Option Description
#BSUB -W 60 #PBS -l walltime=1:00:00 Request a walltime of 1 hour
#BSUB -nnodes 1024 #PBS -l nodes=1024 Request 1024 nodes
#BSUB -P ABC123 #PBS -A ABC123 Charge the job to project ABC123
#BSUB -alloc_flags gpumps No equivalent (set via environment variable) Enable multiple MPI tasks to simultaneously access a GPU

Easy Mode vs. Expert Mode

The Cluster System Management (CSM) component of the job launch environment supports two methods of job submission, termed “easy” mode and “expert” mode. The difference in the modes is where the responsibility for creating the LSF resource string is placed. In easy mode, the system software converts options such as -nnodes in a batch script into the resource string needed by the scheduling system. In expert mode, the user is responsible for creating this string and options such as -nnodes cannot be used.

In easy mode, you will not be able to use bsub -R to create resource strings. The system will automatically create the resource string based on your other bsub options. In expert mode, you will be able to use -R, but you will not be able to use the following options to bsub: -ln_slots, -ln_mem, -cn_cu, or -nnodes.

Most users will want to use easy mode. However, if you need precise control over your job’s resources, such as placement on (or avoidance of) specific nodes, you will need to use expert mode. To use expert mode, add #BSUB -csm y to your batch script (or -csm y to your bsub command line).

Hardware Threads

Hardware threads are a feature of the POWER9 processor through which individual physical cores can support multiple execution streams, essentially looking like one or more virtual cores (similar to hyperthreading on some Intel® microprocessors). This feature is often called Simultaneous Multithreading or SMT. The POWER9 processor on Summit supports SMT levels of 1, 2, or 4, meaning (respectively) each physical core looks like 1, 2, or 4 virtual cores. The SMT level is controlled by the -alloc_flags option to bsub. For example, to set the SMT level to 4, add the line
#BSUB –alloc_flags smt4 to your batch script or add the option -alloc_flags smt4 to you bsub command line.

The default SMT level is 1.

Other Notes

Compute nodes are only allocated to one job at a time; they are not shared. This is why users request nodes (instead of some other resource such as cores or GPUs) in batch jobs and is why projects are charged based on the number of nodes allocated multiplied by the amount of time for which they were allocated. Thus, a job using only 1 core on each of its nodes is charged the same as a job using every core and every GPU on each of its nodes.




Job Launcher (jsrun)

The default job launcher for Summit is jsrun. jsrun was developed by IBM for the Oak Ridge and Livermore Power systems. The tool will execute a given program on resources allocated through the LSF batch scheduler; similar to mpirun and aprun functionality.

jsrun Format

  jsrun    [ -n #resource sets ]   [tasks, threads, and GPUs within each resource set]   program [ program args ] 

Compute Node Description

The following node image will be used to describe jsrun resource sets and layout.

  • 1 node
  • 2 sockets (grey)
  • 42 physical cores* (dark blue)
  • 168 hardware cores (light blue)
  • 6 GPUs (orange)
  • 2 Memory blocks (yellow)

*Core Isolation: 1 core on each socket has been set aside for overhead and is not available for allocation through jsrun. The core has been omitted and is not shown in the above image.

Resource Sets

While jsrun performs similar job launching functions as aprun and mpirun, its syntax is very different. A large reason for syntax differences is the introduction of the resource set concept. Through resource sets, jsrun can control how a node appears to each job. Users can, through jsrun command line flags, control which resources on a node are visible to a job. Resource sets also allow the ability to run multiple jsruns simultaneously within a node. Under the covers, a resource set is a cgroup.

At a high level, a resource set allows users to configure what a node look like to their job.

Jsrun will create one or more resource sets within a node. Each resource set will contain 1 or more cores and 0 or more GPUs. A resource set can span sockets, but it may not span a node. While a resource set can span sockets within a node, consideration should be given to the cost of cross-socket communication. By creating resource sets only within sockets, costly communication between sockets can be prevented.

One or more resource sets can be created on a single node and can span sockets. But, a resource set can not span nodes.

While a resource set can span sockets within a node, consideration should be given to the cost of cross-socket communication. Creating resource sets within sockets will prevent cross-socket communication.

Subdividing a Node with Resource Sets

Resource sets provides the ability to subdivide node’s resources into smaller groups. The following examples show how a node can be subdivided and how many resource set could fit on a node.

Multiple Methods to Creating Resource Sets

Resource sets should be created to fit code requirements. The following examples show multiple ways to create resource sets that allow two MPI tasks access to a single GPU.

  1. 6 resource sets per node: 1 GPU, 2 cores per (Titan)

    In this case, CPUs can only see single assigned GPU.
  2. 2 resource sets per node: 3 GPUs and 6 cores per socket

    In this case, all 6 CPUs can see 3 GPUs. Code must manage CPU -> GPU communication. CPUs on socket0 can not access GPUs or Memory on socket1.
  3. Single resource set per node: 6 GPUs, 12 cores

    In this case, all 12 CPUs can see all node’s 6 GPUs. Code must manage CPU to GPU communication. CPUs on socket0 can access GPUs and Memory on socket1. Code must manage cross socket communication.

Configuring a Resource Set

Resource sets allow each jsrun to control how the node appears to a code. This method is unique to jsrun, and requires thinking of each job launch differently than aprun or mpirun. While the method is unique, the method is not complicated and can be reasoned in a few basic steps.

The first step to creating resource sets is understanding how a code would like the node to appear. For example, the number of tasks/threads per GPU. Once this is understood, the next step is to simply calculate the number of resource sets that can fit on a node. From here, the number of needed nodes can be calculated and passed to the batch job request.

The basic steps to creating resource sets:

1) Understand how your code expects to interact with the system.
How many tasks/threads per GPU?
Does each task expect to see a single GPU? Do multiple tasks expect to share a GPU? Is the code written to internally manage task to GPU workload based on the number of available cores and GPUs?
2) Create resource sets containing the needed GPU to task binding
Based on how your code expects to interact with the system, you can create resource sets containing the needed GPU and core resources.
If a code expects to utilize one GPU per task, a resource set would contain one core and one GPU. If a code expects to pass work to a single GPU from two tasks, a resource set would contain two cores and one GPU.
3) Decide on the number of resource sets needed
Once you understand tasks, threads, and GPUs in a resource set, you simply need to decide the number of resource sets needed.

As on Titan it is useful to keep the general layout of a node in mind when laying out resource sets.

Launching a Job with jsrun

jsrun Format

  jsrun    [ -n #resource sets ]   [tasks, threads, and GPUs within each resource set]   program [ program args ] 

Common jsrun Options

Below are common jsrun options. More flags and details can be found in the jsrun man page.

Flags Description
Long Short
--nrs -n Number of resource sets
--tasks_per_rs -a Number of MPI tasks (ranks) per resource set
--cpu_per_rs -c Number of CPUs (cores) per resource set.
--gpu_per_rs -g Number of GPUs per resource set
--bind -b Binding of tasks within a resource set. Can be none, rs, or packed:#
--rs_per_host -r Number of resource sets per host
--latency_priority -l Latency Priority. Controls layout priorities. Can currently be cpu-cpu or gpu-cpu
--launch_distribution -d How tasks are started on resource sets

Aprun to Jsrun

Mapping aprun commands used on Titan to Summit’s jsrun is only possible in simple single GPU cases. The following table shows some basic single GPU examples that could be executed on Titan or Summit. In the single node examples, each resource set will resemble a Titan node containing a single GPU and one or more cores. Although not required in each case, common jsrun flags (resource set count, GPUs per resource set, tasks per resource set, cores per resource set, binding) are included in each example for reference. The jsrun -n flag can be used to increase the number of resource sets needed. Multiple resource sets can be created on a single node. If each MPI task requires a single GPU, up to 6 resource sets could be created on a single node.

GPUs per Task MPI Tasks Threads per Task aprun jsrun
1 1 0 aprun -n1 jsrun -n1 -g1 -a1 -c1
1 2 0 aprun -n2 jsrun -n1 -g1 -a2 -c1
1 1 4 aprun -n1 -d4 jsrun -n1 -g1 -a1 -c4 -bpacked:4
1 2 8 aprun -n2 -d8 jsrun -n1 -g1 -a2 -c16 -bpacked:8

The jsrun -n flag can be used to increase the number of resource sets needed. Multiple resource sets can be created on a single node. If each MPI task requires a single GPU, up to 6 resource sets could be created on a single node.

 

The following example images show how a single-gpu/single-task job would be placed on a single Titan and Summit node. On Summit, the red box represents a resource set created by jsrun. The resource set looks similar to a Titan node, containing a single GPU, a single core, and memory.

Titan Node Summit Node
aprun -n1 jsrun -n1 -g1 -a1 -c1

Because Summit’s nodes are much larger than Titan’s, 6 single-gpu resource sets can be created on a single Summit node. The following image shows how six single-gpu, single-task resource sets would be placed on a node by default. In the example, the command jsrun -n6 -g1 -a1 -c1 is used to create six single-gpu resource sets on the node. Each resource set is indicated by differing colors. Notice, the -n flag is all that changed between the above single resource set example. The -n flag tells jsrun to create six resource sets.

    • jsrun -n 6 -g 1 -a 1 -c 1
    • Starts 6 resource sets, each indicated by differing colors
    • Each resource contains 1 GPU, 1 Core, and memory
    • The red resource set contains GPU 0 and Core 0
    • The purple resource set contains GPU 3 and Core 84
    • -n 6 tells jsrun how many resource sets to create
    • In this example, each resource set is similar to a single Titan node

 

jsrun Resource Set Reference Table

The following table provides a quick reference for creating resource sets of various common use cases. The -n flag can be altered to specify the number of resource sets needed.

Resource Sets MPI Tasks Threads Physical Cores GPUs jsrun Command
1 16 0 16 0 jsrun -n1 -a16 -c16 -g0
1 1 0 1 1 jsrun -n1 -a1 -c1 -g1
1 2 0 2 1 jsrun -n1 -a2 -c2 -g1
1 1 0 1 2 jsrun -n1 -a1 -c1 -g2
1 1 16 16 3 jsrun -n1 -a1 -c16 -g3 -bpacked:16

 

jsrun Examples

The below examples were launched in the following 2 node interactive batch job:

summit> bsub -nnodes 2 -Pprj123 -W02:00 -Is $SHELL

Single MPI Task, single GPU per RS

The following example will create 12 resource sets each with 1 MPI task and 1 GPU. Each MPI task will have access to a single GPU.

Rank 0 will have access to GPU 0 on the first node ( red resource set). Rank 1 will have access to GPU 1 on the first node ( green resource set). This pattern will continue until 12 resources sets have been created.

The following jsrun command will request 12 resource sets ( -n12 ) 6 per node ( -r6 ). Each resource set will contain 1 MPI task ( -a1 ), 1 GPU ( -g1 ), and 1 core ( -c1 ).

summit> jsrun -n12 -r6 -a1 -g1 -c1 ./a.out
Rank:    0; NumRanks: 12; RankCore:   0; Hostname: h41n04; GPU: 0
Rank:    1; NumRanks: 12; RankCore:   4; Hostname: h41n04; GPU: 1
Rank:    2; NumRanks: 12; RankCore:   8; Hostname: h41n04; GPU: 2
Rank:    3; NumRanks: 12; RankCore:  84; Hostname: h41n04; GPU: 3
Rank:    4; NumRanks: 12; RankCore:  89; Hostname: h41n04; GPU: 4
Rank:    5; NumRanks: 12; RankCore:  92; Hostname: h41n04; GPU: 5

Rank:    6; NumRanks: 12; RankCore:   0; Hostname: h41n03; GPU: 0
Rank:    7; NumRanks: 12; RankCore:   4; Hostname: h41n03; GPU: 1
Rank:    8; NumRanks: 12; RankCore:   8; Hostname: h41n03; GPU: 2
Rank:    9; NumRanks: 12; RankCore:  84; Hostname: h41n03; GPU: 3
Rank:   10; NumRanks: 12; RankCore:  89; Hostname: h41n03; GPU: 4
Rank:   11; NumRanks: 12; RankCore:  92; Hostname: h41n03; GPU: 5

Multiple tasks, single GPU per RS

The following jsrun command will request 12 resource sets ( -n12 ). Each resource set will contain 2 MPI tasks ( -a2 ), 1 GPU ( -g1 ), and 2 cores ( -c2 ). 2 MPI tasks will have access to a single GPU.

Ranks 0 – 1 will have access to GPU 0 on the first node ( red resource set). Ranks 2 – 3 will have access to GPU 1 on the first node ( green resource set). This pattern will continue until 12 resource sets have been created.

Adding cores to the RS: The -c flag should be used to request the needed cores for tasks and treads. The default -c core count is 1. In the above example, if -c is not specified both tasks will run on a single core.

summit> jsrun -n12 -a2 -g1 -c2 -dpacked ./a.out | sort
Rank:    0; NumRanks: 24; RankCore:   0; Hostname: a01n05; GPU: 0
Rank:    1; NumRanks: 24; RankCore:   4; Hostname: a01n05; GPU: 0

Rank:    2; NumRanks: 24; RankCore:   8; Hostname: a01n05; GPU: 1
Rank:    3; NumRanks: 24; RankCore:  12; Hostname: a01n05; GPU: 1

Rank:    4; NumRanks: 24; RankCore:  16; Hostname: a01n05; GPU: 2
Rank:    5; NumRanks: 24; RankCore:  20; Hostname: a01n05; GPU: 2

Rank:    6; NumRanks: 24; RankCore:  88; Hostname: a01n05; GPU: 3
Rank:    7; NumRanks: 24; RankCore:  92; Hostname: a01n05; GPU: 3

Rank:    8; NumRanks: 24; RankCore:  96; Hostname: a01n05; GPU: 4
Rank:    9; NumRanks: 24; RankCore: 100; Hostname: a01n05; GPU: 4

Rank:   10; NumRanks: 24; RankCore: 104; Hostname: a01n05; GPU: 5
Rank:   11; NumRanks: 24; RankCore: 108; Hostname: a01n05; GPU: 5

Rank:   12; NumRanks: 24; RankCore:   0; Hostname: a01n01; GPU: 0
Rank:   13; NumRanks: 24; RankCore:   4; Hostname: a01n01; GPU: 0

Rank:   14; NumRanks: 24; RankCore:   8; Hostname: a01n01; GPU: 1
Rank:   15; NumRanks: 24; RankCore:  12; Hostname: a01n01; GPU: 1

Rank:   16; NumRanks: 24; RankCore:  16; Hostname: a01n01; GPU: 2
Rank:   17; NumRanks: 24; RankCore:  20; Hostname: a01n01; GPU: 2

Rank:   18; NumRanks: 24; RankCore:  88; Hostname: a01n01; GPU: 3
Rank:   19; NumRanks: 24; RankCore:  92; Hostname: a01n01; GPU: 3

Rank:   20; NumRanks: 24; RankCore:  96; Hostname: a01n01; GPU: 4
Rank:   21; NumRanks: 24; RankCore: 100; Hostname: a01n01; GPU: 4

Rank:   22; NumRanks: 24; RankCore: 104; Hostname: a01n01; GPU: 5
Rank:   23; NumRanks: 24; RankCore: 108; Hostname: a01n01; GPU: 5

summit>

Multiple Task, Multiple GPU per RS

The following example will create 4 resource sets each with 6 tasks and 3 GPUs. Each set of 6 MPI tasks will have access to 3 GPUs.

Ranks 0 – 5 will have access to GPUs 0 – 2 on the first socket of the first node ( red resource set). Ranks 6 – 11 will have access to GPUs 3 – 5 on the second socket of the first node ( green resource set). This pattern will continue until 4 resource sets have been created.

The following jsrun command will request 4 resource sets ( -n4 ). Each resource set will contain 6 MPI tasks ( -a6 ), 3 GPUs ( -g3 ), and 6 cores ( -c6 ).

summit> jsrun -n 4 -a 6 -c 6 -g 3 -d packed -l GPU-CPU ./a.out
Rank:    0; NumRanks: 24; RankCore:   0; Hostname: a33n06; GPU: 0, 1, 2
Rank:    1; NumRanks: 24; RankCore:   4; Hostname: a33n06; GPU: 0, 1, 2
Rank:    2; NumRanks: 24; RankCore:   8; Hostname: a33n06; GPU: 0, 1, 2
Rank:    3; NumRanks: 24; RankCore:  12; Hostname: a33n06; GPU: 0, 1, 2
Rank:    4; NumRanks: 24; RankCore:  16; Hostname: a33n06; GPU: 0, 1, 2 
Rank:    5; NumRanks: 24; RankCore:  20; Hostname: a33n06; GPU: 0, 1, 2
 
Rank:    6; NumRanks: 24; RankCore:  88; Hostname: a33n06; GPU: 3, 4, 5
Rank:    7; NumRanks: 24; RankCore:  92; Hostname: a33n06; GPU: 3, 4, 5
Rank:    8; NumRanks: 24; RankCore:  96; Hostname: a33n06; GPU: 3, 4, 5
Rank:    9; NumRanks: 24; RankCore: 100; Hostname: a33n06; GPU: 3, 4, 5
Rank:   10; NumRanks: 24; RankCore: 104; Hostname: a33n06; GPU: 3, 4, 5
Rank:   11; NumRanks: 24; RankCore: 108; Hostname: a33n06; GPU: 3, 4, 5

Rank:   12; NumRanks: 24; RankCore:   0; Hostname: a33n05; GPU: 0, 1, 2
Rank:   13; NumRanks: 24; RankCore:   4; Hostname: a33n05; GPU: 0, 1, 2
Rank:   14; NumRanks: 24; RankCore:   8; Hostname: a33n05; GPU: 0, 1, 2
Rank:   15; NumRanks: 24; RankCore:  12; Hostname: a33n05; GPU: 0, 1, 2
Rank:   16; NumRanks: 24; RankCore:  16; Hostname: a33n05; GPU: 0, 1, 2
Rank:   17; NumRanks: 24; RankCore:  20; Hostname: a33n05; GPU: 0, 1, 2

Rank:   18; NumRanks: 24; RankCore:  88; Hostname: a33n05; GPU: 3, 4, 5
Rank:   19; NumRanks: 24; RankCore:  92; Hostname: a33n05; GPU: 3, 4, 5
Rank:   20; NumRanks: 24; RankCore:  96; Hostname: a33n05; GPU: 3, 4, 5
Rank:   21; NumRanks: 24; RankCore: 100; Hostname: a33n05; GPU: 3, 4, 5
Rank:   22; NumRanks: 24; RankCore: 104; Hostname: a33n05; GPU: 3, 4, 5
Rank:   23; NumRanks: 24; RankCore: 108; Hostname: a33n05; GPU: 3, 4, 5
summit>

Single Task, Multiple GPUs, Multiple Threads per RS

The following example will create 12 resource sets each with 1 task, 4 threads, and 1 GPU. Each MPI task will start 4 threads and have access to 1 GPU.

Rank 0 will have access to GPU 0 and start 4 threads on the first socket of the first node ( red resource set). Rank 2 will have access to GPU 1 and start 4 threads on the second socket of the first node ( green resource set). This pattern will continue until 12 resource sets have been created.

The following jsrun command will create 12 resource sets ( -n12 ). Each resource set will contain 1 MPI task ( -a1 ), 1 GPU ( -g1 ), and 4 cores ( -c4 ). Notice that more cores are requested than MPI tasks; the extra cores will be needed to place threads. Without requesting additional cores, threads will be placed on a single core.

Requesting Cores for Threads: The -c flag should be used to request additional cores for thread placement. Without requesting additional cores, threads will be placed on a single core.

Binding Cores to Tasks: The -b binding flag should be used to bind cores to tasks. Without specifying binding, all threads will be bound to the first core.

summit> setenv OMP_NUM_THREADS 4
summit> jsrun -n12 -a1 -c4 -g1 -b packed:4 -d packed ./a.out
Rank: 0; RankCore: 0; Thread: 0; ThreadCore: 0; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}
Rank: 0; RankCore: 0; Thread: 1; ThreadCore: 4; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}
Rank: 0; RankCore: 0; Thread: 2; ThreadCore: 8; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}
Rank: 0; RankCore: 0; Thread: 3; ThreadCore: 12; Hostname: a33n06; OMP_NUM_PLACES: {0},{4},{8},{12}

Rank: 1; RankCore: 16; Thread: 0; ThreadCore: 16; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}
Rank: 1; RankCore: 16; Thread: 1; ThreadCore: 20; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}
Rank: 1; RankCore: 16; Thread: 2; ThreadCore: 24; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}
Rank: 1; RankCore: 16; Thread: 3; ThreadCore: 28; Hostname: a33n06; OMP_NUM_PLACES: {16},{20},{24},{28}

...

Rank: 10; RankCore: 104; Thread: 0; ThreadCore: 104; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}
Rank: 10; RankCore: 104; Thread: 1; ThreadCore: 108; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}
Rank: 10; RankCore: 104; Thread: 2; ThreadCore: 112; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}
Rank: 10; RankCore: 104; Thread: 3; ThreadCore: 116; Hostname: a33n05; OMP_NUM_PLACES: {104},{108},{112},{116}

Rank: 11; RankCore: 120; Thread: 0; ThreadCore: 120; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}
Rank: 11; RankCore: 120; Thread: 1; ThreadCore: 124; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}
Rank: 11; RankCore: 120; Thread: 2; ThreadCore: 128; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}
Rank: 11; RankCore: 120; Thread: 3; ThreadCore: 132; Hostname: a33n05; OMP_NUM_PLACES: {120},{124},{128},{132}

summit>

Hardware Threads: Multiple Threads per Core

Each physical core on Summit contains 4 hardware threads. The SMT level can be set using LSF flags:

SMT1
#BSUB -alloc_flags smt1
jsrun -n1 -c1 -a1 -bpacked:4 csh -c 'echo $OMP_PLACES’
0
SMT2
#BSUB -alloc_flags smt2
jsrun -n1 -c1 -a1 -bpacked:4 csh -c 'echo $OMP_PLACES’
{0:2}
SMT4
#BSUB -alloc_flags smt4
jsrun -n1 -c1 -a1 -bpacked:4 csh -c 'echo $OMP_PLACES’
{0:4}

jsrun Tools

This section describes tools that users might find helpful to better understand the jsrun job launcher.

Hello_jsrun

Hello_jsrun is a “Hello World”-type program that users can run on Summit nodes to better understand how MPI ranks and OpenMP threads are mapped to the hardware.
https://code.ornl.gov/t4p/Hello_jsrun

A screencast showing how to use Hello_jsrun is also available:

For More Information

This section provides some of the most commonly used LSF commands as well as some of the most useful options to those commands and information on jsrun, Summit’s job launch command. Many commands have much more information that can be easily presented here. More information about these commands is available via the online manual (i.e. man jsrun). Additional LSF information can be found on IBM’s website, specifically the Running Jobs and Command Reference Documents.