Technology - Written by on January 5, 2016

OLCF Delivers New Optimized Job Launching Tool

Tags: , ,

This summer, OLCF user support specialists Adam Simpson and Matt Belhorn developed and released Wraprun, a new job submission tool for bundling large amounts of small jobs on Titan, a 27-petaflop Cray XK7 machine with a hybrid CPU/GPU architecture located at the Department of Energy’s Oak Ridge National Laboratory.

This summer, OLCF user support specialists Adam Simpson and Matt Belhorn developed and released Wraprun, a new job submission tool for bundling large amounts of small jobs on Titan, a 27-petaflop Cray XK7 machine with a hybrid CPU/GPU architecture located at the Department of Energy’s Oak Ridge National Laboratory.

Support staff develops tool for bundling a large number of small jobs

When dozens of users are running millions of computations per day, scheduling jobs to run on the Titan supercomputer can become complex. Fortunately for users at the Oak Ridge Leadership Computing Facility (OLCF)—a US Department of Energy (DOE) Office of Science User Facility—launching jobs has become easier and more efficient with the recent development of Wraprun, a job launcher that allows users to “wrap” a large number of small jobs into one launch call.

Back in the summer, OLCF user support specialists Adam Simpson and Matt Belhorn developed and released Wraprun; its name is a play on Aprun, the Cray job launcher users of the OLCF’s Titan supercomputer have typically used. With Wraprun, users are more easily able to queue up a large number of small jobs and produce more science results in less time.

The OLCF developed Wraprun to assist users of Titan that wish to run ensemble jobs — ones that consist of multiple independent jobs that are run simultaneously — but had difficulty using the default job launcher, Cray’s Aprun, to do these runs.

What makes Wraprun different is that users can launch their ensembles—which typically would be launched with hundreds or thousands of Aprun calls—with a single Aprun call. Wraprun isolates the communication of the individual jobs by dividing the nodes efficiently—the nodes running one job will communicate only with the nodes working on the same job. Without the tool, each job communicates with all the other nodes, causing errors and crashes.

By bundling the jobs, Wraprun allows users to launch more jobs in less time, increasing efficiency and productivity. Before the tool, users could launch only about 100 jobs in an ensemble at a time, and because Titan users are limited to two running batch jobs, users could run only about 200 jobs simultaneously. With Wraprun, there is no known upper limit on the number of jobs that can be in an ensemble. In principle, users can bundle a full machine’s worth of jobs into one Aprun call.

One of the major benefits of developing Wraprun in-house is that Simpson and Belhorn can make changes and add features to the tool for OLCF users with relative ease. In fact, earlier this year, Simpson was able to implement an error-handling feature for an OLCF user within an hour—a request that would have taken days with a third-party tool.

For all of the benefits it provides, the Wraprun tool is also simple to use.

“This is what makes it a great solution for what it does,” Belhorn said. “All you need to do is add ‘Wr’ in front of the Aprun call and add some extra flags, and now you can do things on Titan that you couldn’t do before.” – Miki Nolin

Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.