OLCF’s Chris Fuson works with Summit vendors and OLCF team members to ready Summit’s batch scheduler and job launcher

The Faces of Summit series shares stories of the people behind America’s top supercomputer for open science, the Oak Ridge Leadership Computing Facility’s Summit. The IBM AC922 machine launched in June 2018.

At the Oak Ridge Leadership Computing Facility (OLCF), supercomputing staff and users are already talking about what kinds of science problems they will be able to solve once they “get on Summit.”

But before they run their science applications on the 200-petaflop IBM AC922 supercomputer later this year, they will have to go through the system’s batch scheduler and job launcher.

“The batch scheduler and job launcher control access to the compute resources on the new machine,” said Chris Fuson, OLCF high-performance computing (HPC) support specialist. “As a user, you will need to understand these resources to utilize the system effectively.”

OLCF HPC user support specialist Chris Fuson readies Summit's batch scheduler and job launcher.

HPC Support Specialist Chris Fuson in the Summit computing center

A staff member in the User Assistance and Outreach (UAO) Group, Fuson has worked on five flagship supercomputers at OLCF—Cheetah, Phoenix, Jaguar, Titan, and now Summit.

With a background in programming and computer science, Fuson said he likes to focus on solving the unexpected issues that come up during installation and testing, such as fixing bugs or adding new features to help users navigate the system.

Fuson can often be found standing at his desk listening to background music while he sorts through new tasks, user requests, and technical issues related to job scheduling.

“As the systems change and evolve, the detective work involved in helping users solve problems as they run on a new machine keeps it interesting,” he said.

Of course, the goal is to make the transition to a new system as smooth as possible for users. While still responding to day-to-day tasks related to the OLCF’s current supercomputer, Titan, Fuson and the UAO group also work with IBM to learn, incorporate, and document the IBM Load Sharing Facility (LSF) batch scheduler and the parallel job launcher jsrun for Summit. LSF allocates Summit resources, and jsrun launches jobs on the compute nodes.

“The new launcher provides similar functionality to other parallel job launchers, such as aprun and mpirun, but requires users to take a slightly different approach in determining how to request and lay out resources for a job,” Fuson said.

IBM developed jsrun to meet the unique computing needs of two CORAL partners, the US Department of Energy’s (DOE’s) Oak Ridge and Lawrence Livermore National Laboratories.

“We relayed our workload and scheduling requirements to IBM,” Fuson said. “For example, as a leadership computing facility, we provide priority for large jobs in the batch queue. We work with LSF developers to incorporate our center’s policy requirements and diverse workload needs into the existing scheduler.”

OLCF Center for Accelerated Application Readiness team members, who are optimizing application codes for Summit, have tested LSF and jsrun on Summitdev, an early access system with IBM processers one generation away from Summit’s Power9 processors.

“Early users are already providing feedback,” Fuson said. “There’s a lot of work that goes into getting these pieces polished. At first, it is always a struggle as we work toward production, but things will begin to fall into place.”

To prepare all facility users for scheduling on Summit, Fuson is also developing user documentation and training. In February, he introduced users to jsrun on the monthly User Conference Call for the OLCF, a DOE Office of Science User Facility at ORNL.

“Right now, Summit is a big focus,” he said. “We’ve invested time in learning these new tools and testing them in the Summit environment.”

And what about during his free time when Summit is not the focus? Fuson spends his off-hours scheduling as well. “My hobby is taxiing my kids around town between practices,” he joked.

ORNL is managed by UT–Battelle for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://science.energy.gov/.