Faces of Summit: Leading a Systems Expedition
HPC systems administrator Matt Ezell irons out crucial details during Summit project’s final stages
The Faces of Summit series shares stories of the people behind America’s top supercomputer for open science, the Oak Ridge Leadership Computing Facility’s Summit. The IBM AC922 machine launched in June 2018.
Installing software. Troubleshooting problems. Migrating procedures. Debugging programs. These are just a few of the countless tasks performed by systems administrators—people who oversee the configuration and upkeep of a computer system’s operations.
At the Oak Ridge Leadership Computing Facility (OLCF)—a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL)—the systems administrator’s responsibilities scale with the size of the system. As the OLCF readies to launch its most powerful system yet, the job remains as critical as ever.
The IBM AC922 Summit, the OLCF’s next top supercomputer for open science, will have a peak performance of around 200 petaflops—or 200 quadrillion calculations per second—when it comes online later this year. Matt Ezell, one of the OLCF’s high-performance computing (HPC) systems administrators for Summit, is currently undertaking some of the most important tasks necessary for standing up the machine. One of his largest tasks is ensuring that the system software components all work reliably and perform well together. These components range from the Red Hat Enterprise Linux operating system to the IBM Spectrum Load Sharing Facility batch system, the software that determines how jobs are scheduled and where they will run on the machine.
“There’s a lot of work that our team has had to do to migrate all of the policies and procedures from Titan to Summit,” Ezell said, referring to the OLCF’s current Cray XK7 supercomputer and the upcoming system, respectively.
As part of the HPC Operations (HPC Ops) Group at the OLCF, Ezell works alongside HPC systems team lead and system owner Don Maxwell, HPC systems administrator Christopher Muzyn, and HPC Linux systems administrator Richard Ray on some of the most challenging aspects of the system software.
“Each time a new software release drops from IBM, we have to work with IBM and the other vendors to tune it for Summit,” Ezell said. “The biggest challenge is installing and testing all of the new software and making sure we get it working in a way that’s best going to support our users and their science. It’s definitely an iterative process.”
Troubleshooting beyond Google
Ezell, though, finds problem-solving one of the most rewarding aspects of his career.
Regardless of the task at hand, he enjoys dissecting jobs into their components. During high school, Ezell worked as a graphic designer of sorts—only he was much more interested in the inner workings of the computer than in any of the actual graphics.
“The job took someone who knew about computers and about the software involved,” Ezell said. “I had to help people understand what I needed from them as far as image resolution goes, and then I could work with that to achieve certain results for their promotional materials.”
Having been interested in the “guts” of machines from a young age, Ezell pursued a degree in electrical engineering at the University of Tennessee–Knoxville (UT). During his undergraduate program, he began working for the National Institute for Computational Sciences (NICS), UT’s supercomputing center at ORNL.
Once he started his master’s program in computer engineering at UT, he took on the role of lead systems administrator for NICS’s Kraken supercomputer, a Cray XT5 system. Upon deployment in 2009, Kraken boasted a peak performance of 1.17 petaflops, making it the most powerful academic computer at the time.
Ezell was hired into the OLCF’s HPC Ops Group in 2012, when ORNL was launching Titan. Discovering how to manage Titan’s 18,688 GPUs significantly lowered the learning curve for Summit, but Ezell has still encountered unique challenges.
“Troubleshooting Summit has been a whole new beast,” Ezell said. “You can’t just go to Google and type in something and expect to find the answer. A lot of times you have to read the source code yourself and figure out what’s wrong or talk to the software developers. We are solving problems that nobody has ever seen before, and that is really exciting.”
Summit’s software needed to be transformed to run on a new hardware system—a task that has required writing code from the ground up and finding ways of creatively renovating software components. During Summit’s deployment, Ezell contributed multiple code patches to IBM’s open-source Extreme Cloud Administration Toolkit management software that helped improve security and fix bugs.
“Matt has an uncanny ability to dig in and figure out how to fix problems,” Maxwell said. “He really knows the low-level stuff and understands how to fix it. Summit would not be where it is today without Matt’s technical leadership.”
A jack of all trades
Ezell and the HPC systems team are currently preparing for the final stages of acceptance, a process during which Summit undergoes rigorous testing to ensure its performance. The complex task requires the expertise of several OLCF groups. The HPC systems team collaborates with the OLCF’s User Assistance and Outreach (UAO) Group to identify system problems based on end-user reports. HPC systems staff members also work with the OLCF’s Scientific Computing Group to focus on more technical problems that may affect users and their scientific applications.
Ezell, in particular, has played a broad role in readying the IBM system. As colead for the cluster systems management (CSM) and Linux systems management (LSM) working groups, he has contributed to the nonrecurring engineering R&D phase of Summit, which aims to address the gaps in Summit’s software and hardware stack. Consisting of members of HPC Ops, UAO, and the Technology Integration Group, the CSM and LSM working groups coordinate with IBM to understand which new technologies and features the current version of software may need.
The groups use Peak, Summit’s lone test cabinet, to track and fix bugs in collaboration with IBM. Ezell works with OLCF user support specialist and programmer Verónica Vergara Larrea to run acceptance tests on Peak before migrating the software onto Summit itself.
“Peak allows us to find many of our bugs and work through issues before we take an outage and put it on the big system,” Ezell said. “It insulates users from these growing pains.”
The HPC systems team expects that the last software stack, intended for full acceptance and production, will be installed this summer. The final stages of Summit’s software installation double as a thorough test of the knowledge base Ezell and his colleagues have compiled the last few years. It’s a body of work that may also serve Ezell in his next big role: system owner of the OLCF’s first exascale system scheduled to be deployed around 2021. Coincidentally, Ezell submitted the winning name for this system, “Frontier,” in 2017.
ORNL is managed by UT-Battelle for DOE’s Office of Science. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, please visit https://science.energy.gov/.