Faces of Summit: Putting the System to the Test
Verónica Vergara Larrea’s team is tasked with ensuring Summit’s performance
The Faces of Summit series shares stories of people working to stand up America’s next top supercomputer for open science, the Oak Ridge Leadership Computing Facility’s Summit. The next-generation machine is scheduled to come online in 2018.
The process of accepting a new supercomputer is one of the most unique and challenging projects a programmer may take on during a career. Now that the Oak Ridge Leadership Computing Facility (OLCF) has received much of the hardware for its next big supercomputer, Summit, the first phase of acceptance testing has been completed.
Tasked with leading the design and selection of test codes and problem cases to ensure the stability of the system is Verónica Vergara Larrea, a user support specialist and programmer at the OLCF, a US Department of Energy (DOE) Office of Science User Facility at DOE’s Oak Ridge National Laboratory (ORNL). As part of the System Test Working Group, Vergara Larrea coordinates and organizes test development and benchmarking of the massive architecture for ORNL.
Originally from Quito, Ecuador, Vergara Larrea came to the US as a mathematics/physics major, earning a bachelor of arts from Reed College in Portland, Oregon, in 2008. Computational science seemed a natural fit for Vergara Larrea, who sought a practical way to apply her knack for problem-solving. She went on to receive a master of science in computational science at Florida State University and began her career troubleshooting and debugging scientific applications on large cluster resources at Purdue University. Now, Vergara Larrea is troubleshooting one of the largest supercomputing systems in the world.
“Our group deals with all kinds of issues. We try to make sure that every issue we find—whether it’s with the compilers or other aspects of the system—is resolved before Summit goes live,” Vergara Larrea said. Compilers help translate codes into instructions a computer can understand; therefore, their debugging is essential for efficient and smooth operation.
In January 2016, the acceptance team designed an acceptance test plan for Summitdev, the early access system for Summit. Featuring IBM’s POWER8+ processors and NVIDIA Pascal GPUs, Summitdev provided a bridge to Summit by giving staff and users a chance to try out the predecessor to Summit’s POWER9 architecture and NVIDIA Volta GPUs. Teams in the Center for Accelerated Application Readiness began running their scientific applications on the early access system in the first months of 2017, aiming to optimize them so that users from every scientific domain will be able to run their codes effortlessly on Summit when it is available to users in 2019. The acceptance test plan for Summit, delivered in early 2017, consists of more than 380 tests, with more than 7,800 jobs running over a continuous 92-hour period for the Phase 1 stability testing.
“The acceptance process for Summit is very different from that of Summitdev,” Vergara Larrea said. “For Summit, in addition to the hardware diagnostics and basic functionality testing, we run benchmarks and full applications at different problem sizes to verify correctness while also evaluating performance. Then we look at what happens if we put as much load on the system as we can to test its stability.”
As of early 2018, the OLCF has completed acceptance of Phase 1 of Summit, which totals approximately 25 percent of the final system. The next phase, beginning soon, will follow the same process of testing hardware, functionality, performance, and stability for the entire system.
No “I” in team
Although Vergara Larrea’s troubleshooting skills extend back to her days at Purdue University, where she worked as a scientific applications analyst for 3 years, Summit’s scale makes her work especially challenging.
“Part of my job is doing validation testing and making sure all the tests that have been developed will give us correct results,” Vergara Larrea said. “Once we had these codes and applications working correctly on Summitdev, we had to wait for Summit’s hardware to arrive before we could determine if there were major problems that had to be addressed before acceptance could start. It’s a complex process and requires us to work closely with IBM development teams.”
But scheduling around numerous other groups working on various aspects of the project makes acceptance tricky. Because it requires installed hardware to run test cases and validate software, acceptance depends on the success of multiple groups. The acceptance team consisted of 36 people from multiple groups at the National Center for Computational Sciences at ORNL, including staff members in the OLCF’s High-Performance Computing Operations (HPC Ops), User Assistance and Outreach, Scientific Computing, Technology Integration, and Computer Science Research Groups.
The teams have had to coordinate often to meet milestones. If one aspect is delayed, such as the delivery of the cabinets or installation of the nodes, it affects acceptance. Although teams have separate goals for the system, they must work together to guarantee Summit’s overall success.
A blank slate
Vergara Larrea has worked with new architectures previously, but before the Summit project, she had not dealt with pre-release technologies. She said the most exciting aspect of working on a project such as Summit is the ability to run on a novel architecture with a specialized software stack designed specifically for the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (CORAL). The goal of CORAL is to stand up leadership computers for these respective sites that will outperform current DOE leadership systems by 5 to 10 times.
After the OLCF’s HPC Ops Group sets up and configures new supercomputing systems, Vergara Larrea’s team is one of the first to run scientific applications and codes on the system, troubleshooting bugs and problems along the way.
“Summit’s compute nodes are not generally available yet, and that’s incredibly exciting,” Vergara Larrea said. “It’s not just an architecture that’s never been in production. It’s being deployed at a scale that hasn’t been tested before—and we get to make it work.”
ORNL is managed by UT-Battelle for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit http://science.energy.gov/.