Faces of Summit: Serving up Software
Mark Berrill's one-of-a-kind background helps him navigate software-related aspects of the Summit project
The Faces of Summit series shares stories of people working to stand up America’s next top supercomputer for open science, the Oak Ridge Leadership Computing Facility’s Summit. The next-generation machine is scheduled to come on line in 2018.
The installation of a new supercomputer demands the expertise of individuals with diverse knowledge sets. As projects evolve, diverge, and grow, they require a special kind of talent to fit certain pieces together. When it comes to scientific codes and software, they require someone who knows computers and science—someone like Mark Berrill, a computational scientist in the Scientific Computing Group (SciComp) at the Oak Ridge Leadership Computing Facility (OLCF).
Berrill applies his knowledge in engineering and computing to manage software projects related to the IBM AC922 Summit, the next leadership-class supercomputer at the OLCF, a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL). With Summit’s arrival approaching, people like Berrill are bridging the gap between multiple areas of expertise to ensure the success of the potential 200 petaflop system.
A software specialist who has been at ORNL since 2010, Berrill is not new to working with high-performance computing (HPC) resources. From 2010 to 2014, he was a software engineer in the Computational Engineering and Energy Sciences Group led by distinguished R&D staff member John Turner. He transitioned to the SciComp group when his research began heading increasingly toward computing.
“I really liked Titan and working with these large computing resources,” Berrill said, referencing the OLCF’s current flagship supercomputer. “I get to work on many different projects, and that’s why I knew this group would be a good fit—because there’s always so much diversity in the work we do.”
Drawn to machines
Although Berrill hasn’t always worked with software, computing has always held his interest.
“When I was in high school, I liked to take apart electronics and rebuild them,” Berrill said. “I liked building and working with computers, and I initially wanted to go into robotics.”
After dabbling in robotics and communications at Colorado State University (CSU), Berrill decided to pursue an electrical engineering degree. But even in graduate school, he still sought out opportunities to work with computers.
“In graduate school, I elected to call myself the ‘lone theoretician in an experimental group,’” he recalled. “I did all the computational modeling and more of the theoretical work in a laser research group and plasma research group. Everybody else was experimental, and I did computer modeling.”
In 2006, while earning his PhD in electrical engineering from CSU, he became a fellow in the DOE Computational Science Graduate Fellowship program, which provides opportunities to students pursuing degrees in fields that apply HPC to science and engineering problems. After participating in a summer practicum at Los Alamos National Laboratory and gaining access to large parallel computing resources, Berrill began looking at opportunities within the national laboratories. In 2010, he applied for and earned the Eugene P. Wigner Fellowship at ORNL.
The birth of Summit
In the early stages of Summit, Berrill played a crucial role in updating the OLCF’s acceptance test harness, a software package originally developed by OLCF computational scientist Arnold Tharrington that is designed to simulate multiple users running codes on massive HPC systems such as Summit.
“The acceptance harness runs applications through the process of compiling the codes and submitting them to the scheduler and making sure that the answer is correct, according to the definition of correct for whatever the application in question is,” Berrill said. “It’s a way to simulate that process so we don’t have to submit and check jobs manually for 24 hours a day during the stability phase of acceptance.” Stability testing is a process that requires putting a heavy job load on the system to test the performance of codes when the system is stressed. The project was completed by a team consisting of Berrill, Tharrington, OLCF user support specialist Verónica Vergara Larrea, and OLCF computational scientist Wayne Joubert.
Berrill was also in charge of drafting the Summit acceptance document—a birth certificate of sorts. The document described the tests ORNL would run on Summit before its inauguration into the world of computing and served as the guide for the execution of acceptance. A team led by Jim Rogers, computing and facilities director at the National Center for Computational Sciences, submitted the document to IBM in a partnership with ORNL to stand up Summit.
Now Berrill is focused on smaller acceptance test applications, such as his mini application (miniapp) XRayTrace. He developed the miniapp to evaluate different programming models for his original RayTrace code that’s used to simulate x-ray lasers. Testing application performance on different programming models will ensure success on unfamiliar parallel architectures such as Summit, which will feature IBM’s POWER9 architecture and NVIDIA Volta GPUs.
A natural intermediary, Berrill also acts as the official and unofficial liaison for a number of projects at OLCF, including three engineering projects under the Innovative and Novel Computational Impact on Theory and Experiment, or INCITE, program. He has also played a large role in the Consortium for Advanced Simulation of Light Water Reactors (CASL), serving as the point of contact for CASL’s interaction with the OLCF.
Today Berrill is filling in the gaps for Summit where necessary, working to get the final test pieces ready for acceptance. As hardware rolls in, Berrill’s team is racing against time to ensure software tests are free of bugs.
“While there are always problems with new hardware, our job is to make sure that we don’t have problems with the tests themselves,” he said. “If the tests identify an issue, we want to make sure it’s a legitimate problem as opposed to a flawed test. With these final tests, it’s a mad dash to the finish line, and we’re going full speed ahead.”
ORNL is managed by UT–Battelle for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit http://science.energy.gov/.