OLCF hackathon opens new frontiers for users

The Oak Ridge Leadership Computing Facility (OLCF) wrapped up its 2021 hackathon with teams from around the globe working on projects that spanned the cosmos.

Ten teams with a total of 71 participants from 21 organizations took part in the October hackathon hosted by the OLCF, home to Summit, the nation’s fastest supercomputer. The event capped a yearlong series of hackathons sponsored by NVIDIA.

Scientists from across the US, along with participants from Brazil and the Netherlands, collaborated to tackle problems of computational fluid dynamics, electronic structure, nuclear fusion and astrophysics under the supervision of experts from the OLCF, NVIDIA, Georgia Tech and the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory. The OLCF and NERSC are US Department of Energy Office of Science user facilities.

As in 2020, the events were held entirely online due to the COVID-19 pandemic.

The hackathons began in 2014 as a way to help OLCF users to adapt their application codes to graphics processing units, or GPUs, relied on by the hybrid architecture of supercomputers such as Summit—the OLCF’s 200-petaflop IBM AC922 system—and its 27-petaflop predecessor Titan. Previous supercomputers traditionally relied on central processing units, or CPUs, which can run calculations quickly but can’t necessarily keep up with the number of parallel operations that GPUs can run simultaneously. Hybrid computing architectures like Summit’s have since become standard.

Each hackathon team works on a specific code using Ascent, a stand-alone 18-node cluster built on the same design as Summit, and spends the week porting and profiling the codes to run on GPUs, working out bugs, and fine-tuning other details.

Graphics processing units, or GPUs, like this one enable supercomputers like Summit to perform quadrillions of calculations per second. Image Credit: ORNL

“We started with a single event here at the OLCF, and now it’s become an annual series that takes place worldwide,” said Tom Papatheodore, a high-performance computing engineer who’s helped organize the events since 2017. “In the pre-pandemic world, we used to host these events in person in a conference room, with teams working together at their own tables. Now we host these events virtually on Zoom with breakout rooms instead of in-person huddles.”

One of the star crews from the 2021 series proved to be the Thornado team, led by Eirik Endeve and Austin Harris, computational astrophysicists at Oak Ridge National Laboratory, with graduate students Ran Chu and Sam Dunham. The team’s name comes from their code—the Toolkit for High-Order Neutrino Radiation Hydrodynamics (Thornado). The code aims to help predict the elemental fallout, known as the nucleosynthetic yield, scattered when a star explodes in a supernova, along with other observables such as gravitational waves and neutrinos produced by the explosion.

“Hackathons like these are great places to bring students for even just a day to brush up on existing skills and learn new skills that will be invaluable in making Thornado perform well on DOE supercomputers,” Endeve said. “When we leave the event, they can go off on their own and make a lot of progress. Each member of the team used the hackathon to work on various kernels of the code that are critical to solving its equations. We probably wouldn’t have made as much progress as we have otherwise.”

Eirik Endeve

The Thornado team took advantage of the hackathons to speed up their code and streamline operations. Before the OLCF hackathon, simulations using the code took about 10 days. The work completed at the hackathon helped the team shave that time down to 2.5 days.

“We’ve been working on this project for the past 6 years,” said Chu, a doctoral student at the University of Tennessee, Knoxville. “The hackathon was a great opportunity to finish the last steps needed. With everyone working together virtually at the same time, we could get feedback quickly and efficiently, and the mentors could look over our work to make suggestions early in the process. We can talk to participants from other groups in the breakout rooms in between work. After the OLCF hackathon, we were able to run our supernova simulations fully on GPUs at a speedup of about 17 times faster. That’s a big milestone for us.”

Dunham, a doctoral candidate at Vanderbilt University, used the hackathon to get some stubborn nodes connected and port a module on the CPU side.

“This module was one of the big bottlenecks in the code, so having the access to mentors offered by the hackathon was a big help,” he said. “They could give me hints on which rabbit holes to go down and where to focus my time and effort.”

Thornado has been made publicly available, and the team looks forward to using it on Frontier, the 1.5-exaflop successor to Summit expected to open to full user operations by 2023.

“Thanks in part to the experiences taken from the hackathon, Thornado will run more efficiently on Frontier and hopefully enable scientists to do more science,” Endeve said.

UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.