When it comes to solving complex technical issues for GPU-accelerated supercomputers, the national labs have found that tackling them is “better together”

With every new generation of supercomputer that arrives at the nation’s leadership computing facilities also comes an inevitable learning curve as computational scientists get acclimated to unfamiliar systems. Software bugs must be hunted down and hardware quirks ironed out. But even after those initial issues are resolved, there is an ongoing pursuit of system optimization that requires more than just initial technical support. In fact, at three national labs, “technical support” has been completely redefined as an ongoing collaboration to refine and expand the capabilities of these high-performance computing (HPC) machines long after their delivery dates.

The Summit on Summit, Sierra, and Perlmutter (SoSSP) is not merely a triannual workshop hosted by NVIDIA to assist its supercomputer customers—it’s actually part of a year-round team effort to solve HPC issues and help conduct new science across the national labs. Named after the supercomputer systems of the three main participants—Oak Ridge National Laboratory (ORNL) with Summit, Lawrence Livermore National Laboratory (LLNL) with Sierra, and the National Energy Research Scientific Computing Center (NERSC) with the upcoming Perlmutter housed at Lawrence Berkeley National Laboratory—the SoSSP marked 2 years of this new, unified approach to problem-solving and developer-to-developer engagement at its September meeting.

“We focus on technical issues that can benefit a large number of users of these machines,” said Tjerk Straatsma, a distinguished research scientist at the National Center for Computational Sciences who also serves as ORNL’s SoSSP liaison. “It is a joint effort to use the NVIDIA GPU architecture as effectively as possible for a wide range of applications, and it focuses on the implications for both the software environment as well as scientific methods and libraries.”

In the SoSSP, personnel from each of the US Department of Energy’s Advanced Scientific Computing Research (ASCR) computing facilities are attached to individual “work streams,” combining their expertise to overcome cross-system hurdles. These work streams are dedicated to 10 different focus areas—ranging from compilers to data analytics to math libraries—and are assigned a set of goals due approximately each quarter. In partnership with the ASCR personnel, every work stream is led by a person from NVIDIA who drives the technical discussions around these deliverables.

“The work of the national labs is critically important in tackling many of the global challenges we face today,” said Ian Buck, general manager and vice president of Accelerated Computing at NVIDIA. “The ongoing SoSSP, dev-to-dev partnership demonstrates our commitment to working with the national labs to maximize our scientific computing platform today and for years to come.”

The collaborative team of end-user developers from the labs and CUDA platform developers meets each month to assess the progress being made by each work stream, and then new goals are set and broadly reviewed at the triannual meetings at NVIDIA’s headquarters in Santa Clara, California, which are also attended by representatives from Sandia, Los Alamos, Oak Ridge, Berkeley, Lawrence Livermore, and Argonne national labs. (With the COVID-19 pandemic, meetings are now virtual.)

The idea for this continuing team effort originated from a workshop held at ORNL in August of 2018, soon after the launch of Summit. Dubbed “Summit on Summit,” the 2-day symposium featured technicians from NVIDIA traveling to Oak Ridge and discussing the capabilities of Summit and how to use the new machine most effectively.

“An important driver for the ongoing existence of this summit is the new architectural features that the NVIDIA Volta GPUs provide, including the tensor cores and the fast NVLINK intra-node network. It became apparent that these new hardware features could particularly benefit data-driven machine learning applications in addition to model-driven simulations. To learn how to use this powerful new architecture effectively, we started this collaboration with NVIDIA,” Straatsma said.

In November of 2018, NVIDIA technicians also met with the Sierra team at LLNL for much the same reasons: to explore their unique challenges and opportunities, but also to look for commonalities across lab sites.

“Any new large system like Sierra and Summit presents both learning curves and issues,” said Rob Neely, LLNL liaison for the SoSSP and associate division leader at the Center for Applied Scientific Computing. “The learning curves are most easily summed up first as ‘How can I get my applications to work correctly?’, followed soon after with ‘How can I get my applications to work well (i.e., run fast)?’”

The main areas that LLNL wanted to work on initially were very similar to ORNL’s: (1) compiler support and features, (2) tools for debugging and performance analysis, (3) monitoring of the compute center to understand how well the accelerators are being used, (4) optimizing software for peak performance, and (5) looking forward to how to integrate machine learning into application workflows. So, for the next workshop in April of 2019, the location was moved to NVIDIA’s headquarters in California, and both labs attended for “Summit on Summit and Sierra.” The benefits of a combined effort were immediately apparent to participants.

“If there are two labs struggling with the same issue and asking for a solution, that is always more compelling and helpful as it helps drive the vendor’s priorities,” Neely said. “It’s also beneficial to know how others may be working around issues, or even running into issues you haven’t yet—but will soon.”

For NVIDIA, the workshops became more than just a way to assist their customers adjust to GPU-accelerated computing. CJ Newburn, NVIDIA’s HPC lead for Compute Software, initiated the summits to help foster a connection between end-user developers and NVIDIA’s developers of the CUDA parallel computing platform. But the meetings have also helped inform the development of NVIDIA’s products themselves.

“We don’t just build something and throw it over the wall, hoping that it’ll be useful. We stick around in deep engagements to see things through,” Newburn said. “These interactions have helped us mature and change features like CUDA Graphs and have significantly impacted our roadmaps in math libraries and compilers. We’ve also brought together visualization tools using a common interface with LANL and refined our datacenter monitoring solutions to create much greater visibility into usage on Summit.”

In January of 2020, a new system was added to the workshop’s title: “Summit on Summit, Sierra, and Perlmutter.” NERSC’s Perlmutter is a Hewlett Packard Enterprise pre-exascale system scheduled to begin delivery and installation in late 2020. It’s also the first production computing system at NERSC to be accelerated with GPUs, sparking their early interest in learning how to best port applications to GPUs and be able to take full advantage of the hardware parallelism.

According to Brian Friesen, a NERSC application performance specialist and SoSSP liaison, NERSC has over 7,000 users spanning 800 DOE projects and 700 codes, and it is a significant challenge to migrate a workload of this size to an accelerated system like Perlmutter.

“Fortunately, the SoSSP workstreams intersect these topics in several different ways, and since ORNL and LLNL face similar challenges, there are many opportunities for us to collaborate on these problems to find common solutions,” Friesen said. “Collaboration with the other labs has shown significant benefit over working through these challenges alone. While some issues can be unique to some labs and don’t share the same priority across all DOE computing facilities, the breadth and depth of experiences and perspectives of experts at the different labs are always helpful in identifying solutions to the problems we face.”

About the OLCF and Oak Ridge National Laboratory
As a US Department of Energy (DOE) Office of Science User Facility, the Oak Ridge Leadership Computing Facility (OLCF) offers leadership-class computing resources to researchers from government, academia, and industry who have many of the largest computing problems in science. Oak Ridge National Laboratory (ORNL), located in Oak Ridge, Tennessee, is the largest DOE science and energy laboratory, conducting basic and applied research to deliver transformative solutions to compelling problems in energy and security. UT-Battelle LLC manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science. 

About NERSC and Berkeley Lab
The National Energy Research Scientific Computing Center (NERSC) is a US Department of Energy (DOE) Office of Science User Facility that serves as the primary high-performance computing center for scientific research sponsored by the Office of Science. Located at Lawrence Berkeley National Laboratory, the NERSC Center serves more than 7,000 scientists at national laboratories and universities researching a wide range of problems in combustion, climate modeling, fusion energy, materials science, physics, chemistry, computational biology, and other disciplines. Berkeley Lab is a DOE national laboratory located in Berkeley, California. It conducts unclassified scientific research and is managed by the University of California for the DOE. »Learn more about computing sciences at Berkeley Lab.

About Lawrence Livermore National Laboratory
Founded in 1952, Lawrence Livermore National Laboratory provides solutions to our nation’s most important national security challenges through innovative science, engineering and technology. Lawrence Livermore National Laboratory is managed by Lawrence Livermore National Security, LLC for the US Department of Energy’s National Nuclear Security Administration.