ORNL-NOAA collaboration repurposes cloud-computing tool to share climate models
Successfully running a simulation program on a supercomputer requires more than just writing code. An application such as a climate model requires libraries, network support, and the correct operating environment. Because of these complex and multilayered requirements, applications are typically customized and built natively on a computing system, making them difficult to distribute or use on other machines. But what if you could package all the elements of an application so that it’s simpler and easier to share?
That’s the goal of a collaboration among researchers at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) and the National Oceanic and Atmospheric Administration (NOAA). They are developing containers that run on Gaea, which is NOAA’s Cray XC supercomputer used for climate modeling and simulation. In high-performance computing (HPC), containers are software packages that include the application code and any dependencies for that code, such as libraries or elements of the operating system. Originally developed for cloud-computing systems, containers now enable ORNL and NOAA to distribute climate models, such as the Earth System Model version 4 (ESM4) and the Atmosphere Model version 4 (AM4), and make those models more accessible to other researchers.
“Containers will allow the next generation of models to run on more computers and to be more usable by the community,” said Tom Robinson, a physical scientist at the Geophysical Fluid Dynamics Laboratory at NOAA and a collaborator on the project.
“I hope that NOAA can use containers to share models with climate scientists who can then immediately begin their research without going through a multistage process of building the application,” said Subil Abraham, an ORNL HPC engineer and a project collaborator.
However, packaging all the elements that allow an application to run isn’t as simple as adding files to a folder on a desktop. Natively built applications, such as the climate models AM4 and ESM4, are optimized for and rely on the environment of a given system. Additionally, NOAA has strict reproducibility requirements for its models. So, the output from the model must be consistent regardless of the computer on which the model is running.
At ORNL, Abraham is leading the work to build containers that maintain the speed and accuracy of the models that run on Gaea. The team learned that for containers to work, they must contend with a major challenge: Message Passing Interface (MPI) libraries.
MPI is the protocol that allows multiple processors on an HPC system to communicate with one another. The climate models rely on MPI libraries that tell the applications how to communicate with the nodes and how to distribute the work. The libraries are critical to running the models correctly, but Abraham and the ORNL team—which included HPC Engineer Matthew Davis and HPC Data Analytics Engineer Ryan Prout—needed to determine the best way to integrate the libraries with the container. The team tested two designs to find a viable solution; the results of their research were presented in May at the 2022 Cray User Group meeting.
In the hybrid model, an MPI library included in the container communicates with the MPI library on the computing node through an interface known as the Project Management Interface. The main advantage of the hybrid model is that it allows each container to use a different MPI library than the host compute node or the other containers running on the system, making it more flexible and adaptable for different computing systems. However, the multiple layers of communication slow down the climate model’s run time.
A second option is called the bind model. In this design, the container uses the MPI library on the host system. Although this design doesn’t offer the same flexibility as the hybrid model, the bind model makes up for it with speed.
“This design works great for speed because you’re pulling the host MPI library, which is optimized for that particular HPC system to be as fast as possible,” Abraham said.
“The MPI bind is one of the major developments that came out of this collaboration. We needed ORNL to investigate the design for us, and their effort made the model run much faster. It’s more efficient,” said Robinson.
The NOAA team is already seeing the benefits that containers provide to other climate scientists and anticipates using them even for applications that don’t require parallel computing.
“I collaborated with a scientist who used AM4 to study the effects of sea-surface temperature on the number of tropical cyclones per year, and I created a container to share the climate model with him,” said Robinson. “There are other models that don’t require a supercomputer, but you still have to compile all the code, so having the environment and application in the container makes it much easier to run the model.”
At ORNL, Abraham is designing containers to run on Summit, ORNL’s IBM Power System AC922 supercomputer, and he hopes also to make the containers available on Frontier, ORNL’s Hewlett Packard Enterprise Cray EX exascale computer, by applying the skills he learned while working on Gaea.
“For the longest time, HPC has existed just fine without containers, but HPC and containers could end up coexisting. They would be a welcome addition to what’s already being provided by HPC systems,” said Abraham.
UT-Battelle LLC manages ORNL for the DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.