PI: Scott Klasky,
Oak Ridge National Laboratory
In 2016, the Department of Energy’s Exascale Computing Project (ECP) set out to develop advanced software for the arrival of exascale-class supercomputers capable of a quintillion (1018) or more calculations per second. That leap meant rethinking, reinventing, and optimizing dozens of scientific applications and software tools to leverage exascale’s thousandfold increase in computing power. That time has arrived as the first DOE exascale computer — the Oak Ridge Leadership Computing Facility’s Frontier — opened to users around the world. “Exascale’s New Frontier” explores the applications and software technology for driving scientific discoveries in the exascale era.
Why Exascale Needs ADIOS
The arrival of exascale-class supercomputers such as Frontier brings with it the need to effectively manage the massive amounts of data that these systems produce. Most importantly for computational scientists, beyond the sheer quantity of data that their simulations create, is their capability to quickly write, read, and effectively analyze all that data. And that’s where the Adaptable IO System, or ADIOS, comes in.
At its core, ADIOS is an input/output, or I/O, software framework that provides a simple, flexible way for scientists to describe, in their code, the data that may need to be written, read, or processed outside of the running simulation. ADIOS greatly increases the efficiency of I/O operations by transporting data as groups of self-describing variables and attributes to different media types. In addition to using the host machine’s file system, ADIOS can transfer data between high-performance computing systems and other facilities by using a common application programming interface. All these features make it considerably easier for researchers to target the results they want to analyze out of huge volumes of data.
“A simulation produces all this data, but that doesn’t mean everything has to be written. If you want to get some pieces, then you can get it — you can get it to storage or you can just get it in memory and process it. That’s the beauty of ADIOS — it allows for those modalities,” said Scott Klasky, who heads ADIOS development and leads the Workflow Systems Group at the National Center for Computational Sciences at ORNL.
Technical Challenges
ADIOS originally started development in 2008 with a team formed by Klasky. After 14 major releases by 2015, ADIOS’s patchwork of code was becoming difficult to manage and needed a complete revamp to meet the needs of exascale computing.
With ECP support, Klasky was able to hire software engineers to develop a new ADIOS. A larger team was formed with researchers from ORNL and Lawrence Berkeley National Laboratory, Georgia Tech and Rutgers University, with software developers from Kitware, Inc. The team designed an entirely new framework to make ADIOS more modular and easier to extend or change. Moreover, every change can now be thoroughly tested to ensure its stability on different systems.
“ADIOS 2.0 was born from scratch in 2016 — zero lines from ADIOS 1.0 were included in the new version. We went from C to the C++ 11 programming language, which completely changed the whole thing. Our main goals were twofold: first, redesign and reimplement the product for file system support for the incoming exascale computers. Second, make data staging production quality so that it can be used by applications daily,” said Norbert Podhorszki, an ORNL computer scientist who oversaw much of the development.
ECP and Frontier Successes
ADIOS 2.9, released at the end of the ECP project, allows the important applications running on the Frontier supercomputer to produce and consume multiple terabytes of data per second using its Orion file system. The new framework also enables the transfer of tens of terabytes per second of data from one half of Frontier’s compute nodes to the other.
ADIOS now has many forms of staging capabilities, including traditional in-situ visualization; staging to other nodes for detached, asynchronous, in-situ processing; and post-processing from files. However, the most important factor is that all these options work with the same application code, and users can pick the best staging strategy to suit their needs. Kitware also added ParaView support to ADIOS for in-situ processing options in applications that produce ADIOS data.
ADIOS’s staging capability has been used by the ECP’s Whole Device Modeling fusion application, or WDMApp, to couple multiple simulations of different regions of the device together, exchanging data between the simulations in every iteration while also producing multiple visualizations of all data flowing around for each simulation and the coupling region. Multiple analysis results are also computed in situ and concurrently with the simulations — without slowing them down.
What’s Next?
With over 1 exaflop of computing power and its giant file system, Frontier’s ability to churn out petabytes of data is considerable. But with that volume of data comes new data-management issues, and for the ADIOS team, that means finding new ways of staging data on different types of platforms.
Klasky has a solution that’s comparable to how the thousands of photos you take using your smart phone are made easily available for you to browse. Most of the photos are not actually stored on your phone — they end up on a cloud service. But the photo app on your phone offers representations of those photos so you can see what they look like and select them for downloading or sharing.
“Can we provide that sort of experience with huge data? How do you work with some of the largest datasets now on, say, your laptop? On your cluster?” Klasky asks. “I don’t think everyone has to have Frontier to just get a glimpse of what’s in their data. So that’s a big push — that’s where we’ve been going — how do we do that?”
Support for this research came from the ECP, a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration, and from the DOE Office of Science’s Advanced Scientific Computing Research program. The OLCF is a DOE Office of Science user facility.
UT-Battelle LLC manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.