Imagine solving the greatest scientific question of the era but having no way to save your answer.
No need to fear, thanks to Dustin Leverman and his team. Leverman leads the Oak Ridge Leadership Computing Facility’s High-Performance Computing Storage and Archive Group at the U.S. Department of Energy’s Oak Ridge National Laboratory. He oversees installation and acceptance of Orion, the massive file system that will support Frontier, the world’s first exascale supercomputing system and fastest computer in the world.
“Orion is very much a collaborative effort and might not get the spotlight, but we can’t live without it,” Leverman said. “We’re always going to need a place to store data and the results of all this number crunching.”
Frontier set a new record for computing speed when the HPE Cray EX system debuted in May 2022 at an average speed of 1.1 exaflops, or 1.1 quintillion calculations per second. Engineers at the OLCF, which houses Frontier and its predecessor, Summit, expect that speed to be even faster after final tuning.
Teams of scientists have waited to unleash Frontier’s computational power on problems that range from harnessing nuclear fusion to exploring the origins of the universe — and just taking notes as they go won’t work.
“We have one user with a single job on Frontier that will generate 80 petabytes of data,” Leverman said. “One petabyte equals about 500 billion pages of text, so do the math and that gives you some idea of the scale of data produced by these studies. If the user needs to refine the model, there may be even more data generated. We need ways to keep this kind of data available for verification and for future study.”
Orion consists of 50 total cabinets with capacity for up to 700 petabytes of data spread across a three-tiered system of flash memory, spinning disk drives and other nonvolatile media that relies on open-source Lustre and ZFS technologies.
“We’re taking a hybrid approach to this system because these spinning mechanical drives by themselves aren’t fast enough to keep up with the processing speeds,” Leverman said. “The more time a computer burns stopping to save data, the less time you have for computational workloads. We don’t want 20% of anyone’s time on Frontier to be spent on just writing data.”
Files will be distributed across:
- a flash-based performance tier of 5,400 nonvolatile memory express, or NVMe, devices providing 11.5 petabytes of capacity at peak read-write speeds of 10 terabytes per second with more than 2 million random-read input-output operations per second;
- a hard disk–based capacity tier of 47,700 perpendicular magnetic recording devices providing 679 petabytes of capacity at peak read speeds of 4.6 terabytes per second and peak write speeds of 5.5 terabytes per second; and
- a flash-based metadata tier of 480 NVMe devices providing an additional capacity of 10 petabytes.
Even that much storage won’t be enough eventually. Files older than 90 days will be moved to a data archive.
Frontier marks the fourth leadership-class supercomputing system Leverman’s helped deploy. He joined the OLCF in 2009 and worked on not just Summit but predecessors Titan and Jaguar, each the world’s fastest supercomputer at the time of launch.
“There are always challenges throughout the process, and this has been no different,” Leverman said. “Each of these machines has helped change the future of science, and we’re excited to help make that possible.”
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.