Users of the Oak Ridge Leadership Computing Facility are experiencing big changes with how their data is stored — in a good way. As the OLCF’s High Performance Storage System is to be decommissioned in early 2025 after decades of service, users are becoming familiar with Kronos, which is already proving its ease of use and efficiency for data storage.
Kronos is a 134 PB, multiprogrammatic nearline storage system that also provides tape-based backups for all data as a disaster-recovery measure. The system uses IBM’s Storage Scale — known to many as the General Parallel File System — and the disaster recovery component leverages a magnetic tape format called the Linear Tape File System. These two systems are linked by IBM’s Storage Archive hierarchical storage manager.
With 134 PB of space, Kronos is so massive that approximately six copies of the entire Library of Congress could be stored on its drives.
“It’s exciting to provide a different tool for the researchers’ toolbox,” Brenna Miller, high-performance computing systems engineer and system owner of Kronos, said. “Ultimately, my job is to help provide them with the tools they need to do their research. That’s the best part for me — providing something that will hopefully improve their experience and enhance their ability to do their jobs.”
Planning for Kronos began over three years ago, and the system was implemented on July 31 of this year. For users, the interface will be familiar, but the consistent speed at which users can access their data is remarkably different.
One main difference from HPSS is that Kronos provides resilience with two copies of all files stored in the system — one on disk and one on tape. However, users will never need to recall a file from tape because that copy exists strictly as a disaster-recovery mechanism.
“We have all the data protections of having tape on the back end, but we have the performance of disks up front that the users interact with,” said Dustin Leverman, group leader for the HPC Storage and Archive Group. “As users update those files on disk, those same files are also updated on tape. It’s a different way of thinking about long-term storage.”
Planning for Kronos involved a combination of evaluating the usage patterns on HPSS and learning what needs the users had for data storage. Ultimately, the goal was to have a clearer understanding of the users’ needs so those data storage requirements could be accommodated.
“We know there are users who truly need their data stored forever,” Miller said. “The planning also included surveying HPSS power users and gathering the use cases that we could glean from access patterns.”
The overall transition from HPSS to Kronos has been remarkably smooth so far. The system became available to early users on July 31, and 8 PB of data was ingested in the first 30 days. On Aug. 31, HPSS became read-only, and data can no longer be written to the file system. The final step in the process will occur on Jan. 31, 2025, when HPSS is officially decommissioned. After that, any data remaining on HPSS will be unavailable for retrieval.
Leverman and Miller want to emphasize that users of OLCF computing systems should not wait to start moving their data. Users should act quickly and use the “hsi_xfer” tool to provide an optimized recall of data from tape. Using this tool will reduce time-to-transfer and undue load on the HPSS system. Moreover, handling the transfer sooner rather than later will give users time to address potential issues should they arise.
It’s vital for users to understand that, although they have time to move their data, waiting until the last minute will make the process difficult.
“We’re concerned that if all of the users wait until January to transfer their data, we will run out of time,” said Leverman. “We want to preserve all of their critical data.”
Kronos was designed to be extensible, so even as hardware becomes deprecated and is retired, new hardware can be deployed within the system without service interruption. The design was intended to accommodate a variety of needs and will be able to support non-OLCF programs as well.
“If this specific technology stack still meets the needs, if it’s still checking all the boxes for use cases in five years, then there’s no reason why it can’t remain the same platform and keep going,” Miller said.
The OLCF, a Department of Energy Office of Science user facility at Oak Ridge National Laboratory, helps researchers solve some of the world’s most challenging scientific problems with a combination of world-class HPC resources and unparalleled expertise in scientific computing. To learn more, visit olcf.ornl.gov.
UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.