Changing a common perception about checkpoint frequency leads to better I/O performance for supercomputer users
In computing, checkpointing is a process in which a long-running application periodically saves information about the “state” of its progress to permanent storage. In the event of an application or system failure, the application then can be restarted from the last known “state,” minimizing the amount of work lost because of the failure.
The perception among users about how frequently they should checkpoint varies widely and subjectively. Excessive checkpointing means less time spent solving the problem at hand and more time generating input/output (I/O) on a shared resource. The balance is delicate, because system efficiency is compromised by unnecessary I/O activity. As computing resources approach exascale—a thousandfold increase over current supercomputers—increasing or even maintaining efficiency during checkpointing will be paramount.
When it comes to checkpointing on high-performance computing systems, being a bit “lazy” actually can be a good thing. At the Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy Office of Science User Facility, computer scientists Devesh Tiwari and Saurabh Gupta see laziness—a changed perception that reduces the frequency of application-level checkpoints—as increasingly important in recovering from computer failures on larger and faster machines.
“People have a general notion of the mean time between failures [MTBF], which may indicate that every 10 hours or 20 hours, you will have a failure,” Tiwari said. If an uncorrectable error occurs on hardware used by the application, then the application may quit unexpectedly or produce unintended results. So based on runtime policies that may allow applications to run for days at a time, and an expectation that their application is potentially exposed to one or more errors, most researchers will choose to take periodic checkpoints during their supercomputing runs.
As supercomputers increase in size, the sheer volume of checkpoint activity can significantly affect the shared networks and file systems. These I/O constraints led Tiwari and Gupta of the National Center for Computational Sciences Technology Integration (TechInt) team to look closer at checkpointing on supercomputers.
After examining several supercomputers at the OLCF, including the Cray XK7 Titan, as well as large systems at other centers, Tiwari and Gupta were able not only to characterize the failure patterns in these large systems but also to debunk a common perception about MTBF. Their analysis concluded that not only do the vast majority of failures occur before the statistical MTBF but also that subsequent errors are clustered, occurring close to the time of the originating hardware failure. For Tiwari and Gupta, this demonstrated the need for a different approach to checkpointing.
The OLCF, on the campus of DOE’s Oak Ridge National Laboratory, has one of the more stable supercomputing systems anywhere. Hundreds of thousands of applications run during the course of a year, and few applications see uncorrectable errors. However, if a user assumes an overly conservative stance toward the need for checkpointing, the extreme demand of I/O from excess checkpointing can strain the efficiency of OLCF computer systems.
Based on their observations that system failures are poorly distributed over time and are clustered close to the original event, Tiwari and Gupta proposed that checkpoint frequency should be correlated to this pattern. One hardware failure, regardless of whether it impacts one application, should be an indicator that other applications should increase their checkpoint frequency to ensure that they are adequately protected from this clustered failure effect. As the time interval beyond the last observed failure increases, the necessity to checkpoint also may decrease. Hence, as time passes, the system may allow for an application to execute fewer checkpoints. Of course, there is a limit to a good thing—the proposed technique puts upper bounds on how relaxed one can be before disrupting the balance of checkpointing versus an eventual hardware failure.
Tiwari and Gupta wrote a paper supporting “lazy” checkpointing and presented their results at the 2014 Institute of Electronics Engineers’ and International Federation for Information Processing’s Dependable Networks and Systems (DSN) conference. Tiwari and Gupta were finalists for the best paper award at the DSN conference, one of the premier conferences for reliability and fault tolerance research in supercomputing.
Gupta said that by creating a smarter system for researchers to checkpoint their data, all users will benefit from a better experience when using supercomputing resources. “If we reduce I/O, which is such a constrained resource on supercomputers, by even 20 to 30 percent of the I/O volume, that’s a big advantage,” Gupta said. “That means less contention, and everyone sees better performance. That’s the side effect that we see from this, and it is a positive byproduct.”
Though Tiwari and Gupta said the TechInt team is available to help users find smarter ways to checkpoint while running their applications on Titan, they ultimately want to create a tool to help users make more efficient checkpoint decisions. This tool will use different system and job related parameters such as job size, checkpoint size, file system performance, and system MTBF. —Eric Gedenk
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.