RAIT allows OLCF staff members to place users’ data sets on multiple tape drives while also having a master backup, or parity tape, in case a data tape is lost or damaged.

RAIT allows OLCF staff members to place users’ data sets on multiple tape drives while also having a master backup, or parity tape, in case a data tape is lost or damaged.

New method for backing up data on the High-Performance Storage System increases reliability

When staff at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory began using petascale computers—machines capable of at least a quadrillion calculations per second—they anticipated that data demands would grow.

And grow they have.

The Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility, has used its tape-driven High-Performance Storage System (HPSS) since 1998 to manage massive amounts of data coming from its users. Since then, data demands have grown. In fact, the OLCF adds more than a petabyte of new data roughly every 11 weeks. As computers increase in size and power, demands on data storage only continue to grow.

Tape storage can offer more security than storing data on discs or array disc drives because of its increased tolerance of varied environmental conditions. In addition, tapes do not need power unless data is being moved on or off a tape. Until recently, though, tapes still had some risks associated with particularly large data sets.

For data that could fit on a single tape, OLCF staff used to “stripe” the information across two tapes. If one of the tapes was damaged or lost, though, there was no recourse for recovering that tape’s data. Staff rarely striped data across more than two tapes because of that risk.

Enter the Redundant Array of Independent Tapes (RAIT) feature. The OLCF’s RAIT feature allows incoming data to be striped across four tapes, but it also has an extra tape—called a parity tape—that can allow the data to be reconstructed in the event that a tape is damaged or lost.

“RAIT uses the same algorithm that modern disk arrays employ,” said Jason Hill, OLCF staff member. “In our case, you can take a byte of data and split it into 8 bits. Put those 8 bits into groups of 2, then calculate a checksum of all those data points, and put it into a separate parity segment.” This 4 + 1 approach allows staff to provide users with a method to recover from a tape failure—an option that was never before available for data saved on tape.

OLCF staff also can use a “7 + 2” approach—striping data across seven tapes and having two parity segments—if data sets are large enough, but to date, this has not been necessary.

The storage team is currently working on a strategy that would take single-copy stored files and place them into a data-protected storage class. This means that data will go into either a class that supports the RAIT feature or a storage class where staff will automatically store two copies of the same data on separate physical tapes.

RAIT recently became available for users, and documentation explaining how to use this feature can be found here.

For more information about HPSS upgrades at the OLCF, click here.

Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.