The HPSS Integrity Crawler works by creating a statistical sample space of datasets for every iteration, from the overall population based on several dimensions such as file size, age, media type, and project.

OLCF TechInt Team develops, deploys HPSS data Quality Assurance framework

A new data quality assurance tool developed by the Technology Integration Group (TechInt) at the Oak Ridge Leadership Computing Facility (OLCF) – a DOE Office of Science User Facility – will help protect that data by validating the integrity of the High Performance Storage System (HPSS) archive.

The HPSS provides longer-term storage for the large amounts of data created on the compute systems at the OLCF. The mass storage facility consists of tape and disk storage components, servers, and HPSS software. Over the course of decades, about 55 million data files have been accumulated and safely stored.

This new tool—HPSS Integrity Crawler (HPSS IC)—works by creating a statistical sample space of datasets for every iteration, from the overall population based on several dimensions such as file size, age, media type, and project. It also provides a plug-in framework that will allow new data QA mechanisms to be developed quickly and deployed easily and builds a repository comprising integrity metadata based on the QA schemes, which can be queried. It has been tested and refined since becoming operational in early 2014.

For example, the Checksum Verifier plug-in allows for an integrity check by scanning HPSS files and computing their checksums. By building and tracking checksums for a statistical sample of files in the archive, the tool can help increase confidence in the accuracy and completeness of the data stored in the archive.

“A checksum is a fingerprint of the file,” said Sudharshan Vazhkudai, group leader for TechInt. “The integrity crawler creates this fingerprint, catalogs it, and recomputes the checksum to see if the file still matches.”

Tom Barron, a member of the TechInt team and developer of the tool, added, “This shows us that the file hasn’t been changed or corrupted in any way.”

The Tape Copy Checker plug-in ensures appropriate redundancy when requests are made to store multiple copies of a file. The plug-in looks to see how many copies of a file should be present, then checks to make sure that many copies actually are stored.

The migration checker verifies if the files have been migrated across the media tiers to their end locations, while the purge record checker scans the logs for outstanding data purge operations. “Over time, the tool will touch all components of HPSS,” Barron said. Vazhkudai added, “We are collecting a wealth of metadata.”

The plug-in aspect of the HPSS IC framework is key to the design of the tool because plug-ins can be added as the need arises. If there is a specific way files should be validated, a plug-in can be designed to serve that purpose.

“We want to evaluate everyone’s data in HPSS and even make the tool available to administrators at other sites for their use,” Vazhkudai said. “The goal is for it to be as lightweight as possible. The primary intent is not to interfere with operations while we build confidence in our data holdings. IC is a great tool for the data QA framework.”

A future plug-in called Drill Instructor “will run HPSS through its paces,” Barron said. “It will make sure tapes are still good, drives are still good, disks are still working. It will test all the components.”

Also under development is a dashboard that Barron hopes to complete this year. “The dashboard will allow a manager to see what is verified and cataloged,” Vazhkudai said. It will allow the data collected to be visible in summary form and enable an administrator to tune the “knobs” to affect the behavior of the crawler.


The Oak Ridge National Laboratory is supported by the Office of Science of the U.S. Department of Energy. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov