DataFed, an ORNL-developed platform, allows for domain-specific metadata customization that fits the needs of complex research collaboration networks

This story is the fourth in a series called “Data Matters,” a collection of features dedicated to exploring the relevance of data for scientific advancement and exploration at the Oak Ridge Leadership Computing Facility, across Oak Ridge National Laboratory, and beyond.

The increasingly collaborative nature of scientific research has established a need to share data, but keeping it organized in a way that’s effective and beneficial to the team can be challenging.

DataFed, a scientific data management system (SDMS) that helps scientists keep track of their ongoing data assets throughout the course of a research project, is a tool that aims at enabling collaborative scientific work across multiple facilities.

Developed by the National Center for Computational Sciences (NCCS), DataFed was created as a “lightweight, distributed SDMS that spans a federation of storage systems within a loosely coupled network of scientific facilities.”

The system is available to researchers and users at the Oak Ridge Leadership Computing Facility (OLCF)—a US Department of Energy (DOE) Office of Science User Facility located at Oak Ridge National Laboratory (ORNL)—who are looking to eliminate silos in data sharing, said Dale Stansberry, systems programmer in OLCF’s Advanced Data and Workflows Group (ADWG) and leader of the project.

“We tend to pay a lot of attention to data acquisition systems, but data management is just as important. In the old days we used to keep everything in lab notebooks, but we are now dealing with larger volumes of information, and that’s simply not enough. That’s how we started looking for and then building a solution,” said Stansberry, who created the tool inspired by his own experiences as a researcher.

The perfect fit

Prior research has shown that as much as 50 to 80 percent of the time spent on a research project is devoted to data management. Scientists expect this to rise, particularly in fields such as machine learning and genetics, where simulations account for new data production, often in the terabytes range.

“Researchers continue to use suboptimal data practices because they lack appropriate data management infrastructure,” said Suhas Somnath, computer scientist in NCCS’s Advanced Technologies Section.

Although data management systems are not a new thing, DataFed is different in that it is domain agnostic and offers high levels of customization, two features that could translate into the flexibility scientists need to manage their data.

When using traditional file systems, researchers embed key information into the paths and names of files because this is how such systems index searchable data, explained Somnath: “DataFed solves this problem with a search capability that enables users to discover data using powerful metadata queries, allowing scientists to zero in on domain-specific fields and provide very specific results.”

All this means that metadata is a key portion of the tool, said Jessica Breet, an Oak Ridge Associated Universities (ORAU) research associate in distributed computing: “By setting metadata fields that appropriately reflect a domain, we can accurately interpret what this information is telling us, what it represents, and how to use it in the context of a different domain.”

Other platforms such as laboratory information management systems (LIMS)—which are regularly used in domains such as bioinformatics—typically come with preset fields that the researcher fills up.

“The problem with that is that sometimes you don’t exactly know what you are going to do, where you will be going, what you will be finding,” Stansberry said.

“If you’re experimenting, oftentimes you don’t have a stepwise A-B-C process. And as your project evolves, your data and the way you save it and share it evolve too. So you need a tool that is capable of accommodating that.”

DataFed offers the possibility to organize metadata with fields that are relevant to the research team. However, developing the metadata schema—structures that define required and optional fields, elements, and the relationships between elements of metadata—in a way that can be useful for current and future researchers can be challenging, and it can require the assistance of library and information science experts, something not all scientists have at hand.

At OLCF, this role is filled by Katie Knight, a data engineer with expertise in the Findable Accessible, Interoperable and Reusable (FAIR) Data Initiative, a set of principles that assists scientists in improving the FAIRness of digital assets.

“Categories are directly connected to the context in which they and their mode of organization are created and used. They can change over time and mean different things from one domain to the next, which is why the objectives and discourse of a given domain should be understood and employed when developing metadata concepts for any organization and retrieval system that incorporates them. This means that a system that allows for domain-specific metadata is essential,” Knight said.

ORNL’s Center for Nanophase Materials Sciences (CNMS) is putting this flexibility to the test by piloting DataFed in their workflow.

The goal is for users to log into their DataFed account, take measurements using an instrument—a microscope for example with preestablished metadata fields—and have both the data and metadata from the instrument’s output automatically captured and added to their DataFed account. They can also include additional metadata and context that would otherwise be recorded in physical lab notebooks.

Once this is done, users can access that data and share it with someone else, regardless of the collaborator’s institution or physical location of the data, as long as they have a DataFed account. This option to share easily is yet another benefit of DataFed.

While other platforms are typically restricted to just one laboratory, DataFed lets users share data in an interconnected way, which fits complex systems, such as the DOE network of labs, facilities, and collaborators.

“It’s about cross-collaboration,” Stansberry said. “One team member could be running experiments on an instrument at one facility (pushing data into DataFed), while another team member at a different facility can pull that data and do additional experiments using it as an input or run analytics on it at a different facility.”

Interoperability as the end goal

The team behind DataFed expects that once scientists become familiar with the benefits of correct metadata labeling, research communities in each domain can organize themselves to develop interoperable and standard metadata schemas that work for them.

DataFed is all about amplifying the value and the usability of the data that scientists are gathering to, hopefully, be the cornerstone of groundbreaking research at some point.

Related Publication: D. Stansberry, S. Somnath, J. Breet, G. Shutt and M. Shankar,  “DataFed: Towards Reproducible Research via Federated Data Management,” presented in the 2019 International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 2019 pp. 1312-1317.

doi: 10.1109/CSCI49370.2019.00245

UT-Battelle LLC manages ORNL for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.