A software package helps scientists publish the information used on research projects

This story is the third in a series called “Data Matters,” a collection of features dedicated to exploring the relevance of data for scientific advancement and exploration at the Oak Ridge Leadership Computing Facility, across Oak Ridge National Laboratory, and beyond.

As new science builds on previous research, continuous access to data sets is vital. But hosting information and indexing it to make it accessible to others can oftentimes take scientists down expensive and difficult-to-navigate avenues. Sorting out where and how to best store data can distract from the labor of producing more quality science.

For researchers working with the Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy (DOE) Office of Science User Facility at Oak Ridge National Laboratory (ORNL), publishing this data also means fulfilling certain requirements established by DOE’s Office of Scientific and Technical Information (OSTI).

To facilitate that process, OLCF offers its users a software package called Constellation DOI.

First developed in 2016, the system assists scientists in acquiring a Digital Object Identification (DOI) number for the data sets that accompany published research.

Defined as unique and permanent alphanumeric codes, DOIs are used to identify media and digital objects such as articles and data sets. Users can then refer to these codes to search and retrieve data sets at a later date, thanks to a collection of metadata tags gathered by Constellation DOI.

Data sets indexed using Constellation DOI are even discoverable via commercial search engines, such as Google or Bing.

“Having this resource helps collect, preserve, and readily disseminate the scientific knowledge that our researchers work so hard on,” added Sudharshan Vazhkudai, director of OLCF’s Scalable Data Infrastructure for Science Program.

A forever home for data sets

Scientists using OLCF facilities to conduct research have the opportunity to host their output data with ORNL and use the Constellation software package to get a DOI for it.

Alternatively, scientists can take care of storage themselves, but this option doesn’t come without challenges.

Data storage costs money, and depending on the size of the set, this investment can be significant. “Also, without adequate maintenance or even a sustainable business model, private servers can come and go, which means that the data you’re paying to have stored today might not be available in 5, 10 years,” said OLCF software engineer Ross Miller, who for the past few years has worked in the development of Constellation DOI. “At the end of the day, what we’re offering is to host this data at no cost to OLCF users, technically forever, in a way that is compliant with the highest standards of research today.”

Constellation DOI currently hosts and has processed DOI requests for data sets in domains that include electron diffraction, neutron spectroscopy, scanning probe microscopy, genome association, and climate benchmarking data, as well as other assets such as BigNeuron and magnetopause data.

Still, the team behind Constellation DOI is eager to expand. In recent times, they have also opened up this option to other divisions within ORNL through access via CADES, which means more scientists can take advantage of this opportunity.

A thousand and one uses

Dan Jacobson, ORNL’s chief scientist for Computational Systems Biology, has been participating in the Constellation DOI project from its very origins. “I think I was one of the first beta-testing phase users,” he said.

Part of Jacobson’s work in viral genomes has found a permanent home in Constellation DOI. Specifically, the program has come in useful for his bioenergy research, where he and his team are producing data files of genetic diversity of populations that are later used when working on genotype-to-phenotype associations.

“We want to be able to have people reproduce those results, of course. That means we need a long-standing, easy way to have a reference for our data sets in all our publications,” he explained.

In his line of work, the information collected is not always static. “Our data sets can change over time as we add members to a population and come up with new algorithms to ferret out the diversity, meaning that there can be multiple iterations of a data set. So, it’s been great to have a place where we can simply assign a DOI and refer to that DOI in publications and websites, and as we have a new derivative data product, we can create a new DOI and refer to that new version.”

In recent times, Jacobson and his team have also built—and are storing on Constellation DOI—several robust k-mer profiles, sequences that help represent patterns in a genome. So far, they have done that for over 700,000 viral genomes.

“I have truly enjoyed working with the Constellation team because they are great at listening to our needs as scientists. Correctly storing metadata is not always a one-size-fits-all situation, and they are very accommodating of any questions that come up during the process,” he said.

Computational scientist Junqi Yin is another researcher paying attention to the Constellation DOI project.

Yin, who is part of OLCF’s Advanced Data and Workflow Group, recently used one of his own Constellation DOI–hosted sets to design one of the data challenges of the upcoming 2020 Smoky Mountains Computational Sciences and Engineering Conference.

The event—now running in his fourth year—has become a valuable experience in finding solutions for data analytics challenges faced by high-performance computing systems, such as ORNL’s Summit, one of the world’s fastest supercomputers.

For Yin’s challenge, participants are asked to download a specific data set hosted in Constellation to build a machine learning algorithm that accurately predicts materials’ crystal structures, can be tweaked to account for imbalances in the data set—such as too few data points for a certain type of structure—and can multitask to predict material structures while reducing the effects of imbalances in the data.

“This data set addresses more than 60,000 types of materials, so insights beyond traditional modeling and simulation would be impactful to the materials science field,” said Yin. “This kind of challenge has been a vital driver for deep learning success in industry, and this specific data set represents a great opportunity to create a deep learning benchmark for materials science research, so it is vital to us that the information used by the scientists participating in the challenge is properly stored.”

Vazhkudai echoes this sentiment and leaves an open invitation to all ORNL researchers and OLCF users to take advantage of Constellation DOI: “The product of your work already lives here at ORNL, so it would be a natural next step to formally store it here, too. But on top of that, having DOIs for your data sets can increase your chances to be cited, and for good science to be done—one of the common goals that unites us all.”

UT Battelle LLC manages ORNL for DOE’s Office of Science. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.