Data Curation for the Exascale Era

This summer, Alex May faced a task daunting enough for any human: sift through a dataset of 10 million organic molecules, make sure everything was correct and get it done within two weeks.

That task was not the first time May has had to assess voluminous amounts of data within a tight time frame at the Department of Energy’s Oak Ridge National Laboratory. Although his official ORNL title is a modest “data services engineer” in the Data Lifecycle Technologies group, he is the first and only full-time data curator at the Oak Ridge Leadership Computing Facility, a DOE Office of Science user facility located at ORNL.

May’s job primarily consists of evaluating datasets developed by computational scientists before they are made public through the OLCF’s Constellation portal for open data exchange. He has a long checklist of potential issues to monitor, from the basic functionality of the codes to the ethical nature of the data itself.

“I look at the data and evaluate it — check the files and read the documentation. Then I’ll try to run the code, if there’s a code involved, and duplicate the steps. There might be a next-generation scientist who will want to use this dataset, and if you’re missing a step, that could create a lot of frustration,” May said. “Sensitive information is also becoming a big issue, and sometimes scientists are more focused on just doing their research. My role is to ensure that the author and the lab are in compliance with best practices for data security and privacy.”

In the case of the organic molecules dataset, May unearthed a disparity: The specific number of molecules declared in its description did not match the actual number of molecules in the dataset. All project leader Massimiliano Lupo Pasini had to do was correct the description. However, because the dataset was part of a paper on AI-enabled molecular design being submitted for publication in the journal Nature Scientific Data, it was an important fix.

“Especially now that AI is becoming a strong key player in scientific computing, data curation is very important and to some extent is at the core of reproducibility requirements of the results. Not being able to reproduce results questions the credibility of the researcher,” said Lupo Pasini, a data scientist in ORNL’s Computational Sciences and Engineering Division. “So, having a data curator makes sure that the scientific output is really helping the scientific progress of the community, and it protects the person who publishes the data.”

But May’s mission goes beyond ensuring the OLCF’s datasets are ready for public consumption. Along with Olga Kuchar, leader of ORNL’s Data Lifecycle Technologies group, May co-chairs DOE’s newly formed Data Curation Working Group, or DCWG, spreading the fundamentals of comprehensive data curation to the other national labs. Their goal is to establish an adaptable infrastructure and workflows to ensure the quality, transparency and accessibility of DOE datasets for years to come. And May hopes to make AI an integral part of that plan.

OLCF data curator Alex May. Photo credit: Carlos Jones/ORNL; background image: JohnsonGoh, CC0, via Wikimedia Common.

Data’s past and present

May has a bachelor’s degree in ancient history, so his current job safeguarding the integrity of data produced by some of the world’s fastest supercomputers may seem unexpected. But combined with his master’s in information and library science, May’s career path inevitably led to the field of data curation.

While working as a book cataloger at Tufts University’s Tisch Library in 2011, May found himself running the library’s repository. As the repository grew, his position at the library became increasingly focused on repository infrastructures and curation models. He’d later take that experience to Arizona State University’s repository, going on to develop metadata standards, local taxonomies and workflows for curation.

All of this experience sparked in May an unexpected fascination with analyzing data.

“It’s the puzzle — the joy of trying to solve a puzzle and learning the domain. Getting a dataset, and wondering ‘What’s in this dataset? What do I need to be aware of? What could potentially be there that I might need to talk with a researcher about?’” he said.

About two years ago, that passion finally led May to ORNL, where the production of data is substantial. May joined the Data Lifecycle Technologies group — formed in 2020 as the Data Lifecycle and Workflows group — which addresses a growing need among the OLCF user community: tools to support big data. The group researches and develops innovative data-management services, establishing data standards and policies for scalable research ecosystems that enable the next generation of scientific discoveries.

Meanwhile, Constellation currently holds approximately 3.5 petabytes of data, with a 4-petabyte dataset about to be ingested — a prospect that animates May with even more energy.

“Coming to ORNL has been an amazing experience for me — the volume, velocity and variety of the data produced here is so different. The projects are so much larger here that you’re posed with these amazing, grand-challenge problems,” May said. “It has been, for me professionally, a very exciting time — I’ve learned so much and have gained so many skills just learning how to curate the type of data that we produce out here.”

Part of May’s job is mitigating risk for each one of the lab’s researchers, the lab itself and anyone else linked to these data-based projects. For any public-facing dataset, May is vigilant to prevent the inadvertent inclusion of sensitive information.

May has enlisted an ad hoc group of experts at ORNL to help him navigate these and other issues, including Kuchar, privacy officer Dan DeVore, intellectual property attorney Michael Johnson, data architect Jay Barua, technical information officer Diana Stanley, Information Services Group Leader Anna Galyon, data engineering team leader Ian Goethert and others as needed.

“I may be the first data curator at ORNL, but it’s not just me — it’s a team of people. I’m constantly being supported by all sorts of different staffers here,” May said. “I’ve managed to catch issues and bring them to this ad hoc working group so that we could group-think it and present the best paths forward for the researcher. Everything we’re trying to do is to make this dataset available to the larger world and to protect the researcher.”

OLCF data curator Alex May. Photo credit: Carlos Jones/ORNL; background image: JohnsonGoh, CC0, via Wikimedia Common.

Data vitality for the future

Data curation as an essential practice in research is still a relatively new concept. In 2016, Mark D. Wilkinson published “The FAIR Guiding Principles for scientific data management and stewardship” in Nature Scientific Data. These principles assert the basic tenants for data curation: findability, accessibility, interoperability and reusability. Two years later, those ideas were refined into granular workflows by the Data Curation Network, or DCN, a member organization of data repositories. With its specific checklists, the DCN CURATE(D) Steps can cover everything from checking files to documenting curation activities.

The efforts to establish data curation practices also addressed an emerging issue: ensuring the future usability of today’s datasets. As datasets grow ever larger and more critical in computational science and academic research, “future-proofing” them against ever-evolving file formats, software iterations and hardware platforms has become an imperative.

“If your data is not understandable, if it hasn’t been curated and if it’s not trustworthy, if there aren’t ways for someone coming back at it a decade from now with new tooling to be able to use it, then we haven’t really done a good job,” May said. “So how do you start thinking about the steps that we’re going to take now to ensure that these datasets are available at least 10 years into the future?”

Those are just some of the issues on May’s agenda as he works toward establishing new data-curation workflows at ORNL, as well as for the DOE complex, that set a citation-level metadata schema and comply with federal mandates for research data.

In his presentation at last year’s DOE Data Days conference, May pointed out the need for a DOE-wide working group of curators, librarians, information scientists, researchers and developers representing the different laboratories. With the interest generated at the conference, he and Kuchar drafted a charter and seven objectives for the DOE DCWG, from developing data curator toolkits to establishing lab-centric best practices.

“This isn’t just a group to get together and talk about data curation issues. There are real challenges to curating data for the labs, and we want to solve them. And that’s what makes this group trailblazing — here we are as an ad hoc group from the ground up that’s breaking down the silos between labs, developing trust with each other,” May said.

Among the DCWG’s goals is using machine learning to assist data curation, an idea originally suggested by Arjun Shankar, head of the Advanced Technologies Section of the National Center for Computational Sciences at ORNL. The group aims to write a proof-of-concept short paper that demonstrates how accessing repository metadata can allow for semantic search capabilities and ease the burden of evaluating huge datasets. The concept is also being tackled by the DCN, where May is co-chair of the Big Data Interest Group.

“For me, the question is how do we start using these technologies to ensure that our datasets are trustworthy, useful, and can also feed back into the AI models? This is all cutting-edge — no one has an answer here, but we need to start thinking about this now,” May said.

UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit energy.gov/science.