With scientists producing more data than ever before, the challenge now is finding ways to better use it

This story is the first in a series called “Data Matters,” a short series of features dedicated to exploring the relevance of data for scientific advancement and exploration at the Oak Ridge Leadership Computing Facility, across Oak Ridge National Laboratory, and beyond.

Optimizing large sets of data to drive new groundbreaking discoveries is a responsibility that a multidisciplinary team of researchers at the Oak Ridge Leadership Computing Facility (OLCF) takes very seriously. The OLCF is a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL).

“Data matters because, without it, we can’t find answers or pose new questions,” said Gina Tourassi, director of the National Center for Computational Sciences (NCCS), a division of the ORNL Computing and Computational Sciences Directorate (CCSD) that hosts OLCF.

But transformational use of data in areas like health, manufacturing, and energy generation requires new techniques and a continuous commitment to process improvement.

A change for the better

In the 1980s and ’90s, it became evident that to fully explore the potential of science in national laboratories and research facilities, the development of large computers capable of generating and analyzing vital sets of data was necessary.

In the early 2000s, big datasets became an integral component of the scientific process. Computers got much better at collecting high-resolution information in larger quantities, and data storage and management also improved. That’s how data became the fourth leg of a now very steady table: theory, experiment, computing, and data.

“We would say that big data is any amount of information that could not be easily processed using a desktop computer,” explained Arjun Shankar, group leader of the Advanced Data and Workflow Group in NCCS and director of ORNL’s Compute and Data Environment for Science (CADES).

Data-driven discovery then took an accelerated turn for the better when computers became powerful enough to dive into a sea of information to identify patterns and details, dramatically outperforming existing approaches. Furthermore, computers were now able to learn from those findings and make predictions based on data science, leading to a resurgence of artificial intelligence including machine and deep learning.

“In the past decade, we have produced more data than in all previous years combined. But since not all data is information, we need to carefully rethink the way we manage it,” said Tourassi.

For the OLCF, this new era of data has meant several opportunities for marshaling some of the fastest supercomputers in the world—such as Titan and Summit—as key players in scientific simulation, an exercise that has helped resolve some of humankind’s most puzzling questions. For Shankar, positioning these computers to analyze data at large scales interleaved with the simulations is the opportunity that lies ahead.

The reason for that balance of focus is something that Tourassi calls “computer-assisted serendipity”: Analyzing the information available in large data sets—without necessarily having a specific question to answer—, can help scientists come up with new, powerful scientific questions that they can later attempt to resolve. And this could apply even to data produced by simulations.

Looking into the future

Today, more than 30 countries have developed national artificial intelligence strategies. And only those who best exploit their data will become the most competitive players in the world of research.

“To fly a plane, you need to have the right type of fuel. Data is the fuel for science in the 21st century. And we need to make sure that the data we collect and the way we manage it truly represent the emergent needs of the scientific community,” Tourassi said.

With this challenge in sight, and leveraging its place as DOE’s largest science and energy laboratory in the United States, ORNL has been working on several data improvement projects for the last decade.

“We want to enhance the overall life cycle of data—the way it’s collected, processed, identified, analyzed, stored, and retrieved—so others can later use it in an efficient, productive way,” Shankar said.

One of those initiatives is the NCCS’s Constellation DOI Framework and Portal project.

DOI, which stands for Digital Object Identifier, is a universal code assigned to pieces of scholarly materials such as books and e-books, journal articles, and government information.

When combined with the metadata of these pieces—such as author, title, date, subject, etc.—DOIs provide a way to retrieve information for consultation and reuse, the same way a book in a library could be checked out for reference.

The same principle can be applied to supercomputing datasets, and that’s precisely what Constellation does: scientists working with the OLCF can store and access extensive data collections tagged with DOIs to use in their own scientific work.

The DataFed project is another example of how ORNL is working to improve the sharing and management of data. Deployed as a federated hub for computational science and data analytics within distributed high-performance- and cloud-computing environments, DataFed is currently being piloted with scientific collaborators at ORNL’s Center for Nanophase Materials Sciences and external academic collaborators.

DataFed provides an efficient way for researchers to manage and access data without getting sidetracked from other important scientific tasks.

These and other efforts are supported by DOE, which organized in the past year a series of four town hall meetings on the topic of artificial intelligence to gather input from scientific communities at national laboratories about the way they use data in HPC.

The path forward

There has never been a better time to discuss what the future of data looks like, Shankar said.

“Today, there is wide acknowledgment of how much further we can get with large datasets, going beyond each person trying to make sense of their own data. Now, data sharing, collaboration, and AI across all phases of the data life cycle offers us new possibilities,” he added.

With that potential, however, comes challenges as well: scientists are now more aware than ever that they need to manage and analyze their data in principled but flexible ways—and do this at scale. Advances in methodologies in this area will increase the value of what we can extract from data.

“It’s no longer enough for one team to have their data and use it. We are eager to see where these data initiatives can take us as a laboratory, an organization, and a community,” Tourassi said.

UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.