Data is the fuel that propels scientific research, and fortunately for scientists, there is no shortage of data in the modern world—a single scientific instrument can produce terabytes of data. However, just as important as collecting the data is being able to access it and analyze it. For researchers at the US Department of Energy’s Oak Ridge National Laboratory (ORNL), this can be a challenging task complicated by security protocols and non-networked instruments.
A new software aims to alleviate these challenges, making data transfer easier and faster; in the long run, the tool might even open the door to autonomous experiments. Using DataFlow, whose development was funded by the Oak Ridge Leadership Computing Facility (OLCF) and supported by staff in ORNL’s Data Lifecycle and Scalable Workflows group, scientists can easily transfer scientific data from their instruments to centralized data storage resources through a simple web application or an API. Crucially, DataFlow enables quick and simple data transfer from machines, computers, or instruments that are air-gapped—that is, not connected to the internet or laboratory network.
“We wanted a tool that was capable of heavy-duty data transfer, like from a microscope generating terabytes of data,” said Suhas Somnath, previously a computer scientist in the National Center for Computational Sciences—now an associate at McKinsey and Company in Boston, MA—who helped develop the tool. “How do you take that data and quickly move it to a place where it can be safely stored and analyzed quickly?”
“We designed it to be as simple as possible,” explained Greg Shutt, DataFlow technical lead and software engineer in the Software Services Development group at ORNL. “Researchers drag and drop their dataset into a web browser, and it works no matter the file size, no matter the file type.”
The issue of data transfer from air-gapped machines was the team’s primary concern. Machines are air-gapped for several reasons, such as cybersecurity policies. Often, machines and instruments run software that is developed for a particular version of an operating system and will not run if the version is updated or patched. Of course, this presents cybersecurity risks to the lab, so these machines are required to operate off-network. Another concern is that automatic updates can force a machine to restart unexpectedly, which could cause significant data loss if the restart occurs in the middle of an experiment.
Users can access DataFlow through a web browser by connecting their instruments to a small private network that talks to the DataFlow server, depending on their needs. Shutt has also been working with ORNL’s IT team to set up an internal network well suited for DataFlow’s users.
“These previously air-gapped instruments can connect to the internal ORNL network,” Shutt explained, “and DataFlow provides a simple interface for data movement and metadata capture.”
Critically, DataFlow doesn’t require researchers to install software or change codes on their instruments, which all run the risk of interfering with finely tuned programming. At the same time, the DataFlow code is easily modifiable and can be customized to capture metadata according to the requirements of a particular project.
“Researchers spend half their time moving their data around when they could be using that time to think about different experiments, about different ways to approach their science, and I think that’s unacceptable,” Shutt said. “DataFlow takes the handcuffs off and allows the technology to work with you rather than against you.”
Although DataFlow is a relatively new tool, it is quickly gaining traction at ORNL. Rama Vasudevan, group leader of the Data Nanoanalytics group in the Center for Nanophase Materials Sciences, develops microscopy techniques to better understand how materials behave at the nanoscale. For the past several years, his research has included trying to automate the instruments to make decisions on their own, an area called smart characterization.
Vasudevan and his team started using DataFlow to upload data directly to the Compute and Data Environment for Science storage system, an ability that simplified and sped up their research process.
“Previously, we had to save the file to the local hard drive, then connect a laptop to the computer, transfer the file to the laptop, then transfer from the laptop to the server, and finally get data back from the server. DataFlow simplified this procedure dramatically, and now we can run experiments in real time,” Vasudevan said.
While DataFlow’s current capabilities are already making an impact for researchers at ORNL, Shutt is planning to add additional features and integration with other tools such as Dropbox and OneDrive to make the software even more user friendly. And DataFlow has additional potential that has not yet been explored, including data transfer from external laboratories or universities directly to ORNL’s storage systems.
In the meantime, Shutt is working to raise awareness of DataFlow within ORNL, and he’s confident in what the tool has to offer.
“It reduces the time, the complexity, and the friction that researchers face in getting their data to analysis. Once they start using this thing, they’re going to love it,” Shutt said.
The OLCF is a DOE Office of Science User Facility at ORNL.
UT-Battelle LLC manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.