A team led by to Joshua Agar, a faculty member in Lehigh University’s Department of Materials Science and Engineering, has realized the first live implementation of an innovative data management system at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory. Called DataFed, the system was designed to allow authorized users to create, access, and share scientific data across any DOE facility or federated external organization.
According to its webpage, DataFed is “a federated, big-data storage, collaboration, and full-life-cycle management system for computational science and/or data analytics within distributed high-performance computing (HPC) and/or cloud-computing environments.” Although DataFed offers multiple interfaces—including a web application, a command line interface, and an application programming interface (API)—Agar’s group is primarily leveraging the API to integrate DataFed with specific scientific instruments and instrument control systems for their experimental work on materials.
Agar’s team recently developed a novel machine learning approach that can determine similarities in materials and help scientists identify trends in material structures. After building the neural network and training it to include symmetry-aware features, the team applied the method to more than 25,000 piezoresponse force microscopy images from the University of California, Berkeley. They were ultimately able to group classes of materials together and observe trends in the material structures, which gave them a better understanding of the relationships between similar materials.
“Machine learning and deep learning rely on users having large volumes of well-labeled, high-quality data to teach them,” said Suhas Somnath, a computer scientist in the Data Lifecycle and Scalable Workflows Group at the Oak Ridge Leadership Computing Facility. “The ability to collect this data with all necessary context in a seamless and consistent manner in DataFed is a major benefit for users like Agar’s team.”
When scientists are searching for a specific file out of tens of thousands of datasets—as is the case for Agar’s team—data management systems light the way. Knowing the exact combination of experimental parameters for a measurement, the material synthesis parameters, the annotations, and any other subject matter about the data can help scientists form a query and more easily locate it. DataFed is effective because it captures relevant contextual information, or metadata. It’s also capable enough to enable users to collect data from multiple locations without having to worry about where the individual data are located.
“DataFed is probably the only tool I’m aware of that can flexibly store not just the data but also the rich metadata and relationships between data records,” Somnath said.
Agar was initially impressed with DataFed’s ability to capture and share data in a user-friendly manner with an intuitive web interface and API. He first procured his own data repository at Lehigh for his team’s multidepartment, multidisciplinary research at the Nano Human Interface Initiative and then plugged the repository into DataFed. Now, his team’s machine learning approach is helping them understand complex material structures.
“Humans have limited capacity for memory and recollection,” Agar said. “DataFed is a modern-day Memex; it provides a memory of scientific information that can easily be found and recalled.”
DataFed is currently in a preproduction stage and includes supporting documentation in the form of online videos that walk users through common tasks. For more information about DataFed, see https://ornl.github.io/DataFed/system/introduction.html.
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.