Big science generates big data. But what happens when researchers produce more data than they can process? It’s actually becoming a problem for research facilities that generate so much data that only supercomputers can tackle the volume.
To bridge the gap between experimental facilities and supercomputers, experts from SLAC National Accelerator Laboratory are teaming up with other Department of Energy national laboratories to build a new data streaming pipeline. The pipeline will allow researchers to send their data to the nation’s leading computing centers for analysis in real time even as their experiments are taking place.
Not only will the project alleviate stress on computing systems, but software integrations powered by AI and machine learning will also help optimize the experiments, leading to faster, more accurate results.
“Rapid data analysis and computing capabilities for autonomous experiment steering are essential for advancing science in the 21st century as well as our ability to shed new light on a number of challenging scientific research areas,” said project lead Jana Thayer, technical research manager at SLAC.
DOE operates a fleet of world-leading reactor- and accelerator-based research facilities that generate beams of X-rays and neutrons to study energy and materials. The X-ray facilities are known as light sources. The neutron facilities are known as neutron sources.
Also under DOE are the world’s most powerful supercomputers, including Frontier and Aurora — exascale-class HPE Cray supercomputers located at Oak Ridge and Argonne National Laboratories, respectively. The systems are currently ranked no. 1 and 2 on the TOP500 list of supercomputers. Frontier and Aurora can perform more than a quintillion, or a billion-billion calculations per second.
“Our ultimate goal is to stream data straight from the experiment directly to a remote computing facility without ever having to save anything to disk,” Thayer said. “The ability to analyze the data while the experiment is still going on will allow us to make educated decisions such as changing the temperature and pressure to optimize the research results. That means faster times to solutions and more accurate science.”
The 5-year project is called Intelligent Learning for Light Source and Neutron Source User Measurements Including Navigation and Experiment Steering, or ILLUMINE. The project is part of the early development of DOE’s Integrated Research Infrastructure, which is an initiative that expands connections between DOE’s computing centers and research facilities within the U.S. national laboratory system.
The research facilities involved in the ILLUMINE project include SLAC’s Linac Coherent Light Source, or LCLS; Brookhaven National Laboratory’s National Synchrotron Light Source II; Argonne National Laboratory’s Advanced Photon Source; Lawrence Berkeley National Laboratory’s Advanced Light Source; and Oak Ridge National Laboratory’s Spallation Neutron Source, or SNS, and High Flux Isotope Reactor. Each facility is a DOE Office of Science user facility.
“The challenge these facilities face is that they lack the computing resources necessary to handle the huge amounts of data they generate. And that’s especially true for the next-generation facilities coming online now and the ones currently under construction,” Thayer said. “We’re talking terabytes per second — orders of magnitude more than what we currently produce.”
“That’s where high-performance computing centers come in,” said Valerio Mariani, the head of the LCLS Data Analytics Department at SLAC. “Streaming data straight from the facility to the supercomputer would greatly expedite the data analysis process and alleviate in-house data storage issues.”
Currently, each facility has a different solution that only works with a certain computing system.
”The software they use now is designed to run on laptops or small clusters,” said Tom Beck, section head for science engagement in ORNL’s National Center for Computation Sciences. “So, we need to build software and infrastructure that not only works on the biggest systems but also enables the supercomputers, the neutron sources and the light sources to all speak the same language.”
The science behind the sources
The facilities involved in the ILLUMINE project are among some of today’s modern marvels of engineering. The accelerator-based sources, for example, are enormous facilities that range from hundreds of yards to more than a couple miles in length. Each facility houses high-powered instruments, specialized for studying energy and materials at extremely small scales, from micrometers to subatomic sizes.
Generally, these sources work by bombarding experimental samples with beams of photons or neutrons traveling at or near the speed of light. Upon interacting with the sample, the particles scatter off and get collected by detectors that record the particles’ energy, speed, position and trajectory.
This scattering technique is like shining a flashlight inside a material to see how the atoms are arranged and how they behave. The fundamental insights are used to improve a wide range of technologies from batteries and cosmetics to building materials and novel treatments for diseases.
“To appreciate the volume of data the neutron and light source facilities create, consider this: At SNS, for example, every neutron that hits a detector is ‘stamped’ with a time of arrival and position on the detector. That time is calculated from the moment the neutrons are generated,” said Jon Taylor, Division Director for ORNL’s Neutron Scattering Division. “The pulsed beam of the accelerator generates a million-billion neutrons per pulse, sixty times a second. Multiply that by 5,000 hours, roughly 208 days a year of operation and the amount of data is pretty staggering.”
Addressing the deluge of big data
As a test bed, the ILLUMINE team is using SLAC’s Coherent X-ray Imaging, or CXI, beamline in California and ORNL’s IBM AC922 Summit, a 200-petaflop supercomputer managed by the OLCF in Tennessee.
On average, imaging beamlines generate more data than their nonimaging counterparts. The CXI instrument is like a camera that can take thousands of photographs per second to create images that closely resemble medical X-rays. Detectors count the particles as they pass through the sample to construct an image similar to a shadow cast by shining light over an object. The most important regions of the image are called Bragg peaks.
“With Bragg peaks, all you need is the information just around the peaks, which means you can throw away the rest of the image. Right now, that’s done by a person,” Mariani said. “We’re testing as much as we can in real time so that the system becomes better and faster at finding the peaks and deciding what adjustments need to be made to get the optimal data. That will save us a lot of memory and computing time.”
The team is also developing an algorithm called PeakNet, which uses AI and machine learning to recognize different kinds of Bragg peaks and discard any extraneous data.
A year into the ILLUMINE project, the team has achieved several important milestones. A feasibility demonstration succeeded in transferring datasets from the CXI instrument to the Summit supercomputer via the Energy Sciences Network. This high-performance network services the entire national laboratory system and connects more than 50 major research facilities.
In addition to developing machinery to evaluate the system’s performance, produce high-quality datasets for training, and enable remote workflow capabilities, the team also leveraged thousands of Summit’s NVIDIA GPUs to train a series of neural network models for advanced processing of complex visual data.
The future on Frontier
After 6 years of production, the Summit supercomputer is scheduled to be decommissioned at the end of 2024. The ILLUMINE project will then shift to working on Summit’s successor, Frontier, also located at the OLCF.
“Access to Summit has been crucial to the success of this project. Summit provided us with several orders of magnitude more GPU resources than what we have available, enabling us at LCLS to start training the Bragg Peak Finder on a large dataset,” Mariani said. “OLCF expertise was vital to overcoming the obstacles associated with transferring data to the facility and launching and executing the code.”
At ORNL, the ILLUMINE project is well timed. Construction is almost complete for VENUS, which will be the most state-of-the-art neutron-imaging beamline for open science in the United States. ORNL is also building the Second Target Station, which is an entirely new, next-generation neutron source that will also be powered by the SNS accelerator.
“As more beamlines incorporate automation, eventually the system could be developed to run the entire experiment almost entirely on its own — safely and securely from anywhere in the country,” Taylor said. “Getting results in a fraction of the time because of machine learning means users will save a lot of time and money, while at the same time, the neutron and light source facilities will be able to accommodate more users each year.”
In addition to Thayer, Mariani, Beck and Taylor, ILLUMINE team members include Ryan Herbst, Wilko Kroeger, Vivek Thampy and Cong Wang (SLAC); Stuart Campbell, Daniel Allan, Andi Barbour, Thomas Caswell, Natalie Isenberg, Phillip Maffettone, Daniel Olds, Max Rakitin and Nathan Urban (BNL); Nicholas Schwarz, Franck Cappello, Ian Foster and Antonino Miceli (Argonne); Alexander Hexemer and Dylan McReynolds (Berkeley Lab); and David M. Rogers (ORNL).
Computing resources for the project are provided through the SummitPLUS allocation program.
Support for this research came from the DOE Office of Science’s Advanced Scientific Computing Research program. The OLCF is a DOE Office of Science user facility located at ORNL.
UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.