Collaboration between the OLCF and APPL facilities leads to new data automation process

At the Department of Energy’s Oak Ridge National Laboratory, scientists studying plant characteristics have access to sophisticated robotic imaging tools and sensors at the Advanced Plant Phenotyping Laboratory, or APPL. This greenhouse-like lab offers one of the most diverse suites of imaging capabilities for plant phenotyping worldwide, letting scientists quickly analyze the success of genetic modifications made to improve plant traits.

The system measures photosynthetic activity, water and nitrogen content, stress response and other characteristics for hundreds of plants. Consequently, APPL’s automated instruments produce a massive amount of data each year.

Processing, storing and sharing those terabytes of high-resolution data — used by scientists around the world — presents a unique challenge for APPL. To help wrangle that much information, the facility recently teamed up with the Oak Ridge Leadership Computing Facility, or OLCF, a DOE Office of Science user facility also located at ORNL, to develop a scalable, automated pipeline for data management. The project is an early effort in DOE’s Integrated Research Infrastructure, or IRI, which aims to link distributed research resources to support modern collaborative science.

The work was led at APPL by Stanton Martin, a bioinformatics data scientist; Hong-Jun Yoon, an R&D staff member; and Daniel Hopp, a computer programmer and application developer. The data pipeline development was led by the OLCF’s Ryan Prout, acting group leader for Software Services Development, and Kellen Leland, an HPC engineer, with assistance from Paul Bryant, a software developer; Jeffrey Miller, group leader for HPC Infrastructure and Platforms; and James “Jake” Wynne, an HPC storage systems engineer.

The collaboration between APPL and OLCF included staff from across ORNL: (from left) James “Jake” Wynne, Hong-Jun Yoon, Ryan Prout, Stanton Martin, Daniel Hop and Kellen Leland. Credit: Jeff Otto/ORNL, U.S. Dept. of Energy

Martin reached out to the OLCF for a custom, scalable data infrastructure to centrally store and share data collected at APPL, starting with data to be collected from the lab’s existing hyperspectral imaging instruments. While only 0.0035% of the electromagnetic spectrum is visible to the human eye, these instruments break down hundreds of spectral bands for each pixel in a single plant image, providing a more comprehensive look into the physiological characteristics of the plant.

This layered, spectral data is presented in 3D data cubes, each of which contains around three gigabytes of data. The data system was built using images from hyperspectral cameras originally set up in the APPL greenhouse as part of a project examining nitrogen stress in plants, led by ORNL staff scientist Dave Weston. APPL’s automated imaging systems have since produced hundreds of images and related hypercubes, with plans to create many more.

Hyperspectral data is captured along with other critical measurements in ORNL’s Advanced Plant Phenotyping Laboratory and then presented in a 3D cube. This hypercube shows spectral data collected from poplar trees. Credit: Hong-Jun Yoon and Stan Martin/ORNL, U.S. Dept. of Energy

“We needed a way to scale both the delivery of data and our development, because we can’t have a closed system that only a few developers can access and improve,” Martin said.

To accomplish this goal, Prout and Leland used two readily available tools: Kubernetes, a containerized application software with native management and deployment capabilities, and the Themis storage system, which is in an open security storage enclave managed by the OLCF. This enabled them to create a pipeline that automatically builds and deploys APPL’s hyperspectral imaging application and provides secure, accessible and local data storage. The method is described in a technical report published earlier this year.

The project was an important pilot for both facilities. For APPL, the collaboration is a critical step in making its data more accessible to external users and paving the way for continued streamlining of its data management.

“Our biggest success in all this was that we demonstrated the ability to use an open and scalable system for our data,” Martin said. The result gets to the heart of APPL’s mission to accelerate the transformation of plants as hardy feedstocks for bioenergy and bioproducts, for agriculture and for climate resilience.

For the OLCF, which primarily offers HPC resources and support, working with APPL was an opportunity to develop, test and deploy scalable and secure data services for users outside the facility, expanding its support capabilities for open research.

“There’s a lot of good science that could be enabled by these data services that otherwise might not be possible or would take much more time,” Leland said.

The collaboration is also a promising example of the enhanced capabilities at DOE facilities that will be possible through IRI.

“Pairing reliable, scalable and flexible data services with the OLCF’s computing resources is really unique,” Prout said. “I hope that this work ultimately contributes to the future of science and scientific discovery.”

The work was supported by the Laboratory Directed Research and Development Program at ORNL.

UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.