Testbeds

Advanced Computing Ecosystem (ACE) testbed is a unique OLCF capability providing a centralized sandboxed area in deploying heterogeneous computing and data resources and facilitating the evaluation of diverse workloads across a broad spectrum of system architectures. ACE is designed to fuel the cycle of productization of new HPC technologies, as applicable to the OLCF and DOE missions. ACE is an open access environment consisting of HPC production capable resources, and it allows researchers and HPC system architects to more freely assess existing and emerging technologies without the limitations of a production environment. Topics of interest include:

IRI workflows and patterns (i.e., time-sensitive, data-intensive)
Emerging compute architectures and techniques for HPC (e.g., ARM, AI appliances, reconfigurable)
Emerging storage architectures and techniques for HPC (e.g., object storage)
Emerging network architectures and techniques (e.g., DPUs)
Cloudification of traditional HPC architectures (e.g., multi-tenancy, preemptible queues)

Testbeds System Support

The Testbeds Guide is the definitive source of information about Advanced Computing Ecosystem Testbed (ACE), and details everything from connecting to running complex workflows. Please direct questions about Testbeds and its usage to the OLCF User Assistance Center by emailing [email protected].

Earth Systems Grid Federation

The Earth System Grid Federation (ESGF) Peer-to-Peer (P2P) enterprise system is a collaboration that develops, deploys and maintains software infrastructure for the management, dissemination, and analysis of model output and observational data. ESGF’s primary goal is to facilitate advancements in Earth System Science. It is an interagency and international effort led by the Department of Energy (DOE), and co-funded by National Aeronautics and Space Administration (NASA), National Oceanic and Atmospheric Administration (NOAA), National Science Foundation (NSF), and international laboratories such as the Max Planck Institute for Meteorology (MPI-M) German Climate Computing Centre (DKRZ), the Australian National University (ANU) National Computational Infrastructure (NCI), Institut Pierre-Simon Laplace (IPSL), and the Centre for Environmental Data Analysis (CEDA). The ESGF mission is to:

Support current CMIP6 activities, and prepare for future assessments
Develop data and metadata facilities for inclusion of observations and reanalysis products for CMIP6 use
Enhance and improve current climate research infrastructure capabilities through involvement of the software development community and through adherence to sound software principles
Foster collaboration across agency and political boundaries
Integrate and interoperate with other software designed to meet the objectives of ESGF: e.g., software developed by NASA, NOAA, ESIP, and the European IS-INES
Create software infrastructure and tools that facilitate scientific advancements

ESGF P2P is a component architecture expressly designed to handle large-scale data management for worldwide distribution. The team of computer scientists and climate scientists has developed an operational system for serving climate data from multiple locations and sources. Model simulations, satellite observations, and reanalysis products are all being served from the ESGF P2P distributed data archive.

Gamma Ray Energy Tracking Array (GRETA)

The Gamma Ray Energy Tracking Array (GRETA) is a state of the art gamma-ray spectrometer being built at Lawrence Berkeley National Laboratory to be first sited at the Facility for Rare Isotope Beams (FRIB) at Michigan State University. A key design requirement for the spectrometer is to perform gamma-ray tracking in near real time. To meet this requirement we have used an inline, streaming approach to signal processing in the GRETA data acquisition system, using a GPU-equipped computing cluster. The data stream will reach 480 thousand events per second at an aggregate data rate of 4 gigabytes per second at full design capacity. We have been able to simplify the architecture of the streaming system greatly by interfacing the FPGA-based detector electronics with the computing cluster using standard network technology. A set of highperformance software components to implement queuing, flow control, event processing and event building have been developed, all in a streaming environment which matches detector performance. Prototypes of all high-performance components have been completed and meet design specifications.

MOSSAIC

As a part of the NCI-DOE Collaboration, the MOSSAIC project applies natural language processing (NLP) and deep learning algorithms to population-based cancer data. This data comes from NCI’s Surveillance, Epidemiology, and End Results (SEER) program.

The goals of the MOSSAIC project are to:

deliver advanced computational and informatics solutions needed to support a comprehensive, scalable, and cost-effective national cancer surveillance program.
lay the foundation for an integrative data-driven approach to modeling cancer outcomes at scale and in real time.

With this knowledge, scientists may better understand the impact of new diagnostics, treatments, and other factors affecting patient trajectories and outcomes. The MOSSAIC team is developing end-to-end capabilities—from scientific discovery to operationalization—as well as trustworthy, explainable, and secure artificial intelligence (AI) solutions that are extensible across a broad range of data sources.

MOSSAIC is co-led by:

Lynne Penberthy, National Cancer Institute
Heidi Hanson, Oak Ridge National Laboratory, Department of Energy

Joint Genome Institute (JGI)

Joint Genome Institute (JGI) is a multi-project user facility located at LBNL. These projects are focused on large-scale genomic sciences that address questions of relevance to DOE’s Office of Biological and Environmental Research (BER) missions in sustainable biofuel and bioproducts production, global carbon and nutrient cycling, and biogeochemistry, among others. Several projects involve producing and processing microbial, fungal, algal and plant genome data. As a direct consequence of the aforementioned activities, massive amounts of data transmission, processing and storage challenges arise. A cross-facility approach to computational and data processing may alleviate some of these challenges.

The ACE testbed acts as a precursor to an infrastructure facilitating the data transfer, storage and workflow processing platform that will support the JGI produced workflows and Service artifacts such as JGI Analysis Workflows Service (JAWS). The testbed is capable of seamlessly integrated as part of the existing JGI infrastructure thus augmenting the compute and storage capabilities of the overall system.

JGI PoC: Steve Chan (LBL)
OLCF PoC: Ketan Maheshwari

Linac Coherent Light Source II - AI Models and Data Streams (LCLStream)

The LCLStream project is focused on the development of advanced AI models and data management techniques for the Linac Coherent Light Source II (LCLS-II). One goal of this project is to train a generalist AI model for interpreting experimental X-ray data, which is vital for steering and optimizing the various instruments of this high repetition rate X-ray light source. The project also aims to demonstrate a multi-stream and bidirectional data flow from high data rate detectors at LCLS-II, enabling real-time feedback and data processing. LCLStream addresses the challenge of constrained experimental design by integrating experimental setup, data collection, and analysis. This involves the development of data curation platforms, real-time compression algorithms, and ML inference techniques, enhancing edge-to-HPC analysis pipelines.

The ACE testbed plays a pivotal role in supporting the LCLStream project, aligning with the DOE Integrated Research Infrastructure (IRI) goals. For the AI model training effort, ACE provides the computational resources and environment necessary to train the model on an extensive dataset, facilitating effective data analysis and intelligent data collection. For the data flow and real-time feedback, ACE’s capabilities in handling high data rate detectors and facilitating real-time feedback loops are crucial for optimizing compression settings and improving experimental data focus. To address experimental design challenges, ACE’s support in data curation and processing is essential for rapid adaptations in experimental setups, leveraging large datatasets, and optimizing instrument configurations.

Photo Credit: Greg Stewart (SLAC)

LCLS-II PoC: Jana Thayer, Frédéric Poitevin, and Ryan Coffee (SLAC)
OLCF PoC: David Rogers

HFIR’s IMAGINE

IMAGINE-X will accelerate structure solution and drug discovery for bio-preparedness by delivering new breakthrough capabilities for neutron structural biology at ORNL, using Dynamic Nuclear Polarization (DNP) technologies to attain order of magnitude gains in the S/N ratio of neutron data, using crystals that are radically smaller than has previously been possible, and controlled by new AI driven analysis and simulation methods. This breakthrough will produce >100-fold gains in performance for neutron diffraction analysis of biological systems, accelerating development of new therapeutics for disease, and understanding and control of enzymes with designed catalytic and ligand binding behaviors.

A virtual instrument for in situ analysis: In parallel, we will exploit cutting-edge AI technologies, Frontier and ACE computing resources at OLCF to develop a virtual instrument to help optimize the design of the new IMAGINE instrument, accelerate real-time data collection, reduce data redundancy, and enable autonomous online experiment steering. This co-design approach will simultaneously develop the new IMAGINE instrument, the virtual instrument, and the computing infrastructure needed to connect HFIR, OLCF and edge computing at the beamline, improving experiment steering with computational workflows.

The outcome will be a workflow from instrument to edge to exascale as an Integrated Research Infrastructure (IRI) that enables: i) in situ data collection and reduction, ii) real-time estimation of missing Bragg peaks, and iii) prediction of tight error bounds. We will offer interactive diffraction data analysis software to extract Bragg peaks, using high performance NVIDIA GPU enabled edge computing nodes for model inference within the admissible set of protein crystal structures.

IMAGINE Laue Diffractometer, beamline CG-4D at HFIR. Credit: ORNL

IMAGINE-X PoC: Dean Myles (SNS)
OLCF PoC: Jens Glaser

Compute Resources

Defiant
– 36 nodes: AMD Epyc CPU, 4 AMD MI100 GPUs, Slingshot 10 networking
– Former Frontier early access system

Wombat
– AArch64 testbed with multiple node configurations
– Fujitsu A64fx CPU, EDR IB (8 nodes)
– Ampere Computing Altra CPU, 2 NVidia Ampere GPUs, 2 BlueFied-2 DPUs (8 nodes)

GraphCore
– Dedicated AI appliance – BOWPod₁₆ configuration

Holly
– Single Supermicro server with 8 Nvidia H100 GPUs

Storage Resources

Polis – Lustre
– ~1.6 PB
– Primarily spinning disk with some flash, connected to Defiant

VastData – NFS
– ~600 TB
– NFS-over-RDMA storage appliance
– Flash, connected to the IB fabric

DAOS – Object Storage
– 4 servers with ~30 TB flash each and 4 HDR IB connections