Advanced Computing Ecosystem (ACE) Testbed
Advanced Computing Ecosystem (ACE) testbed is a unique OLCF capability providing a centralized sandboxed area in deploying heterogeneous computing and data resources and facilitating the evaluation of diverse workloads across a broad spectrum of system architectures. ACE is designed to fuel the cycle of productization of new HPC technologies, as applicable to the OLCF and DOE missions. ACE is an open access environment consisting of HPC production capable resources, and it allows researchers and HPC system architects to more freely assess existing and emerging technologies without the limitations of a production environment. Topics of interest include:
- IRI workflows and patterns (i.e., time-sensitive, data-intensive)
- Emerging compute architectures and techniques for HPC (e.g., ARM, AI appliances, reconfigurable)
- Emerging storage architectures and techniques for HPC (e.g., object storage)
- Emerging network architectures and techniques (e.g., DPUs)
- Cloudification of traditional HPC architectures (e.g., multi-tenancy, preemptible queues)
Supported Science Workflows
Joint Genome Institute (JGI)
Joint Genome Institute (JGI) is a multi-project user facility located at LBNL. These projects are focused on large-scale genomic sciences that address questions of relevance to DOE’s Office of Biological and Environmental Research (BER) missions in sustainable biofuel and bioproducts production, global carbon and nutrient cycling, and biogeochemistry, among others. Several projects involve producing and processing microbial, fungal, algal and plant genome data. As a direct consequence of the aforementioned activities, massive amounts of data transmission, processing and storage challenges arise. A cross-facility approach to computational and data processing may alleviate some of these challenges.
The ACE testbed acts as a precursor to an infrastructure facilitating the data transfer, storage and workflow processing platform that will support the JGI produced workflows and Service artifacts such as JGI Analysis Workflows Service (JAWS). The testbed is capable of seamlessly integrated as part of the existing JGI infrastructure thus augmenting the compute and storage capabilities of the overall system.
JGI PoC: Steve Chan (LBL)
OLCF PoC: Ketan Maheshwari
Linac Coherent Light Source II – AI Models and Data Streams (LCLStream)
The LCLStream project is focused on the development of advanced AI models and data management techniques for the Linac Coherent Light Source II (LCLS-II). One goal of this project is to train a generalist AI model for interpreting experimental X-ray data, which is vital for steering and optimizing the various instruments of this high repetition rate X-ray light source. The project also aims to demonstrate a multi-stream and bidirectional data flow from high data rate detectors at LCLS-II, enabling real-time feedback and data processing. LCLStream addresses the challenge of constrained experimental design by integrating experimental setup, data collection, and analysis. This involves the development of data curation platforms, real-time compression algorithms, and ML inference techniques, enhancing edge-to-HPC analysis pipelines.
The ACE testbed plays a pivotal role in supporting the LCLStream project, aligning with the DOE Integrated Research Infrastructure (IRI) goals. For the AI model training effort, ACE provides the computational resources and environment necessary to train the model on an extensive dataset, facilitating effective data analysis and intelligent data collection. For the data flow and real-time feedback, ACE’s capabilities in handling high data rate detectors and facilitating real-time feedback loops are crucial for optimizing compression settings and improving experimental data focus. To address experimental design challenges, ACE’s support in data curation and processing is essential for rapid adaptations in experimental setups, leveraging large datatasets, and optimizing instrument configurations.
LCLS-II PoC: Jana Thayer, Frédéric Poitevin, and Ryan Coffee (SLAC)
OLCF PoC: David Rogers
IMAGINE-X will accelerate structure solution and drug discovery for bio-preparedness by delivering new breakthrough capabilities for neutron structural biology at ORNL, using Dynamic Nuclear Polarization (DNP) technologies to attain order of magnitude gains in the S/N ratio of neutron data, using crystals that are radically smaller than has previously been possible, and controlled by new AI driven analysis and simulation methods. This breakthrough will produce >100-fold gains in performance for neutron diffraction analysis of biological systems, accelerating development of new therapeutics for disease, and understanding and control of enzymes with designed catalytic and ligand binding behaviors.
A virtual instrument for in situ analysis: In parallel, we will exploit cutting-edge AI technologies, Frontier and ACE computing resources at OLCF to develop a virtual instrument to help optimize the design of the new IMAGINE instrument, accelerate real-time data collection, reduce data redundancy, and enable autonomous online experiment steering. This co-design approach will simultaneously develop the new IMAGINE instrument, the virtual instrument, and the computing infrastructure needed to connect HFIR, OLCF and edge computing at the beamline, improving experiment steering with computational workflows.
The outcome will be a workflow from instrument to edge to exascale as an Integrated Research Infrastructure (IRI) that enables: i) in situ data collection and reduction, ii) real-time estimation of missing Bragg peaks, and iii) prediction of tight error bounds. We will offer interactive diffraction data analysis software to extract Bragg peaks, using high performance NVIDIA GPU enabled edge computing nodes for model inference within the admissible set of protein crystal structures.
IMAGINE-X PoC: Dean Myles (SNS)
OLCF PoC: Jens Glaser
– 36 nodes: AMD Epyc CPU, 4 AMD MI100 GPUs, Slingshot 10 networking
– Former Frontier early access system
– AArch64 testbed with multiple node configurations
– Fujitsu A64fx CPU, EDR IB (8 nodes)
– Ampere Computing Altra CPU, 2 NVidia Ampere GPUs, 2 BlueFied-2 DPUs (8 nodes)
– Dedicated AI appliance – BOWPod16 configuration
– Single Supermicro server with 8 Nvidia H100 GPUs
Polis – Lustre
– ~1.6 PB
– Primarily spinning disk with some flash, connected to Defiant
VastData – NFS
– ~600 TB
– NFS-over-RDMA storage appliance
– Flash, connected to the IB fabric
DAOS – Object Storage
– 4 servers with ~30 TB flash each and 4 HDR IB connections