Skip to main content

Uncovering hidden microbial worlds

PI: Kathy Yelick
Lawrence Berkeley National Laboratory

In 2016, the Department of Energy’s Exascale Computing Project, or ECP, set out to develop advanced software for the arrival of exascale-class supercomputers capable of a quintillion (1018) or more calculations per second. That meant rethinking, reinventing and optimizing dozens of scientific applications and software tools to leverage exascale’s thousandfold increase in computing power. That time has arrived as the first DOE exascale computer — the Oak Ridge Leadership Computing Facility’s Frontier — opens to users around the world. “Exascale’s New Frontier” explores the applications and software technology for driving scientific discoveries in the exascale era.

The Scientific Challenge

A single drop of water or handful of dirt can contain its own universe of microbial organisms, many so small they continue to evade detection by all but the closest examination. Piecing together the traces of these microbes, particularly the proteins they rely on to survive, requires sifting through mountains of data at a time. That’s long been a task beyond the reach of even the fastest supercomputers — until now.

The ExaBiome project seeks to catalog the countless microscopic ecosystems, or microbiomes, found all around us via the power of exascale computing on the Frontier supercomputer at DOE’s Oak Ridge National Laboratory. Image credit: GettyImages

Why Exascale?

The ExaBiome project, a joint effort of scientists at Lawrence Berkeley and Los Alamos national laboratories and the Joint Genome Institute, seeks to catalog these microscopic ecosystems, or microbiomes, via the power of exascale computing on the Frontier supercomputer at DOE’s Oak Ridge National Laboratory. The ExaBiome team has spent years developing and optimizing codes such as MetaHipMer for assembling genomes from microbial samples; the Protein Alignment via Sparse Matrices code, or PASTIS; and the High Performance Markov Clustering algorithm, or HipMCL. These applications harness exascale’s speeds to reconstruct, classify and compare collected genome sequences and to understand the relationship and function of genes within microbial species.

“Assembly is like putting together a jigsaw puzzle with no box cover to guide us and with all the pieces dumped together from hundreds of different puzzles,” said Kathy Yelick, a senior computational scientist at Berkeley Lab. “We pick out these pieces and put them together into sequences. We may not have a complete puzzle, but we at least have parts of it. Then we can put these long sequences into bins that go together and compare them to what we already know. These organisms live in communities, so something like a single soil sample might have hundreds of thousands of these species in it.”

Even the average supercomputer can’t handle calculations of that size and complexity.

“If we don’t have the capability of running these large, distributed computations, these small species end up looking like errors because there aren’t enough single microbes to be recognized on their own,” said Leonid Oliker, a senior computational scientist at Berkeley Lab and director of the ExaBiome project. “It’s only when these microbes are combined that there’s something to see. That’s why only an exascale machine like Frontier can do this at the speed and scale that we need.”

Frontier Success

The ExaBiome codes have run calculations across all of Frontier’s more than 9,000 compute nodes. Frontier’s massive exascale throughput allowed researchers to shrink the cataloging and comparison work of months or weeks into days or hours. The overall peak performance on Frontier reflects a 536× improvement over the benchmark initially set by the team.

“Exascale has enabled us to discover new microbial species that don’t exist in any established databases,” Yelick said. “Thanks to Frontier, we can analyze much larger datasets than have ever been possible — up to 100 terabytes and more.

“Exascale has changed not just our understanding but how we conduct science in the environmental biology community. Now that we can analyze datasets this large, it’s worth collecting more data. Before, there was less incentive to collect information in this much detail because we were really limited by the computational aspects as far as what we could analyze. We couldn’t do anything with it. Now we have the opportunity to analyze as much as we can collect.”

What’s Next?

The ExaBiome team plans to apply exascale analysis to some of the fundamental questions of biology and genomics, from the human biome to microbe samples gathered from the ocean floor.

“Thanks to Frontier, we’re gaining a much clearer, much more detailed picture of what’s living and happening in the microbial world,” Yelick said. “What’s the functional behavior of these microbes? What genes do they possess? How do they interact? We’re now closer to answering all of these questions. If we really want to understand all the microbes in the world and exactly how they interact with one another, that’s a problem too big even for Frontier. But this is a start.”

Support for this research came from the ECP, a collaborative effort of the DOE Office of Science and the National Nuclear Security Administration, and from the DOE Office of Science’s Advanced Scientific Computing Research program. The OLCF is an Office of Science user facility at ORNL.

UT-Battelle LLC manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.