The decade-long effort to attain the next thousand-fold increase in compute power encompasses multiple agencies, committees, labs, companies, and some lucky choices
Solving the world’s biggest challenges requires the most sophisticated scientific tools, including fast and powerful supercomputers. That’s why the US Department of Energy (DOE) devotes so many resources to designing and building next-generation systems. But back in 2008, the feasibility of exascale-class computing—supercomputers that can perform exaflops, or a billion billion (1018) floating point operations per second—did not look good.
Peter M. Kogge, the McCourtney Professor of Computer Science and Engineering at the University of Notre Dame, led a study sponsored by the Defense Advanced Research Projects Agency (DARPA) to identify exascale’s key hurdles. Published in May 2008, Technology Challenges in Achieving Exascale Systems surveyed high-performance computing (HPC) experts from universities, research labs, and industry to predict whether the nation could—by 2015—attain a thousand-fold increase in computational power over the then-new petascale systems. Their consensus? Not without changing the trajectory of HPC technology from the state of the art at that time (see “Exascale Computing’s Four Biggest Challenges and How They Were Overcome”).
“We all believed that it would be unlikely to meet the challenges by 2015, especially power—the best we could forecast for 2015 technology was over three times the goal. It was not at all clear how exascale would be reached with business as usual,” Kogge said.
But now, 13 years later, exascale computing is about to become a reality. Frontier, an HPE Cray EX system featuring 3rd Gen AMD EPYC™ and AMD Instinct GPUs, is currently being installed at the DOE’s Oak Ridge National Laboratory (ORNL). Once switched, Frontier will be the harbinger of a new era in computational science, followed in the next couple of years by Aurora at Argonne National Laboratory (Argonne) and El Capitan at Lawrence Livermore National Laboratory (LLNL). These exascale-class supercomputers will tackle more extensive problems and answer more complex questions than ever before. It’s a goal that several other countries are actively pursuing as well.
How did the Oak Ridge Leadership Computing Facility (OLCF)—a DOE Office of Science user facility at ORNL—and the other DOE national labs and their respective vendor partners (AMD, HPE, IBM, Intel, and NVIDIA) overcome the hurdles and achieve the remedies identified in Kogge’s report? It took a lot of organizational work, several false starts, and some fortuitous advances in computing technology.
“We had to build more energy efficient processors. We had to build more-reliable hardware and better interconnects. We had to design algorithms that can use that level of concurrency. We had to be able to move data around more efficiently,” said Jack Dongarra, director of the Innovative Computing Laboratory at the University of Tennessee, Knoxville. “So those are all complicated things, and it required research to develop the technology necessary to achieve them. But in the intervening 10 years or so, we did just that.”
So, Why Exascale?
In 2018, the DOE announced it would commit up to $1.8 billion to develop new exascale supercomputer hardware. It culminated a decade-long effort along many different fronts to make the case that exascale could soon be achievable in terms of technology, productivity, and affordability. But underlining all the studies and workgroups and presentations was a core concept: the United States needs exascale supercomputers to stay competitive.
“The whole reason for exascale computing is because of the science. To advance science, you need exascale computing. Everything today is done through simulation, and simulation is done on computers. To get the best simulations, you need the fastest computers, so getting to exascale provides a tool that scientists use to do their research,” said Dongarra, a seminal figure in HPC academia who devised the LINPACK benchmark in 1979, which evolved into the High Performance Linpack benchmark used for ranking the world’s most powerful computer systems in the TOP500.
Attaining exascale became a goal soon after petascale supercomputers became a reality with the 2008 launch of the Roadrunner machine at Los Alamos National Laboratory (LANL)—the world’s first HPC system to hit 1.026 petaflops, or one thousand million million (1015) floating-point operations per second. Dongarra and his peers immediately began considering the possibilities of the subsequent thousand-fold increase in computational power, as did government agencies such as DOE and DARPA, which underwrote their studies.
At the OLCF, computational scientists made the most out of the petascale power of Jaguar, which took the number one spot on the TOP500 in November 2009. The successor OLCF machine, Titan, which employed GPU accelerators to increase its performance to over 10 petaflops, took first place on the TOP500 list in November 2012. The 200-petaflop Summit arrived at the OLCF in 2018 and held the title of world’s most powerful computer from June 2018 to June 2020. Still, scientists saw a need for even more computational power.
“We had succeeded at petascale, yet at the same time we knew that we didn’t have enough physical fidelity for a lot of the problems we wanted to attack,” said Bronson Messer, the OLCF’s director of science. “We didn’t have enough grid points to resolve phenomena that were happening on smaller scales, and we couldn’t include as much physics as we wanted to in multiphysics simulations like climate, nuclear engineering, or stellar astrophysics. It was clear that orders of magnitude of additional capability were needed to make that possible.”
It was also clear to international scientists—and their governments in China, Japan, and the European Union—that exascale capability should be a priority for their countries, leading to their own efforts to launch exascale supercomputers.
“The people with the fastest computers can do the best science,” Dongarra said. “I like to think of the exascale computer as a very sophisticated scientific tool. These exascale computing tools will be used to understand something that we couldn’t understand before, so it allows us to push back those frontiers of science.”
Planning Ahead for a New Class of Supercomputer
To design and construct a next-generation supercomputer, the first step is the trickiest: find funding. It’s unlikely that any private companies will have the budgets (or business cases) to dedicate hundreds of millions of dollars toward amassing all the components and software needed to launch a new generation of supercomputers, so it’s typically the job of government. But to convince government agencies—and, in turn, Congress—to spend large sums of money on cutting-edge technology, a strong case must be made for its necessity and feasibility, and that entails lots of study groups and a lot of effort to find consensus.
After that first exascale study edited by Kogge in 2008, several reports were published over the ensuing decade that examined the possibilities (and potential impossibilities) of exascale computing, usually with Jack Dongarra’s name on them: The Opportunities and Challenges of Exascale Computing (Summary Report of the Advanced Scientific Computing Advisory Committee [ASCAC] Subcommittee, 2010), “The International Exascale Software Project Roadmap” (The International Journal of High Performance Computing Applications, 2011), Top Ten Exascale Research Challenges (DOE ASCAC Subcommittee Report, 2014), and many others. Most of them identified the same technical challenges that still needed to be overcome to achieve exascale—reducing energy consumption and increasing the reliability of the chips powering these potentially massive systems.
Meanwhile, exascale alliances began forming with the goal of figuring out how to solve the technical challenges identified in those reports—as well as hoping to gain enough political traction to move exascale projects through federal agencies. One early effort, called Nexus/Plexus, came out of the DOE Office of Science’s Advanced Scientific Computing Research program and attempted to cover six different technical research areas but never quite got off the ground. In 2010, the Scientific Partnership for Exascale Computing was formed between ORNL, LANL, and Sandia National Laboratories. Not long after, an alliance was launched between Argonne, Lawrence Berkeley National Laboratory, and LLNL, called ABLE. Those groups were scrapped in favor of a union between all seven of those national labs, called E7.
Finally, however, real forward motion toward exascale came on three fronts.
First, relationships with computer technology vendors were formalized in 2012 with the FastForward and DesignForward programs. FastForward focused on the processor, memory, and storage vendors—such as AMD and Intel—to address power consumption and resiliency issues. DesignForward focused mainly on system integrators such as HPE, Cray and IBM to plan packaging, integration, and engineering. Cray was later acquired by HPE in 2019.
Second, the Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) was formed by the DOE in 2012 to streamline the supercomputer procurement process. Each of those labs were acquiring pre-exascale systems at the time, and it made sense for them to work together. CORAL-2 continues this successful collaboration with the goal of procuring exascale systems.
Third, the Exascale Computing Project (ECP) launched in 2016 as part of the Exascale Computing Initiative, assembling over 1,000 researchers from 15 labs, 70 universities, and 32 vendors to tackle exascale application development as well as software libraries and software technologies. The software will be a critical factor in determining the early success of these new exascale systems.
“When you have a one-of-a-kind, world-class supercomputer coming, you better be ready to use it—you can’t afford to let it sit. We’re committed to delivering answers to problems of national interest,” said Doug Kothe, ECP project director. “Everything has to be production-hardened quality, performant, and portable on day one. So, to get the largest return on taxpayer investment, and the fastest route to new scientific discovery, we have to have all these apps ready.”
With these programs in place, requirements for actual exascale systems needed to be solidified for the DOE to issue Requests for Information and Requests for Proposals, necessary steps in the government procurement process to spur competition between vendors.
“We need these machines to be stable so that, in concert with our apps and software stack, they can fulfill their promise of being consequential science and engineering instruments for the nation. So, we as a community started talking about this years ago—and talking means writing down answers to key requirements questions like: Why? What? How? How Well? What do we need to focus on in terms of R&D and software?” Kothe said.
Many of those questions revolved around what basic architecture the system would ultimately use—which was partly determined by a choice made by the OLCF years earlier to use a piece of consumer PC hardware in supercomputers.
Specifying the Future of HPC
Before the OLCF’s Titan debuted in 2012, graphics processing units, known commonly as GPUs, were best known for high-end gaming PCs. At the time, most supercomputers relied only on central processing units, CPUs, to crunch their algorithms. But Titan introduced a revolutionary hybrid processor architecture that combined AMD 16-core Opteron CPUs and NVIDIA Kepler GPUs, which tackled computationally intensive math problems. At the same time, the CPUs efficiently directed tasks, thereby significantly speeding up calculations.
“Titan’s GPU-based system was a unique supercomputer design at the time. I don’t think the whole community bought into it or thought it would become the base technology for exascale. Oak Ridge believed that accelerated-node computing would be the norm for the foreseeable future, and so far that has come to fruition,” Kothe said.
Titan led to many more GPU-based supercomputers, including Summit in 2017—Titan’s successor at the OLCF—and LLNL’s Sierra in 2018. They both employ NVIDIA V100 GPUs, and that prudent choice also proved important in configuring the upcoming exascale system’s capabilities.
Al Geist, chief technology officer for the ECP and the OLCF, wrote many of the documents for CORAL and CORAL-2. He sees Summit’s architecture as another fortunate turning point in supercomputer design.
“As supercomputers got larger and larger, we expected them to be more specialized and limited to a small number of applications that could exploit their particular capabilities. But what happened when Summit was announced, NVIDIA jumped up and said, ‘Oh, by the way, those Volta GPUs have something called tensor cores in them that allow you to do AI calculations and all sorts of additional things,’” Geist said. “What we found is that Summit’s architecture, based on powerful multi-GPU nodes and huge amounts of memory per node, turned out to have broad capabilities. They could do the traditional HPC modeling and simulation, but Summit is also very effective at doing high-performance data analytics and artificial intelligence.”
The ability of GPUs to greatly accelerate performance as well as handle mixed-precision arithmetic for data science—all while using less power than CPUs—made them the best choice for exascale architecture, especially with more powerful and efficient next-generation chip designs being produced by vendor partners like AMD, Intel, and NVIDIA.
“What we found out in the end is that exascale didn’t require this exotic technology that came out of that 2008 report—we didn’t need special architectures, we didn’t need new programming paradigms,” Geist said. “Getting to exascale turned out to be very incremental steps—not a giant leap like we thought it was going to take to get to Frontier.”
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.