In 2008, the Exascale Study Group (ESG) issued a report, Technology Challenges in Achieving Exascale Systems, sponsored by the Defense Advanced Research Projects Agency. It concluded that exascale supercomputers faced four major obstacles—power consumption, data movement, fault tolerance, and extreme parallelism—“where current technology trends are simply insufficient, and significant new research is absolutely needed to bring alternatives online.”
In the years since ESG’s conclusions, these findings were corroborated in the many reports that followed, which were conducted by different study groups and identified the same hurdles. Nevertheless, with the installation of Frontier at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL), supercomputing will at last enter the exascale age this fall.
So how were those issues surmounted?
Al Geist, chief technology officer for the Exascale Computing Project and the Oak Ridge Leadership Computing Facility at ORNL, was a member of several different exascale working groups and recently delivered a seminar on the history of US exascale efforts and how Frontier overcomes the four obstacles. Here is a brief summary:
Power Consumption
ESG 2008: “The Energy and Power Challenge is the most pervasive of the four and has its roots in the inability of the group to project any combination of currently mature technologies that will deliver sufficiently powerful systems in any class at the desired power levels.”
The Problem: Power consumption was the main challenge identified in the ESG report. Their analysis showed that if one took existing technologies and built a 1-exaflop system out of them, it would consume more than 600 megawatts.
“The electric bill alone would be over $600 million for every year the supercomputer ran, which is well beyond the budget of any supercomputer facility. But more importantly, the ESG report showed that if you built a stripped down, optimized, 1-exaflop system using projected 2015 technologies, it would still consume over 150 megawatts. That result really gave us pause about whether it would ever be feasible to build one of these machines,” Geist said.
The Solution: The DOE’s Office of Science challenged chip and supercomputer vendors to increase chip reliability and reduce the power consumption to 20 megawatts for a 1-exaflop system. DOE spent more than $300 million funding six companies for eight years to tackle this challenge. Frontier is proof that this effort was a success: its power consumption is even less than 20 megawatts per exaflop. An added benefit of DOE’s investment is the development of lower-power electronics that end up in many consumer products, which in turn reduce the nation’s energy demand.
Speed and Energy of Data Movement
ESG 2008: “The Memory and Storage Challenge concerns the lack of currently available technology to retain data at high enough capacities, and access it at high enough rates, to support the desired application suites at the desired computational rate, and still fit within an acceptable power envelope. This information storage challenge lies in both main memory (DRAM) and in secondary storage (rotating disks).”
The Problem: The time and the amount of energy required to move a byte of data from memory into the processors, and from the processors back out to storage, is orders of magnitude larger than the time and energy required to perform a floating-point operation on that data. The impediment of getting data in and out of memory is referred to as the memory wall.
“We were seeing large computational problems where the time to solution was being dominated by the movement of data rather than the number of computations. We reasoned that as the computations got faster and faster, the memory wall would get too high to be able to reach an exaflop. The calculations would be limited by the speed of the memory,” Geist said.
The Solution: Frontier reduces the height of the memory wall by having stacked, high-bandwidth memory soldered directly onto its GPUs, which increases the system’s data-moving speed by an order of magnitude.
Fault Tolerance
ESG 2008: “A Resiliency Challenge deals with the ability of a system to continue operation in the presence of either faults or performance fluctuations. This concern grew out of not only the explosive growth in component count for the larger systems, but also out of the need to use advanced technology at lower voltage levels, where individual devices and circuits become more and more sensitive to local operating environments, and new classes of aging effects become significant.”
The Problem: Exascale systems were expected to become so large and so complex that computer failure rates would increase to the point of becoming insurmountable.
“Extrapolating from 2009 error rates, we predicted that exascale system failures might happen faster than you could checkpoint a job. If the failure rate got this high, then traditional supercomputer applications would not be able to roll back to a checkpoint and make forward progress,” Geist said.
The Solution: DOE’s 8-year investment in the FastForward and DesignForward program with vendors to increase reliability has led to chips that are much more tolerant of failures. Vendors also made their networks and system software much more adaptive so systems could continue despite component failures. In addition to this, Frontier’s on-node, non-volatile memory reduced the checkpoint times from minutes to seconds—so even though failure rates increased, the checkpoint time is still much shorter than the mean time to failure.
Billion-Way Parallelism
ESG 2008: “The Concurrency and Locality Challenge likewise grows out of the flattening of silicon clock rates and the end of increasing single-thread performance, which has left explicit, largely programmer visible parallelism as the only mechanism to increase overall system performance. While this affects all three classes of systems, projections for the data-center class systems in particular indicate that applications may have to support upwards of a billion separate threads to efficiently use the hardware.”
The Problem: To compute at a rate of 1 exaflop requires 1 billion floating point units performing 1 billion calculations per second each. Exascale applications might need to break their problems into a billion threads executing in parallel.
“In 2009, large-scale parallelism was typically less than 10,000-way parallelism, and the largest application on record was only about 100,000-way parallelism. Yet we were predicting that at exascale it would take a billion-way parallelism, and no one really had any clue about how or if that could be done effectively,” said Geist.
The Solution: Billion-way concurrency is still present, but Frontier addresses this challenge by having large multi-GPU nodes. Each of the GPUs has anywhere between 1,000- to 10,000-way concurrency. So rather than create applications for billion-way parallelism, users only need to think about parallelism according to the number of GPUs or the number of nodes in the computer. As a bonus, Frontier’s system software deals with only thousands of nodes—not a million.
UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.