How NCCS researchers helped Frontier cross the Top500 finish line
Teams worked around the clock for months to help prepare the new HPE Cray EX supercomputer and gauge its record-setting exascale performance. Frontier debuted at the end of May as the world’s fastest supercomputer on the 59th Top500 list at 1.1 exaflops – more than 1 quintillion double-precision calculations per second. The system, housed at the Oak Ridge Leadership Computing Facility, topped the international rankings not only for overall computing speed but in a newer ranking for AI speeds at more than 6.86 quintillion mixed-precision calculations per second and for energy efficiency at 62.8 gigaflops per watt.
To measure Frontier’s AI capabilities, NCCS researchers in the Analytics and AI Methods at Scale Group developed a code in-house with help from colleagues in the Algorithms and Performance Analysis Group that could gauge the mixed-precision calculations typically relied on by AI networks. This code, known as an HPL-AI benchmark, represents a new capability for ORNL.
The team first ran the benchmark code at scale on Summit, Frontier’s 200-petaflop predecessor, to demonstrate its algorithmic correctness and scalability. Then came the time to run the code on Frontier.
“There are a lot of complexities, optimization paths and trade-offs associated with driving such a benchmark to the maximum at this unprecedented scale,” said Feiyi Wang, the group’s leader. “Vendors typically develop and run these benchmarks themselves and treat them as industry secrets. We took a risk, as this is something we have never done before. Thanks to the relentless drive for excellence from our staff scientists at the Lab, we did it.”
Hao Lu, a research scientist, and Michael Matheson, a visualization specialist, took the lead on developing the new capability, and the entire group pitched in.
The code they produced works across two major computing platforms to set a new high point for benchmarking supercomputer performance. The team plans to publish the code to aid other researchers worldwide.
“It’s the first known code base of this kind that can run on both of the most important accelerator-computing platforms,” Lu said.
Next steps include steadily and symmetrically monitoring and profiling Frontier’s performance to reach even greater speeds.
“We developed our own code base not just for the sake of producing a single performance number, and that single number doesn’t and can’t tell the whole story,” Wang said. “To some degree, you can say all benchmark numbers are wrong, but some are useful. This development will not only help us establish a performance baseline, but also position us better to shape this and future next-generation leadership computing systems from acquisition and evaluation all the way to acceptance testing. Moreover, we have demonstrated a variety of optimization techniques that can be broadly useful and applicable to a wide spectrum of scientific applications.
“We couldn’t have done this without such ardent support and continuous encouragement from NCCS leadership on taking risks. There have been many unsung heroes and of course, a truly great team responding to the mission calls and striving for excellence. We are fortunate to play our part in this significant milestone project.”