Skip to main content

Benchmark released by Hewlett Packard Enterprise was tested on multiple DOE systems, including the OLCF’s Summit, and is a better measure of network congestion

Companies that design the networks that allow compute systems to share information often rely on certain benchmarks to test how well their networks perform.

But the traditional benchmarks that measure network latency—the time it takes to send a message across the network—and bandwidth, the rate at which data is transferred between compute elements, don’t always represent the circumstances that occur with regular network use.

“Many of the measurements used to benchmark these technologies are taken on an empty network, which doesn’t account for network traffic,” said Scott Atchley, high-performance computing (HPC) systems engineer in the Technology Integration Group at the Oak Ridge Leadership Computing Facility (OLCF). “It’s like comparing a road trip at 3:00 am with a road trip during rush hour.”

Earlier this year, Cray released a new networking benchmark, called the Global Performance and Congestion Network Test (GPCNeT), that takes into account network traffic when running codes on large-scale supercomputers. Last fall Atchley and OLCF HPC systems engineer Chris Zimmer ran a prerelease version of GPCNeT on the OLCF’s Summit, the nation’s most powerful supercomputer, to see how well it captured the impact of congestion on the system’s Enhanced Data Rate, or EDR, InfiniBand network when running different communication patterns typical of scientific codes. The OLCF is a US Department of Energy (DOE) Office of Science User Facility at DOE’s Oak Ridge National Laboratory(ORNL).

In a paper at the 2019 Supercomputing Conference, Atchley and Zimmer, as well as researchers from Lawrence Livermore National Laboratory, evaluated the interconnects in both the OLCF’s Summit and Livermore Computing’s Sierra supercomputers using this new benchmark as well as traditional benchmarks. In the paper, the engineers described GPCNeT’s benefits as well as the factors that affect the consistency in performance speeds (known as performance variability).

“GPCNeT is a new valuable tool to add to our suite of tools to help us understand these networks,” Atchley said. “It’s particularly useful to help users set expectations on variability, which we want to improve.”

Hewlett Packard Enterprise (HPE), which acquired Cray last year, has now published a comprehensive description of GPCNeT that takes into account the data from Atchley and Zimmer’s runs on Summit as well as runs on the Argonne Leadership Computing Facility’s Theta supercomputer and the National Energy Research Scientific Computing Center’s Edison system. In the paper, HPE claims that the benchmark will allow for more substantive comparisons of different HPC networks.

The runs on Summit demonstrated the efficiency of its network—even without the use of the congestion control feature that Summit’s infrastructure offers. Currently, congestion control is not active on Summit in its production environment, but future software developments by Mellanox could support this feature.

Zimmer said tail latency—the long, outlier delays that determine the overall rate of communication—is the largest impact from congestion. Measuring tail latency is a defined requirement in Frontier, and the team intends to measure it using GPCNeT during acceptance of the machine.

“The benchmark is open source, so we might even make enhancements to see what storage I/O congestion will look like on Frontier or other machines in the future,” Zimmer said.

Related Publication: Christopher Zimmer, Scott Atchley, Ramesh Pankajakshan, Brian E. Smith, Ian Karlin, Matthew L. Leininger, Adam Bertsch, Brian S. Ryujin, Jason Burmark, André Walker-Loud, M. A. Clark, and Olga Pearce, “An Evaluation of the CORAL Interconnects,” SC19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 1, no. 39 (2019): 1–18. https://doi.org/10.1145/3295500.3356166

UT-Battelle LLC manages Oak Ridge National Laboratory for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit https://energy.gov/science.

Rachel McDowell

Rachel McDowell is a science writer for the Oak Ridge Leadership Computing Facility.