OLCF staff optimize tensor algebra library, boost performance up to 10 times

OLCF computational scientist Dmitry Liakh, left, and performance analyst Frank Winkler display a visual analysis of the improved runtime performance of a numerical tensor algebra library that can be used by chemistry applications on Summit, the OLCF’s next leadership computing system.

Preparing for a new supercomputer at the Oak Ridge Leadership Computing Facility (OLCF) is a little like preparing for the Olympics. About every 2 to 4 years, computational scientists start improving how a suite of science applications will perform on a different, more powerful architecture.

Testing and modifying application codes for faster systems is a computational workout that can take years but yields significant performance gains. Currently, this work is part of the OLCF Center for Accelerated Application Readiness (CAAR) program, which is focused on upgrading 13 applications across research areas for the facility’s next-generation supercomputer, Summit, expected to go online in 2018. The OLCF is a US Department of Energy Office of Science User Facility located at DOE’s Oak Ridge National Laboratory.

While working on three CAAR chemistry applications, OLCF computational scientist Dmitry Liakh, former OLCF performance analyst Frank Winkler, and NVIDIA computational scientist Antti-Pekka Hynninen fixed an unexpected bottleneck in a key library, improving the library’s runtime performance by as much as 10 times.

“In these codes, there are basic, underlying mathematical equations known as tensor operations that are computationally intensive,” Liakh said.

The three applications—DIRAC, LS-DALTON, and NWChem—use tensor algebra to make connections related to quantum mechanics. Tensor operations are also important to deep learning applications, which are expected to see increased use on Summit.

Working in the pre-Summit developmental environment, known as Summitdev, OLCF staff realized it would be more efficient if the chemistry application used a numerical library of tensor operations. Instead of writing code for tensor operations directly into the application code, the application could pull in the library—a modular way to package routine code and optimize computing time.

Built from IBM POWER8 CPUs and NVIDIA Pascal GPUs, Summitdev is one generation away from Summit’s architecture that will have IBM POWER9 CPUs and NVIDIA Volta GPUs.

OLCF staff created the TAL-SH library to implement numeric tensor algebra operations on a hybrid CPU–GPU system like Summit and optimized it specifically for NVIDIA GPUs. CPUs typically spend a short time tasking GPUs to perform certain operations, then GPUs do the intensive number crunching.

But when staff tested the new library on Summitdev, the computational speed was disappointing. Although the GPUs were completing tasks twice as fast as before optimization, staff expected the speedup to be greater.

“It wasn’t clear why it was happening. It should have been faster,” Liakh said.

To quickly identify and resolve the issue, Liakh worked with Winkler to record and visually analyze a test run of the tensor library intended for use in DIRAC.

When computational scientists discuss application performance, they are often concerned with either scalability—how effectively an application is distributing operations across many nodes—or on-node efficiency, which is how CPU and GPU resource utilization is being maximized on a single node.

“This was an efficiency issue focused on a single node,” Liakh said.

The team needed to see how the GPUs were interacting with the CPUs to understand the bottleneck, Liakh said. Using the trace recording tool Score-P and the performance analysis tool Vampir, which analyzes both CPU and GPU performance, Winkler determined that the CPU code was not supplying tasks to the GPUs fast enough; instead, the CPUs were spending almost as much time preparing the tasks as GPUs were spending executing them, which was leading to gaps in overall performance for the node.

Working with Hynninen on the CPU code, the team was then able to fix the tensor library bottleneck and boost performance. This and other projects focused on optimizing applications for Summit’s architecture are critical to preparing for the system, which will be at least five times as powerful as the OLCF’s 27-petaflop Titan supercomputer.


Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.