The OLCF’s Problem Busters

Professor Spencer Bryngelson and his team of researchers at Georgia Tech’s School of Computational Science and Engineering were excited to bring their code into the new world of exascale-class supercomputing. The Multicomponent Flow Code, or MFC, originated in the early 2000s to model the simultaneous movements of gases and liquids as they transform through different thermodynamic phases, such as water boiling into steam. When exascale computing arrived some 20 years later, the freshly updated code was ready to leverage the latest advances in high-performance computing, or HPC.

Using the Oak Ridge Leadership Computing Facility’s early-access system, Crusher, Bryngelson would get a sneak peek of what MFC could do on the newly launched HPE Cray EX Frontier supercomputer. Powered by next-generation GPU accelerators, Frontier’s first-in-the-world capability of conducting more than a billion-billion calculations per second would surely launch the team’s code into new realms of achievement. Both Crusher and Frontier are managed by the OLCF, which is a Department of Energy Office of Science user facility located at Oak Ridge National Laboratory.

Unfortunately, MFC’s performance on Crusher was not good.

“We got MFC running on Crusher in 2022 — all of our tests were finally working, which took about a year of effort. But the performance was extremely poor — like, really, really poor when compared to its performance on previous GPU systems,” said Bryngelson, an assistant professor at Georgia Tech’s College of Computing. “It wasn’t just slow in terms of networking across multiple nodes. Rather, it wasn’t even fast on just a single GPU. Since this was a new computer architecture, trying to reason out the cause was challenging.”

This lack of speed was especially puzzling because MFC had run well on Frontier’s predecessor, the 200-petaflop Summit, in 2020. Of course, Crusher used a different architecture. Unlike Summit’s ecosystem of IBM CPUs and NVIDIA GPUs, Crusher featured all-new AMD CPUs and GPUs, just like Frontier. Bryngelson had chosen OpenACC, a hardware-agnostic programming model, to help make MFC compatible with NVIDIA’s chips, but its utility for AMD’s new chips was unproven. Something was not right.

To solve that dilemma, Bryngelson turned to the OLCF’s user-support ecosystem provided by its User Assistance Center: support tickets, virtual office hours, and hands-on coding hackathons. Led by the OLCF’s Christopher Fuson, the User Assistance Group consists of seven HPC engineers on call to help solve problems, as well as hardware vendor representatives providing additional support. With their expert help in hunting down some elusive bugs, the team overcame their initial difficulties and ended up providing solutions for other researchers who were working to bring their codes to exascale.

OLCF GPU Hackathons provide users with hands-on support, wherein OLCF staff and vendor partners join developers in multiday coding events several times a year to optimize their applications. Image: Carol Morgan/ORNL.

Getting a ticket

Bryngelson started working on MFC around 10 years ago at the California Institute of Technology. The computational fluid dynamics code is even older than that. It was begun in the mid-2000s by Eric Johnsen and then more formally developed by Vedran Coralic, both at Caltech at the time and under the guidance of professor Tim Colonius. The code was originally designed to run on the parallel CPU-only systems available in the late 2000s and early 2010s, and those systems are quite different from today’s flagship GPU-accelerated supercomputers.

To update legacy codes to run on such new architectures, they must be rewritten so that their algorithms can leverage the particular CPUs and GPUs used by each supercomputer. Chip designs can vary among manufacturers such as AMD, IBM, Intel, and NVIDIA, each with different ways of optimizing parallel processing. Researchers use application programming interfaces such as OpenACC or OpenMP to augment their codes, which are written in programming languages such as Fortran or C++, to fully exploit the particular hardware on which they run.

Bryngelson’s team used compilers from the open-source GNU Compiler Collection, and those compilers are supposed to align with OpenACC’s specifications. However, as he and his team discovered, there were compatibility issues.

“It turns out that when you actually try to put some of those specifications into action, the compilers aren’t always fully compliant, or they’re compliant in a different way than you expected,” Bryngelson said. “That’s when you end up in this compiler world, trying to figure out what the differences are — what is merely just a difference in implementation versus what is a compiler bug, perhaps never even implemented in the compiler at all.”

Isolating the exact bugs causing the slowdown in performance would require further expertise, so the team filed tickets with the OLCF’s User Assistance staff. The questions caught the attention of Reuben Budiardja, who is the group leader of Advanced Computing for Nuclear, Particle, and Astrophysics in the OLCF’s Science Engagement Section. He also leads its Compiler Working Group.

“This was strictly a ticket that User Assistance can deal with — they deal with a lot of things elegantly and do a great job. But I just browse tickets to see if there are systematic issues that we need to be aware of and that we can solve to make everyone’s life better,” Budiardja said. “One of the things I’m particularly keen on is Fortran and OpenMP and OpenACC. So, reading the ticket, I was like, ‘I don’t have an obvious solution to that. It should work.’ That’s the kind of thing that I’ll sometimes decide to spend time with to see what’s going on.”

Budiardja suggested some potential work-arounds, and the team identified a few dozen OpenACC-related compiler bugs. Many of these bugs were shared with OpenMP, which is an officially supported and recommended method of off-loading to Frontier’s AMD GPU devices.

“We got to the bottom of many different things, many with downstream effects for other early-access application teams. We found work-arounds for current compiler bugs, learned more about things that we were doing that weren’t quite right. All kinds of weird things. It was a very good learning experience for us and incredibly helpful,” Bryngelson said.

Despite these improvements, the perplexing speed issues remained. It was time to bring in reinforcements.

With few distractions at OLCF hackathons, both virtual and in-person, developer teams can focus solely on solving problems and improving codes for use on Frontier and other GPU-powered systems. Image: Carol Morgan/ORNL.

Heading to the office

OLCF Office Hours provides project teams with direct access to OLCF staff and vendor representatives to discuss software issues via Zoom sessions every Monday from 2 to 3 p.m. and Wednesday from 1 to 2 p.m. ET. That’s how Bryngelson met technical experts from AMD and Hewlett Packard Enterprise, and those experts jumped in to help solve the puzzle.

“I attend every OLCF Office Hour and Frontier Hackathon, as do most of the Frontier Center of Excellence engineers,” said Steve Abbott, technical lead for HPE’s Frontier Center of Excellence, which helps users optimize their performance on Frontier. “The OLCF Office Hours are one of the best innovations from the OLCF’s User Assistance team in the past few years and are the main way the Center of Excellence interacts with Frontier users.”

Abbott took note of the Georgia Tech team’s use of OpenACC. Although OpenACC is widely used in GPU computing, it’s new for adapting applications on Frontier. The Cray Compiling Environment, or CCE, Fortran compiler is currently the only HPE-supported compiler on Frontier for OpenACC, though there are experimental efforts in other compilers. Abbott recommended the CCE Fortran compiler as the team’s best option. This effort was fruitful, working around several compiler problems.

“We had a number of false starts where we had an effective work-around for a reduced test case, but some pattern in the full code would still break,” said Abbott, who’s also a principal engineer in HPE’s HPC Performance Engineering team. “These sorts of situations are where building a mental model of the code and compiler as a coupled system really helps, together with a love for the weird dark corners of programming languages where there are convoluted patterns no sensible person would write.”

The MFC bug involved Fortran module variables being used in device routines. The broader the scope of a variable, the more work must be done at both compile time and run time to ensure that all uses of that variable refer to the correct memory location.

“The MFC team was hitting a case where references to specific variables deep in OpenACC device procedures were not correctly being connected to the variables’ locations in memory. The compiler needed to connect several different references together like links in a chain, but one of those links wasn’t working correctly,” Abbott said.

Abbott devised a work-around that involved changing the variables from “allocatable” to “pointer.” By the OpenACC standard, this meant that instead of the compiler doing everything automatically, the programmers needed to manually make the connection themselves.

Still, speed problems persisted. MFC’s GPU kernels were seeing a large amount of register spilling, which occurs when the fastest memory locations on the CPUs and GPUs are overfilled with new data and must “spill” the oldest pieces back into main memory. Furthermore, MFC did not appear to be using as many registers as it was allowed.

To truly identify and remove MFC’s bugs, the entire team would need to delve into the application code itself.

With help from Brian Cornille, a software system design engineer in AMD’s technical staff, and Steve Abbott, technical lead for HPE’s Frontier Center of Excellence, the MFC team was able to overcome several unexpected bugs to run their code well on both Frontier and El Capitan supercomputers. Image: Carol Morgan/ORNL.

Hacking away at the problem

OLCF GPU Hackathons provide hands-on support, wherein OLCF staff and vendor partners join developers in multiday coding events several times a year to optimize their applications. With few distractions, the teams can focus solely on solving problems and improving codes for use on Frontier and other GPU-powered systems. MFC’s issues were beginning to diminish.

Brian Cornille, a software system design engineer in AMD’s technical staff, began working on the problem, too. He consulted with AMD compiler engineers to help determine the appropriate attribute that would affect the compiler’s behavior.

“Steve Abbott and I were able to identify a value that should have been different in the intermediate files generated by the compiler. To provide proof of concept and allow the MFC team to see the expected performance of these kernels, we manually edited the intermediate files and continued the compilation from there,” Cornille said. “With this change, the register usage increased to what was expected, and spilling was greatly reduced. This led to a substantial speedup in these kernels.”

Abbott and the MFC team then created a script to automatically edit the intermediate files. This work-around provided MFC a substantial performance improvement, which demonstrated the need for a permanent fix in the CCE Fortran compiler.

“The structure of MFC’s kernels was a pretty common pattern, and it was reasonable to assume that if this change benefited MFC, it would benefit many other applications. So, we needed to ask, ‘Why is the compiler setting this value to what it is?’ and ‘Can it do better?’ Fortunately, we have a good relationship with the compiler developers and could reach out to them directly,” Abbott said.

The team discovered that during its initial development, the CCE Fortran compiler was designed to set default parameters that would maximize the number of threads per block when launching the kernel. This approach ensures that the code provides correct answers every time, but it doesn’t always yield the best performance.

“Now that the toolchain was more mature, we could look for a better way. If we can be sure at compile time that a kernel will be run with a certain number of threads per block, then the compiler can pick an attribute value that will optimize for that number of threads,” Abbott said.

After a presentation by the Center of Excellence and some discussion with the OLCF’s Compiler Working Group, the HPE compiler team quickly implemented the optimization, thereby benefiting other users.

“This issue involved such common code patterns that it would be unusual if other programs did not see remarkable speedups when the new compiler version was released,” Bryngelson said.

A thankful I/O upgrade

At last, with MFC operating at its expected speeds, the Georgia Tech team performed a scaling run on Frontier during the Thanksgiving week of 2023 — whereupon they hit a wall at 128 compute nodes, unable to use Frontier’s full array of more than 9,400 nodes. If they couldn’t solve this problem, they’d quickly use up their allocation hours on Frontier just trying to complete their calculations on only 128 nodes.

Luckily, Budiardja had some free time on Thanksgiving break and quickly identified an issue in MFC’s I/O code.

“I decided to take a look at the code, and it was just one of those, ‘Oh, I know what’s going on, and I know what to do.’ I was kind of excited, so when the kids went to bed after turkey dinner, I just rewrote their I/O code,” Budiardja said. “It was fortunate because the machine is a little bit empty during the break. They were struggling from running about 15 or 20 minutes with 128 nodes, but with the change that I put in the I/O module, they were able to run on the whole system — and it took about 5 minutes.”

Bryngelson aims to use MFC for several projects to simulate multiphase-flow phenomena. These range from cavitation bubbles, which arise in applications such as nuclear reactors, to hydrodynamic instabilities, thus showing how different types of interfaces between gases, liquids, and solids can break up and mix together.

“We learned a lot along the way, from identifying compiler bugs to realizing how pivoting from one computer architecture to another requires meaningful effort if you maintain a performance-critical application. I think our whole team feels wiser as a result, and we remain excited about new architectures. We will continue to tackle new systems as they appear,” Bryngelson said.

That includes Lawrence Livermore’s recently launched exascale system, El Capitan, which was ranked the fastest supercomputer in the world in November’s Top500 rankings at a benchmark speed of 1.742 exaflops. With El Capitan’s array of new AMD GPUs, the team was able to use the same strategies they had developed on Frontier — and they were able to scale up MFC to use nearly 100% of the system’s 11,000+ compute nodes.

UT-Battelle manages ORNL for DOE’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, visit energy.gov/science.

Tags:

The OLCF’s Problem Busters

If there’s something weird in your code, who are you going to call? The OLCF’s User Assistance Center provides an array of expert support services, from office hours to hackathons

Getting a ticket

Heading to the office

Hacking away at the problem

A thankful I/O upgrade

Tags:

Coury Turczyn

Contact Us

Quick Links

Connect with OLCF

The OLCF’s Problem Busters

If there’s something weird in your code, who are you going to call? The OLCF’s User Assistance Center provides an array of expert support services, from office hours to hackathons

Getting a ticket

Heading to the office

Hacking away at the problem

A thankful I/O upgrade

Tags:

Coury Turczyn

You May Also Like

INCITE 2027 Call for Proposals

Cracking the Hadron Collision Puzzle

Tag-Teaming Turbulence

Contact Us

Quick Links

Connect with OLCF