Code for Largest Cosmological Simulations Ever on GPUs Is Gordon Bell Finalist

HACC uses modules with algorithms specific to different supercomputing architectures

Zoom-in showing the clustering of dark matter, including the effect of massive neutrinos, as simulated on Titan.

Advancements to instruments in observatories and satellites can stretch the eye of the observer billions of light-years away to the fringes of the observable universe. Images from sky surveys of galaxies, quasars, and other astronomical objects offer scientists clues about how the distribution of mass is influenced by dark energy, the repelling force guiding the accelerated expansion of the universe.

But all the telescopes at scientists’ disposal cannot begin to canvas the distribution of mass across the entire universe through time—an analysis that would help physicists corroborate observational data with their understanding of the fundamental processes that govern how the structure of the universe is formed.

To create a comprehensive sky catalog of the development of the universe to which scientists can compare instrumental observations, researchers are using the Department of Energy’s (DOE’s) most powerful computing systems, including the nation’s top-ranked machine, Titan, managed by the Oak Ridge Leadership Computing Facility (OLCF), to simulate the evolution of the universe as it expands across billions of years.

“Basically what the code does is follow the formation of structure in the universe,” said Salman Habib, project leader and high-energy physicist and computational scientist at Argonne National Laboratory (ANL). “And the idea is to get very-high-accuracy simulations of the universe so you can compare them to observations of the sky.”

To simulate cosmic structure, the ANL research team and its collaborators developed a modular, high-performance computing (HPC) code called HACC (for Hardware/Hybrid Accelerated Cosmology Code) designed for diverse HPC architectures. HACC requires petascale computing to “evolve” trillions of interacting particles and is the first large-scale cosmology code that can run on a hybrid CPU/GPU supercomputer, as well as on multicore or many-core architectures.

The code’s exceptional performance on the OLCF’s 27-petaflop, hybrid CPU/GPU Cray XK7 supercomputer, Titan, located at Oak Ridge National Laboratory, as well as on the IBM Blue Gene/Q machines Sequoia (at Lawrence Livermore National Laboratory) and Mira (at ANL), earned the project a finalist nomination for ACM’s Gordon Bell Prize. The award recognizes outstanding achievement in high-performance supercomputing applications, and this year’s winner will be announced at the SC13 supercomputing conference in November. Four of the six finalists, including HACC, ran on Titan.

Pushing the boundaries of what is possible in HPC today, the most computationally demanding calculations in HACC run at performance levels that translate to more than 25 petaflops on Titan and at an estimated 10 petaflop sustained performance. Such extreme performance is needed to simultaneously capture the large scales and the low-level details that modern cosmology requires.

Tracing the cosmos billions of particles at a time

HACC simulations begin with an extremely dense and very uniform universe. As a run proceeds, the simulated universe expands in a series of thousands of time steps, while at the same time the initial uniform structure forms a complex cosmic web, developing a detailed clustering of mass across large distances.

“The code starts with an initial, smooth density field, and it tracks where the matter goes,” Habib said. “And within these clumps of matter, galaxies form and other things happen.”

But unlike many HPC simulations that are building virtual systems molecule by molecule or atom by atom, HACC’s trillions of points of mass are not exact representations of physical objects.

“We’re trying to track where the mass is in the universe, and of course, if you tried to do it by number of atoms, that would be hopeless because there are way too many of those,” Habib said.

The trillions of particles simulated are “tracer” particles, themselves representing conglomerations of mass like galaxies.

“Modern cosmology isn’t about looking at one object,” Habib said. “It’s about building statistics over billions of objects.”

The smallest clump of mass distinguishable in the code is 100 billion solar masses (roughly equivalent to the Milky Way galaxy), and the code resolves distances from kiloparsecs to gigaparsecs—from galaxy-scale distances to the entire swath of the observable universe.

“There’s very high resolution in this code,” Habib said. “And one of the reasons is that you may be comparing the results to a range of observational surveys with different parameters, so everywhere in the simulation you need a resolution on the order of a million to one.”

HACC has applications for a wide range of cosmological studies, especially in the production of predictive surveys for studies of dark matter, cosmic background radiation, and other indicators of dark energy at work.

“Many observations of the sky are ongoing or planned for the near future,” Habib said. “HACC is the modeling side of that.”

Habib’s team, for instance, aims to use HACC in conjunction with the Large Synoptic Survey Telescope (LSST) project. LSST will follow changes in the southern sky over a period of 10 years. By comparing HACC simulations with integrated time-lapse images from LSST, which will provide measurements of weak gravitational lensing (or the distortion of background galaxy images by foreground matter concentrations), researchers intend to probe the nature of dark energy.

Accelerating the universe on GPUs

One of the code’s most attractive features is its versatility. With limited and varied access to supercomputers, researchers benefit from a code that can be used on multiple architectures.

“We have modules that we can plug in and out for different architectures,” Habib said. “The beauty of HACC is we can run the exact same problem on different machines with different architectures using different algorithms and, by comparison, make sure the answer is accurate.”

And as the numbers prove, HACC is highly effective across architectures. In the test runs on Titan and Mira, the code evolved at least 1.1 trillion particles each time, and a larger run on Sequoia landed at 3.6 trillion particles.

“This is the largest cosmological benchmark ever performed,” said Katrin Heitmann, who is part of ANL’s HACC team.

To accommodate different architectures, HACC’s framework is divided into two levels: a more homogenous grid level and a detailed, computationally-intensive, particle-to-particle level.

The difference in running the code on Titan’s GPUs versus running it on the all-CPU, many-core Blue Gene/Q systems like Sequoia and Mira, lies in the short-range, particle-to-particle calculations. Hybrid architectures like Titan use an algorithm designed for accelerated GPU hardware, whereas the many-core Blue Gene machines use an algorithm developed for CPUs.

“The grid is responsible for four orders of magnitude of dynamic range, which are longer distances, while the particle interactions handle the critical two orders of magnitude at the shortest scales,” Habib said. “The bulk of the time-stepping computation takes place in the latter.”

As a user “zooms in” to smaller orders of magnitude, instances of mass in the field become denser, and increasingly accurate calculations are required. Calculating these particle-to-particle interactions is where Titan’s GPUs excelled, generating the largest cosmological simulation on GPUs to date.

“The advantage of the GPUs is they have this raw speed,” Habib said. And this raw speed is amplified by HACC’s mixed-precision code, meaning that some calculations require 8 figures after the decimal (single precision) and some 16 (double precision).

For single-precision calculations, both CPUs and GPUs can double in speed. However, GPUs are doubling or tripling an already higher speed, which further reduces computation time.

“We view the GPUs as a place where you get ‘computing for free,’” Habib said. “The particle interactions do well on the GPUs because they’re very computationally intensive.”

An additional benefit of HACC’s modular framework is its potential scalability as supercomputing architectures evolve. HACC is a highly parallel code, meaning that it can perform many calculations at once, and its capability is not limited to today’s petaflop machines. HACC is prepared to accelerate with petascale computing.

“We don’t see a problem at least into machines capable of hundreds of petaflops,” Habib said.

And that’s good news for researchers because resolving the history of the universe and the mystery of dark energy may take a lot of computational power and a little bit of time. —by Katie Elyce Jones