World’s Most Powerful Accelerator Comes to Titan with a High-Tech Scheduler

PanDA manages all of ATLAS’s data tasks from a server located at CERN, the European Organization for Nuclear Research. Image Credit: CERN.

Collaboration holds potential benefits for OLCF

The people who found the Higgs boson have serious data needs, and they’re meeting some of them on the Oak Ridge Leadership Computing Facility’s (OLCF’s) flagship Titan system.

Researchers with the ATLAS experiment at Europe’s Large Hadron Collider (LHC) have been using Titan since December, according to Ken Read, a physicist at Oak Ridge National Laboratory and the University of Tennessee. Read, who works with another LHC experiment, known as ALICE, noted that much of the challenge has been in integrating ATLAS’s advanced scheduling and analysis tool, PanDA, with Titan.

PanDA (for Production and Distributed Analysis) manages all of ATLAS’s data tasks from a server located at CERN, the European Organization for Nuclear Research. The job is daunting, with the workflow including 1.8 million computing jobs each day distributed among 100 or so computing centers spread across the globe.

PanDA is able to match ATLAS’s computing needs seamlessly with disparate systems in its network, making efficient use of resources as they become available.

In all, PanDA manages 150 petabytes of data (enough to hold about 75 million hours of high-definition video), and its needs are growing rapidly—so rapidly that it needs access to a supercomputer with the muscle of Titan, the United States’ most powerful system.

“For ATLAS, access to the leadership computing facilities will help it manage a hundredfold increase in the amount of data to be processed,” said ATLAS developer Alexei Klimentov of Brookhaven National Laboratory. PanDA was developed in the United States under the guidance of Kaushik De of the University of Texas at Arlington and Torre Wenaus from Brookhaven National Laboratory.

“Our grid resources are overutilized,” Klimentov said. “It’s a question of where we can find resources and use them opportunistically. We cannot scale the grid 100 times.”

In order to integrate with Titan, PanDA team developers Sergey Panitkin from BNL and Danila Oleynik from UTA redesigned parts of the PanDA system on Titan responsible for job submission on remote sites (known as “Pilot”) and gave PanDA new capability to collect information about unused worker nodes on Titan. This allows PanDA to precisely define the size and duration of jobs submitted to Titan according to available free resources. This work was done in collaboration with OLCF technical staff.

The collaboration holds potential benefits for OLCF as well as for ATLAS.

In the first place, PanDA’s ability to efficiently match available computing time with high-priority tasks holds great promise for a leadership system such as Titan. While the OLCF focuses on projects that can use most, if not all, of Titan’s 18,000-plus computing nodes, there are occasionally a relatively small numbers of nodes sitting idle for one or several hours. They sit idle because there are not enough of them—or they don’t have enough time—to handle a leadership computing job. A scheduler that can occupy those nodes with high-priority tasks would be very valuable.

“Today, if we use 90 or 92 percent of available hours, we think that is high utilization,” said Jack Wells, director of science at the OLCF. “That’s because of inefficiencies in scheduling big jobs. If we have a flexible workflow to schedule jobs for backfill, it would mean higher utilization of Titan for science.”

PanDA is also highly skilled at finding needles in haystacks, as it showed during the search for the Higgs boson.

According to the Standard Model of particle physics, the field associated with the Higgs is necessary for other particles to have mass. The boson is also very massive itself and decays almost instantly; this means it can be created and detected only by a very high-energy facility. In fact, it has, so far, been found definitively only at the LHC, which is the world’s most powerful particle accelerator.

But while high energy was necessary for identifying the Higgs, it was not sufficient. The LHC creates 800 million collisions between protons each second, yet it creates a Higgs boson only once every one to two hours. In other words, it takes 4 trillion collisions, more or less, to create a Higgs. And it takes PanDA to manage ATLAS’s data processing workflow in sifting through the data and finding it.

PanDA’s value to high-performance computing is widely recognized. The Department of Energy’s offices of Advanced Scientific Computing Research and High Energy Physics are, in fact, funding a project known as Big PanDA to expand the tool beyond high-energy physics to be used by other communities.