A major interest of ADAC’s three founding members is GPU-accelerated computing. The supercomputing centers of all three organizations—the Oak Ridge Leadership Computing Facility (OLCF) at ORNL, the Global Scientific Information and Computing Center at the Tokyo Institute of Technology (Tokyo Tech), and the Swiss National Supercomputing Center at the Swiss Federal Institute of Technology, Zurich (ETH Zurich)—manage heterogeneous supercomputers that rank among the world’s most powerful systems and provide essential leadership-class computing resources to academia, government, and industry for the benefit of the global scientific community.
“With ADAC we’re trying to gauge the challenges all three centers face collectively,” said OLCF Operations Manager Stephen McNally. “We want to share our experiences and knowledge and identify topics where we can come together to address hurdles facing accelerated computing.”
In its inaugural form, ADAC is focused on three areas—resource management, applications, and performance—with representatives from ORNL, Tokyo Tech, and ETH Zurich serving as leads for the areas respectively. For ORNL, McNally is serving as the resource management colead alongside Jim Rogers, the National Center for Computational Sciences director for computing and facilities. Also on behalf of ORNL, OLCF Scientific Computing Group Leader Tjerk Straatsma is serving in the applications lead role, while ORNL’s Jeff Vetter is acting as the performance lead.
During a workshop hosted by ETH Zurich earlier this summer, ADAC members identified multiple avenues for future discussion and collaboration.
From the resource management crowd, which is charged with ensuring smooth and efficient operation of massive HPC systems, a clear long-term goal is to work with vendors to better define what it means to use a GPU.
“Today we can monitor a few things concerning GPU usage, but we’d like to see some more useful metrics, such as a power profile where you can see the energy over time for certain applications,” McNally said. “If you can see the power trends when a job is running, you can really start to dig into whether it’s efficient use.”
Other discussions concerned the complexity of HPC storage—which has evolved over time into immediate, short-term, and long-term tiers—and the oftentimes underappreciated role of InfiniBand networks in linking compute and storage.
“InfiniBand is critical to what we do, yet it’s primarily thought of as a second-class technology,” McNally said. “Most storage or HPC system administrators are forced to learn this HPC networking standard out of necessity. At the most recent ADAC meeting, we recognized that HPC facilities may be better off treating InfiniBand with the same level of respect as Ethernet network systems by designating staff to support it.”
On the applications side, ADAC’s applications working group identified a number of fundamental capabilities in the computational science community through which the institute could provide accelerated implementations that would benefit a number of applications in targeted science domains. The improvements would be especially beneficial to users of next-generation systems like the OLCF’s Summit.
These capabilities will be developed as general domain-specific libraries, highly optimized code implementations that efficiently execute on GPU-accelerated systems and provide software developers with “prepackaged” resources that have the potential to benefit a wide range of software. Currently ADAC members are creating libraries that will assist applications in biophysics, climate science, electronic structure, and computational chemistry.
“Within this collaboration we are leveraging our current code development programs, such as the OLCF’s Center for Accelerated Application Readiness effort,” Straatsma said. “Each ADAC member contributes its own ongoing projects, but there happens to be a lot of synergy between our development efforts. That’s what makes our collaboration so valuable.”
ADAC’s next official workshop will take place next month at the Smoky Mountains Computational Sciences and Engineering Conference in Gatlinburg, Tennessee.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.