Building the Bridge to Exascale
Building an exascale computer—a machine that could solve complex science problems at least 50 times faster than today’s leading supercomputers—is a national effort.
To oversee the rapid research and development (R&D) of an exascale system by 2023, the US Department of Energy (DOE) created the Exascale Computing Project (ECP) last year. The project brings together experts in high-performance computing from six DOE laboratories with the nation’s most powerful supercomputers—including Oak Ridge, Argonne, Lawrence Berkeley, Lawrence Livermore, Los Alamos, and Sandia—and project members work closely with computing facility staff from the member laboratories.
At the first ECP annual meeting, held January 29–February 3 in Knoxville, Tennessee, about 450 project members convened to discuss collaboration in breakout sessions focused on project organization and upcoming R&D milestones for applications, software, hardware, and exascale systems focus areas. During facility-focused sessions, senior staff from the Oak Ridge Leadership Computing Facility (OLCF) met with ECP members to discuss opportunities for the project to use current petascale supercomputers, test beds, prototypes, and other facility resources for exascale R&D. The OLCF is a DOE Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL).
“The ECP’s fundamental responsibilities are to provide R&D to build exascale machines more efficiently and to prepare the applications and software that will run on them,” said OLCF Deputy Project Director Justin Whitt. “The facilities’ responsibilities are to acquire, deploy, and operate the machines. We are currently putting advanced test beds and prototypes in place to evaluate technologies and enable R&D efforts like those in the ECP.”
ORNL has a unique connection to the ECP. The Tennessee-based laboratory is the location of the project office that manages collaboration within the ECP and among its facility partners. ORNL’s Laboratory Director Thom Mason delivered the opening talk at the conference, highlighting the need for coordination in a project of this scope.
On behalf of facility staff, Mark Fahey, director of operations at the Argonne Leadership Computing Facility, presented the latest delivery and deployment plans for upcoming computing resources during a plenary session. From the OLCF, Project Director Buddy Bland and Director of Science Jack Wells provided a timeline for the availability of Summit, OLCF’s next petascale supercomputer, which is expected to go online in 2018; it will be at least 5 times more powerful than the OLCF’s 27-petaflop Titan supercomputer.
“Exascale hardware won’t be around for several more years,” Wells said. “The ECP will need access to Titan, Summit, and other leadership computers to do the work that gets us to exascale.”
Wells said he was able to highlight the spring 2017 call for Innovative and Novel Computational Impact on Theory and Experiment, or INCITE, proposals, which will give 2-year projects the first opportunity for computing time on Summit. OLCF staff also introduced a handful of computing architecture test beds—including the developmental environment for Summit known as Summitdev, NVIDIA’s deep learning and accelerated analytics system DGX-1, an experimental cluster of ARM 64-bit compute nodes, and a Cray XC40 cluster of 168 nodes known as Percival—that are now available for OLCF users.
In addition to leveraging facility resources for R&D, the ECP must understand the future needs of facilities to design an exascale system that is ready for rigorous computational science simulations. Facilities staff can offer insight about the level of performance researchers will expect from science applications on exascale systems and estimate the amount of space and electrical power that will be available in the 2023 timeframe.
“Getting to capable exascale systems will require careful coordination between the ECP and the user facilities,” Whitt said.
One important collaboration so far was the development of a request for information, or RFI, for exascale R&D that the ECP released in February to industry vendors. The RFI enables the ECP to evaluate potential software and hardware technologies for exascale systems—a step in the R&D process that facilities often undertake. Facilities will later release requests for proposals when they are ready to begin building exascale systems.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.