The Faces of Summit series shares stories of the people behind America’s top supercomputer for open science, the Oak Ridge Leadership Computing Facility’s Summit. The IBM AC922 machine launched in June 2018.
When the Oak Ridge Leadership Computing Facility’s (OLCF’s) newest supercomputer, Summit, comes on line in 2018 at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL), the system is expected to be one of the most powerful supercomputers in the world and one of the best machines for scientific computing and artificial intelligence applications.
At five to 10 times the computing power of OLCF’s current 27-petaflop Titan system, Summit’s leap in performance cannot be purchased out of the box. To meet specific performance, reliability, and efficiency requirements, OLCF staff collaborated with vendors IBM, NVIDIA, and Mellanox and CORAL partners Lawrence Livermore and Argonne National Laboratories to engineer Summit’s unique scientific computing environment, including customizations for software and hardware.
Known as nonrecurring engineering (NRE)—a one-time phase of R&D—this critical step in building Summit is steered for OLCF by Sudharshan Vazhkudai, Technology Integration (TechInt) Group leader, and Al Geist, OLCF Chief Technology Officer.
“The Technology Integration Group looks at the gaps in the system software and hardware stack and works with the vendors to figure out how we can deploy new solutions to address the gaps,” said Vazhkudai, who oversees more than a dozen R&D staff members in TechInt.
For OLCF’s NRE tasks, TechInt works closely with other OLCF groups, including High-Performance Computing (HPC) Operations and Computer Science Research.
“Nonrecurring engineering is a behind-the-scenes process with a cross-cutting group of people who are designing a better Summit for our users,” Vazhkudai said. “We are co-designing solutions by working with vendors and our CORAL partner.”
NRE collaborations require many experts who will serve different roles when the system is up and running. Vazhkudai said he likes helping fit the puzzle pieces together: “I like building systems and mobilizing a group that can do much more than I could do alone. That is why I’m at a place like OLCF where we build larger-than-life solutions.”
NRE collaborators organized working groups to focus on burst buffers, the file system, messaging, cluster systems management, and programming and tools. To design NRE tasks for the Summit software and hardware stack, OLCF staff provided requirements and use cases—studies of common requests and challenges for OLCF users—to vendors, who worked with OLCF to co-design a solution for each task. While Summit is under construction, OLCF staff are testing vendor prototypes on OLCF’s early access development platform Summitdev and providing feedback on usability.
Boosting the speed of Summit
Ultimately, Summit is built for scientific discovery and will process and produce extreme amounts of data. Summit’s IBM Power9 CPU and NVIDIA Volta GPU architecture expedite processing and boost memory to enable more complex simulations and data-intensive applications. To support faster communication, Mellanox high-speed interconnects reduce latency, or lag time in data delivery, and enable high I/O throughput, or the successful completion rate of data delivery.
Three major NRE focus areas for OLCF’s TechInt Group involve memory and data flow, including developing software for burst buffers and a new file system and improving messaging between Summit’s hardware and software.
Each Summit node uses nonvolatile flash memory that serves as a burst buffer to absorb an onslaught of data before it is steadily drained to the file system. Because burst buffers are a new addition to OLCF systems, the burst buffer working group led by Vazhkudai co-designed the burst buffer application programming interface (API) to seamlessly drain the data from the node-local nonvolatile memory to the center-wide IBM General Parallel File System (GPFS).
In addition, the burst buffer working group developed software to monitor flash memory performance and usage and a distributed log-structured file system layered on top of the node-local flash memory. Together, the burst buffer API and the distributed file system address use cases in which multiple application processes write to multiple files as well as a single shared file. These NRE tasks on the burst buffer will allow applications to write data at more than 9.5 terabytes per second, which is roughly four times faster than the center-wide GPFS.
By reducing the number of partitions in the file system, the file system working group optimized the IBM GPFS to store up to 250 petabytes of Summit data in a single file system partition at 2.5 terabytes per second.
“Our solution needed to be a single file system without multiple partitions,” Vazhkudai said. “Our use cases are so extreme, we cannot deploy an out-of-the-box system.”
IBM and OLCF also worked together to develop software to scale the GPFS, which will offer over seven times the capacity of Titan’s file system, for Summit’s user data requirements and handle the stringent metadata requirements, such as creating 50,000 files per second to a single directory.
In the messaging working group, TechInt staff members guided vendor developments that are improving application network performance for Summit. The group modified the Parallel Active Messaging Interface (PAMI) layer to take advantage of Summit’s Mellanox hardware, including developing accelerated VERBS in PAMI, integrating the use of Scalable Hierarchical Aggregation Protocol collectives, and developing support for hardware tag matching within Mellanox adaptors, among other tasks. The group also improved messaging overlap with GPU-initiated communication, including implementing NVIDIA’s GPUDirect technologies to reduce memory movement between GPUs.
Improving the user experience
TechInt works closely with other OLCF teams for cluster systems management and programming and tools. OLCF HPC Operations staff oversee OLCF’s NRE tasks for software deployed to launch, schedule, and manage jobs on Summit. The cluster systems management working group has added requirements for Summit’s software to include features that users are accustomed to on Titan.
Further, TechInt and HPC Operations have guided the development of a big data system that will provide OLCF staff a better overview of Summit usage and enable them to troubleshoot user applications. Vazhkudai and his team used their prior experience in developing and deploying the GUIDE framework for Titan, which correlated log data from multiple subsystems such as the compute nodes, network, I/O, and scheduling subsystems.
In addition to improving how users access Summit, scientists from the OLCF Computer Science Research Group and vendors are improving how user applications will run on Summit.
The programming working group’s NRE work has pushed the development of directive-based programming languages, such as OpenMP and OpenACC, to address the growing use of accelerator-based systems like Summit and ensure that high-quality compiler support is available to users. By working through the open-source compiler project LLVM, the improvements made through NRE activities will also be available to the broader community.
Because performance and debugging tools for accelerators are still relatively new compared to tools for CPU-only systems, OLCF staff and vendors are pushing the capability of these tools to the benefit of OLCF users, as well as the research community. For Summit users, NRE activities are ensuring the continuity of capabilities from Titan to Summit, particularly by addressing the significant increase in Summit’s node-level parallelism relative to Titan.
TechInt team members overseeing NRE tasks include Vazhkudai, Scott Atchley, Sarp Oral, and Chris Zimmer. HPC Operations tasks are led by Don Maxwell, and Computer Science Research tasks are overseen by David Bernholdt, Mike Brim, and Oscar Hernandez. Overall, NRE work for Summit has contributed to a more robust software and hardware system for the system, which OLCF staff are now readying for deployment later this year.
The OLCF is a DOE Office of Science User Facility.
ORNL is managed by UT-Battelle for the Department of Energy’s Office of Science. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States. DOE’s Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit https://science.energy.gov