Since having partnered with IBM, NVIDIA, and Mellanox in 2014, the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) has been forging the path to its next big supercomputer, Summit, which will feature an IBM POWER9 architecture and NVIDIA Volta GPUs.
Scheduled to enter full production in 2019, Summit will allow researchers to dive deeper into some of the most complex problems facing our world today. Because Summit may simulate some systems at 5 to 10 times the speed of Titan—the Oak Ridge Leadership Computing Facility’s (OLCF’s) current flagship supercomputer—members of the Center for Accelerated Application Readiness (CAAR) teams are eagerly working on redesigning and optimizing their application codes to ensure users can make the best use possible of Summit’s increased performance. The OLCF is a DOE Office of Science User Facility located at ORNL.
Last fall, the OLCF received the Summit Early Access development platform (Summitdev), a system that is one generation removed from Summit’s architecture, enabling the CAAR teams and other staff members to test IBM’s POWER processors and NVIDIA’s Pascal GPU–based architecture line and optimize their codes before Summit arrives. Staff received IBM’s latest software stack for the system in November, and CAAR team members have already begun testing their codes on the early access system.
“The Summitdev system is not only for users to start developing for the future Summit system, but it’s also an opportunity for our staff to learn,” said Ashley Barker, group leader for the User Assistance and Outreach Group at the OLCF. “There are many aspects of administering this system that are different from our previous systems—for example, the way we deploy software, the way users run jobs, and even the file system. This makes Summitdev crucial so that we can get the system administration tasks figured out before Summit gets here.”
Summitdev Workshop
OLCF staff members led a Summitdev training workshop January 10–12 at ORNL to provide CAAR teams with training for using the new early access system, which differs significantly from previous architectures.
“We’re going from a typical x86 Intel AMD CPU in Titan to the POWER processors, accelerated with the more powerful Pascal GPUs,” said Tjerk Straatsma, group leader for the Scientific Computing Group at the OLCF. “Our Scientific Computing Group staff started working on Summitdev and came up with a lot of questions about many aspects of this new architecture. The workshop aimed to answer some of these questions.”
The training event included 2 days of talks and presentations from experts at IBM, NVIDIA, and PGI and a half-day of hands-on training to introduce staff members to the platform and give them a chance to implement their new knowledge. Topics included POWER architecture line differences, GPU accelerator changes, programming standards and languages, file system practices, and software development tools.
More than 80 professionals attended the workshop, including staff members working for the IBM/NVIDIA Center of Excellence at ORNL, which facilitates application readiness for Summit.
“What made this event so important was that the developers here had access to the vendors,” Straatsma said. “The vendors can tell our developers what the current state is of both the hardware and the software and talk about the specifics of the tools that are available, so this workshop was very beneficial.”
CAAR team members, who have received priority on the system since its arrival, comprised the majority of the attendees. CAAR codes are represented in many current projects running on OLCF resources through various allocation programs, so their successful optimization on pre-Summit systems is essential for continued success on Summit itself. At the workshop, teams learned how to run jobs and troubleshoot problems; they also learned porting exercises that they can use throughout 2017 to help them optimize their codes.
Workshop attendee Bronson Messer, computational scientist at the OLCF and CAAR team lead for FLASH, has found great success on the Summitdev system thus far. He said that FLASH, a multiphysics code for simulating supernovae explosions, might have the potential to run four to five times faster on the full Summit system with OpenMP4.5, a programming interface for parallel programming.
“One of the most important things to learn on a new machine is how to submit and run jobs. Without that, you can’t really experiment like you want to,” Messer said. “The workshop gave us a recipe for running jobs. Bob Walkup of the IBM Watson Research Center did a great job of explaining how to get your job done and how to fix problems you encounter, because he knows exactly how people compute, so his lectures were extremely helpful.”
Other attendees included group leaders Sudharshan Vazhkudai of the Technology Integration Group, David Bernholdt of the Computer Science Research Group, and members of the High-Performance Computing Operations Group. The workshop covered a broad range of technologies, software, and compilers to ensure that each group received applicable information.
“The workshop was very useful to our group because it helps us establish best practices that we can use to eventually help users with allocations,” Straatsma said. “One of the big benefits of workshops like this is that we learn about how to make applications ready for a new architecture. The more we learn about how to do this, the more we can help others do the same thing in the future.”
Summitdev, Tundra, and Percival
Summitdev is a two-cabinet system with 36 nodes that each contain two IBM POWER8 CPUs and four of NVIDIA’s Pascal architecture–based Tesla accelerator GPUs. The GPUs are connected by NVLink 1.0, a node-local network that enables fast communication between multiple processors. The two cabinets that make up Summitdev will remain in place over the next year for testing until the full Summit system goes online in 2018.
The OLCF also received an additional system cabinet called Tundra that staff members will separately test over the next year. This system is part of the early access system but will be used exclusively by staff members to limit outages on the Summitdev system that could affect CAAR teams’ progress.
Percival, a third system containing 168 nodes, is also available for CAAR team testing. The Cray XC40 cluster will advance OLCF portability goals because its configuration is based on the Knights Landing processor in the National Energy Research Scientific Computing Center’s Cori supercomputer and Argonne Leadership Computing Facility’s Theta system.
Over the next year, CAAR team experts and staff members will participate in scheduled workshops and training events while continuing to optimize their codes for Summit. The next workshop will be in October 2017.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.