New platform allows users and computer scientists to test software and algorithms in anticipation of Summit supercomputer
Computer scientists and support staff at the Oak Ridge Leadership Computing Facility (OLCF) at the US Department of Energy’s Oak Ridge National Laboratory want researchers to focus on scientific discovery. They can handle the rest.
In November, the OLCF, a DOE Office of Science User Facility, announced its next-generation supercomputer, Summit, would be delivered in 2018. To make sure Summit is ready for researchers from day one, the OLCF Scientific Computing (SciComp), Technology Integration (TechInt), and High-Performance Computing Operations (HPC Ops) groups are collaborating on a test bed for staff and vendors to help the OLCF prepare for Summit. Jason Hill, storage team leader for HPC Ops at the OLCF, explained why test beds are necessary in preparation for a new supercomputer.
“The effort of building a test bed is making sure that you are ready to serve your users,” Hill said. “If you don’t understand the hardware or the software stack the day you get the machine, you can’t serve your users, and that, in the end, means they can’t get their science done.” The initial test bed was ready for staff in December 2014, giving them multiple years to prepare their applications for Summit’s new architecture.
Shortly after OLCF chose the next system, Summit—an IBM-built computer using components from NVIDIA and Mellanox—members of SciComp, TechInt, and HPC Ops came together to build a simple, yet realistic, test bed based on Summit’s proposed computing architecture. Though the test bed is a small cluster, it has four compute nodes, each equipped with four GPUs managed via a central management server using tools for administration and for the user environment that will resemble the environment for Summit.
In addition to scaling up scientific applications, the test bed also serves as fertile ground for software testing. Most software vendors do not have access to large-scale computing resources. By allowing early versions of software to run on the test bed, vendors can collaborate with OLCF support staff to fix bugs before products are even released. “We can put problems that we encounter in the hands of TechInt, which does a great job of going out and telling our vendors, ‘Here’s where the problem is,’” Hill said. “Not only do we report a problem, but sometimes we can go out and actually find the source of the problem and, in some cases, we are able to hand over a patch and say, ‘Here’s the fix for it.’”
After building the initial test bed in December, the team plans to add a General Parallel File System to better store and access data as well as an InfiniBand interconnect to speed up communication between nodes. —Eric Gedenk
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.