Skip to main content

OLCF upgrades Arm test bed with NVIDIA GPUs and new NVIDIA CUDA software to explore new avenues in HPC

Alongside the regular high-performance computing (HPC) resources at supercomputing centers are test beds, small computing clusters that offer HPC experts a chance to explore new architectures before they are considered for deployment at a larger scale. Last year, the Oak Ridge Leadership Computing Facility (OLCF) ventured into new HPC territory when it installed the Arm-based test bed Wombat, powered by pre-production Cavium (now Marvell) ThunderX2 CPU processors and AMD GPUs.

Now, the OLCF has upgraded Wombat, installing production Marvell ThunderX2 CPUs and NVIDIA V100 GPUs to test NVIDIA’s new CUDA software stack purpose-built for Arm CPU systems. In late October, immediately after the upgrade, eight teams successfully ported their codes to the new system in the days leading up to the 2019 Supercomputing Conference, SC19. In less than 2 weeks, eight codes in a variety of scientific domains were running smoothly on Wombat.

“At the end of the day, everything that we tried was successful,” said Oscar Hernandez, a computer scientist at the OLCF. “We didn’t have any applications that failed or couldn’t deliver at the time we wanted them to.”

Arm-NVIDIA server. Image Credit: NVIDIA

The undertaking represented the first implementation of an HPC server employing Arm CPUs and NVIDIA GPUs. The company Arm Holdings is not a manufacturer of hardware but rather licenses its technology to hardware vendors. It has a large stake in the mobile device market, but in recent years, its technology—often touted for its energy efficiency—has been adopted by the HPC community.

“The combination of NVIDIA and Arm-based technology in a supercomputing cluster is very unique at this stage,” Hernandez said.

Although Arm hasn’t traditionally been specifically focused on the HPC market until recently, the use of Arm-based technology in computing systems aligns with the US Department of Energy’s (DOE’s) goals for encouraging a diverse HPC environment. The OLCF is a DOE Office of Science User Facility located at DOE’s Oak Ridge National Laboratory (ORNL).

“DOE seeks to create a sustainable vendor community to support a vibrant ecosystem of computing resources for its mission,” said Jack Wells, the director of the Office of Strategic Planning and Performance Management at ORNL. Wells served as the director of science at the OLCF from 2011–2019. “Understanding the work required to produce the necessary software tools for Arm-based systems was what led to the OLCF’s initial interest in deploying Wombat.”

One of the goals of the OLCF’s test bed program is to evaluate new and emerging technologies and assess their compatibility with one another, in terms of functionality and performance. Experimenting with different implementations of separate technologies—in this case, Arm and NVIDIA GPUs—allows HPC experts to gauge future requirements for current scientific applications and explore different combinations of hardware and software to meet those requirements.

NVIDIA granted the OLCF early access to the CUDA software platform in addition to the architectural components for Arm. Aiming to explore an array of applications on the updated system, the OLCF deployed eight leadership-class scientific codes that were either bandwidth- or memory-driven to understand the implications of running codes involving either high amounts of data movement or many compute-intensive calculations. In addition to these production applications, OLCF staff successfully ran parallel programming models, community libraries, benchmarking suites, and mini-applications on the system.

Among the applications tested were the Combinatorial Metrics (CoMet) code for comparative genomics; GROningen MAchine for Chemical Simulations (GROMACS), used for molecular dynamics simulations; the Dynamical Cluster Approximation ++ (DCA++) code for quantum many-body systems; the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) code; and the Locally Self-consistent Multiple Scattering (LSMS) code for material sciences. Application teams included members from organizations around the world as well as staff members at ORNL, including staff in the OLCF’s Scientific Computing Group.

“If we want to codesign a large-scale solution that links Arm and NVIDIA, understanding the current implementations and how to improve them will be crucial,” said Ross Miller, systems integration programmer at the OLCF. “Wombat gives teams the ability to identify areas requiring improvement to create a better ecosystem.”

ORNL plans to keep evaluating Arm in the context of new emerging architectures.

ORNL collaborated with the following institutions during the evaluation of Wombat: the Innovative Computing Laboratory at the University of Tennessee, the University of Bristol, Sandia National Laboratories, the University of Illinois at Urbana-Champaign, the University of Tokyo, CERN, NVIDIA, Marvell, and Hewlett Packard Enterprise.

UT Battelle LLC manages ORNL for DOE’s Office of Science. The Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, please visit https://energy.gov/science.

Rachel McDowell

Rachel McDowell is a science writer for the Oak Ridge Leadership Computing Facility.