OLCF Upgrades System to CUDA 6.5
Unified memory function facilitates data sharing, creates managed memory, improves performance
The Oak Ridge Leadership Computing Facility (OLCF), a US Department of Energy (DOE) Office of Science User Facility, put the finishing touches on its upgrade to CUDA 6.5 on June 23 and, thus, delivered two areas of improvement.
The upgrade’s unified memory function improves memory management between the CPUs and GPUs and also provides the benefit of globally shared data. The upgrade is a true win–win for the OLCF and its users.
“The upgrade features numerous benefits,” said Jim Rogers, director of Computing and Facilities for the National Center for Computational Sciences at DOE’s Oak Ridge National Laboratory. “It provides performance improvements, process improvements, better error reporting as well as improved Fortran support, high performance stack libraries, and other features. In short, this step is important for us for a number of reasons.”
Don Maxwell, task lead in the OLCF’s High-Performance Computing Operations Group, added,“We’ve been anxious to get to 6.5. The notion of the unified memory in CUDA 6.5 is important. Our users have been excited about taking advantage of its capabilities.”
“There was a bug in the CUDA 6.5 driver that delayed this deployment,” Maxwell said. “The problem was first found by Cray in its internal testing. NVIDIA couldn’t reproduce the issue in its own environment; they could only do it on the Cray XK7.”
Added Rogers, “We worked directly with NVIDIA to provide them access to our test system. In short order, NVIDIA discovered the root cause of the bug in the power management unit and corrected it. The correction eliminated the spurious double bit errors. This is something that had to be fixed; we couldn’t just ignore the double bit errors. Working with NVIDIA exposed the root cause and provided the prerequisites required for us to install CUDA 6.5.
“Providing them access to our system reduced our time to solution,” Rogers continued. “Truly, it was a Herculean effort, and now these modifications are available to everyone using the Tesla product.”
Making these modifications set the stage for the CUDA upgrade, which will benefit OLCF users, according to OLCF user support specialist Fernanda Foertter.
“Some of our users have been waiting on the managed memory aspects of the driver. Instead of explicitly managing the memory transfer, the driver creates a ‘bucket’ of memory on the host side and transfers it as needed,” she said. “It means less coding for the user, not having to think about when to transfer the memory, and whether the memory will be there when you need to use it.”
Rogers added, “Our collaboration with NVIDIA was a very positive aspect of this story. NVIDIA provided one of its most senior software engineers to help diagnose the bug. It demonstrates NVIDIA’s commitment to identifying and correcting a problem. We took it seriously and gave them access to our hardware. And now our users will enjoy the benefits of this upgrade to CUDA 6.5.”
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.