OLCF upgrade answers users’ requests for more RAM and GPUs
At the Oak Ridge Leadership Computing Facility (OLCF), users ask and they receive. Prompted by the annual user survey, OLCF’s Rhea received an upgrade this fall. In two phases, OLCF staff doubled Rhea’s amount of RAM per node and added GPU nodes.
Rhea is a commodity-type Linux cluster that provides a conduit for large-scale scientific discovery through pre- and postprocessing and analysis of simulation data generated on the OLCF’s 27-petaflop Titan supercomputer.
“When users run a large simulation on Titan, often what results is a large amount of data output written to the Atlas file system, which is shared across all other OLCF compute resources,” said Clay England, high-performance computing systems Linux cluster administrator. “One of Rhea’s purposes is for users to do postprocessing of that data—for example, reducing it, visualizing it, or analyzing it.”
Before the upgrade, Rhea provided 64 gigabytes of RAM for each of its 512 nodes. But according to the annual survey, users still needed more. During the week of August 18, the OLCF—a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory—upgraded the nodes to contain 128 gigabytes of RAM, doubling its main memory.
“We have found that some users’ data sets won’t fit into 64 gigabytes of RAM per node requested,” England said. “With this upgrade, users have more RAM available, allowing users to load more data on a per-node basis. For some user cases that are not compute intensive, as is the case with some analyses, users will require fewer Rhea nodes per job. Thus, Rhea will be able to accommodate more user jobs simultaneously.”
The OLCF upgraded Rhea in 2014, increasing its node count from 196 to 512. This year’s upgrade included the addition of only nine nodes, but these nodes are notably different. Each new node contains a terabyte of RAM—1,000 gigabytes—and two NVIDIA K80 GPUs, similar to the ones presently in Titan.
“There are other user cases that are inherently serial, that is, the jobs won’t efficiently parallelize. The new GPU nodes, with their large amount of RAM, are intended to service these user cases, which may require the entire data set to be loaded into memory at once,” England said.
The OLCF ran tests after specialists from Dell upgraded the RAM to make sure it was working correctly. The GPU nodes were installed soon after, with the full upgrade completed last month. – Miki Nolin
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.