Upgrades allow staff to monitor hardware failures and performance data in real-time
Staff members at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL) are constantly trying to make users’ experiences on the country’s fastest supercomputer as smooth as possible.
One of the most complicated aspects of using supercomputers revolves around sending data to the various supercomputing resources of the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility located at ORNL.
The OLCF’s flagship supercomputer, a Cray XK7 called Titan, sends data between storage systems, analysis clusters, and transfer nodes using the Atlas file system.
Atlas contains roughly 20,000 disk drives housed within 72 controllers to handle supercomputer-sized data sets, making the task of monitoring Titan’s system far more involved than monitoring a laptop’s system. For years, DataDirect Networks (DDN) has partnered with the OLCF to help build these disk drives. However, because of the vast number of disk drives, the OLCF needed to develop a more efficient performance monitoring tool.
“The DDNTool software that I wrote is basically designed to gather data from all 72 controllers and aggregate it,” said OLCF staffer Ross Miller. “In addition to getting information about whether a drive is failing, we’re also getting performance data.”
Miller has been working on DDNTool since he joined the OLCF in 2009. At the time, the OLCF had the Cray XT5 Jaguar supercomputer—Titan’s predecessor—and was using the Spider file system. Though DDN offered tools for monitoring systems, those tools were designed to support smaller systems.
Miller was tasked with creating a tool for the OLCF’s needs. Through the tool’s development, support staff working in the OLCF’s High-Performance Computing (HPC) Operations group went from clicking on 72 different screens to having one interface that could show all the drivers simultaneously in near real time.
DDNTool must evolve to keep up with increases in the power of the OLCF’s computing resources. The second version, designed to complement the updated interface for DDN’s controllers, came out as soon as the OLCF upgraded to the Atlas file system.
In addition to alerting HPC Operations staff to potential drive failures, DDNTool has helped staff diagnose performance issues that users might experience with Titan. DDNTool feeds data into Splunk, a commercial software package for data mining, which then charts researchers’ use of object storage targets, or OSTs, on Titan. OSTs are the smallest unit that researchers can use to save data at the OLCF. To get the maximum performance during file transfers, users need to design their codes to write to as many of the 1,008 OSTs as possible. In many cases, users unwittingly write to only a fraction of the OSTs, limiting their performance.
Miller’s updates to the DDNTool are proving beneficial to other DOE laboratories as well. Staff at Los Alamos National Laboratory recently requested Miller’s updated DDNTool to evaluate on their systems. Miller hopes to work with DDN to make the tool open source in the coming year.
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.