GUIDO helps aggregate, correlate, and analyze performance of Titan and OLCF subsystems
The Oak Ridge Leadership Computing Facility (OLCF) provides numerous resources for facility users, including the Titan supercomputer and its storage, scheduling, and archiving subsystems. These systems constantly collect a wealth of logs that can shed light on their individual and collective performance.
Optimizing compute jobs based on these separate and combined streams of information, however, can be challenging. Fortunately for staff and users at the OLCF—a US Department of Energy (DOE) Office of Science User Facility located at DOE’s Oak Ridge National Laboratory—a new tool called the Grand Unified Information Directory for the OLCF (GUIDO) has been developed to enhance existing capabilities.
GUIDO is a scalable software package that provides a window into the central operations of a leadership computing facility.
“GUIDO helps us make sense of these logs,” said Sudharshan Vazhkudai, the OLCF Technology Integration (TechInt) Group leader and creator of GUIDO. “Many high-performance computing centers have a lot of standalone tools that look at performance logs in isolation. GUIDO is different in that it is an infrastructure that allows center staff and administration to look at the logs in concert. It helps us aggregate and correlate the logs so we can derive deeper insights into how our resources are performing and being used.”
Vazhkudai came up with the idea to develop GUIDO last year, and over the course of 2015, the TechInt group built the tool. Ross Miller, the primary developer, built the overall infrastructure, the file system, and the I/O page on the dashboard; the I/O page presents information based on storage system logs and plug-in information from other subsystems. Devesh Tiwari developed the reliability page by building tools that analyze and correlate the reliability logs from Titan’s CPUs and GPUs. Chris Zimmer developed the allocation analytics page that provides insights into the job node allocation and partitioning by conducting analytics on the scheduler logs.
GUIDO is not just a dashboard—a user-friendly interface that organizes and presents information—it’s also a window into how the OLCF and its resources operate. Currently, GUIDO presents information about storage, scheduling, allocation, reliability, the file system, archival storage trends, and interconnect congestion. GUIDO provides near real-time information, updating the dashboard every few seconds.
At a quick glance, center staff can see the instantaneous bandwidth that a storage system offers, what the read and write bandwidths are, how many I/O operations it performs, the object storage target (OST) usage, degraded storage pools, file listing performance, and so on. The Titan reliability page shows information on all system errors, GPU errors (double bit, single bit, and bus-off errors), and CPU errors (kernel panic and machine check exception). The allocation analytics page shows information on average job partition distance that impacts runtime variability, whereas file system and archival storage information extracts trends in usage such as file size distribution from across hundreds of millions of files, which can help determine optimal file system block sizes.
The OLCF can use this information to help users improve their simulation runs. For example, when Miller looked at a large job running on all of Titan’s 18,688 nodes, he noticed that the job wasn’t using all of the OSTs within the Spider file system. With GUIDO, he was able to advise the project team to use all of the OSTs, ultimately improving I/O throughput by 15%. Another example was adjustment of the checkpoint frequency for the same application run. Using GUIDO, Tiwari determined that the current checkpoint interval of the application was much more frequent than warranted by the system mean time between failures, which is calculated based on an analysis of the reliability logs. He used this information, along with the I/O throughput logs, to advise the application on the optimal checkpointing interval that resulted in the application getting back roughly 500,000 hours for useful computation per run and millions of hours over several runs.
“The power of this tool is not just the ability to collect and present resource statistics, but the ability to go back, look at the information, do visual analytics, and then improve performance, including performance debugging or hotspot analysis,” Vazhkudai said.
From an administrative standpoint, GUIDO can alert staff regarding degraded storage pools or identify issues with Titan’s compute nodes, which allows OLCF staff to study the reliability of the machine.
Furthermore, rather than looking at one subsystem or one log in isolation, GUIDO allows staff to cross correlate the aggregated information to derive deeper insights. For example, if a user got a high bandwidth number from doing I/O to the storage system, GUIDO could correlate the job and I/O logs and send an email alert, indicating that there is a job doing leadership I/O. This way, the OLCF can take a proactive measure to follow up on interesting science that is getting done with leadership performance.
Moving forward, the group would like to take GUIDO to the next level by presenting Titan and its subsystems visually in a 3-D format and playing the logs together.
“Imagine if you could see Titan’s 18,688 nodes or the Spider storage system in 3-D,” Vazhkudai said. “You could really see the impact of failures or the storage system performance when the jobs are running.” – Miki Nolin
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.