Project Description

The project involves analysis of the reliability characteristics of Titan’s 299,008 CPUs and 18,688 GPUs to understand trends in machine failure, MTBF, single bit errors, double bit errors, off the bus errors and temperature correlation to failure. The study was the first of its kind for a large-scale GPU deployment. Understanding the reliability characteristics of the system is critical to efficient system operations as well as the acquisition of future systems. Below are some key outcomes of this effort:

1) Checkpoint advisory tool: Based on the insights gained by the detailed reliability analysis, we have devised a checkpoint advisory tool for applications. Having up-to-date MTBF for a production machine can advise users on the optimal frequency to write output or checkpoints based on the portion of the system the job uses and the time to write the output. We have developed a tool to this end, and are advising applications on optimal checkpoint frequency. The tool has the potential to save millions of core hours for applications.

2) Our ongoing reliability analysis has resulted in tangible feedback to vendors on their products, which has helped to shape future roadmaps.

3) Real-time monitoring can alert operators and management on the changing state of the system, while post-mortem analysis can point to possible causes and historical trends.

4) Correlation of machine failure logs and job logs can allow us glean insights into job productivity, and how different size jobs are affected by node failure, which allows center to deploy remedial measures.