Using Darshan to Profile I/O on Frontier
Overview
Optimizing I/O performance is an essential, yet often overlooked, aspect of High-Performance Computing (HPC). As AI datasets grow and GPU-accelerated training demands higher throughput, I/O can become a silent bottleneck. Understanding the interaction between application code and the Lustre parallel filesystem is critical for scaling applications effectively on Frontier.
This training provides a practical guide to profiling code I/O using Darshan. We will demonstrate how to use command-line utilities to analyze reports, leverage Python to export and visualize I/O statistics, and identify common “bad” I/O behavior that limits performance. While the darshan-runtime module is loaded by default on Frontier, this session assumes no prior experience and will walk through the entire workflow of turning raw data into actionable performance insights.
