Project Description
As supercomputers grow more expensive, a more data-driven procurement strategy is needed to optimize the overall system performance while avoiding speculative resource provisioning at the cost of millions of dollars. DDSD is a multiphase project designed to address this issue by impacting the procurement of NCCS supercomputers through data-driven procurement. DDSD is composed of the following three phases.
The first phase corresponds to collecting resource-relevant data from the executions of user application codes on the existing OLCF systems including Summit. The target resources include the CPUs, GPUs, memory, interconnect, compute-node local storage and the center-wide file system. For each resource, the target data might include the capacity and bandwidth utilized by the application as well as the observed latency, when applicable. The aim is to collect these data on user applications from as many different domains as possible, and cover as many different parallelization configurations as possible. This phase also covers the design and implementation of a management system for the collected data.
The second phase corresponds to analyzing the data collected in the first step and evaluate the stress exerted on the existing resources by the applications. The aim is to identify the resources that need to be improved to increase the overall system efficiency. The improvements can be done in different ways, such as increasing the number or capacity/bandwidth of the resource as well as changing the hierarchy it exhibits.
The third phase corresponds to using the collected data/analysis to predict the performance of a new system in terms of the desired metrics. This will allow NCCS to extensively evaluate many potential configurations of different hardware/software from different vendors and help the design of future NCCS supercomputers and associated computing and storage platforms.