Project Description

Spider PFS Metadata Snapshot Capture and Analysis: (i) LustreDU: The development a scalable tool to capture daily snapshots of the 1-Billion entry Spider PFS’s metadata. The snapshots contain valuable information such as file paths, last modification and access times, owner, and group information. The snapshots have been captured the past three years (each snapshot is ~ 119 GB).

(ii) Snapshot Analysis: If the daily metadata snapshots can be analyzed in aggregate, it can provide deep insights into the temporal evolution of how the PFS is used by the various science projects. To this end, (a) created a Spark-based distributed analysis framework to analyze the ~ 127 TB of data, and (b) analyzed the data to study the temporal evolution of the file system, glean insights into project/user behavior and their file characteristics, and understand the sharing between users/projects. The analysis was conducted across multiple dimensions such as: (1) How projects use the file system (directory depth, files/directory, wide-striping, burstiness), (2) File characteristics (Popular file types, age), (3) Behavior analysis (e.g., how long after creation are files still accessed?), and (4) Sharing analysis of projects and users (does it follow a power-law graph, small-world pattern, etc., what is the diameter of the connected components?). The analysis has already provided rich information for the design of future storage systems.

PFS Tools: Highly parallel tools development for standard file system operations. These tools are being used to operate on petabytes of data and 1 Billion files, and deployed at other sites.

I/O Signature Extraction: Automatic extraction of application I/O signatures from noisy Spider storage backend I/O logs, using a rich suite of statistical techniques. This work is novel in that it does not require application instrumentation to obtain the signatures unlike extant solutions.