Project Description

The Scientific Data Management System (SDMS) is a cross-facility data storage, indexing, collaboration, and provisioning system for data generated and used by research scientists within High Performance Computing (HPC) environments. The primary goal of the SDMS is to provide an easy-to-use tool for users enabling managed, high-performance data movement and storage within and across compute environments and/or security enclaves – both interactively and non-interactively (from within scheduled jobs or workflows).

There are currently many disjoint data management technologies available to scientific researchers at leadership-class computational facilities. There is a proliferation of both commercial and ad hoc systems that provide specific capabilities, such as data access and movement, indexing, dissemination, publication, and curation. Most commercial systems are oriented toward enterprise environments and do not address the complexities of HPC environments. The lack of integration, consistency, and suitability among these technologies places a significant burden on scientific researchers who must adopt, learn, configure, then follow complex and/or tedious manual procedures to utilize a piecemeal data pipeline. The result can be wasted time, confusion, and even errors related to properly identifying, locating, provisioning, and disseminating large scientific datasets.

The SDMS is envisioned as a suite of HPC-aware tools and services built on top of the high performance data transfer technology of Globus Online (i.e. GridFTP). While Globus does provide services to manually access and coarsely share file system data, the SDMS will offer a higher-level “data-centric” view of research artifacts rather than the “file-centric” view provided by Globus (as well as iRODS). The SDMS, in essence, elevates simple “files” into traceable, managed “data artifacts” that can be uniformly identified and accessed by collaborators without regard to the physical storage location of the data. The SDMS will offer many additional features such as smart data caching for workflows, fine-grained access control, non-hierarchical data organization, data tagging, support for metadata extraction, indexing, and search, and configurable automatic data dissemination.