Project Description
DataFed is a federated, big-data storage, collaboration, and full-life-cycle management system for computational science and/or data analytics within distributed high-performance computing (HPC) and/or cloud-computing environments. DataFed provides a suite of features and capabilities that ease the typical data access and management burdens associated with complex, data-oriented research activities – allowing researchers to focus on their research activities instead of developing ad hoc data management solutions. When fully utilized, DataFed has the potential to improve the quality and repeatability of research output through a variety of advanced features including metadata and context capture, publish-subscribe data dissemination, and dependency-driven collaboration tools.
DataFed overlays and manages the data storage, communication, and security infrastructure present within a network of participating organizations and facilities in order to produce an abstract and homogeneous view of the underlying heterogeneous hardware. A centralized management service is utilized and maintains an awareness of, and control over, all of the data stored within this federation – resulting in an integrated, but geographically distributed, data infrastructure with simple and consistent data access semantics. DataFed users are able to access and manage their data from anywhere within the federation – without ambiguity or regard to physical storage location.
DataFed utilizes an object store paradigm where data is presented as a record (object) within a database with associated unique identifier, raw data, rich metadata, and user-defined relationships. This approach has advantages over the typical hierarchical file system paradigm because it eliminates identity uncertainty (by separating identity from storage location), maintains synchronization between raw data and metadata, and enables data discovery through metadata indexing and search. This approach elevates plain files on an isolated file system into stable and unique data artifacts that can be safely referenced, shared, and discovered with a high degree of certainty.
DataFed is best suited for mid-tier data access patterns, such as managing important and/or relatively stable data over the duration of a research project. Applications that require low-latency and/or high-bandwidth data I/O should utilize local or high-performance parallel file systems; however, DataFed can (and should) be used to stage data in and out of these systems once data processing is complete. At the other end of the spectrum, DataFed can be used for long-term storage and dissemination of reference data; however, DataFed does not currently provide all of the features necessary for long-term data curation or data publication – such as disaster recovery, Digital Object Identifiers, or interfaces with external publication indexing services.
DataFed development is on-going, and the software is currently in alpha-production state with baseline features. A recent CSCI DataFed whitepaper is available at https://arxiv.org/abs/2004.03710 For more information, or for questions regarding access to source code or collaboration opportunities, please contact Dale Stansberry at stansberrydv@ornl.gov