Collaboration Results in Better Parallel File System Tools
Goal is to help speed up difficult jobs
Over the past months, Titan users have had the difficult task of moving enormous amounts of data from the old file system, Widow, to the new and improved Atlas.
Fortunately they have had help in the form of the powerful Distributed File Copy Tool (dcp). According to Oak Ridge Leadership Computing Facility’s (OLCF’s) Blake Caldwell, dcp has helped OLCF staff move 350 terabytes of data so far on behalf of some users, while other users have independently used dcp to copy much more.
OLCF helped to develop dcp as a collaboration with Lawrence Livermore National Laboratory (LLNL), Los Alamos National Laboratory, and the company Data Direct Networks to create a suite of parallel file system tools designed for scalability and performance. Such tools (dcp included) speed up difficult jobs by distributing the workload across multiple processors.
LLNL was the primary developer of dcp, but OLCF has played an important role. Developer Dr. Feiyi Wang, for instance, worked on several of dcp’s features, most notably testing the system and improving its stability.
Though dcp is not the first parallel copy tool, it is unique. According to Wang, traditional multithreading applications can’t scale beyond a single symmetric multiprocessing (SMP) node. In response, dcp uses MPI tasks instead of multithreading. Caldwell says, “MPI allows us to do the transfer over multiple nodes, which allows us to task more cores to each copy.”
However, Wang emphasizes that “dcp is just one of the tools in a suite of tools.” The collaboration that developed dcp has many other parallel file system appliances in the works. OLCF is leading in designing and implementing a dtar tool, which will use parallelism to efficiently collate many files into one, and a dfind tool, which will use parallelism to find specific files in the masses of data on the computer.
In fact, dcp itself is still evolving; Wang hopes to soon add a resume feature, allowing users to recover from a failed transfer, and provide progress information during the copying process. —Timothy Metcalf