Employing Data Transfer Nodes
Categories: Data Management, Data Transfer, File Systems
Print this article
The OLCF provides nodes dedicated to data transfer that are available via
dtn.ccs.ornl.gov. These nodes have been tuned specifically for wide-area data transfers, and also perform well on local-area transfers. The OLCF recommends that users employ these nodes for data transfers, since in most cases transfer speed improves and load on computational systems’ login and service nodes is reduced.
Filesystems Accessible from DTNs
All OLCF filesystems — the NFS-backed User Home and Project Home areas, the Lustre®-backed User Work and Project Work areas, and the HPSS-backed User Archive and Project Archive areas — are accessible to users via the DTNs. For more information on available filesystems at the OLCF see the Data Management Overview page.
Interactive DTN Access
Members of allocated projects are automatically given access to the data transfer nodes. The interactive nodes are accessible for direct login through the dtn.css.ornl.gov alias.
Batch DTN Access
Batch data transfer nodes can be accessed through the Torque/MOAB queuing system on the dtn.ccs.ornl.gov interactive node. The DTN batch nodes are also accessible from the Titan, Eos, and Rhea batch systems through remote job submission.
This is accomplished by the command
qsub -q host script.pbs which will submit the file
script.pbs to the batch queue on the specified host. This command can be inserted at the end of an existing batch script in order to automatically trigger work on another OLCF resource.
The following scripts show how this technique could be employed. Note that only the first script,
retrieve.pbs, would need to be manually submitted; the others will trigger automatically from within the respective batch scripts.
Example Workflow Using DTNs in Batch Mode
The first batch script,
retrieve.pbs, retrieves data needed by a compute job. Once the data has been retrieved, the script submits a different batch script,
compute.pbs, to run computations on the retrieved data.
To run this script start on Titan or Rhea.
qsub -q dtn retrieve.pbs
$ cat retrieve.pbs # Batch script to retrieve data from HPSS via DTNs # PBS directives #PBS -A PROJ123 #PBS -l walltime=8:00:00 # Retrieve required data cd $MEMBERWORK/proj123 hsi get largedatfileA hsi get largedatafileB # Verification code could go here # Submit other batch script to execute calculations on retrieved data qsub -q titan compute.pbs $
The second batch script is submitted from the first to carry out computational work on the data. When the computational work is finished, the batch script
backup.pbs is submitted to archive the resulting data.
$ cat compute.pbs # Batch script to carry out computation on retrieved data # PBS directives #PBS -l walltime=24:00:00 #PBS -l nodes=10000 #PBS -A PROJ123 #PBS -l gres=atlas1 # or atlas2 # Launch executable cd $MEMBERWORK/proj123 aprun -n 160000 ./a.out # Submit other batch script to transfer resulting data to HPSS qsub -q dtn backup.pbs $
The final batch script is submitted from the second to archive the resulting data soon after creation to HPSS via the
$ cat backup.pbs # Batch script to back-up resulting data # PBS directives #PBS -A PROJ123 #PBS -l walltime=8:00:00 # Store resulting data cd $MEMBERWORK/proj123 hsi put largedatfileC hsi put largedatafileD $
Some items to note:
- Batch jobs submitted to the
dtnpartition will be executed on a DTN that is accessible exclusively via batch submissions. These batch-accessible DTNs have identical configurations to those DTNs that are accessible interactively; the only difference between the two is accessibility.
- The DTNs are not currently a billable resource, i.e., the project specified in a batch job targeting the
dtnpartition will not be charged for time spent executing in the
Scheduled DTN Queue
- The walltime limit for jobs submitted to the
dtnpartition is 24 hours.
- Users may request a maximum of 4 nodes per batch job.
- There is a limit of (2) eligible-to-run jobs per user.
- Jobs in excess of the per user limit above will be placed into a held state, but will change to eligible-to-run when appropriate.
- The queue allows each user a maximum of 6 running jobs.