Employing Data Transfer Nodes

Date: January 14th, 2013 | Category:

See this article in context within the following user guides: Data | Titan

The OLCF provides nodes dedicated to data transfer that are available via dtn.ccs.ornl.gov. These nodes have been tuned specifically for wide-area data transfers, and also perform well on local-area transfers. The OLCF recommends that users employ these nodes for data transfers, since in most cases transfer speed improves and load on computational systems’ login and service nodes is reduced.

Filesystems Accessible from DTNs

All OLCF filesystems — the NFS-backed User Home and Project Home areas, the Lustre®-backed User Work and Project Work areas, and the HPSS-backed User Archive and Project Archive areas — are accessible to users via the DTNs. For more information on available filesystems at the OLCF see the Data Management Overview page.

Interactive DTN Access

Members of allocated projects are automatically given access to the data transfer nodes. The interactive nodes are accessible for direct login through the dtn.css.ornl.gov alias.

Batch DTN Access

Batch data transfer nodes can be accessed through the Torque/MOAB queuing system on the dtn.ccs.ornl.gov interactive node. The DTN batch nodes are also accessible from the Titan, Eos, and Rhea batch systems through remote job submission.
This is accomplished by the command qsub -q host script.pbs which will submit the file script.pbs to the batch queue on the specified host. This command can be inserted at the end of an existing batch script in order to automatically trigger work on another OLCF resource.

Note: DTNs can help you manage your allocation hours efficiently by preventing billable compute resources from sitting idle.

The following scripts show how this technique could be employed. Note that only the first script, retrieve.pbs, would need to be manually submitted; the others will trigger automatically from within the respective batch scripts.

Example Workflow Using DTNs in Batch Mode

The first batch script, retrieve.pbs, retrieves data needed by a compute job. Once the data has been retrieved, the script submits a different batch script, compute.pbs, to run computations on the retrieved data.

To run this script start on Titan or Rhea.

qsub -q dtn retrieve.pbs
$ cat retrieve.pbs

  # Batch script to retrieve data from HPSS via DTNs

  # PBS directives
  #PBS -A PROJ123
  #PBS -l walltime=8:00:00

  # Retrieve required data
  cd $MEMBERWORK/proj123 
  hsi get largedatfileA
  hsi get largedatafileB

  # Verification code could go here

  # Submit other batch script to execute calculations on retrieved data
  qsub -q titan compute.pbs

$

The second batch script is submitted from the first to carry out computational work on the data. When the computational work is finished, the batch script backup.pbs is submitted to archive the resulting data.

$ cat compute.pbs

  # Batch script to carry out computation on retrieved data

  # PBS directives
  #PBS -l walltime=24:00:00 
  #PBS -l nodes=10000
  #PBS -A PROJ123
  #PBS -l gres=atlas1 # or atlas2

  # Launch executable
  cd $MEMBERWORK/proj123 
  aprun -n 160000 ./a.out

  # Submit other batch script to transfer resulting data to HPSS
  qsub -q dtn backup.pbs

$

The final batch script is submitted from the second to archive the resulting data soon after creation to HPSS via the hsi utility.

$ cat backup.pbs

  # Batch script to back-up resulting data

  # PBS directives
  #PBS -A PROJ123
  #PBS -l walltime=8:00:00

  # Store resulting data 
  cd $MEMBERWORK/proj123 
  hsi put largedatfileC
  hsi put largedatafileD

$

Some items to note:

Scheduled DTN Queue

Article published on Oak Ridge Leadership Computing Facility - https://www.olcf.ornl.gov

Print this article!