OLCF users have many options for data storage. Each user has multiple user-affiliated storage spaces, and each project has multiple project-affiliated storage spaces where data can be shared for collaboration. The storage areas are mounted across all OLCF systems for maximum data availability.

A Storage Area for Every Activity
The storage area to use in any given situation depends upon the activity you wish to carry out.

Each user has a User Home area on a Network File System (NFS) and a User Archive area on the archival High Performance Storage System (HPSS). These user storage areas are intended to house user-specific files.

Each project has a Project Home area on NFS, multiple Project Work areas on Lustre, and a Project Archive area on HPSS. These project storage areas are intended to house project-centric files.

Simple Guidelines
The following sections describe all available storage areas and detail intended use for each.
A brief description of each area and basic guidelines to follow are provided in the table below:

If you need to store… then use… at path…
Long-term data for routine access that is unrelated to a project User Home $HOME
Long-term data for archival access that is unrelated to a project User Archive /home/$USER
Long-term project data for routine access that’s shared with other project members Project Home /ccs/proj/[projid]
Short-term project data for fast, batch-job access that you don’t want to share Member Work $MEMBERWORK/[projid]
Short-term project data for fast, batch-job access that’s shared with other project members Project Work $PROJWORK/[projid]
Short-term project data for fast, batch-job access that’s shared with those outside your project World Work $WORLDWORK/[projid]
Long-term project data for archival access that’s shared with other project members Project Archive /proj/[projid]

User-Centric Data Storage

Users are provided with several storage areas, each of which serve different purposes. These areas are intended for storage of user data, not for storage of project data.

The following table summarizes user-centric storage areas available on OLCF resources and lists relevant polices.

User-Centric Storage Areas
Area Path Type Permissions Quota Backups Purged Retention
User Home $HOME NFS User-controlled 50 GB Yes No 90 days
User Archive /home/$USER HPSS User-controlled 2 TB [1] No No 90 days
[1] In addition, there is a quota/limit of 2,000 files on this directory.

User Home Directories (NFS)

Each user is provided a home directory to store frequently used items such as source code, binaries, and scripts.

User Home Path

Home directories are located in a Network File System (NFS) that is accessible from all OLCF resources as /ccs/home/$USER.

The environment variable $HOME will always point to your current home directory. It is recommended, where possible, that you use this variable to reference your home directory. In cases in which using $HOME is not feasible, it is recommended that you use /ccs/home/$USER.

Users should note that since this is an NFS-mounted filesystem, its performance will not be as high as other filesystems.

User Home Quotas

Quotas are enforced on user home directories. To request an increased quota, contact the OLCF User Assistance Center. To view your current quota and usage, use the quota command:
$ quota -Qs
Disk quotas for user usrid (uid 12345):
     Filesystem  blocks   quota   limit   grace   files   quota   limit   grace
nccsfiler1a.ccs.ornl.gov:/vol/home
                  4858M   5000M   5000M           29379   4295m   4295m

User Home Backups

If you accidentally delete files from your home directory, you may be able to retrieve them. Online backups are performed at regular intervals. Hourly backups for the past 24 hours, daily backups for the last 7 days, and 1 weekly backup are available. It is possible that the deleted files are available in one of those backups. The backup directories are named hourly.*, daily.* , and weekly.* where * is the date/time stamp of the backup. For example, hourly.2016-12-01-0905 is an hourly backup made on December 1, 2016 at 9:05 AM.

The backups are accessed via the .snapshot subdirectory. Note that if you do an ls (even with the -a option) of any directory you won’t see a .snapshot subdirectory, but you’ll be able to do “ls .snapshot” nonetheless. This will show you the hourly/daily/weekly backups available. The .snapshot feature is available in any subdirectory of your home directory and will show the online backup of that subdirectory. In other words, you don’t have to start at /ccs/home/$USER and navigate the full directory structure; if you’re in a /ccs/home subdirectory several “levels” deep, an “ls .snapshot” will access the available backups of that subdirectory.

To retrieve a backup, simply copy it into your desired destination with the cp command.

User Home Permissions

The default permissions for user home directories are 0750 (full access to the user, read and execute for the group). Users have the ability to change permissions on their home directories, although it is recommended that permissions be set to as restrictive as possible (without interfering with your work).

User Website Directory

Users interested in sharing files publicly via the World Wide Web can request a user website directory be created for their account. User website directories (~/www) have a 5GB storage quota and allow access to files at http://users.nccs.gov/~user (where user is your userid). If you are interested in having a user website directory created, please contact the User Assistance Center at help@olcf.ornl.gov.

User Archive Directories (HPSS)

The High Performance Storage System (HPSS) at the OLCF provides longer-term storage for the large amounts of data created on the OLCF compute systems. The mass storage facility consists of tape and disk storage components, servers, and the HPSS software. After data is uploaded, it persists on disk for some period of time. The length of its life on disk is determined by how full the disk caches become. When data is migrated to tape, it is done so in a first-in, first-out fashion.

User archive areas on HPSS are intended for storage of data not immediately needed in either User Home directories (NFS) or User Work directories (Lustre®). User Archive areas also serve as a location for users to store backup copies of user files. User Archive directories should not be used to store project-related data. Rather, Project Archive directories should be used for project data.

User archive directories are located at /home/$USER.

User Archive Access

Each OLCF user receives an HPSS account automatically. Users can transfer data to HPSS from any OLCF system using the HSI or HTAR utilities. For more information on using HSI or HTAR, see the HPSS Best Practices section.

User Archive Accounting

Each file and directory on HPSS is associated with an HPSS storage allocation. For information on HPSS storage allocations, please visit the HPSS Archive Accounting section.

For information on usage and best practices for HPSS, please see the section HPSS - High Performance Storage System below.


Project-Centric Data Storage

Projects are provided with several storage areas for the data they need. Project directories provide members of a project with a common place to store code, data files, documentation, and other files related to their project. While this information could be stored in one or more user directories, storing in a project directory provides a common location to gather all files.

The following table summarizes project-centric storage areas available on OLCF resources and lists relevant policies.

Project-Centric Storage Areas
Area Path Type Permissions Quota Backups Purged Retention
Project Home /ccs/proj/[projid] NFS 770 50 GB Yes No 90 days
Member Work $MEMBERWORK/[projid] Lustre® 700 [1] 10 TB No 14 days 14 days
Project Work $PROJWORK/[projid] Lustre® 770 100 TB No 90 days 90 days
World Work $WORLDWORK/[projid] Lustre® 775 10 TB No 90 days 90 days
Project Archive /proj/[projid] HPSS 770 100 TB [2] No No 90 days
Important! Files within “Work” directories (i.e., Member Work, Project Work, World Work) are not backed up and are purged on a regular basis according to the timeframes listed above.

[1] Permissions on Member Work directories can be controlled to an extent by project members. By default, only the project member has any accesses, but accesses can be granted to other project members by setting group permissions accordingly on the Member Work directory. The parent directory of the Member Work directory prevents accesses by “UNIX-others” and cannot be changed (security measures).

[2] In addition, there is a quota/limit of 100,000 files on this directory.

Project Home Directories (NFS)

Projects are provided with a Project Home storage area in the NFS-mounted filesystem. This area is intended for storage of data, code, and other files that are of interest to all members of a project. Since Project Home is an NFS-mounted filesystem, its performance will not be as high as other filesystems.

Project Home Path

Project Home area is accessible at /ccs/proj/abc123 (where abc123 is your project ID).

Project Home Quotas

To check your project’s current usage, run df -h /ccs/proj/abc123 (where abc123 is your project ID). Quotas are enforced on project home directories. The current limit is shown in the table above.

Project Home Backups

If you accidentally delete files from your project home directory, you may be able to retrieve them. Online backups are performed at regular intervals. Hourly backups for the past 24 hours, daily backups for the last 7 days, and 1 weekly backup are available. It is possible that the deleted files are available in one of those backups. The backup directories are named hourly.*, daily.* , and weekly.* where * is the date/time stamp of the backup. For example, hourly.2016-12-01-0905 is an hourly backup made on December 1, 2016 at 9:05 AM.

The backups are accessed via the .snapshot subdirectory. Note that if you do an ls (even with the -a option) of any directory you won’t see a .snapshot subdirectory, but you’ll be able to do “ls .snapshot” nonetheless. This will show you the hourly/daily/weekly backups available. The .snapshot feature is available in any subdirectory of your project home directory and will show the online backup of that subdirectory. In other words, you don’t have to start at /ccs/proj/abc123 and navigate the full directory structure; if you’re in a /ccs/proj subdirectory several “levels” deep, an “ls .snapshot” will access the available backups of that subdirectory.

To retrieve a backup, simply copy it into your desired destination with the cp command.

Project Home Permissions

The default permissions for project home directories are 0770 (full access to the user and group). The directory is owned by root and the group includes the project’s group members. All members of a project should also be members of that group-specific project. For example, all members of project “ABC123” should be members of the “abc123” UNIX group.

Project-Centric Work Directories

To provide projects and project members with high-performance storage areas that are accessible to batch jobs, projects are given (3) distinct project-centric work (i.e., scratch) storage areas within Spider, the OLCF’s center-wide Lustre® filesystem.

Three Project Work Areas to Facilitate Collaboration

To facilitate collaboration among researchers, the OLCF provides (3) distinct types of project-centric work storage areas: Member Work directories, Project Work directories, and World Work directories. Each directory should be used for storing files generated by computationally-intensive HPC jobs related to a project.

The difference between the three lies in the accessibility of the data to project members and to researchers outside of the project. Member Work directories are accessible only by an individual project member by default. Project Work directories are accessible by all project members. World Work directories are readable by any user on the system.

Paths

Paths to the various project-centric work storage areas are simplified by the use of environment variables that point to the proper directory on a per-user basis:

  • Member Work Directory: $MEMBERWORK/[projid]
  • Project Work Directory: $PROJWORK/[projid]
  • World Work Directory: $WORLDWORK/[projid]

Environment variables provide operational staff (aka “us”) flexibility in the exact implementation of underlying directory paths, and provide researchers (aka “you”) with consistency over the long-term. For these reasons, we highly recommend the use of these environment variables for all scripted commands involving work directories.

Permissions

UNIX Permissions on each project-centric work storage area differ according to the area’s intended collaborative use. Under this setup, the process of sharing data with other researchers amounts to simply ensuring that the data resides in the proper work directory.

  • Member Work Directory: 700
  • Project Work Directory: 770
  • World Work Directory: 775

For example, if you have data that must be restricted only to yourself, keep them in your Member Work directory for that project (and leave the default permissions unchanged). If you have data that you intend to share with researchers within your project, keep them in the project’s Project Work directory. If you have data that you intend to share with researchers outside of a project, keep them in the project’s World Work directory.

Project-centric Work Directory Quotas

Soft quotas are enforced on project-centric work directories. The current limit is shown in the table above. To request an increased quota, contact the User Assistance Center.

Backups

Member Work, Project Work, and World Work directories are not backed up. Project members are responsible for backing up these files, either to Project Archive areas (HPSS) or to an off-site location.

Project Archive Directories

Projects are also allocated project-specific archival space on the High Performance Storage System (HPSS). The default quota is shown on the table above. If a higher quota is needed, contact the User Assistance Center.

The Project Archive space on HPSS is intended for storage of data not immediately needed in either Project Home (NFS) areas nor Project Work (Lustre®) areas, and to serve as a location to store backup copies of project-related files.

The project archive directories are located at /proj/pjt000 (where pjt000 is your Project ID).

Project Archive Access

Project Archive directories may only be accessed via utilities called HSI and HTAR. For more information on using HSI or HTAR, see the HPSS Best Practices section.

Project Archive Accounting

Each file and directory on HPSS is associated with an HPSS storage allocation. For information on HPSS storage allocations, please visit the HPSS Archive Accounting section.

For information on usage and best practices for HPSS, please see the section HPSS – High Performance Storage System below.


Spider – the Center-Wide Lustre® File System

The OLCF’s center-wide Lustre® file system, called Spider, is the operational work file system for most OLCF computational resources. As an extremely high-performance system, Spider has over 26,000 clients, providing 32 petabytes of disk space and can move data at more than 1 TB/s.

Spider is Center-Wide

Spider is currently accessible by nearly all of the OLCF’s computational resources, including Titan and its 18,000+ compute nodes. The file system is available from the following OLCF systems:

Note: Because the file system is shared by most OLCF computational resources, times of heavy load may impact file system responsiveness.

Spider is for Temporary Storage

Spider provides a location to temporarily store large amounts of data needed and produced by batch jobs. Due to the size of the file system, the area is not backed up. In most cases, a regularly running purge removes data not recently accessed to help ensure available space for all users. Needed data should be copied to more permanent locations.

Warning: Spider provides temporary storage of data produced by or used by batch jobs. The space is not backed up. Users should copy needed data to more permanent locations.

Spider Comprises Multiple File Systems

Spider comprises (2) file systems:

File System Path
atlas1 /lustre/atlas1/
atlas2 /lustre/atlas2/

Why two filesystems?

There are a few reasons why having multiple file systems within Spider is advantageous.

  • More Metadata Servers – Currently, each Lustre filesystem can only utilize one Metadata Server (MDS). Interaction with the MDS is expensive; heavy MDS access will impact interactive performance. Providing (2) filesystems allows the load to be spread over (2) MDSs.
  • Higher Availability – The existence of multiple filesystems increases our ability to keep at least one filesystem available at all times.

Available Directories on Spider

Every project is provided three distinct project-centric work storage areas: one that is only accessible by an individual project member ($MEMBERWORK/[projid]), one that is accessible by all project members ($PROJWORK/[projid]), and one that is accessible by all users of the system ($WORLDWORK/[projid]).

For more details, see the section on Project-Centric Work Directories.

IMPORTANT! Files in lustre directories are not backed up and are purged on a regular basis according to the timeframes listed in the Data Management Policy.
Current Configuration of Spider
atlas1 atlas2
Total disk space 14 PB (usable) 14 PB (usable)
Number of OSTs 1008 1008
Default stripe count 4 4
Default stripe size 1 MB 1 MB

Lustre® Basics

Basic Components of a Lustre® System

  • Metadata Server (MDS) – The MDS makes metadata stored in the MDT available to Lustre clients. Each MDS manages the names and directories in the Lustre filesystem and provides network request handling for the MDT.
  • Metadata Target (MDT) – The MDT stores metadata.
  • Object Storage Server (OSS) – The OSS node provides file service and network request handling for one or more local OSTs.
  • Object Storage Target (OST) – The OST stores file data (chunks of files) as data objects on one or more OSSs. A single file may be striped across one or more OSTs. When a file is striped across multiple OSTs, chunks of the file will exist on more than one OST.
  • Lustre Clients – The nodes which mount the lustre file system are considered to be lustre clients. The service and computational nodes on titan are, for example, lustre clients.

Basic_Cluster

Basic Open/Write

The Metadata Server (MDS) stores information about each file including the number, layout, and location of the file’s stripe(s). A file’s data are stored on one or more Object Storage Targets (OSTs).

To access a file, the Lustre client must obtain a file’s information from the MDS. Once this information is obtained, the client will interact directly with the OSTs on which the file is striped.

Warning: Interaction with the Metadata Server (MDS) is expensive. Limiting tasks that require MDS access (e.g. directory operations, creating/opening/closing files, stat-ing files) will help improve file system interaction performance.

It is important to note that Spider 2, OLCF’s Lustre parallel file system, is a shared resource. As such, the I/O performance observed can vary depending on the particular user jobs running on the system. Information about OLCF I/O best practices can be found here.

More detailed information on Lustre can be found on the Lustre wiki.

File Striping

A file may exist on one OST or multiple OSTs. If chunks of a file exist on multiple OSTs, the file is striped across the OSTs.

Advantages of Striping a File Across Multiple OSTs
  • File Size – By placing chunks of a file on multiple OSTs the space required by the file can also be spread over the OSTs. Therefore, a file’s size is not limited to the space available on a single OST.
  • Bandwidth – By placing chunks of a file on multiple OSTs the I/O bandwidth can also be spread over the OSTs. In this manner, a file’s I/O bandwidth is not limited to a single OST.
Disadvantages of Striping a File Across Multiple OSTs
  • Increased Overhead – By placing chunks of a file across multiple OSTs, the overhead needed to manage the file separation, network connections, and multiple OSTs increases.
  • Increased Risk – By placing chunks of a file across multiple OSTs, the odds that an event will take one of the file’s OSTs down or impact data transfer increases.

File/Directory Stripe Patterns

When a file or directory is created it will inherit the parent directory’s stripe settings. However, users have the ability to alter a directory’s stripe pattern and set a new file’s stripe pattern.

Users have the ability to alter the following stripe settings:

Setting Default Description
stripe count 4 Number of OSTs to stripe over
stripe size 1 MB File chunk size in bytes
stripe index -1 (allow system to select) OST on which first stripe will be placed
Warning: Stripe counts over 512 have a negative impact on system and file performance. Please do not create files with stripe counts over 512.
File Chunk Creation

When a file’s size is greater than the set stripe size, the file will be broken down into enough chunks of the specified stripe size to contain the file.

File Chunk Placement

When a file contains multiple chunks, and a stripe count greater than 1 is used, the file’s chunks will be placed on OSTs in a round robin fashion.

Basic Striping Examples

The following example shows various Lustre striping patterns over 3 OSTs for 3 different file sizes. In all three cases the default stripe size of 1MB and the default stripe index (i.e. -1) are used. Note that object 6 in File A is not pictured because the corresponding data has not been written, resulting in a sparse file — Lustre does not create unnecessary objects in the underlying file system.

  • File A is > 5 MB and < 7 MB, stripe count = 3
  • File B is < 1 MB
  • File C is > 1 MB and < 2 MB, stripe count = 1

File_Striping

Choosing a Stripe Pattern

A stripe pattern should be set based on a code’s I/O requirements. When choosing a stripe pattern consider the following:

Stripe Count

Over-striping can negatively impact performance by causing small chunks to be written to each OST. This may under utilize the OSTs and the network. In contrast, under-striping might can place too much stress on individual OSTs, which may also cause resource and request contention. The following table provides general guidelines that can be used when choosing a stripe count for your files:

File size Recommended Stripe Count Notes
≤ 1 TB 4 This is the default stripe count
1 TB < 50 TB File size / 100 GB An 18 TB file would use 18 TB / 100 GB = 180 stripes
>50 TB 512 Stripe counts > 512 can have negative impact on performance.
See warning below.

When a large file uses few stripes, its individual chunks can occupy large portions of an OST, leaving insufficient storage space available for other files which can result in I/O errors. Following these guidelines will ensure that chunks on individual OSTs have a reasonable size.

Warning: Stripe counts over 512 have a negative impact on system and file performance. Please do not create files with stripe counts over 512. If you are working with files of 50 TB or more, please contact help@olcf.ornl.gov for more guidelines specific to your use case.
Stripe Size

The default stripe size is 1 MB; i.e., Lustre sends data in 1 MB chunks. It is not recommended to set a stripe size less than 1 MB or greater than 4 MB.

Stripe Index

The default stripe index allows the system to select the OST on which the first data chunk will be placed. The default stripe index should always be used allowing the system to choose the OSTs. Forcing the use of specific OSTs by setting the stripe index prevents the system from managing the OST load and can unnecessarily cause high OST load. It can also cause a new file to be striped across an OST that is down or write only; this would cause a file creation to unnecessarily fail.

Warning: The default stripe index, i.e. (-1), should always be used. Forcing the use of specific OSTs by setting the stripe index hinders the system’s OST management.

Viewing the Striping Information

The lfs getstripe command can be used to view the attributes of a file or directory. The following example shows that file1 has a stripe of (6) on OSTs 19, 59, 70, 54, 39, and 28:

  $ lfs getstripe -q dir/file1
    19      28675008      0x1b58bc0      0
    59      28592466      0x1b44952      0
    70      28656421      0x1b54325      0
    54      28652653      0x1b5346d      0
    39      28850966      0x1b83b16      0
    28      28854363      0x1b8485b      0

The following example shows that directory dir1 has a stripe count of (6), the stripe size is set to (0) (i.e., use the default), and the stripe index/offset is set to (-1) (i.e., use the default).

  $ lfs getstripe dir1
    ...
    dir1
    stripe_count: 6 stripe_size: 0 stripe_offset: -1
    ...

More details can be found in the lfs man page ($ man lfs).

Altering the Striping Pattern

A user can change the attributes for an existing directory or set the attributes when creating a new file in Lustre by using the lfs setstripe command. An existing file’s stripe may not be altered.

Warning: The default stripe index, i.e, (-1), should always be used. Forcing the use of specific OSTs by setting the stripe index hinders the system’s OST management.
Note: Files and directories inherit attributes from the parent directory. An existing file’s stripe may not be altered.
Creating New Files

The following will create a zero-length file named file1 with a stripe count of (16):

  $ lfs setstripe -c 16 file1

To alter the stripe of an existing file, you can create a new file with the needed attributes using setstripe and copy the existing file to the created file. To alter the stripe of a large number of files, you can create a new directory with the needed attributes and copy the existing files into the newly created directory. In this manner the files should inherit the directory’s attributes.

Alter Existing Directories

The following example will change the stripe of directory dir1 to (2).

  $ lfs setstripe -c 2 dir1

More details can be found in the lfs man page ($ man lfs).

Viewing OST Storage

The lfs df command can be used to determine the amount of data stored on each Object Storage Target (OST).

The following example shows the size, used, and available space in human readable format for the /lustre/atlas2 filesystem:

  $ lfs df -h /lustre/atlas2
     UUID                     bytes        Used     Available  Use%      Mounted on
   atlas2-OST0000_UUID        14.0T        3.0T       10.3T    22%   /lustre/atlas2[OST:0]
   atlas2-OST0001_UUID        14.0T        3.0T       10.3T    22%   /lustre/atlas2[OST:1]
   atlas2-OST0002_UUID        14.0T        3.2T       10.1T    24%   /lustre/atlas2[OST:2] 
   atlas2-OST0003_UUID        14.0T        3.0T       10.3T    23%   /lustre/atlas2[OST:3]
   atlas2-OST0004_UUID        14.0T        3.2T       10.0T    24%   /lustre/atlas2[OST:4]
   ...
   atlas2-OST03ee_UUID        14.0T        2.0T       11.2T    15%   /lustre/atlas2[OST:1006]
   atlas2-OST03ef_UUID        14.0T        2.0T       11.2T    15%   /lustre/atlas2[OST:1007]

   filesystem summary:        13.8P        3.0P       10.1P    23%   /lustre/atlas2
Note: A no space left on device error will be returned during file I/O if one of the file’s associated OSTs becomes 100% utilized. An OST may become 100% utilized even if there is space available on the filesystem.

You can see a file or directory’s associated OSTs with lfs getstripe. lfs df can then be used to see the usage on each OST.

More details can be found in the lfs man page (man lfs).

Associating a Batch Job with a File System

Through the PBS gres option, users can specify the scratch area used by their batch jobs so that the job will not start if that file system becomes degraded or unavailable.

Creating a Dependency on a Single File System

Line (5) in the following example will associate a batch job with the atlas2 file system. If atlas2 becomes unavailable prior to execution of the batch job, the job will be placed on hold until atlas2 returns to service.

  1: #!/bin/csh
  2: #PBS -A ABC123
  3: #PBS -l nodes = 16000
  4: #PBS -l walltime = 08:00:00
  5: #PBS -l gres=atlas2
  6:
  7: cd $MEMBERWORK/abc123
  8: aprun -n 256000 ./a.out

Creating a Dependency on Multiple File Systems

The following example will associate a batch job with the atlas1 and atlas2 file systems. If either atlas1 or atlas2 becomes unavailable prior to execution of the batch job, the job will be placed on hold until both atlas1 and atlas2 are in service.

  -l gres=atlas1%atlas2
Default is Dependency on All Spider File Systems

When the gres option is not used, batch jobs will be associated with all Spider filesystems by default as though the option -l gres=atlas1%atlas2 had been applied to the batch submission.

Note: To help prevent batch jobs from running during periods where a Spider file system is not available, batch jobs that do not explicitly specify the PBS gres option will be given a dependency on all Spider file systems.
Why Explicitly Associate a Batch Job with a File System?
  • Associating a batch job with a file system will prevent the job from running if the file system becomes degraded or unavailable.
  • If a batch job only uses (1) spider file systems, specifying the file systems explicitly instead of taking the default of all (2), would prevent the job from being held if a file system not used by the job becomes degraded or unavailable.

Verify/View Batch Job File System Association

The checkjob utility can be used to view a batch job’s file system associations. For example:

  $ qsub -lgres=atlas2 batchscript.pbs
  851694.nid00004
  $ checkjob 851694 | grep "Dedicated Resources Per Task:"
  Dedicated Resources Per Task: atlas2: 1

Spider Best Practices

The following best practices can help achieve better I/O performance from your applications running on Spider.

Edit/Build Code in User Home and Project Home Areas Whenever Possible

Spider is built for large, contiguous I/O. Opening, closing, and stat-ing files are expensive tasks in Lustre. This, combined with periods of high load from compute jobs, can make basic editing and code builds noticeably slow. To work around this, users are encouraged to edit and build codes in their User Home (i.e., /ccs/home/$USER) and Project Home (i.e., /ccs/proj/) areas. While large scale I/O will be slower from the NFS areas, basic file system operations will be faster. The areas are also not accessible to compute nodes which limits the possible resource load.

When using the vi editor, the default behavior is to create an opened file’s temporary file in the current directory. If editing on a Spider file system, this can result in slowdown. You can specify that vi create temporary files in your User Home area by modifying your /ccs/home/$USER/.vimrc file:

  $ cat ~/.vimrc
  set swapsync=
  set backupdir=~
  set directory=~

Use ls -l Only Where Absolutely Necessary

Consider that ls -l must communicate with every OST that is assigned to any given listed file. When multiple files are listed, ls -l becomes a very expensive operation. It also causes excessive overhead for other users.

Open Files as Read-Only Whenever Possible

If a file to be opened is not subject to write(s), it should be opened as read-only. Furthermore, if the access time on the file does not need to be updated, the open flags should be O_RDONLY | O_NOATIME. If this file is opened by all files in the group, the master process (rank 0) should open it O_RDONLY with all of the non-master processes (rank > 0) opening it O_RDONLY | O_NOATIME.

Read Small, Shared Files from a Single Task

If a shared file is to be read and the data to be shared among the process group is less than approximately 100 MB, it is preferable to change the common code shown below (in C):

  int iRead;
  char cBuf[SIZE];

  // Check file descriptor
  iFD=open( PathName, O_RDONLY | O_NOATIME, 0444 );

  // Check number of bytes read
  iRead=read( iFD, cBuf, SIZE );

…to the code shown here:

  int  iRank;
  int  iRead;
  char cBuf[SIZE];

  MPI_Comm_rank( MPI_COMM_WORLD, iRank );
  if(!iRank) {

    // Check file descriptor
    iFD=open( PathName, O_RDONLY | O_NOATIME, 0444 );

    // Check number of bytes read
    iRead=read( iFD, cBuf, SIZE );

  }
  MPI_Bcast( cBuf, SIZE, MPI_CHAR, 0, MPI_COMM_WORLD );

Similarly, in Fortran, change the code shown below:

  INTEGER iRead
  CHARACTER cBuf(SIZE)

  OPEN(1, FileName)
  READ(1,*) cBuf

…to the code shown here:

  INTEGER iRank
  INTEGER iRead
  INTEGER ierr
  CHARACTER cBuf(SIZE)

  CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, ierr)
  IF (iRank .eq. 0) THEN
    OPEN(UNIT=1,FILE=PathName,ACTION='READ')
    READ(1,*) cBuf
  ENDIF
  CALL MPI_BCAST(cBuf, SIZE, MPI_CHAR, 0, MPI_COMM_WORLD, ierr)

Here, we gain several advantages: Instead of making N (the number of tasks in MPI_COMM_WORLD) open, read requests, we are making only (1). Also, the broadcast will use a fan out method to reduce the network traffic by allowing the interconnect routers of intermediate nodes to process less data.

However, if the shared data size exceeds 100 MB, you should contact the OLCF User Assistance Center for further optimizations.

Limit the Number of Files in a Single Directory

For large-scale applications that are going to write large numbers of files using private data, it is best to implement a subdirectory structure to limit the number of files in a single directory. A suggested approach is a (2)-level directory structure with sqrt(N) directories each containing sqrt(N) files, where N is the number of tasks.

Place Small Files on a Single OST

If only one process will read/write the file and the amount of data in the file is small, stat performance will be improved by limiting the file to a single OST on creation; every stat operation must communicate with every OST which contains file data. You can stripe a file across a single OST via:

  $ lfs setstripe PathName -s 1m -i -1 -c 1

Place Directories Containing Many Small Files on a Single OST

If you are going to create many small files in a single directory, stat (and therefore ls -l) will be more efficient if you set the directory’s striping count to (1) OST upon creation:

  $ lfs setstripe DirPathName -s 1m -i -1 -c 1

This is especially effective when extracting source code distributions from a tarball:

  $ lfs setstripe DirPathName -s 1m -i -1 -c 1
  $ cd DirPathName
  $ tar -x -f TarballPathName

All of the source files, header files, etc. span only (1) OST. When you build the code, all of the object files will use only (1) OST as well. The binary will span (1) OST as well, but that is not desirable, you can copy the binary with a new stripe count:

  $ lfs setstripe NewBin -s 1m -i -1 -c 4
  $ rm -f OldBinPath
  $ mv NewBin OldBinPath
stat Files from a Single Task

If many processes need the information from stat on a single file, it is most efficient to have a single process perform the stat call, then broadcast the results. This can be achieved by modifying the following code (shown in C):

  int iRank;
  struct stat sB;

  iRC=lstat( PathName, &sB );

To the following:

  MPI_Comm_rank( MPI_COMM_WORLD, iRank );
  if(!iRank)
  {
    iRC=lstat( PathName, &sB );
  }
  MPI_Bcast( &sB, size(struct stat), MPI_CHAR, 0, MPI_COMM_WORLD );

Similarly, change the following Fortran code:

  INTEGER*4 sB(13)

  CALL LSTAT(PathName, sB, ierr)

To the following:

  INTEGER iRank
  INTEGER*4 sB(13)
  INTEGER ierr

  CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, ierr)
  IF (iRank .eq. 0) THEN
    CALL LSTAT(PathName, sB, ierr)
  ENDIF
  CALL MPI_BCAST(sB, 13, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)
Note: the Fortran lstat binding does not support files larger than 2 GB. Users must provide their own Fortran binding to the C lstat for files larger than 2 GB.

Consider Available I/O Middleware Libraries

For large scale applications that are going to share large amounts of data, one way to improve performance is to use a middleware libary such as ADIOS, HDF5, or MPI-IO.

Use Large and Stripe-aligned I/O Whenever Possible

I/O requests should be large, i.e., a full stripe width or greater. In addition, you will get better performance by making I/O requests stripe aligned whenever possible. If the amount of data generated or required from the file on a client is small, a group of processes should be selected to perform the actual I/O request with those processes performing data aggregation.

HPSS – High Performance Storage System

The High Performance Storage System (HPSS) at the OLCF provides longer-term storage for the large amounts of data created on the OLCF compute systems. The mass storage facility consists of tape and disk storage components, servers, and the HPSS software. After data is uploaded, it persists on disk for some period of time. The length of its life on disk is determined by how full the disk caches become. When data is migrated to tape, it is done so in a first-in, first-out fashion.

HPSS Hardware

HPSS has SL8500 tape libraries, each holding up to 10,000 cartridges. The libraries house a total of (24) T10K-A tape drives (500 GB cartridges, uncompressed) and (60) T10K-B tape drives (1 TB cartridges, uncompressed), (36) T10K-C, and (72) T10K-D. Each drive has a bandwidth of 250 MB/s.

HPSS Archive Accounting

Each file and directory archived in HPSS is associated with an HPSS storage account. This associated account is used to determine storage of a user and project.

Storage Account Types

There are (3) types of HPSS storage accounts:

Overhead Account By default, files and directories created in User Archive areas on HPSS (i.e. /home/userid) will be associated with the user’s overhead account. This account is named the same as your user ID.
Project Accounts By default, files and directories created in Project Archive areas on HPSS (i.e. /proj/projid) will be associated with a project account. The project account is named the same as project’s project ID.
Legacy Account Files stored on the system prior to March 15, 2008 are associated with a legacy project. The legacy project is used to record a file’s storage time period. Users are not able to associate new files to the legacy account; however, we do encourage users to associate files currently under the legacy account with the appropriate project account.

Determine Available Accounts

The showproj utility can be used from any NCCS system to determine your available accounts. For example:

  $ showproj -s hpss
  userid is a member of the following project(s) on hpss:
  xxxyyy
  userid
Viewing File/Directory’s Associated Account

To view a file or directory’s associated account, the following commands can be used:

  $ hsi
  :[/home/userid]: ls -UH

For example:

  :[/home/userid]: ls -UH
  Mode         Links  Owner    Group   COS    Acct     Where   Size         DateTime      Entry
  /home/userid:
  drwxr-x---   3      userid   1099           userid                  512   Apr 01 2008   Backups
  -rw-r-----   1      userid   1099    6001   legacy   TAPE       4280320   Oct 24 2006   file.tar
  -rw-r-----   1      userid   1099    6007   xxxyyy   DISK    1956472248   Mar 20 2008   a.inp

In the above example:

  • The associated account for directory Backups is userid. By default all files created within the Backups directory will be associated with the userid account.
  • The file file.tar was created prior to March 15, 2008 and is therefore associated with the legacy account.
  • The file a.inp is associated with the xxxyyy project.
Modifying File/Directory’s Associated Account

A new file and directory will inherit the parent directory’s account association. This is the easiest and recommended method for managing data stored against an account. That said, users are able to manually change a file or directory’s associated account through the hsi chacct command.

Note: When moving an existing file, it will not inherit the parent directory’s account association automatically. You must change it manually.

The syntax to manually change a directory’s associated account is chacct [-R] newacct mydir. Similarly, the syntax to manually change a file’s associated account is chacct newacct myfile.out.

For example:

  chacct xxxyyy a.out

will set a.out’s associated account to xxxyyy. Likewise,

  chacct -R xxxyyy dir1

will set the associated account of dir1 and all files and directories within dir1 to xxxyyy.

We strongly encourage users to associate HPSS files with projects as opposed to their individual user accounts; this helps us understand project needs and usage.

Note: HPSS account identifiers are case sensitive; you should use lower case when referring to the account name.
Viewing Storage

The showusage utility may be used to view current storage associated with a user’s overhead project and other allocated projects for which the user is a member. The utility should be executed from a NCCS system; it may not be executed from within an HSI interactive session. For example:

  $ showusage -s hpss

  HPSS Storage in GB:                              
  Project                      Project Totals Storage    userid Storage              
  __________________________|__________________________|______________
  userid                    |       8.11               |       8.11
  legacy                    |                          |      25.67
Quotas

Space on the HPSS is for files that are not immediately needed. Users must not store files unrelated to their NCCS projects on HPSS. They must also periodically review their files and remove unneeded ones. Both Overhead and Project Accounts have a storage quota. For information on quotas, please see the OLCF Storage Policy Summary.

HPSS Best Practices

Currently HSI and HTAR are offered for archiving data into HPSS or retrieving data from the HPSS archive.

For optimal transfer performance we recommend sending file of 768 GB or larger to HPSS. The minimum file size that we recommend sending is 512 MB. HPSS will handle files between 0K and 512 MB, but write and read performance will be negatively affected. For files smaller than 512 MB we recommend bundling them with HTAR to achieve an archive file of at least 512 MB.

When retrieving data from a tar archive larger than 1 TB, we recommend that you pull only the files that you need rather than the full archive. Examples of this will be give in the htar section below.

If you are using HSI to retrieve an single file larger than 1 TB please make sure that the stripe pattern that you choose is approprate for this file’s size. See the “Choosing a Stripe Pattern” section of the Lustre® Basics page to learn how and why choosing the right striping pattern is important.

We also recommend using our data transfer nodes (DTNs) for achieving the fastest possible transfer rates. This can be done by logging on to dtn.ccs.ornl.gov and initiating transfers interactively or by submitting a batch job from any OLCF resource to the DTNs as described in the HSI and HTAR Workflow section.

Using HSI

Issuing the command hsi will start HSI in interactive mode. Alternatively, you can use:

  hsi [options] command(s)

…to execute a set of HSI commands and then return.

To list you files on the HPSS, you might use:

  hsi ls

hsi commands are similar to ftp commands. For example, hsi get and hsi put are used to retrieve and store individual files, and hsi mget and hsi mput can be used to retrieve multiple files.

To send a file to HPSS, you might use:

  hsi put a.out

To put a file in a pre-existing directory on hpss:

  hsi “cd MyHpssDir; put a.out”

To retrieve one, you might use:

  hsi get /proj/projectid/a.out
Warning: If you are using HSI to retrieve an single file larger than 1 TB please make sure that the stripe pattern that you choose is approprate for this file’s size. See the “Choosing a Stripe Pattern” section of “Choosing a Stripe Pattern” to learn how and why.

Here is a list of commonly used hsi commands.

Command Function
cd Change current directory
get, mget Copy one or more HPSS-resident files to local files
cget Conditional get – get the file only if it doesn’t already exist
cp Copy a file within HPSS
rm mdelete Remove one or more files from HPSS
ls List a directory
put, mput Copy one or more local files to HPSS
cput Conditional put – copy the file into HPSS unless it is already there
pwd Print current directory
mv Rename an HPSS file
mkdir Create an HPSS directory
rmdir Delete an HPSS directory

 

Additional HSI Documentation

There is interactive documentation on the hsi command available by running:

  hsi help

Additionally, documentation can be found at the Gleicher Enterprises website, including an HSI Reference Manual and man pages for HSI.

Using HTAR

The htar command provides an interface very similar to the traditional tar command found on UNIX systems. It is used as a command-line interface. The basic syntax of htar is:

htar -{c|K|t|x|X} -f tarfile [directories] [files]

As with the standard Unix tar utility the -c, -x, and -t options, respectively, function to create, extract, and list tar archive files. The -K option verifies an existing tarfile in HPSS and the -X option can be used to re-create the index file for an existing archive.

For example, to store all files in the directory dir1 to a file named allfiles.tar on HPSS, use the command:

  htar -cvf allfiles.tar dir1/*

To retrieve these files:

  htar -xvf allfiles.tar 

htar will overwrite files of the same name in the target directory.

When possible, extract only the files you need from large archives.

To display the names of the files in the project1.tar archive file within the HPSS home directory:

  htar -vtf project1.tar

To extract only one file, executable.out, from the project1 directory in the Archive file called project1.tar:

  htar -xm -f project1.tar project1/ executable.out 

To extract all files from the project1/src directory in the archive file called project1.tar, and use the time of extraction as the modification time, use the following command:

  htar -xm -f project1.tar project1/src
HTAR Limitations

The htar utility has several limitations.

Apending data

You cannot add or append files to an existing archive.

File Path Length

File path names within an htar archive of the form prefix/name are limited to 154 characters for the prefix and 99 characters for the file name. Link names cannot exceed 99 characters.

Size

There are limits to the size and number of files that can be placed in an HTAR archive.

Individual File Size Maximum 68GB, due to POSIX limit
Maximum Number of Files per Archive 1 million

 

For example, when attempting to HTAR a directory with one member file larger that 64GB, the following error message will appear:


[titan-ext1]$htar -cvf hpss_test.tar hpss_test/

INFO: File too large for htar to handle: hpss_test/75GB.dat (75161927680 bytes)
ERROR: 1 oversize member files found - please correct and retry
ERROR: [FATAL] error(s) generating filename list 
HTAR: HTAR FAILED
Additional HTAR Documentation


The HTAR user’s guide can be found at the Gleicher Enterprises website Gleicher Enterprises website, including the HTAR man page.

HSI and HTAR Workflow

Transfers with the HPSS should be launched from the external Titan login nodes, the interactive data transfer nodes (dtns), or the batch-accessible dtns.

If the file size is above 512 MB and HSI is initiated from titan-ext, or titan-batch nodes the HSI-DTN will transfer files in a further optimized and striped method.

Batch dtns should be used for large long-running transfers or for transfers that are part of a scripted workflow.

To submit a data archival job from any OLCF resource use the -q dtn option with qsub.

qsub -q dtn Batch-script.pbs

Your allocation will not be charged time for this job.

Below is an example batch script using HTAR.

Batch-script.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
htar -cf /proj/prj123/viz_output.htar viz_output/
htar -cf /proj/prj123/compute_data.htar compute_data/

See the workflow documentation for more workflow examples.

Storage Locations

Users are provided with a User Archive directory on HPSS that is located at /home/userid (where userid is your User ID). Additionally, each project is given a Project Archive directory located at /proj/projectid (where projectid is the six-character project ID).

A Note on Bundling Data

HPSS is optimized for larger files, so if you have multiple files that are smaller than 2GB, you should combine them and store a single, larger file. In most cases, this will provide a faster transfer and it will allow HPSS to store the data more efficiently.

The HTAR command is very useful for bundling smaller files, and is often faster than using the conventional tar command and then transferring via HSI. HTAR has an individual file size limit of 64GB, due to the POSIX tar specification. However, HTAR can be used to store and retrieve directories that are in total large than 64GB, provided that they do not contain any individual files large than 64GB.

When retrieving a large number of files, if HSI knows there are many files needed, it can bundle retrieves. This method allows HPSS to gather needed files on a single tape and perform fewer mount/seeks/rewind/unmounts. For example:

The following will create a list of files and pass the list to HPSS to retrieve. Note that this method does not preserve directory structure and is better used when directory structure is not needed:

echo "get << EOFMARKER" > dir0.lst
hsi -q find testdir -type f >>& dir0.lst
echo "EOFMARKER" >> dir0.lst
hsi "out dir0.out ; in dir0.lst"
Classes of  Service and Data Redundancy

The HPSS has several Classes of Service (COS) to ensure that files are efficiently stored based on their size. The COS is set automatically based on the size of the file that is being stored.

COS ID Name based on filesize # Tapes
11 NCCS 0MB<16MB 3 copies
12 NCCS 16MB<8GB RAIT 2+1
13 NCCS 8GB<1TB RAIT 4+1
14 NCCS >1TB RAIT 4+1

 

For files less than 16 MB in size, three copies are written to tape. For files 16MB or greater in size, HPSS supports a Redundant Array of Independent Tapes (RAIT) so there is no need to use multiple copies to ensure file safety in the event of tape failure.

Neither multiple-copies nor RAIT will protect your data if you accidentally delete it. Therefore avoid hsi rm */*.

Using File Globbing Wildcard Characters with HSI

HSI has the option to turn file globbing on and off.
If you get this message:

  O:[/home/user]: ls -la file*
  *** hpss_Lstat: No such file or directory [-2: HPSS_ENOENT]
    /home/user/file*

…then you’ll need to turn on HSI file globbing with the glob command:

  O:[/home/user]: glob
  filename globbing turned on

  O:[/home/user]: ls -la file*
  -rw-r--r--   1 user  users     6164480 Jan 14 10:36 file.tar
  -rw-r--r--   1 user  users     6164480 Jan  6 11:08 file1.tar.gz
  -rw-r--r--   1 user  users     6164480 Jan  6 11:08 file2.tar
  -rw-r--r--   1 user  users     6164480 Jan  6 11:09 file3.tar
  -rw-r--r--   1 user  users     6164480 Jan  6 11:09 file4.tar
  -rw-r--r--   1 user  users     6164480 Jan  6 11:09 file5.tar

Employing Data Transfer Nodes

The OLCF provides nodes dedicated to data transfer that are available via dtn.ccs.ornl.gov. These nodes have been tuned specifically for wide-area data transfers, and also perform well on local-area transfers. The OLCF recommends that users employ these nodes for data transfers, since in most cases transfer speed improves and load on computational systems’ login and service nodes is reduced.

Filesystems Accessible from DTNs

All OLCF filesystems — the NFS-backed User Home and Project Home areas, the Lustre®-backed User Work and Project Work areas, and the HPSS-backed User Archive and Project Archive areas — are accessible to users via the DTNs. For more information on available filesystems at the OLCF see the Data Management Policy Summary section.

Batch DTN Access

Batch data transfer nodes can be accessed through the Torque/MOAB queuing system on the dtn.ccs.ornl.gov interactive node. The DTN batch nodes are also accessible from the Titan, Eos, and Rhea batch systems through remote job submission.
This is accomplished by the command qsub -q host script.pbs which will submit the file script.pbs to the batch queue on the specified host. This command can be inserted at the end of an existing batch script in order to automatically trigger work on another OLCF resource.

Note: DTNs can help you manage your allocation hours efficiently by preventing billable compute resources from sitting idle.

The following scripts show how this technique could be employed. Note that only the first script, retrieve.pbs, would need to be manually submitted; the others will trigger automatically from within the respective batch scripts.

Example Workflow Using DTNs in Batch Mode

The first batch script, retrieve.pbs, retrieves data needed by a compute job. Once the data has been retrieved, the script submits a different batch script, compute.pbs, to run computations on the retrieved data.

To run this script start on Titan or Rhea.

qsub -q dtn retrieve.pbs
$ cat retrieve.pbs

  # Batch script to retrieve data from HPSS via DTNs

  # PBS directives
  #PBS -A PROJ123
  #PBS -l walltime=8:00:00

  # Retrieve required data
  cd $MEMBERWORK/proj123 
  hsi get largedatfileA
  hsi get largedatafileB

  # Verification code could go here

  # Submit other batch script to execute calculations on retrieved data
  qsub -q titan compute.pbs

$

The second batch script is submitted from the first to carry out computational work on the data. When the computational work is finished, the batch script backup.pbs is submitted to archive the resulting data.

$ cat compute.pbs

  # Batch script to carry out computation on retrieved data

  # PBS directives
  #PBS -l walltime=24:00:00 
  #PBS -l nodes=10000
  #PBS -A PROJ123
  #PBS -l gres=atlas1 # or atlas2

  # Launch executable
  cd $MEMBERWORK/proj123 
  aprun -n 160000 ./a.out

  # Submit other batch script to transfer resulting data to HPSS
  qsub -q dtn backup.pbs

$

The final batch script is submitted from the second to archive the resulting data soon after creation to HPSS via the hsi utility.

$ cat backup.pbs

  # Batch script to back-up resulting data

  # PBS directives
  #PBS -A PROJ123
  #PBS -l walltime=8:00:00

  # Store resulting data 
  cd $MEMBERWORK/proj123 
  hsi put largedatfileC
  hsi put largedatafileD

$

Some items to note:

  • Batch jobs submitted to the dtn partition will be executed on a DTN that is accessible exclusively via batch submissions. These batch-accessible DTNs have identical configurations to those DTNs that are accessible interactively; the only difference between the two is accessibility.
  • The DTNs are not currently a billable resource, i.e., the project specified in a batch job targeting the dtn partition will not be charged for time spent executing in the dtn partition.

Scheduled DTN Queue

  • The walltime limit for jobs submitted to the dtn partition is 24 hours.
  • Users may request a maximum of 4 nodes per batch job.
  • There is a limit of (2) eligible-to-run jobs per user.
  • Jobs in excess of the per user limit above will be placed into a held state, but will change to eligible-to-run when appropriate.
  • The queue allows each user a maximum of 6 running jobs.

Enabling Workflows through Cross-System Batch Submission

The OLCF supports submitting jobs between OLCF systems via batch scripts. This can be useful for automatically triggering analysis and storage of large data sets after a successful simulation job has ended, or for launching a simulation job automatically once the input deck has been retrieved from HPSS and pre-processed.

Cross-Submission allows jobs on one OLCF resource to submit new jobs to other OLCF resources.

Cross-Submission allows jobs on one OLCF resource to submit new jobs to other OLCF resources.

The key to remote job submission is the command qsub -q host script.pbs which will submit the file script.pbs to the batch queue on the specified host. This command can be inserted at the end of an existing batch script in order to automatically trigger work on another OLCF resource. This feature is supported on the following hosts:

Host Remote Submission Command
Rhea qsub -q rhea visualization.pbs
Eos qsub -q eos visualization.pbs
Titan qsub -q titan compute.pbs
Data Transfer Nodes (DTNs) qsub -q dtn retrieve_data.pbs

Example Workflow 1: Automatic Post-Processing

The simplest example of a remote submission workflow would be automatically triggering an analysis task on Rhea at the completion of a compute job on Titan. This workflow would require two batch scripts, one to be submitted on Titan, and a second to be submitted automatically to Rhea. Visually, this workflow may look something like the following:

Post-processing Workflow

The batch scripts for such a workflow could be implemented as follows:

Batch-script-1.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Retrieve data from HPSS
cd $MEMBERWORK/prj123
htar -xf /proj/prj123/compute_data.htar compute_data/

# Submit compute job to Titan
qsub -q titan Batch-script-2.pbs

Batch-script-2.pbs

#PBS -l walltime=2:00:00
#PBS -l nodes=4096
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
aprun -n 65536 ./analysis-task.exe

# Submit data archival job to DTNs
qsub -q dtn Batch-script-3.pbs

Batch-script-3.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
htar -cf /proj/prj123/viz_output.htar viz_output/
htar -cf /proj/prj123/compute_data.htar compute_data/

The key to this workflow is the qsub -q batch@rhea-batch Batch-script-2.pbs command, which tells qsub to submit the file Batch-script-2.pbs to the batch queue on Rhea.

Initializing the Workflow

We can initialize this workflow in one of two ways:

  • Log into dtn.ccs.ornl.gov and run qsub Batch-script-1.pbs OR
  • From Titan or Rhea, run qsub -q dtn Batch-script-1.pbs

Example Workflow 2: Data Staging, Compute, and Archival

Now we give another example of a linear workflow. This example shows how to use the Data Transfer Nodes (DTNs) to retrieve data from HPSS and stage it to your project’s scratch area before beginning. Once the computation is done, we will automatically archive the output.

Post-processing Workflow

Batch-script-1.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Retrieve Data from HPSS
cd $MEMBERWORK/prj123
htar -xf /proj/prj123/input_data.htar input_data/

# Launch compute job
qsub -q titan Batch-script-2.pbs

Batch-script-2.pbs

#PBS -l walltime=6:00:00
#PBS -l nodes=4096
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
aprun -n 65536 ./analysis-task.exe

# Submit data archival job to DTNs
qsub -q dtn Batch-script-3.pbs

Batch-script-3.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
htar -cf /proj/prj123/viz_output.htar viz_output/
htar -cf /proj/prj123/compute_data.htar compute_data/

Initializing the Workflow

We can initialize this workflow in one of two ways:

  • Log into dtn.ccs.ornl.gov and run qsub Batch-script-1.pbs OR
  • From Titan or Rhea, run qsub -q dtn Batch-script-1.pbs

Example Workflow 3: Data Staging, Compute, Visualization, and Archival

This is an example of a “branching” workflow. What we will do is first use Rhea to prepare a mesh for our simulation on Titan. We will then launch the compute task on Titan, and once this has completed, our workflow will branch into two separate paths: one to archive the simulation output data, and one to visualize it. After the visualizations have finished, we will transfer them to a remote institution.

Post-processing Workflow

Step-1.prepare-data.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=10
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Prepare Mesh for Simulation
mpirun -n 160 ./prepare-mesh.exe

# Launch compute job
qsub -q titan Step-2.compute.pbs

Step-2.compute.pbs

#PBS -l walltime=6:00:00
#PBS -l nodes=4096
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Launch exectuable
cd $MEMBERWORK/prj123
aprun -n 65536 ./analysis-task.exe

# Workflow branches at this stage, launching 2 separate jobs

# - Launch Archival task on DTNs
qsub -q dtn@dtn-batch Step-3.archive-compute-data.pbs

# - Launch Visualization task on Rhea
qsub -q rhea Step-4.visualize-compute-data.pbs

Step-3.archive-compute-data.pbs

#PBS -l walltime=0:30:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Archive compute data in HPSS
cd $MEMBERWORK/prj123
htar -cf /proj/prj123/compute_data.htar compute_data/

Step-4.visualize-compute-data.pbs

#PBS -l walltime=2:00:00
#PBS -l nodes=64
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Visualize Compute data
cd $MEMBERWORK/prj123
mpirun -n 768 ./visualization-task.py

# Launch transfer task
qsub -q dtn Step-5.transfer-visualizations-to-campus.pbs

Step-5.transfer-visualizations-to-campus.pbs

#PBS -l walltime=2:00:00
#PBS -l nodes=1
#PBS -A PRJ123
#PBS -l gres=atlas1%atlas2

# Transfer visualizations to storage area at home institution
cd $MEMBERWORK/prj123
SOURCE=gsiftp://dtn03.ccs.ornl.gov/$MEMBERWORK/visualization.mpg
DEST=gsiftp://dtn.university-name.edu/userid/visualization.mpg
globus-url-copy -tcp-bs 12M -bs 12M -p 4 $SOURCE $DEST

Initializing the Workflow

We can initialize this workflow in one of two ways:

  • Log into rhea.ccs.ornl.gov and run qsub Step-1.prepare-data.pbs OR
  • From Titan or the DTNs, run qsub -q rhea Step-1.prepare-data.pbs

Checking Job Status

Host Remote qstat Remote showq
Rhea qstat -a @rhea-batch showq --host=rhea-batch
Eos qstat -a @eos-batch showq --host=eos-batch
Titan qstat -a @titan-batch showq --host=titan-batch
Data Transfer Nodes (DTNs) qstat -a @dtn-batch showq --host=dtn-batch

Deleting Remote Jobs

In order to delete a job (say, job number 18688) from a remote queue, you can do the following

Host Remote qdel
Rhea qdel 18688@rhea-batch
Eos qdel 18688@eos-batch
Titan qdel 18688@titan-batch
Data Transfer Nodes (DTNs) qdel 18688@dtn-batch

Potential Pitfalls

The OLCF advises users to keep their remote submission workflows simple, short, and mostly linear. Workflows that contain many layers of branches, or that trigger many jobs at once, may prove difficult to maintain and debug. Workflows that contain loops or recursion (jobs that can submit themselves again) may inadvertently waste allocation hours if a suitable exit condition is not reached.

Recursive workflows which do not exit will drain your project’s allocation. Refunds will not be granted. Please be extremely cautious when designing workflows that cause jobs to re-submit themselves.

Circular Workflow

As always, users on multiple projects are strongly advised to double check that the #PBS -A <PROJECTID> field is set to the correct project prior to submission. This will ensure that resource usage is associated with the intended project.


Local Transfers

The OLCF provides a shared-storage environment, so transferring data between our machines is largely unnecessary. However, we provide tools both to move large amounts of data between scratch and archival storage and from one scratch area to another.

The following utilities can be used to transfer data between partitions of the filesystem.

Utility Amount of Data to Transfer Where to run utility Notes
cp < 500 GB command line / batch script Useful when transferring small amounts of data, available from Titan and the DTNs
bbcp < 500 GB DTN command line or batch script Multi-streaming ability can make bbcp a faster option than cp, should be executed from DTNs only
fcp > 500 GB batch script Faster than cp and bbcp for directory trees, can preserve lustre striping, only available from the batch scheduled DTNs

For large transfers, transfers to the High Performance Storage System, and when running fcp or bbcp, the DTNs should be used to prevent overloading the compute system’s login nodes.

To help reduce load on the compute systems’ login resources, please use the DTNs when using fcp or bbcp, or transferring more than ~500GB.

Since all areas of the shared luster filesystem are periodically swept, moving data from lustre to archival storage on High performance storage system is a necessary part of most users’ work flow.  The commands hsi and htar provide users with easy-to-use interfaces to their User Archive and Project Archive spaces on the OLCF’s HPSS-based archival storage system.

The HPSS Best Practices section contains examples using HSI and HTAR, as well as sample workflows.

fcp

fcp is a program designed to do large-scale parallel data transfer from a source directory to a destination directory across locally mounted file systems. It is not for wide area data transfers such as ftp, bbcp, or globus-ftp. In that sense, it is closer to cp. One crucial difference from regular cp, is that fcp requires the source and destination to be directories. fcp will fail if these conditions are not met. In the most general case, fcp works in two stages: first it analyzes the workload by walking the tree in parallel; and then it parallelizes the data copy operation.

At the OLCF, fcp is provided as a modulefile and it is available on the data transfer nodes (DTNs) and the analysis cluster (Rhea). To use fcp:

  1. Load the pcircle modulefile:
    $ module load pcircle
    
  2. Use mpirun to copy a directory:
    $ mpirun -np 8 fcp src_dir dest_dir
    

fcp options

In addition, fcp supports the following options:

-p, --preserve: Preserve metadata attributes. In the case of Lustre, the
striping information is kept.
  
-f, --force: Overwrite the destination directory. The default is off.
  
--verify: Perform checksum-based verification after the copy.
  
-s, --signature: Generate a single sha1 signature for the entire dataset.
This option also implies --verify for post-copy verification.
  
--chunksize sz: fcp will break up large files into pieces to increase
parallelism. By default, fcp adaptively sets the chunk size based on the
overall size of the workload. Use this option to specify a particular
chunk size in KB, MB. For example, --chunksize 128MB.
  
--reduce-interval: Controls progress report frequency. The default is 10
seconds.

Using fcp inside a batch job

The data transfer nodes (DTNs) can be used to submit an fcp job to the batch queue:

  1. Connect to the DTNs:
    ssh <username>@dtn.ccs.ornl.gov
    
  2. Prepare a PBS submission script:
    #!/bin/bash -l
    #PBS -l nodes=2
    #PBS -A <projectID>
    #PBS -l walltime=00:30:00
    #PBS -N fcp_job
    #PBS -j oe
       
    cd $PBS_O_WORKDIR
    module load pcircle
    module list
       
    mpirun -n 16 --npernode 8 fcp ./src_dir ./dest_dir
    

    The --pernode is needed to distribute the MPI processes across physical nodes. Ommitting this option would place 16 MPI processes on the same node, which in this example is not the desired behavior, as it would reduce the amount of memory available to each process.

Performance considerations

fcp performance is subject to the bandwidth and conditions of the source file system, the storage area network, the destination file system, and the number of processes and nodes involved in the transfer. Using more processes per node does not necessarily result in better performance due to an increase in the number of metadata operations as well as additional contention generated from a larger number of processes. A rule of thumb is to match or halve the number of physical cores per transfer node.

Both post copy verification (–verify) and dataset signature (–signature) options have performance implications. When turned on, fcp calculates the checksums of each chunk/file for both source and destination, in addition to reading back from destination. This increases both the amount of bookkeeping and memory usage. Therefore, for large scale data transfers, a large memory node is recommended.

Author and Code

Feiyi Wang (fwang2@ornl.gov), ORNL Technology Integration group.
Source code for PCircle can be found on the OLCF GitHub page.


Remote Transfers

The OLCF provides several tools for moving data between computing centers or between OLCF machines and local user workstations. The following tools are primarily designed for transfers over the internet, and aren’t recommended for use transferring data between OLCF machines.

The following table summarizes options for remote data transfers:

GridFTP + GridCert
GridFTP + SSH
SFTP/SCP
BBCP
Data Security
Insecure by default, secure with configuration
Insecure by default, secure with configuration
Secure
Insecure & unsuited for sensitive projects
Authentication
Passcode
Passcode
Passcode
Passcode
Transfer speed
Fast
Fast
Slow
Fast
Required Infrastructure
GridFTP server at remote site
GridFTP server at remote site
Comes standard with SSH install
BBCP installed at remote site

Globus GridFTP

GridFTP is a high-performance data transfer protocol based on FTP and optimized for high-bandwidth wide-area networks. It is typically used to move large amounts of data between the OLCF and other majors centers.
Globus is a kind of GridFTP that provides a web user-interface for initiating, managing, and monitoring GridFTP transfers between endpoints. An endpoint is the logical address of a resource or filesystem attached to a Globus Connect GridFTP server. Many institutions host their own shared Globus Connect Servers and endpoints. However, it is possible to turn any non-commercial private resource into an endpoint using the Globus Connect Personal client. Globus can also be scripted.

The Globus GridFTP suite provides the following interfaces for managing data transfer:

Installing GridFTP

    Prior to using GridFTP, it must be installed on both the client and server. Installation is independent of the authentication method chosen. GridFTP is currently available on each OLCF system and can be added to your environment using the globus module:
  $ module load globus

If your site does not already have GridFTP available, it can be downloaded from Globus. Download and installation information can be found on the Globus Toolkit Documentation site.

Transferring Data with Globus Online

In the browser of your choice, visit https://www.globus.org.

  1. Log in to Globus with your Globus identity. Most new Globus users should create a traditional stand-alone Globus ID by following the “Sign up” link. Returning OLCF users should generally follow the “Globus ID” link to login.
    Only users who have an ORNL UCAMS institutional ID may choose “Oak Ridge National Lab” from the dropdown menu.

    See the Globus accounts FAQ for more information.

    Login to the Globus webapp

    Login to the Globus webapp using an existing Globus ID, linked institutional ID, or new Globus account.

  2. Once logged in to the Globus webapp, select “Manage Data” and then “Transfer Files” from blue menu bar.
    Choose Transfer Files from navigation menu

    Choose Transfer Tiles from the navigation menu.

  3. Enter an OLCF endpoint into one of the two Endpoint fields. Using the endpoint selection dialog that appears, enter OLCF Atlas (or the alternate name olcf#dtn_atlas) to choose the OLCF as an endpoint.
    Select Endpoint

    Select an OLCF endpoint in one of the two available fields.

    Workflows established prior to February 2016 may have used the now-discontinued olcf#dtn endpoint. This endpoint should no longer be used. Questions about migrating legacy workflows can be directed to help@olcf.ornl.gov.
  4. Globus must request permission from you via the OLCF to access your files. Press the Continue button. You will be redirected to our authentication page at “myproxy*.ccs.ornl.gov”. Enter your OLCF username in the “username” box and your OLCF SecurID passcode in the “Passcode” box. Upon success, you will be returned to the Globus web interface.
    OLCF OAuth page

    Activating an endpoint using only your OLCF credentials requires a browser to authenticate the OAuth request from Globus.

    The endpoint lifetime is 72 hours. If the endpoint authentication expires before the transfer is complete, the transfer will automatically resume the next time you reactivate the endpoint.
  5. Enter the path to your data (for example /tmp/work/username) in the “path: window. Soon you should see a list of the files in that directory appear in the window below.
  6. Repeat this process in the other endpoint window with the endpoint at the other end of your transfer.
  7. Select the files you want to transfer by clicking on them. Use the arrows in the middle of the page to do the transfer.
    Screen Shot 2013-06-06 at 1.48.05 PM
  8. Globus will give you a message at the top of the page about the status of your transfer and send you an email when your transfer is complete.

Reactivating an Expired Endpoint

If the endpoint or proxy expires before the transfer is complete, the transfer will automatically resume the next time you activate the endpoint.

To reactivate an expired endpoint, choose “Manage Endpoints” from the Globus web interface. Select the OLCF endpoint you wish to reactivate and choose the “Activate” tab. Press the “Reactivate Now” button and enter your OLCF credentials to approve the request by Globus to access your account.

Reactivate endpoint

Reactivate an endpoint under the Manage Endpoints section.

Globus Online Command Line Interface

Globus Online also provides a scriptable command-line interface available via SSH at cli.globusonline.org using your Globus account credentials. Complete information about cli.globusonline.org can be found in the official Globus documentation.

To use the CLI interface, you must first generate an SSH public-private key pair on the host from which you will use the interface. From a terminal, call

$ ssh-keygen -t rsa -b 4096 -C "Globus key on $HOST" -f $HOME/.ssh/id_rsa.globus
It is highly recommended that all your SSH keys are protected by passphrases. Passphrase-protected keys can be used in conjunction with an SSH-agent for convenience.
Compromised passphrase-less keys linked to Globus will allow read-write access to all of your activated endpoints.

Add the public key to your Globus ID’s list of authorized keys. From the web-UI, click on the Account menu link, then choose “manage SSH and X.509 keys”, then “Add a New Key”. Give the key any alias, choose the SSH Public Key Type, paste the full contents of $HOME/.ssh/id_rsa.globus.pub into the body field and click “Add Key”.

To use the interface, start an SSH session as your globus ID username with

$ ssh -i $HOME/.ssh/id_rsa.globus ${GLOBUS_UNAME}@cli.globusonline.org

This command will place you into an interactive console from which globus transfer management commands can be issued. Calling help will list all of the available commands. Full documentation for each command is available through man $COMMAND.

By encapsulating the SSH invocation into a shell function or alias and using an SSH-agent or passphrase-less key, it is possible to write convenient shell scripts for managing Globus transfers. The script below uses the Globus Online Tutorial Endpoints go#ep1 and go#ep2, which are available to all Globus users for practice, to demonstrate basic operations.

#!/bin/bash
#
# This script demos the simplest way to automate Globus Online transfers
# using the ssh://cli.globusonline.org interface.
#
#==============================================================================

# Edit these as needed for individual use
PUBKEY="$HOME/.ssh/id_rsa.globus"
GLOBUS_UNAME="FIXME"

# Function to simplify remote Globus command invocations.
gocli() {
  ssh -i ${PUBKEY} ${GLOBUS_UNAME}@cli.globusonline.org "$@"
}

# Print available commands. Man pages can be read by starting an interactive
# session over ssh using `ssh ${GLOBUS_UNAME}@cli.globusonline.org`
gocli help

# Activate the endpoints.
# endpoint-activate returns 0 if active or successfully activated.
# Some endpoints may be interactively activated, but not the OLCF's.
# It is a good practice to exit the script on activation problems when the
# script is run non-interactively.
# TODO - Add a trap or timeout if this script is run non-interactively
#        against endpoints that can be interactively activated.
gocli endpoint-activate "go#ep1"
[ "$?" = 1 ] && exit 1

gocli endpoint-activate "go#ep2"
[ "$?" = 1 ] && exit 1

# Make destination dirs - this is not strictly necessary, just showing off
# `mkdir`.
if ! $(gocli ls "go#ep2/~/simple_demo" > /dev/null 2>&1); then
  gocli mkdir "go#ep2/~/simple_demo"
fi

# List the SRC and DST folder contents.
gocli ls -l "go#ep1/share/godata"
gocli ls -l "go#ep2/~/simple_demo"

# Bulk file transfer:
# Constuct array of INPUTLINE(s) from ls output on src dir:
DATA_FILES=( $(gocli ls "go#ep1/share/godata/*.txt") )
{
for i in "${!DATA_FILES[@]}"; do
  f="${DATA_FILES[$i]}"
  echo "go#ep1/share/godata/$f go#ep2/~/simple_demo/bulk/$f"
done
# Pipe array into transfer command.
} | gocli transfer -s 3 --label="scripted_bulk_xfer_demo"

# Recursive transfer:
gocli transfer -s 3 --label="scripted_recursive_xfer_demo" -- \
  "go#ep1/share/godata/" \
  "go#ep2/~/simple_demo/recursive/" -r

# Print the status of the last two transfers:
# See `gocli man status` for format options to make parsing easier.
gocli status -a -l 2

Globus GridFTP with SSH Authentication

No setup is required if you will be using SSH for authentication. However, to use this for remote transfers, the remote facility must also accept SSH for authentication.

Transferring Data

Files are transferred using the globus-url-copy command. The arguments to that command will differ based on your authentication method.
To use globus-url-copy with SSH authentication load the globus module:

  $ module load globus

Then run the globus-url-copy command:

For example, while on an OLCF resource, you can transfer file1 in your OLCF User Work area to file2 on a remote system:

 $ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb file:/lustre/atlas/scratch/$USER/file1 sshftp://user@remote.system/remote/dir/file2

From the OLCF, transfer file1 on a remote system to file2 in your User Work area:

 $ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb sshftp://remote.system/remote/dir/file1 file:/lustre/atlas/scratch/$USER/file2

From remote system, transfer file1 on a remote system to file2 in your User Work area:

 $ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb file:/remote/dir/file1 sshftp://userid@dtn.ccs.ornl.gov/lustre/atlas/scratch/$USER/file2

SSH Keys

SSH keys can not be used to grant passwordless access to OLCF resources. However, SSH keys can be created for OLCF systems to use for access to remote resources that support ssh keys.

To create an SSH key pair for dtn.ccs.ornl.gov:

Log in to dtn.ccs.ornl.gov and go to your .ssh directory or create a .ssh directory if you do not have one.

$ssh-keygen -t dsa
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /ccs/home/$USER/.ssh/id_dsa.
Your public key has been saved in /ccs/home/$USER/.ssh/id_dsa.pub.

This will generate a ssh key pair consisting of id_dsa.pub and id_dsa. If you choose not to enter a passphrase, anyone who gains access to your private key file will then be able to assume your identity.

To cache the private key and passphrase so that you do not need to enter it for every ssh operation in your session:

$ ssh-agent
$ ssh-add
Identity added: /ccs/home/$USER/.ssh/id_rsa (/ccs/home/$USER/.ssh/id_rsa)
Enter passphrase for /ccs/home/$USER/.ssh/id_dsa: 
Identity added: /ccs/home/$USER/.ssh/id_dsa (/ccs/home/$USER/.ssh/id_dsa)

Copy your id_dsa.pub to the remote resource’s .ssh directory and copy the contents into a file called “authorized_keys”.

To test if this was successful, ssh to the remote resource from dtn.ccs.ornl.gov. If your ssh key works you will not need to enter your password for the remote resource.

SFTP & SCP

The SSH-based SFTP and SCP utilities can be used to transfer files to and from OLCF systems. Because these utilities can be slow, we recommend using them only to transfer limited numbers of small files.

Both SCP and SFTP are available on all NCCS systems and should be a part of each user’s environment. For example, on a UNIX-based system, to transfer the file oldfile from your local system to your $HOME directory on OLCF systems as newfile, you would use one of the following commands:

SFTP

  sftp userid@dtn.ccs.ornl.gov
  sftp> put oldfile newfile
  sftp> bye

SCP

  scp ./oldfile userid@dtn.ccs.ornl.gov:~/newfile

…where userid is your given NCCS username.

Standard file transfer protocol (FTP) and remote copy (RCP) should not be used to transfer files to the NCCS high-performance computing (HPC) systems due to security concerns.

SCP works with NCCS systems only if your per-process initialization files produce no output. The means that files such as .cshrc, .kshrc, .profile, etc. must not issue any commands that write to standard output. If you would like for this file to write to standard output for interactive sessions, you must edit the file so that it does so only for interactive sessions.

For sh-type shells such as bash and ksh use the following template:

  TTY=$( /usr/bin/tty )
    if [ $? = 0 ]; then
      /usr/bin/echo "interactive stuff goes here"
    fi

For c-shell-type shells such as csh and tcsh use:

  ( /usr/bin/tty ) > /dev/null
    if ( $status == 0 ) then
      /usr/bin/echo "interactive stuff goes here"
    endif

BBCP

Before you can use the BBCP utility, it must be installed on both the local and remote systems. It is currently available on each OLCF system and should be a part of each user’s default environment. Several pre-compiled binaries as well as the source can be downloaded from the Stanford Linear Accelerator Center (SLAC) BBCP page.

Installation from Source Tips

  • Refer to your operating system’s documentation to satisfy missing dependencies or problems in your build environment.
  • Clone the BBCP source code git repository from the Stanford Linear Accelerator Center (SLAC) BBCP page.
  • $ git clone http://www.slac.stanford.edu/~abh/bbcp/bbcp.git
    
  • From within the decompressed BBCP directory, run make. This should build the BBCP executable into the created bin directory. The build has been tested on Linux-based systems and should build with few or no modifications. If you system’s uname command does not return Linux or Darwin, you may need to modify the Makefile.
  • $ cd bbcp/src
    $ uname
    Linux
    $ make
    

Common variable modifications

  • In MakeSname, the test command is hard coded to /usr/bin/test. If this is not the location of test on your system, you can change the following line to the correct path (which test should return the path to test on your system):
  • if /usr/bin/test -${1} $2; then
    
  • If the uname command is not in /bin on your system, change the uname variable in the MakeSname file. You will also need to change the following line in the file Makefile:
  • @cd src;$(MAKE) make`/bin/uname` OSVER=`../MakeSname`
    
  • If the libz.a library is not located at /usr/local/lib/libz.a on your system, change the libzMakefile file.
  • The file Makefile contains compiler and compiler flag options for the BBCP build. You can change the compilers and flags by modifying variables in this file. For example, to change the compilers used on a Linux system, modify the variables LNXCC and LNXcc.

Usage

To transfer the local file /local/path/largefile.tar to the remote system remotesystem as /remote/path/largefile.tar, use the following:

bbcp -P 2 -V -w 8m -s 16 /local/path/largefile.tar remotesystem:/remote/path/largefile.tar

where

-P 2 produces progress messages every 2 seconds.
-V produces verbose output, including detailed transfer-speed statistics.
-w 8m sets the size of the disk input/output (I/O) buffers.
-s 16 sets the number of parallel network streams to 16.

BBCP assumes the remote system’s non-interactive environment contains the path to the BBCP utility. This can be determined with the following command:

ssh remotesystem which bbcp

If this is not the case, the -T BBCP option can be used to specify how to start BBCP on the remote system. For example, you could use the following:

bbcp -P 2 -V -w 8m -s 16 -T 'ssh -x -a -oFallBackToRsh=no %I -l %U %H /remote/path/to/bbcp' /local/path/largefile.tar remotesystem:/remote/path/largefile.tar

Often, during large transfers the connection between the transferring systems is lost. The -a option gives BBCP the ability to pick up where it left off. For example, you could use the following:

bbcp -k -a /remotesystem/homedir/.bbcp/ -P 2 -V -w 8m -s 16 /local/path/largefile.tar remotesystem:/remote/path/largefile.tar

To transfer an entire directory tree, use the following:

bbcp -r -P 2 -V -w 8m -s 16 /local/path/* remotesystem:/remote/path

We strongly recommend that you use the Data Transfer Nodes when transferring files to and from the OLCF. If you are, however, connecting directly to systems such as the Cray XK, it is necessary to specify a particular node as the destination host because the host name (i.e. titan.ccs.ornl.gov) actually points to a server load-balancing device that returns node addresses in a round-robin fashion. For example, you could use the following:

bbcp -r -P 2 -V -w 8m -s 16 /local/path/* titan-login3.ccs.ornl.gov:/remote/path

You may encounter an error similar to the following:

bbcp: Accept timed out on port 5031
bbcp: Unable to allocate more than 0 of 8 data streams.
Killed by signal 15.

If this happens, add the -z option to your bbcp command. This tells bbcp to use the “reverse connection protocol” and can be helpful when a transfer is being blocked by a firewall.

Further Reading

More information on BBCP can be found by typing “bbcp -h” on OLCF systems as well as on the Stanford Linear Accelerator Center (SLAC) BBCP page.


Data Management Policy Summary

Users must agree to the full Data Management Policy as part of their account application. The “Data Retention, Purge, & Quotas” section is useful and is summarized below.

Data Retention, Purge, & Quota Summary
User-Centric Storage Areas
Area Path Type Permissions Quota Backups Purged Retention
User Home $HOME NFS User-controlled 10 GB Yes No 90 days
User Archive /home/$USER HPSS User-controlled 2 TB [1] No No 90 days

Project-Centric Storage Areas
Area Path Type Permissions Quota Backups Purged Retention
Project Home /ccs/proj/[projid] NFS 770 50 GB Yes No 90 days
Member Work $MEMBERWORK/[projid] Lustre® 700 [2] 10 TB No 14 days 14 days
Project Work $PROJWORK/[projid] Lustre® 770 100 TB No 90 days 90 days
World Work $WORLDWORK/[projid] Lustre® 775 10 TB No 90 days 90 days
Project Archive /proj/[projid] HPSS 770 100 TB [3] No No 90 days

Area The general name of storage area.
Path The path (symlink) to the storage area’s directory.
Type The underlying software technology supporting the storage area.
Permissions UNIX Permissions enforced on the storage area’s top-level directory.
Quota The limits placed on total number of bytes and/or files in the storage area.
Backups States if the data is automatically duplicated for disaster recovery purposes.
Purged Period of time, post-file-creation, after which a file will be marked as eligible for permanent deletion.
Retention Period of time, post-account-deactivation or post-project-end, after which data will be marked as eligible for permanent deletion.
Important! Files within “Work” directories (i.e., Member Work, Project Work, World Work) are not backed up and are purged on a regular basis according to the timeframes listed above.

[1] In addition, there is a quota/limit of 2,000 files on this directory.

[2] Permissions on Member Work directories can be controlled to an extent by project members. By default, only the project member has any accesses, but accesses can be granted to other project members by setting group permissions accordingly on the Member Work directory. The parent directory of the Member Work directory prevents accesses by “UNIX-others” and cannot be changed (security measures).

[3] In addition, there is a quota/limit of 100,000 files on this directory.