titan

Since 4/22/14 10:50 pm

eos

Since 4/22/14 01:15 pm

rhea

Since 4/23/14 09:20 am

hpss

Since 4/15/14 09:50 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Spider – the Center-Wide Lustre® File System

Bookmark and Share

The OLCF’s center-wide Lustre® file system, called Spider, is the operational work file system for most OLCF computational resources. As an extremely high-performance system, Spider has over 26,000 clients, providing 32 petabytes of disk space and can move data at more than 1 TB/s.

Spider is Center-Wide

Spider is currently accessible nearly all of the OLCF’s computational resources, including Titan and its 18,000+ compute nodes. The file system is available from the following OLCF systems:

Note: Because the file system is shared by most OLCF computational resources, times of heavy load may impact file system responsiveness.
Spider is for Temporary Storage

Spider provides a location to temporarily store large amounts of data needed and produced by batch jobs. Due to the size of the file system, the area is not backed up. In most cases, a regularly running purge removes data not recently accessed to help ensure available space for all users. Needed data should be copied to more permanent locations.

Warning: Spider provides temporary storage of data produced by or used by batch jobs. The space is not backed up. Users should copy needed data to more permanent locations.
Spider Comprises Multiple File Systems

Spider comprises (2) file systems:

File System Path
atlas1 /lustre/atlas1/
atlas2 /lustre/atals2/
Why two filesystems?

There are a few reasons why having multiple file systems within Spider is advantageous.

More Metadata Servers – Currently each Lustre filesystem can only utilize one Metadata Server (MDS). Interaction with the MDS is expensive; heavy MDS access will impact interactive performance. Providing (2) filesystems allows the load to be spread over (2) MDSs.

Higher Availability – The existence of multiple filesystems increases our ability to keep at least one filesystem available at all times.

Associating a Batch Job with a File System

Through the PBS gres option, users can specify the scratch area used by their batch jobs so that the job will not start if that file system becomes degraded or unavailable.

Creating a Dependency on a Single File System

Line (5) in the following example will associate a batch job with the atlas2 file system. If atlas2 becomes unavailable prior to execution of the batch job, the job will be placed on hold until atlas2 returns to service.

  1: #!/bin/csh
  2: #PBS -A ABC123
  3: #PBS -l nodes = 16000
  4: #PBS -l walltime = 08:00:00
  5: #PBS -l gres=atlas2
  6:
  7: cd $MEMBERWORK/abc123
  8: aprun -n 256000 ./a.out
Creating a Dependency on Multiple File Systems

The following example will associate a batch job with the atlas1 and atlas2 file systems. If either atlas1 or atlas2 becomes unavailable prior to execution of the batch job, the job will be placed on hold until both atlas1 and atlas2 are in service.

  -l gres=atlas1%atlas2
Default is Dependency on All Spider File Systems

If a batch job is not associated with a file system, i.e., if the gres option is not used, the batch job will be associated with all four widow file systems by adding -l gres=atlas1%atlas2 to to the batch submission.

Note: To help prevent batch jobs from running during periods where a Spider file system is not available, batch jobs that do not explicitly specify the PBS gres option will be given a dependency on all Spider file systems.
Why Explicitly Associate a Batch Job with a File System?
  • Associating a batch job with a file system will prevent the job from running if the file system becomes degraded or unavailable.
  • If a batch job only uses (1) spider file systems, specifying the file systems explicitly instead of taking the default of all (2), would prevent the job from being held if a file system not used by the job becomes degraded or unavailable.
Verifying/View Batch Job File System Association

The checkjob utility can be used to view a batch job’s file system associations. For example:

  $ qsub -lgres=atlas2 batchscript.pbs
  851694.nid00004
  $ checkjob 851694 | grep "Dedicated Resources Per Task:"
  Dedicated Resources Per Task: atlas2: 1
Available Directories on Spider

Every project is assigned a directory in the Spider filesystem, and all storage in the Spider filesystem is therefore associated with a project. This directory is further divided into user-specific areas, an area shared among all members of the project, and an area shared among all users of the system. For more details, see the article on Project-Centric Work Directories.

IMPORTANT! Files in lustre directories are not backed up and are purged on a regular basis according to the timeframes listed in the Data Management Policy.
Current Configuration of Spider
atlas1 atlas22 widow3
Total disk space 14 PB (usable) 14 PB (usable)
Number of OSTs 1008 1008
Default stripe count 4 4
Default stripe size 1 MB 1 MB
Additional Information

More information on Spider and Lustre can be found on the Spider Best Practices page and the Lustre Basics page.