titan

Since 4/22/14 10:50 pm

eos

Since 4/22/14 01:15 pm

rhea

Since 4/22/14 01:00 pm

hpss

Since 4/15/14 09:50 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

Spider Best Practices

Bookmark and Share

This article summarizes a handful of best practices you can follow to get the best I/O performance from your applications running on Spider, the OLCF’s center-wide Lustre® filesystem.

Edit/Build Code in User Home and Project Home Areas Whenever Possible

Spider is built for large, contiguous I/O. Opening, closing, and stat-ing files are expensive tasks in Lustre. This, combined with periods of high load from compute jobs, can make basic editing and code builds noticeably slow. To work around this, users are encouraged to edit and build codes in their User Home (i.e., /ccs/home/$USER) and Project Home (i.e., /ccs/proj/) areas. While large scale I/O will be slower from the NFS areas, basic file system operations will be faster. The areas are also not accessible to compute nodes which limits the possible resource load.

When using the vi editor, the default behavior is to create an opened file’s temporary file in the current directory. If editing on a Spider file system, this can result in slowdown. You can specify that vi create temporary files in your User Home area by modifying your /ccs/home/$USER/.vimrc file:

  $ cat ~/.vimrc
  set swapsync=
  set backupdir=~
  set directory=~
Use ls -l Only Where Absolutely Necessary

Consider that ls -l must communicate with every OST that is assigned to any given listed file. When multiple files are listed, ls -l becomes a very expensive operation. It also causes excessive overhead for other users.

Open Files as Read-Only Whenever Possible

If a file to be opened is not subject to write(s), it should be opened as read-only. Furthermore, if the access time on the file does not need to be updated, the open flags should be O_RDONLY | O_NOATIME. If this file is opened by all files in the group, the master process (rank 0) should open it O_RDONLY with all of the non-master processes (rank > 0) opening it O_RDONLY | O_NOATIME.

Read Small, Shared Files from a Single Task

If a shared file is to be read and the data to be shared among the process group is less than approximately 100 MB, it is preferable to change the common code shown below (in C):

  int iRead;
  char cBuf[SIZE];

  // Check file descriptor
  iFD=open( PathName, O_RDONLY | O_NOATIME, 0444 );

  // Check number of bytes read
  iRead=read( iFD, cBuf, SIZE );

…to the code shown here:

  int  iRank;
  int  iRead;
  char cBuf[SIZE];

  MPI_Comm_rank( MPI_COMM_WORLD, iRank );
  if(!iRank) {

    // Check file descriptor
    iFD=open( PathName, O_RDONLY | O_NOATIME, 0444 );

    // Check number of bytes read
    iRead=read( iFD, cBuf, SIZE );

  }
  MPI_Bcast( cBuf, SIZE, MPI_CHAR, 0, MPI_COMM_WORLD );

Similarly, in Fortran, change the code shown below:

  INTEGER iRead
  CHARACTER cBuf(SIZE)

  OPEN(1, FileName)
  READ(1,*) cBuf

…to the code shown here:

  INTEGER iRank
  INTEGER iRead
  INTEGER ierr
  CHARACTER cBuf(SIZE)

  CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, ierr)
  IF (iRank .eq. 0) THEN
    OPEN(UNIT=1,FILE=PathName,ACTION='READ')
    READ(1,*) cBuf
  ENDIF
  CALL MPI_BCAST(cBuf, SIZE, MPI_CHAR, 0, MPI_COMM_WORLD, ierr)

Here, we gain several advantages: Instead of making N (the number of tasks in MPI_COMM_WORLD) open, read requests, we are making only (1). Also, the broadcast will use a fanout which will reduce the network traffic by allowing the interconnect routers of intermediate nodes to process less data.

However, if the shared data size exceeds 100 MB, you should contact the OLCF User Assistance Center for further optimizations.

Limit the Number of Files in a Single Directory

For large-scale applications that are going to write large numbers of files using private data, it is best to implement a subdirectory structure to limit the number of files in a single directory. A suggested approach is a (2)-level directory structure with sqrt(N) directories each containing sqrt(N) files, where N is the number of tasks.

Place Small Files on a Single OST

If only one process will read/write the file and the amount of data in the file is small, stat performance will be improved by limiting the file to a single OST on creation; every stat operation must communicate with every OST which contains file data. You can stripe a file across a single OST via:

  $ lfs setstripe PathName -s 1m -i -1 -c 1
Place Directories Containing Many Small Files on a Single OST

If you are going to create many small files in a single directory, stat (and therefore ls -l) will be more efficient if you set the directory’s striping count to (1) OST upon creation:

  $ lfs setstripe DirPathName -s 1m -i -1 -c 1

This is especially effective when extracting source code distributions from a tarball:

  $ lfs setstripe DirPathName -s 1m -i -1 -c 1
  $ cd DirPathName
  $ tar -x -f TarballPathName

All of the source files, header files, etc. span only (1) OST. When you build the code, all of the object files will use only (1) OST as well. The binary will span (1) OST as well, but that is not desirable, you can copy the binary with a new stripe count:

  $ lfs setstripe NewBin -s 1m -i -1 -c 4
  $ rm -f OldBinPath
  $ mv NewBin OldBinPath
stat Files from a Single Task

If many processes need the information from stat on a single file, it is most efficient to have a single process perform the stat call, then broadcast the results. This can be achieved by modifying the following code (shown in C):

  int iRank;
  struct stat sB;

  iRC=lstat( PathName, &sB );

To the following:

  MPI_Comm_rank( MPI_COMM_WORLD, iRank );
  if(!iRank)
  {
    iRC=lstat( PathName, &sB );
  }
  MPI_Bcast( &sB, size(struct stat), MPI_CHAR, 0, MPI_COMM_WORLD );

Similarly, change the following Fortran code:

  INTEGER*4 sB(13)

  CALL LSTAT(PathName, sB, ierr)

To the following:

  INTEGER iRank
  INTEGER*4 sB(13)
  INTEGER ierr

  CALL MPI_COMM_RANK(MPI_COMM_WORLD, iRank, ierr)
  IF (iRank .eq. 0) THEN
    CALL LSTAT(PathName, sB, ierr)
  ENDIF
  CALL MPI_BCAST(sB, 13, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)

Please note, the Fortran lstat binding does not support files larger than 2 GB. Users must provide their own Fortran binding to the C lstat for files larger than 2 GB.

Consider Available I/O Middleware Libraries

For large scale applications that are going to share large amounts of data, one way to improve performance is to use a middleware libary; such as ADIOS, HDF5, or MPI-IO.

Use Large and Stripe-aligned I/O Whenever Possible

I/O requests should be large, i.e., a full stripe width or greater. In addition, you will get better performance by making I/O requests stripe aligned whenever possible. If the amount of data generated or required from the file on a client is small, a group of processes should be selected to perform the actual I/O request with those processes performing data aggregation.