darshan-util Overview
Darshan is a scalable HPC I/O characterization tool developed by the Argonne Leadership Computing Facility. Darshan is capable of capturing an accurate picture of an application’s I/O behavior. Darshan can be used to investigate, understand, and tune an application’s I/O behavior.
Support
Usage
Using Darshan on Titan
Darshan is installed on Titan and is available via thedarshan
modulefile. The basics steps to run Darshan consist of the following:
- Load the Darshan modulefile
- Set the
DARSHAN_LOGPATH
variable to a directory on the Spider 2 Lustre file system ($MEMBERWORK
,$PROJWORK
,$WORLDWORK
) - Compile the application
- Submit a job to run the code
- Analyze Darshan logs using one of Darshan's utilities
Example on Titan: Creating a file using HDF5
To demonstrate the Darshan workflow, we will use a simple code that creates an HDF5 file. This code is part of the HDF5 Tutorial on file creation.Getting the code
Create a file with the content onfile_create.f90
or download the code from https://www.hdfgroup.org/ftp/HDF5/examples/parallel/file_create.f90.
file_create.f90
For this example, create a directory for this tutorial called hdf5_darshan and use thewget
utility to download the code:
$ cd $MEMBERWORK/abc123/myusername $ mkdir hdf5_darshan $ cd hdf5_darshan $ wget https://www.hdfgroup.org/ftp/HDF5/examples/parallel/file_create.f90
Starting an interactive job
Then, start an interactive job that will be used to illustrate the Darshan workflow and compile on the same session. To submit an interactive job:$ qsub -l nodes=1,walltime=00:30:00 -A abc123 -IWhen the job starts continue to the next step.
Setting up Darshan
Since this example uses HDF5 and Darshan, both modules must be loaded:$ module load cray-hdf5-parallel $ module load darshan Please set DARSHAN_LOGPATH to an existing directory in one of the Lustre workspaces ($MEMBERWORK, $PROJWORK, $WORLDWORK) For example: export DARSHAN_LOGPATH=$MEMBERWORK/<projID>/<username>/myLogs (in bash) setenv DARSHAN_LOGPATH $MEMBERWORK/<projID>/<username>/myLogs (in tcsh)Then, we need to set and create a directory to be used as the storage location for the Darshan logs:
$ export DARSHAN_LOGPATH=$MEMBERWORK/abc123/myusername/hdf5_darshan/myLogs $ mkdir $DARSHAN_LOGPATH
Compiling the code
With the Darshan module loaded, we can proceed to compile the code:$ cd $MEMBERWORK/abc123/myusername/hdf5_darshan $ ls file_create.f90 $ ftn -o hdf5_test file_create.f90A warning about 'dlopen' will appear but it is safe to ignore in this example
Running the code
Once the code is compiled it can be run as usual using:$ aprun -n 2 ./hdf5_test
Analyzing Darshan logs
Only text-based analysis tools are currently installed on Titan. However, Darshan also provides utilities to generate plots and PDF reports for an application's I/O profile. To use these tools, we recommend transferring the raw data to your local system.
If all the previous steps have been successful, there should be a compressed log file in the $DARSHAN_LOGPATH
directory:
$ ls $DARSHAN_LOGPATH myusername_hdf5_test_id2456521_8-24-57921-1462825255861516406_1.darshan.gzThere are several utilities that can be used to analyze Darshan directories. On Titan, only text-based utilities have been installed. The
darshan-parser
utility can be used to obtain information about the I/O operations performed during the job:
$ darshan-parser --help Usage: darshan-parser [options]In this case since the code is creating a single file, we can use the--all : all sub-options are enabled --base : darshan log field data [default] --file : total file counts --file-list : per-file summaries --file-list-detailed : per-file summaries with additional detail --perf : derived perf data --total : aggregated darshan field data
--files
option to confirm that two MPI processes were used and that they wrote to a single shared file:
# darshan log version: 2.06 # size of file statistics: 1328 bytes # size of job statistics: 1080 bytes # exe: ./hdf5_testThe resulting I/O profile shows that a single shared file,# uid: 11685 # jobid: 2456521 # start_time: 1440446721 # start_time_asci: Mon Aug 24 16:05:21 2015 # end_time: 1440446721 # end_time_asci: Mon Aug 24 16:05:21 2015 # nprocs: 2 # run time: 1 # metadata: lib_ver = 2.3.1 # metadata: h = romio_no_indep_rw=true;cb_nodes=4 # mounted file systems (device, mount point, and fs type) # ------------------------------------------------------- # mount entry: 1156358687813405641 /etc dvs # mount entry: -6093723773410666080 /lustre/atlas2 lustre # mount entry: 4448267357135738885 /lustre/atlas1 lustre # mount entry: 3832236260697971115 /lustre/atlas lustre # mount entry: -648807988769344735 / dvs # mount entry: -6093723773410666080 /lustre/atlas2 lustre # mount entry: 4448267357135738885 /lustre/atlas1 lustre # mount entry: 3832236260697971115 /lustre/atlas lustre # mount entry: -648807988769344735 / rootfs # files # ----- # total: 1 799 799 # read_only: 0 0 0 # write_only: 1 799 799 # read_write: 0 0 0 # unique: 0 0 0 # shared: 1 799 799
sds.h5
, was created by 2 MPI processes.
The following snippets from sample outputs show the different types of report available:
- -file-list
# Per-file summary of I/O activity. #: hash of file name # : last 15 characters of file name # : MPI or POSIX # : number of processes that opened the file # : (estimated) time in seconds consumed in IO by slowest process # : average time in seconds consumed in IO per process # 17249959986228912358 _darshan/sds.h5 MPI 2 0.007429 0.005239
- -perf
# performance # ----------- # total_bytes: 800 # # I/O timing for unique files (seconds): # ........................... # unique files: slowest_rank_io_time: 0.000000 # unique files: slowest_rank_meta_time: 0.000000 # unique files: slowest rank: 0 # # I/O timing for shared files (seconds): # (multiple estimates shown; time_by_slowest is generally the most accurate) # ........................... # shared files: time_by_cumul_io_only: 0.005239 # shared files: time_by_cumul_meta_only: 0.005052 # shared files: time_by_open: 0.012253 # shared files: time_by_open_lastio: 0.007449 # shared files: time_by_slowest: 0.007429 # # Aggregate performance, including both shared and unique files (MiB/s): # (multiple estimates shown; agg_perf_by_slowest is generally the most accurate) # ........................... # agg_perf_by_cumul: 0.145617 # agg_perf_by_open: 0.062265 # agg_perf_by_open_lastio: 0.102423 # agg_perf_by_slowest: 0.102702A complete report for a given application can be obtained by using the
--all
option.
Profiling dynamically linked applications
The procedure above works for statically linked executables. If your application is dynamically linked, you will need to add theLD_PRELOAD
argument in your job launch command. In our example above, the aprun
command would become:
aprun -e LD_PRELOAD=$DARSHAN_HOME/lib/libdarshan.so ./hdf5_test
Additional Resources
More information about Darshan and the analysis utilities it provides can be found at:Builds
SUMMIT
- darshan-util@3.1.4%gcc@4.8.5
TITAN
- darshan-util@3.1.4%gcc@5.3.0
- darshan-util@3.1.4%gcc@5.3.0
- 2.3.1
RHEA
- 2.3.1