Categories: Data Management, Software
Print this article
Checkpoint Advisory Tool
Checkpoint Advisory tool helps in estimating the optimal checkpointing interval for your application. As illustrated earlier, checkpointing too frequently may result in excessive I/O overhead. On the other hand, large checkpointing interval can result in significant lost work. Therefore, the aim of this tool is to find the optimal checkpointing interval for a given application. The estimation is primarily based on two factors: mean time to failure and time-to-write one checkpoint. We provide instructions below on how to invoke this tool, what parameters to provide, expected return value, and how to integrate in the job script.
Basically, the tool expects two inputs: job-size and time-to-write-one-checkpoint. The job size parameter is used to determine the “expected” mean time between failures (MTBF) for the job. If the MTBF for a system with 20,000 nodes is, say, 50 hours. Then, the “expected” mean time between failures (MTBF) for the job utilizing only 10,000 nodes will be 100 hours. Time to write one checkpoint varies among applications. Therefore, it is expected as an input to the tool. Note that this parameter should be passed in “minutes” unit. The return value is the optimal checkpointing interval (in “minutes”). In some cases, the application may have to adjust the iteration counts in the code base to match with the return optimal checkpointing interval.
Please note that the tool does not provide any guarantees against system failures. The system failures may still interrupt the application. However, by using this tool, application will minimize the side effect of these failures (i.e., minimize the lost work). Also, given the stability of the Titan supercomputer, the tool may advise the application to checkpoint less frequently or not at all, in certain case. This is especially useful since it can reduce the I/O overhead significantly and improve scientific productivity by doing useful computation instead of checkpointing.
Background on Checkpointing
Checkpoint interval is the time period after which an application regularly takes a checkpoint.
More frequent checkpointing potentially reduces the potential for lost work, but increases the checkpointing overhead and vice-versa.
Usage Help for Checkpoint Advisory Tool
Checkpoint Advisory — v1.0 usage and command line options
Usage: ./checkpoint_advisory.py arguments and options Usage: ./checkpoint_advisory.py -c <Time To Checkpoint> -s <Job Size> -n <Job Name> -i <Job ID> -t <Requested Wall Clock Time> Required Arguments: -c --checkpoint_time Time to write one checkpoint (in minutes) -s --job_size Job size (number of nodes) Optional Arguments: -n --job_name Name of the job -i --job_id Job ID -t --requested_time Requested wall clock time Notes: Time to checkpoint is a required argument. The input should be in minutes. It can be an estimation based on prior runs. Job size is the number of nodes to be allocated for next job (aprun command). Job size argument is not necessarily the number of nodes requested by the batch job (i.e., $PBS_NUM_NODES). Job name and ID are optional parameters. If you are running multiple apruns within one batch job, supplying application's name is recommended every time checkpoint advisory is invoked.
./checkpoint_advisory.py --checkpoint_time 20 --job_size 100
./checkpoint_advisory.py -c 5 -s 4096
./checkpoint_advisory.py --checkpoint_time 2 --job_size 10000 -n $PBS_JOBNAME -i $PBS_JOBID --requested_time $PBS_WALLTIME
Example Job Integration with Checkpoint Advisory Tool
#!/bin/bash #PBS -N Job_Name #PBS -A Account_Number #PBS -l walltime=05:00:00 #PBS -l nodes=1000 #PBS -o outputlog #PBS -e errorlog cd $MEMBERWORK/my_directory #Time to checkpoint in minutes #My OCI in minutes OIC=$(eval "./checkpoint_advisory.py --checkpoint_time 0.20 --job_size 1000") echo "Optimal Interval for Checkpointing (Oh-I-See): " $OIC # Change directory aprun -n 1200 -N 16 ./app_name_A exit 0