titan

Up since 11/8/17 02:45 pm

eos

Up since 11/14/17 11:20 pm

rhea

Up since 10/17/17 05:40 pm

hpss

Up since 11/20/17 09:15 am

atlas1

Up since 11/15/17 07:25 am

atlas2

Up since 11/27/17 10:45 am
OLCF User Assistance Center

Can't find the information you need below? Need advice from a real person? We're here to help.

OLCF support consultants are available to respond to your emails and phone calls from 9:00 a.m. to 5:00 p.m. EST, Monday through Friday, exclusive of holidays. Emails received outside of regular support hours will be addressed the next business day.

CheckpointAdvisory

Description

Checkpoint Advisory Tool

Checkpoint Advisory tool helps in estimating the optimal checkpointing interval for your application. As illustrated earlier, checkpointing too frequently may result in excessive I/O overhead. On the other hand, large checkpointing interval can result in significant lost work. Therefore, the aim of this tool is to find the optimal checkpointing interval for a given application. The estimation is primarily based on two factors: mean time to failure and time-to-write one checkpoint. We provide instructions below on how to invoke this tool, what parameters to provide, expected return value, and how to integrate in the job script.

Basically, the tool expects two inputs: job-size and time-to-write-one-checkpoint. The job size parameter is used to determine the “expected” mean time between failures (MTBF) for the job. If the MTBF for a system with 20,000 nodes is, say, 50 hours. Then, the “expected” mean time between failures (MTBF) for the job utilizing only 10,000 nodes will be 100 hours. Time to write one checkpoint varies among applications. Therefore, it is expected as an input to the tool. Note that this parameter should be passed in “minutes” unit. The return value is the optimal checkpointing interval (in “minutes”). In some cases, the application may have to adjust the iteration counts in the code base to match with the return optimal checkpointing interval.

Please note that the tool does not provide any guarantees against system failures. The system failures may still interrupt the application. However, by using this tool, application will minimize the side effect of these failures (i.e., minimize the lost work). Also, given the stability of the Titan supercomputer, the tool may advise the application to checkpoint less frequently or not at all, in certain case. This is especially useful since it can reduce the I/O overhead significantly and improve scientific productivity by doing useful computation instead of checkpointing.

Background on Checkpointing

Checkpointing-Overview

Checkpoint interval is the time period after which an application regularly takes a checkpoint.

Checkpointing-overhead

More frequent checkpointing potentially reduces the potential for lost work, but increases the checkpointing overhead and vice-versa.

Usage

Usage Help for Checkpoint Advisory Tool

Checkpoint Advisory — v1.0 usage and command line options

Usage: ./checkpoint_advisory.py arguments and options

Usage: ./checkpoint_advisory.py -c <Time To Checkpoint> -s <Job Size> -n <Job Name> -i <Job ID> -t <Requested Wall Clock Time>

Required Arguments:

-c --checkpoint_time Time to write one checkpoint (in minutes)

-s --job_size Job size (number of nodes)

Optional Arguments:

-n --job_name Name of the job

-i --job_id Job ID

-t --requested_time Requested wall clock time

Notes:
Time to checkpoint is a required argument. The input should be in minutes. It can be an estimation based on prior runs.

Job size is the number of nodes to be allocated for next job (aprun command).

Job size argument is not necessarily the number of nodes requested by the batch job (i.e., $PBS_NUM_NODES).

Job name and ID are optional parameters.

If you are running multiple apruns within one batch job, supplying application's name is recommended every time checkpoint advisory is invoked.

Examples

./checkpoint_advisory.py --checkpoint_time 20 --job_size 100
./checkpoint_advisory.py -c 5 -s 4096
./checkpoint_advisory.py​ --checkpoint_time 2 --job_size 10000 -n $PBS_JOBNAME -i $PBS_JOBID --requested_time $PBS_WALLTIME

Example Job Integration with Checkpoint Advisory Tool

#!/bin/bash

#PBS -N Job_Name

#PBS -A Account_Number

#PBS -l walltime=05:00:00

#PBS -l nodes=1000

#PBS -o outputlog

#PBS -e errorlog



cd $MEMBERWORK/my_directory



#Time to checkpoint in minutes

#My OCI in minutes

OIC=$(eval "./checkpoint_advisory.py --checkpoint_time 0.20 --job_size 1000")

echo "Optimal Interval for Checkpointing (Oh-I-See): " $OIC

# Change directory

aprun -n 1200 -N 16 ./app_name_A



exit 0

Versions

Available Versions

System Application/Version
Titan checkpointadvisory/1.0.1