Daemon processes
Some workflows may benefit or require a process that continues to run even when the user is not interactively logged into an OLCF system, such a process is often referred to as a daemon. There are two main ways in which user daemons may run at the OLCF: through looped chained jobs and the use of cron, the examples below highlight the use of each method.
Looped Chained Jobs
A looped chained job is a batch job script that recursively submits itself such that the next job runs only after the current has finished. For daemon processes a node size of 0 may be specified on Titan, such jobs will run on a single service node of the compute resource and will not affect compute allocation. Inside of the batch job script the environment variable PBS_JOBID
may be used in conjunction with the qsub flag -W depend=afterok
to submit a job that will run only after the current job has finished. This method creates a chain of size 0 jobs that will run back to back indefinitely. Some machines like the DTNs may not support a job of size 0, and so #PBS -l nodes=1
may be required in the example script below.
In the example below the python script daemon.py
will be run by the job script daemon_launcher.pbs
on a service node of the compute resource submitted to:
daemon.py
import time from datetime import datetime while True: print "The time is: " + str(datetime.now()) time.sleep(30)
daemon_launcher.pbs
#PBS -l walltime=24:00:00 #PBS -l nodes=0 #PBS -A PRJ123 qsub -W depend=afterok:$PBS_JOBID daemon_launcher.pbs source $MODULESHOME/init/bash module load python python -u $HOME/daemon.py >> $HOME/daemon.out 2>&1
To disable the daemon use qdel
to delete both the running and pending job.
Cron
The cron utility may be used to schedule periodic system jobs. Cron is not available directly from Titan, Eos, or Rhea, but is available on the Interactive Data Transfer Nodes and could potentially be used with cross-system submission to enable automated workflows. The example below will show a simple method for using cron to ensure that a single instance of the python script daemon.py
is always running:
daemon.py
import time from datetime import datetime while True: print "The time is: " + str(datetime.now()) time.sleep(30)
The goal is to create a cron entry that will check if daemon.py
is running and start it if it is not. We can use flock(2) to do so. With the command below flock will attempt to place a lock on the specified file, /tmp/$USER.lockfile
, and will run the command specified by the -c
flag if the file is able to be successfully locked. If the file is already locked -n
specifies than flock will return with exit code 1 and not attempt to run the command specified by -c
.
Such a setup allows the system to check if our daemon.py
process is running once every minute, if it is not running it will be started. Standard out and error will be redirected to daemon.out
for our daemon process and cron.out
for the flock command. Note that the use of /tmp is deliberate, flock should not used to lock files in NFS or Lustre areas.
The cron entry will look as such:
* * * * * flock -n /tmp/$USER.lockfile -c "python -u $HOME/daemon.py >> $HOME/daemon.out 2>&1" >> $HOME/cron.out 2>&1
To disable the daemon remove the crontab entry and then kill the running processes started with flock -c
.