Skip to main content

Daemon processes

Some workflows may benefit or require a process that continues to run even when the user is not interactively logged into an OLCF system, such a process is often referred to as a daemon. There are two main ways in which user daemons may run at the OLCF: through looped chained jobs and the use of cron, the examples below highlight the use of each method.

Please note that daemon processes will be running on shared resources, memory or compute intensive daemons are prohibited.

Looped Chained Jobs

A looped chained job is a batch job script that recursively submits itself such that the next job runs only after the current has finished. For daemon processes a node size of 0 may be specified on Titan, such jobs will run on a single service node of the compute resource and will not affect compute allocation. Inside of the batch job script the environment variable PBS_JOBID may be used in conjunction with the qsub flag -W depend=afterok to submit a job that will run only after the current job has finished. This method creates a chain of size 0 jobs that will run back to back indefinitely. Some machines like the DTNs may not support a job of size 0, and so #PBS -l nodes=1 may be required in the example script below.

In the example below the python script daemon.py will be run by the job script daemon_launcher.pbs on a service node of the compute resource submitted to:

daemon.py

import time
from datetime import datetime

while True:
    print "The time is: " + str(datetime.now())
    time.sleep(30)

daemon_launcher.pbs

#PBS -l walltime=24:00:00
#PBS -l nodes=0
#PBS -A PRJ123

qsub -W depend=afterok:$PBS_JOBID daemon_launcher.pbs

source $MODULESHOME/init/bash
module load python
python -u $HOME/daemon.py >> $HOME/daemon.out 2>&1

To disable the daemon use qdel to delete both the running and pending job.

Cron

The OLCF does not recommend the use of cron unless absolutely necessary, please use looped chained jobs as described above if possible. The crontab is not backed up and the OLCF provides no guarantee that cronjobs will run when scheduled. Scheduled cronjobs may be silently disabled if they are disruptive to the system. During DTN OS upgrades the crontab will be lost, please keep a backup in your NFS home area.

The cron utility may be used to schedule periodic system jobs. Cron is not available directly from Titan, Eos, or Rhea, but is available on the Interactive Data Transfer Nodes and could potentially be used with cross-system submission to enable automated workflows. The example below will show a simple method for using cron to ensure that a single instance of the python script daemon.py is always running:

daemon.py

import time
from datetime import datetime

while True:
    print "The time is: " + str(datetime.now())
    time.sleep(30)

The goal is to create a cron entry that will check if daemon.py is running and start it if it is not. We can use flock(2) to do so. With the command below flock will attempt to place a lock on the specified file, /tmp/$USER.lockfile, and will run the command specified by the -c flag if the file is able to be successfully locked. If the file is already locked -n specifies than flock will return with exit code 1 and not attempt to run the command specified by -c.

Such a setup allows the system to check if our daemon.py process is running once every minute, if it is not running it will be started. Standard out and error will be redirected to daemon.out for our daemon process and cron.out for the flock command. Note that the use of /tmp is deliberate, flock should not used to lock files in NFS or Lustre areas.

The cron entry will look as such:

* * * * * flock -n /tmp/$USER.lockfile -c "python -u $HOME/daemon.py >> $HOME/daemon.out 2>&1" >> $HOME/cron.out 2>&1

To disable the daemon remove the crontab entry and then kill the running processes started with flock -c.