Scheduled Cron Job Skipped Silently With No Error Logged

A scheduled job never fired and nothing showed up in logs. Fix by going UTC-only, adding heartbeat metrics, and alerting on missed execution counts.

You scheduled a job to run at 02:00 every day. This morning the downstream report is empty, the inbox is silent, and there is nothing useful in the log. No error, no stack trace, just a missing run. The most common reasons are a timezone mismatch between the scheduler and your assumption, lock contention with a previous instance, or at-most-once semantics where a lost lease meant nobody picked up the slot. Fix it by standardizing on UTC, emitting a heartbeat metric per run, and alerting when expected runs do not arrive.

Common causes

Ordered by hit rate.

1. Timezone mismatch

Your cron string says 0 2 * * * and you assumed local time. The container or scheduler runs in UTC. The job fires at 02:00 UTC, which is 21:00 the previous day in your office.

How to spot it: date inside the container vs the time you expected. Or check the scheduler log timestamp.

2. DST shift swallowed the run

Cron string targets 02:30 local. On the spring-forward Sunday, 02:30 does not exist. The run is skipped silently. Some schedulers run it twice in the fall-back direction.

How to spot it: Missing runs always align with DST boundaries.

3. Previous run still holding a lock

Job N took 25 hours. The scheduler tries to start job N+1 but a database advisory lock or file lock is still held. Job N+1 exits silently if the lock is non-blocking.

How to spot it: Long-running predecessor visible in pg_stat_activity or process list around the next scheduled time.

4. At-most-once with a lost lease

Distributed scheduler (e.g. Kubernetes CronJob with concurrencyPolicy: Forbid, or Temporal/Airflow with leases) lost track of the worker. No retry, no alert.

How to spot it: Worker pod crashed shortly before the schedule; controller events show “missed schedule”.

5. Schedule disabled or paused

Someone clicked pause in the UI or set suspend: true and forgot. Or a deploy reset the schedule to default.

How to spot it: kubectl get cronjob shows SUSPEND: True, or Airflow DAG is paused.

Shortest path to fix

Step 1: Go UTC-only everywhere

Pick UTC for every scheduler expression and document it. Convert in the UI layer only.

# Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 9 * * *"   # 09:00 UTC = 02:00 PDT, 01:00 PST
  timeZone: "Etc/UTC"     # Kubernetes 1.27+ supports timeZone
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5

For systemd timers, set OnCalendar=*-*-* 09:00:00 UTC.

Step 2: Emit a heartbeat metric every run

Every job emits a counter at the start and at successful end. Missing pulses are the signal.

# Python with prometheus_client
from prometheus_client import Counter, push_to_gateway, CollectorRegistry

registry = CollectorRegistry()
runs = Counter('cron_runs_total', 'Cron run count', ['job','phase'], registry=registry)

def heartbeat(job, phase):
    runs.labels(job=job, phase=phase).inc()
    push_to_gateway('pushgateway:9091', job=job, registry=registry)

heartbeat('nightly-report', 'start')
do_work()
heartbeat('nightly-report', 'success')

For lighter setups, use a dead-man switch service (Healthchecks.io, Cronitor) — they alert when the expected ping does not arrive.

Step 3: Alert on missed schedules

# Prometheus alert rule
- alert: CronMissedRun
  expr: |
    time() - max(cron_last_success_timestamp{job="nightly-report"}) > 90000
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "nightly-report has not succeeded in over 25 hours"

Use a window slightly larger than the cadence (25h for daily, 75min for hourly).

Step 4: Guard against overlap with explicit locks

-- Postgres advisory lock, fails fast if already held
SELECT pg_try_advisory_lock(hashtext('nightly-report'));
-- returns true if acquired, false if a previous run is still going
import psycopg
with psycopg.connect(DSN) as conn:
    acquired, = conn.execute("SELECT pg_try_advisory_lock(hashtext(%s))", ['nightly-report']).fetchone()
    if not acquired:
        print("previous run still holding lock, skipping")
        return
    try:
        run_job()
    finally:
        conn.execute("SELECT pg_advisory_unlock(hashtext(%s))", ['nightly-report'])

Emit a separate metric for “lock contention skip” so silent skips become visible.

Step 5: Check scheduler state in deploys

Add a smoke check to CI/CD: after deploy, verify no CronJob is suspended unintentionally.

kubectl get cronjob -o json \
  | jq -r '.items[] | select(.spec.suspend==true) | .metadata.name'

Block the deploy if the output is non-empty and not on an allowlist.

Prevention

  • UTC everywhere in code and config; humans see local time only in dashboards.
  • Every job emits start and success heartbeats; alert on missed pulses.
  • Use concurrencyPolicy: Forbid (Kubernetes) or explicit advisory locks to make overlap behavior explicit.
  • Avoid scheduling at 02:00-03:00 local time; this overlaps every DST shift somewhere on earth.
  • Keep a single source of truth for the cron catalog; review monthly.

Tags: #Backend #Troubleshooting #cron