You scheduled a job to run at 02:00 every day. This morning the downstream report is empty, the inbox is silent, and there is nothing useful in the log. No error, no stack trace, just a missing run. The most common reasons are a timezone mismatch between the scheduler and your assumption, lock contention with a previous instance, or at-most-once semantics where a lost lease meant nobody picked up the slot. Fix it by standardizing on UTC, emitting a heartbeat metric per run, and alerting when expected runs do not arrive.
Common causes
Ordered by hit rate.
1. Timezone mismatch
Your cron string says 0 2 * * * and you assumed local time. The container or scheduler runs in UTC. The job fires at 02:00 UTC, which is 21:00 the previous day in your office.
How to spot it: date inside the container vs the time you expected. Or check the scheduler log timestamp.
2. DST shift swallowed the run
Cron string targets 02:30 local. On the spring-forward Sunday, 02:30 does not exist. The run is skipped silently. Some schedulers run it twice in the fall-back direction.
How to spot it: Missing runs always align with DST boundaries.
3. Previous run still holding a lock
Job N took 25 hours. The scheduler tries to start job N+1 but a database advisory lock or file lock is still held. Job N+1 exits silently if the lock is non-blocking.
How to spot it: Long-running predecessor visible in pg_stat_activity or process list around the next scheduled time.
4. At-most-once with a lost lease
Distributed scheduler (e.g. Kubernetes CronJob with concurrencyPolicy: Forbid, or Temporal/Airflow with leases) lost track of the worker. No retry, no alert.
How to spot it: Worker pod crashed shortly before the schedule; controller events show “missed schedule”.
5. Schedule disabled or paused
Someone clicked pause in the UI or set suspend: true and forgot. Or a deploy reset the schedule to default.
How to spot it: kubectl get cronjob shows SUSPEND: True, or Airflow DAG is paused.
Shortest path to fix
Step 1: Go UTC-only everywhere
Pick UTC for every scheduler expression and document it. Convert in the UI layer only.
# Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: nightly-report
spec:
schedule: "0 9 * * *" # 09:00 UTC = 02:00 PDT, 01:00 PST
timeZone: "Etc/UTC" # Kubernetes 1.27+ supports timeZone
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
For systemd timers, set OnCalendar=*-*-* 09:00:00 UTC.
Step 2: Emit a heartbeat metric every run
Every job emits a counter at the start and at successful end. Missing pulses are the signal.
# Python with prometheus_client
from prometheus_client import Counter, push_to_gateway, CollectorRegistry
registry = CollectorRegistry()
runs = Counter('cron_runs_total', 'Cron run count', ['job','phase'], registry=registry)
def heartbeat(job, phase):
runs.labels(job=job, phase=phase).inc()
push_to_gateway('pushgateway:9091', job=job, registry=registry)
heartbeat('nightly-report', 'start')
do_work()
heartbeat('nightly-report', 'success')
For lighter setups, use a dead-man switch service (Healthchecks.io, Cronitor) — they alert when the expected ping does not arrive.
Step 3: Alert on missed schedules
# Prometheus alert rule
- alert: CronMissedRun
expr: |
time() - max(cron_last_success_timestamp{job="nightly-report"}) > 90000
for: 5m
labels:
severity: page
annotations:
summary: "nightly-report has not succeeded in over 25 hours"
Use a window slightly larger than the cadence (25h for daily, 75min for hourly).
Step 4: Guard against overlap with explicit locks
-- Postgres advisory lock, fails fast if already held
SELECT pg_try_advisory_lock(hashtext('nightly-report'));
-- returns true if acquired, false if a previous run is still going
import psycopg
with psycopg.connect(DSN) as conn:
acquired, = conn.execute("SELECT pg_try_advisory_lock(hashtext(%s))", ['nightly-report']).fetchone()
if not acquired:
print("previous run still holding lock, skipping")
return
try:
run_job()
finally:
conn.execute("SELECT pg_advisory_unlock(hashtext(%s))", ['nightly-report'])
Emit a separate metric for “lock contention skip” so silent skips become visible.
Step 5: Check scheduler state in deploys
Add a smoke check to CI/CD: after deploy, verify no CronJob is suspended unintentionally.
kubectl get cronjob -o json \
| jq -r '.items[] | select(.spec.suspend==true) | .metadata.name'
Block the deploy if the output is non-empty and not on an allowlist.
Prevention
- UTC everywhere in code and config; humans see local time only in dashboards.
- Every job emits start and success heartbeats; alert on missed pulses.
- Use
concurrencyPolicy: Forbid(Kubernetes) or explicit advisory locks to make overlap behavior explicit. - Avoid scheduling at 02:00-03:00 local time; this overlaps every DST shift somewhere on earth.
- Keep a single source of truth for the cron catalog; review monthly.
Related
- Backend message queue dead-letter buildup
- Backend Postgres connection pool exhausted
- Backend RabbitMQ consumer stuck
- Edge function timeout
- Webhook not firing
- Backend Docker OOM killed
- S3 Presigned URL Returns 403 Mid-Upload on Large Files
Tags: #Backend #Troubleshooting #cron