Scheduled Cron Job Skipped Silently With No Error Logged

Q: Why was there no error at all in my application logs?

Most silent skips happen *before* your code runs: the scheduler never created the process. Timezone drift, the Kubernetes `> 100` missed-start cutoff, a `Forbid` concurrency skip, and a suspended schedule all live in the scheduler or controller layer, so your app logs stay clean. Check the scheduler's own events (`kubectl describe cronjob`, controller-manager logs, Airflow scheduler logs) first.

Q: What does `Too many missed start time (> 100)` actually mean?

The Kubernetes CronJob controller compares now against `status.lastScheduleTime` and counts how many scheduled times it missed. If that count exceeds 100 it stops trying and logs `Cannot determine if job needs to be started: Too many missed start time (> 100)`. The cure is to set `startingDeadlineSeconds` (which bounds the window it counts) and, if it is already stuck, recreate the CronJob to reset `lastScheduleTime`.

A scheduled job never fired and nothing showed up in logs. Fix it by going UTC-only, checking startingDeadlineSeconds, adding a heartbeat, and alerting on missed runs.

Published: May 24, 2026 Updated: Jun 18, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You scheduled a job to run at 02:00 every day. This morning the downstream report is empty, the inbox is silent, and there is nothing useful in the log. No error, no stack trace, just a missing run.

Fastest fix: check three things in order. (1) The scheduler’s timezone vs the one you assumed (run date -u inside the container). (2) On Kubernetes, whether the CronJob controller is logging Too many missed start time (> 100) and refusing to start anything. (3) Whether a previous run is still holding a lock. Then make silent skips loud: emit a heartbeat per run and alert when an expected run does not arrive.

Which bucket are you in?

Symptom you can observe	Most likely cause	Jump to
Job ran, but hours off from when you expected	Timezone mismatch	Cause 1
Missing runs cluster on March / November Sundays	DST shift swallowed the run	Cause 2
Controller events show `missed schedule`, K8s logs say `Too many missed start time`	Missed-deadline cutoff hit	Cause 3
Previous run visible in `pg_stat_activity` at the next schedule time	Lock contention with the prior run	Cause 4
`kubectl get cronjob` shows `SUSPEND: True`	Schedule paused/suspended	Cause 5

Common causes

Ordered by hit rate.

1. Timezone mismatch

Your cron string says 0 2 * * * and you assumed local time. The container or scheduler runs in UTC. The job fires at 02:00 UTC, which is 21:00 the previous day in a US Eastern office.

How to spot it: run date -u and date inside the container and compare to the time you expected. Or check the scheduler log timestamp. On Kubernetes, note that putting CRON_TZ= or TZ= inline in .spec.schedule is not supported and will be rejected with a validation error — you must use the dedicated .spec.timeZone field (stable since Kubernetes v1.27).

2. DST shift swallowed the run

A cron string targets 02:30 local. On the spring-forward Sunday, 02:30 does not exist, so the run is skipped silently. In the fall-back direction the same local time happens twice, and some schedulers fire the job twice.

How to spot it: missing (or doubled) runs always align with DST boundaries — mid-March and early November in the US, late March and late October in the EU.

3. Kubernetes hit the 100-missed-start cutoff

This is the classic silent Kubernetes skip. If the CronJob controller was down, the cluster was paused, or status.lastScheduleTime is stale, the controller counts how many scheduled times it missed. Once that count exceeds 100, it gives up and stops starting jobs entirely, logging:

Cannot determine if job needs to be started: Too many missed start time (> 100). Set or decrease .spec.startingDeadlineSeconds or check clock skew.

Without .spec.startingDeadlineSeconds, the controller counts every miss since the last successful schedule, so a job that was suspended for a few days easily blows past 100 and never recovers on its own.

How to spot it: kubectl describe cronjob <name> events, or controller-manager logs, contain the string above. kubectl get cronjob shows a stale LAST SCHEDULE.

Fix: set a startingDeadlineSeconds so the controller only looks at a bounded window, then nudge the schedule. With a 200-second deadline, the controller only checks the last 200 seconds of misses instead of all of history:

spec:
  startingDeadlineSeconds: 200

If the controller is already wedged, recreate the CronJob (kubectl delete cronjob <name> then re-apply) to reset its status.lastScheduleTime.

4. Previous run still holding a lock

Job N took 25 hours. The scheduler tries to start job N+1 but a database advisory lock or file lock is still held. Job N+1 exits silently if it grabs the lock non-blocking and gets a false.

How to spot it: a long-running predecessor is visible in pg_stat_activity or the process list around the next scheduled time. On Kubernetes with concurrencyPolicy: Forbid, the controller skips the new run by design and only records it in events.

5. Schedule disabled or paused

Someone clicked pause in the UI or set suspend: true and forgot. Or a deploy reset the schedule to a default.

How to spot it: kubectl get cronjob shows SUSPEND: True, or the Airflow DAG is paused (toggle off in the UI).

Shortest path to fix

Step 1: Go UTC-only everywhere

Pick UTC for every scheduler expression and document it. Convert to local time in the UI layer only.

# Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
  name: nightly-report
spec:
  schedule: "0 9 * * *"            # 09:00 UTC = 02:00 PDT, 01:00 PST
  timeZone: "Etc/UTC"             # .spec.timeZone is stable since Kubernetes v1.27
  concurrencyPolicy: Forbid       # skip a new run if the prior one is still going
  startingDeadlineSeconds: 200    # bounded miss window, avoids the >100 lockup
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5

For systemd timers, set OnCalendar=*-*-* 09:00:00 UTC. Reminder: do not put CRON_TZ= or TZ= inside the Kubernetes schedule string; use timeZone instead.

Step 2: Emit a heartbeat metric every run

Every job emits a counter at the start and at successful end. Missing pulses are the signal.

# Python with prometheus_client
from prometheus_client import Counter, push_to_gateway, CollectorRegistry

registry = CollectorRegistry()
runs = Counter('cron_runs_total', 'Cron run count', ['job','phase'], registry=registry)

def heartbeat(job, phase):
    runs.labels(job=job, phase=phase).inc()
    push_to_gateway('pushgateway:9091', job=job, registry=registry)

heartbeat('nightly-report', 'start')
do_work()
heartbeat('nightly-report', 'success')

For lighter setups, use a dead-man-switch service such as Healthchecks.io (free tier: 20 checks, 3 months of log history as of June 2026) or Cronitor (free tier: 5 monitors). The job pings a unique URL after each successful run; if the ping does not arrive inside the expected window, the service alerts you. This catches the case where the job never started at all, which an in-process metric cannot.

Step 3: Alert on missed schedules

# Prometheus alert rule
- alert: CronMissedRun
  expr: |
    time() - max(cron_last_success_timestamp{job="nightly-report"}) > 90000
  for: 5m
  labels:
    severity: page
  annotations:
    summary: "nightly-report has not succeeded in over 25 hours"

Use a window slightly larger than the cadence: 90000 seconds (25h) for a daily job, 4500 seconds (75min) for an hourly one.

Step 4: Guard against overlap with explicit locks

-- Postgres advisory lock, fails fast if already held
SELECT pg_try_advisory_lock(hashtext('nightly-report'));
-- returns true if acquired, false if a previous run is still going

import psycopg
with psycopg.connect(DSN) as conn:
    acquired, = conn.execute("SELECT pg_try_advisory_lock(hashtext(%s))", ['nightly-report']).fetchone()
    if not acquired:
        print("previous run still holding lock, skipping")
        return
    try:
        run_job()
    finally:
        conn.execute("SELECT pg_advisory_unlock(hashtext(%s))", ['nightly-report'])

Emit a separate metric for “lock contention skip” so a silent skip becomes a visible data point rather than a gap.

Step 5: Check scheduler state in deploys

Add a smoke check to CI/CD: after deploy, verify no CronJob is suspended unintentionally.

kubectl get cronjob -o json \
  | jq -r '.items[] | select(.spec.suspend==true) | .metadata.name'

Block the deploy if the output is non-empty and not on an allowlist.

How to confirm it’s fixed

Trigger a manual run and confirm it completes: kubectl create job --from=cronjob/nightly-report manual-test-1 (Kubernetes), then kubectl get jobs shows COMPLETIONS 1/1.
Confirm both heartbeats landed: query cron_runs_total{job="nightly-report"} and check the start and success phases both incremented.
Confirm the controller is not still wedged: kubectl describe cronjob nightly-report should show a recent LAST SCHEDULE and no Too many missed start time event.
Wait one real scheduled cycle and confirm the dead-man-switch ping arrived on time.

Prevention

UTC everywhere in code and config; humans see local time only in dashboards.
Always set startingDeadlineSeconds on Kubernetes CronJobs so a paused or lagging controller cannot trip the silent > 100 cutoff.
Every job emits start and success heartbeats; alert on missed pulses with an external dead-man switch, not just an in-process metric.
Use concurrencyPolicy: Forbid (Kubernetes) or explicit advisory locks to make overlap behavior explicit.
Avoid scheduling at 02:00-03:00 local time; that window overlaps every DST shift somewhere on earth.
Keep a single source of truth for the cron catalog and review it monthly.

FAQ

Why was there no error at all in my application logs? Most silent skips happen before your code runs: the scheduler never created the process. Timezone drift, the Kubernetes > 100 missed-start cutoff, a Forbid concurrency skip, and a suspended schedule all live in the scheduler or controller layer, so your app logs stay clean. Check the scheduler’s own events (kubectl describe cronjob, controller-manager logs, Airflow scheduler logs) first.

What does Too many missed start time (> 100) actually mean? The Kubernetes CronJob controller compares now against status.lastScheduleTime and counts how many scheduled times it missed. If that count exceeds 100 it stops trying and logs Cannot determine if job needs to be started: Too many missed start time (> 100). The cure is to set startingDeadlineSeconds (which bounds the window it counts) and, if it is already stuck, recreate the CronJob to reset lastScheduleTime.

Should I set startingDeadlineSeconds? Yes, on Kubernetes. Without it, the controller has no bounded window and counts every miss since the last schedule, which is exactly how jobs hit the > 100 lockup after a pause or controller restart. A value modestly larger than your job’s start latency (for example 200) is a safe default.

Heartbeat metric or dead-man switch — which one? Both, and they catch different failures. An in-process Prometheus counter proves the job ran and how far it got. An external dead-man switch (Healthchecks.io, Cronitor) proves it ran at all and on time, which is the only thing that catches “the scheduler never fired.”

Why does everyone say to avoid 2-3 AM local time? Because the spring-forward DST transition deletes a wall-clock hour in that range, so a job scheduled at 02:30 local simply has no instant to run on that day and is skipped. Schedule in UTC, or pick a time outside 02:00-03:00, and the ambiguity disappears.

Tags: #Backend #Troubleshooting #cron

Which bucket are you in?

Common causes

1. Timezone mismatch

2. DST shift swallowed the run

3. Kubernetes hit the 100-missed-start cutoff

4. Previous run still holding a lock

5. Schedule disabled or paused

Shortest path to fix

Step 1: Go UTC-only everywhere

Step 2: Emit a heartbeat metric every run

Step 3: Alert on missed schedules

Step 4: Guard against overlap with explicit locks

Step 5: Check scheduler state in deploys

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Postgres Migration Stuck on ALTER TABLE in Production

Docker Container Restarts With Exit Code 137 (OOM Killed): Fix It

Fix gRPC DEADLINE_EXCEEDED Errors Under Load

JWT 'jwt expired' on Fresh Tokens: Fix Clock Skew

Kafka Consumer Lag Keeps Growing After Scaling Consumers

MongoDB Aggregation With $lookup + $group Runs for 30 Seconds