A Temporal workflow crashes mid-execution. You restart it from the last checkpoint. The agent’s internal state says “files A, B, C have been refactored” — but the crash happened after it updated its state record and before it actually wrote file C to disk. The agent skips C because it believes C is already done, and the resulting codebase is broken in exactly the place the agent thinks it fixed. Or a LangGraph pipeline restarts after a Redis eviction and the checkpoint is three steps behind what was actually executed — the agent re-executes those three steps, doubling mutations on a database that doesn’t support idempotent writes.
Common causes
1. State is written before the side effect completes
The agent writes “step 3 complete” to its state store, then crashes before the actual file write or API call finishes. On restart, the state says “done” but the effect never happened. This is a classic write-before-commit ordering bug.
How to spot it: Look for any code where state.mark_done(step) or equivalent is called before — not after — the actual file write, API call, or database mutation. The state update and the side effect must be atomic or the state update must come last.
2. Checkpoint is too coarse — covers multiple steps in one atomic unit
The checkpoint saves every 10 steps, but the crash happened at step 7. On restart, the agent replays steps 1-10, re-executing steps 1-7 that already ran. If those steps are not idempotent (e.g., they append to a file, increment a counter, or send an email), re-execution causes duplicate side effects.
How to spot it: Find the checkpoint granularity in your workflow code. If a single checkpoint covers multiple side-effecting operations, any crash mid-checkpoint causes a desync between the checkpoint state and the real world.
3. External state changed between checkpoint and resume
The agent checkpoints “database schema = v4.” While the agent was stopped, a human engineer manually ran a migration to v5. The agent resumes from checkpoint and proceeds as if the schema is v4, generating migration SQL that is now wrong.
How to spot it: Compare the external state at resume time against the external state recorded in the checkpoint. Any difference is a desync. This requires a “pre-resume verification” step that validates checkpoint assumptions before execution continues.
4. In-memory state not persisted — only persists in process memory
Some pipeline frameworks accumulate state in a Python dict or object in memory. A process restart wipes this entirely. The agent restarts from scratch, re-doing all prior work — or worse, partially re-doing it and creating duplicates.
How to spot it: Search for state stores that are plain Python dicts (state = {}) or class attributes without any persistence call. If the state would be lost on kill -9, it is not persisted.
5. Clock skew between checkpoint timestamp and current time
The checkpoint records “last_updated: 2026-05-25T14:30:00Z.” The agent wakes up, compares against the current time, and determines that dependencies modified after that timestamp need to be re-processed. If the system clock was wrong when the checkpoint was written, the agent either skips genuinely modified dependencies or re-processes unmodified ones.
How to spot it: Check whether your resume logic uses timestamps for change detection. If so, verify NTP synchronization across all machines involved in the workflow.
6. Partial checkpoint write — checkpoint itself is incomplete
The checkpoint write was interrupted mid-serialization (crash, OOM kill, disk full). On restart, the checkpoint reads back but some fields are at their default/zero values. The agent resumes with a mix of real state and default state that never existed together.
How to spot it: Add a checksum or is_complete: true field to every checkpoint. On load, verify the checksum or check the completion flag before trusting any checkpoint data.
Shortest path to fix
Step 1: Add a pre-resume state verification step
Before resuming from any checkpoint, verify that the real world matches the checkpoint’s assumptions:
def verify_checkpoint(checkpoint: dict) -> list[str]:
discrepancies = []
for file_path, expected_hash in checkpoint.get("file_hashes", {}).items():
actual = hash_file(file_path) if os.path.exists(file_path) else None
if actual != expected_hash:
discrepancies.append(
f"{file_path}: expected {expected_hash}, got {actual}"
)
for key, expected_val in checkpoint.get("db_state", {}).items():
actual_val = db.get_value(key)
if actual_val != expected_val:
discrepancies.append(
f"DB {key}: expected {expected_val}, got {actual_val}"
)
return discrepancies
If discrepancies are found, do not auto-resume — alert a human to reconcile first.
Step 2: Write state AFTER side effects, not before
# WRONG — state written before effect
def execute_step(step):
state.mark_done(step.id) # crash here = state says done, effect not done
write_file(step.output)
# CORRECT — state written after effect
def execute_step(step):
write_file(step.output) # effect first
state.mark_done(step.id) # crash here = state says not done, effect done (safe to re-run if idempotent)
This ensures that a crash leaves the state one step behind reality, not one step ahead.
Step 3: Make side effects idempotent so re-execution is safe
def write_file_idempotent(path: str, content: str, expected_hash: str):
if os.path.exists(path):
actual_hash = hash_file(path)
if actual_hash == expected_hash:
return # already written correctly — skip
with open(path, 'w') as f:
f.write(content)
# For database operations:
def upsert_record(table: str, key: str, value: dict):
# INSERT ON CONFLICT DO UPDATE (idempotent)
db.execute(
f"INSERT INTO {table} (key, data) VALUES (?, ?) "
"ON CONFLICT(key) DO UPDATE SET data=excluded.data",
(key, json.dumps(value))
)
Step 4: Fine-grain the checkpoint boundary to one side effect per checkpoint
for step in steps:
execute_step(step) # one side effect
checkpoint.save(step_id=step.id, state=current_state) # immediately after
Each step is independently checkpointed. A restart re-runs at most one step (the last one), not a batch.
Step 5: Add a checksum to every checkpoint record
import hashlib, json
def save_checkpoint(state: dict, path: str):
payload = json.dumps(state, sort_keys=True, default=str)
checksum = hashlib.sha256(payload.encode()).hexdigest()
with open(path, 'w') as f:
json.dump({"state": state, "checksum": checksum, "is_complete": True}, f)
def load_checkpoint(path: str) -> dict:
with open(path) as f:
record = json.load(f)
if not record.get("is_complete"):
raise CorruptedCheckpointError(f"Incomplete checkpoint at {path}")
payload = json.dumps(record["state"], sort_keys=True, default=str)
actual_checksum = hashlib.sha256(payload.encode()).hexdigest()
if actual_checksum != record["checksum"]:
raise CorruptedCheckpointError(f"Checksum mismatch at {path}")
return record["state"]
Prevention
- Always write state AFTER side effects complete — never before.
- Checkpoint at the smallest granularity your storage allows — one checkpoint per side-effecting step.
- Add a pre-resume verification step that confirms the real world matches checkpoint assumptions before continuing execution.
- Make every side effect idempotent so re-execution on restart produces the same result.
- Add checksums and a completion flag to checkpoint records; reject and alert on incomplete or corrupted checkpoints.
- Include the hash of key external artifacts (files, schema versions, API endpoint versions) in the checkpoint, not just internal state.
- Test your resume path explicitly in CI: write a test that crashes the workflow mid-run and verifies correct resume.
- Never use in-process memory as the only state store for any workflow longer than a single LLM call.
FAQ
Q: Does Temporal guarantee no state desync? A: Temporal provides durable execution with event sourcing — the workflow history is the authoritative state, and side effects are recorded before being returned to workflow code. This eliminates most desync scenarios. However, activities that produce external side effects (file writes, external API calls) must still be idempotent because Temporal may re-execute them on worker restart.
Q: How do I handle a checkpoint that is partially valid — some fields are correct and some are wrong? A: Do not try to merge a partial checkpoint with the real world automatically. Archive the corrupted checkpoint, run the pre-resume verification, and let the workflow replay from the last fully-valid checkpoint. Automatic partial merges produce hard-to-debug hybrid states.
Q: Can I use a database transaction to make state write and side effect atomic? A: If both the state update and the side effect are database operations, yes — wrap them in a single transaction. For mixed side effects (file writes + DB updates), transactional sagas with compensating transactions are the right pattern, though they add significant complexity.
Q: How long should I keep old checkpoints? A: Keep at least the last 3 checkpoints per workflow run. A single retained checkpoint is vulnerable to a write failure during the next checkpoint save. Three checkpoints gives you fallback in case the most recent one is corrupted.