You restart a long-running Temporal or LangGraph workflow from its saved checkpoint after a server restart. The agent loads the checkpoint, reads completed_steps = 7, and proceeds from step 8. But the artifacts dict in the checkpoint is missing the output of step 5 — the checkpoint write was interrupted mid-serialization by an OOM kill. The agent tries to use the missing artifact in step 8, gets a KeyError, crashes, and you are stuck: the checkpoint is corrupted, the original state is gone, and steps 1-4 would need to be re-run to recover. Corrupted checkpoints are rare but catastrophic when they occur.
Common causes
1. Checkpoint write interrupted mid-serialization
The most common cause. The process receives a SIGKILL (OOM, instance shutdown, container eviction) while writing a large checkpoint to disk or a database. The file or database row contains a partial JSON blob. The next load reads the truncated data successfully as “valid but incomplete JSON” or fails with a parse error on the truncated tail.
How to spot it: Check the checkpoint file for a valid JSON structure. If the file ends mid-string, mid-array, or with null values where there should be data, the write was interrupted. Look for recent OOM events in system logs correlating with the checkpoint timestamp.
2. Concurrent writes from multiple processes corrupt the checkpoint
Two processes (a checkpoint writer and a recovery monitor) both open the same checkpoint file for writing simultaneously. One overwrites the other’s data. Depending on OS buffering, you may end up with byte-level interleaving of the two writes — the checkpoint is structurally intact JSON but fields belong to different runs.
How to spot it: Check whether more than one process has write access to checkpoint storage. Log the PID of every checkpoint write. If two different PIDs wrote to the same checkpoint path within a short time window, concurrent writes happened.
3. Checkpoint schema version mismatch after code deployment
The agent code was updated. The new version expects state["artifacts"]["step_5"]["type"] (nested). The checkpoint was written by the old version and has state["step_5_artifact"] (flat). The load appears to succeed, but fields are in the wrong shape for the new code. The agent proceeds with a structurally valid but semantically wrong state.
How to spot it: Check whether the checkpoint includes a schema version field. If it does not, or if the version in the checkpoint doesn’t match the version the code expects, schema drift has occurred.
4. Serialization handles non-serializable types silently
The state object contains a Python datetime, a numpy array, or a custom object. The JSON serializer uses default=str, which converts datetime(2026, 5, 25) to "2026-05-25 00:00:00". On load, the deserialized value is a string, not a datetime. Code that calls .isoformat() on it silently gets a wrong result; code that calls .date() crashes.
How to spot it: Compare the Python types of every field in the state object before serialization and after deserialization. Any type change (e.g., datetime → str) indicates silent type coercion.
5. Storage backend returns stale/inconsistent data
Redis with WAIT 0 0 (non-durable writes) loses checkpoint data on a restart. S3 with eventual consistency reads a stale version immediately after a write. The storage backend returns successfully but the data is old or partially written.
How to spot it: Check your checkpoint storage backend’s durability settings. Redis without AOF persistence, S3 without object versioning, and databases without fsync lose writes on crashes. Verify by writing a checkpoint and immediately reading it back before trusting the backend.
6. Checkpoint is compressed and compression codec is unavailable on load
The checkpoint was written with lz4 compression. The server was rebuilt without the lz4 library. The load silently reads the compressed bytes as raw text and produces garbage data, or the decompressor raises an exception that is caught and treated as “no checkpoint — start fresh.”
How to spot it: Check whether checkpoint read failures trigger “no checkpoint found” fallback behavior. If they do, a load error is indistinguishable from a missing checkpoint, and the pipeline starts from scratch silently.
Shortest path to fix
Step 1: Validate checkpoint integrity on every load
import json, hashlib
def load_checkpoint_safe(path: str, schema_version: int) -> dict:
with open(path) as f:
record = json.load(f)
# Check completeness flag
if not record.get("is_complete"):
raise CorruptedCheckpointError(f"Incomplete checkpoint at {path}")
# Check schema version
saved_version = record.get("schema_version")
if saved_version != schema_version:
raise SchemaVersionMismatch(
f"Checkpoint schema v{saved_version} != code schema v{schema_version}"
)
# Verify checksum
state_blob = json.dumps(record["state"], sort_keys=True, default=str)
expected = hashlib.sha256(state_blob.encode()).hexdigest()
if record.get("checksum") != expected:
raise CorruptedCheckpointError(f"Checksum mismatch at {path}")
return record["state"]
Step 2: Write checkpoints atomically using a temp-file swap
import os, tempfile
def save_checkpoint_atomic(path: str, state: dict, schema_version: int):
state_blob = json.dumps(state, sort_keys=True, default=str)
checksum = hashlib.sha256(state_blob.encode()).hexdigest()
record = {
"state": state,
"checksum": checksum,
"schema_version": schema_version,
"is_complete": True,
"saved_at": datetime.utcnow().isoformat(),
}
# Write to temp file in the same directory (same filesystem → atomic rename)
dir_name = os.path.dirname(path)
with tempfile.NamedTemporaryFile(
mode='w', dir=dir_name, delete=False, suffix='.tmp'
) as tmp:
json.dump(record, tmp)
tmp_path = tmp.name
# Atomic rename — either the new file is there or the old one is; never a partial write
os.replace(tmp_path, path)
os.replace() is atomic on POSIX systems when src and dst are on the same filesystem.
Step 3: Keep the last 3 checkpoint generations
def rotate_checkpoints(base_path: str, state: dict, schema_version: int):
# Rotate: checkpoint.2.json → checkpoint.3.json, .1 → .2, current → .1
for i in range(3, 0, -1):
src = f"{base_path}.{i-1}.json" if i > 1 else f"{base_path}.json"
dst = f"{base_path}.{i}.json"
if os.path.exists(src):
os.replace(src, dst)
save_checkpoint_atomic(f"{base_path}.json", state, schema_version)
If the current checkpoint is corrupted, fall back to .1.json, then .2.json, then .3.json.
Step 4: Add a schema migration function for version mismatches
SCHEMA_VERSION = 3
def migrate_checkpoint(state: dict, from_version: int, to_version: int) -> dict:
if from_version == 1 and to_version >= 2:
# v1 → v2: flatten artifact structure
for k, v in state.pop("artifacts_nested", {}).items():
state[f"artifact_{k}"] = v
if from_version <= 2 and to_version >= 3:
# v2 → v3: add missing "completed_at" field
state.setdefault("completed_at", {})
return state
Never silently load a mismatched version — migrate explicitly.
Step 5: Test checkpoint corruption recovery in CI
# Simulate a corrupted checkpoint by truncating it
dd if=/dev/zero of=checkpoints/run-42.json count=1 bs=512 conv=notrunc
# Verify the pipeline detects corruption and falls back to checkpoint .1.json
python run_pipeline.py --run-id run-42 --resume
# Expected: "Loaded fallback checkpoint .1.json — proceeding from step 5"
Prevention
- Write checkpoints atomically using temp-file-then-rename; never write directly to the checkpoint path.
- Include a checksum, completion flag, and schema version in every checkpoint record.
- Keep at least 3 checkpoint generations (ring buffer rotation); never delete a checkpoint until 2 newer ones are verified valid.
- Validate checksum and schema version on every checkpoint load before trusting any field in the state.
- Write checkpoint schema migration functions for every schema change; never silently load a mismatched version.
- Use a durable checkpoint backend (Redis with AOF + fsync, PostgreSQL, S3 with versioning) — not in-memory stores or eventually-consistent buckets.
- Test checkpoint corruption recovery in CI: truncate a checkpoint mid-file and verify the fallback path works.
- Never catch checkpoint load exceptions with “start fresh” fallback — log them loudly and require a human decision before proceeding.
FAQ
Q: Does Temporal handle checkpoint integrity automatically?
A: Temporal uses event sourcing: the workflow history is the checkpoint, and it is stored in a durable database (PostgreSQL, MySQL, or Cassandra) with transactional writes. Partial writes are rolled back. Schema changes require explicit versioning with workflow.GetVersion(). For Temporal, checkpoint corruption is essentially eliminated — the risk moves to history replay fidelity.
Q: How large can a checkpoint be before I need to use a different storage approach? A: Keep checkpoints under 1 MB for file-based storage. Beyond that, use a database or object store. LangGraph’s SQLite checkpointer handles multi-MB states efficiently. For very large state (generated code files, analysis results), store the large blobs externally and checkpoint only references (file paths, S3 keys).
Q: Can I inspect a corrupted checkpoint without loading it?
A: Yes — use python -m json.tool checkpoint.json to check basic JSON validity. For truncated files, jq '.' checkpoint.json 2>&1 will report the exact byte offset of the syntax error. This tells you how much of the checkpoint is intact and whether a manual repair is feasible.
Q: How do I handle a corruption that only affects one field out of 20? A: Use the last valid checkpoint generation instead of trying to patch the corrupted one. Partial manual repairs are risky — you may fix the visible corruption and miss a secondary corruption introduced by the same event. Roll back cleanly to the previous generation.