Restored Agent Checkpoint Is Corrupted

Q: Does LangGraph's built-in serializer already handle datetime, set, and Decimal?

Mostly, yes. As of `langgraph-checkpoint` 4.x (current release 4.1.1, May 2026) the default `JsonPlusSerializer` uses `ormsgpack` with an extended-JSON fallback and round-trips datetimes, enums, sets, and LangChain/LangGraph primitives — it is not plain pickle. Corruption usually creeps in where your own node code calls `json.dumps` by hand, or after a major dependency upgrade. Keep `SqliteSaver`/`PostgresSaver` doing the serialization, add an integrity checksum on top, and set `LANGGRAPH_STRICT_MSGPACK=true` in production.

Q: Can I inspect a corrupted checkpoint without loading it?

Yes. `python -m json.tool checkpoint.json` checks basic validity, and `jq '.' checkpoint.json` reports the exact byte offset of the first syntax error — that tells you how much of the file is intact and whether a manual look is even worth it. For msgpack/binary blobs, dump the raw bytes and inspect the header rather than letting your app deserialize untrusted data.

Your agent resumes from a checkpoint but the state is garbled, missing fields, or wrong-typed. Detect the corruption, recover from a good generation, and write crash-durable checkpoints.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You restart a long-running Temporal or LangGraph workflow from its saved checkpoint after a server restart. The agent loads the checkpoint, reads completed_steps = 7, and proceeds from step 8. But the artifacts dict is missing the output of step 5: the checkpoint write was interrupted mid-serialization by an OOM kill. Step 8 reaches for the missing artifact, throws a KeyError, and crashes. Now the live checkpoint is corrupt, the in-memory state is gone, and steps 1-4 would have to re-run to recover. Corrupted checkpoints are rare but catastrophic.

Fastest fix: don’t repair the bad file. Roll back to the previous good checkpoint generation (run-42.1.json, then .2.json), and going forward write checkpoints crash-durably (temp file -> fsync -> atomic rename -> fsync the directory) with a checksum, a schema_version, and an is_complete flag that you verify on every load. The sections below cover detection, recovery, and a write path that cannot leave a half-written file behind.

Which bucket are you in?

Symptom on load	Most likely cause	First move
`json.JSONDecodeError` / `jq` reports error at a byte offset	Write interrupted mid-serialization (OOM, SIGKILL)	Roll back to the previous generation
Loads fine, but a field is `None` / empty that should have data	Truncated write, or “start fresh” fallback masked a load error	Roll back; remove the silent fallback
`KeyError` / `AttributeError` on a field the code expects	Schema drift after a code deploy	Run an explicit migration; don’t auto-load
`TypeError` on a `datetime`/`Decimal`/`set` field	Serializer coerced the type (e.g. `datetime` -> `str`)	Use a type-preserving serializer
Fields look valid but belong to a different run	Concurrent writes from two processes	Single-writer lock; log writer PID
Always “no checkpoint found”, state silently resets	Decompression/codec error caught as “missing”	Distinguish load-error from not-found

If you can’t tell which bucket you’re in, treat it as bucket one and roll back. A clean rollback to a verified-good generation is almost always safer than patching a corrupt file in place.

Common causes

1. Checkpoint write interrupted mid-serialization

The most common cause. The process receives a SIGKILL (OOM, instance shutdown, container eviction) while writing a large checkpoint. The file or DB row contains a partial blob. The next load either reads truncated “valid but incomplete” data or fails parsing on the truncated tail.

How to spot it: run python -m json.tool checkpoint.json or jq '.' checkpoint.json. jq reports the exact byte offset of the syntax error, telling you how much is intact. If the file ends mid-string, mid-array, or with null where data belongs, the write was interrupted. Correlate the checkpoint mtime with OOM events in dmesg / journalctl -k (look for Out of memory: Killed process).

2. The “atomic” write was atomic but not durable

A subtle one. You wrote to a temp file and called os.replace(), which is atomic on POSIX. But os.replace() only guarantees the filename points to one inode or the other; per POSIX it does not flush the data to disk. After a power loss or hard crash, you can be left with the renamed file pointing at zero-length or stale data, because neither the temp file’s contents nor the directory entry were fsync-ed. The fix in Step 2 below adds both flushes.

How to spot it: the file passes a JSON parse but the state is empty or reverts to a prior version specifically after an ungraceful host reboot (not a clean process restart). The give-away is “looks atomic in testing, loses data only on real power loss.”

3. Concurrent writes from multiple processes corrupt the checkpoint

Two processes (a checkpoint writer and a recovery monitor, or two agents sharing a thread_id) write the same path simultaneously. One clobbers the other, or in LangGraph the later write wins and silently drops the earlier agent’s state.

How to spot it: log the writer PID and a monotonically increasing sequence number in every checkpoint record. If two different PIDs wrote the same path inside a short window, or the sequence number went backwards, you had a concurrent write. Enforce a single writer (advisory file lock, or a DB row lock / SELECT ... FOR UPDATE).

4. Checkpoint schema version mismatch after a code deploy

The new code expects state["artifacts"]["step_5"]["type"] (nested); the checkpoint was written by the old version with state["step_5_artifact"] (flat). The load “succeeds” but the shape is wrong, and the agent runs on structurally valid, semantically wrong state.

How to spot it: check for a schema_version field in the record. If it’s missing, or it doesn’t match the version the code expects, you have drift. Never auto-load a mismatched version — migrate it explicitly (Step 4).

5. Serialization silently coerces non-JSON types

The state holds a datetime, a Decimal, a set, or a numpy array. A plain json.dumps(..., default=str) turns datetime(2026, 5, 25) into "2026-05-25 00:00:00". On load it’s a str, not a datetime: .isoformat() returns a wrong-shaped string, .date() raises AttributeError. set doesn’t serialize at all under stock json.

How to spot it: compare the Python type of each field before serialize and after deserialize. Any change (datetime -> str, Decimal -> float, set -> list or error) means silent coercion. LangGraph’s default serializer avoids this (see the FAQ), but only if you let it serialize — a hand-rolled json.dumps in your own node code will not.

6. Storage backend returns stale or non-durable data

Redis without AOF persistence loses checkpoint data on restart; S3 with read-after-write on overwrite can briefly return a stale version; a DB without fsync/synchronous_commit loses the last writes on crash.

How to spot it: check the backend’s durability settings. For self-managed Redis, confirm appendonly yes and an appendfsync policy of everysec or always. For Postgres, confirm synchronous_commit = on. Validate by writing a checkpoint, killing the host, and reading it back.

7. Checkpoint is compressed and the codec is missing on load

The checkpoint was written with lz4/zstd; the rebuilt server lacks the library. The load reads compressed bytes as raw text (garbage) or raises an exception that gets caught and treated as “no checkpoint — start fresh,” silently re-running from scratch.

How to spot it: check whether checkpoint read failures funnel into a “no checkpoint found” branch. If a decode error is indistinguishable from a missing file, you will silently lose work. Pin the codec library version and make a decode error a hard, logged failure — never a silent reset.

Shortest path to fix

Step 1: Validate checkpoint integrity on every load

import json, hashlib

class CorruptedCheckpointError(Exception): ...
class SchemaVersionMismatch(Exception): ...

def load_checkpoint_safe(path: str, schema_version: int) -> dict:
    with open(path) as f:
        record = json.load(f)  # raises JSONDecodeError on a truncated tail

    if not record.get("is_complete"):
        raise CorruptedCheckpointError(f"Incomplete checkpoint at {path}")

    saved_version = record.get("schema_version")
    if saved_version != schema_version:
        raise SchemaVersionMismatch(
            f"Checkpoint schema v{saved_version} != code schema v{schema_version}"
        )

    state_blob = json.dumps(record["state"], sort_keys=True, default=str)
    expected = hashlib.sha256(state_blob.encode()).hexdigest()
    if record.get("checksum") != expected:
        raise CorruptedCheckpointError(f"Checksum mismatch at {path}")

    return record["state"]

Step 2: Write checkpoints crash-durably, not just “atomically”

os.replace() is atomic but not durable. To survive a real power loss you must fsync the temp file before the rename and fsync the parent directory after it. Otherwise the rename can land while the data behind it is still in the page cache.

import os, json, hashlib, tempfile
from datetime import datetime, timezone

def save_checkpoint_atomic(path: str, state: dict, schema_version: int):
    state_blob = json.dumps(state, sort_keys=True, default=str)
    record = {
        "state": state,
        "checksum": hashlib.sha256(state_blob.encode()).hexdigest(),
        "schema_version": schema_version,
        "is_complete": True,
        "saved_at": datetime.now(timezone.utc).isoformat(),
    }
    dir_name = os.path.dirname(path) or "."
    # temp file in the SAME directory -> same filesystem -> rename is atomic
    with tempfile.NamedTemporaryFile(
        mode="w", dir=dir_name, delete=False, suffix=".tmp"
    ) as tmp:
        json.dump(record, tmp)
        tmp.flush()
        os.fsync(tmp.fileno())   # durability 1/2: the data is on disk
        tmp_path = tmp.name
    os.replace(tmp_path, path)   # atomic swap, never a partial file
    # durability 2/2: persist the directory entry for the rename
    dir_fd = os.open(dir_name, os.O_DIRECTORY)
    try:
        os.fsync(dir_fd)
    finally:
        os.close(dir_fd)

datetime.utcnow() is deprecated as of Python 3.12; use timezone-aware datetime.now(timezone.utc) as above.

Step 3: Keep the last 3 checkpoint generations

def rotate_checkpoints(base_path: str, state: dict, schema_version: int):
    # checkpoint.2.json -> .3.json, .1 -> .2, current -> .1, then write fresh current
    for i in range(3, 0, -1):
        src = f"{base_path}.{i-1}.json" if i > 1 else f"{base_path}.json"
        dst = f"{base_path}.{i}.json"
        if os.path.exists(src):
            os.replace(src, dst)
    save_checkpoint_atomic(f"{base_path}.json", state, schema_version)

If the current checkpoint fails validation, walk the ring: try .1.json, then .2.json, then .3.json, loading each through load_checkpoint_safe. The first one that validates is your recovery point.

Step 4: Migrate explicitly on a version mismatch

SCHEMA_VERSION = 3

def migrate_checkpoint(state: dict, from_version: int, to_version: int) -> dict:
    if from_version == 1 and to_version >= 2:
        # v1 -> v2: flatten the artifact structure
        for k, v in state.pop("artifacts_nested", {}).items():
            state[f"artifact_{k}"] = v
    if from_version <= 2 and to_version >= 3:
        # v2 -> v3: add the missing "completed_at" map
        state.setdefault("completed_at", {})
    return state

Never silently load a mismatched version. On SchemaVersionMismatch, run the migration, re-validate, then save a fresh checkpoint at the current version before resuming.

Step 5: Test corruption recovery in CI

# Corrupt the live checkpoint by zeroing its first 512 bytes
dd if=/dev/zero of=checkpoints/run-42.json count=1 bs=512 conv=notrunc
# The pipeline must detect corruption and fall back to .1.json
python run_pipeline.py --run-id run-42 --resume
# Expected log: "Loaded fallback checkpoint .1.json - proceeding from step 5"

How to confirm it’s fixed

Round-trip type check: serialize a state with a datetime, a Decimal, and a set; load it back; assert each field’s type() is unchanged.
Crash-during-write test: start a save, kill -9 the process mid-write, restart, and confirm the loader either reads the previous complete checkpoint or refuses to load — never silently uses a half-written file. Check that no .tmp files linger.
Power-loss simulation (if you can): on a VM, write a checkpoint and hard-reset the host. After reboot the latest complete checkpoint must still load; this is what the directory fsync in Step 2 buys you.
Checksum gate: flip one byte inside the state blob of a saved checkpoint and confirm load_checkpoint_safe raises CorruptedCheckpointError rather than returning the tampered data.

Prevention

Write checkpoints crash-durably: temp file in the same directory -> fsync the file -> os.replace -> fsync the parent directory. Never write straight to the live path.
Put a checksum, an is_complete flag, and a schema_version in every record, and verify all three on every load.
Keep at least 3 generations (ring rotation); never delete a checkpoint until 2 newer ones validate.
Write an explicit migration for every schema change; never auto-load a mismatched version.
Enforce a single writer per checkpoint path (file lock or DB row lock) and log the writer PID plus a sequence number.
Use a durable backend: Redis with appendonly yes + appendfsync everysec/always, Postgres with synchronous_commit = on, or S3 with object versioning. Avoid in-memory or eventually-consistent stores for the source of truth.
Never funnel checkpoint load errors into a “start fresh” fallback. Log loudly and require a human decision.
Test corruption recovery in CI (truncate a checkpoint, verify the fallback path).

Security note (LangGraph self-hosters)

If you load checkpoints with LangGraph, harden the load path. Check Point Research disclosed a chain in 2026 (CVE-2025-67644, a SQL injection in the SQLite checkpointer, plus CVE-2026-28277, unsafe msgpack deserialization) that can reach remote code execution when an attacker can write to the checkpoint store and the app exposes get_state_history(). As of June 2026 the fixes are in langgraph >= 1.0.10, langgraph-checkpoint-sqlite >= 3.0.1, and langgraph-checkpoint-redis >= 1.0.2. Also set LANGGRAPH_STRICT_MSGPACK=true (or pass an explicit allowed_msgpack_modules list to JsonPlusSerializer) so deserialization only reconstructs a known-safe set of types instead of any Python object found in the blob. See the GitHub advisory.

FAQ

Q: Does LangGraph’s built-in serializer already handle datetime, set, and Decimal? A: Mostly, yes. As of langgraph-checkpoint 4.x (current release 4.1.1, May 2026) the default JsonPlusSerializer uses ormsgpack with an extended-JSON fallback and round-trips datetimes, enums, sets, and LangChain/LangGraph primitives — it is not plain pickle. Corruption usually creeps in where your own node code calls json.dumps by hand, or after a major dependency upgrade. Keep SqliteSaver/PostgresSaver doing the serialization, add an integrity checksum on top, and set LANGGRAPH_STRICT_MSGPACK=true in production.

Q: Does Temporal handle checkpoint integrity automatically? A: Yes, by design. Temporal uses event sourcing: the workflow’s event history is the checkpoint, stored in a durable database (PostgreSQL, MySQL, or Cassandra) with transactional writes, so partial writes are rolled back. The risk shifts from corruption to replay fidelity. Make code changes with Workflow.getVersion() / patched() so old histories still replay deterministically (legacy “Worker Versioning” from before 2025 was removed from Temporal Server in March 2026). Run replay tests against captured histories before deploying.

Q: Can I inspect a corrupted checkpoint without loading it? A: Yes. python -m json.tool checkpoint.json checks basic validity, and jq '.' checkpoint.json reports the exact byte offset of the first syntax error — that tells you how much of the file is intact and whether a manual look is even worth it. For msgpack/binary blobs, dump the raw bytes and inspect the header rather than letting your app deserialize untrusted data.

Q: One field out of twenty is corrupt. Should I just patch that field? A: No. Roll back to the last valid generation. A partial repair fixes the corruption you can see and usually misses the secondary corruption introduced by the same event (the OOM that truncated field 5 may also have dropped field 12). A clean rollback to a checksum-verified generation is safer than hand-editing.

Q: How large can a checkpoint get before I change storage? A: Keep file-based checkpoints under ~1 MB; beyond that use a database or object store. For large state (generated files, long LLM histories), checkpoint only references (file paths, S3 keys) and store the blobs externally. Splitting a small “metadata” checkpoint (written every step) from a large “data” checkpoint (written only on change) cuts per-step write volume from tens of MB to a few KB.

Tags: #AI coding #Agents #Troubleshooting