Agent State Desyncs After Restart: Detect Drift and Resync

Q: What's the difference between LangGraph's `MemorySaver`/`InMemorySaver`, `SqliteSaver`, and `PostgresSaver`?

`InMemorySaver` (the current name; `MemorySaver` is the older alias) keeps checkpoints in RAM and loses everything on process restart — testing only. `SqliteSaver` persists to a local SQLite file, so it survives a restart on the same machine. For production and multi-worker setups use `PostgresSaver`/`AsyncPostgresSaver` (or a Redis saver), which share checkpoints across workers. Postgres savers need a one-time `.setup()` call to create their tables.

After a crash or restart, your agent thinks the world is in a state that no longer matches reality. Here's how to detect the drift and resync reliably with LangGraph, Temporal, and custom checkpoints.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

A workflow crashes mid-execution. You restart it from the last checkpoint, and the agent’s internal state says “files A, B, C have been refactored” — but the crash happened after it recorded step 3 complete and before it actually wrote file C to disk. The agent skips C because it believes C is already done, and the codebase is now broken in exactly the place the agent thinks it fixed. Or a LangGraph pipeline restarts after a MemorySaver/InMemorySaver process restart, the in-memory checkpoint is gone, and the agent re-runs every step — doubling mutations on a database that doesn’t support idempotent writes.

TL;DR — fastest fix: In 90% of cases the desync is one of two things: (1) state was written before the side effect completed (reorder so state is written after), or (2) you’re storing runtime state in process memory (a Python set/dict or LangGraph InMemorySaver) that a kill -9 wipes (move it to a durable saver — PostgresSaver/RedisSaver for LangGraph, or let Temporal’s event history hold it). Then add a pre-resume verification step so the agent never blindly trusts a checkpoint. Details below.

Which bucket are you in?

Symptom on restart	Most likely cause	Jump to
Agent skips a step that was never actually finished	State written before the side effect	Cause 1
Agent re-runs work that already ran (duplicate emails, double DB rows)	Checkpoint too coarse / non-idempotent steps	Cause 2
Agent starts over from zero after restart	State lived only in process memory	Cause 3
Agent uses stale facts (old schema, old file contents)	External world changed while stopped	Cause 4
`KeyError` / `None` on a state field after a deploy	Checkpoint schema version mismatch	Cause 5
Checkpoint loads but some fields are at defaults	Partial / corrupted checkpoint write	Cause 6
`nondeterministic` / “command does not match event history” (Temporal)	Workflow code changed without versioning	Temporal note

Common causes

1. State is written before the side effect completes

The agent writes “step 3 complete” to its state store, then crashes before the actual file write or API call finishes. On restart the state says “done” but the effect never happened. This is a classic write-before-commit ordering bug, and it’s the single most common source of desync.

How to spot it: Look for any code where state.mark_done(step) (or a checkpoint save) is called before — not after — the actual file write, API call, or database mutation. The state update must come last, or the two must be atomic.

2. Checkpoint is too coarse — covers multiple steps in one unit

The checkpoint saves every 10 steps, but the crash happened at step 7. On restart the agent replays steps 1-10, re-executing steps 1-7 that already ran. If those steps are not idempotent (they append to a file, increment a counter, charge a card, or send an email), re-execution produces duplicate side effects.

How to spot it: Find the checkpoint granularity in your workflow code, then compute latest_checkpoint_step minus crash_step. Any gap greater than zero is lost-and-replayed work. If a single checkpoint covers multiple side-effecting operations, a crash mid-checkpoint guarantees a desync.

3. Runtime state lived only in process memory

Many pipelines accumulate state in a plain Python dict/set or a class attribute. A process restart wipes it entirely, so the agent either starts from scratch or partially re-does work and creates duplicates. In LangGraph specifically, this is what InMemorySaver (the modern name for the old MemorySaver) does — it stores checkpoints in a defaultdict in RAM and is for testing only. As of June 2026 the LangChain docs are explicit: use PostgresSaver/AsyncPostgresSaver (or RedisSaver) for anything that must survive a restart.

How to spot it: List every state variable and tag its storage location (memory / file / Redis / Postgres). Any variable the workflow needs after resume that lives only in memory will be lost on kill -9.

4. External state changed between checkpoint and resume

The agent checkpoints db_schema = v4. While it was stopped, a human ran a migration to v5. The agent resumes from the v4 checkpoint and generates migration SQL that is now wrong. Same class of bug if a temp file the agent relied on was cleaned up, or a file it “already wrote” was edited by someone else.

How to spot it: Compare the external world at resume time against what the checkpoint recorded — file hashes, schema version, row counts, API resource versions. Any difference is a desync, which is why you need an explicit pre-resume verification step (Step 1 below).

5. Checkpoint schema version mismatch after a deploy

You ship new code whose state shape differs from the checkpoint that’s already on disk. state["new_field"] doesn’t exist in the old checkpoint, so the agent gets None or a KeyError and continues with a wrong default. In Temporal this surfaces differently — see the determinism note — but for hand-rolled and LangGraph state it’s a plain serialization-version gap.

How to spot it: Diff the current state schema against the field list in the newest checkpoint. Missing or renamed fields are the trigger. Stamp every checkpoint with a schema_version and migrate on load.

6. Partial or corrupted checkpoint write

The checkpoint write was interrupted mid-serialization (crash, OOM kill, disk full), so it reads back with some fields at their zero/default values. The agent resumes with a Frankenstein mix of real and default state that never existed together. Concurrent writers without a lock cause the same thing — half the fields from one writer, half from another.

How to spot it: Add a checksum and an is_complete: true flag to every checkpoint. On load, verify both before trusting any field. Check updated_at timestamps for two writes in the same millisecond, which signals an unlocked concurrent write.

Shortest path to fix

Step 1: Add a pre-resume state verification step

Before resuming from any checkpoint, verify that the real world matches the checkpoint’s assumptions. Do not auto-resume on a mismatch — alert a human to reconcile first.

def verify_checkpoint(checkpoint: dict) -> list[str]:
    discrepancies = []
    for file_path, expected_hash in checkpoint.get("file_hashes", {}).items():
        actual = hash_file(file_path) if os.path.exists(file_path) else None
        if actual != expected_hash:
            discrepancies.append(
                f"{file_path}: expected {expected_hash}, got {actual}"
            )
    for key, expected_val in checkpoint.get("db_state", {}).items():
        actual_val = db.get_value(key)
        if actual_val != expected_val:
            discrepancies.append(
                f"DB {key}: expected {expected_val}, got {actual_val}"
            )
    return discrepancies

Step 2: Write state AFTER side effects, not before

# WRONG — state written before effect
def execute_step(step):
    state.mark_done(step.id)   # crash here => state says done, effect not done
    write_file(step.output)

# CORRECT — state written after effect
def execute_step(step):
    write_file(step.output)    # effect first
    state.mark_done(step.id)   # crash here => state says not done, effect done (safe to re-run if idempotent)

This guarantees a crash leaves state one step behind reality, never one step ahead. Re-running the last step is then safe as long as Step 3 holds.

Step 3: Make side effects idempotent so re-execution is safe

def write_file_idempotent(path: str, content: str, expected_hash: str):
    if os.path.exists(path) and hash_file(path) == expected_hash:
        return  # already written correctly — skip
    with open(path, "w") as f:
        f.write(content)

# For database operations:
def upsert_record(table: str, key: str, value: dict):
    # INSERT ON CONFLICT DO UPDATE (idempotent)
    db.execute(
        f"INSERT INTO {table} (key, data) VALUES (?, ?) "
        "ON CONFLICT(key) DO UPDATE SET data = excluded.data",
        (key, json.dumps(value)),
    )

For non-idempotent external calls (charge a card, send an email), derive an idempotency key from workflow_id + step_id and pass it to the provider so a replay is deduplicated server-side.

Step 4: Fine-grain the checkpoint boundary to one side effect per checkpoint

for step in steps:
    execute_step(step)                                    # one side effect
    checkpoint.save(step_id=step.id, state=current_state) # immediately after

A restart now re-runs at most one step (the last one), not a batch. If you’re on LangGraph, this is what a durable saver buys you — scope each run with a thread_id and let the saver persist after every node:

from langgraph.checkpoint.postgres import PostgresSaver

with PostgresSaver.from_conn_string(DB_URI) as checkpointer:
    checkpointer.setup()  # REQUIRED on first use — creates the checkpoint tables
    graph = builder.compile(checkpointer=checkpointer)
    # resume: omit checkpoint_id to get the latest checkpoint for this thread
    graph.invoke(state, {"configurable": {"thread_id": run_id}})

PostgresSaver (and AsyncPostgresSaver) require a one-time .setup() call before first use, as of June 2026. Pin to a specific point in history with {"configurable": {"thread_id": run_id, "checkpoint_id": "<uuid>"}}.

Step 5: Add a checksum and a completion flag to every checkpoint

import hashlib, json

def save_checkpoint(state: dict, path: str):
    payload = json.dumps(state, sort_keys=True, default=str)
    checksum = hashlib.sha256(payload.encode()).hexdigest()
    tmp = path + ".tmp"
    with open(tmp, "w") as f:
        json.dump({"state": state, "checksum": checksum,
                   "is_complete": True, "schema_version": 2}, f)
    os.replace(tmp, path)  # atomic rename — never a half-written file at `path`

def load_checkpoint(path: str) -> dict:
    with open(path) as f:
        record = json.load(f)
    if not record.get("is_complete"):
        raise CorruptedCheckpointError(f"Incomplete checkpoint at {path}")
    payload = json.dumps(record["state"], sort_keys=True, default=str)
    if hashlib.sha256(payload.encode()).hexdigest() != record["checksum"]:
        raise CorruptedCheckpointError(f"Checksum mismatch at {path}")
    return record["state"]

The tmp + os.replace() pattern makes the write atomic, so a crash mid-write never leaves a corrupt file at the real path — it leaves the previous good checkpoint untouched.

A note on Temporal determinism

Temporal does not store a memory snapshot. On restart it replays your workflow code against the recorded event history, skipping any activity that already succeeded and reusing its result — so the desync classes above mostly disappear inside the workflow function. Two things still bite:

Activities can still re-run on worker crash or retry, so every activity that touches the outside world must be idempotent (use the workflow_id + activity idempotency key pattern from Step 3).
Changing workflow code mid-flight breaks replay. If a running execution started on old code and resumes on a worker with an added/removed/reordered step, you get a nondeterminism error (“command does not match event history”). Guard code changes with the Workflow Versioning APIs and run replay tests in CI. Note: the pre-2025 experimental Worker Versioning was removed from Temporal Server in March 2026 — use the current versioning APIs, not the legacy one.

Worth knowing: a checkpointer is not the same thing as durable execution. A checkpoint just gives you a save point; you are still responsible for detecting that you need it, triggering the resume, and coordinating to avoid duplicate work. A durable-execution engine (Temporal, and increasingly LangGraph’s durable-execution mode) does that coordination for you.

How to confirm it’s fixed

Crash test in CI. Write a test that runs the workflow, hard-kills it at a random step (os._exit(1) inside a step, or kill -9 the subprocess), restarts from the checkpoint, and asserts the final state and side effects are correct exactly once.
Reconciliation check. After a run completes, compare the checkpoint’s “processed count” against the real source of truth (rows in the DB, files on disk). They must match. If they drift, you still have a desync.
Temporal replay test. Run your current workflow code against captured event histories from production; a passing replay test means your latest deploy won’t throw a nondeterminism error on in-flight executions.

Prevention

Always write state AFTER side effects complete — never before.
Checkpoint at the smallest granularity your storage allows — one checkpoint per side-effecting step.
Add a pre-resume verification step that confirms the real world matches checkpoint assumptions before continuing.
Make every side effect idempotent; for non-idempotent external calls, pass a stable idempotency key.
Add a checksum, a completion flag, and a schema_version to checkpoints; reject and alert on incomplete or corrupted ones, and migrate on load.
Include hashes of key external artifacts (files, schema version, API resource versions) in the checkpoint, not just internal state.
Never use in-process memory (a bare dict/set or LangGraph InMemorySaver) as the only state store for any workflow longer than a single LLM call.
Treat the external database as the source of truth; rebuild processed_ids from it on resume rather than trusting the checkpoint’s copy.
Test the resume path explicitly in CI, and run Temporal replay tests before every deploy.

FAQ

Q: Does Temporal guarantee no state desync? A: Inside the workflow function, largely yes — Temporal uses durable execution with event sourcing, so the event history is authoritative and already-succeeded activities are skipped and their results reused on replay. But activities that produce external side effects (file writes, API calls) must still be idempotent because Temporal may re-execute them on worker restart, and changing workflow code without versioning can cause a nondeterminism error.

Q: What’s the difference between LangGraph’s MemorySaver/InMemorySaver, SqliteSaver, and PostgresSaver? A: InMemorySaver (the current name; MemorySaver is the older alias) keeps checkpoints in RAM and loses everything on process restart — testing only. SqliteSaver persists to a local SQLite file, so it survives a restart on the same machine. For production and multi-worker setups use PostgresSaver/AsyncPostgresSaver (or a Redis saver), which share checkpoints across workers. Postgres savers need a one-time .setup() call to create their tables.

Q: If the checkpoint and the external database disagree, which wins? A: The external database. It’s the source of truth; the checkpoint is only a record of execution progress. On resume, query the DB for which items were actually processed (by unique ID + status), rebuild processed_ids from that, and don’t blindly trust the checkpoint’s processed_ids.

Q: How do I handle a checkpoint that’s partially valid — some fields correct, some wrong? A: Don’t auto-merge a partial checkpoint with the real world. Archive the corrupted one, run the pre-resume verification, and replay from the last fully-valid checkpoint. Automatic partial merges create hard-to-debug hybrid states.

Q: How long should I keep old checkpoints? A: Keep at least the last 3 per workflow run. A single retained checkpoint is vulnerable to a write failure during the next save; three give you a fallback if the most recent is corrupted. For schema-migration code, keep the migration path for as many versions as you might still load on resume — two major versions is a common rule of thumb.

Tags: #AI coding #Agents #Troubleshooting