Shared Agent Memory Corrupted by Overlapping Writes

Q: Does Redis support fully serialized multi-key transactions?

Yes, via `MULTI`/`EXEC` plus `WATCH` for optimistic locking. On a single Redis instance this spans any keys. On Redis Cluster, every key in a transaction must hash to the same slot — use a hash tag (`{team}:findings`, `{team}:version`) to force colocation, or you get a `CROSSSLOT` error. The same constraint applies to Lua scripts (`EVAL`): they run atomically server-side, but all `KEYS` must live on one slot.

Two agents read the same shared memory, both write back, and one update silently vanishes. Diagnose the lost-update race and fix it with atomic ops, optimistic locking, or per-agent partitions.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your AutoGen or CrewAI multi-agent system uses a shared Redis store as the “team memory.” Agent A reads the research summary, appends three findings, and writes back the combined list. At almost the same instant, Agent B reads the same summary (before A’s write lands), appends two different findings, and writes back its version. Agent A’s three findings are silently overwritten. The shared store now holds only Agent B’s two findings, and every downstream agent runs on incomplete context. This is the classic read-modify-write race — the “lost update” problem — applied to LLM agent memory.

Fastest fix: stop doing client-side read-modify-write on a shared key. Replace get + modify + set with an atomic server-side operation — RPUSH for append-only logs, HSET for field-level updates — or wrap the read and write in a Redis WATCH/MULTI/EXEC retry loop. If agents only need their own slice, give each agent a private key (agent:{id}:...) and merge later. The rest of this page is how to tell which of these you need and verify it worked.

Which bucket are you in?

Symptom	Most likely cause	Go to
Append-only list ends up shorter than the number of writes	Client-side `get`+append+`set` (non-atomic)	Step 1
Whole record clobbered; only the last writer’s fields survive	Read-modify-write on a JSON blob, no CAS	Step 2
Two agents meant to write different keys hit the same key	Key derived from a shared property	Step 3
Parallel LangGraph nodes update one state key, one update lost	Wrong / missing state reducer	Step 4
Intermittent loss only under high concurrency	Stale-retry after version conflict, or write-behind flush	Steps 2 and 5
In-process `dict`/`list` corrupts (TypeError, partial state)	No `asyncio.Lock`/`threading.Lock` on the critical section	Step 6

Common causes

1. Read-modify-write without atomic compare-and-swap

The most common pattern. Agents read the full state, modify it in Python, then write the whole thing back, with no check that the state hasn’t changed in between. Any overlapping read-modify-write pair loses one agent’s changes.

How to spot it: find any code shaped like state = store.get(key); state.update(new_data); store.set(key, state). Without an if-not-modified-since guard or a WATCH/CAS, this always loses concurrent writes.

2. Append operations not atomic at the storage layer

Two agents both call store.append(key, item), but the implementation is store.set(key, store.get(key) + [item]) — a non-atomic read-modify-write. Both read the same list, both append one item, both write a list with one new element. The second write erases the first.

How to spot it: check whether your store’s append/add is an atomic server-side op (Redis RPUSH) or a client-side get + modify + set. Client-side is not safe for concurrent agents.

3. No write lock — multiple agents write the same key

Parallel agents all have write access to one namespace, with no mutex. The pipeline assumed agents would write to separate keys, but the key is derived from a shared task property (task category, model name, date), so they collide.

How to spot it: log every write with key, agent ID, and timestamp. Two different agent IDs writing the same key within a sub-second window is the signature.

4. Optimistic locking version check skipped under load

Each record has a version field. Agents read version v, compute an update, and write with a WHERE version = v condition. Under load the write fails (another agent won), but the error handler retries with the original (stale) data instead of re-reading — so the retry overwrites the winning write with stale data.

How to spot it: read the retry path for version-conflict / WatchError handling. If it replays the original payload rather than re-reading and recomputing, it corrupts state on the second attempt.

5. LangGraph state reducer merges incompatibly

In LangGraph, parallel node outputs are merged by per-key reducers. If two nodes update the same key with no reducer, the default is last-write-wins and one update is dropped. If the key holds messages and you use operator.add instead of add_messages, you can get duplicates or broken tool-call pairing instead of correct deduplicated merging.

How to spot it: review each key in your state TypedDict. Any key written by more than one parallel node needs an explicit reducer — Annotated[list, operator.add] for plain lists, Annotated[list, add_messages] for chat messages, or a custom merge for dicts.

6. In-process structures aren’t concurrency-safe

An in-memory dict or list shared across threads or asyncio tasks. In CPython the GIL makes a single d[k] = v atomic, but compound read-check-write (if k not in d: d[k] = v) is not, and under asyncio every await is a yield point where another coroutine can interleave.

How to spot it: confirm the data structure’s type and look for compound operations on it that are not wrapped in a threading.Lock (threads) or asyncio.Lock (coroutines).

7. Write-behind cache flushes and loses in-flight writes

A write-behind cache flushes every few seconds. Two agents write inside one flush window; the flush persists the last in-memory snapshot, which may hold only one of the two writes.

How to spot it: check the cache flush interval against the typical gap between concurrent agent writes. If the interval is longer, writes can be dropped at flush time. Prefer write-through for shared agent memory.

Shortest path to fix

Step 1: Use atomic server-side operations for shared-memory writes

For append-only data, push to a Redis list instead of read-modify-write:

import json
import redis

r = redis.Redis()

# WRONG — client-side read-modify-write, loses concurrent appends
def append_finding_unsafe(key: str, finding: str):
    findings = json.loads(r.get(key) or "[]")
    findings.append(finding)
    r.set(key, json.dumps(findings))

# CORRECT — atomic server-side append, safe under concurrency
def append_finding_safe(key: str, finding: str):
    r.rpush(key, finding)  # RPUSH is atomic; no read-modify-write

def get_findings(key: str) -> list:
    return [item.decode() for item in r.lrange(key, 0, -1)]

For structured records, update individual hash fields so only the touched field changes:

# Atomic field-level write — leaves every other field intact
r.hset(
    "agent_memory",
    f"finding:{agent_id}:{timestamp}",
    json.dumps(finding_data),
)

RPUSH, HSET, ZADD, SADD, and SQL INSERT ... ON CONFLICT are all single-round-trip atomic operations. Reach for one of these before reaching for a lock.

Step 2: Optimistic locking with correct retry

When you genuinely must read a value, transform it, and write it back, wrap it in WATCH/MULTI/EXEC. After WATCH, redis-py runs commands immediately (so the get returns a value), and EXEC aborts with WatchError if the key changed since WATCH:

import logging
import json

logger = logging.getLogger(__name__)

class ConcurrencyError(RuntimeError):
    pass

def update_with_optimistic_lock(key: str, update_fn, max_retries: int = 5):
    for attempt in range(max_retries):
        with r.pipeline() as pipe:
            try:
                pipe.watch(key)                       # watch for concurrent change
                current = json.loads(pipe.get(key) or "{}")
                new_state = update_fn(current)        # recompute on CURRENT state
                pipe.multi()                          # begin transaction
                pipe.set(key, json.dumps(new_state))
                pipe.execute()                        # aborts if key changed since WATCH
                return new_state
            except redis.WatchError:
                logger.debug("CAS conflict on %s, retry %d", key, attempt + 1)
                continue                              # loop re-reads — never replays stale data
    raise ConcurrencyError(f"Could not update {key} after {max_retries} attempts")

The bug in cause 4 is replaying the original payload on conflict. The fix is the continue above: it loops back to watch + re-read, so update_fn always runs against fresh state. redis-py also ships a built-in helper, r.transaction(update_fn_taking_pipe, key), that handles the watch-and-retry boilerplate for you.

Step 3: Partition memory by agent ID

The most robust fix is to remove the contention entirely. Give each agent a private write namespace and merge in a single coordinator after the fan-out completes:

def write_agent_memory(agent_id: str, key: str, value: dict):
    # Private key — no other agent ever writes here, so no lock needed
    r.set(f"agent:{agent_id}:{key}", json.dumps(value))

def read_shared_memory(key: str) -> dict:
    # Shared namespace is read-only for worker agents
    return json.loads(r.get(f"shared:{key}") or "{}")

def publish_to_shared(agent_id: str, contribution_key: str, value: dict):
    # Append to a stream; a single coordinator consumes and merges in order
    r.xadd("shared_memory_stream", {
        "agent_id": agent_id,
        "key": contribution_key,
        "value": json.dumps(value),
    })

XADD appends to a Redis Stream atomically and preserves order, so the coordinator sees every contribution and resolves conflicts deliberately instead of relying on write timing.

Step 4: Fix the LangGraph reducer for parallel nodes

Give every key that more than one parallel node writes an explicit reducer. Use add_messages for chat messages (it merges by message ID and keeps tool-call pairs intact) and operator.add for plain lists:

from typing import Annotated
from typing_extensions import TypedDict
import operator
from langgraph.graph.message import add_messages

def deep_merge(a: dict, b: dict) -> dict:
    result = dict(a)
    for k, v in b.items():
        if k in result and isinstance(result[k], dict) and isinstance(v, dict):
            result[k] = deep_merge(result[k], v)
        else:
            result[k] = v
    return result

class AgentState(TypedDict):
    messages: Annotated[list, add_messages]          # dedupe-by-id merge for chat
    findings: Annotated[list[str], operator.add]     # concatenate both lists
    artifacts: Annotated[dict, deep_merge]           # custom deep merge for dicts

Have each parallel node return only its delta (return {"findings": ["finding 3"]}), never the full state — the reducer does the merge. Then test it explicitly:

def test_parallel_findings_merge():
    a = {"findings": ["finding 1", "finding 2"]}
    b = {"findings": ["finding 3"]}
    merged = operator.add(a["findings"], b["findings"])
    assert len(merged) == 3   # all three survive the fan-in

Step 5: Log writes and detect conflicts

Make collisions observable so you can prove the fix held and catch regressions:

import time

def monitored_write(key: str, value, agent_id: str):
    r.set(key, json.dumps(value))
    r.lpush(f"write_log:{key}", json.dumps({
        "agent": agent_id, "ts": time.time(), "size": len(str(value)),
    }))
    recent = r.lrange(f"write_log:{key}", 0, 5)
    agents = {json.loads(e)["agent"] for e in recent}
    if len(agents) > 1:
        logger.warning("Concurrent writes on %s by agents %s", key, agents)

Step 6: Lock in-process shared structures

If the shared memory is an in-process object rather than Redis, wrap every compound operation in a lock so the read-check-write runs as one critical section:

import asyncio

class SharedMemory:
    def __init__(self):
        self._data: dict = {}
        self._lock = asyncio.Lock()

    async def append_to_list(self, key: str, item) -> None:
        async with self._lock:               # no other coroutine interleaves here
            self._data.setdefault(key, []).append(item)

    async def atomic_update(self, key: str, updater):
        async with self._lock:
            self._data[key] = updater(self._data.get(key))
            return self._data[key]

For OS-thread concurrency use threading.Lock with the same with self._lock: pattern.

How to confirm it’s fixed

Write a concurrency test that reproduces the original loss, then watch it pass:

import threading

def test_no_lost_appends():
    key = "test:findings"
    r.delete(key)
    writers = [threading.Thread(target=append_finding_safe, args=(key, f"f{i}"))
               for i in range(50)]
    for t in writers: t.start()
    for t in writers: t.join()
    assert len(get_findings(key)) == 50   # every write present, none lost

Run it 10-20 times (concurrency bugs are probabilistic). The unsafe version drops items intermittently; the atomic version returns exactly 50 every run. For Redis, you can also tail write_log:{key} (Step 5) or enable keyspace notifications (CONFIG SET notify-keyspace-events KEA) to confirm no two agent IDs hit the same key inside your collision window in production.

Prevention

Prefer atomic server-side ops (Redis RPUSH, HSET, ZADD; SQL INSERT ON CONFLICT) over client-side read-modify-write for all shared writes.
Partition memory into per-agent private namespaces and merge through a coordinator; reach for locks only when sharing a key is unavoidable.
For read-then-write, use WATCH/CAS and re-read on retry — never replay the stale payload.
In LangGraph, define an explicit reducer for every key multiple parallel nodes touch (add_messages for messages, operator.add for lists, custom merge for dicts) and have nodes return deltas only.
Lock compound operations on in-process dict/list with asyncio.Lock or threading.Lock.
Avoid write-behind caches for shared agent memory; use write-through for durability.
Document which keys are single-writer (private) vs. multi-writer (shared); shared keys require explicit concurrency control.
Keep the concurrency test above in CI so a future refactor can’t silently reintroduce the race.

FAQ

Q: Does Redis support fully serialized multi-key transactions? A: Yes, via MULTI/EXEC plus WATCH for optimistic locking. On a single Redis instance this spans any keys. On Redis Cluster, every key in a transaction must hash to the same slot — use a hash tag ({team}:findings, {team}:version) to force colocation, or you get a CROSSSLOT error. The same constraint applies to Lua scripts (EVAL): they run atomically server-side, but all KEYS must live on one slot.

Q: Is a message queue safer than shared mutable memory for agent communication? A: Usually, yes. Streams and queues (Redis Streams, Kafka, SQS) serialize writes by design — each message is appended atomically and consumers read an ordered log. For inherently sequential agent work (each agent contributes findings), a stream is safer and far more auditable than a mutable shared dict.

Q: A version conflict keeps firing — should I just add a retry? A: Only if the retry re-reads. The cause-4 bug is retrying with the original stale payload, which overwrites the winner. Make the retry loop back to WATCH + re-read + recompute (Step 2). If conflicts are constant, the key is too hot — switch to per-agent partitions (Step 3) or an atomic append (Step 1) so writes stop contending.

Q: How do I recover state that’s already corrupted? A: Reconstruct from the write log. Find the last write that produced a correct state, identify the overlapping writes after it, and re-apply the lost updates by hand. For Redis, enable keyspace notifications to capture a real-time log of future writes for post-mortems.

Q: Do vector stores have the same problem? A: Yes. Pinecone, Qdrant, and Chroma use upsert semantics — a concurrent upsert with the same vector ID overwrites the entire prior record (Pinecone replaces the whole record on a duplicate ID). Use unique IDs per write (agent_id + timestamp + content hash) instead of a fixed ID per topic. That turns a destructive overwrite into an append-and-deduplicate problem you control.

Tags: #AI coding #Agents #Troubleshooting