Agent Orchestrator Deadlocks Waiting on Each Other

Two agents block forever waiting for each other's output. Find the cycle, add a timeout, and break the deadlock in minutes — with LangGraph, Temporal, and AutoGen specifics.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your LangGraph, Temporal, or AutoGen workflow hangs and never finishes. Agent A is waiting for Agent B to produce a schema before it writes the API handler; Agent B is waiting for Agent A to write the handler so it can infer the schema. Neither moves. The run sits there burning a worker, a timeout, or a polling loop, and the logs go quiet.

Fastest fix: wrap every blocking wait in a hard timeout (asyncio.wait_for(...) or a framework deadline) so the hang surfaces as an error instead of an infinite stall, then print the dependency graph and break the cycle by making one agent emit a stub/default first. Steps 1, 3, and 5 below do exactly that.

Deadlocks in agent orchestration are rarer than database lock deadlocks but harder to spot, because a deadlock looks like slowness, not a crash. The signature that tells them apart: a deadlocked run has near-zero CPU and zero new LLM API calls for minutes while the run is still marked “running.”

Which bucket are you in?

Match your symptom to the cause, then jump to the matching fix step.

Symptom you observe	Likely cause	Go to
`GraphRecursionError: Recursion limit of 25 reached` or a node never reaches `END`	Conditional edge forms an unintended cycle	Step 2 + Step 3
Two agents each hold one lock and wait for the other since the same timestamp	Inconsistent lock-acquisition order	Step 4
Both agents are in a blocking `wait_for_reply()` with messages queued for each	Message-queue deadlock (no one drains the inbox)	Step 1 + Step 5
AutoGen team loops on “you go first / no, you go first”	No turn cap or tie-break rule	Step 5 + Prevention
Temporal run stuck; worker log shows `PotentialDeadlock` / `Deadlock detected`	Blocking call inside the workflow thread (2s detector)	See “Temporal’s two kinds of deadlock”
Run is alive but idle; CPU ~0%, no token usage for minutes	Any blocking wait with no timeout	Step 1 + Step 5

Common causes

1. Circular dependency in the workflow graph

The most direct cause. Agent A depends on output Y from Agent B, and Agent B depends on output X from Agent A, forming a cycle: A -> needs Y -> B -> needs X -> A. If neither has a default or cached value, both block forever.

How to spot it: Draw or print the dependency graph. In LangGraph, call graph.get_graph().draw_mermaid(). Look for any edge where following the graph forward eventually leads back to the same node.

2. Shared lock held too long — acquisition order inconsistency

Agent A acquires a lock on resource_1, then tries to acquire resource_2. Agent B acquires resource_2 first, then tries to acquire resource_1. Classic dining-philosophers deadlock. This happens when multiple agents use a file-claim registry or a database row lock without a consistent acquisition order.

How to spot it: Log every lock acquisition with agent ID, resource name, and timestamp. A deadlock shows up as two agents each holding one lock and waiting for the other since the same t=T.

3. Message-queue deadlock — both agents waiting for a reply

In an AutoGen or CrewAI multi-agent chat, Agent A sends a message to Agent B and blocks on a response. Agent B simultaneously sends a message to Agent A and blocks on a response. Both are in wait_for_reply() with no timeout. The queue holds both messages, but neither agent is reading its inbox while it waits.

How to spot it: Check the pending message queue for both agents. If Agent A has an unread message from Agent B and Agent B has an unread message from Agent A, and both are in a blocking wait, it is a message-queue deadlock.

4. Conditional edges form an unintended cycle

In LangGraph, conditional edges that route on agent output can accidentally create a cycle. Agent A’s output triggers a “needs review” edge to Agent B. Agent B’s output triggers a “needs context” edge back to Agent A. Neither edge has a base case. In practice you do not hang forever here — LangGraph caps iterations and raises langgraph.errors.GraphRecursionError: Recursion limit of 25 reached without hitting a stop condition. (default limit is 25 as of June 2026). Treat that error as “I have a cycle,” not “raise the limit.”

How to spot it: Trace every conditional edge function. For each function that can route back to a prior node, verify there is a reachable path that does NOT route back — a base case that terminates the cycle.

5. Timeout set on the wrong layer — outer fires before inner resolves

The orchestrator has a 60-second timeout, but the sub-workflow it is waiting on has a 90-second timeout. The orchestrator times out and tries to cancel, but the sub-workflow is still running. The cancel goes to the sub-workflow’s input queue, which the sub-workflow is not reading because it is blocked on a tool call. Neither completes.

How to spot it: Map every timeout in your system (orchestrator, sub-workflow, tool calls, external API calls). If any outer timeout is shorter than the sum of inner timeouts in a normal run, the outer fires before the inner resolves.

Shortest path to fix

Step 1: Detect the deadlock — print all waiting agents and what they wait for

import sys, threading, traceback

def dump_thread_stacks():
    for thread_id, frame in sys._current_frames().items():
        print(f"\n--- Thread {thread_id} ---")
        traceback.print_stack(frame)

# Fire this from a watchdog if no progress after N seconds
threading.Timer(120, dump_thread_stacks).start()

For LangGraph workflows, inspect the persisted state. get_state needs a checkpointer configured on the compiled graph; it returns a StateSnapshot whose .next is the tuple of nodes about to run and .tasks lists pending work:

state = graph.get_state(config)   # requires a checkpointer
print("Next nodes:", state.next)        # e.g. ('agent_b',)
print("Pending tasks:", state.tasks)    # PregelTask entries still to run
print("Pending interrupts:", state.interrupts)

If .next keeps pointing at the same node across snapshots and nothing advances, that node is your blocked party.

Step 2: Visualize the dependency graph and find cycles

# LangGraph — render the graph to spot cycles
print(graph.get_graph().draw_mermaid())     # Mermaid source
# graph.get_graph().draw_mermaid_png()      # PNG bytes, if you prefer an image

For a manual cycle check on any dependency dict:

def has_cycle(graph: dict[str, list[str]]) -> bool:
    visited, rec_stack = set(), set()
    def dfs(node):
        visited.add(node)
        rec_stack.add(node)
        for neighbor in graph.get(node, []):
            if neighbor not in visited:
                if dfs(neighbor):
                    return True
            elif neighbor in rec_stack:
                return True
        rec_stack.discard(node)
        return False
    return any(dfs(n) for n in graph if n not in visited)

Step 3: Break the circular dependency with an initialization contract

For the “A needs Y from B, B needs X from A” pattern, one agent must produce a stub or default first:

# Agent A produces a stub schema first, Agent B refines it
initial_schema = {
    "endpoint": "/api/users",
    "method": "POST",
    "body": "TBD",  # placeholder Agent B fills in
}

# Wire it acyclic: A(stub) -> B(refine) -> A(implement with real schema)
graph.add_edge("agent_a_stub", "agent_b_refine")
graph.add_edge("agent_b_refine", "agent_a_implement")

Designate one agent as the “provider of defaults” and the other as the “consumer who refines.” That single rule turns a 2-cycle into a 3-stage line.

Step 4: Enforce consistent lock-acquisition order

RESOURCE_ORDER = ["database", "file_system", "message_queue"]

def acquire_locks(resources: list[str]) -> list:
    # Always acquire in canonical order to prevent deadlock
    ordered = sorted(resources, key=lambda r: RESOURCE_ORDER.index(r))
    locks = []
    for r in ordered:
        lock = get_lock(r)
        lock.acquire(timeout=10)   # never block forever on a lock
        locks.append(lock)
    return locks

Every agent must go through this function. Acquiring two locks “ad hoc” in different orders is exactly what creates a hold-and-wait cycle.

Step 5: Put a watchdog timeout on every blocking wait

import asyncio

class DeadlockError(Exception):
    pass

async def wait_with_timeout(coro, timeout_seconds: float, label: str):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        raise DeadlockError(
            f"Timed out after {timeout_seconds}s waiting for: {label}. "
            "Possible deadlock — check the dependency graph."
        )

Wrap every await that can block: sub-agent calls, message-queue reads, tool calls, external APIs. Name what you were waiting for in the message so the next on-call engineer diagnoses it in one read. For AutoGen teams, also cap the conversation itself (see Prevention) so a “you go first” loop ends instead of running to your token budget.

Temporal’s two kinds of deadlock

Temporal has its own deadlock detector that is easy to confuse with the orchestration hang above, so name which one you have:

Workflow deadlock detector (worker error). Temporal’s Rust core enforces a hardcoded ~2-second budget for a workflow task to yield control. If your workflow code runs a blocking call (a synchronous network request, time.sleep, a CPU-bound loop) instead of await-ing, the worker logs a PotentialDeadlock / “Deadlock detected” error and fails the task. As of June 2026 this 2s limit is hardcoded in the Rust core and not configurable from the Python SDK. Fix: move the blocking work into an Activity; keep workflow code deterministic and non-blocking.
Logical “wait forever” hang. Durable execution survives crashes, but it cannot rescue a workflow that genuinely waits on a signal or activity that never arrives. Always pass a timeout to workflow.wait_condition(...) and a start_to_close_timeout to workflow.execute_activity(...):

from datetime import timedelta
from temporalio import workflow

# Bounded wait for a human/peer signal — falls through instead of hanging
got_it = await workflow.wait_condition(
    lambda: self._approved is not None,
    timeout=timedelta(hours=72),
)
if not got_it:
    return "auto_rejected_timeout"

How to confirm it’s fixed

Re-run the workflow that hung. It should now either complete or raise a named timeout/GraphRecursionError quickly — never sit silently for minutes.
Watch CPU and token usage during the run. A healthy run shows periodic LLM calls; a deadlocked one shows zero. Add an alert: tokens at zero while the run is still “running” past your normal duration means investigate.
For LangGraph, snapshot graph.get_state(config).next twice a few seconds apart. If it advances, you are no longer stuck on one node.
For lock-based flows, grep your acquisition log and confirm every multi-lock call lists resources in the canonical order from Step 4.

Prevention

Draw the agent dependency graph before you build it, and run a cycle-detection check (Step 2) in CI; reject workflow definitions that contain cycles at startup.
Never let two agents hold a lock and request each other’s — enforce one canonical acquisition order globally.
Set explicit timeouts on every blocking wait: tool calls, sub-agent invocations, message-queue reads, external API calls.
In AutoGen v0.4+ AgentChat, build teams with RoundRobinGroupChat or SelectorGroupChat and cap them with max_turns plus a termination condition such as MaxMessageTermination(...), so “you go first” loops end. (In the older AutoGen v0.2 / AG2 GroupChat, the equivalent guard is max_round on the GroupChatManager.)
Keep Temporal workflow code non-blocking so the 2-second deadlock detector never trips, and put deadlines on every wait_condition and activity.
For genuinely bidirectional dependencies, restructure into three phases: A-stub -> B-refine -> A-finalize.
Add a watchdog that dumps all agent states if no progress is made in N seconds.
Instrument lock-acquisition latency: a lock that takes more than 5 seconds to acquire is an early deadlock warning.

FAQ

Q: How is an agent deadlock different from an agent loop? A: A loop makes progress — the agent runs, produces output, and iterates, burning token budget. A deadlock makes no progress at all: agents block on a precondition that will never be met, burning wall-clock time while holding resources. In LangGraph a runaway loop ends with GraphRecursionError (default limit 25); a true deadlock just hangs until your timeout fires.

Q: Does Temporal prevent deadlocks? A: It prevents crash-induced hangs through durable execution, and its 2-second deadlock detector catches a workflow thread that blocks instead of yielding. It cannot prevent a workflow that genuinely waits forever on a signal or activity that never arrives — that is on you to bound with a timeout on every wait_condition() and a start_to_close_timeout on every execute_activity().

Q: Can I detect a deadlock without modifying agent code? A: Yes. Monitor CPU and token usage. A deadlocked pipeline has near-zero CPU and zero LLM API calls over a multi-minute window. Alert when token usage is zero but the workflow is still “running” past its expected duration.

Q: My LangGraph run dies with “Recursion limit of 25 reached.” Should I just raise the limit? A: Usually no. That error almost always means a conditional edge routes back without a base case (cause 4). Raise recursion_limit only if the graph legitimately needs more steps; otherwise fix the edge so it can reach END.

Q: What is the safest way to break a live deadlock without data loss? A: Snapshot the current state of all agents first (inputs, outputs, lock holdings). Then cancel the agent holding the fewest resources and retry it after the other completes. Do not forcibly kill an agent mid-write — let write operations finish before canceling.

Tags: #AI coding #Agents #Troubleshooting

Which bucket are you in?

Common causes

1. Circular dependency in the workflow graph

2. Shared lock held too long — acquisition order inconsistency

3. Message-queue deadlock — both agents waiting for a reply

4. Conditional edges form an unintended cycle

5. Timeout set on the wrong layer — outer fires before inner resolves

Shortest path to fix

Step 1: Detect the deadlock — print all waiting agents and what they wait for

Step 2: Visualize the dependency graph and find cycles

Step 3: Break the circular dependency with an initialization contract

Step 4: Enforce consistent lock-acquisition order

Step 5: Put a watchdog timeout on every blocking wait

Temporal’s two kinds of deadlock

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Agent Budget Exhausted Halfway Through the Task

Restored Agent Checkpoint Is Corrupted

Cost Tracking Misses Sub-Agent Usage

Cycle in Agent Call Graph Goes Undetected

Agent Handoff Loses Context Between Steps

Fix: Agent Output Leaks Secrets Into Logs and Git