Agent Orchestrator Deadlocks Waiting on Each Other

Two agents block forever waiting for each other's output — a classic deadlock in async pipelines. Detect the cycle and break it in minutes.

Your LangGraph or Temporal workflow hangs indefinitely. Agent A is waiting for Agent B to produce a schema definition before it can write the API handler. Agent B is waiting for Agent A to write the API handler so it can infer the schema from it. Neither progresses. The pipeline just sits there, consuming a thread, burning a timeout, or spinning on a polling loop. In AutoGen GroupChat, two agents enter a “you go first / no, you go first” loop that the GroupChatManager never breaks. Deadlocks in agent orchestration are less common than in database locking but harder to diagnose because they often look like slowness rather than a hard hang.

Common causes

1. Circular dependency in the workflow graph

The most direct cause. Agent A depends on output Y from Agent B, and Agent B depends on output X from Agent A, forming a cycle: A → needs Y → B → needs X → A. If neither has a default or cached value, both block forever.

How to spot it: Draw or print the dependency graph. In LangGraph, call graph.get_graph().draw_mermaid(). Look for any edge where following the graph forward eventually leads back to the same node.

2. Shared lock held too long — acquisition order inconsistency

Agent A acquires lock on resource_1, then tries to acquire resource_2. Agent B acquires resource_2 first, then tries to acquire resource_1. Classic dining-philosophers deadlock. This happens when multiple agents use a file-claim registry or a database row lock without a consistent acquisition order.

How to spot it: Log every lock acquisition with agent ID, resource name, and timestamp. A deadlock appears as two agents each holding one lock and waiting for the other since t=T.

3. Message queue congestion — both agents waiting for a reply

In an AutoGen or CrewAI multi-agent chat, Agent A sends a message to Agent B and blocks on a response. Agent B simultaneously sends a message to Agent A and blocks on a response. Both are in wait_for_reply() with no timeout. The queue has both messages but neither agent is processing incoming messages while waiting.

How to spot it: Check the pending message queue for both agents. If Agent A has an unread message from Agent B and Agent B has an unread message from Agent A, and both are in a blocking wait, it’s a message-queue deadlock.

4. Conditional edges form an unintended cycle

In LangGraph, conditional edges that route based on agent output can accidentally create a cycle. Agent A’s output triggers a “needs review” edge that routes to Agent B. Agent B’s output triggers a “needs context” edge that routes back to Agent A. Neither edge has a base case or exit condition.

How to spot it: Trace every conditional edge function. For each function that can route back to a prior node, verify there is a reachable code path that does NOT route back — i.e., a base case that terminates the cycle.

5. Timeout set on the wrong layer — outer timeout fires before inner resolves

The orchestrator has a 60-second timeout, but the sub-workflow it’s waiting on has a 90-second timeout. The orchestrator times out and attempts to cancel, but the sub-workflow is still running. The cancel request goes to the sub-workflow’s input queue — which the sub-workflow is not reading because it is blocked waiting on a tool call. Neither completes.

How to spot it: Map every timeout in your system (orchestrator, sub-workflow, tool calls, external API calls). If any outer timeout is shorter than the sum of inner timeouts in a normal run, the outer timeout fires before the inner resolves.

Shortest path to fix

Step 1: Detect the deadlock — print all waiting agents and what they are waiting for

import threading

def dump_thread_stacks():
    for thread_id, frame in sys._current_frames().items():
        import traceback
        print(f"\n--- Thread {thread_id} ---")
        traceback.print_stack(frame)

# Call this after a suspected deadlock (from a watchdog thread)
threading.Timer(120, dump_thread_stacks).start()

For LangGraph workflows, enable state inspection:

state = graph.get_state(config)
print("Current node:", state.next)
print("Pending tasks:", state.tasks)

Step 2: Visualize the dependency graph and find cycles

# LangGraph — render the graph to spot cycles
from IPython.display import Image
img = graph.get_graph().draw_mermaid_png()
# Or print the Mermaid source:
print(graph.get_graph().draw_mermaid())

For a manual cycle check:

def has_cycle(graph: dict[str, list[str]]) -> bool:
    visited, rec_stack = set(), set()
    def dfs(node):
        visited.add(node)
        rec_stack.add(node)
        for neighbor in graph.get(node, []):
            if neighbor not in visited:
                if dfs(neighbor): return True
            elif neighbor in rec_stack:
                return True
        rec_stack.discard(node)
        return False
    return any(dfs(n) for n in graph if n not in visited)

Step 3: Break the circular dependency with an initialization contract

For the “A needs Y from B, B needs X from A” pattern, one agent must produce a stub or default:

# Agent A produces a stub schema first, Agent B refines it
initial_schema = {
    "endpoint": "/api/users",
    "method": "POST",
    "body": "TBD"  # placeholder — Agent B fills this in
}

# Wire: A(stub) → B(refine schema) → A(implement with real schema)
graph.add_edge("agent_a_stub", "agent_b_refine")
graph.add_edge("agent_b_refine", "agent_a_implement")

Break any mutual dependency by designating one agent as the “provider of defaults” and the other as the “consumer who refines.”

Step 4: Enforce consistent lock acquisition order

RESOURCE_ORDER = ["database", "file_system", "message_queue"]

def acquire_locks(resources: list[str]) -> list[Lock]:
    # Always acquire in canonical order to prevent deadlock
    ordered = sorted(resources, key=lambda r: RESOURCE_ORDER.index(r))
    locks = []
    for r in ordered:
        lock = get_lock(r)
        lock.acquire(timeout=10)
        locks.append(lock)
    return locks

All agents must use this function — never acquire locks ad hoc.

Step 5: Add a watchdog timeout at every blocking wait

import asyncio

async def wait_with_timeout(coro, timeout_seconds: float, label: str):
    try:
        return await asyncio.wait_for(coro, timeout=timeout_seconds)
    except asyncio.TimeoutError:
        raise DeadlockError(
            f"Timed out after {timeout_seconds}s waiting for: {label}. "
            "Possible deadlock — check dependency graph."
        )

Set timeouts at every await point in your workflow. The error message should name what you were waiting for to make diagnosis instant.

Prevention

  • Draw the agent dependency graph before implementing it; run a cycle-detection algorithm on it as part of your CI pipeline.
  • Never let two agents hold a lock and request each other’s lock — enforce canonical acquisition order globally.
  • Set explicit timeouts on every blocking wait: tool calls, sub-agent invocations, message-queue reads, and external API calls.
  • In GroupChat frameworks (AutoGen, CrewAI), configure the GroupChatManager with a max_round limit and a tie-breaking rule so “you go first” loops terminate.
  • Use a DAG (directed acyclic graph) constraint in your orchestration layer and reject workflow definitions that contain cycles at startup.
  • For genuinely bidirectional dependencies, restructure into three phases: A-stub → B-refine → A-finalize.
  • Add a watchdog thread that dumps all agent states if no progress is made in N seconds.
  • Instrument lock acquisition latency — a lock that takes over 5 seconds to acquire is a deadlock warning sign.

FAQ

Q: How is an agent deadlock different from an agent loop? A: A loop makes progress (the agent runs, produces output, and iterates). A deadlock makes no progress at all — agents are blocked waiting for a precondition that will never be satisfied. A loop burns budget; a deadlock burns wall-clock time and holds resources.

Q: Does Temporal prevent deadlocks? A: Temporal’s durable execution prevents crash-induced hangs, but it cannot prevent a workflow that genuinely waits forever on a signal or activity that never arrives. You still need timeouts on every workflow.wait_condition() and workflow.execute_activity() call.

Q: Can I detect a deadlock without modifying agent code? A: Yes — monitor CPU and token usage. A deadlocked pipeline has near-zero CPU and zero LLM API calls over a multi-minute window. Set up an alert: if token usage drops to zero and the workflow is still marked “running,” trigger a deadlock investigation.

Q: What is the safest way to break a live deadlock without data loss? A: First snapshot the current state of all agents (their inputs, outputs, and lock holdings). Then cancel the agent that holds the fewest resources and retry it after the other agent completes. Avoid forcibly killing agents mid-write — wait for write operations to complete before canceling.

Tags: #AI coding #Agents #Troubleshooting