Cycle in Agent Call Graph Goes Undetected

Agents call each other in a loop that never terminates because the orchestrator has no cycle detection. Here's how to find cycles before they run forever.

Your LangGraph or AutoGen orchestrator has a “planner” agent that delegates subtasks to a “researcher” and a “coder.” Under certain conditions — when the coder produces output the planner deems too abstract — the planner routes back to the researcher for more detail. The researcher asks the coder to produce a concrete example. The coder’s example is too abstract again. The loop runs 400 times before burning the token budget. Or an OpenAI Swarm implementation has agents that hand off to each other with no depth limit: Agent A → Agent B → Agent C → Agent A. Each agent in the cycle adds a small amount of reasoning. After 200 hops, the pipeline has spent $40 and produced nothing useful. No error is raised — the cycle just runs.

Common causes

1. Conditional routing with no base case

A conditional edge in LangGraph routes to agent B when quality is below a threshold. Agent B’s output always scores just below the threshold because the threshold was set too strictly. The route fires every time, creating a cycle with no base case that ever evaluates to “move on.”

How to spot it: For every conditional routing function that can route backward to a prior node, check whether there is a code path that does NOT route backward. If all branches route backward or to a holding state that always routes backward, there is no base case.

2. Visited-node set not maintained across the call chain

Each agent invocation is stateless. Agent A calls Agent B, which calls Agent C, which calls Agent A. None of them check “have I been called in this chain before?” because the “visited” set is in-process memory that doesn’t persist between agent invocations.

How to spot it: Search for any “visited” set, “call stack,” or “depth counter” passed through the agent call chain. If these are absent or if they are stored only in the calling agent’s local variables (not passed forward), cycles are undetectable across invocation boundaries.

3. Agent routing decision is made by LLM with no depth constraint

The routing logic is: “Ask the LLM which agent should handle this next.” The LLM can produce any agent name, including the one that was just executing. With no depth limit or cycle-detection constraint injected into the routing prompt, the LLM can freely generate cycles.

How to spot it: Check whether the routing prompt includes the call history or depth. If the LLM receives only the current task and available agents (not the path taken to get here), it has no information to detect or avoid cycles.

4. Dynamic agent registration allows cycles at registration time

Agents register their “can delegate to” list at startup. Agent A says “can delegate to B, C.” Agent B says “can delegate to A, C.” This creates a valid cycle in the capability graph. The orchestrator doesn’t validate this graph for cycles at registration time — it only discovers cycles at runtime after they occur.

How to spot it: Build the delegation graph from agent registrations and run a cycle-detection algorithm (DFS with a recursion stack) on it at startup. If the graph has a cycle, the orchestrator should reject the registration.

5. Max-depth check was added but checked at the wrong layer

A depth < 10 guard was added to the routing function. But the routing function is called by a wrapper that catches the MaxDepthError and silently re-invokes with depth=0 to “retry the routing cleanly.” The depth counter resets, the guard never stops the cycle.

How to spot it: Trace every path where MaxDepthError (or equivalent) is caught. If any catch handler resets the depth counter rather than propagating the error, depth limiting is ineffective.

6. Agent spawns sub-agents that re-enter the same pipeline

Agent A is part of Pipeline P. It spawns a sub-agent that runs Pipeline P to handle a subtask. Pipeline P eventually spawns Agent A again. The recursion is across pipeline boundaries, which makes it invisible to any cycle detection within a single pipeline.

How to spot it: Check whether any agent in a pipeline can trigger the same pipeline (or another pipeline that triggers this one) as a sub-workflow. Cross-pipeline cycles are harder to detect but follow the same pattern.

Shortest path to fix

Step 1: Add cycle detection to the graph at definition time

def validate_no_cycles(edges: dict[str, list[str]]) -> None:
    """Raise if the agent delegation graph contains a cycle."""
    visited = set()
    recursion_stack = set()

    def dfs(node: str) -> bool:
        visited.add(node)
        recursion_stack.add(node)
        for neighbor in edges.get(node, []):
            if neighbor not in visited:
                if dfs(neighbor):
                    return True
            elif neighbor in recursion_stack:
                cycle_path = list(recursion_stack) + [neighbor]
                raise CycleDetectedError(
                    f"Cycle detected in agent graph: {' → '.join(cycle_path)}"
                )
        recursion_stack.discard(node)
        return False

    for node in edges:
        if node not in visited:
            dfs(node)

# Run at agent registration time:
AGENT_EDGES = {
    "planner": ["researcher", "coder"],
    "researcher": ["coder"],  # OK — no back-edge to planner
    "coder": [],              # leaf node
}
validate_no_cycles(AGENT_EDGES)

Step 2: Thread a call-path token through every agent invocation

import hashlib

@dataclass
class CallContext:
    run_id: str
    call_path: list[str]  # ordered list of agent names invoked so far
    max_depth: int = 20

    def enter_agent(self, agent_name: str) -> "CallContext":
        if agent_name in self.call_path:
            cycle = " → ".join(self.call_path + [agent_name])
            raise CycleDetectedError(f"Cycle detected: {cycle}")
        if len(self.call_path) >= self.max_depth:
            raise MaxDepthError(
                f"Max depth {self.max_depth} reached: {' → '.join(self.call_path)}"
            )
        return CallContext(
            run_id=self.run_id,
            call_path=self.call_path + [agent_name],
            max_depth=self.max_depth,
        )

# Pass context through every agent invocation:
def invoke_agent(agent_name: str, task: str, ctx: CallContext) -> str:
    child_ctx = ctx.enter_agent(agent_name)
    agent = AGENT_REGISTRY[agent_name]
    return agent.run(task, ctx=child_ctx)

Step 3: Inject call history into LLM routing prompts

def build_routing_prompt(task: str, call_path: list[str]) -> str:
    history = " → ".join(call_path) if call_path else "none"
    return f"""
You must choose the next agent to handle this task.

Task: {task}

Agents already invoked in this chain (DO NOT route back to any of these):
{history}

Available agents (choose one that has NOT already been invoked):
- researcher: gathers information
- coder: implements solutions
- reviewer: checks quality

Respond with ONLY the agent name. No other text.
"""

With call history in the prompt, the routing LLM has the information to avoid cycles.

Step 4: Add a hard depth limit at the orchestration layer

MAX_AGENT_DEPTH = 15

def run_agent_chain(task: str, depth: int = 0) -> str:
    if depth >= MAX_AGENT_DEPTH:
        raise MaxDepthError(
            f"Agent chain reached maximum depth {MAX_AGENT_DEPTH}. "
            "Possible cycle — review the routing logic."
        )
    agent_name = route_task(task)
    return invoke_agent(agent_name, task, depth=depth + 1)

The depth limit is a safety net independent of cycle detection — it catches cycles that escape the visited-set check.

Step 5: Test for cycles in CI using graph validation

# Run cycle detection as part of the test suite
python -m pytest tests/test_agent_graph.py -k "test_no_cycles" -v
def test_agent_delegation_graph_has_no_cycles():
    graph = build_agent_delegation_graph()
    with pytest.raises(CycleDetectedError):
        # Inject a known cycle and confirm detection works
        graph["coder"] = ["planner"]
        validate_no_cycles(graph)

def test_production_graph_is_acyclic():
    # The actual production graph must pass
    graph = PRODUCTION_AGENT_EDGES
    validate_no_cycles(graph)  # should not raise

Prevention

  • Run cycle detection on the agent delegation graph at startup — reject any registration that creates a cycle.
  • Thread a call_path list through every agent invocation boundary; check for the current agent’s name in the path before executing.
  • Include the call history in every LLM routing prompt so the model has information to avoid routing back to already-visited agents.
  • Add a hard max-depth limit as a secondary safety net independent of cycle detection.
  • Write a CI test that validates the production agent graph is acyclic after every graph definition change.
  • For legitimately iterative patterns (e.g., refine-until-quality-passes), use an explicit iteration counter with a hard cap instead of routing edges — make the loop visible and bounded in the graph definition.
  • Monitor the depth distribution of agent call chains in production; a tail that grows to depth 10+ is a cycle-near-miss.
  • Distinguish between allowed cycles (explicit bounded retry loops with a counter) and unintended cycles (unbounded delegation loops) in your graph definition.

FAQ

Q: Does LangGraph prevent cycles? A: LangGraph supports cycles explicitly — they are how you implement retry and iterative refinement loops. It does not prevent cycles at definition time. It provides a recursion_limit parameter (default 25) that raises RecursionError after N steps, which is the built-in cycle guard. Set this to a reasonable value for your use case.

Q: What is the right max depth for a multi-agent chain? A: For most pipelines, 10 is generous. A chain deeper than 10 usually indicates a routing issue rather than legitimate task complexity. Set the hard limit to 15 to give headroom for complex tasks, but alert when depth exceeds 8.

Q: Can a DAG workflow ever produce a runtime cycle? A: A static DAG cannot have a cycle by definition. But dynamic routing — where the next node is determined at runtime by the current agent’s output — can produce a cycle even in a “DAG” framework. Dynamic routing requires runtime cycle detection (the call-path approach) rather than static graph analysis.

Q: How do I implement a legitimate “refine until good enough” loop without creating a cycle risk? A: Use an explicit iteration counter, not a routing cycle. while quality < threshold and iteration < 5: output = refine(output); iteration += 1. The loop terminates at 5 regardless of quality. If quality is still below threshold at iteration 5, fail and escalate — do not continue looping.

Tags: #AI coding #Agents #Troubleshooting