Flaky Tool Triggers an Agent Retry Storm

One unreliable tool call causes your agent to retry hundreds of times, exhausting budget and rate limits. Here's how to add backoff and circuit-breaking.

Your Inngest or Temporal workflow calls a code-execution tool that works 90% of the time. On a transient timeout, the agent retries — with zero delay, no backoff, and no retry cap. The tool is called 50 times in 3 seconds. The execution sandbox rate-limits the agent at 10 req/s, now every retry gets a 429, which the agent also retries. 500 LLM calls later, the pipeline has burned 300K tokens, hit the provider rate limit, and the original task is still not done. What started as a 5-second transient hiccup turned into a 10-minute outage.

Common causes

1. No retry cap — agent loops until budget exhaustion

The agent’s retry logic has while not success: retry() with no maximum. It will retry forever or until the cost budget runs out, whichever comes first. This is the most common pattern in hand-rolled retry logic.

How to spot it: Search for retry loops in your agent or tool wrapper code. Any loop missing a max_attempts or attempt < N guard will grow unbounded on persistent failures.

2. No exponential backoff — retries arrive faster than recovery

The tool fails and returns a 503 (service unavailable). The agent retries in 100ms. Still failing. Retries every 100ms. A service that is overloaded recovers faster when callers back off; constant-interval retries keep the service under pressure and extend the outage.

How to spot it: Log the timestamps of retry attempts. If the inter-retry interval is constant (100ms, 500ms) rather than growing, there is no exponential backoff.

3. 429 rate-limit response not detected — treated as a generic error

The agent’s error handler checks if status_code != 200: retry(). A 429 means “stop calling me for N seconds” — the Retry-After header tells you exactly how long to wait. An agent that retries a 429 immediately generates more 429s, accelerating the storm.

How to spot it: Check whether your error handler differentiates HTTP 429 from other errors. If 429 is handled identically to 500 or 503, the agent cannot respect rate limits.

4. Parallel sub-agents all retry simultaneously

A fan-out to 10 parallel agents, all using the same flaky tool. When the tool fails, all 10 agents retry simultaneously. Their combined retry load is 10x the single-agent load, saturating the tool even faster.

How to spot it: Check whether retries across parallel agents are coordinated (e.g., via a shared circuit breaker) or completely independent. Independent retries in parallel compound into a storm.

5. Retry logic wraps the entire LLM call, not just the tool call

When the tool fails, the agent retries the entire reasoning loop — re-calling the LLM, re-generating the tool call, re-executing it. Each “retry” costs a full LLM call even though the issue is in the tool, not the model’s reasoning. This multiplies cost by the model’s per-call price.

How to spot it: Check what exactly is retried. If retry() re-invokes the LLM and not just the tool executor, each retry costs 10-50x more than necessary.

6. No circuit breaker — retries continue after the tool is clearly down

After 10 consecutive failures from the same tool endpoint, the tool is clearly unavailable. The agent should stop calling it and escalate. Without a circuit breaker, it continues retrying indefinitely, burning budget while the tool stays down.

How to spot it: Count consecutive failures for each tool in your logs. If there are long runs of 20+ consecutive failures with no “tool disabled” or “circuit open” event, there is no circuit breaker.

Shortest path to fix

Step 1: Add exponential backoff with jitter and a hard cap

import time, random

def retry_with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=60.0):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_attempts:
                raise
            delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
            jitter = random.uniform(0, delay * 0.1)
            wait = delay + jitter
            logger.warning(
                "Tool call failed (attempt %d/%d): %s — retrying in %.1fs",
                attempt, max_attempts, e, wait
            )
            time.sleep(wait)

max_attempts=5 with exponential backoff means the last retry fires ~30 seconds after the first failure — enough for most transient issues to recover.

Step 2: Handle 429 explicitly using the Retry-After header

import httpx

def call_tool_with_rate_limit(url: str, payload: dict) -> dict:
    for attempt in range(5):
        resp = httpx.post(url, json=payload, timeout=30)
        if resp.status_code == 429:
            retry_after = int(resp.headers.get("Retry-After", 60))
            logger.warning("Rate limited — waiting %ds", retry_after)
            time.sleep(retry_after + 1)
            continue
        resp.raise_for_status()
        return resp.json()
    raise RateLimitExhaustedError("Still rate-limited after 5 waits")

Never retry a 429 without reading and honoring the Retry-After header.

Step 3: Implement a circuit breaker

from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"     # normal operation
    OPEN = "open"         # failing — reject calls
    HALF_OPEN = "half_open"  # testing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.opened_at: float = 0

    def call(self, fn):
        if self.state == CircuitState.OPEN:
            if time.time() - self.opened_at > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit breaker open — tool is down")
        try:
            result = fn()
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            self.opened_at = time.time()
            logger.error("Circuit breaker OPENED for tool after %d failures", self.failure_count)

Instantiate one CircuitBreaker per tool and share it across all agents using that tool.

Step 4: Retry only the tool call, not the full LLM call

# WRONG — retries the full LLM reasoning loop
def agent_step_with_retry(state):
    for _ in range(5):
        try:
            return llm.invoke(state)  # full LLM call + tool call
        except ToolError:
            continue

# CORRECT — retries only the tool execution
def agent_step(state):
    tool_call = llm.plan_tool_call(state)  # LLM call — no retry needed here
    result = retry_with_backoff(lambda: execute_tool(tool_call))  # retry only tool
    return llm.process_result(state, result)  # LLM call — no retry needed

Step 5: Coordinate retries across parallel agents with a shared semaphore

import threading

_tool_semaphore = threading.Semaphore(5)  # max 5 concurrent calls to this tool

def call_tool_safe(payload):
    with _tool_semaphore:
        return retry_with_backoff(lambda: call_tool(payload))

For distributed agents, use a Redis-backed rate limiter (e.g., the redis-py-rate-limiter library or a Lua script with INCR and EXPIRE).

Prevention

  • Wrap every tool call in a retry function with exponential backoff, jitter, and a hard max-attempts cap (5 is usually right).
  • Handle HTTP 429 specially: read the Retry-After header and sleep exactly that long before retrying.
  • Add a circuit breaker per tool; open the circuit after 5 consecutive failures, test recovery after 60 seconds.
  • Retry only the tool execution layer, not the full LLM reasoning loop — retrying the LLM is expensive and usually unnecessary.
  • Coordinate parallel agents with a shared semaphore or rate limiter so they do not all retry simultaneously.
  • Set a per-tool retry budget (e.g., max 3 retries per task, not per agent call) so a persistent failure fails fast.
  • Test your retry path in CI with a mock tool that returns failures on the first 2 calls and succeeds on the 3rd.
  • Alert when a circuit breaker opens — it is a production incident, not a silent retry that resolves itself.

FAQ

Q: Should I use tenacity or write retry logic by hand? A: Use tenacity (Python) or retry (JS/TS) — they are battle-tested and handle edge cases (thread safety, async, non-deterministic delays) that hand-rolled loops miss. The @retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=60)) decorator covers 95% of cases.

Q: How do I pick the right failure threshold for a circuit breaker? A: Use 5 consecutive failures as a starting point. Monitor the circuit breaker’s open/close events for a week and adjust based on false positives (opened unnecessarily) and false negatives (should have opened sooner). Never use a percentage-based threshold for low-volume tools — use consecutive counts.

Q: Does Temporal handle this automatically? A: Temporal activities have built-in retry policies with RetryPolicy (backoff coefficient, max attempts, max interval). Set a RetryPolicy on every activity that calls an external tool. Temporal does not implement circuit breaking — add that in your activity code.

Q: What if the tool is always slow, not failing? A: Slow tool calls do not trigger retry logic — they just hold the agent’s thread. Add a timeout to every tool call (httpx.post(..., timeout=30)) so slow calls fail fast and enter the retry/circuit-breaking path. A tool that takes 5 minutes per call is functionally a failure.

Tags: #AI coding #Agents #Troubleshooting