Flaky Tool Triggers an Agent Retry Storm

Q: Should I use tenacity or write retry logic by hand?

Use a library. `tenacity` (Python) and `p-retry` / `async-retry` (JS/TS) are battle-tested for thread safety, async, and jittered delays that hand-rolled loops miss. For jittered exponential backoff in tenacity, use `wait=wait_random_exponential(multiplier=1, max=60)` with `stop=stop_after_attempt(5)` — `wait_exponential` alone has no jitter and lets parallel agents re-synchronize.

Q: Doesn't Temporal handle all of this automatically?

Temporal retries activities with a `RetryPolicy`, but its default `maximum_attempts` is `0`, which means unlimited — so a flaky tool retries forever until you cap it. Set `maximum_attempts`, list `non_retryable_error_types`, and bound `maximum_interval`. Temporal does not implement circuit breaking; add that in your activity code.

Q: How do I pick the circuit-breaker failure threshold?

Start at 5 consecutive failures. Watch the open/close events for a week and tune for false positives (opened when the tool was fine) and false negatives (should have opened sooner). For low-volume tools use a consecutive-failure count, not a percentage.

Q: A retry storm is happening right now — how do I stop the bleeding?

1) Throttle at the gateway / load balancer to force backend request rate down to normal. 2) Pause workflow instances (Temporal, Inngest, or your LangGraph runner). 3) Temporarily set `max_attempts` to 1 and redeploy. Then fix the root cause from the steps above.

Q: The tool is slow, not failing — does retry logic help?

No. Slow calls do not raise an exception, so they never enter the retry path; they just hold the agent's thread. Put a timeout on every tool call (`httpx.post(..., timeout=30)`) so a hung call fails fast and enters retry/circuit-breaking. A call that takes 5 minutes is functionally a failure.

One unreliable tool call makes your agent retry hundreds of times, burning budget and tripping rate limits. Add a retry cap, backoff with jitter, 429 handling, and a circuit breaker.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your Inngest, Temporal, or LangGraph workflow calls a tool (code execution, an HTTP API, a sandbox) that works 90% of the time. On a transient timeout the agent retries — with zero delay, no backoff, no cap. The tool gets called 50 times in 3 seconds. The sandbox rate-limits at 10 req/s, so every retry now returns 429, which the agent also retries. Five hundred LLM calls later the pipeline has burned 300K tokens, hit the provider rate limit, and the original task still is not done. A 5-second hiccup became a 10-minute outage.

Fastest fix: cap retries (5 attempts is usually right), add exponential backoff with jitter, treat HTTP 429 specially by honoring Retry-After, and put a circuit breaker in front of each tool so the agent stops hammering an endpoint that is clearly down. Then retry only the tool call — never the whole LLM reasoning loop.

Which bucket are you in?

Symptom in your logs	Most likely cause	Jump to
Hundreds of calls to one tool, never stops	No retry cap (or framework default is “unlimited”)	Step 1
Retries at a fixed 100ms / 500ms interval	No exponential backoff	Step 1
Wall of `429` responses, each immediately retried	`429` treated as a generic error	Step 2
10 parallel agents all spike at the same instant	No jitter / no shared limiter	Steps 1 and 5
Cost is 10-50x what the work should cost	Retrying the LLM call, not just the tool	Step 4
Long run of 20+ consecutive failures, no recovery	No circuit breaker	Step 3
Retried a `400` / `403` forever	Non-retryable error treated as retryable	Step 2

Common causes

1. No retry cap — the agent loops until budget exhaustion

Hand-rolled while not success: retry() with no maximum retries forever (or until your cost budget runs out). Framework defaults are worse than they look: a Temporal activity’s default RetryPolicy has maximum_attempts = 0, which means unlimited — it will retry an external tool forever until you set a cap (its other defaults: backoff coefficient 2.0, 1s initial interval, 100s max interval, as of June 2026). Inngest runs each step.run() 5 times total (the initial attempt plus 4 retries) by default, and each step has its own independent counter, so a multi-step function can multiply that out. LangGraph node RetryPolicy defaults to max_attempts=3.

How to spot it: search your agent/tool wrapper for retry loops without a max_attempts or attempt < N guard, and check every framework RetryPolicy for an explicit max. An unset Temporal cap is the silent killer here.

2. No exponential backoff — retries arrive faster than recovery

The tool returns 503, the agent retries in 100ms, still failing, retries again in 100ms. An overloaded service recovers faster when callers back off; constant-interval retries keep it under pressure and extend the outage.

How to spot it: log retry timestamps. A constant inter-retry interval (100ms, 500ms) instead of a growing one means there is no exponential backoff.

3. HTTP 429 not detected — treated as a generic error

An error handler that does if status_code != 200: retry() cannot respect rate limits. A 429 means “stop calling me,” and per RFC 9110 §10.2.3 the Retry-After header tells you exactly how long — either as a number of seconds (Retry-After: 30) or as an HTTP-date (Retry-After: Wed, 10 Jun 2026 14:30:00 GMT). A client that retries a 429 immediately just generates more 429s and accelerates the storm.

How to spot it: check whether your handler differentiates 429 from 500 / 503. If they take the same path, you cannot back off correctly.

4. Non-retryable errors retried forever

A 400 Bad Request (malformed args) or 403 Forbidden (missing permission) will never succeed no matter how many times you retry. A blanket “retry all errors” policy turns a deterministic failure into a storm. The same applies to deterministic exceptions — LangGraph’s default policy, for example, deliberately does not retry ValueError / TypeError.

How to spot it: look at the status codes of retried failures. Repeated 400 / 401 / 403 / 404 / 422 means a non-retryable error is being retried. Only 429, 500, 502, 503, 504, and connection/timeout errors are worth a retry.

5. Parallel sub-agents all retry simultaneously

A fan-out to 10 agents sharing one flaky tool: when it fails, all 10 retry at the same instant. Combined load is 10x the single-agent load and saturates the tool faster. Without jitter, even backed-off retries re-synchronize into waves (the “thundering herd”).

How to spot it: check whether retries across agents are coordinated (shared circuit breaker / limiter) or independent, and whether your backoff adds random jitter. Independent, jitter-free retries in parallel compound into a storm.

6. Retry logic wraps the whole LLM call, not just the tool call

When the tool fails, the agent retries the entire reasoning loop — re-calling the LLM, re-generating the tool call, re-executing it. Each “retry” costs a full LLM call even though the fault is in the tool, not the model. This multiplies cost by the model’s per-call price (with Opus 4.7 at $5/$25 per 1M tokens as of June 2026, an unnecessary 8K-token re-reason adds up fast). A related leak: appending each failure’s full error to the message history, so every retry sends a longer, more expensive prompt.

How to spot it: check exactly what retry() re-invokes. If it re-runs the LLM and not just the tool executor, each retry costs 10-50x more than necessary. Compare the token count of attempt 1 vs attempt 5 — if it grew, you are accumulating error text in context.

7. No circuit breaker — retries continue after the tool is clearly down

After 10 consecutive failures from the same endpoint, the tool is plainly unavailable. The agent should stop calling it and escalate. Without a circuit breaker it keeps retrying, burning budget while the tool stays down. Note that no major workflow engine ships this for you — Temporal, Inngest, and LangGraph all do retries but none implement circuit breaking; you add it in your tool wrapper.

How to spot it: count consecutive failures per tool. Long runs of 20+ failures with no “circuit open” / “tool disabled” event means there is no breaker.

Shortest path to fix

Step 1: Add exponential backoff with jitter and a hard cap

import time, random

def retry_with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=60.0):
    for attempt in range(1, max_attempts + 1):
        try:
            return fn()
        except TransientError as e:
            if attempt == max_attempts:
                raise
            delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
            jitter = random.uniform(0, delay)   # full jitter spreads the herd
            wait = jitter
            logger.warning(
                "Tool call failed (attempt %d/%d): %s — retrying in %.1fs",
                attempt, max_attempts, e, wait
            )
            time.sleep(wait)

max_attempts=5 with exponential backoff fires the last retry roughly 30 seconds after the first failure — enough time for most transient issues to clear. “Full jitter” (a random wait between 0 and the computed delay) de-synchronizes parallel agents far better than a fixed delay plus a small wobble.

If you would rather not hand-roll this, tenacity is battle-tested. For jittered exponential backoff use wait_random_exponential (the plain wait_exponential has no jitter):

from tenacity import (
    retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type,
)

@retry(
    stop=stop_after_attempt(5),
    wait=wait_random_exponential(multiplier=1, max=60),
    retry=retry_if_exception_type((ConnectionError, TimeoutError)),
    reraise=True,
)
def call_tool(payload):
    return execute_tool(payload)

On Temporal, set an explicit cap so the default unlimited policy cannot run away:

from datetime import timedelta
from temporalio.common import RetryPolicy

await workflow.execute_activity(
    call_flaky_tool, payload,
    start_to_close_timeout=timedelta(seconds=30),
    retry_policy=RetryPolicy(
        maximum_attempts=5,                       # NOT 0 — 0 means unlimited
        initial_interval=timedelta(seconds=1),
        backoff_coefficient=2.0,
        maximum_interval=timedelta(seconds=60),
        non_retryable_error_types=["BadRequestError", "ForbiddenError"],
    ),
)

Step 2: Handle 429 and skip non-retryable errors

Read Retry-After and honor it, and never retry a deterministic 4xx. Retry-After can be seconds or an HTTP-date, so parse both:

import httpx, time
from email.utils import parsedate_to_datetime
from datetime import datetime, timezone

NON_RETRYABLE = {400, 401, 403, 404, 422}

def parse_retry_after(value: str, default: int = 60) -> int:
    if not value:
        return default
    if value.isdigit():
        return int(value)                          # delta-seconds form
    try:                                            # HTTP-date form
        when = parsedate_to_datetime(value)
        return max(0, int((when - datetime.now(timezone.utc)).total_seconds()))
    except (TypeError, ValueError):
        return default

def call_tool_with_rate_limit(url: str, payload: dict, max_attempts: int = 5) -> dict:
    for attempt in range(max_attempts):
        resp = httpx.post(url, json=payload, timeout=30)
        if resp.status_code == 429:
            wait = parse_retry_after(resp.headers.get("Retry-After"))
            logger.warning("Rate limited — waiting %ds", wait)
            time.sleep(wait + 1)
            continue
        if resp.status_code in NON_RETRYABLE:
            resp.raise_for_status()                 # fail fast, do not retry
        resp.raise_for_status()
        return resp.json()
    raise RateLimitExhaustedError("Still rate-limited after retries")

Never retry a 429 without reading Retry-After, and never retry a 400 / 403 at all.

Step 3: Put a circuit breaker in front of each tool

import time
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"        # normal operation
    OPEN = "open"            # failing — reject calls immediately
    HALF_OPEN = "half_open"  # probing recovery

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.opened_at: float = 0

    def call(self, fn):
        if self.state == CircuitState.OPEN:
            if time.time() - self.opened_at > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                raise CircuitOpenError("Circuit open — tool is down")
        try:
            result = fn()
            self._on_success()
            return result
        except Exception:
            self._on_failure()
            raise

    def _on_success(self):
        self.failure_count = 0
        self.state = CircuitState.CLOSED

    def _on_failure(self):
        self.failure_count += 1
        if self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN
            self.opened_at = time.time()
            logger.error("Circuit OPENED for tool after %d failures", self.failure_count)

Instantiate one CircuitBreaker per tool and share it across every agent that uses that tool. The circuitbreaker PyPI package gives you the same behavior as a decorator if you prefer. When the circuit is open, decide deliberately: fail fast and tell the user “service temporarily unavailable,” fall back to a cached result or backup tool, or queue the task until the breaker closes.

Step 4: Retry only the tool call, not the full LLM call

# WRONG — retries the full LLM reasoning loop (10-50x the cost)
def agent_step_with_retry(state):
    for _ in range(5):
        try:
            return llm.invoke(state)          # LLM call + tool call together
        except ToolError:
            continue

# CORRECT — retry only the tool execution
def agent_step(state):
    tool_call = llm.plan_tool_call(state)                       # LLM call, no retry
    result = retry_with_backoff(lambda: execute_tool(tool_call))  # retry only the tool
    return llm.process_result(state, result)                    # LLM call, no retry

While you are here, do not append full tool errors to the message history. Record a compact failure marker (error type and count) in a side field instead, so retries do not send an ever-growing, ever-more-expensive prompt.

Step 5: Coordinate parallel agents with a shared limiter

import threading

_tool_semaphore = threading.Semaphore(5)  # max 5 concurrent calls to this tool

def call_tool_safe(payload):
    with _tool_semaphore:
        return retry_with_backoff(lambda: call_tool(payload))

A semaphore caps concurrency; a token bucket caps the rate over time. For distributed agents, back the limiter with Redis (redis-py plus a Lua script using INCR and EXPIRE, or a maintained rate-limiter library) so all workers share one budget no matter how many processes you scale to.

How to confirm it’s fixed

Inject failure deterministically. Point the agent at a mock tool that fails the first two calls and succeeds on the third. Confirm the task completes on attempt 3 and stops at max_attempts when the mock fails every time.
Check the backoff curve. In the logs, the inter-retry interval should grow (≈1s, 2s, 4s, …) and vary between parallel agents, not fire on a fixed cadence.
Force a 429. Have the mock return 429 with Retry-After: 5; confirm the client waits ~5s, not milliseconds.
Trip the breaker. Make the mock fail 5+ times in a row; confirm a “Circuit OPENED” log line appears and subsequent calls raise CircuitOpenError immediately (no network call) until recovery_timeout elapses.
Watch total calls and tokens. Per task, total tool calls should be a small multiple of max_attempts, not hundreds, and token usage should not climb with each retry.

Prevention

Wrap every tool call in retry logic with exponential backoff, jitter, and a hard max_attempts (5 is a good default).
On Temporal, always set maximum_attempts explicitly — the default 0 means unlimited.
Handle 429 specially: parse Retry-After (seconds or HTTP-date) and sleep exactly that long.
Only retry 429 / 5xx / connection / timeout errors. Never retry 400 / 401 / 403 / 404 / 422.
Add a circuit breaker per tool: open after 5 consecutive failures, probe recovery after 60s. No workflow engine does this for you.
Retry only the tool execution layer, not the full LLM loop, and keep error text out of the prompt history.
Coordinate parallel agents with a shared semaphore, token bucket, or Redis limiter.
Set a per-task retry budget (and a max-token budget) so a persistent failure fails fast instead of burning the whole budget.
Test the retry path in CI with a mock that fails the first 2 calls. Alert when a breaker opens — that is an incident, not a self-healing retry.

FAQ

Q: Should I use tenacity or write retry logic by hand? A: Use a library. tenacity (Python) and p-retry / async-retry (JS/TS) are battle-tested for thread safety, async, and jittered delays that hand-rolled loops miss. For jittered exponential backoff in tenacity, use wait=wait_random_exponential(multiplier=1, max=60) with stop=stop_after_attempt(5) — wait_exponential alone has no jitter and lets parallel agents re-synchronize.

Q: Doesn’t Temporal handle all of this automatically? A: Temporal retries activities with a RetryPolicy, but its default maximum_attempts is 0, which means unlimited — so a flaky tool retries forever until you cap it. Set maximum_attempts, list non_retryable_error_types, and bound maximum_interval. Temporal does not implement circuit breaking; add that in your activity code.

Q: How do I pick the circuit-breaker failure threshold? A: Start at 5 consecutive failures. Watch the open/close events for a week and tune for false positives (opened when the tool was fine) and false negatives (should have opened sooner). For low-volume tools use a consecutive-failure count, not a percentage.

Q: A retry storm is happening right now — how do I stop the bleeding? A: 1) Throttle at the gateway / load balancer to force backend request rate down to normal. 2) Pause workflow instances (Temporal, Inngest, or your LangGraph runner). 3) Temporarily set max_attempts to 1 and redeploy. Then fix the root cause from the steps above.

Q: The tool is slow, not failing — does retry logic help? A: No. Slow calls do not raise an exception, so they never enter the retry path; they just hold the agent’s thread. Put a timeout on every tool call (httpx.post(..., timeout=30)) so a hung call fails fast and enters retry/circuit-breaking. A call that takes 5 minutes is functionally a failure.

Tags: #AI coding #Agents #Troubleshooting