Your Inngest or Temporal workflow calls a code-execution tool that works 90% of the time. On a transient timeout, the agent retries — with zero delay, no backoff, and no retry cap. The tool is called 50 times in 3 seconds. The execution sandbox rate-limits the agent at 10 req/s, now every retry gets a 429, which the agent also retries. 500 LLM calls later, the pipeline has burned 300K tokens, hit the provider rate limit, and the original task is still not done. What started as a 5-second transient hiccup turned into a 10-minute outage.
Common causes
1. No retry cap — agent loops until budget exhaustion
The agent’s retry logic has while not success: retry() with no maximum. It will retry forever or until the cost budget runs out, whichever comes first. This is the most common pattern in hand-rolled retry logic.
How to spot it: Search for retry loops in your agent or tool wrapper code. Any loop missing a max_attempts or attempt < N guard will grow unbounded on persistent failures.
2. No exponential backoff — retries arrive faster than recovery
The tool fails and returns a 503 (service unavailable). The agent retries in 100ms. Still failing. Retries every 100ms. A service that is overloaded recovers faster when callers back off; constant-interval retries keep the service under pressure and extend the outage.
How to spot it: Log the timestamps of retry attempts. If the inter-retry interval is constant (100ms, 500ms) rather than growing, there is no exponential backoff.
3. 429 rate-limit response not detected — treated as a generic error
The agent’s error handler checks if status_code != 200: retry(). A 429 means “stop calling me for N seconds” — the Retry-After header tells you exactly how long to wait. An agent that retries a 429 immediately generates more 429s, accelerating the storm.
How to spot it: Check whether your error handler differentiates HTTP 429 from other errors. If 429 is handled identically to 500 or 503, the agent cannot respect rate limits.
4. Parallel sub-agents all retry simultaneously
A fan-out to 10 parallel agents, all using the same flaky tool. When the tool fails, all 10 agents retry simultaneously. Their combined retry load is 10x the single-agent load, saturating the tool even faster.
How to spot it: Check whether retries across parallel agents are coordinated (e.g., via a shared circuit breaker) or completely independent. Independent retries in parallel compound into a storm.
5. Retry logic wraps the entire LLM call, not just the tool call
When the tool fails, the agent retries the entire reasoning loop — re-calling the LLM, re-generating the tool call, re-executing it. Each “retry” costs a full LLM call even though the issue is in the tool, not the model’s reasoning. This multiplies cost by the model’s per-call price.
How to spot it: Check what exactly is retried. If retry() re-invokes the LLM and not just the tool executor, each retry costs 10-50x more than necessary.
6. No circuit breaker — retries continue after the tool is clearly down
After 10 consecutive failures from the same tool endpoint, the tool is clearly unavailable. The agent should stop calling it and escalate. Without a circuit breaker, it continues retrying indefinitely, burning budget while the tool stays down.
How to spot it: Count consecutive failures for each tool in your logs. If there are long runs of 20+ consecutive failures with no “tool disabled” or “circuit open” event, there is no circuit breaker.
Shortest path to fix
Step 1: Add exponential backoff with jitter and a hard cap
import time, random
def retry_with_backoff(fn, max_attempts=5, base_delay=1.0, max_delay=60.0):
for attempt in range(1, max_attempts + 1):
try:
return fn()
except TransientError as e:
if attempt == max_attempts:
raise
delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
jitter = random.uniform(0, delay * 0.1)
wait = delay + jitter
logger.warning(
"Tool call failed (attempt %d/%d): %s — retrying in %.1fs",
attempt, max_attempts, e, wait
)
time.sleep(wait)
max_attempts=5 with exponential backoff means the last retry fires ~30 seconds after the first failure — enough for most transient issues to recover.
Step 2: Handle 429 explicitly using the Retry-After header
import httpx
def call_tool_with_rate_limit(url: str, payload: dict) -> dict:
for attempt in range(5):
resp = httpx.post(url, json=payload, timeout=30)
if resp.status_code == 429:
retry_after = int(resp.headers.get("Retry-After", 60))
logger.warning("Rate limited — waiting %ds", retry_after)
time.sleep(retry_after + 1)
continue
resp.raise_for_status()
return resp.json()
raise RateLimitExhaustedError("Still rate-limited after 5 waits")
Never retry a 429 without reading and honoring the Retry-After header.
Step 3: Implement a circuit breaker
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # normal operation
OPEN = "open" # failing — reject calls
HALF_OPEN = "half_open" # testing recovery
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=60):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.opened_at: float = 0
def call(self, fn):
if self.state == CircuitState.OPEN:
if time.time() - self.opened_at > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit breaker open — tool is down")
try:
result = fn()
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def _on_failure(self):
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
self.opened_at = time.time()
logger.error("Circuit breaker OPENED for tool after %d failures", self.failure_count)
Instantiate one CircuitBreaker per tool and share it across all agents using that tool.
Step 4: Retry only the tool call, not the full LLM call
# WRONG — retries the full LLM reasoning loop
def agent_step_with_retry(state):
for _ in range(5):
try:
return llm.invoke(state) # full LLM call + tool call
except ToolError:
continue
# CORRECT — retries only the tool execution
def agent_step(state):
tool_call = llm.plan_tool_call(state) # LLM call — no retry needed here
result = retry_with_backoff(lambda: execute_tool(tool_call)) # retry only tool
return llm.process_result(state, result) # LLM call — no retry needed
Step 5: Coordinate retries across parallel agents with a shared semaphore
import threading
_tool_semaphore = threading.Semaphore(5) # max 5 concurrent calls to this tool
def call_tool_safe(payload):
with _tool_semaphore:
return retry_with_backoff(lambda: call_tool(payload))
For distributed agents, use a Redis-backed rate limiter (e.g., the redis-py-rate-limiter library or a Lua script with INCR and EXPIRE).
Prevention
- Wrap every tool call in a retry function with exponential backoff, jitter, and a hard max-attempts cap (5 is usually right).
- Handle HTTP 429 specially: read the
Retry-Afterheader and sleep exactly that long before retrying. - Add a circuit breaker per tool; open the circuit after 5 consecutive failures, test recovery after 60 seconds.
- Retry only the tool execution layer, not the full LLM reasoning loop — retrying the LLM is expensive and usually unnecessary.
- Coordinate parallel agents with a shared semaphore or rate limiter so they do not all retry simultaneously.
- Set a per-tool retry budget (e.g., max 3 retries per task, not per agent call) so a persistent failure fails fast.
- Test your retry path in CI with a mock tool that returns failures on the first 2 calls and succeeds on the 3rd.
- Alert when a circuit breaker opens — it is a production incident, not a silent retry that resolves itself.
FAQ
Q: Should I use tenacity or write retry logic by hand?
A: Use tenacity (Python) or retry (JS/TS) — they are battle-tested and handle edge cases (thread safety, async, non-deterministic delays) that hand-rolled loops miss. The @retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=1, min=1, max=60)) decorator covers 95% of cases.
Q: How do I pick the right failure threshold for a circuit breaker? A: Use 5 consecutive failures as a starting point. Monitor the circuit breaker’s open/close events for a week and adjust based on false positives (opened unnecessarily) and false negatives (should have opened sooner). Never use a percentage-based threshold for low-volume tools — use consecutive counts.
Q: Does Temporal handle this automatically?
A: Temporal activities have built-in retry policies with RetryPolicy (backoff coefficient, max attempts, max interval). Set a RetryPolicy on every activity that calls an external tool. Temporal does not implement circuit breaking — add that in your activity code.
Q: What if the tool is always slow, not failing?
A: Slow tool calls do not trigger retry logic — they just hold the agent’s thread. Add a timeout to every tool call (httpx.post(..., timeout=30)) so slow calls fail fast and enter the retry/circuit-breaking path. A tool that takes 5 minutes per call is functionally a failure.