One Agent's Rate Limit Cascades Into a Chain Failure

A single rate-limited agent stalls the entire pipeline as upstream agents queue and timeout. Learn to isolate rate limits and prevent cascade failures.

You have a 5-step LangGraph pipeline: research → summarize → code → review → deploy. The “code” agent runs 20 parallel sub-tasks and each makes its own call to the Anthropic API. The combined rate quickly hits the 60 req/min limit. The code agent starts receiving 429s and retrying. The review agent, which is waiting on the code agent’s output, times out after 120 seconds and fails. The deploy agent, waiting on review, also times out. The entire pipeline fails — not because the task was impossible, but because one agent’s load profile exceeded its allocated rate capacity and the failure propagated upstream through timeout chains.

Common causes

1. All agents share a single API key with one rate-limit bucket

Every agent in the pipeline — orchestrator, sub-agents, reviewers — calls the same API key. The rate limit is shared. When one high-volume agent (e.g., a parallelized research agent making 30 concurrent calls) hits the limit, it consumes the entire bucket and starves all other agents.

How to spot it: Check whether all agents in the pipeline use the same API key. Pull API usage logs for that key and look for periods where a single agent type accounts for over 80% of requests. That agent is the consumer that triggers starvation.

2. Timeout on the waiting agent is shorter than the retry window on the blocked agent

Agent C waits 90 seconds for Agent B. Agent B hit a rate limit and is waiting 120 seconds for the Retry-After interval. Agent C times out at 90 seconds and fails before Agent B even has a chance to retry. The cascade happens because timeouts are not coordinated.

How to spot it: Map every timeout in the chain. For each agent, find the longest possible retry delay it could experience. Any upstream agent whose timeout is shorter than a downstream agent’s worst-case retry delay will fail in a cascade.

3. No queue between agents — synchronous blocking

Agents are called synchronously: result_b = call_b(result_a). If B is slow due to rate limiting, the thread calling it blocks. If there are many such threads (one per parallel task), they pile up and exhaust the thread pool, making the problem worse.

How to spot it: Check whether the pipeline uses synchronous blocking calls or an async queue between stages. If every agent-to-agent call is a synchronous HTTP call with no message queue in between, there is no buffer for rate-limit delays.

4. Fan-out multiplies per-agent rate usage by N

The orchestrator dispatches 50 tasks to 50 parallel agent instances. Each makes 3 LLM calls. That is 150 calls fired simultaneously — well above a 60 req/min limit. The pipeline was tested with 5 parallel tasks, where 15 calls was fine, but 50 was never anticipated.

How to spot it: Calculate the maximum concurrent calls at peak fan-out: num_parallel_tasks * calls_per_task. Compare against the rate limit. If the theoretical maximum exceeds the limit, fan-out without concurrency control will trigger cascade.

5. No per-agent rate limit budget — one agent can consume all capacity

There is no allocation: any agent can use as much of the shared rate limit as it wants. A new high-volume agent was added to the pipeline without checking its rate usage against the remaining capacity for existing agents.

How to spot it: Check whether your rate-limiting layer enforces per-agent quotas or just a global limit. A global limit without per-agent quotas allows any single agent to starve all others.

6. Error handling treats rate limits the same as permanent failures

When an agent receives a 429, it raises a generic PipelineError. The orchestrator catches PipelineError and marks the task as permanently failed, triggering compensation actions (rollbacks, alerts). The task was not actually failed — it was temporarily rate-limited — but the error handling escalated it to a fatal state.

How to spot it: Check whether your error handling distinguishes RateLimitError (transient, retry with backoff) from AuthError or ContentPolicyError (permanent, do not retry). If all errors go to the same handler, rate limits are misclassified as fatal failures.

Shortest path to fix

Step 1: Identify the bottleneck agent by API key usage breakdown

# Pull usage logs from Anthropic console or your observability stack
# Group by agent_id or metadata tag
curl -s "https://api.anthropic.com/v1/usage?start=2026-05-25&group_by=metadata_tag" \
  -H "x-api-key: $ANTHROPIC_API_KEY" | jq '.usage[] | {tag, request_count, input_tokens}'

Alternatively, add a tag to every API call:

# Tag every call with the agent name
client.messages.create(
    model="claude-sonnet-4-6",
    messages=messages,
    metadata={"user_id": f"agent:{agent_name}:run:{run_id}"}
)

Step 2: Add a rate-limit-aware semaphore for the high-volume agent

import asyncio, time

class RateLimitedSemaphore:
    def __init__(self, max_per_minute: int):
        self.semaphore = asyncio.Semaphore(max_per_minute)
        self.call_times: list[float] = []

    async def acquire(self):
        await self.semaphore.acquire()
        now = time.time()
        # Remove calls older than 60 seconds
        self.call_times = [t for t in self.call_times if now - t < 60]
        if len(self.call_times) >= self.semaphore._value:
            # Wait until the oldest call is 60s old
            wait = 60 - (now - self.call_times[0])
            if wait > 0:
                await asyncio.sleep(wait)
        self.call_times.append(time.time())

    def release(self):
        self.semaphore.release()

# One semaphore shared across all parallel instances of the same agent
_code_agent_limiter = RateLimitedSemaphore(max_per_minute=40)  # leave headroom for others

Step 3: Separate API keys per agent tier with independent limits

API_KEYS = {
    "orchestrator": os.environ["ANTHROPIC_KEY_ORCH"],   # Tier 1 — low volume
    "code_agent":   os.environ["ANTHROPIC_KEY_CODE"],   # Tier 2 — high volume
    "review_agent": os.environ["ANTHROPIC_KEY_REVIEW"], # Tier 2 — medium volume
}

def get_client(agent_type: str) -> anthropic.Anthropic:
    return anthropic.Anthropic(api_key=API_KEYS[agent_type])

Each key has its own rate limit bucket. A code agent rate limit does not affect the orchestrator or review agent.

Step 4: Introduce an async queue between pipeline stages

import asyncio

code_output_queue: asyncio.Queue = asyncio.Queue(maxsize=100)

async def code_agent_worker(task):
    # Respects rate limits internally
    result = await generate_code(task)
    await code_output_queue.put(result)

async def review_agent_worker():
    while True:
        result = await asyncio.wait_for(
            code_output_queue.get(),
            timeout=300  # 5-minute timeout — longer than max retry window
        )
        await review_code(result)

The queue decouples the timing of code generation and review. Rate limit delays in the code agent do not directly time out the review agent.

Step 5: Fix error classification to treat 429 as transient

class RateLimitError(TransientError):
    """Retry with backoff — do not mark as permanently failed."""

def handle_api_error(exc: anthropic.APIError) -> None:
    if exc.status_code == 429:
        retry_after = int(exc.response.headers.get("retry-after", 60))
        raise RateLimitError(retry_after=retry_after)
    elif exc.status_code in (500, 503):
        raise TransientError(str(exc))
    else:
        raise PermanentError(str(exc))  # 400, 401, 403 — don't retry

Prevention

  • Assign separate API keys to agent tiers with different volume profiles; never share one key across a high-volume and a low-volume agent.
  • Calculate peak concurrent API calls at maximum fan-out before deploying; if it exceeds 70% of the rate limit, add a semaphore or throttle.
  • Set timeout values on waiting agents to be longer than the maximum Retry-After value a downstream agent could receive (usually 120 seconds).
  • Use an async queue between pipeline stages to decouple rate-limit delays from upstream timeouts.
  • Classify HTTP 429 as a transient error with backoff, not a permanent failure — never trigger compensation or rollback on a rate limit alone.
  • Add per-agent rate usage to your monitoring dashboard; alert when any agent exceeds 80% of its allocated rate budget.
  • Test cascade scenarios explicitly: simulate a rate limit on the highest-volume agent and verify that other agents continue operating.
  • Reserve at least 20% of your rate limit capacity as headroom for retry bursts — never plan to use 100% of the rate limit in steady state.

FAQ

Q: Can Anthropic or OpenAI increase my rate limits? A: Yes — both providers allow rate limit increases for paid accounts. Contact support with your expected usage volume and use case. Increases typically take 1-3 business days. For sudden scale needs, this is faster than redesigning the pipeline.

Q: How do I distribute load across multiple API keys without violating terms of service? A: Using multiple keys owned by the same organization is typically permitted for load distribution. Using separate accounts to circumvent per-account limits is a terms-of-service violation. Check your provider’s terms before implementing multi-key load balancing.

Q: Our pipeline has unpredictable volume spikes — how do we handle them? A: Use a token bucket rate limiter that allows bursts up to a configurable burst size, then throttles to the sustained rate. limits (Python) and bottleneck (JavaScript) both implement token-bucket algorithms. Set the burst size to 20% of the per-minute limit.

Q: What’s the right Retry-After wait for a 429 in practice? A: Anthropic’s 429 responses include a retry-after header with the exact seconds to wait. Honor it exactly. If the header is missing, default to 60 seconds. Never retry a 429 in under 10 seconds without reading the header — you will just get another 429.

Tags: #AI coding #Agents #Troubleshooting