One Agent's Rate Limit Cascades Into a Chain Failure

Q: On Anthropic, will giving each agent its own API key split the rate limit?

No. As of June 2026, Anthropic rate limits are per *organization* and per *model class*, so every key under the same org draws from the same Opus 4.x / Sonnet 4.x bucket. To isolate agents, route them to different model classes, set per-workspace limits, or use separate organizations — not separate keys. See Anthropic's [rate limits docs](https://platform.claude.com/docs/en/api/rate-limits).

Q: What's the right `retry-after` wait in practice?

Read the header and honor it exactly. Anthropic's `retry-after` aligns with `anthropic-ratelimit-requests-reset` (RPM) or `anthropic-ratelimit-tokens-reset` (TPM). OpenAI sends `Retry-After` and `retry-after-ms`. If the header is missing, default to 60s. Add a little random jitter so parallel agents do not all retry at the same instant.

One rate-limited agent stalls the whole pipeline as waiting agents time out. Isolate the bucket, decouple stages, and stop 429 cascades for good.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You have a 5-step pipeline: research, summarize, code, review, deploy. The “code” agent fans out 20 parallel sub-tasks, each calling the Anthropic Messages API on the same key. The combined burst blows past your tier’s requests-per-minute (RPM) ceiling, so the code agent starts getting 429 responses with a retry-after header. The review agent, waiting on code output with a 120-second timeout, gives up before the retries clear. Deploy, waiting on review, times out too. The whole run fails — not because the task was impossible, but because one agent’s load profile exceeded its share of the rate bucket and the failure propagated upstream through uncoordinated timeouts.

Fastest fix: put one shared rate limiter in front of the high-volume agent so its peak concurrency stays under your tier limit, and make every waiting agent’s timeout longer than the worst-case retry-after (treat a 429 as transient, never as a permanent failure). The rest of this guide is how to find the offending agent and harden the pipeline so it cannot happen again.

How rate limits actually work (June 2026)

Two facts decide everything below, and both are commonly misunderstood:

Anthropic limits are per organization, per model class — not per API key. Every key under the same org draws from one Opus 4.x bucket and one Sonnet 4.x bucket. Workspaces can set lower limits, but the org ceiling always applies on top. So handing each agent its own key does not split the bucket on Anthropic.
OpenAI limits are per organization + model too, with separate RPM, TPM, RPD, and TPD dimensions. Exceeding any one returns 429.

Both providers use a token-bucket that replenishes continuously, so there is no clean “reset at the top of the minute.” A 60 RPM limit is enforced as roughly 1 request/second; a tight burst trips it even if your 60-second average is fine. Anthropic also enforces an acceleration limit: a sharp spike in usage can return 429 even when you are under your steady-state ceiling, which is exactly what a retry storm produces.

When you get a 429, read the response. Anthropic returns:

retry-after: 12
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2026-06-17T18:04:30Z
anthropic-ratelimit-input-tokens-remaining: 41000
anthropic-ratelimit-tokens-reset: 2026-06-17T18:04:12Z

The error type is rate_limit_error, and the message names which limit (RPM, ITPM, or OTPM) you hit. OpenAI’s 429 carries Retry-After (and retry-after-ms) plus x-ratelimit-remaining-*. Honor retry-after exactly — guessing a shorter delay just earns another 429.

Which bucket are you in?

Symptom	Most likely cause	Go to
One agent owns over 80% of requests on the key	Single shared bucket, one greedy agent	Step 1, Step 2
Total request rate rose after the first `429`	Retry storm / acceleration limit	Step 5 + Prevention
Fails only at high fan-out, fine in small tests	`num_tasks * calls_per_task` exceeds the limit	Step 2
Waiting agent fails before the blocked one retries	Timeout shorter than worst-case `retry-after`	Step 4
`429` triggers a rollback or alert	Error handler treats transient as permanent	Step 5
You hit it instantly with no spike	Billing cap, not a rate limit (`insufficient_quota`)	FAQ

Common causes

Every agent — orchestrator, sub-agents, reviewers — runs through the same org/model bucket. When one high-volume agent (a parallelized code or research agent making 30 concurrent calls) saturates it, every other agent gets starved and starts seeing 429.

How to spot it: Tag each call (below) and pull usage by tag. If a single agent type accounts for over 80% of requests in the window where 429s appear, that is your consumer.

2. Fan-out multiplies per-agent usage by N

The orchestrator dispatches 50 tasks to 50 parallel instances, each making 3 LLM calls — 150 calls fired near-simultaneously, well above a tier where RPM is in the dozens. The pipeline was tested at 5 tasks (15 calls, fine) and 50 was never modeled.

How to spot it: Compute peak concurrency as num_parallel_tasks * calls_per_task and compare to your tier RPM. If the theoretical max exceeds the limit, uncontrolled fan-out will cascade.

3. Waiting timeout is shorter than the blocked agent’s retry window

Agent C waits 90s for Agent B. Agent B hit a rate limit and is honoring a 120s retry-after. C times out at 90s and fails before B can even retry. The cascade is purely a timeout-coordination bug.

How to spot it: Map every timeout in the chain. For each agent, find its worst-case retry delay (sum of retry-after across allowed attempts). Any upstream timeout shorter than a downstream worst case will cascade.

4. No queue between stages — synchronous blocking

Stages are called synchronously: result_b = call_b(result_a). If B is slow from rate limiting, the calling thread blocks; with one thread per parallel task, they pile up and exhaust the pool, making the burst worse.

How to spot it: Check whether stages are joined by a synchronous call or an async queue. A direct synchronous HTTP call with no buffer in between has nowhere to absorb a rate-limit delay.

5. The error handler treats `429` like a fatal failure

On a 429, an agent raises a generic PipelineError. The orchestrator catches it, marks the task permanently failed, and fires compensation (rollbacks, alerts). The task was only rate-limited, not failed — the handler escalated transient to fatal.

How to spot it: Check whether your handler separates RateLimitError (transient, back off and retry) from AuthenticationError or content-policy errors (permanent, do not retry). If everything funnels to one handler, rate limits get misclassified.

6. Retries amplify the spike instead of draining it

Each agent retries on its own with a fixed 1s wait. Five agents retry at once, the global request rate doubles during the limit window, and you trip the acceleration limit, stretching a 2-minute incident into a 20-minute one.

How to spot it: Measure the request rate after the first 429. If it went up rather than down, retry amplification is the driver — you need backoff with jitter and a circuit breaker, not just retries.

Shortest path to fix

Step 1: Find the bottleneck agent by tagging every call

Tag each request with the agent name so usage breaks down by consumer:

client.messages.create(
    model="claude-sonnet-4-6",
    messages=messages,
    metadata={"user_id": f"agent:{agent_name}:run:{run_id}"},
)

Then read the per-agent breakdown in the Claude Console Usage page (platform.claude.com > Usage), or pull it programmatically with the Rate Limits and Usage APIs. The agent owning the spike is your target.

Step 2: Put a shared rate limiter in front of the high-volume agent

The point is one limiter shared across all parallel instances of the same agent, sized below your tier so other agents keep headroom. If you are on LangChain, use the built-in token-bucket limiter rather than rolling your own:

from langchain_core.rate_limiters import InMemoryRateLimiter
from langchain_anthropic import ChatAnthropic

# 40 RPM = ~0.66 req/s; leave headroom for other agents under the org bucket
limiter = InMemoryRateLimiter(
    requests_per_second=0.66,
    check_every_n_seconds=0.1,
    max_bucket_size=5,   # cap burst size
)

code_llm = ChatAnthropic(model="claude-sonnet-4-6", rate_limiter=limiter)

In LangGraph, also cap branch fan-out so map-reduce nodes cannot all fire at once:

graph.invoke(state, config={"max_concurrency": 5})

Framework-agnostic, an asyncio.Semaphore does the same job — one semaphore instance shared by every worker for that agent:

import asyncio

# Shared by ALL parallel instances of the code agent
_code_agent_sem = asyncio.Semaphore(5)

async def code_agent_call(task):
    async with _code_agent_sem:
        return await generate_code(task)

Note the InMemoryRateLimiter is per-process. If your agents run across multiple workers or machines, back the limiter with Redis (or put a proxy like LiteLLM in front) so the bucket is genuinely shared.

Step 3: Stop trying to “split the bucket” with extra keys

On Anthropic, extra keys under the same org share the bucket, so this buys you nothing. Real isolation options, in order of effort:

Different model classes for different agents. Opus 4.x and Sonnet 4.x have separate buckets. Route the low-volume orchestrator to Opus and the high-volume worker to Sonnet (or Haiku) and they no longer contend.
Workspace limits to cap a noisy agent so it cannot eat the whole org bucket (Console > Settings > Limits). Workspace limits can only be set below the org limit.
Separate organizations for genuinely separate products. Using extra accounts to dodge a single org’s limit is a terms-of-service violation; separate legitimate orgs are not.

On OpenAI the bucket is per org+model, so the same logic applies: separate by model, not by key.

Step 4: Decouple stages with a queue and a generous timeout

Put an async queue between stages so a rate-limit delay in one stage does not directly time out the next. Set the consumer timeout longer than the worst-case retry-after chain:

import asyncio

code_output_queue: asyncio.Queue = asyncio.Queue(maxsize=100)

async def code_worker(task):
    result = await code_agent_call(task)  # respects the limiter from Step 2
    await code_output_queue.put(result)

async def review_worker():
    while True:
        result = await asyncio.wait_for(
            code_output_queue.get(),
            timeout=300,  # longer than max retry window
        )
        await review_code(result)

Step 5: Classify `429` as transient and honor `retry-after`

Read the header, never guess, and add jitter so retrying agents do not stampede in lockstep:

import random, asyncio, anthropic

class RateLimitError(Exception):
    """Transient — back off and retry, do NOT mark permanently failed."""

def classify(exc: anthropic.APIStatusError):
    if exc.status_code == 429:
        retry_after = int(exc.response.headers.get("retry-after", 60))
        return RateLimitError(), retry_after + random.uniform(0, 5)  # jitter
    if exc.status_code in (500, 503, 529):   # 529 = Anthropic overloaded
        return RateLimitError(), 5 + random.uniform(0, 5)
    raise exc  # 400/401/403 are permanent — do not retry

Cap retries (5 is standard) and wrap the agent in a circuit breaker so a sustained limit pauses the whole stage instead of hammering it. This is what stops the retry-amplification in cause 6.

How to confirm it’s fixed

Reproduce the spike. Run the pipeline at peak fan-out (the largest num_tasks you expect). With the limiter in place, watch anthropic-ratelimit-requests-remaining in the response headers — it should approach 0 but 429s should not appear.
Inject a forced 429. Temporarily set a tiny workspace limit (or mock a 429 on the code agent) and confirm the review and deploy agents wait rather than fail. No rollback or alert should fire.
Check the rate did not climb after a limit. Graph requests/min during the test; after any 429, the rate should fall (backoff working), not rise.

Prevention

Route high-volume and low-volume agents to different model classes so they sit in separate buckets; do not rely on separate keys under one Anthropic org.
Compute peak concurrent calls at maximum fan-out before deploying; if it exceeds ~70% of your tier RPM, add a shared limiter or semaphore.
Set waiting-agent timeouts longer than the worst-case retry-after chain a downstream agent can hit (typically 120s+ per attempt).
Use an async queue between stages so rate-limit delays do not propagate into upstream timeouts.
Classify 429 as transient with backoff and jitter; never trigger rollback or compensation on a rate limit alone.
Add per-agent usage to your dashboard and alert when any agent passes 80% of its budget; alert separately on 429 rate above ~5%.
Cap workspace limits on noisy agents so one consumer cannot drain the org bucket.
Reserve at least 20% rate headroom for retry bursts; never plan to run at 100% of the limit in steady state.
Test the cascade explicitly: force a 429 on the highest-volume agent and verify the rest keep running.

FAQ

Q: On Anthropic, will giving each agent its own API key split the rate limit? A: No. As of June 2026, Anthropic rate limits are per organization and per model class, so every key under the same org draws from the same Opus 4.x / Sonnet 4.x bucket. To isolate agents, route them to different model classes, set per-workspace limits, or use separate organizations — not separate keys. See Anthropic’s rate limits docs.

Q: I hit a 429 instantly with no traffic spike. Why? A: That is usually a billing cap, not a rate limit. OpenAI returns 429 with type: insufficient_quota when your org has hit its spend limit, and Anthropic blocks usage once you reach your tier’s monthly spend ceiling. Check your console billing/usage page; a rate-limit fix will not help a quota problem.

Q: What’s the right retry-after wait in practice? A: Read the header and honor it exactly. Anthropic’s retry-after aligns with anthropic-ratelimit-requests-reset (RPM) or anthropic-ratelimit-tokens-reset (TPM). OpenAI sends Retry-After and retry-after-ms. If the header is missing, default to 60s. Add a little random jitter so parallel agents do not all retry at the same instant.

Q: Can the provider raise my limits, and how fast? A: Both raise limits for paid accounts. Anthropic advances tiers automatically as cumulative spend crosses thresholds (up to Tier 4), and you can request more on Console > Settings > Limits; OpenAI graduates usage tiers by spend the same way. For urgent needs, contacting sales/support is faster than redesigning the pipeline, but design for headroom regardless.

Q: My burst is genuinely large — how do I smooth it without serializing everything? A: Use a token-bucket limiter that allows a bounded burst then throttles to the sustained rate (InMemoryRateLimiter’s max_bucket_size, or limits in Python / bottleneck in JS). Set the burst to roughly 20% of the per-minute limit. For batch jobs, also consider Anthropic’s Message Batches API, which has its own separate limit pool.

Tags: #AI coding #Agents #Troubleshooting

How rate limits actually work (June 2026)

Which bucket are you in?

Common causes

1. All agents share one bucket and one agent is greedy

2. Fan-out multiplies per-agent usage by N

3. Waiting timeout is shorter than the blocked agent’s retry window

4. No queue between stages — synchronous blocking

5. The error handler treats 429 like a fatal failure

6. Retries amplify the spike instead of draining it

Shortest path to fix

Step 1: Find the bottleneck agent by tagging every call

Step 2: Put a shared rate limiter in front of the high-volume agent

Step 3: Stop trying to “split the bucket” with extra keys

Step 4: Decouple stages with a queue and a generous timeout

Step 5: Classify 429 as transient and honor retry-after

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Agent Budget Exhausted Halfway Through the Task

Restored Agent Checkpoint Is Corrupted

Cost Tracking Misses Sub-Agent Usage

Cycle in Agent Call Graph Goes Undetected

Agent Handoff Loses Context Between Steps

Agent Orchestrator Deadlocks Waiting on Each Other

5. The error handler treats `429` like a fatal failure

Step 5: Classify `429` as transient and honor `retry-after`