Claude 429 Rate-Limit Retry Loops: Stop the Hammering and Fix It

Q: Is `429` the same as `529`?

No. `429 rate_limit_error` means *your* account hit a limit — honor `retry-after`. `529 overloaded_error` means *Anthropic's* servers are overloaded across all users; it's not about your usage. Both say "retry later," but for 529 you should also check [status.anthropic.com](https://status.anthropic.com) and consider failover.

Q: Do output tokens count against the limit?

Yes — there's a separate **OTPM** (output tokens per minute) limit, distinct from input. On Tier 1 Sonnet it's 8,000 OTPM. Large `max_tokens` doesn't itself cost OTPM (only tokens actually generated count), so you can set a high `max_tokens` safely.

Q: Does prompt caching help with rate limits?

Yes. On most models `cache_read_input_tokens` don't count toward ITPM, so caching a large shared system prompt or document can dramatically raise your effective input throughput without changing your tier.

Q: Will retrying faster get me unblocked sooner?

The opposite. Retrying before `retry-after` elapses just returns more 429s and can extend the block under acceleration limits. Wait out the header, then resume.

Claude keeps looping on 429 rate_limit_exceeded — read retry-after, cap concurrency, separate 429 from 529, and stay under your tier's RPM/ITPM/OTPM.

Published: May 17, 2026 Updated: Jun 15, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You’re running a Claude API script — bulk-translating 200 articles, or an agent chewing through support tickets — and halfway in it starts looping on 429 rate_limit_exceeded. Your retry-on-error logic just keeps hammering and digs the hole deeper. Or you’re in Claude Code, the agent hits the wall, can’t stop, and keeps making things worse.

Fastest fix: stop the loop now, wait out the window, then on restart read the retry-after header instead of a fixed sleep(), and cap concurrency to your tier (Tier 1 is only 50 RPM as of June 2026). If you’re not on the API at all — you’re in the Claude.ai app or Claude Code on a Pro/Max plan — you’ve hit a subscription cap, not an API rate limit, and the only fix is to wait for the reset (see cause 6 and the plan-cap fix path below).

First, make sure you’re solving the right problem. Two different HTTP codes look similar but need different responses:

Code	`error.type`	Meaning	What to do
`429`	`rate_limit_error`	Your account exceeded a limit (RPM, ITPM, or OTPM)	Honor `retry-after`, throttle your own rate
`529`	`overloaded_error`	Anthropic’s API is overloaded across all users	Back off with jitter; optionally fail over to another model/provider

This article is about 429. If you’re seeing 529 overloaded_error, the cause is server-side traffic, not your code — check the Anthropic status page and retry with backoff; widening concurrency limits won’t help.

How Claude’s limits actually work (as of June 2026)

The Messages API enforces three separate rate limits per model class, not one combined “TPM”:

RPM — requests per minute
ITPM — input tokens per minute (only uncached input tokens count on most models)
OTPM — output tokens per minute

Exceeding any one returns a 429. Tier 1 (the tier every new org starts on) is tight. Current standard limits:

Tier 1 model (June 2026)	RPM	ITPM	OTPM
Claude Sonnet 4.x	50	30,000	8,000
Claude Haiku 4.5	50	50,000	10,000
Claude Opus 4.x	50	500,000	80,000

Two surprises catch people out:

Sonnet’s Tier 1 ITPM is only 30,000. A single 30K-token request can consume a full minute of input budget. Opus has a far higher ITPM (500K) because it’s metered on a separate, more generous schedule.
Cached input doesn’t count toward ITPM on most models. cache_read_input_tokens are excluded; only input_tokens (after your last cache breakpoint) plus cache_creation_input_tokens count. With an 80% cache hit rate you can push ~5x more total input through the same ITPM ceiling.

You can read your live numbers from the response headers on every call (not just 429s) or from the Limits page in the Console.

Common causes

Ordered by hit rate, highest first.

1. Tight retry loop that ignores retry-after

Every 429 response includes a retry-after header (in seconds). If your script retries instantly, each retry is rejected and — under acceleration limits — the window can stretch further. The header is authoritative: it aligns with anthropic-ratelimit-requests-reset when you blew RPM, or anthropic-ratelimit-tokens-reset when you blew a token limit (both RFC 3339 timestamps).

How to spot it: your retry uses a fixed sleep(1) instead of reading retry-after. That’s the bug.

2. Fanout too wide — instant RPM blowout

Promise.all over 50 prompts fires 50 requests in roughly one second. Even if your per-minute average is fine, the instant burst trips the limit, because the API enforces RPM as a token bucket (a 50 RPM limit behaves close to “1 request every ~1.2s” under bursts).

How to spot it: you’re using Promise.all / a thread pool with no concurrency cap.

3. Long context exhausts ITPM

On Tier 1 Sonnet, ITPM is 30,000. Five concurrent requests of 30K input tokens = 150K in a minute — 5x over. Output also has its own 8,000 OTPM ceiling.

How to spot it: compute single-request input tokens × requests-per-minute and compare against your tier’s ITPM in Anthropic’s rate-limit docs. If you send large repeated context (a system prompt, a long document), see whether prompt caching gets you under the line.

4. Agent doesn’t know when to stop

Claude Code or a custom agent sees a 429 from a tool call, treats it as “temporarily unavailable,” and just keeps retrying — forever, without explicit failure handling.

How to spot it: the agent log has more than 5 occurrences of 429 in a short span.

Backend, cron, and CI all use the same key. Each thinks its share is fine; combined they bust the org-level limit (limits are enforced per organization, and per workspace if you’ve set workspace limits).

How to spot it: the Console Usage page shows multiple source IPs hitting one key, or your aggregate RPM in the Usage charts is higher than any single service expects.

6. Claude Code / Claude.ai plan cap (not an API limit)

If you’re in the Claude.ai app or in Claude Code on a Pro ($20) or Max ($100 / $200) plan, you’re not hitting API rate limits at all — you’re hitting subscription usage caps. As of June 2026 there are two layers: a 5-hour rolling session window (the counter starts on your first prompt, not on the clock) and a weekly rolling cap on top of it. Higher-effort models (Opus 4.7) drain the weekly budget far faster than Sonnet 4.6. Hit the weekly cap and you’re locked even if the 5-hour window still has room.

How to spot it: the UI says something like “You’ve reached your usage limit, resets in X hours,” or /usage / /status inside Claude Code shows a weekly limit at 100%.

Shortest path to fix (API)

Step 1: Kill the loop, back off

ps aux | grep my-script
kill -9 <PID>

Or Ctrl+C the Claude Code agent. Don’t immediately restart. The token bucket replenishes continuously, so a short pause restores some headroom; wait until retry-after (or the anthropic-ratelimit-*-reset timestamp) has passed before debugging.

Step 2: Honor retry-after with exponential backoff as a fallback

import time
import anthropic
from anthropic import RateLimitError

client = anthropic.Anthropic()

def call_with_backoff(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
            )
        except RateLimitError as e:
            # Trust the header; fall back to exponential backoff if absent
            wait = int(e.response.headers.get("retry-after", 2 ** attempt))
            print(f"Rate limited, sleeping {wait}s")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

Key rule: read retry-after. Don’t hardcode a sleep. Add ±20% jitter when you fall back to 2 ** attempt so parallel workers don’t retry in lockstep (the “thundering herd”).

Note the SDK already retries 429 and 5xx automatically (default max_retries=2, ~1-2s exponential backoff). Historically the SDK did not auto-retry 529 overloaded_error, so if you face 529s, add that handling yourself or fail over.

Step 3: Cap concurrency

import asyncio
sem = asyncio.Semaphore(3)  # at most 3 in flight

async def safe_call(prompt):
    async with sem:
        return await call_with_backoff(prompt)

Rule of thumb: keep concurrency at or below RPM / 60 in flight so you never burst past one second of budget. On Tier 1 (50 RPM) that’s a handful of workers, not fifty.

Step 4: Batch many small calls into one

Combine 10 short translations into a single prompt that returns 10 results:

Translate the following 10 English passages to Chinese.
Return a JSON array of 10 strings, in order:
[
  "text 1",
  "text 2",
  ...
]

Output tokens still count toward OTPM, but your RPM drops ~10x — usually the binding constraint on Tier 1.

Step 5: Cache, downshift the model, or use the Batch API

Prompt caching: if every request shares a big system prompt or document, add a cache breakpoint. Cached reads don’t count toward ITPM on most models and are billed at ~10% of input price.
Lighter model: work that doesn’t need Opus → Sonnet 4.6 or Haiku 4.5 (Haiku has higher Tier 1 ITPM/OTPM).
Message Batches API: async, up to 24h to complete, has its own separate queue limits (doesn’t compete with live traffic), and is 50% cheaper. Ideal for bulk jobs with no latency requirement.

Step 6: Upgrade tier or split keys

Tiers advance automatically as your cumulative credit purchases cross thresholds: Tier 1 at $5, Tier 2 at $40, Tier 3 at $200, Tier 4 at $400 (cumulative, as of June 2026). Each tier multiplies your RPM/ITPM/OTPM. For production, check the Limits page and add credit to advance, or contact sales for custom limits. Short term: give each service its own API key (or its own workspace with workspace limits) so one job can’t starve the others.

Shortest path to fix (Claude.ai / Claude Code plan cap)

This isn’t an API limit, so backoff code won’t help. Your options:

Wait for the reset. Run /usage (or /status) in Claude Code, or open Settings > Usage on Claude.ai, to see the exact per-limit reset time. The 5-hour window resets 5 hours after your first prompt; the weekly cap resets on a rolling 7-day basis.
Switch to a cheaper model. In Claude Code, dropping from Opus to Sonnet 4.6 stretches a weekly budget several times further.
Upgrade the plan (Pro → Max 5x → Max 20x) if you hit caps regularly.
Move heavy automation to the API with pay-as-you-go billing, which uses the RPM/ITPM/OTPM limits above instead of subscription caps.

How to confirm it’s fixed

Re-run the job and watch for zero 429s in the logs over a full minute at full concurrency.
Log anthropic-ratelimit-requests-remaining and anthropic-ratelimit-input-tokens-remaining after each call; if either stays above zero, you have headroom.
Track your 429 ratio over a day. Under ~1% with backoff is healthy; persistently higher means upgrade the tier or cut request volume.

Prevention

Use the SDK’s built-in retry (max_retries=N) instead of rolling your own — and add explicit 529 handling.
Before launching any script, estimate peak RPM and ITPM and leave ~20% headroom under your tier.
Don’t fire batch jobs on the hour (00:00, 05:00…); add jitter so you don’t pile onto global peaks.
Give backend, CI, and cron each their own API key (or workspace) for isolation.
For anything bulk with no latency need, default to the Batch API — it saves money and keeps live traffic clear.

FAQ

Is 429 the same as 529? No. 429 rate_limit_error means your account hit a limit — honor retry-after. 529 overloaded_error means Anthropic’s servers are overloaded across all users; it’s not about your usage. Both say “retry later,” but for 529 you should also check status.anthropic.com and consider failover.

Why do I get rate-limited when my per-minute average is well under the limit? Bursts. The API uses a token-bucket and enforces limits over short windows, so 50 requests fired in one second can trip a 50 RPM limit even though the minute-average is fine. Cap concurrency to roughly RPM / 60.

Do output tokens count against the limit? Yes — there’s a separate OTPM (output tokens per minute) limit, distinct from input. On Tier 1 Sonnet it’s 8,000 OTPM. Large max_tokens doesn’t itself cost OTPM (only tokens actually generated count), so you can set a high max_tokens safely.

Does prompt caching help with rate limits? Yes. On most models cache_read_input_tokens don’t count toward ITPM, so caching a large shared system prompt or document can dramatically raise your effective input throughput without changing your tier.

I’m in Claude Code, not the API — why am I rate-limited? You’ve hit a subscription cap (5-hour rolling window and/or weekly cap), not an API rate limit. Run /usage to see the reset time, switch to Sonnet 4.6 to stretch the budget, or upgrade your plan. No code change fixes this.

Will retrying faster get me unblocked sooner? The opposite. Retrying before retry-after elapses just returns more 429s and can extend the block under acceleration limits. Wait out the header, then resume.

Tags: #Claude #Debug #Troubleshooting

How Claude’s limits actually work (as of June 2026)

Common causes

1. Tight retry loop that ignores retry-after

2. Fanout too wide — instant RPM blowout

3. Long context exhausts ITPM

4. Agent doesn’t know when to stop

5. Multiple clients sharing one API key

6. Claude Code / Claude.ai plan cap (not an API limit)

Shortest path to fix (API)

Step 1: Kill the loop, back off

Step 2: Honor retry-after with exponential backoff as a fallback

Step 3: Cap concurrency

Step 4: Batch many small calls into one

Step 5: Cache, downshift the model, or use the Batch API

Step 6: Upgrade tier or split keys

Shortest path to fix (Claude.ai / Claude Code plan cap)

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

Claude Artifact React Component Fails Silently on Prop Errors

Claude Artifact Reverts to an Old Version After a Follow-up

Claude Attachment Preview Not Rendering: Fix the Blank Thumbnail

Claude Web Search Citations Return 404 or Wrong Page

Claude Computer Use Stuck Clicking the Same Button in a Loop

Claude Conversation Export Broken: PDF or Data Dump Fails