You run a CrewAI pipeline to refactor a 40-file codebase and it stops at file 22. The logs show BudgetExhaustedError: token limit 500000 reached or, in Claude Code, the usage-limit banner appears and the agent halts with half the tests still failing. The work already done may be partially correct, partially broken, or in a state that is worse than the starting point. Exhausting a budget mid-task is not just a cost problem — it is a correctness and recovery problem.
Common causes
1. Task scope was underestimated at planning time
The single biggest cause. You allocated a budget based on the happy path (no errors, no retries, model reads each file once). In reality, the agent re-reads files to verify changes, retries failed tool calls, and writes intermediate reasoning into its context — often 3-5x the naive estimate.
How to spot it: Compare the token count for a simple “read and summarize one file” run against the budget-per-file you allocated. If the ratio is under 3x, your budget math assumed a perfect run.
2. Long tool outputs inflate context without trimming
read_file returns a 1,200-line file verbatim. The agent keeps the full content in context across 15 subsequent tool calls instead of storing just the relevant excerpt. By file 10, the context is 80% stale file contents.
How to spot it: Print len(messages) and the total character count of the message history every 5 steps. If the size grows faster than O(steps) — i.e., it’s quadratic — tool outputs are accumulating uncontrolled.
3. Retry loops consume budget silently
A flaky external API causes 5 retries per call. Each retry carries the full conversation context. Twenty API calls with 5 retries each = 100 LLM calls instead of 20. The budget runs out at one-fifth of the expected progress.
How to spot it: Count distinct LLM API calls in your trace vs. the number of “logical steps” completed. A ratio above 2:1 indicates retry waste.
4. Sub-agents are not counted in the parent budget
In frameworks like AutoGen or LangGraph with nested agents, the parent agent’s budget tracker counts only its own calls. Sub-agent token usage accrues separately — or not at all — leaving the parent’s budget meter wrong by a large margin.
How to spot it: Sum token usage across all agent IDs in the trace. If the total is significantly higher than what the parent’s budget counter shows, sub-agent costs are off the books. See Cost Tracking Misses Sub-Agent Usage.
5. Model selected for the task is over-powered for most subtasks
The pipeline uses Claude Opus 4.7 or GPT-4o for every step including trivial ones like “extract the function name from this code.” At $15/MTok input, those cheap-looking subtasks add up fast compared to using a smaller model.
How to spot it: List the model used per step. Flag any step that doesn’t require reasoning — file reads, format conversions, substring extractions — that is using a frontier model.
6. No checkpointing means partial work is lost on exhaustion
The agent runs to exhaustion and the orchestrator has no checkpoint. Files 1-22 were edited in memory but the crash discards them. The user restarts from zero and burns the budget again.
How to spot it: Check whether your orchestration layer writes checkpoints (e.g., LangGraph’s SqliteSaver, Temporal’s durable execution). If restarts always go back to step 1, checkpointing is absent.
Shortest path to fix
Step 1: Audit actual vs. budgeted token spend per step
# LangChain / LangGraph callback
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
result = chain.invoke(input)
print(f"Total tokens: {cb.total_tokens}")
print(f"Total cost USD: {cb.total_cost:.4f}")
Run a single representative subtask with this instrumentation. Multiply by the number of subtasks to get a realistic budget estimate, then add a 3x safety margin.
Step 2: Set tiered budgets — hard stop + soft warning
SOFT_BUDGET_TOKENS = 400_000
HARD_BUDGET_TOKENS = 500_000
def check_budget(used: int, step: str):
if used >= HARD_BUDGET_TOKENS:
raise BudgetExhaustedError(f"Hard limit reached at step: {step}")
if used >= SOFT_BUDGET_TOKENS:
logger.warning("Soft budget hit at step %s — consider checkpointing", step)
The soft warning gives you time to checkpoint before the hard stop hits.
Step 3: Trim tool outputs before injecting into context
MAX_TOOL_OUTPUT_CHARS = 4_000
def trim_tool_output(output: str, max_chars: int = MAX_TOOL_OUTPUT_CHARS) -> str:
if len(output) <= max_chars:
return output
half = max_chars // 2
return output[:half] + "\n... [trimmed] ...\n" + output[-half:]
For file reads, use a search/grep approach instead of reading whole files:
# Instead of reading 1200 lines, extract the relevant function
grep -n "def authenticate" src/auth.py | head -5
sed -n '47,82p' src/auth.py
Step 4: Route cheap subtasks to smaller models
def pick_model(task_type: str) -> str:
cheap_tasks = {"extract", "format", "summarize_short", "classify"}
if task_type in cheap_tasks:
return "claude-haiku-3-5" # or gpt-4o-mini
return "claude-sonnet-4-6" # reasoning tasks only
This alone can cut costs by 60-80% on pipelines with mixed task complexity.
Step 5: Add checkpointing so partial work survives exhaustion
LangGraph with SQLite checkpointer:
from langgraph.checkpoint.sqlite import SqliteSaver
checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
graph = graph.compile(checkpointer=checkpointer)
# Resume from the last checkpoint:
config = {"configurable": {"thread_id": "run-42"}}
result = graph.invoke(input, config=config)
Temporal handles this natively — every await workflow.execute_activity(...) is a durable checkpoint.
Step 6: Resume from checkpoint rather than restarting
After a budget exhaustion, identify the last completed step from the checkpoint, increase the budget, and resume:
# LangGraph: inspect last checkpoint state
python - <<'EOF'
from langgraph.checkpoint.sqlite import SqliteSaver
cp = SqliteSaver.from_conn_string("checkpoints.db")
state = cp.get({"configurable": {"thread_id": "run-42"}})
print(state["channel_values"].keys())
print("Last node:", state["metadata"]["writes"])
EOF
Prevention
- Benchmark a single representative subtask before setting any budget — use 3x the measured cost as minimum allocation.
- Trim all tool outputs to a maximum length before they enter the LLM context; store full outputs externally by reference.
- Enable checkpointing from day one — retrofitting it after a crash is painful.
- Route classification, extraction, and formatting subtasks to smaller, cheaper models.
- Implement a soft budget threshold at 80% of the hard limit and log a warning with current progress percentage.
- Track sub-agent token usage in a single aggregated counter — never trust the parent’s local counter alone.
- Cap retry attempts per tool call (e.g., 3 retries max) and fall back to a human-review queue rather than infinite retry.
- Build a “resume from step N” path into your pipeline before you need it — it is much harder to add after exhaustion.
FAQ
Q: How do I estimate the right budget for a new pipeline? A: Run the pipeline on a sample of 3-5 representative tasks with full instrumentation, take the 90th-percentile token count, and multiply by 2x. Never budget on the median — agents’ token usage has a heavy right tail.
Q: Can I pause an agent mid-run to add more budget? A: In LangGraph with a checkpointer, yes — the state is durably saved and you can re-invoke with a higher limit. In stateless frameworks, you need to implement your own pause/resume logic around checkpoints. Claude Code itself does not support mid-run budget increases.
Q: Does prompt caching help with budget exhaustion?
A: Yes, significantly. Anthropic’s prompt caching can reduce repeated-context costs by up to 90%. If your pipeline re-reads the same system prompt and large context blocks across many calls, enable caching with the cache_control parameter to cut input-token spend.
Q: What happens to half-edited files when the budget runs out?
A: Depends on the framework. If the agent edits files directly on disk (Claude Code, Cursor), the changes on disk are real and may leave code in a broken intermediate state. Run git diff to see what changed, and git stash or git checkout -- . if you need to revert before retrying.