Agent Budget Exhausted Halfway Through the Task

Your agent burns through its token or cost budget before finishing, leaving work incomplete. Here's how to diagnose spending and resume safely.

You run a CrewAI pipeline to refactor a 40-file codebase and it stops at file 22. The logs show BudgetExhaustedError: token limit 500000 reached or, in Claude Code, the usage-limit banner appears and the agent halts with half the tests still failing. The work already done may be partially correct, partially broken, or in a state that is worse than the starting point. Exhausting a budget mid-task is not just a cost problem — it is a correctness and recovery problem.

Common causes

1. Task scope was underestimated at planning time

The single biggest cause. You allocated a budget based on the happy path (no errors, no retries, model reads each file once). In reality, the agent re-reads files to verify changes, retries failed tool calls, and writes intermediate reasoning into its context — often 3-5x the naive estimate.

How to spot it: Compare the token count for a simple “read and summarize one file” run against the budget-per-file you allocated. If the ratio is under 3x, your budget math assumed a perfect run.

2. Long tool outputs inflate context without trimming

read_file returns a 1,200-line file verbatim. The agent keeps the full content in context across 15 subsequent tool calls instead of storing just the relevant excerpt. By file 10, the context is 80% stale file contents.

How to spot it: Print len(messages) and the total character count of the message history every 5 steps. If the size grows faster than O(steps) — i.e., it’s quadratic — tool outputs are accumulating uncontrolled.

3. Retry loops consume budget silently

A flaky external API causes 5 retries per call. Each retry carries the full conversation context. Twenty API calls with 5 retries each = 100 LLM calls instead of 20. The budget runs out at one-fifth of the expected progress.

How to spot it: Count distinct LLM API calls in your trace vs. the number of “logical steps” completed. A ratio above 2:1 indicates retry waste.

4. Sub-agents are not counted in the parent budget

In frameworks like AutoGen or LangGraph with nested agents, the parent agent’s budget tracker counts only its own calls. Sub-agent token usage accrues separately — or not at all — leaving the parent’s budget meter wrong by a large margin.

How to spot it: Sum token usage across all agent IDs in the trace. If the total is significantly higher than what the parent’s budget counter shows, sub-agent costs are off the books. See Cost Tracking Misses Sub-Agent Usage.

5. Model selected for the task is over-powered for most subtasks

The pipeline uses Claude Opus 4.7 or GPT-4o for every step including trivial ones like “extract the function name from this code.” At $15/MTok input, those cheap-looking subtasks add up fast compared to using a smaller model.

How to spot it: List the model used per step. Flag any step that doesn’t require reasoning — file reads, format conversions, substring extractions — that is using a frontier model.

6. No checkpointing means partial work is lost on exhaustion

The agent runs to exhaustion and the orchestrator has no checkpoint. Files 1-22 were edited in memory but the crash discards them. The user restarts from zero and burns the budget again.

How to spot it: Check whether your orchestration layer writes checkpoints (e.g., LangGraph’s SqliteSaver, Temporal’s durable execution). If restarts always go back to step 1, checkpointing is absent.

Shortest path to fix

Step 1: Audit actual vs. budgeted token spend per step

# LangChain / LangGraph callback
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = chain.invoke(input)

print(f"Total tokens: {cb.total_tokens}")
print(f"Total cost USD: {cb.total_cost:.4f}")

Run a single representative subtask with this instrumentation. Multiply by the number of subtasks to get a realistic budget estimate, then add a 3x safety margin.

Step 2: Set tiered budgets — hard stop + soft warning

SOFT_BUDGET_TOKENS = 400_000
HARD_BUDGET_TOKENS = 500_000

def check_budget(used: int, step: str):
    if used >= HARD_BUDGET_TOKENS:
        raise BudgetExhaustedError(f"Hard limit reached at step: {step}")
    if used >= SOFT_BUDGET_TOKENS:
        logger.warning("Soft budget hit at step %s — consider checkpointing", step)

The soft warning gives you time to checkpoint before the hard stop hits.

Step 3: Trim tool outputs before injecting into context

MAX_TOOL_OUTPUT_CHARS = 4_000

def trim_tool_output(output: str, max_chars: int = MAX_TOOL_OUTPUT_CHARS) -> str:
    if len(output) <= max_chars:
        return output
    half = max_chars // 2
    return output[:half] + "\n... [trimmed] ...\n" + output[-half:]

For file reads, use a search/grep approach instead of reading whole files:

# Instead of reading 1200 lines, extract the relevant function
grep -n "def authenticate" src/auth.py | head -5
sed -n '47,82p' src/auth.py

Step 4: Route cheap subtasks to smaller models

def pick_model(task_type: str) -> str:
    cheap_tasks = {"extract", "format", "summarize_short", "classify"}
    if task_type in cheap_tasks:
        return "claude-haiku-3-5"   # or gpt-4o-mini
    return "claude-sonnet-4-6"      # reasoning tasks only

This alone can cut costs by 60-80% on pipelines with mixed task complexity.

Step 5: Add checkpointing so partial work survives exhaustion

LangGraph with SQLite checkpointer:

from langgraph.checkpoint.sqlite import SqliteSaver

checkpointer = SqliteSaver.from_conn_string("checkpoints.db")
graph = graph.compile(checkpointer=checkpointer)

# Resume from the last checkpoint:
config = {"configurable": {"thread_id": "run-42"}}
result = graph.invoke(input, config=config)

Temporal handles this natively — every await workflow.execute_activity(...) is a durable checkpoint.

Step 6: Resume from checkpoint rather than restarting

After a budget exhaustion, identify the last completed step from the checkpoint, increase the budget, and resume:

# LangGraph: inspect last checkpoint state
python - <<'EOF'
from langgraph.checkpoint.sqlite import SqliteSaver
cp = SqliteSaver.from_conn_string("checkpoints.db")
state = cp.get({"configurable": {"thread_id": "run-42"}})
print(state["channel_values"].keys())
print("Last node:", state["metadata"]["writes"])
EOF

Prevention

  • Benchmark a single representative subtask before setting any budget — use 3x the measured cost as minimum allocation.
  • Trim all tool outputs to a maximum length before they enter the LLM context; store full outputs externally by reference.
  • Enable checkpointing from day one — retrofitting it after a crash is painful.
  • Route classification, extraction, and formatting subtasks to smaller, cheaper models.
  • Implement a soft budget threshold at 80% of the hard limit and log a warning with current progress percentage.
  • Track sub-agent token usage in a single aggregated counter — never trust the parent’s local counter alone.
  • Cap retry attempts per tool call (e.g., 3 retries max) and fall back to a human-review queue rather than infinite retry.
  • Build a “resume from step N” path into your pipeline before you need it — it is much harder to add after exhaustion.

FAQ

Q: How do I estimate the right budget for a new pipeline? A: Run the pipeline on a sample of 3-5 representative tasks with full instrumentation, take the 90th-percentile token count, and multiply by 2x. Never budget on the median — agents’ token usage has a heavy right tail.

Q: Can I pause an agent mid-run to add more budget? A: In LangGraph with a checkpointer, yes — the state is durably saved and you can re-invoke with a higher limit. In stateless frameworks, you need to implement your own pause/resume logic around checkpoints. Claude Code itself does not support mid-run budget increases.

Q: Does prompt caching help with budget exhaustion? A: Yes, significantly. Anthropic’s prompt caching can reduce repeated-context costs by up to 90%. If your pipeline re-reads the same system prompt and large context blocks across many calls, enable caching with the cache_control parameter to cut input-token spend.

Q: What happens to half-edited files when the budget runs out? A: Depends on the framework. If the agent edits files directly on disk (Claude Code, Cursor), the changes on disk are real and may leave code in a broken intermediate state. Run git diff to see what changed, and git stash or git checkout -- . if you need to revert before retrying.

Tags: #AI coding #Agents #Troubleshooting