Agent Budget Exhausted Halfway Through the Task

Q: Why did my LangChain cost callback report zero tokens?

The legacy `get_openai_callback` is OpenAI-only and returns zero when responses are streamed; its home package `langchain-community` was archived on May 26, 2026. Switch to `UsageMetadataCallbackHandler` from `langchain-core`, which works across providers and with streaming.

Your agent burns its token or cost budget before finishing and leaves work half-done. Diagnose where the spend went, recover the partial work, and resume from a checkpoint.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You run a CrewAI pipeline to refactor a 40-file codebase and it stops at file 22. The log shows BudgetExhaustedError: token limit 500000 reached, or in Claude Code the You've reached your usage limit banner appears and the agent halts with half the tests still failing. The work already done may be partially correct, partially broken, or in a state worse than where you started. Exhausting a budget mid-task is not just a cost problem. It is a correctness and recovery problem.

Fastest path: First recover the partial work — run git diff to see what the agent changed on disk, and git stash the half-edited files so you can restart cleanly without re-paying for completed steps. Then find where the budget actually went using the diagnosis table below (usually it is one of: under-budgeting, untrimmed tool outputs, retry loops, or off-the-books sub-agents). Fix that one cause, add a SQLite or Temporal checkpoint so the next run is resumable, and re-invoke from the last checkpoint instead of from step 1.

Which bucket are you in

Match your symptom to the most likely cause before you change any code.

Symptom in the trace	Most likely cause	Jump to
Total spend is roughly 3-5x your per-file estimate, evenly across steps	Scope under-estimated at planning time	Cause 1
Message-history size grows faster than the number of steps	Untrimmed tool outputs piling up	Cause 2
LLM API calls greatly outnumber logical steps completed	Silent retry loops	Cause 3
Parent budget meter reads low but the bill is high	Sub-agent usage off the books	Cause 4
Trivial steps (extract, format) are billed at a frontier-model rate	Over-powered model on cheap subtasks	Cause 5
Restart always goes back to step 1	No checkpointing	Cause 6

Common causes

1. Task scope was underestimated at planning time

The single biggest cause. You allocated a budget based on the happy path: no errors, no retries, model reads each file once. In reality the agent re-reads files to verify changes, retries failed tool calls, and writes intermediate reasoning into its context, often 3-5x the naive estimate.

How to spot it: Compare the token count for a simple “read and summarize one file” run against the budget-per-file you allocated. If the ratio is under 3x, your budget math assumed a perfect run.

2. Long tool outputs inflate context without trimming

read_file returns a 1,200-line file verbatim. The agent keeps the full content in context across 15 subsequent tool calls instead of storing just the relevant excerpt. By file 10, the context is 80% stale file contents, and every call pays for all of it again as input tokens.

How to spot it: Print len(messages) and the total character count of the message history every 5 steps. If the size grows faster than O(steps) (that is, quadratically), tool outputs are accumulating uncontrolled.

3. Retry loops consume budget silently

A flaky external API causes 5 retries per call. Each retry carries the full conversation context. Twenty API calls with 5 retries each is 100 LLM calls instead of 20, so the budget runs out at one-fifth of the expected progress.

How to spot it: Count distinct LLM API calls in your trace against the number of logical steps completed. A ratio above 2:1 indicates retry waste.

4. Sub-agents are not counted in the parent budget

In frameworks like AutoGen or LangGraph with nested agents, the parent agent’s budget tracker counts only its own calls. Sub-agent token usage accrues separately, or not at all, leaving the parent’s budget meter wrong by a large margin.

How to spot it: Sum token usage across all agent IDs in the trace. If the total is meaningfully higher than what the parent’s budget counter shows, sub-agent costs are off the books. See Cost Tracking Misses Sub-Agent Usage.

5. Model selected for the task is over-powered for most subtasks

The pipeline uses Claude Opus 4.7 or GPT-5.5 for every step, including trivial ones like “extract the function name from this code.” As of June 2026 Opus 4.7 bills at $5/MTok input and $25/MTok output, versus Sonnet 4.6 at $3 / $15. Those cheap-looking subtasks add up fast when each one is paying a frontier rate.

How to spot it: List the model used per step. Flag any step that does not require reasoning (file reads, format conversions, substring extractions) that is using a top-tier model.

6. No checkpointing means partial work is lost on exhaustion

The agent runs to exhaustion and the orchestrator has no checkpoint. Files 1-22 were edited in memory but the crash discards them. The user restarts from zero and burns the budget again.

How to spot it: Check whether your orchestration layer writes checkpoints (for example LangGraph’s SqliteSaver, or Temporal’s durable execution). If restarts always go back to step 1, checkpointing is absent.

Shortest path to fix

Step 1: Audit actual vs. budgeted token spend per step

Note: the older get_openai_callback helper lives in langchain-community, which was archived (read-only) on May 26, 2026, is OpenAI-only, and silently reports zero tokens when streaming. Use the provider-agnostic UsageMetadataCallbackHandler from langchain-core instead:

from langchain_core.callbacks import UsageMetadataCallbackHandler

cb = UsageMetadataCallbackHandler()
result = chain.invoke(input, config={"callbacks": [cb]})

# cb.usage_metadata is keyed by model name
for model, usage in cb.usage_metadata.items():
    print(model, usage["input_tokens"], usage["output_tokens"], usage["total_tokens"])

Run a single representative subtask with this instrumentation. Multiply by the number of subtasks to get a realistic budget estimate, then add a 3x safety margin.

Step 2: Set tiered budgets, a hard stop plus a soft warning

SOFT_BUDGET_TOKENS = 400_000
HARD_BUDGET_TOKENS = 500_000

def check_budget(used: int, step: str):
    if used >= HARD_BUDGET_TOKENS:
        raise BudgetExhaustedError(f"Hard limit reached at step: {step}")
    if used >= SOFT_BUDGET_TOKENS:
        logger.warning("Soft budget hit at step %s, consider checkpointing", step)

The soft warning gives you time to checkpoint before the hard stop hits.

Step 3: Trim tool outputs before injecting into context

MAX_TOOL_OUTPUT_CHARS = 4_000

def trim_tool_output(output: str, max_chars: int = MAX_TOOL_OUTPUT_CHARS) -> str:
    if len(output) <= max_chars:
        return output
    half = max_chars // 2
    return output[:half] + "\n... [trimmed] ...\n" + output[-half:]

For file reads, use a search or grep approach instead of reading whole files:

# Instead of reading 1200 lines, extract the relevant function
grep -n "def authenticate" src/auth.py | head -5
sed -n '47,82p' src/auth.py

Step 4: Route cheap subtasks to a cheaper model

Claude Code runs Anthropic models only, so within that ecosystem the lever is Opus 4.7 (reasoning) vs Sonnet 4.6 (everything else). In a general pipeline you can also fall back to a small/fast tier such as GPT-5.5 Instant or Gemini 3.1 Pro for extraction-grade work.

def pick_model(task_type: str) -> str:
    cheap_tasks = {"extract", "format", "summarize_short", "classify"}
    if task_type in cheap_tasks:
        return "claude-sonnet-4-6"   # workhorse, $3/$15 per MTok
    return "claude-opus-4-7"         # reasoning tasks only, $5/$25 per MTok

Keeping frontier-model calls to the steps that truly need reasoning typically cuts spend by 60-80% on pipelines with mixed task complexity.

Step 5: Add checkpointing so partial work survives exhaustion

In current LangGraph (the langgraph-checkpoint-sqlite package, refreshed May 12, 2026), SqliteSaver.from_conn_string is a context manager, so it must be used inside a with block:

from langgraph.checkpoint.sqlite import SqliteSaver

with SqliteSaver.from_conn_string("checkpoints.sqlite") as checkpointer:
    graph = workflow.compile(checkpointer=checkpointer)
    config = {"configurable": {"thread_id": "run-42"}}
    result = graph.invoke(input, config=config)

Temporal handles this natively: every await workflow.execute_activity(...) is a durable checkpoint, so a budget abort never loses completed activities.

Step 6: Resume from checkpoint rather than restarting

After a budget exhaustion, inspect the last saved state with graph.get_state(config), which returns a StateSnapshot. Its .values holds the channel state, .next lists the nodes that were about to run, and .metadata records the last writes. Raise the budget, then re-invoke with the same thread_id:

from langgraph.checkpoint.sqlite import SqliteSaver

with SqliteSaver.from_conn_string("checkpoints.sqlite") as checkpointer:
    graph = workflow.compile(checkpointer=checkpointer)
    config = {"configurable": {"thread_id": "run-42"}}

    snap = graph.get_state(config)
    print("Saved channels:", list(snap.values.keys()))
    print("Next nodes to run:", snap.next)

    # Resume: passing None re-enters at the pending node, not step 1
    result = graph.invoke(None, config=config)

For Claude Code, there is no mid-run budget top-up. When the You've reached your usage limit banner shows, the session goes read-only until the window resets. As of June 2026 Claude Code enforces two overlapping caps: a 5-hour rolling window (doubled for Pro, Max, Team, and seat-based Enterprise on May 6, 2026) and a weekly cap on active compute. Recover partial work with git diff / git stash, then continue after the reset or on a higher plan (Max 5x at $100, Max 20x at $200).

How to confirm it’s fixed

Re-run the same task with instrumentation. The UsageMetadataCallbackHandler total should land within your 3x-margin estimate, not blow past it.
Print message-history size every 5 steps. It should grow roughly linearly, not quadratically (confirms tool-output trimming works).
Force a failure mid-run (kill the process), then re-invoke with the same thread_id. The job must continue from the pending node, not from step 1.
Check the per-step model log. No extract / format / classify step should be hitting a frontier model.

Prevention

Benchmark a single representative subtask before setting any budget, and use 3x the measured cost as the minimum allocation.
Trim all tool outputs to a maximum length before they enter the LLM context; store full outputs externally by reference.
Enable checkpointing from day one. Retrofitting it after a crash is painful.
Route classification, extraction, and formatting subtasks to the cheaper model tier.
Implement a soft budget threshold at 80% of the hard limit and log a warning with the current progress percentage.
Track sub-agent token usage in a single aggregated counter; never trust the parent’s local counter alone.
Cap retries per tool call. In CrewAI, max_iter defaults to 25 and is the main cost driver, so set it to 5-8 per agent and pair it with max_rpm; in AutoGen, bound max_round / max_consecutive_auto_reply. Fall back to a human-review queue rather than infinite retry.
Enable Anthropic prompt caching on the stable parts of the prompt (system prompt, tool definitions, large shared context). Mark up to 4 breakpoints with cache_control: {"type": "ephemeral"} (default 5-minute TTL, optional "ttl": "1h"); cache reads are far cheaper than re-sending the prefix, though the first write costs about 25% more than a normal input token.
Build a “resume from step N” path into your pipeline before you need it. It is much harder to add after exhaustion.

FAQ

Q: How do I estimate the right budget for a new pipeline? A: Run the pipeline on a sample of 3-5 representative tasks with full instrumentation, take the 90th-percentile token count, and multiply by 2x. Never budget on the median. Agent token usage has a heavy right tail, and conversational steps grow roughly O(n^2) as history accumulates, not O(n).

Q: Can I pause an agent mid-run to add more budget? A: In LangGraph with a checkpointer, yes. The state is durably saved, so you re-invoke the same thread_id with a higher limit and pass None as the input to resume at the pending node. In stateless frameworks you must implement your own pause/resume around checkpoints. Claude Code itself does not support a mid-run budget increase; you wait for the window to reset or move to a higher plan.

Q: Does prompt caching help with budget exhaustion? A: Yes, significantly, but it solves a different problem than checkpointing. Caching cuts the cost of re-sending the same prefix (system prompt, tool defs, shared context) on every call by roughly 70-90% in real workflows. It does not recover execution state, so if a run dies at step 30 without a checkpoint you still restart from step 1, just at a lower per-step cost. Enable both.

Q: What happens to half-edited files when the budget runs out? A: It depends on the framework. If the agent edits files directly on disk (Claude Code, Cursor), the changes on disk are real and may leave code in a broken intermediate state. Run git diff to see what changed, then git stash or git checkout -- . to revert before retrying. Orchestrators that only edit in memory lose everything not yet checkpointed.

Q: Why did my LangChain cost callback report zero tokens? A: The legacy get_openai_callback is OpenAI-only and returns zero when responses are streamed; its home package langchain-community was archived on May 26, 2026. Switch to UsageMetadataCallbackHandler from langchain-core, which works across providers and with streaming.

Tags: #AI coding #Agents #Troubleshooting