Cost Tracking Misses Sub-Agent Usage

Q: Can LangSmith aggregate distributed sub-agent costs?

Yes. Pass `parent_run_id` (or propagate tracing context / OpenTelemetry across service boundaries) so each sub-run attaches to the root. LangSmith then sums cost for every run in the tree under the root. The hard requirement is that every LLM call, orchestrator or sub-agent, is logged with the correct parent.

Q: Our sub-agents run on different teams' infrastructure. How do we attribute cost?

Tag every LLM call with `team_id` and `pipeline_id` in request metadata (LiteLLM, LangSmith, and the native SDKs all support custom tags). Build a weekly report that breaks cost down by team and pipeline.

Your pipeline reports $2 but the invoice says $18. Sub-agent token usage isn't attributed to the parent run. Here's how to close the gap, verified June 2026.

Published: May 25, 2026 Updated: Jun 17, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your LangGraph pipeline finishes and LangSmith reports a total cost of $2.10 for the run. The Anthropic invoice at the end of the month is $380, roughly 20x your projected spend. The gap is almost always sub-agent calls: your orchestrator spawns a research sub-agent, which spawns a web-search sub-agent, which makes its own LLM calls. Each level uses a different API key, or makes calls that aren’t attributed back to the root run. Your cost tracker only sees the calls the orchestrator makes directly. The sub-agent tree is invisible, and so is its cost.

Fastest fix: read token counts from the API response’s usage field (never estimate them), include output and cache tokens in your formula, and propagate one root run ID through every sub-agent so all calls aggregate under it. If you want this without writing plumbing, route every agent through a single LiteLLM proxy and read spend_logs. Details below.

Which bucket are you in?

Symptom	Likely cause	Jump to
Invoice is far higher than tracker; you have more than one API key	Untracked keys	Cause 1
Sub-agents run as separate services/containers	No shared callback	Cause 2
LangSmith trace tree shows fewer nodes than agents	Broken parent linkage	Cause 3
Tracker undercounts by a fixed ~80% (output-heavy runs)	Output/cache tokens ignored	Cause 4
Undercount of ~20-40%, worse for non-English or tool-heavy calls	Client-side token estimates	Cause 5
Streaming calls contribute zero tokens	Streaming usage skipped	Cause 6

Common causes

1. Sub-agents use separate API keys not linked to the parent

The orchestrator has API key sk-orch. When it spawns a sub-agent, the sub-agent uses a different key sk-subagent loaded from its own environment. The cost-tracking callback is attached to the sk-orch calls only. Sub-agent costs accrue to sk-subagent and are never aggregated into the parent run’s report.

How to spot it: list every API key used anywhere in the pipeline (grep for ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, and any per-service secrets). If there is more than one, check whether each key’s usage rolls up into a single cost tracker. Any key not tracked equals invisible cost.

2. Sub-agent is a separate process or service with no shared callback

The orchestrator calls the sub-agent over HTTP (POST /run-agent). The sub-agent is a separate microservice that makes its own LLM calls. The orchestrator’s LangSmith or OpenAI callback only tracks calls made inside the orchestrator’s process. The sub-agent service has no callback, or reports to a separate project.

How to spot it: check whether any sub-agents run in separate processes, Docker containers, or services. Any sub-agent not in the same process as the cost tracker, and not propagating trace context across the boundary, is invisible.

3. LangSmith trace hierarchy is broken: sub-runs not parented to the root

LangSmith uses run_id and parent_run_id to build a trace tree. The trace tree aggregates token usage and cost for the whole trace, with per-run breakdowns rolled up to each parent. If the sub-agent is invoked without passing the parent run ID, it creates a root-level trace with no parent, and its cost is never folded into the main run’s total.

How to spot it: in LangSmith, open the root run and expand the trace tree. If you see fewer nodes than you have agents, some sub-runs are not parented. Also look for orphaned root-level runs with the same session or thread ID as the main run.

4. Aggregation counts only input tokens, ignoring output and cache tokens

Your cost formula is cost = input_tokens * price_per_input_token. It omits output token cost, cached-input write cost, and image tokens. On Claude Sonnet 4.6 output is 5x the input rate ($15 vs $3 per MTok as of June 2026); on Opus 4.7 it is also 5x ($25 vs $5). Output-heavy agent runs can be 80% output cost, so an input-only formula reports roughly a fifth of the real number.

How to spot it: compare your formula against the provider’s pricing page. If it does not include output tokens, cache-write tokens, and tool-use tokens separately, it is underestimating.

5. Token counts are estimated, not read from API responses

The cost tracker counts tokens by calling tiktoken.encode(prompt) or a similar client-side estimator. The provider bills using its own tokenizer, which counts differently. Claude’s tokenizer tends to count more tokens than tiktoken for the same text, and the gap is widest on code, function-call payloads, and non-English text. For those inputs the provider’s count is routinely 20-40% higher than a client-side guess.

How to spot it: compare your estimated token counts against the usage field in the API response for the same call. If the provider’s usage.input_tokens differs from your estimate by more than 10%, switch to the API-reported count everywhere.

6. Streaming responses drop usage data

When using streaming, usage handling differs by provider. OpenAI does not return usage in a streamed response unless you pass stream_options={"include_usage": True}; if you forget it, those calls contribute zero tokens. Anthropic does emit usage during streaming (input tokens arrive in the message_start event, output tokens accumulate, and stream.get_final_message().usage holds the totals), but only if your code actually reads the final message instead of discarding the stream after the last text chunk.

How to spot it: log whether usage is present for every streaming response. Missing or zeroed usage on streaming calls produces systematic undercounting.

Shortest path to fix

Step 1: Build a centralized cost accumulator keyed by run ID

Track all four token types, not just input. The pricing table below is current as of June 2026 (cache write = 1.25x base input for the 5-minute TTL, cache read = 0.1x base input).

from collections import defaultdict
from dataclasses import dataclass
from threading import Lock

# Per-MTok USD rates, June 2026. cache_write is the 5-minute TTL rate (1.25x input).
MODEL_PRICING = {
    "claude-sonnet-4-6": {"input": 3.0, "output": 15.0, "cache_write": 3.75, "cache_read": 0.30},
    "claude-opus-4-7":   {"input": 5.0, "output": 25.0, "cache_write": 6.25, "cache_read": 0.50},
}

@dataclass
class RunCost:
    input_tokens: int = 0
    output_tokens: int = 0
    cache_write_tokens: int = 0
    cache_read_tokens: int = 0

    def total_cost_usd(self, model: str) -> float:
        p = MODEL_PRICING[model]
        return (
            self.input_tokens / 1_000_000 * p["input"]
            + self.output_tokens / 1_000_000 * p["output"]
            + self.cache_write_tokens / 1_000_000 * p.get("cache_write", 0)
            + self.cache_read_tokens / 1_000_000 * p.get("cache_read", 0)
        )

_costs: dict[str, RunCost] = defaultdict(RunCost)
_lock = Lock()

def record_usage(run_id: str, usage: dict, model: str):
    with _lock:
        c = _costs[run_id]
        c.input_tokens += usage.get("input_tokens", 0)
        c.output_tokens += usage.get("output_tokens", 0)
        c.cache_write_tokens += usage.get("cache_creation_input_tokens", 0)
        c.cache_read_tokens += usage.get("cache_read_input_tokens", 0)

Step 2: Pass a root run ID to sub-agents and propagate it through every call

import uuid

# In the orchestrator
root_run_id = str(uuid.uuid4())

def invoke_sub_agent(task: str, parent_run_id: str) -> str:
    resp = sub_agent_client.post(
        "/run",
        json={"task": task},
        headers={"X-Run-Id": parent_run_id},  # propagate the ID across the boundary
    )
    return resp.json()["result"]

# In the sub-agent service
@app.post("/run")
def run_agent(request: Request, body: AgentRequest):
    run_id = request.headers.get("X-Run-Id", str(uuid.uuid4()))
    # Every LLM call in this service records against run_id
    result = execute(body.task, run_id=run_id)
    return {"result": result}

For LangSmith specifically, pass parent_run_id (or use the tracing context propagation headers / OpenTelemetry) so distributed sub-runs attach to the root trace and their cost rolls up automatically.

Step 3: Always read usage from the API response, never client-side estimates

# Anthropic
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=messages,
    max_tokens=1024,
)
usage = {
    "input_tokens": response.usage.input_tokens,
    "output_tokens": response.usage.output_tokens,
    "cache_creation_input_tokens": getattr(response.usage, "cache_creation_input_tokens", 0),
    "cache_read_input_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
}
record_usage(current_run_id, usage, model="claude-sonnet-4-6")

Step 4: Capture usage from streaming responses

# Anthropic: usage is on the final accumulated message
with client.messages.stream(
    model="claude-sonnet-4-6",
    messages=messages,
    max_tokens=1024,
) as stream:
    for _ in stream.text_stream:
        pass  # render tokens as they arrive

    final = stream.get_final_message()  # blocks until the stream completes
    record_usage(run_id, {
        "input_tokens": final.usage.input_tokens,
        "output_tokens": final.usage.output_tokens,
        "cache_creation_input_tokens": getattr(final.usage, "cache_creation_input_tokens", 0),
        "cache_read_input_tokens": getattr(final.usage, "cache_read_input_tokens", 0),
    }, model="claude-sonnet-4-6")

On OpenAI, the equivalent is to add stream_options={"include_usage": True}; the usage object then arrives in the final chunk.

Step 5: Reconcile internal tracking against provider invoices weekly

import logging
logger = logging.getLogger(__name__)

def reconcile_costs(internal_usd: float, invoice_usd: float) -> None:
    discrepancy_pct = abs(internal_usd - invoice_usd) / invoice_usd * 100
    logger.info(
        "Cost reconciliation: internal=%.2f invoice=%.2f discrepancy=%.1f%%",
        internal_usd, invoice_usd, discrepancy_pct,
    )
    if discrepancy_pct > 10:
        alert(f"Cost tracking discrepancy {discrepancy_pct:.1f}% — investigate sub-agent attribution")

Run this once per billing period. A discrepancy above 10% means you have untracked call paths.

Optional: skip the plumbing with a LiteLLM proxy

If you do not want to wire callbacks into every service, point all agents at a single LiteLLM proxy. It records cost for every call regardless of which service made it, and you query spend_logs grouped by tag.

# litellm_config.yaml
model_list:
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

general_settings:
  master_key: sk-your-proxy-key
  database_url: postgresql://...   # stores every call's spend record

Send metadata.tags (and a per-agent trace ID) on each request, then break spend down by tag. LiteLLM also exposes max_budget_per_session and per-tag budgets so a runaway sub-agent trips a limit instead of a surprise invoice.

How to confirm it’s fixed

Run one full pipeline, including every sub-agent, and note your tracker’s total for that run ID.
Wait for the calls to appear in the provider console (Anthropic Console → Usage, or OpenAI → Usage) filtered to the same time window.
The two numbers should match within ~5%. If the tracker is still low, the missing slice points back to a specific cause: an untracked key (Cause 1), a sub-agent service with no propagation (Cause 2), an orphaned LangSmith trace (Cause 3), or input-only math (Cause 4).
Confirm the trace tree node count equals your agent count, and that every node shows non-zero output tokens.

Prevention

Use a single API key for the whole pipeline and sub-agent tree; if isolation is required, aggregate all keys into one cost tracker (or route through one LiteLLM proxy).
Propagate a root run ID through every sub-agent call via a header or context variable; every LLM call records against it.
Always read token usage from the API response usage field; never estimate client-side.
Include output tokens, cache-write tokens, and cache-read tokens in the formula alongside input tokens.
Capture usage on streaming responses explicitly: read the final message on Anthropic, set include_usage on OpenAI.
Reconcile internal tracking against provider invoices monthly; a growing discrepancy means a new untracked call path has appeared.
Add a per-run cost cap that fires an alert (not a hard stop) when a run exceeds 2x its expected cost.
Surface sub-agent spend prominently in dashboards; if it is invisible, planning will always underestimate it.

FAQ

Q: How do I track costs across multiple providers (Anthropic + OpenAI + Gemini)? A: Use a provider-agnostic schema (input tokens, output tokens, model, provider) plus a per-model pricing table. litellm ships built-in cost tracking with a unified interface across providers and its own pricing map, so completion_cost(response) returns dollars without you maintaining rates by hand. Refresh the table whenever a provider changes prices.

Q: Can LangSmith aggregate distributed sub-agent costs? A: Yes. Pass parent_run_id (or propagate tracing context / OpenTelemetry across service boundaries) so each sub-run attaches to the root. LangSmith then sums cost for every run in the tree under the root. The hard requirement is that every LLM call, orchestrator or sub-agent, is logged with the correct parent.

Q: Why is the cache-read cost so different from input cost? A: Prompt caching has its own rates. As of June 2026 a cache read is 0.1x the base input price and a 5-minute cache write is 1.25x (a 1-hour write is 2x). On Sonnet 4.6 that is $0.30/MTok read vs $3 input. If your formula treats cached input as full-price input, you over- or under-count depending on hit rate, so track cache_creation_input_tokens and cache_read_input_tokens as separate line items.

Q: Our sub-agents run on different teams’ infrastructure. How do we attribute cost? A: Tag every LLM call with team_id and pipeline_id in request metadata (LiteLLM, LangSmith, and the native SDKs all support custom tags). Build a weekly report that breaks cost down by team and pipeline.

Q: What is a reasonable cost-per-run for a multi-agent code-review pipeline? A: Reviewing a single PR (5-10 files, 3 agent passes) on Sonnet 4.6 typically lands at $0.05-$0.30 depending on file sizes and context length. Anything over $1.00 per PR warrants checking for token bloat, full-context resends, or unexpected retry loops.

Tags: #AI coding #Agents #Troubleshooting