Cost Tracking Misses Sub-Agent Usage

Your pipeline's cost report shows $2 but the invoice says $18. Sub-agent token usage is not attributed to the parent run. Here's how to close the gap.

Your LangGraph pipeline finishes and LangSmith reports a total cost of $2.10 for the run. The Anthropic invoice at the end of the month is $380 — 20x your projected spend. The discrepancy is sub-agent calls: your orchestrator spawns a research sub-agent, which spawns a web-search sub-agent, which makes its own LLM calls. Each level uses a different API key or makes calls that aren’t attributed back to the root run. Your cost tracker only sees the calls the orchestrator makes directly. The sub-agent tree is invisible, and so is its cost.

Common causes

1. Sub-agents use separate API keys not linked to the parent

The orchestrator has API key sk-orch. When it spawns a sub-agent, the sub-agent uses a different key sk-subagent loaded from its own environment. The cost tracking callback is attached to the sk-orch calls only. Sub-agent costs accrue to sk-subagent and are not aggregated into the parent run’s report.

How to spot it: List all API keys used anywhere in the pipeline. If there are more than one, check whether each key’s usage is rolled up into a single cost tracker. Any key not tracked = invisible costs.

2. Sub-agent is a separate process/service — no shared callback

The orchestrator calls the sub-agent over HTTP (POST /run-agent). The sub-agent is a separate microservice that makes its own LLM calls. The orchestrator’s LangSmith or OpenAI callback only tracks calls made in the orchestrator’s process. The sub-agent service has no callback or reports to a separate project.

How to spot it: Check whether any sub-agents run in separate processes, Docker containers, or services. Any sub-agent not in the same process as the cost tracker is invisible.

3. LangSmith trace hierarchy is broken — sub-runs not parented to the root

LangSmith uses run_id and parent_run_id to build a trace tree. If the sub-agent is invoked without passing the parent run ID, it creates a root-level trace with no parent. LangSmith’s per-run cost aggregation only sums a run and its explicit children — orphaned sub-runs are excluded.

How to spot it: In LangSmith, open the root run and expand the trace tree. If you see fewer nodes than you have agents, some sub-runs are not parented. Also look for orphaned root-level runs with the same session ID as the main run.

4. Aggregation only counts input tokens — ignores output and cache tokens

Your cost formula is cost = input_tokens * price_per_input_token. It doesn’t include output token cost (which for Claude Opus is $75/MTok vs. $15/MTok input), cached input write costs, or image tokens. The actual invoice includes all of these.

How to spot it: Compare your formula against the provider’s pricing page. If it doesn’t include output tokens, cache write tokens, and tool-use tokens separately, it is underestimating.

5. Token counts are estimated, not read from API responses

The cost tracker counts tokens by calling tiktoken.encode(prompt) or a similar client-side estimator. The provider bills based on their own tokenizer, which may count differently. For non-English text, function call payloads, and system prompts, the provider’s count can be 20-40% higher than client-side estimates.

How to spot it: Compare your estimated token counts against the usage field in the API response for the same call. If the provider’s usage.input_tokens differs from your estimate by more than 10%, use the API-reported count.

6. Streaming responses don’t emit usage data and usage is skipped

When using streaming (stream=True), some SDK versions don’t emit the usage object in the final chunk unless you explicitly request it. If usage is skipped, those calls contribute zero tokens to the tracker.

How to spot it: Log whether usage is present in the final chunk of every streaming response. Missing usage fields in streaming calls result in systematic undercounting.

Shortest path to fix

Step 1: Build a centralized cost accumulator keyed by run ID

from collections import defaultdict
from dataclasses import dataclass, field
from threading import Lock

@dataclass
class RunCost:
    input_tokens: int = 0
    output_tokens: int = 0
    cache_write_tokens: int = 0
    cache_read_tokens: int = 0

    def total_cost_usd(self, model: str) -> float:
        pricing = MODEL_PRICING[model]
        return (
            self.input_tokens / 1_000_000 * pricing["input"]
            + self.output_tokens / 1_000_000 * pricing["output"]
            + self.cache_write_tokens / 1_000_000 * pricing.get("cache_write", 0)
            + self.cache_read_tokens / 1_000_000 * pricing.get("cache_read", 0)
        )

_costs: dict[str, RunCost] = defaultdict(RunCost)
_lock = Lock()

def record_usage(run_id: str, usage: dict, model: str):
    with _lock:
        c = _costs[run_id]
        c.input_tokens += usage.get("input_tokens", 0)
        c.output_tokens += usage.get("output_tokens", 0)
        c.cache_write_tokens += usage.get("cache_creation_input_tokens", 0)
        c.cache_read_tokens += usage.get("cache_read_input_tokens", 0)

Step 2: Pass run ID to sub-agents and propagate it through all calls

# In the orchestrator
root_run_id = str(uuid.uuid4())

def invoke_sub_agent(task: str, parent_run_id: str) -> str:
    resp = sub_agent_client.post(
        "/run",
        json={"task": task},
        headers={"X-Run-Id": parent_run_id}  # propagate the ID
    )
    return resp.json()["result"]

# In the sub-agent service
@app.post("/run")
def run_agent(request: Request, body: AgentRequest):
    run_id = request.headers.get("X-Run-Id", str(uuid.uuid4()))
    # All LLM calls in this service record against run_id
    result = execute(body.task, run_id=run_id)
    return {"result": result}

Step 3: Always read usage from API response, not client-side estimates

# Anthropic
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=messages,
    max_tokens=1024,
)
# Read actual usage from the response
usage = {
    "input_tokens": response.usage.input_tokens,
    "output_tokens": response.usage.output_tokens,
    "cache_creation_input_tokens": getattr(response.usage, "cache_creation_input_tokens", 0),
    "cache_read_input_tokens": getattr(response.usage, "cache_read_input_tokens", 0),
}
record_usage(current_run_id, usage, model="claude-sonnet-4-6")

Step 4: Enable streaming usage reporting

# Anthropic streaming — request usage in final event
with client.messages.stream(
    model="claude-sonnet-4-6",
    messages=messages,
    max_tokens=1024,
) as stream:
    for event in stream:
        pass  # process text events

    # Usage is available after the stream ends
    final_message = stream.get_final_message()
    record_usage(run_id, {
        "input_tokens": final_message.usage.input_tokens,
        "output_tokens": final_message.usage.output_tokens,
    }, model="claude-sonnet-4-6")

Step 5: Reconcile internal tracking against provider invoices weekly

def reconcile_costs(internal_usd: float, invoice_usd: float) -> None:
    discrepancy_pct = abs(internal_usd - invoice_usd) / invoice_usd * 100
    logger.info(
        "Cost reconciliation: internal=%.2f invoice=%.2f discrepancy=%.1f%%",
        internal_usd, invoice_usd, discrepancy_pct
    )
    if discrepancy_pct > 10:
        alert(f"Cost tracking discrepancy {discrepancy_pct:.1f}% — investigate sub-agent attribution")

Run this once per billing period. A discrepancy above 10% means you have untracked calls.

Prevention

  • Use a single API key for the entire pipeline and sub-agent tree; if isolation is required, aggregate all keys into a single cost tracker.
  • Propagate a root run ID through all sub-agent calls via a header or context variable; every LLM call records against this ID.
  • Always read token usage from the API response’s usage field — never estimate client-side.
  • Include output tokens, cache write tokens, and cache read tokens in your cost formula alongside input tokens.
  • Enable usage reporting in streaming responses explicitly; it is often opt-in.
  • Reconcile your internal cost tracker against provider invoices monthly; a growing discrepancy means new call paths have appeared without tracking.
  • Add a per-run cost cap that fires an alert (not a hard stop) when the run exceeds 2x the expected cost.
  • Include sub-agent costs in dashboards prominently — if sub-agent spend is invisible, it will always be underestimated in planning.

FAQ

Q: How do I track costs across multiple providers (Anthropic + OpenAI + Gemini)? A: Use a provider-agnostic cost schema (input tokens, output tokens, model, provider) and maintain a pricing table per model. Libraries like litellm include built-in cost tracking across providers with a unified interface. Update the pricing table whenever providers change prices.

Q: Can LangSmith handle distributed sub-agent cost aggregation? A: Yes, if you pass the parent_run_id when creating each sub-run. LangSmith aggregates costs for all runs in a tree under the root. The critical requirement is that every LLM call, whether in the orchestrator or sub-agent, is logged to LangSmith with the correct parent_run_id.

Q: Our sub-agents run on different teams’ infrastructure — how do we attribute costs? A: Tag every LLM call with a team_id and pipeline_id label in your request metadata. Most providers support custom tags on API calls. Build a weekly report that breaks down cost by team and pipeline.

Q: What is a reasonable cost-per-run budget for a typical multi-agent code-review pipeline? A: A pipeline that reviews a single PR (reads 5-10 files, runs 3 agent passes) with claude-sonnet-4-6 typically costs $0.05-$0.30 depending on file sizes and context lengths. Anything over $1.00 per PR warrants investigation of token bloat or unexpected retries.

Tags: #AI coding #Agents #Troubleshooting