Agent Output Not Machine-Parseable Downstream

Your agent wraps JSON in markdown or adds prose commentary, breaking the downstream parser. Here's how to enforce structured output reliably.

Your LangGraph pipeline has an analysis agent that is supposed to output a JSON object with {"issues": [...], "severity": "high"}. Downstream, a routing agent calls json.loads(output) and crashes with JSONDecodeError: Expecting value: line 1 column 1. The analysis agent actually returned:

Here's my analysis:

```json
\{"issues": ["missing null check"], "severity": "high"\}

Let me know if you need more detail.


The JSON is there — it's just buried in a markdown code fence and surrounded by prose. This is the most common output format failure in multi-agent pipelines, and it compounds: each downstream parser failure either crashes the pipeline or silently consumes garbage data that propagates further downstream.

## Common causes

### 1. System prompt requests JSON but doesn't forbid prose

The prompt says "respond with JSON" but doesn't say "respond with ONLY JSON, no other text." LLMs default to conversational framing — they add preambles ("Here's the result:"), postambles ("Let me know if..."), and markdown fences even when asked for raw JSON.

**How to spot it**: Print the raw string content of the last 10 agent outputs before any parsing. Count how many contain characters before the first `{` or after the last `}`. If more than 2 out of 10 have leading/trailing text, the prompt is not strict enough.

### 2. No output schema enforcement — relying on prompt alone

The pipeline relies entirely on prompt instructions to produce structured output. There is no schema validation, no Pydantic model, and no structured-output API call (e.g., OpenAI's `response_format`, Anthropic's tool-calling). The model's compliance is probabilistic, not enforced.

**How to spot it**: Check whether the agent call uses any structured-output API feature. If the model is invoked with a plain string prompt and the response is parsed as a string, there is no schema enforcement.

### 3. Model version change breaks a previously reliable format

Your pipeline worked for months with GPT-4o `gpt-4o-2024-05-13`. After an automatic model version update, the same prompt now sometimes produces code-fenced JSON. Different model versions have different formatting tendencies, and "worked before" is not a guarantee for a different checkpoint.

**How to spot it**: Check when format failures started. If they correlate with a model version change or a provider infrastructure update, format regression in the new model is the cause.

### 4. Long output triggers partial JSON with truncation

The agent is asked to return a large JSON array. The output hits the model's max-token limit mid-way through the array. The result is valid JSON up to a point, then cut off: `["item1", "item2", "ite` — which `json.loads()` rejects.

**How to spot it**: Check whether output parse failures correlate with large result sets. If the character count of failed outputs is near the token limit, truncation is the cause.

### 5. Multi-turn conversation accumulates non-JSON turns

In a multi-turn agent session, the agent produces valid JSON on turn 3 but on turn 5 (after the conversation has grown longer) it starts adding commentary. The model is fitting to the conversational tone of earlier turns in the context window.

**How to spot it**: Log which turn number parse failures occur on. If failures cluster on later turns (turn 5+), context drift is causing format regression.

### 6. Downstream parser assumes one format, agent changed format

The schema evolved: the agent now returns `{"result": {"issues": [...]}}` (nested) but the parser still reads `data["issues"]` (flat). No error — just `KeyError` or a silent `None` where a list was expected.

**How to spot it**: Compare the schema in the parsing code against the actual schema the agent returns today. Schema drift between the two is a format mismatch even if the JSON is well-formed.

## Shortest path to fix

### Step 1: Use structured output APIs instead of prompt-based formatting

OpenAI:

```python
from pydantic import BaseModel

class AnalysisResult(BaseModel):
    issues: list[str]
    severity: str
    confidence: float

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=messages,
    response_format=AnalysisResult,
)
result = response.choices[0].message.parsed  # typed AnalysisResult object

Anthropic (via tool-calling for structured output):

tools = [{
    "name": "submit_analysis",
    "description": "Submit the analysis result",
    "input_schema": {
        "type": "object",
        "properties": {
            "issues": {"type": "array", "items": {"type": "string"}},
            "severity": {"type": "string", "enum": ["low", "medium", "high"]},
        },
        "required": ["issues", "severity"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    tools=tools,
    tool_choice={"type": "tool", "name": "submit_analysis"},
    messages=messages,
)
result = response.content[0].input  # dict matching the schema

Step 2: Add a JSON extraction wrapper as a fallback

import re, json

def extract_json(text: str) -> dict:
    # Try direct parse first
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError:
        pass

    # Strip markdown code fences
    fenced = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
    if fenced:
        try:
            return json.loads(fenced.group(1))
        except json.JSONDecodeError:
            pass

    # Find the outermost JSON object
    start = text.find('{')
    end = text.rfind('}')
    if start != -1 and end != -1 and end > start:
        try:
            return json.loads(text[start:end+1])
        except json.JSONDecodeError:
            pass

    raise ValueError(f"Could not extract JSON from agent output: {text[:200]!r}")

Use this as a fallback, not a primary strategy — fix the prompt/API first.

Step 3: Harden the system prompt with explicit negative constraints

Respond with ONLY a valid JSON object. No markdown. No code fences. No preamble.
No postamble. No explanation. The first character of your response must be '{'.
The last character must be '}'. If you cannot produce valid JSON, respond with:
{"error": "unable to analyze", "reason": "<one sentence>"}

The “first/last character” instruction is surprisingly effective.

Step 4: Validate schema after parsing

from pydantic import BaseModel, ValidationError

class AnalysisResult(BaseModel):
    issues: list[str]
    severity: str

def parse_and_validate(raw: str) -> AnalysisResult:
    data = extract_json(raw)
    try:
        return AnalysisResult(**data)
    except ValidationError as e:
        raise OutputFormatError(
            f"Agent output failed schema validation: {e}"
        ) from e

Schema validation catches field-level issues (missing keys, wrong types) that JSON parsing alone misses.

Step 5: Limit output size to prevent truncation

# Set max_tokens conservatively based on the expected schema size
# A JSON object with 20 issues averages ~500 tokens
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,  # not 4096 — right-size to expected output
    messages=messages,
)

# If output might be large, page it
# Prompt: "Return at most 10 issues per call. Include a 'has_more' boolean."

Step 6: Write a format regression test

def test_agent_output_format():
    sample_inputs = load_fixture("agent_format_test_inputs.json")
    for inp in sample_inputs:
        raw = run_agent(inp)
        result = parse_and_validate(raw)
        assert result.issues is not None
        assert result.severity in ("low", "medium", "high")

Run this in CI. If the model version updates and format regresses, CI catches it before production.

Prevention

  • Use the provider’s structured output API (OpenAI response_format, Anthropic tool_choice) instead of relying on prompt instructions alone.
  • Harden system prompts with explicit negative constraints (no prose, no fences, first char is an opening brace) for any path where structured output APIs are not available.
  • Validate output against a Pydantic schema immediately after parsing — before any downstream consumption.
  • Size max_tokens to the expected output, not the model’s maximum — truncation-induced parse failures are easy to prevent.
  • Write format regression tests that run in CI against the production model endpoint.
  • Version your output schema explicitly; when the schema changes, update both the agent prompt and the parser together in the same commit.
  • Log the raw pre-parse string for every output that fails validation — you need the exact characters to diagnose format issues.
  • Monitor parse failure rate in production; alert when it exceeds 1% of outputs.

FAQ

Q: Should I use JSON mode or tool-calling for structured output? A: Tool-calling is more reliable because the model treats the tool input as a typed API call rather than freeform text that happens to be JSON. OpenAI’s response_format={"type": "json_schema", ...} is also excellent. Anthropic’s equivalent is using tool_choice to force a specific tool call. “JSON mode” (just response_format={"type": "json_object"}) guarantees valid JSON but not schema compliance.

Q: Can I fix this without changing the agent prompt? A: The extraction wrapper (Step 2) handles most fence-wrapping cases in production. But it is a band-aid — it cannot fix truncated JSON or schema drift. Fix the root cause with structured output APIs.

Q: Our pipeline uses a fine-tuned model — can we fine-tune format compliance? A: Yes. Include 200+ examples of correct JSON-only output in fine-tuning data, with explicit negative examples showing fenced/prose output labeled as wrong. Fine-tuning dramatically reduces format failures for custom model checkpoints.

Q: How do I handle streaming responses that need to be parsed? A: Buffer the full stream before parsing. Parsing a partial stream produces fragmented JSON. If you need real-time progress from a streaming agent, emit structured progress events ({"type": "progress", "pct": 50}) rather than partial JSON objects.

Tags: #AI coding #Agents #Troubleshooting