Your LangGraph pipeline has an analysis agent that is supposed to output a JSON object with {"issues": [...], "severity": "high"}. Downstream, a routing agent calls json.loads(output) and crashes with JSONDecodeError: Expecting value: line 1 column 1. The analysis agent actually returned:
Here's my analysis:
```json
\{"issues": ["missing null check"], "severity": "high"\}
Let me know if you need more detail.
The JSON is there — it's just buried in a markdown code fence and surrounded by prose. This is the most common output format failure in multi-agent pipelines, and it compounds: each downstream parser failure either crashes the pipeline or silently consumes garbage data that propagates further downstream.
## Common causes
### 1. System prompt requests JSON but doesn't forbid prose
The prompt says "respond with JSON" but doesn't say "respond with ONLY JSON, no other text." LLMs default to conversational framing — they add preambles ("Here's the result:"), postambles ("Let me know if..."), and markdown fences even when asked for raw JSON.
**How to spot it**: Print the raw string content of the last 10 agent outputs before any parsing. Count how many contain characters before the first `{` or after the last `}`. If more than 2 out of 10 have leading/trailing text, the prompt is not strict enough.
### 2. No output schema enforcement — relying on prompt alone
The pipeline relies entirely on prompt instructions to produce structured output. There is no schema validation, no Pydantic model, and no structured-output API call (e.g., OpenAI's `response_format`, Anthropic's tool-calling). The model's compliance is probabilistic, not enforced.
**How to spot it**: Check whether the agent call uses any structured-output API feature. If the model is invoked with a plain string prompt and the response is parsed as a string, there is no schema enforcement.
### 3. Model version change breaks a previously reliable format
Your pipeline worked for months with GPT-4o `gpt-4o-2024-05-13`. After an automatic model version update, the same prompt now sometimes produces code-fenced JSON. Different model versions have different formatting tendencies, and "worked before" is not a guarantee for a different checkpoint.
**How to spot it**: Check when format failures started. If they correlate with a model version change or a provider infrastructure update, format regression in the new model is the cause.
### 4. Long output triggers partial JSON with truncation
The agent is asked to return a large JSON array. The output hits the model's max-token limit mid-way through the array. The result is valid JSON up to a point, then cut off: `["item1", "item2", "ite` — which `json.loads()` rejects.
**How to spot it**: Check whether output parse failures correlate with large result sets. If the character count of failed outputs is near the token limit, truncation is the cause.
### 5. Multi-turn conversation accumulates non-JSON turns
In a multi-turn agent session, the agent produces valid JSON on turn 3 but on turn 5 (after the conversation has grown longer) it starts adding commentary. The model is fitting to the conversational tone of earlier turns in the context window.
**How to spot it**: Log which turn number parse failures occur on. If failures cluster on later turns (turn 5+), context drift is causing format regression.
### 6. Downstream parser assumes one format, agent changed format
The schema evolved: the agent now returns `{"result": {"issues": [...]}}` (nested) but the parser still reads `data["issues"]` (flat). No error — just `KeyError` or a silent `None` where a list was expected.
**How to spot it**: Compare the schema in the parsing code against the actual schema the agent returns today. Schema drift between the two is a format mismatch even if the JSON is well-formed.
## Shortest path to fix
### Step 1: Use structured output APIs instead of prompt-based formatting
OpenAI:
```python
from pydantic import BaseModel
class AnalysisResult(BaseModel):
issues: list[str]
severity: str
confidence: float
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=messages,
response_format=AnalysisResult,
)
result = response.choices[0].message.parsed # typed AnalysisResult object
Anthropic (via tool-calling for structured output):
tools = [{
"name": "submit_analysis",
"description": "Submit the analysis result",
"input_schema": {
"type": "object",
"properties": {
"issues": {"type": "array", "items": {"type": "string"}},
"severity": {"type": "string", "enum": ["low", "medium", "high"]},
},
"required": ["issues", "severity"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-6",
tools=tools,
tool_choice={"type": "tool", "name": "submit_analysis"},
messages=messages,
)
result = response.content[0].input # dict matching the schema
Step 2: Add a JSON extraction wrapper as a fallback
import re, json
def extract_json(text: str) -> dict:
# Try direct parse first
try:
return json.loads(text.strip())
except json.JSONDecodeError:
pass
# Strip markdown code fences
fenced = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', text, re.DOTALL)
if fenced:
try:
return json.loads(fenced.group(1))
except json.JSONDecodeError:
pass
# Find the outermost JSON object
start = text.find('{')
end = text.rfind('}')
if start != -1 and end != -1 and end > start:
try:
return json.loads(text[start:end+1])
except json.JSONDecodeError:
pass
raise ValueError(f"Could not extract JSON from agent output: {text[:200]!r}")
Use this as a fallback, not a primary strategy — fix the prompt/API first.
Step 3: Harden the system prompt with explicit negative constraints
Respond with ONLY a valid JSON object. No markdown. No code fences. No preamble.
No postamble. No explanation. The first character of your response must be '{'.
The last character must be '}'. If you cannot produce valid JSON, respond with:
{"error": "unable to analyze", "reason": "<one sentence>"}
The “first/last character” instruction is surprisingly effective.
Step 4: Validate schema after parsing
from pydantic import BaseModel, ValidationError
class AnalysisResult(BaseModel):
issues: list[str]
severity: str
def parse_and_validate(raw: str) -> AnalysisResult:
data = extract_json(raw)
try:
return AnalysisResult(**data)
except ValidationError as e:
raise OutputFormatError(
f"Agent output failed schema validation: {e}"
) from e
Schema validation catches field-level issues (missing keys, wrong types) that JSON parsing alone misses.
Step 5: Limit output size to prevent truncation
# Set max_tokens conservatively based on the expected schema size
# A JSON object with 20 issues averages ~500 tokens
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024, # not 4096 — right-size to expected output
messages=messages,
)
# If output might be large, page it
# Prompt: "Return at most 10 issues per call. Include a 'has_more' boolean."
Step 6: Write a format regression test
def test_agent_output_format():
sample_inputs = load_fixture("agent_format_test_inputs.json")
for inp in sample_inputs:
raw = run_agent(inp)
result = parse_and_validate(raw)
assert result.issues is not None
assert result.severity in ("low", "medium", "high")
Run this in CI. If the model version updates and format regresses, CI catches it before production.
Prevention
- Use the provider’s structured output API (OpenAI response_format, Anthropic tool_choice) instead of relying on prompt instructions alone.
- Harden system prompts with explicit negative constraints (no prose, no fences, first char is an opening brace) for any path where structured output APIs are not available.
- Validate output against a Pydantic schema immediately after parsing — before any downstream consumption.
- Size
max_tokensto the expected output, not the model’s maximum — truncation-induced parse failures are easy to prevent. - Write format regression tests that run in CI against the production model endpoint.
- Version your output schema explicitly; when the schema changes, update both the agent prompt and the parser together in the same commit.
- Log the raw pre-parse string for every output that fails validation — you need the exact characters to diagnose format issues.
- Monitor parse failure rate in production; alert when it exceeds 1% of outputs.
FAQ
Q: Should I use JSON mode or tool-calling for structured output?
A: Tool-calling is more reliable because the model treats the tool input as a typed API call rather than freeform text that happens to be JSON. OpenAI’s response_format={"type": "json_schema", ...} is also excellent. Anthropic’s equivalent is using tool_choice to force a specific tool call. “JSON mode” (just response_format={"type": "json_object"}) guarantees valid JSON but not schema compliance.
Q: Can I fix this without changing the agent prompt? A: The extraction wrapper (Step 2) handles most fence-wrapping cases in production. But it is a band-aid — it cannot fix truncated JSON or schema drift. Fix the root cause with structured output APIs.
Q: Our pipeline uses a fine-tuned model — can we fine-tune format compliance? A: Yes. Include 200+ examples of correct JSON-only output in fine-tuning data, with explicit negative examples showing fenced/prose output labeled as wrong. Fine-tuning dramatically reduces format failures for custom model checkpoints.
Q: How do I handle streaming responses that need to be parsed?
A: Buffer the full stream before parsing. Parsing a partial stream produces fragmented JSON. If you need real-time progress from a streaming agent, emit structured progress events ({"type": "progress", "pct": 50}) rather than partial JSON objects.