You called the model and the response ends with “…as I was saying, the most important”. No closing quote, no period. Or your JSON parser fails because the model’s output ends "score": 0.8, "reason" with no value. Or the code block has no closing triple-backtick. The model didn’t get confused — it ran out of budget. The max_tokens parameter you set (or the default) capped generation, and the API truncated wherever the cap landed.
Truncation is one of the easiest bugs to confirm and one of the most common to ignore because the visible output looks “almost right.” Always check finish_reason before assuming the model misunderstood.
Common causes
1. max_tokens left at the SDK default
OpenAI Python SDK default for max_tokens is unlimited for chat (good), but Anthropic and many wrappers default to 1024 or 2048. Long-form requests hit the cap silently.
How to spot it: Check the SDK version’s default for max_tokens. If your code doesn’t pass it, the default applies.
2. max_tokens set conservatively years ago and never raised
You wrote max_tokens=500 for a chatbot in 2023. Now you’re using it for article generation. The number never got revisited.
How to spot it: Search the codebase for max_tokens=. Audit each value against current task length.
3. Reasoning tokens eat the budget
For o1, o3, Claude with extended thinking, and similar models, the model spends most of max_tokens on internal reasoning. Visible output may be tiny.
How to spot it: API response includes usage.completion_tokens_details.reasoning_tokens or cache_creation_input_tokens. If reasoning tokens > visible output tokens, the model thought a lot, said little.
4. Streaming hides truncation
When streaming, your UI shows tokens as they arrive. When the stream ends, it ends — there’s no visible “truncated” badge unless you wire one. Users see a partial response and assume the model is done.
How to spot it: Streaming endpoints should still emit a final event with finish_reason. Check whether your client surfaces it.
5. JSON mode + low max_tokens = invalid JSON
You enabled response_format={"type":"json_object"} to guarantee JSON. The model started a valid JSON object but ran out of tokens mid-object. Parser fails.
How to spot it: JSON parse error, output starts with { but ends without closing }. finish_reason: length.
6. Long input + small max_tokens window
Some APIs cap total tokens (input + output). If your input is 100k tokens and the model has a 128k context, only ~28k is left for output. Setting max_tokens=50000 silently clamps to the remaining budget.
How to spot it: Output stops well below max_tokens you set. Check API error or usage logs.
7. Stop sequence accidentally inside content
You set stop=["END"]. Model generates a paragraph that contains “END” as a normal word. API truncates there.
How to spot it: finish_reason: stop and the output ends just before a word that matches a stop sequence.
Shortest path to fix
Step 1: Always check finish_reason
Before doing anything else, log it:
resp = client.chat.completions.create(...)
choice = resp.choices[0]
if choice.finish_reason == "length":
raise RuntimeError("Output truncated by max_tokens")
finish_reason values: stop (natural end or stop sequence), length (hit max_tokens), content_filter (safety), tool_calls (function call).
Step 2: Size max_tokens by task
Rough budgets:
- Chat reply: 1000-2000
- Short summary: 500
- Article (1000 words): 4000
- Code generation, file-level: 8000
- Multi-file refactor: 16000+
When in doubt, set higher than you need. You’re billed only for tokens generated.
Step 3: For reasoning models, double the budget
# o1/o3 or Claude with thinking
resp = client.chat.completions.create(
model="o3",
messages=[...],
max_completion_tokens=16000, # roughly 8k reasoning + 8k visible
)
Look at usage.completion_tokens_details.reasoning_tokens to right-size.
Step 4: Detect truncation and recover
For prose, ask the model to continue:
if finish_reason == "length":
cont = call_llm(messages + [
{"role": "assistant", "content": partial_output},
{"role": "user", "content": "Continue exactly where you left off. Do not repeat."}
])
full_output = partial_output + cont
For JSON, retry with higher max_tokens rather than concatenate — JSON continuation is fragile.
Step 5: Stream and surface truncation in UI
When streaming, capture the final chunk’s finish_reason and badge the response:
{message.truncated && <span className="warn">Response was truncated — request more?</span>}
Step 6: For total-token caps, compute the budget
input_tokens = count_tokens(messages)
model_window = 128_000
safety_margin = 1000
max_output = model_window - input_tokens - safety_margin
max_tokens = min(desired_output, max_output)
Don’t hardcode max_tokens higher than the remaining context.
Step 7: Audit stop sequences
Make sure stop sequences are unique strings unlikely to appear in normal output. "\n\n" is risky for prose. "<|END|>" is safer.
When this is not on you
If you’re using a managed wrapper (LangChain, etc.), the default max_tokens may be set by the wrapper, not the underlying SDK. Check the wrapper docs — sometimes there’s a hidden cap.
Easy to misdiagnose as
A “model confusion” or “prompt clarity” issue. If the first half of the response is coherent and the second half doesn’t exist, it’s almost always max_tokens. Check finish_reason first, always.
Prevention
- Always log
finish_reasonand alert onlength. - Set
max_tokensper task type, not a global default. - For reasoning models, allocate ~2x the visible output you expect.
- For streaming UIs, surface truncation explicitly.
- When using JSON mode, set max_tokens 2x your expected JSON size.
- Use unambiguous stop sequences, never single newlines.
FAQ
- Is there a downside to setting
max_tokensvery high? Cost cap and latency cap — providers throttle generation duration. Otherwise no, you only pay for tokens actually generated. - Should I retry on
finish_reason: lengthautomatically? Yes for prose continuation. For JSON, raise max_tokens and re-call the whole prompt.
Related
- No output format specified
- Prompt asks for 10 items, model returns 3 and stops
- Long prompt degrades output
- Long background hides the task
- Output polished but not actionable
- Too many tasks in one prompt
- AI gives lists not execution
- AI output style drift
- Latest sentence overrides earlier instructions
- Prompt too broad
Tags: #Prompt engineering #Troubleshooting #llm-output #max-tokens #api-config #truncation