My code still uses `max_tokens` with OpenAI and now errors. What changed?

OpenAI deprecated `max_tokens` on Chat Completions in favor of `max_completion_tokens`, and reasoning models reject the old name outright. Rename the field. On the newer Responses API the field is `max_output_tokens`.

Why does Anthropic ignore my `finish_reason` check?

Anthropic does not return `finish_reason`. It returns `stop_reason`, and the truncated value is `max_tokens` (not `length`). Check the right field for the right SDK.

Is there a downside to setting the cap very high?

Latency and a higher worst-case bill, since providers throttle generation duration and a runaway reply costs more. But you only pay for tokens actually generated, so unused headroom is free.

Should I retry automatically on truncation?

Yes for prose continuation. For JSON or tool calls, raise the cap and re-run the whole prompt rather than stitching a partial response.

The reply is short but the stop field says it completed normally. Now what?

That is not truncation — it is a prompt or model-behavior issue. See the related articles below on lists ending early and format problems.

Troubleshooting

LLM Response Cut Off Mid-Sentence: max_tokens Too Low (2026 Fix)

The model's reply ends mid-sentence, mid-JSON, or with an unclosed code block. It is almost always the token cap. How to size it, detect truncation per SDK, and recover.

Published: May 24, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You called the model and the reply ends with “…as I was saying, the most important”. No closing quote, no period. Or your JSON parser fails because the output stops at "score": 0.8, "reason" with no value. Or the code block has no closing triple-backtick. The model did not get confused. It ran out of budget. The token cap you set (or the SDK default) stopped generation, and the API truncated wherever the cap landed.

Fastest fix: raise the output-token cap and re-run. On the OpenAI Chat Completions API the field is max_completion_tokens (the old max_tokens is deprecated and rejected by reasoning models); on the OpenAI Responses API it is max_output_tokens; on Anthropic it is max_tokens. Set it to roughly 2x the longest reply you expect, then confirm the stop field came back clean (details below).

Truncation is one of the easiest bugs to confirm and one of the most ignored, because the visible output looks “almost right.” Always check the stop field before assuming the model misunderstood the prompt.

First: which stop field tells you it was truncated

The single field that proves truncation has a different name on each API, and the truncated value differs too. This trips up most people, so check the right one for your SDK:

API / SDK	Field to read	Value that means “truncated”
OpenAI Chat Completions	`choices[0].finish_reason`	`length`
OpenAI Responses API	`response.status` + `response.incomplete_details.reason`	`incomplete` + `max_output_tokens`
Anthropic Messages API	`response.stop_reason`	`max_tokens` (or `model_context_window_exceeded`)

If you read finish_reason on an Anthropic response you will get nothing useful, because Anthropic calls it stop_reason. That mismatch is itself a common reason truncation goes undetected.

Common causes

1. The token cap left at the SDK default

The OpenAI Python SDK does not impose a small default output cap on chat requests, but Anthropic’s max_tokens is a parameter you set yourself (their quickstarts use max_tokens=1024), and many wrappers default to 1024 or 2048. Long-form requests hit that cap silently.

How to spot it: check your SDK version’s behavior for the output-token field. If your code does not pass it, the default applies.

2. The cap set conservatively years ago and never raised

You wrote max_tokens=500 for a chatbot in 2023. Now the same client powers article generation. The number never got revisited.

How to spot it: grep the codebase for max_tokens=, max_completion_tokens=, and max_output_tokens=. Audit each value against current task length.

3. Reasoning tokens consume the budget (and the accounting differs by vendor)

For reasoning models (OpenAI GPT-5.5 in Thinking/Pro mode, Claude Opus 4.7 / Sonnet 4.6 with extended thinking, Gemini 3.1 Pro), the model spends a large share of the run on hidden internal reasoning before any visible text.

The accounting is not the same on each API, as of June 2026:

OpenAI: max_completion_tokens caps reasoning tokens and visible output together. If reasoning uses most of the cap, you can get an incomplete / length result with little or no visible text and still be billed for the reasoning. OpenAI recommends reserving at least 25,000 tokens for reasoning plus output when you start with these models, and tuning reasoning_effort (low / medium / high) to control hidden spend.
Anthropic: with extended thinking enabled, max_tokens must be larger than your configured thinking budget_tokens, because thinking output counts toward max_tokens. Size max_tokens as thinking budget plus the visible answer you want.

How to spot it: inspect usage.completion_tokens_details.reasoning_tokens (OpenAI) or the thinking blocks and usage (Anthropic). If reasoning tokens far exceed visible output tokens, the model thought a lot and said little.

4. Streaming hides truncation

When streaming, your UI prints tokens as they arrive. When the stream ends, it ends — there is no “truncated” badge unless you wire one. Users see a partial reply and assume the model finished.

How to spot it: streaming responses still deliver the stop field in the terminal event (the final chunk’s finish_reason on OpenAI Chat, the message_delta event’s stop_reason on Anthropic, the response.completed / response.incomplete event on the Responses API). Check whether your client reads it.

5. JSON / structured output + low cap = invalid JSON

You enabled JSON or Structured Outputs to guarantee parseable JSON. The model started a valid object but ran out of tokens mid-object. The parser fails.

How to spot it: JSON parse error, output starts with { but ends without a closing }, and the stop field is length / max_tokens / incomplete. Structured Outputs guarantees the schema of a completed response — it does not protect you if generation is cut off by the token cap.

6. Long input plus a small remaining window

The output cap cannot exceed the context window minus your input. If your input is 100k tokens and the model window is 128k, only about 28k is left for output. Setting max_completion_tokens=50000 will either clamp to the remaining budget or error.

How to spot it: output stops well below the cap you set. On Anthropic you may see stop_reason: model_context_window_exceeded; on the Responses API, incomplete_details.reason. Check usage logs.

7. Stop sequence accidentally inside the content

You set stop=["END"] (OpenAI) or stop_sequences=["END"] (Anthropic). The model generates a paragraph that contains “END” as an ordinary word. The API truncates there.

How to spot it: finish_reason: stop / stop_reason: stop_sequence, and the output ends right before a word that matches a stop sequence.

Shortest path to fix

Step 1: Read the right stop field, every call

Before anything else, log it. The field name depends on your SDK (see the table above).

OpenAI Chat Completions:

resp = client.chat.completions.create(...)
choice = resp.choices[0]
if choice.finish_reason == "length":
    raise RuntimeError("Output truncated by token cap")

finish_reason values: stop (natural end or stop sequence), length (hit the cap), content_filter (safety), tool_calls (function call).

Anthropic Messages:

resp = client.messages.create(...)
if resp.stop_reason == "max_tokens":
    raise RuntimeError("Output truncated by max_tokens")

stop_reason values: end_turn, max_tokens, stop_sequence, tool_use, pause_turn, refusal, model_context_window_exceeded.

OpenAI Responses API:

resp = client.responses.create(...)
if resp.status == "incomplete" and resp.incomplete_details.reason == "max_output_tokens":
    raise RuntimeError("Output truncated by max_output_tokens")

Step 2: Size the output cap by task

Rough budgets (output tokens, excluding any hidden reasoning):

Chat reply: 1000-2000
Short summary: 500
Article (about 1000 words): 4000
Code generation, file-level: 8000
Multi-file refactor: 16000+

When in doubt, set higher than you need. You are billed only for tokens actually generated, so an unused headroom of cap costs nothing.

Step 3: For reasoning models, budget reasoning separately

On OpenAI, the cap covers reasoning plus visible text, so size it generously and tune effort:

# GPT-5.5 (Thinking) on the Responses API
resp = client.responses.create(
    model="gpt-5.5",
    input=[...],
    reasoning={"effort": "medium"},   # low | medium | high
    max_output_tokens=25000,          # OpenAI suggests >= 25k for reasoning + output
)

On Anthropic, max_tokens must exceed the thinking budget:

# Claude Opus 4.7 with extended thinking
resp = client.messages.create(
    model="claude-opus-4-7",
    thinking={"type": "enabled", "budget_tokens": 8000},
    max_tokens=16000,   # must be > budget_tokens; leaves ~8k for the visible answer
    messages=[...],
)

Look at usage.completion_tokens_details.reasoning_tokens (OpenAI) or the run’s usage (Anthropic) to right-size.

Step 4: Detect truncation and recover

For prose, ask the model to continue from where it stopped:

if truncated:  # finish_reason == "length" / stop_reason == "max_tokens"
    cont = call_llm(messages + [
        {"role": "assistant", "content": partial_output},
        {"role": "user", "content": "Continue exactly where you left off. Do not repeat."}
    ])
    full_output = partial_output + cont

For JSON or tool calls, retry the whole prompt with a higher cap rather than concatenate — JSON continuation is fragile and a half-finished tool_use block cannot be stitched.

Step 5: Stream and surface truncation in the UI

When streaming, capture the terminal event’s stop field and badge the message:

{message.truncated && <span className="warn">Response was truncated. Request more?</span>}

Step 6: For window-bound caps, compute the budget

input_tokens = count_tokens(messages)
model_window = 1_000_000   # Opus 4.7 / Sonnet 4.6 / Gemini 3.1 Pro standard, as of June 2026
safety_margin = 1000
max_output = model_window - input_tokens - safety_margin
output_cap = min(desired_output, max_output)

Do not set the output cap higher than the remaining context. Note that an in-app chat window is not the same as the API window: ChatGPT Plus carries roughly 320 pages of in-app context (the full 1M is API-side or the $200 Pro tier), so an API call and a paste into the web UI behave differently.

Step 7: Audit stop sequences

Make stop sequences unique strings unlikely to appear in normal output. "\n\n" is risky for prose. A sentinel like "<|END|>" is far safer.

How to confirm it is fixed

Re-run the exact request that failed.
Assert the stop field is clean: finish_reason == "stop", or stop_reason == "end_turn", or status == "completed". It must NOT be length / max_tokens / incomplete.
For JSON, parse the output and assert it loads without error.
Add the assertion to your test suite or an alert so the next regression is caught automatically, not by a user.

When this is not on you

If you use a managed wrapper (LangChain and similar), the default cap may be set by the wrapper, not the underlying SDK, and the wrapper may still pass max_tokens to a reasoning model that now requires max_completion_tokens. Check the wrapper version and docs — sometimes there is a hidden cap or an outdated parameter name.

Easy to misdiagnose as

A “model confusion” or “prompt clarity” issue. If the first half of the reply is coherent and the second half does not exist, it is almost always the token cap, not the prompt. Check the stop field first, always.

Prevention

Log the stop field on every call and alert on length / max_tokens / incomplete.
Set the output cap per task type, not as a single global default.
For reasoning models, reserve budget for hidden reasoning (OpenAI: include it in the cap, start near 25k; Anthropic: max_tokens larger than budget_tokens).
For streaming UIs, surface truncation explicitly.
When using JSON / Structured Outputs, set the cap to about 2x your expected JSON size.
Use unambiguous stop sequences, never single newlines.

FAQ

My code still uses max_tokens with OpenAI and now errors. What changed? OpenAI deprecated max_tokens on Chat Completions in favor of max_completion_tokens, and reasoning models reject the old name outright. Rename the field. On the newer Responses API the field is max_output_tokens.
Why does Anthropic ignore my finish_reason check? Anthropic does not return finish_reason. It returns stop_reason, and the truncated value is max_tokens (not length). Check the right field for the right SDK.
Is there a downside to setting the cap very high? Latency and a higher worst-case bill, since providers throttle generation duration and a runaway reply costs more. But you only pay for tokens actually generated, so unused headroom is free.
Should I retry automatically on truncation? Yes for prose continuation. For JSON or tool calls, raise the cap and re-run the whole prompt rather than stitching a partial response.
The reply is short but the stop field says it completed normally. Now what? That is not truncation — it is a prompt or model-behavior issue. See the related articles below on lists ending early and format problems.

Tags: #Prompt engineering #Troubleshooting #llm-output #max-tokens #api-config #truncation