Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)

Q: Did the model reach EOS, or was it truncated?

On Ollama's `/api/generate` check `done_reason`: `stop` = a real EOS/stop token, `length` = `num_predict` exhausted. On any OpenAI-compatible endpoint (`/v1`, vLLM, llama-server) check `choices[0].finish_reason`, where `stop` and `length` mean the same thing.

Q: Ollama ignores my `max_tokens` — why?

Because the native `/api/generate` and `/api/chat` endpoints don't recognize `max_tokens`; they use `num_predict` inside the `options` object. Only `/v1/chat/completions` accepts `max_tokens` (and maps it to `num_predict`). As of June 2026 a native `max_tokens` alias is still an open feature request.

Q: Can I resume a truncated reply instead of regenerating?

Yes — append the partial text as an `assistant` message and send a short `continue` user turn. Each continuation grows the prompt, though, so you will eventually hit `num_ctx`; raising the limit up front is cleaner.

Your local model stops mid-word with no EOS token. Diagnose num_predict limits, the VRAM-based num_ctx default, stop sequences, proxy buffering, and UTF-8 byte splits.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your local Llama 3.1 8B or Qwen2.5 under Ollama or llama.cpp generates a response and then stops abruptly in the middle of a word — The recommended approach is to use Docke — with no end-of-sequence (EOS) token, no error, and no sign anything went wrong. The client just stops receiving data.

Fastest fix: check the finish reason first. In Ollama’s /api/generate reply look at done_reason; in any OpenAI-compatible client (/v1, vLLM, llama-server) look at choices[0].finish_reason. If it reads length, you hit a token cap — raise num_predict / max_tokens (or set num_predict to -1) and the truncation is gone. If it reads stop but the text is clearly unfinished, it is usually not a real cut-off: the context window filled, a stop sequence matched inside the body, a proxy buffered the stream, or your client decoded a half-finished UTF-8 character. (One exception worth knowing: as of June 2026 Ollama’s OpenAI-compatible endpoint has a known bug where it can report finish_reason: stop even when output was actually capped by max_tokens — so when in doubt, cross-check the native done_reason.) Work through the buckets below in order.

Which bucket are you in

Symptom signature	Most likely cause	Jump to
`done_reason` / `finish_reason` is `length`	`num_predict` / `max_tokens` cap hit	Cause 1
Cuts at the same spot regardless of length; `stop` reason; long prompt	`num_ctx` too small, context filled	Cause 2
Stops right after `\n\n`, `###`, or a code fence	Stop sequence matched mid-body	Cause 3
Direct call to `:11434` works; proxied call truncates	Reverse-proxy buffering the stream	Cause 4
Garbled last character, often a CJK glyph; stream only	UTF-8 split across token chunks	Cause 5
Same cut every run, same prompt, deterministic	Corrupt GGUF or aggressive-quant decode bug	Cause 6

Common causes

Ordered by hit rate, highest first.

1. num_predict / max_tokens cap hit at a mid-word token

The most common cause, and the only one that returns a length finish reason. Tokenizers split words into subword pieces — Docker might tokenize as ["Do", "cker"]. If the cap lands on the Do token, the output ends with Do, which reads as a mid-word cut even though the limit was honored exactly.

Two traps make this confusing:

Ollama’s native endpoints (/api/generate, /api/chat) silently ignore the OpenAI-style max_tokens parameter. You must pass num_predict inside the options object instead (as of June 2026 a native max_tokens alias is still only a feature request).
Only the OpenAI-compatible endpoint /v1/chat/completions accepts max_tokens and maps it to num_predict internally.

How to spot it: the finish reason is length, and your cap is a round number (128, 256, 512, 1024). Ollama’s documented default num_predict is 128, so an unset limit on a native call often stops at exactly 128 tokens. Increase the cap and the cut disappears.

2. Context window (num_ctx) filled during generation

This is the cause whose default changed most recently, so it bites people whose articles or tooling predate it. As of June 2026, modern Ollama no longer uses a flat 2048-token default — it sizes num_ctx from available VRAM: roughly 4K context under 24 GiB VRAM, 32K for 24–48 GiB, and 256K at 48 GiB or more (set explicitly with OLLAMA_CONTEXT_LENGTH or the num_ctx option). Older builds, third-party wrappers, and many Modelfiles still pin 2048. When the prompt plus generated tokens reach num_ctx, generation stops — and Ollama silently drops the oldest prompt tokens to make room rather than erroring, so the reply can end mid-sentence with done_reason: stop.

How to spot it: count prompt tokens and add your num_predict. If the sum approaches num_ctx, that is the cause. Run ollama show <model> (or check --ctx-size on llama-server) to see the active value; a 2048 there on a long prompt is the smoking gun.

One caution when you raise num_ctx: the KV cache grows with the context window, and if it overflows VRAM Ollama offloads it to CPU RAM, which can drop throughput from 50–100 tok/s to 2–5 tok/s. Size the bump to what the prompt actually needs rather than maxing it out.

3. Stop sequence matching inside the body

A stop sequence like "\n\n", "###", or "<|eot_id|>" is matching text in the middle of the response instead of at the end. In streaming mode the server cuts immediately at the match — often inside a code block (every triple-backtick fence) or a paragraph break.

How to spot it: inspect every stop sequence in the request. In Ollama check ollama show <model> --modelfile | grep -i stop and the stop field in your API call; on llama-server check --stop flags. Re-run the same prompt with no stop sequences; if it completes, a stop string was the trigger.

4. Reverse proxy buffering the stream

If Ollama or llama-server sits behind nginx, Caddy, or a load balancer without streaming configured, the proxy can buffer the response and flush it on a timeout, cutting the stream mid-token. The model produced a complete answer; it never reached the client intact.

How to spot it: make the identical request directly to the Ollama port (http://127.0.0.1:11434) bypassing the proxy. If that completes but the proxied path truncates, the proxy is the culprit.

5. UTF-8 character split across streaming chunks

In streaming mode the server emits one token at a time, and a single Unicode character (any CJK glyph, emoji, or accented letter) can straddle a token / byte boundary. A client that decodes each chunk’s bytes independently — instead of buffering until a valid character completes — drops or mangles the trailing partial character, which looks like a mid-token cut. This is why CJK output truncates and garbles far more often than ASCII: a Chinese character is 3 UTF-8 bytes, so the odds of a chunk ending mid-character are roughly triple those of a 1-byte ASCII letter.

How to spot it: switch the same call to non-streaming ("stream": false). If the truncation vanishes, the bug is in your client’s stream decoder, not in the model.

6. Corrupt GGUF or aggressive-quant decode bug

A partially downloaded or corrupted GGUF can make the llama.cpp backend stop decoding at the damaged tensor boundary. Separately, very aggressive quants (IQ2/IQ3 tiers) can emit a stray NULL or garbage byte on certain token sequences, which a downstream reader treats as end-of-stream.

How to spot it: the cut is deterministic — same prompt, same seed, identical stop position every run regardless of prompt content. Re-run with a Q4_K_M or Q8_0 build of the same model; if the truncation disappears, the previous file or quant was the problem.

Shortest path to fix

Step 1: Check the finish reason, then raise the token cap

# OpenAI-compatible endpoint (works for Ollama /v1, vLLM, llama-server)
import openai
client = openai.OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = client.chat.completions.create(
    model="qwen2.5:7b",
    messages=[{"role": "user", "content": "Explain Docker networking in detail"}],
    max_tokens=2048,            # raise this well above expected length
)
print("finish_reason:", resp.choices[0].finish_reason)  # want "stop", not "length"
print(resp.choices[0].message.content)

On Ollama’s native API, max_tokens is ignored — set num_predict inside options. Use -1 to generate until EOS (or context), -2 to fill the context window:

curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain Docker networking in detail",
  "stream": false,
  "options": { "num_predict": -1, "num_ctx": 8192 }
}' | python3 -m json.tool | grep -E '"(done_reason|response)"'

done_reason: length confirms a cap was hit; done_reason: stop means the model emitted EOS (so look at Step 2 onward).

Step 2: Set num_ctx large enough that the context never fills

# llama-server
./llama-server -m models/llama-3.1-8b-instruct-Q4_K_M.gguf \
  --ctx-size 8192 --n-predict 2048

# Ollama: set it per call (preferred) ...
curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b", "prompt": "your long prompt",
  "options": { "num_ctx": 8192, "num_predict": 2048 }
}'

# ... or globally for the server (restart ollama serve to apply)
OLLAMA_CONTEXT_LENGTH=8192 ollama serve

Confirm the value actually took effect with ollama show llama3.1:8b — a stale 2048 here is the most common reason a num_ctx bump appears to do nothing.

Step 3: Audit and strip stop sequences

ollama show llama3.1:8b --modelfile | grep -i stop

# OpenAI-compatible call: disable stop sequences to isolate
resp = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain Docker networking"}],
    max_tokens=2048,
    stop=None,                   # was ["###", "\n\n"] — comment back in one at a time
)

If output completes with stop=None, re-add stop strings one at a time to find the offender. Never put a triple-backtick fence or a bare \n\n in stop for prose or code output.

Step 4: Test direct vs. proxied, then fix the proxy

# Direct to Ollama, bypassing any proxy
curl -N -s http://127.0.0.1:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Count from 1 to 50", "stream": true}'

If the direct call completes but the proxied one truncates, disable buffering on the proxy. For nginx:

location /api/ {
    proxy_pass http://127.0.0.1:11434;
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
    chunked_transfer_encoding on;
}

Step 5: Buffer streaming bytes until each UTF-8 character is complete

# Never decode raw chunk bytes individually — accumulate until valid
import sys

def safe_stream(byte_chunks):
    buf = b""
    for chunk in byte_chunks:
        buf += chunk
        try:
            text = buf.decode("utf-8")   # succeeds only on a char boundary
            sys.stdout.write(text); sys.stdout.flush()
            buf = b""
        except UnicodeDecodeError:
            continue                      # partial char: wait for more bytes

If you use the official OpenAI or ollama-python SDK, this is handled for you — the bug only appears in hand-rolled byte readers.

Step 6: Verify GGUF integrity if the cut is deterministic

# Inspect metadata without running inference
python3 -c "
import gguf
r = gguf.GGUFReader('models/llama-3.1-8b-instruct-Q4_K_M.gguf')
print('Tensors:', len(r.tensors))
print('Arch:', r.fields['general.architecture'])
"

If the file is incomplete, re-download and verify its SHA256 against the file card on the Hugging Face repo, and prefer a Q4_K_M or Q8_0 build over IQ2/IQ3 quants for decode stability.

How to confirm it’s fixed

The finish reason now reads stop (done_reason on Ollama native, finish_reason on /v1) and the text ends on a complete sentence.
Re-run with roughly 2x your longest realistic prompt — the response still completes.
For CJK or emoji output, the final character renders correctly with no replacement glyph (�).
If you fixed a proxy, the proxied path now matches the direct :11434 path byte for byte.

Prevention

Always log the finish reason. stop is healthy; length means you capped it; an unfinished stop means context, stop sequence, proxy, or decoder.
Set num_predict to -1, or to at least 2x your longest expected response — never trust the framework default (128 on Ollama native).
Set num_ctx to (longest prompt + longest response + ~512 headroom). Don’t assume the old 2048 default; as of June 2026 it is VRAM-derived and varies by machine, so pin it explicitly.
Keep \n\n, ###, and code fences out of stop sequences; test any stop string in isolation first.
Put proxy_buffering off and a 300s read timeout on any reverse proxy in front of Ollama before deploying.
Buffer streaming bytes to a UTF-8 character boundary, or use an SDK that already does.
Verify SHA256 on large GGUF downloads and prefer Q4_K_M/Q8_0 over IQ2/IQ3 for production decoding.

FAQ

Q: Did the model reach EOS, or was it truncated? A: On Ollama’s /api/generate check done_reason: stop = a real EOS/stop token, length = num_predict exhausted. On any OpenAI-compatible endpoint (/v1, vLLM, llama-server) check choices[0].finish_reason, where stop and length mean the same thing.

Q: Ollama ignores my max_tokens — why? A: Because the native /api/generate and /api/chat endpoints don’t recognize max_tokens; they use num_predict inside the options object. Only /v1/chat/completions accepts max_tokens (and maps it to num_predict). As of June 2026 a native max_tokens alias is still an open feature request.

Q: Clean output in the Ollama chat UI but truncation via the API — why? A: The interactive UI generates until EOS, while your API call is capping output — usually a default max_tokens from the OpenAI SDK or a num_predict: 128 on a native call. Set the cap explicitly (2048+ or -1).

Q: Why does my Chinese (or emoji) output garble at the very end but English doesn’t? A: A CJK character is 3 UTF-8 bytes, so a streaming chunk is about three times more likely to end mid-character than with a 1-byte ASCII letter. Buffer bytes until they decode cleanly (Step 5), or switch to "stream": false.

Q: finish_reason is stop but the sentence is clearly cut off. A: Usually the model emitted EOS too early — the common causes are a filled num_ctx (raise it, Step 2), a stop sequence matching inside the body (Step 3), a chat-template mismatch that injects a premature <|eot_id|> / <|im_end|>, or a proxy cutting the stream (Step 4). There is also one upstream bug to rule out: as of June 2026 Ollama’s /v1 endpoint can return finish_reason: stop even when output was actually truncated by max_tokens, so cross-check the native done_reason (or just raise the cap and see if the text grows) before chasing the other causes.

Q: Can I resume a truncated reply instead of regenerating? A: Yes — append the partial text as an assistant message and send a short continue user turn. Each continuation grows the prompt, though, so you will eventually hit num_ctx; raising the limit up front is cleaner.

Tags: #local-llm #ollama #Troubleshooting