Your local Llama 3.1 8B running under Ollama 0.4 or llama.cpp generates a response and then cuts off abruptly in the middle of a word — “The recommended approach is to use Docke” — with no EOS token, no error, and no indication that anything went wrong. The client simply stops receiving data. This is distinct from the model choosing to end a response early: truncation mid-token means the generation engine stopped at a token boundary that the client decoded into an incomplete word. It surfaces in Ollama, vLLM, and llama-server and has four distinct root causes, each fully fixable.
Common causes
Ordered by hit rate, highest first.
1. max_tokens / num_predict hit exactly at a mid-word token
The most common cause. LLM tokenizers split words into subword pieces. “Docker” might be tokenized as [“Do”, “cker”] or [“Docker”]. If num_predict (llama.cpp) or max_tokens (OpenAI-compatible APIs) is set to a value that lands exactly on the “Do” token, the output ends with “Do” — which looks like truncation mid-word even though the limit was respected exactly.
How to spot it: Check your API call parameters. If max_tokens or n_predict is a suspiciously round number (128, 256, 512, 1024), it’s almost certainly the culprit. Increase it by 50-100 and the “truncation” will disappear.
2. Stop sequence matching a prefix in the model output
A stop sequence like "\n\n" or "###" or "<|eot_id|>" is matching text in the middle of the response rather than at the end. In streaming mode, the server stops immediately at the match point — which can be in the middle of a code block or sentence that happened to contain those characters.
How to spot it: Inspect the full list of stop sequences in your API request. In Ollama, check the stop parameter in your Modelfile or API call. In llama-server, check --stop flags. Run the same prompt without any stop sequences to see if output completes.
3. Streaming buffer flushed prematurely by a proxy or reverse proxy
If Ollama or llama-server is behind nginx, Caddy, or another reverse proxy without proper streaming configuration, the proxy may buffer the response and flush it after a timeout — cutting the stream mid-token. The model itself generated a complete response; it never reached the client.
How to spot it: Make the same API request directly to the Ollama port (e.g., localhost:11434) without going through the proxy. If the response completes, the proxy is truncating.
4. Client-side streaming reader exits before all chunks arrive
If you’re reading the streaming response with a custom client that doesn’t wait for the [DONE] SSE marker or the final JSON chunk, it may close the connection before the model has finished. The partial output that was buffered client-side gets decoded, resulting in a mid-token cut.
How to spot it: Check your client code for response.close(), context manager exits, or timeouts that could interrupt the stream. Add logging to confirm whether the final "done": true chunk from Ollama was received.
5. Context window exhausted during generation
If the prompt was very long and the num_ctx / --ctx-size is set too small, the model may truncate at the point where the KV cache is full, cutting off generation even though max_tokens wasn’t reached. This looks identical to a mid-token truncation.
How to spot it: Count the prompt tokens (use tiktoken or llama_tokenize) and add num_predict. If the sum approaches or exceeds your num_ctx setting, context exhaustion is the cause.
6. GGUF file corruption or incomplete download
A partially downloaded or corrupted GGUF file can cause the llama.cpp backend to silently stop decoding at the corrupted tensor boundary. The model appears to work for short prompts but truncates at the same position on longer generations.
How to spot it: Run llama-cli -m model.gguf -p "say hello" -n 50 twice with identical seeds. If the truncation happens at the exact same token position every time regardless of prompt content, the GGUF may be corrupted.
Shortest path to fix
Step 1: Raise max_tokens well above your expected response length
# Ollama API — explicitly set a generous num_predict
curl -s http://localhost:11434/api/generate \
-d '{
"model": "llama3.1:8b",
"prompt": "Explain Docker networking in detail",
"stream": false,
"options": {
"num_predict": 2048,
"num_ctx": 8192
}
}' | python3 -m json.tool | grep -A2 '"response"'
Set num_predict to -1 to remove the limit entirely (generates until EOS):
curl -s http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "your prompt", "options": {"num_predict": -1}}'
Step 2: Audit and simplify stop sequences
# Check what stop sequences are active in your Modelfile
ollama show llama3.1:8b --modelfile | grep -i stop
Remove all non-essential stop sequences and test:
# OpenAI-compatible call — remove the stop parameter entirely to test
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain Docker networking"}],
max_tokens=2048,
# stop=["###", "\n\n"], # Comment out to isolate
)
Step 3: Test direct vs. proxied connection
# Direct to Ollama (bypass any proxy)
curl -N -s http://127.0.0.1:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "Count from 1 to 50", "stream": true}' \
| grep -o '"response":"[^"]*"' | tr -d '\n'
If this completes fully but the proxied version truncates, update your nginx config:
location /api/ {
proxy_pass http://127.0.0.1:11434;
proxy_buffering off;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
chunked_transfer_encoding on;
}
Step 4: Increase num_ctx to prevent context exhaustion during generation
# llama-server
./llama-server \
-m models/llama-3.1-8b-instruct-Q4_K_M.gguf \
--ctx-size 8192 \
--n-predict 2048
# Ollama Modelfile
cat > /tmp/Modelfile << 'EOF'
FROM llama3.1:8b
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
EOF
ollama create llama3.1-ctx8k -f /tmp/Modelfile
Step 5: Verify GGUF integrity
# Check GGUF metadata without running inference
./llama-gguf-split --info models/llama-3.1-8b-instruct-Q4_K_M.gguf
# Or use Python gguf package
python3 -c "
import gguf
reader = gguf.GGUFReader('models/llama-3.1-8b-instruct-Q4_K_M.gguf')
print(f'Tensors: {len(reader.tensors)}')
print(f'Arch: {reader.fields[\"general.architecture\"]}')
"
If the file is corrupted, re-download it and verify the SHA256 against the HuggingFace repository checksum.
Prevention
- Always set
num_predictto -1 or to a value 2x larger than your expected longest response; never leave it at framework defaults. - Test stop sequences in isolation before adding them to a production Modelfile or API wrapper.
- Configure nginx/Caddy with
proxy_buffering offandproxy_read_timeout 300before deploying Ollama behind a reverse proxy. - After downloading large GGUF files, verify their SHA256 against the HuggingFace file card before loading.
- Set
num_ctxto at least (longest expected prompt + longest expected response + 512) tokens of headroom. - When building streaming clients, always read until the
[DONE]SSE event or"done": trueJSON field rather than reading a fixed number of bytes. - Log the finish reason from the API response —
"finish_reason": "stop"is normal;"length"indicates a max_tokens hit.
FAQ
Q: How do I know if the model reached EOS or was truncated?
A: In Ollama’s /api/generate response, check "done": true and "done_reason". A value of "stop" means a proper EOS token; "length" means num_predict was exhausted. In vLLM’s OpenAI-compatible response, check choices[0].finish_reason.
Q: Does temperature affect truncation behavior? A: No — truncation is a generation length / stop sequence / context length issue, not a sampling issue. Temperature only affects which token is picked at each step, not when generation stops.
Q: I get clean output in the Ollama chat UI but truncation via the API — why?
A: The Ollama chat UI sets num_predict: -1 by default. Your API call is likely using a default max_tokens from the OpenAI SDK (often 128 or 256). Explicitly set max_tokens: 2048 in your API call.
Q: Why does truncation always happen at the same token count?
A: If the truncation is deterministic across different prompts, the most likely cause is a hard-coded num_predict or max_tokens in your wrapper code, or a stop sequence that always matches at that position. Check every layer of your stack for hardcoded limits.