Chat-Template Mismatch Produces Garbage Local LLM Output

Q: Why does the model work with the `/completion` endpoint but produce garbage on `/v1/chat/completions`?

`/completion` takes raw text and applies no chat template; `/v1/chat/completions` wraps your messages with the configured template. If that template is wrong for the model, only the chat endpoint breaks. Render the prompt with `apply_chat_template` (Step 5) and compare it to what you sent to `/completion`.

Q: Which template does Mistral 7B Instruct v0.3 use in llama.cpp?

`mistral-v3`. There is no plain `mistral` template name. v0.3 added native tool-calling control tokens; if you apply a v1/v0.1 template, tool tokens like `[TOOL_CALLS]` show up as literal text. Mistral Nemo uses `mistral-v3-tekken` and Large 2411 uses `mistral-v7`.

A local LLM echoes your prompt, prints literal [INST] or <|im_start|> tags, or loops the same sentence. That is a chat-template mismatch. Find the model's real template and force the engine to use it.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You load Mistral-7B-Instruct-v0.3 in llama-server and send a chat message. Instead of a clean reply, the model echoes your entire message back, sprinkles [INST] markers into the middle of the response, then starts writing what looks like the next user turn by itself. Or you load Llama 3 and the output literally begins with <|im_start|>assistant as plain text. That is a chat-template mismatch: the tokenizer and the inference engine disagree on how to wrap the conversation, so the model receives a token sequence that looks like mid-conversation injection instead of a clean prompt, and it generates noise.

Fastest fix (as of June 2026): run llama-server with the --jinja flag. That tells llama.cpp to use the Jinja chat template embedded in the GGUF (parsed by its built-in minja engine) instead of guessing a named template. --jinja is on by default in recent builds, but pass it explicitly so older configs do not silently fall back. If the GGUF has no embedded template, jump to Step 1 to read the model’s real template and pass the correct named one.

Which bucket are you in?

Match your exact symptom to the most likely cause before changing anything.

Symptom you see	Most likely cause	Go to
Literal `[INST]`, `<\|im_start\|>`, or `<\|eot_id\|>` text in the reply	Wrong named template for the model family	Cause 1 / Step 2
Output is coherent but never stops, or repeats one line	Stop / EOS token not registered for this template	Cause 6 / Step 5
Model just continues your text, never “answers”	You loaded a `base` model, not `instruct`	Cause 2
Works in Ollama, garbage in raw `llama.cpp`	Engine applied no template (raw completion mode)	Cause 3
First token is garbled or wrong language	Double BOS token prepended	Cause 5 / Step 4
System prompt leaks into the visible answer	System message placed outside the template	Cause 4

Common causes

Ordered by hit rate, highest first.

1. Wrong named chat template for the model family

llama.cpp, llama-server, and Ollama each ship a list of built-in template names. If you pass --chat-template llama2 to a Llama 3 model, the engine wraps your message in [INST] ... [/INST] instead of the <|begin_of_text|>...<|eot_id|> format Llama 3 was trained on. The model never saw Llama 2 framing and produces incoherent output.

A frequent 2026 trap: the template name is more specific than people expect. Mistral-7B-Instruct-v0.3 needs mistral-v3, not mistral (which is not a valid name). Qwen2.5 uses chatml, not a qwen2 value. Mistral Nemo uses mistral-v3-tekken; Mistral Large 2411 uses mistral-v7.

How to spot it: run ./llama-cli --chat-template-help (or check the supported-templates wiki) and compare the names against the chat_template Jinja string in the model’s HuggingFace tokenizer_config.json.

2. You loaded a base model, not the instruct version

Base (pre-trained) checkpoints never went through SFT/RLHF, so they do not respond to any chat template. They only continue whatever tokens you feed them, which reads as endless rambling that ignores your question.

How to spot it: check the GGUF filename and the HuggingFace repo name for Instruct, Chat, or -it. If it says base or has no suffix, you have the wrong checkpoint. No template will fix a base model.

3. GGUF has no embedded template, so the engine used a generic one

convert_hf_to_gguf.py copies the tokenizer’s chat_template into GGUF metadata. If the conversion used an old script or a third-party converter, the template may be missing, and the engine falls back to a generic format that does not match the model. The same garbage appears in raw llama-cli (which by default does no template wrapping) but not in Ollama (which applies its own).

How to spot it:

python3 -c "import gguf; r = gguf.GGUFReader('model.gguf'); print(r.fields.get('tokenizer.chat_template'))"

If the output is None, the template is missing from the GGUF and you must supply it explicitly.

4. System prompt placed outside the template structure

Some models (Mistral v0.3, Qwen2.5) expect the system prompt embedded in the first/last [INST] block, not as a standalone system turn. If your client sends {"role": "system", "content": "..."} and the template does not handle a separate system role, the system text leaks into the visible answer in the wrong place.

How to spot it: drop the system message and send only the user message. If output improves, system-role placement is the problem.

5. BOS (beginning-of-sequence) token added twice

Some wrappers prepend the BOS token manually before calling the tokenizer, which also prepends it. The Jinja template may add a third. Double BOS confuses positional encoding at position 0: symptoms include a garbled first token or a prefix in the wrong language. Recent llama.cpp prints a warning like added a BOS token to the prompt as the prompt already starts with a BOS token.

How to spot it: start llama-server with --verbose and read the warning line, or inspect the leading token IDs. If the model’s BOS id (1 for Llama-family) appears twice at the start, something is double-prepending it.

6. Stop / EOS token not registered, or history in wrong role order

If the engine does not know the model’s end-of-turn token, the model “answers,” then keeps going and hallucinates the next user turn or loops one sentence forever. Separately, if your client sends roles as [assistant, user, assistant] instead of strictly alternating [user, assistant, user], the model sees an unprompted assistant turn and produces confused, repetitive output.

How to spot it: confirm --jinja or the named template registers the right stop tokens (<|eot_id|> for Llama 3, <|im_end|> for ChatML, </s> for Mistral), and log the exact JSON sent to /v1/chat/completions to verify the role sequence strictly alternates.

Shortest path to fix

Step 1: Find the model’s real chat template

The authoritative source is the model’s own tokenizer, not your memory.

# From the HuggingFace hub (works for open repos)
python3 - << 'EOF'
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(tok.chat_template)
EOF

For gated models (Llama 3 needs a HuggingFace login/token), download the config file directly:

hf download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json
python3 -c "import json; d=json.load(open('tokenizer_config.json')); print(d['chat_template'])"

To read what is actually baked into your GGUF:

python3 - << 'EOF'
import gguf
r = gguf.GGUFReader("model-Q4_K_M.gguf")
for f in r.fields.values():
    if "chat_template" in f.name:
        print("=== GGUF chat_template ===")
        print(bytes(f.parts[-1]).decode("utf-8"))
EOF

Step 2: Force the correct template in llama-server

Prefer --jinja — it uses the GGUF-embedded template and is the most reliable path in 2026:

./llama-server -m model-Q4_K_M.gguf --jinja

If the embedded template is missing or wrong, pass an explicit named template. Use the exact current names (verified against llama.cpp June 2026):

# Llama 3 / 3.1 / 3.2
./llama-server -m models/Llama-3.1-8B-Instruct-Q4_K_M.gguf --chat-template llama3

# Mistral 7B Instruct v0.3  (NOT "mistral")
./llama-server -m models/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf --chat-template mistral-v3

# Mistral Nemo (tekken tokenizer)
./llama-server -m models/Mistral-Nemo-Instruct-2407-Q4_K_M.gguf --chat-template mistral-v3-tekken

# Qwen 2 / 2.5  (uses ChatML, there is no "qwen2" name)
./llama-server -m models/Qwen2.5-7B-Instruct-Q4_K_M.gguf --chat-template chatml

# Gemma
./llama-server -m models/gemma-2-9b-it-Q4_K_M.gguf --chat-template gemma

If your model needs a template that is not built in, point at the raw Jinja file:

./llama-server -m model.gguf --jinja --chat-template-file ./my_template.jinja

Step 3: Fix an Ollama Modelfile

In an Ollama Modelfile the TEMPLATE block and the stop parameters must match the training format exactly. For Llama 3.1:

FROM llama3.1:8b

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>

{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

"""

PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"

Save as Modelfile.llama31, run ollama create myllama31 -f Modelfile.llama31, then verify with ollama show myllama31 --modelfile and compare the TEMPLATE block against the official Ollama library page for that model. Do not write the template from memory; copy it from the official source.

Step 4: Stop a double BOS

If --verbose shows the double-BOS warning, do not prepend BOS yourself. From a Python wrapper, let the chat path handle special tokens:

from llama_cpp import Llama

llm = Llama(
    model_path="model-Q4_K_M.gguf",
    n_gpu_layers=35,
    n_ctx=8192,
    chat_format="llama-3",   # or "chatml", "mistral-instruct"
    verbose=False,
)
# Do NOT add BOS manually; chat_format inserts it once.
resp = llm.create_chat_completion(
    messages=[{"role": "user", "content": "What is 2+2?"}]
)
print(resp["choices"][0]["message"]["content"])

For llama-server you can override the metadata flag, though note this is ignored for a few model families (a known bug as of June 2026, see llama.cpp issue #21786):

./llama-server -m model.gguf --jinja \
  --override-kv tokenizer.ggml.add_bos_token=bool:false

There is no --no-bos flag on llama-server; that flag exists only in some older llama-cli builds.

Step 5: Apply the template manually to inspect it

When you are not sure what the engine is sending, render the prompt yourself and read it:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("path/to/model/or/hub/id")
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain Docker networking."},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(repr(prompt))  # exact string, including special tokens, before it hits the engine

apply_chat_template is the source of truth: BOS, role tokens, and the trailing generation prompt should all be present and in order.

Step 6: Confirm it is fixed

Send a deterministic test prompt and check the body, not just the HTTP status:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is 2+2? Reply with just the number."}],
    "max_tokens": 10,
    "temperature": 0
  }' | python3 -m json.tool | grep content

It is fixed when:

the reply is 4 (clean, no leading/trailing junk),
no role markers ([INST], <|im_start|>, <|eot_id|>) appear as literal text,
generation stops on its own instead of running to max_tokens,
the server log shows no double-BOS warning.

If you still see role tokens or echoed input, the template is still wrong; go back to Step 1.

Prevention

Default to --jinja so the engine reads the GGUF-embedded template instead of a guessed name.
Always read tokenizer_config.json chat_template (or the GGUF metadata) before running a new model, and match it to the engine’s template name.
Re-convert with the latest convert_hf_to_gguf.py if the embedded template is missing or outdated.
In Ollama Modelfiles, copy the TEMPLATE block from the official library page, never from memory.
Pin the llama.cpp build when deploying; template and Jinja handling change between releases.
Keep a model-to-template mapping doc for everything in your deployment.

FAQ

Q: Why does the model work with the /completion endpoint but produce garbage on /v1/chat/completions? A: /completion takes raw text and applies no chat template; /v1/chat/completions wraps your messages with the configured template. If that template is wrong for the model, only the chat endpoint breaks. Render the prompt with apply_chat_template (Step 5) and compare it to what you sent to /completion.

Q: Should I trust Ollama’s auto-detection? A: For models pulled from the official Ollama library, yes — the template ships with the model. For custom GGUFs you converted from HuggingFace, verify with ollama show model --modelfile and compare against the tokenizer config; auto-detection falls back to heuristics when the GGUF has no embedded template.

Q: Which template does Mistral 7B Instruct v0.3 use in llama.cpp? A: mistral-v3. There is no plain mistral template name. v0.3 added native tool-calling control tokens; if you apply a v1/v0.1 template, tool tokens like [TOOL_CALLS] show up as literal text. Mistral Nemo uses mistral-v3-tekken and Large 2411 uses mistral-v7.

Q: Output is not garbled but the model repeats one sentence forever — is that a template issue? A: Often yes, by way of the stop token. If the template does not register the model’s end-of-turn token, generation never stops. Confirm the right stop token is set (<|eot_id|> for Llama 3, <|im_end|> for ChatML, </s> for Mistral) in --jinja output or your Modelfile PARAMETER stop lines.

Q: Can a wrong template silently bypass safety guardrails? A: Yes. Instruct/RLHF alignment is trained against specific role tokens. If the template puts content in the wrong role, the model may not recognize a request as a user turn, and its refusal behavior can misfire — not a jailbreak, just format confusion.

Tags: #local-llm #llama.cpp #Troubleshooting

Which bucket are you in?

Common causes

1. Wrong named chat template for the model family

2. You loaded a base model, not the instruct version

3. GGUF has no embedded template, so the engine used a generic one

4. System prompt placed outside the template structure

5. BOS (beginning-of-sequence) token added twice

6. Stop / EOS token not registered, or history in wrong role order

Shortest path to fix

Step 1: Find the model’s real chat template

Step 2: Force the correct template in llama-server

Step 3: Fix an Ollama Modelfile

Step 4: Stop a double BOS

Step 5: Apply the template manually to inspect it

Step 6: Confirm it is fixed

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)