You load Mistral-7B-Instruct-v0.3 in llama-server and send a chat message. Instead of a clean reply, the model echoes back your entire message, then adds [INST] markers in the middle of the response, then starts generating what looks like the next user turn by itself. Or you load Llama 3 and get a response that starts with <|im_start|>assistant as literal text in the output. These are symptoms of a chat template mismatch: the tokenizer and inference engine are not agreeing on how to structure the conversation, so the model receives raw token sequences that look like mid-conversation injection rather than a properly formatted prompt.
Common causes
Ordered by hit rate, highest first.
1. Using the wrong chat template name for the model family
llama.cpp, Ollama, and llama-server all maintain a list of built-in template names. If you specify --chat-template llama2 for a Llama 3 model, the engine will wrap your message in [INST] ... [/INST] tags instead of the <|begin_of_text|>...<|eot_id|> format Llama 3 requires. The model was never trained on Llama 2 format and will produce incoherent output.
How to spot it: Run ./llama-cli --list-chat-templates and compare the template names with the model’s HuggingFace tokenizer_config.json, which contains the authoritative chat_template Jinja string.
2. GGUF file has no embedded chat template and no template was specified
When converting models from HuggingFace to GGUF, the convert_hf_to_gguf.py script copies the chat_template from the tokenizer into the GGUF metadata. If the conversion was done with an older version of the script or a third-party converter, the template may be missing. llama-server then falls back to a generic template that doesn’t match the model.
How to spot it: Run python3 -c "import gguf; r = gguf.GGUFReader('model.gguf'); print(r.fields.get('tokenizer.chat_template'))". If the output is None, the template is missing from the GGUF.
3. Ollama Modelfile overrides the template incorrectly
In an Ollama Modelfile, the TEMPLATE directive must exactly match the model’s training format, including special tokens. A common mistake is copying a Llama 2 template for a Llama 3 model, or omitting the {{ .System }} block for models that use a system prompt in the template structure.
How to spot it: Run ollama show modelname --modelfile and compare the TEMPLATE block against the model’s official HuggingFace tokenizer_config.json.
4. System prompt placed outside the template structure
Some models (Mistral v0.3, Qwen2.5) expect the system prompt to be embedded in the first [INST] block, not as a separate system turn. If your client sends {"role": "system", "content": "..."} as a standalone message and the template doesn’t handle it, the system content leaks into the conversation in the wrong position.
How to spot it: Remove the system message from your API call and send only the user message. If output improves, the system role placement is the issue.
5. BOS (beginning-of-sequence) token added twice
Some llama.cpp wrappers prepend the BOS token manually before calling the tokenizer, which also prepends it. The double-BOS confuses the model’s positional encoding at position 0. Symptoms include garbled first-token output or the model producing a prefix in the wrong language.
How to spot it: Enable verbose mode in llama-server (--verbose) and inspect the token IDs at the start of the prompt. If token ID 1 (BOS) appears twice, the wrapper is double-prepending it.
6. Chat history serialized in wrong role order
If your client sends messages as [assistant, user, assistant, user] instead of [user, assistant, user, assistant], the model receives a conversation that starts with an unprompted assistant turn. Models trained with RLHF expect user turns to always precede assistant turns, and an inverted order causes confused, repetitive output.
How to spot it: Log the exact JSON sent to the /v1/chat/completions endpoint and verify the role sequence is strictly alternating user → assistant → user.
Shortest path to fix
Step 1: Find the correct chat template from the HuggingFace model card
# Download and inspect the tokenizer config
python3 - << 'EOF'
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
print(tok.chat_template)
EOF
For models behind a login (Llama 3), use the HuggingFace Hub CLI:
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct tokenizer_config.json
python3 -c "import json; d=json.load(open('tokenizer_config.json')); print(d['chat_template'])"
Step 2: Specify the correct template in llama-server
# Llama 3 models
./llama-server \
-m models/Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--chat-template llama3
# Mistral v0.3
./llama-server \
-m models/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
--chat-template mistral
# Qwen 2.5
./llama-server \
-m models/Qwen2.5-7B-Instruct-Q4_K_M.gguf \
--chat-template qwen2
# List all supported template names
./llama-cli --list-chat-templates
Step 3: Fix an Ollama Modelfile with the correct template
For Llama 3.1 in Ollama:
FROM llama3.1:8b
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ range .Messages }}<|start_header_id|>{{ .Role }}<|end_header_id|>
{{ .Content }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
"""
PARAMETER stop "<|eot_id|>"
PARAMETER stop "<|end_of_text|>"
Save as Modelfile.llama31 and run ollama create myllama31 -f Modelfile.llama31.
Step 4: Apply the template manually using the tokenizer’s apply_chat_template
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("path/to/model/or/hub/id")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker networking."},
]
# apply_chat_template handles BOS, role tokens, and EOS correctly
prompt = tok.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
print(repr(prompt)) # Inspect the exact string before sending to the engine
Step 5: Disable BOS auto-prepend if it’s being added twice
# In llama-server, disable BOS prepending if the template already includes it
./llama-server \
-m model.gguf \
--chat-template llama3 \
--no-bos # prevents double-BOS when the template already includes <|begin_of_text|>
Step 6: Validate output with a fixed test prompt
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is 2+2? Reply with just the number."}],
"max_tokens": 10,
"temperature": 0
}' | python3 -m json.tool | grep content
Expected: "4". If you see role tokens or echoed input, the template is still wrong.
Prevention
- Always check
tokenizer_config.jsonfor thechat_templatefield before running a new model, and match it to the engine’s template name. - Re-convert GGUF files with the latest
convert_hf_to_gguf.pyif the embedded template is missing or outdated. - In Ollama Modelfiles, always copy the template from the official Ollama library page for that model rather than writing it from memory.
- Log the first 200 raw output tokens (not decoded) when testing a new model to catch double-BOS and misaligned special tokens early.
- Maintain a model-to-template mapping document for all models in your deployment.
- Use
apply_chat_templatefrom the HuggingFacetransformerslibrary as the authoritative source of truth for format validation. - Pin the llama.cpp version when deploying — template handling occasionally changes between releases.
FAQ
Q: The model works fine with direct prompts but fails with the chat API — why?
A: The /completion endpoint accepts raw text without any chat template wrapping; the /v1/chat/completions endpoint applies a template. If the model was fine-tuned to expect a specific format and the wrong template is applied by the chat endpoint, results will differ significantly.
Q: Ollama says it auto-detects the template — should I trust that?
A: Ollama reads the chat_template from the GGUF metadata and falls back to heuristics if missing. Auto-detection is reliable for models pulled from the official Ollama library. For custom GGUF files converted from HuggingFace, always verify with ollama show model --modelfile and compare against the HuggingFace tokenizer config.
Q: How does the Mistral v0.3 template differ from v0.1/v0.2?
A: Mistral v0.3 added native tool calling tokens ([TOOL_CALLS], [TOOL_RESULTS]) to the template. If you use the v0.1/v0.2 template with a v0.3 model, tool-call-related tokens appear in the response as literal text instead of being parsed correctly.
Q: Can a wrong template cause safety guardrails to stop working? A: Yes. Instruct fine-tunes and RLHF alignment are trained with specific role tokens. If the template puts content in the wrong role position, the model’s refusal behaviors can be bypassed unintentionally — not because the guard was jailbroken, but because the model doesn’t recognize the malicious content as a user request due to format confusion.