Tokenizer Drift: Local LLM Token Counts Don't Match

Q: Why does tiktoken count fewer tokens than Llama 3 for the same text?

`tiktoken`'s `cl100k_base` (~100k tokens) and Llama 3's BPE vocab (128,256 tokens) overlap heavily on English but diverge on code and non-Latin scripts. Llama 3 often splits curly braces, operators, and CJK characters into separate tokens, so its count runs higher there. They are both BPE, just different vocabularies — close is not equal.

Q: Does the llama.cpp `/tokenize` endpoint match what the GGUF uses for inference?

Yes. `/tokenize` uses the exact tokenizer embedded in the GGUF that the inference engine uses, so it's ground truth for that file. Remember it defaults `add_special` to `false`; set it the same on your client side when comparing.

Q: How do I handle drift across a multi-model deployment?

Keep a tokenizer registry keyed by model name (`{"llama3.1": ..., "mistral": ..., "qwen2.5": ...}`) and route every pre-count through the matching tokenizer. Never share one tokenizer across families.

Your app's token count disagrees with the local llama.cpp or Ollama server, causing context overflow or silent truncation. Use the server's own tokenizer as ground truth to fix the drift.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Your Python app uses tiktoken to pre-count tokens and stay under an 8192-token limit. You count 7,900 tokens, add a buffer, and send the request to your local Ollama or llama-server running Llama 3.1. The server returns context length exceeded, or worse, silently truncates the prompt and answers from the wrong half of your context. The discrepancy is real: tiktoken uses OpenAI’s vocabulary, your model uses its own, and the same text counts differently in each.

Fastest fix: stop trusting a client-side estimate. Ask the server what it counts, using the exact tokenizer baked into the model you are serving. For llama-server, POST /tokenize and read back the token array length. For Ollama, send the prompt once and read prompt_eval_count from the response. That number is ground truth; match your budget to it. The rest of this page explains why the counts drift and how to lock them together.

First, rule out the silent killer: Ollama’s 4096-token default

Before you debug tokenizers, check whether your server is even using the context size you think it is. As of June 2026, Ollama still defaults num_ctx to 4096 tokens unless you override it, and when a prompt exceeds that, it does not raise an error — it silently truncates the oldest tokens and answers from what’s left. So your “tokenizer mismatch” may actually be a context-size mismatch: your client thinks the window is 8192, but the server is running 4096.

Confirm the active context size:

# Check what context size the loaded model is actually running
ollama ps
# The "CONTEXT" column shows the live num_ctx, e.g. 4096

# Override per-request (OpenAI-compatible API needs num_ctx in options, not a flag)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "ping",
  "options": {"num_ctx": 8192},
  "stream": false
}' | python3 -c "import json,sys; print(json.load(sys.stdin)['prompt_eval_count'])"

A known trap (as of June 2026): the OpenAI-compatible /v1/chat/completions endpoint ignores num_ctx unless you pass it through options, and the OLLAMA_CONTEXT_LENGTH env var can be overridden back down to 4096 by VRAM limits. Set it explicitly per request or in a Modelfile PARAMETER num_ctx. If raising num_ctx makes the “mismatch” disappear, you were never drifting — you were truncating.

Common causes

Ordered by how often they’re the culprit, highest first.

1. Using tiktoken for a non-OpenAI model

tiktoken is built for GPT models (cl100k_base for GPT-4, o200k_base for GPT-4o/GPT-5-class models). Using it to count tokens for Llama, Mistral, or Qwen is simply the wrong vocabulary. A code snippet with curly braces or an emoji may be 3 tokens under tiktoken and 7 under the model’s own tokenizer. For Chinese or mixed-script text the gap is routinely 15-30%.

How to spot it: run the same text through both tokenizers and compare. A difference above ~5% on plain English (or any difference on CJK/code) means you’re using the wrong tokenizer for your model.

2. Wrong tokenizer family — and Llama 3 is not SentencePiece

A correction worth internalizing: Llama 3 and 3.1 do not use SentencePiece. They use a tiktoken-based BPE tokenizer with a 128,256-token vocabulary (it actually shares much of GPT-4’s cl100k_base merge table plus extra tokens). Only Llama 2 uses SentencePiece (32,000 tokens). Mistral and many Qwen builds still use SentencePiece. So “I’m on Llama, tiktoken should be close” is a half-truth: Llama 3 is BPE-like and lands closer to GPT-4 on English, but its 128k vocab still splits code and CJK differently, and Llama 2 diverges hard.

Model	Tokenizer type	Vocab size	Counts like tiktoken?
Llama 2 / Code Llama	SentencePiece	32,000	No — much higher on most text
Llama 3 / 3.1 / 3.3	tiktoken-based BPE	128,256	Close on English, off on code/CJK
Mistral / Mixtral	SentencePiece	32,000–32,768	No
Qwen 2.5 / 3	tiktoken-based BPE	~151,000	No — its own vocab

How to spot it: print the vocab size and compare to what you expect.

python3 -c "from transformers import AutoTokenizer; \
t=AutoTokenizer.from_pretrained('meta-llama/Llama-3.1-8B-Instruct'); \
print(t.vocab_size)"   # expect 128256 for Llama 3.x

3. Not applying the chat template when counting

How to spot it: count with and without apply_chat_template; the delta is your template overhead. For Llama 3.1 with a system prompt, expect roughly 20-40 extra tokens per exchange from template tokens alone.

4. add_special_tokens mismatch (BOS/EOS)

With HuggingFace, tokenizer.encode(text, add_special_tokens=True) includes BOS/EOS; add_special_tokens=False doesn’t. The inference server normally adds BOS. If your counter uses add_special_tokens=False, you undercount by 1-3 tokens per segment — harmless on a 200k budget, fatal when you’re slicing to the exact boundary. Note the symmetry on the server side: llama-server’s /tokenize defaults add_special to false, so a fair comparison needs both sides set the same way.

How to spot it: compare len(tokenizer.encode(text)) against len(tokenizer.encode(text, add_special_tokens=True)); check which one your code uses.

5. A modified or re-quantized GGUF tokenizer

GGUF files embed the vocabulary and byte-fallback rules. Some community quantizations regenerate or trim the vocab (merging tokens, dropping rare ones) so the GGUF tokenizer no longer matches the official HuggingFace one. Different normalization (whitespace, Unicode form) then produces drift even for identical UTF-8 input.

How to spot it: send the exact same byte string to both tokenizers via the /tokenize endpoint and your HF tokenizer. If the token IDs differ, the embedded vocab diverged — trust the GGUF.

6. System prompt left out of the budget

Many RAG pipelines count only the user query and retrieved chunks, forgetting the system prompt. A 500-token system prompt against an 8192 window leaves 7,692 usable tokens, not 8,192.

How to spot it: add system_prompt_tokens + all_message_tokens + expected_completion_tokens. If that exceeds the live num_ctx, you’ll overflow or truncate.

Which bucket am I in?

Symptom	Most likely cause	Jump to
Counts match but output is still cut / context “shrinks”	`num_ctx` defaulting to 4096, silent truncation	section above
Client count far below server count, English text	Wrong tokenizer (tiktoken on Llama/Qwen)	Cause 1, 2
Off by a fixed 20-40 tokens per turn	Chat template not counted	Cause 3
Off by 1-3 tokens, consistent	`add_special_tokens` / BOS mismatch	Cause 4
Off only on specific characters/code	Modified GGUF vocab or normalization	Cause 5
Off by exactly your system-prompt length	System prompt not in budget	Cause 6

Shortest path to fix

Step 1: Use the correct tokenizer for your model

from transformers import AutoTokenizer

# Correct: the model-specific tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text, add_special_tokens=False))

# Wrong: tiktoken for a Llama model
# import tiktoken
# enc = tiktoken.encoding_for_model("gpt-4")  # DO NOT use for Llama/Qwen/Mistral

Step 2: Count tokens including the chat template

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

def count_chat_tokens(messages: list[dict]) -> int:
    """Count tokens for a full chat payload, including template overhead."""
    templated = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
    )
    return len(templated)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain Docker networking in detail."},
]
print(f"Total tokens: {count_chat_tokens(messages)}")

Step 3: Cross-validate against the server’s own tokenizer (ground truth)

The server’s embedded tokenizer is the only count that matters at inference time. Match your client to it, not the other way around.

# llama-server: POST /tokenize returns {"tokens": [id, id, ...]}.
# Count the array length. add_special=false by default, so set it on
# both sides for a fair comparison.
curl -s http://localhost:8080/tokenize \
  -H "Content-Type: application/json" \
  -d '{"content": "your text here", "add_special": false}' \
  | python3 -c "import json,sys; print('server tokens:', len(json.load(sys.stdin)['tokens']))"

# Ollama: send the prompt once and read prompt_eval_count.
curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "your text here", "stream": false}' \
  | python3 -c "import json,sys; print('prompt tokens:', json.load(sys.stdin)['prompt_eval_count'])"

Aim for agreement within 1-3 tokens (BOS/EOS only). A larger, consistent gap points straight at Cause 2-5 above.

Step 4: Build a context budget calculator with every component

def check_context_budget(
    system_prompt: str,
    user_message: str,
    context_chunks: list[str],
    max_context: int = 8192,
    max_completion: int = 1024,
    safety_margin: int = 128,
) -> dict:
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_message + "\n\n" + "\n\n".join(context_chunks)},
    ]
    prompt_tokens = count_chat_tokens(messages)
    total_needed = prompt_tokens + max_completion + safety_margin

    return {
        "prompt_tokens": prompt_tokens,
        "max_completion": max_completion,
        "total_needed": total_needed,
        "available": max_context,         # must equal the server's live num_ctx
        "fits": total_needed <= max_context,
        "overflow": max(0, total_needed - max_context),
    }

Set max_context to the live num_ctx from ollama ps or your llama-server launch flag, not the model’s theoretical maximum.

Step 5: Test tokenizer fidelity on edge-case content

test_cases = [
    "Hello, world!",
    "def foo(x: dict[str, int]) -> None:",
    "Unicode emoji test",
    "https://example.com/path?query=value&key=123",
    "<|begin_of_text|>",  # special token — should not split unexpectedly
    "你好，世界",          # CJK — biggest source of drift
]

for text in test_cases:
    client_count = count_tokens(text)
    server_count = get_server_token_count(text)  # via /tokenize or prompt_eval_count
    diff = abs(client_count - server_count)
    print(f"{text[:30]!r}: client={client_count}, server={server_count}, diff={diff}")

Any diff above 2-3 tokens flags a vocabulary or normalization mismatch.

How to confirm it’s fixed

You’re done when all three are true:

ollama ps (or your llama-server -c flag) shows the context size your app assumes — no silent 4096 fallback.
For ten varied test strings (English, code, CJK), your client count and the server’s count agree within 1-3 tokens.
A request sized to num_ctx - safety_margin runs without context length exceeded and prompt_eval_count comes back below num_ctx.

Prevention

Never use tiktoken for non-OpenAI models — always load the model-specific tokenizer (HuggingFace AutoTokenizer, or the GGUF’s own via /tokenize).
Treat the server’s count (/tokenize length or prompt_eval_count) as ground truth and calibrate the client to it.
Always count with apply_chat_template so template tokens are in the budget.
Pin num_ctx explicitly (request options, Modelfile, or -c) so you never inherit the 4096 default by accident.
Keep a safety margin of at least 128 tokens (10-15% for CJK-heavy content) below num_ctx.
Re-test the budget when changing model versions (Llama 2 to 3, or any re-quantized GGUF) — vocabularies differ.
Cache the tokenizer at startup; HF tokenizer loads add 100-500ms per call.
Pin the transformers version in requirements.txt; tokenizer behavior can shift between minor releases.

FAQ

Q: Why does tiktoken count fewer tokens than Llama 3 for the same text? A: tiktoken’s cl100k_base (~100k tokens) and Llama 3’s BPE vocab (128,256 tokens) overlap heavily on English but diverge on code and non-Latin scripts. Llama 3 often splits curly braces, operators, and CJK characters into separate tokens, so its count runs higher there. They are both BPE, just different vocabularies — close is not equal.

Q: Isn’t Llama’s tokenizer SentencePiece? A: Only Llama 2 (and Code Llama, Mistral). Llama 3, 3.1, and 3.3 switched to a tiktoken-based BPE tokenizer with a 128,256-token vocabulary. If you’ve been treating Llama 3 as SentencePiece, that’s likely your drift.

Q: How do I count tokens without burning a full generation on Ollama? A: As of June 2026 Ollama has no standalone tokenize endpoint, so send the prompt with a tiny completion ("options": {"num_predict": 1}) and read prompt_eval_count. For a true zero-cost count, mirror the model with a HuggingFace tokenizer or use llama-server’s /tokenize, which doesn’t run the model.

Q: Does the llama.cpp /tokenize endpoint match what the GGUF uses for inference? A: Yes. /tokenize uses the exact tokenizer embedded in the GGUF that the inference engine uses, so it’s ground truth for that file. Remember it defaults add_special to false; set it the same on your client side when comparing.

Q: How do I handle drift across a multi-model deployment? A: Keep a tokenizer registry keyed by model name ({"llama3.1": ..., "mistral": ..., "qwen2.5": ...}) and route every pre-count through the matching tokenizer. Never share one tokenizer across families.

Tags: #local-llm #llama.cpp #Troubleshooting