Your Python application uses tiktoken to pre-count tokens and stay within an 8192-token context limit. You count 7,900 tokens, add a buffer, and send the request to your local Ollama or llama-server running Llama 3.1. The server returns a context length exceeded error or truncates the response at what feels like 6,500 tokens. The discrepancy is real: tiktoken uses OpenAI’s BPE vocabulary, while Llama 3.1 uses a SentencePiece vocabulary with different tokenization rules. The same sentence tokenizes to different numbers of tokens depending on which tokenizer you use — sometimes up to 20-30% more or fewer.
Common causes
Ordered by hit rate, highest first.
1. Using tiktoken for a non-OpenAI model
tiktoken is designed for GPT models (cl100k_base for GPT-4, o200k_base for GPT-4o). Using it to count tokens for Llama, Mistral, or Qwen models is incorrect because their vocabularies are completely different. A code snippet with curly braces or emoji may tokenize to 3 tokens with tiktoken and 7 tokens with Llama’s SentencePiece tokenizer.
How to spot it: Run the same text through both tokenizers and compare counts. If the difference is more than 5%, you’re using the wrong tokenizer for your model.
2. Different SentencePiece vocabulary versions between application and server
Even within the Llama family, different checkpoint versions use different vocabulary sizes. Llama 2 has 32,000 tokens; Llama 3 has 128,256 tokens. The same sentence produces different token counts because the larger vocabulary can encode more characters in a single token. Mixing a Llama 2 tokenizer with a Llama 3 model produces systematic token-count drift.
How to spot it: Run python3 -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('model_path'); print(t.vocab_size)". Compare against the vocab size of the tokenizer used in your counting code.
3. Tokenizer not applying chat template when counting
The chat template adds special tokens (<|begin_of_text|>, <|start_header_id|>system<|end_header_id|>, <|eot_id|>, etc.) that have token IDs and count against the context. If your token-counting code counts only the raw message content without the template tokens, the actual context usage at the server will be systematically higher.
How to spot it: Count tokens with and without apply_chat_template. The difference is your template overhead. For Llama 3.1 with a system prompt, expect 20-40 extra tokens per message exchange from template tokens.
4. add_special_tokens=True not used in the tokenizer
When tokenizing with HuggingFace, tokenizer.encode(text, add_special_tokens=True) includes BOS/EOS tokens; add_special_tokens=False doesn’t. The inference server always adds BOS. If your counter uses add_special_tokens=False, you’re undercounting by 1-3 tokens per segment.
How to spot it: Compare len(tokenizer.encode(text)) vs. len(tokenizer.encode(text, add_special_tokens=True)). If they differ, check which your counting code uses.
5. Tokenizer running a different normalization or byte-fallback than the server
GGUF files embed the tokenizer vocabulary and byte-fallback rules. If the tokenizer used client-side (e.g., the Python HuggingFace tokenizer) and the one inside the GGUF were converted at different points in the model’s development history, normalization rules (whitespace handling, Unicode normalization) may differ slightly.
How to spot it: Send the exact byte sequence to both tokenizers. If tokenizer outputs differ for the same UTF-8 string, the byte-fallback or normalization rules diverged.
6. System prompt token count not included in the budget
Many RAG applications count only the user query and context chunks, forgetting to subtract the system prompt token count from the available budget. A 500-token system prompt running against an 8192 context limit leaves only 7692 tokens, not 8192.
How to spot it: Add up: system_prompt_tokens + all_message_tokens + expected_completion_tokens. If this exceeds num_ctx, you’ll get truncation errors.
Shortest path to fix
Step 1: Use the correct tokenizer for your model family
from transformers import AutoTokenizer
# Correct: use the model-specific tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def count_tokens(text: str) -> int:
return len(tokenizer.encode(text, add_special_tokens=False))
# Wrong: using tiktoken for a Llama model
# import tiktoken
# enc = tiktoken.encoding_for_model("gpt-4") # DO NOT use for Llama
Step 2: Count tokens including the chat template
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
def count_chat_tokens(messages: list[dict]) -> int:
"""Count tokens for a full chat payload including template overhead."""
templated = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
)
return len(templated)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain Docker networking in detail."},
]
print(f"Total tokens: {count_chat_tokens(messages)}")
Step 3: Cross-validate against the server’s token count
# Ask llama-server for its token count via the tokenize endpoint
curl -s http://localhost:8080/tokenize \
-d '{"content": "your text here"}' | python3 -m json.tool | grep tokens
# Or use Ollama's token count via the generate response
curl -s http://localhost:11434/api/generate \
-d '{"model": "llama3.1:8b", "prompt": "your text here", "stream": false}' \
| python3 -m json.tool | grep prompt_eval_count
Compare the server’s prompt_eval_count against your client-side count. The goal is agreement within 2-3 tokens (for BOS/EOS differences).
Step 4: Build a context budget calculator with all components
def check_context_budget(
system_prompt: str,
user_message: str,
context_chunks: list[str],
max_context: int = 8192,
max_completion: int = 1024,
safety_margin: int = 128,
) -> dict:
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message + "\n\n" + "\n\n".join(context_chunks)},
]
prompt_tokens = count_chat_tokens(messages)
total_needed = prompt_tokens + max_completion + safety_margin
return {
"prompt_tokens": prompt_tokens,
"max_completion": max_completion,
"total_needed": total_needed,
"available": max_context,
"fits": total_needed <= max_context,
"overflow": max(0, total_needed - max_context),
}
Step 5: Test tokenizer fidelity on edge-case content
test_cases = [
"Hello, world!",
"def foo(x: dict[str, int]) -> None:",
"🎉 Unicode emoji test 🚀",
"https://example.com/path?query=value&key=123",
"<|begin_of_text|>", # Special tokens — should not split unexpectedly
]
for text in test_cases:
client_count = count_tokens(text)
server_count = get_server_token_count(text) # via /tokenize endpoint
diff = abs(client_count - server_count)
print(f"'{text[:30]}': client={client_count}, server={server_count}, diff={diff}")
Any diff greater than 2-3 tokens indicates a vocabulary or normalization mismatch.
Prevention
- Never use
tiktokenfor non-OpenAI models — always use the model-specific HuggingFace tokenizer. - Always count tokens with
apply_chat_templateto include template overhead in your budget calculations. - Build a token count validation step into your RAG pipeline that compares client-side counts against the server’s
prompt_eval_countin the response. - Keep a safety margin of at least 128 tokens below
num_ctxto absorb minor tokenizer discrepancies. - When switching between model versions (e.g., Llama 3 to Llama 3.1), retest your token budget calculations — vocabulary sizes differ.
- Cache the tokenizer instance at application startup rather than loading it per request — HuggingFace tokenizer loads add 100-500ms per call.
- Pin the
transformerspackage version in your requirements.txt — tokenizer behavior can change between minor versions.
FAQ
Q: Why does tiktoken count fewer tokens than the Llama 3 tokenizer for the same text? A: GPT-4’s cl100k_base vocabulary has 100,256 tokens, most of which are common English words and subwords. Llama 3’s vocabulary (128,256 tokens) includes more multilingual characters and byte-fallbacks but may split some English patterns differently. For code, the difference can be large because curly braces, parentheses, and operators are often separate tokens in Llama’s vocabulary.
Q: Is there a fast tokenizer that works for multiple model families?
A: The HuggingFace AutoTokenizer with use_fast=True uses the Rust-based tokenizer backend which is 3-5x faster than the Python implementation. You can also use the tokenizers library directly. For Llama family models, these are reliable. For production at scale, cache the tokenizer and use batch tokenization (tokenizer(texts) with a list) for bulk counting.
Q: How do I handle tokenizer drift in a multi-model deployment where different endpoints use different model families?
A: Maintain a tokenizer registry keyed by model name, e.g., {"llama3.1": llama3_tokenizer, "mistral": mistral_tokenizer}. Route each API call’s pre-tokenization through the correct tokenizer. Never use a single shared tokenizer for multiple model families.
Q: Does the llama.cpp tokenize endpoint give the exact same count as what the GGUF model uses for inference?
A: Yes — the /tokenize endpoint in llama-server uses the exact same SentencePiece model embedded in the GGUF file that the inference engine uses. It is the ground truth for token counts on that specific GGUF. Use it to validate your client-side tokenizer.