You sent a prompt in English asking the model to summarize a Chinese article. The summary came back in Chinese. Or you set a system prompt in English, the user pasted a Japanese paragraph, and the assistant kept replying in Japanese for the rest of the conversation. Or worse — the first half of the answer is in English, then it silently switches to Spanish for the rest. The model is not broken. Without an explicit output-language instruction, it picks the language that has the strongest signal in the immediate context, and that signal is often the input text, not the system prompt.
This is one of the cheapest bugs to fix once you know where to look, and one of the most common to ship in multilingual products.
Common causes
Ordered roughly by hit rate in real-world pipelines.
1. No explicit output-language instruction
The prompt says “summarize this” but never says “in English.” The model defaults to matching the dominant language of the input text, which is often not what you want.
How to spot it: Search your prompt template for the word “English” or “Chinese” (or whatever target). If neither output language nor input language is named, this is the bug.
2. System prompt language differs from user input language
System prompt is English. User message is Japanese. Modern models weight the most recent / longest content most, so they reply in Japanese. The system instruction loses.
How to spot it: Reproduce by sending a system prompt in language A and a user message that is mostly language B. If the reply matches B, you have this pattern.
3. Mixed-language few-shot examples
Your few-shot block has 3 examples — 2 in English, 1 in Chinese. The model treats this as “either is acceptable” and picks based on input.
How to spot it: Audit the examples in your prompt template. If they are not all in the target output language, fix the examples.
4. The input contains a different-language quote
User asks in English: “Summarize this review.” The review is a long Japanese block. The model echoes the dominant language of the content to summarize, not the question.
How to spot it: Whenever the document being processed is longer than the wrapper instructions and in a different language, expect drift.
5. Mid-conversation language switch sticks
User opened in English, switched to Chinese for one turn, then back to English. The assistant kept replying in Chinese because the last user turn it saw was Chinese during its KV-cache window.
How to spot it: Look at the most recent user message before the bad reply. If it was a different language, the model latched onto it.
6. Translation task confused with summarization
Prompt: “Process this Spanish article.” Ambiguous — translate? Summarize? Extract? Without a task verb, the model often defaults to translation, especially across languages.
How to spot it: The prompt uses a verb like “process,” “handle,” or “deal with” instead of “summarize in English.”
7. Language inferred from output schema, but schema is empty
You ask for JSON. The JSON keys are English. But values can be any language. The model fills values in the input language because the schema didn’t force English.
How to spot it: JSON schema/example shows English keys but no language constraint on values.
Shortest path to fix
Step 1: State the output language explicitly, in the system prompt
Put it at the top, not the bottom, and use language the model parses directly.
You are a summarization assistant.
ALWAYS reply in English, regardless of the input language.
Never reply in Chinese, Japanese, or any other language unless explicitly asked.
The negative constraint matters — without it, models still drift on long non-English inputs.
Step 2: Repeat the language requirement in the user prompt for high-risk calls
For one-shot calls processing user-supplied text, restate at the bottom of the user message:
[long Japanese article here]
---
Summarize the above in 3 bullet points. Reply in English only.
The “Reply in English only” should be the LAST thing in the prompt — recency wins.
Step 3: Match few-shot examples to the target language
If you want English output, every example output must be English. Inputs can be mixed (that’s realistic), but outputs cannot.
Input: 这家餐厅服务很差。
Output: Service was poor.
Input: La nourriture est incroyable.
Output: The food is amazing.
Step 4: Pin language in JSON schema
If you return structured output, document the language constraint per field:
{
"summary": "string, English, max 200 chars",
"sentiment": "positive | neutral | negative"
}
Models honor field-level descriptions more reliably than they honor a global “be in English” line buried in the system prompt.
Step 5: Validate output language and retry
Use a fast language detector (e.g., langdetect in Python, franc in JS) on the output. If it doesn’t match the target, retry with a stronger reminder:
import langdetect
out = call_llm(prompt)
if langdetect.detect(out) != "en":
out = call_llm(prompt + "\n\nNote: previous reply was in the wrong language. Reply in English ONLY.")
Step 6: For multi-turn chats, pin language per session
Store user’s preferred language in session state and inject it into every system prompt:
User language preference: en-US
Always reply in en-US regardless of message language.
This survives mid-conversation switches.
Step 7: Watch for partial drift
Sometimes the model switches halfway through. Detect this by language-detecting each paragraph, not just the full output. If paragraph N is English and N+1 is Spanish, that’s the bug.
When this is not on you
Some open-weight models simply don’t speak some languages well, and they will fall back to one they know better. If you ask Llama 2 to reply in Vietnamese, it may drift to English just because its Vietnamese capability is weaker.
Easy to misdiagnose as
A “model bug” or a “prompt injection” attack. Most of the time it’s just an unspecified output language combined with longer input than instruction. Check for an explicit language line before assuming malice.
Prevention
- Every multilingual system prompt should name the output language in the first 3 lines.
- Few-shot example outputs all in the target language. No exceptions.
- Restate output language at the END of any one-shot user prompt for high-stakes calls.
- Validate output language post-hoc with a detector; retry with stronger instruction if mismatched.
- For chat sessions, store and inject user language preference on every turn.
FAQ
- Should I include the language instruction in English or in the target language? Both, ideally. Models follow native-language instructions more strongly when generating that language.
- Does temperature affect language drift? Yes — higher temperatures (>0.8) increase drift on borderline cases. Drop to 0.3 for language-critical tasks.
Related
- Latest sentence overrides earlier instructions
- Prompt misused system vs user role
- No output format specified
- Conflicting instructions weaken output
- Missing examples cause output drift
- Long prompt degrades output
- Mixed-tone instructions
- Too many examples overwhelm the model
- Prompt lacks context hierarchy
- AI output style drift
Tags: #Prompt engineering #Troubleshooting #llm-output #language-drift #multilingual #system-prompt