Should I write the language instruction in English or in the target language?

Both, ideally. Models follow native-language instructions more strongly when generating that language, so for non-English targets, add the same rule once in that language too.

Does temperature affect language drift?

Yes. Higher temperatures (above ~0.8) increase drift on borderline cases. Drop to around 0.3 for language-critical tasks.

Why does this happen more on GPT-5.5 and Gemini 3.1 Pro than on older models?

The 2026 generation weights retrieved and pasted foreign-language content more heavily, so a weak or low-priority language line that survived on GPT-4-era models can now lose to a long non-English document. Make the rule explicit and high-priority (Step 1).

Will OpenAI strict structured outputs or Anthropic tool schemas force the language for me?

No. Constrained decoding guarantees the JSON shape, not the language of free-text values. Pin the language in each field's description (Step 4).

The model acknowledges my language rule and then breaks it anyway. Now what?

That usually means a competing instruction elsewhere in the prompt. Strip scattered language rules down to one high-priority line, remove "international users" style framing, and move it to the very top.

Troubleshooting

Model Replies in the Wrong Language (How to Lock It)

You prompted in English and the model answered in Chinese, or it switched mid-output. The exact cause of language drift and the system-prompt + retry pattern that locks the output language, verified June 2026.

Published: May 24, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You asked the model in English to summarize a Chinese article, and the summary came back in Chinese. Or your system prompt is in English, the user pasted a Japanese paragraph, and the assistant kept replying in Japanese for the rest of the chat. Or the first half of an answer is English and then it silently switches to Spanish. The model is not broken. Without an explicit output-language instruction that outranks everything else, it picks the language with the strongest signal in the immediate context, and that signal is usually the input text or the most recent user turn, not your system prompt.

Fastest fix: put one high-priority line at the very top of the system prompt, for example Always reply in English, regardless of the language of any document, quote, or earlier message. Then for any call that processes user-supplied text, repeat Reply in English only. as the last line of the user message. Those two changes resolve the large majority of cases. The rest of this page covers why each step works, a per-cause diagnosis table, and how to verify the fix held.

Worth knowing as of June 2026: newer models drift more, not less, on this. Developers report that GPT-5.5 and Gemini 3.1 Pro weight retrieved or pasted foreign-language content more heavily than the GPT-4-era models did, so a system-prompt line that used to hold can now lose to a long non-English document. The fix is the same shape, applied more firmly.

Which bucket are you in

Find your symptom, then jump to the matching fix step.

Symptom	Most likely cause	Go to
Whole reply is in the input’s language; prompt never named a language	No explicit output-language instruction	Step 1
System prompt is language A, user message is mostly language B, reply is B	Recency / length beats the system prompt	Step 1 + 2
Output language flickers run to run	Mixed-language few-shot examples	Step 3
English question about a long foreign document, answer is in the document’s language	Input longer than the instruction	Step 2
Was fine, then one foreign-language turn made every later reply switch	Mid-conversation switch stuck in history	Step 6
JSON keys are English but values come back in the input language	Schema constrains shape, not value language	Step 4
First paragraphs correct, later paragraphs switch	Partial / mid-output drift	Step 7

Common causes

Ordered roughly by hit rate in real-world pipelines.

1. No explicit output-language instruction

The prompt says “summarize this” but never says “in English.” The model defaults to the dominant language of the input text, which is often not what you want.

How to spot it: search your prompt template for the target-language name (English, Chinese, etc.). If neither the output language nor the input language is named, this is the bug.

2. System prompt language differs from user input language

System prompt is English, the user message is Japanese. Models weight the most recent and longest content heavily, so they reply in Japanese and the system instruction loses. This got worse with the 2026 model generation: when the assistant has retrieved foreign-language content (RAG, a pasted webpage, tool output), GPT-5.5 and Gemini 3.1 Pro show a stronger pull toward that language than earlier models did.

How to spot it: reproduce with a system prompt in language A and a user message that is mostly language B. If the reply matches B, you have this pattern.

3. Mixed-language few-shot examples

Your few-shot block has three examples: two in English, one in Chinese. The model reads that as “either is acceptable” and picks based on the input.

How to spot it: audit the example outputs in your template. If they are not all in the target output language, that is the leak.

4. The input contains a different-language quote

User asks in English: “Summarize this review.” The review is a long Japanese block. The model echoes the dominant language of the content being processed, not the language of the question.

How to spot it: whenever the document being processed is longer than the wrapper instructions and in a different language, expect drift.

The user opened in English, switched to Chinese for one turn, then back to English. The assistant kept replying in Chinese because the most recent foreign-language user turn is still in the conversation history it conditions on.

How to spot it: look at the most recent user message before the bad reply. If it was a different language, the model latched onto it.

6. Translation task confused with summarization

Prompt: “Process this Spanish article.” Ambiguous: translate? Summarize? Extract? Without a task verb, the model often defaults to translation across languages.

How to spot it: the prompt uses a vague verb like “process,” “handle,” or “deal with” instead of “summarize in English.”

7. Language inferred from output schema, but schema is empty

You ask for JSON. The keys are English, but the values can be any language. The model fills values in the input language because the schema never constrained them. Structured-output / constrained-decoding modes (OpenAI strict json_schema, Anthropic tool schemas) enforce the shape of the output, not the language of free-text values, so a schema alone will not save you here.

How to spot it: the JSON schema or example shows English keys but no language rule on the string values.

Shortest path to fix

Step 1: State the output language as the highest-priority rule, at the top

Put it in the first lines of the system prompt, not buried at the bottom, and make it outrank document and history content explicitly. The negative clause about retrieved content is what holds against long foreign inputs on 2026 models:

You are a summarization assistant.
Highest-priority rule: ALWAYS reply in English.
Ignore that any document, quote, tool output, or earlier message may arrive in a different language — that never changes your reply language.
Only switch languages if the user explicitly asks you to in their latest message.

Real-world testing in 2026 found the win comes from a single clear high-priority line, not many scattered language rules. Contradictory framing like “you serve international users” reintroduces ambiguity and brings the drift back, so remove it.

Step 2: Repeat the language requirement at the end of high-risk user prompts

For one-shot calls that process user-supplied text, restate the rule as the very last line of the user message. Recency wins, so it belongs after the content, not before:

[long Japanese article here]

---
Summarize the above in 3 bullet points. Reply in English only.

Step 3: Match every few-shot example output to the target language

If you want English output, every example output must be English. Inputs can be mixed (that is realistic and even helps), but outputs cannot vary.

Input: 这家餐厅服务很差。
Output: Service was poor.

Input: La nourriture est incroyable.
Output: The food is amazing.

Step 4: Pin the language in the JSON schema field descriptions

Constrained decoding fixes the shape, not the value language, so put the constraint in each free-text field’s description. Models honor field-level descriptions more reliably than a global “be in English” line in the system prompt:

{
  "summary": "string, written in English, max 200 chars",
  "sentiment": "positive | neutral | negative"
}

Step 5: Validate the output language and retry

Run a fast language detector on the output and retry with a stronger reminder if it misses. As of June 2026, fast-langdetect (a FastText wrapper, ~95% accuracy and roughly 80x faster than the classic langdetect) is the practical choice in Python; franc is the common JS equivalent. Note that accuracy drops on very short strings (under ~20 characters), so detect on the full reply, not a fragment:

from fast_langdetect import detect

out = call_llm(prompt)
if detect(out)["lang"] != "en":
    out = call_llm(
        prompt
        + "\n\nYour previous reply was in the wrong language. Reply in English ONLY."
    )

The older langdetect library still works (langdetect.detect(out) != "en") but is far slower; treat it as an accuracy baseline rather than a hot-path tool.

Step 6: For multi-turn chats, pin the language per session

Store the user’s preferred language in session state and inject it into the system prompt on every turn, not just the first. Research in 2026 confirms that a single up-front instruction is brittle over a long conversation; re-asserting it each turn is what survives a mid-chat language switch:

User language preference: en-US
Always reply in en-US regardless of the language of any individual message.

For consumer apps rather than your own API, the same setting lives in the product UI. In the Gemini app it is profile picture -> Settings -> Languages, but note this controls the app’s display language (menus, notifications), not the reply language — Gemini replies in whatever language you prompt in, so a typed “reply in English” instruction is what actually pins the response language there. In ChatGPT, custom instructions under Settings -> Personalization -> Custom instructions are the durable place to state a reply language.

Step 7: Watch for partial drift

The model sometimes switches halfway through a single answer. Catch it by language-detecting each paragraph, not just the whole output. If paragraph N is English and N+1 is Spanish, that is this bug, and the Step 1 high-priority line plus a lower temperature (see FAQ) is the fix.

How to confirm it’s fixed

Re-run the original failing call unchanged except for the new instruction. The reply language should now match the target.
Run the adversarial case on purpose: an English instruction wrapping a long document in a different language. If it still holds, your high-priority line is strong enough.
For chat, switch languages mid-session for one turn, then switch back, and confirm the assistant returns to the pinned language.
In a pipeline, add the Step 5 detector as an assertion and log mismatches for a day; a near-zero mismatch rate means it is genuinely fixed, not fixed-on-the-happy-path.

When this is not on you

Some open-weight models simply do not speak certain languages well and fall back to one they know better. Ask a small model with weak Vietnamese coverage to reply in Vietnamese and it may drift to English regardless of how firm the instruction is. If a frontier model (GPT-5.5, Claude Opus 4.7 or Sonnet 4.6, Gemini 3.1 Pro) holds the language but a smaller local model does not on the same prompt, the gap is model capability, not your prompt.

Easy to misdiagnose as

A “model bug” or a “prompt injection” attack. Most of the time it is just an unspecified output language combined with input that is longer than the instruction. Check for an explicit, high-priority language line before assuming malice.

Prevention

Every multilingual system prompt names the output language in its first three lines, as a highest-priority rule that explicitly overrides document and history language.
Few-shot example outputs are all in the target language. No exceptions.
Restate the output language at the END of any one-shot user prompt for high-stakes calls.
Validate output language post-hoc with a detector; retry with a stronger instruction on mismatch.
For chat sessions, store the user’s language preference and re-inject it on every turn, not just the first.
Remove contradictory framing (“serves international users”) that reopens the ambiguity.

FAQ

Should I write the language instruction in English or in the target language? Both, ideally. Models follow native-language instructions more strongly when generating that language, so for non-English targets, add the same rule once in that language too.
Does temperature affect language drift? Yes. Higher temperatures (above ~0.8) increase drift on borderline cases. Drop to around 0.3 for language-critical tasks.
Why does this happen more on GPT-5.5 and Gemini 3.1 Pro than on older models? The 2026 generation weights retrieved and pasted foreign-language content more heavily, so a weak or low-priority language line that survived on GPT-4-era models can now lose to a long non-English document. Make the rule explicit and high-priority (Step 1).
Will OpenAI strict structured outputs or Anthropic tool schemas force the language for me? No. Constrained decoding guarantees the JSON shape, not the language of free-text values. Pin the language in each field’s description (Step 4).
The model acknowledges my language rule and then breaks it anyway. Now what? That usually means a competing instruction elsewhere in the prompt. Strip scattered language rules down to one high-priority line, remove “international users” style framing, and move it to the very top.

Tags: #Prompt engineering #Troubleshooting #llm-output #language-drift #multilingual #system-prompt