Misconfigured RoPE Scaling Garbles Long-Context Output

Q: What context length is safe without any RoPE scaling?

At most the model's `max_position_embeddings` from `config.json` — 4096 for Llama 2, 8192 for Llama 3 base, but 131072 for Llama 3.1 (whose scaling is built in). Going even 10% past a short model's native window without scaling starts breaking long-range dependencies.

Q: linear or yarn — which should I use?

`yarn` for anything beyond ~2x, because it preserves short-range attention while extending long-range coverage. `linear` is fine only for very mild stretches. For models with a large `rope_theta` (Mistral, Llama 3.1) you usually need neither.

Q: Can I set RoPE scaling in an Ollama Modelfile?

As of June 2026, Ollama exposes `num_ctx` (PARAMETER) for context size but does not surface llama.cpp's low-level `--rope-scaling` / `--yarn-*` knobs in the Modelfile; native model scaling is honored automatically. If you need a custom extension, drive llama.cpp directly or use vLLM. For most users the better answer is to pick a model that is natively long-context.

Q: My output is clean to exactly 8k then breaks at 9k. What's the threshold?

That boundary is the model's native `max_position_embeddings` (8192 for Llama 3 base) and the classic missing-scaling tell. Either use the 3.1 variant (built-in scaling) or add `--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 8192`, and the cliff moves out to the extended length.

A local LLM stays coherent up to its native context length, then degenerates into repetition or gibberish. Diagnose and fix RoPE scaling (YaRN, llama3, rope_theta) in llama.cpp and vLLM.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You run a local Llama 3 8B server, set --ctx-size 32768 so it can chew through a long document, and the first few thousand tokens come back fine. Then it falls apart: repeated phrases, random topic switches, broken pronoun references, and eventually a stream of nonsense tokens. The model is still generating, but it has lost its positional grounding for anything past the context length it was trained on. That is the signature of missing or misconfigured RoPE (Rotary Position Embedding) scaling — the mechanism that lets a model address positions beyond its base training window.

Fastest fix (June 2026): if you are on a model that natively ships a long context (Llama 3.1, Qwen2.5/Qwen3, Mistral v0.3+, Gemma 2), do NOT pass any RoPE flags at all. The correct scaling is baked into the GGUF or config.json, and adding --rope-scaling on top of it is the most common way people break their own output. Just set --ctx-size to a value the model supports and stop there. Only the case of pushing an old, short-context base model (Llama 2 4k, Llama 3 8k) past its native window requires manual YaRN scaling — and even then, keep the extension to about 4x.

Which bucket are you in?

Your model	Native context	What to do
Llama 3.1 / 3.2 / 3.3	128k	No RoPE flags. Set `--ctx-size` up to 131072.
Qwen2.5 / Qwen3	32k-128k	No RoPE flags (YaRN already in config). Set `--ctx-size`.
Mistral 7B v0.3, Mixtral	32k	No RoPE flags (`rope_theta=1000000`).
Gemma 2 / 3	8k-128k	No RoPE flags.
Llama 3 8B (not 3.1)	8k	Manual YaRN, `--rope-scale 4`, stay at or below ~32k.
Llama 2	4k	Manual linear or YaRN, stay at or below ~16k.

If your model is in the top four rows and you are passing --rope-scaling, --rope-scale, or --rope-freq-base, remove them first and re-test before reading any further. Most “garbled long context” reports on modern models are self-inflicted overrides.

Common causes

Ordered by hit rate, highest first.

1. Manual RoPE flags applied on top of a model that already scales itself

This is now the number-one cause. Llama 3.1, Qwen2.5/Qwen3, and Mistral v0.3 already encode their context-extension settings in the GGUF metadata (llama.rope.scaling.type, llama.rope.freq_base, llama.rope.scaling.factor). When you add --rope-scaling yarn --rope-scale 4 on the command line, you stack a second scaling on top of the built-in one, which double-scales the frequencies and corrupts positions well before the advertised limit.

How to spot it: remove every --rope-* and --yarn-* flag, keep only --ctx-size, and re-run the retrieval test in Step 6 below. If coherence returns, the flags were the problem.

2. Context extended past native length with no scaling at all

The opposite mistake on an old model. Llama 2 was trained at 4096 tokens; Llama 3 base at 8192. Setting --ctx-size 32768 on one of these without any RoPE extension makes the positional encodings extrapolate far outside their training distribution, so everything past the native window degrades.

How to spot it: the output is clean up to roughly the model’s max_position_embeddings value, then falls off a cliff at that exact boundary. Check config.json for max_position_embeddings and compare it to your --ctx-size.

3. Wrong RoPE scaling type for the model

llama.cpp’s --rope-scaling flag accepts only {none, linear, yarn} (verified against the current llama-server README, June 2026). linear is the crudest method and visibly degrades past about 2x. yarn (Yet Another RoPE ExtensioN) holds quality much better up to ~4x because it scales low- and high-frequency components differently. Picking linear for a 4x stretch on a Llama-family model garbles the long range.

Note there is no llama3 value for the --rope-scaling CLI flag. The llama3-style scaling used by Llama 3.1 is a config.json/GGUF metadata rope_type, auto-detected at load time — you never pass it on the command line.

How to spot it: check the model card’s config.json rope_scaling.rope_type (or type). For Llama 3.1 it is "llama3"; for many long-context fine-tunes it is "yarn".

4. Wrong `--rope-scale` factor

--rope-scale N expands context by a factor of N and should equal target_ctx / native_ctx. Extending a native-8k model to 32k means --rope-scale 4. Setting 2 under-scales (positions past 16k drift); setting 8 over-scales (short-range attention smears even at 8k). A wrong factor degrades quality at moderate lengths, not just at the extreme.

How to spot it: compute target_ctx / native_ctx and compare to your --rope-scale. With YaRN you should also set --yarn-orig-ctx to the native length so the runtime knows the original window.

5. `rope_theta` not preserved in a third-party GGUF

Long-context models use a high rope_theta (rope_freq_base) for stable embeddings at high positions: 500000 for Llama 3/3.1, 1000000 for Mistral, versus 10000 for Llama 2. If a community GGUF was converted by an old or buggy script, the base frequency can be wrong, and llama.cpp then falls back to a default that breaks long-context coherence even with correct CLI flags.

How to spot it: read the embedded value (Step 2) and compare it to the HuggingFace config.json rope_theta.

6. vLLM not applying `rope_scaling` from the config

vLLM reads rope_scaling from config.json automatically when you load by Hub ID. It misbehaves when you load from a local path whose config.json is missing or stale, or when an extension that is NOT in the config needs to be added. As of June 2026 the old --rope-scaling CLI argument is no longer supported in vLLM; you supply overrides through --hf-overrides (Step 5).

How to spot it: log the startup output and look for the rope configuration vLLM resolved, or print AutoConfig.from_pretrained(...).rope_scaling for the path you are loading.

Shortest path to fix

Step 1: Read the model’s expected RoPE configuration

# Requires: pip install transformers (>= 4.43.1 for Llama 3.1 rope_scaling)
from transformers import AutoConfig

config = AutoConfig.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    trust_remote_code=True,
)
print("max_position_embeddings:", config.max_position_embeddings)
print("rope_theta:", config.rope_theta)
print("rope_scaling:", config.rope_scaling)

For Llama 3.1 you should see rope_theta=500000 and:

{
  "factor": 8.0,
  "low_freq_factor": 1.0,
  "high_freq_factor": 4.0,
  "original_max_position_embeddings": 8192,
  "rope_type": "llama3"
}

If rope_scaling is None and max_position_embeddings is small (4096 or 8192), you have a short-context base model and must scale manually (Step 4). If it is populated, the model self-scales and you should pass NO RoPE flags.

Step 2: Verify the GGUF carries the right rope_theta

python3 << 'EOF'
import gguf
reader = gguf.GGUFReader("models/llama-3.1-8b-instruct-Q4_K_M.gguf")

rope_freq = reader.fields.get("llama.rope.freq_base")
if rope_freq:
    print("rope_theta in GGUF:", rope_freq.parts[-1][0])
else:
    print("rope_theta not in GGUF -- llama.cpp will use default 10000")

ctx_len = reader.fields.get("llama.context_length")
if ctx_len:
    print("context_length in GGUF:", ctx_len.parts[-1][0])

# Dump any embedded scaling metadata
for f in reader.fields.values():
    if "rope" in f.name:
        print(f.name, "=", list(f.parts[-1]))
EOF

If llama.rope.freq_base reads 10000 on a model that should be 500000 (Llama 3.1) or 1000000 (Mistral), the GGUF was converted wrong. Reconvert it (Step 3) rather than patching with CLI flags.

Step 3: Re-convert the GGUF from source weights

# convert_hf_to_gguf.py preserves rope_theta and scaling metadata
python convert_hf_to_gguf.py \
  /path/to/Meta-Llama-3.1-8B-Instruct \
  --outtype f16 \
  --outfile llama-3.1-8b-instruct-f16.gguf

# Confirm the base frequency survived
python3 -c "import gguf; r=gguf.GGUFReader('llama-3.1-8b-instruct-f16.gguf'); print(r.fields.get('llama.rope.freq_base').parts[-1][0])"
# Expect 500000 for Llama 3.1

Prefer reconverting from the original HuggingFace weights over trusting an unknown third-party GGUF.

Step 4: Configure llama.cpp correctly

For a modern long-context model, the entire fix is “do not add RoPE flags”:

# Llama 3.1 8B -- RoPE is embedded; just size the context
./llama-server \
  -m models/llama-3.1-8b-instruct-Q4_K_M.gguf \
  --ctx-size 131072 \
  -ngl 99

# Mistral 7B v0.3 (32k native, rope_theta=1000000) -- also no flags
./llama-server \
  -m models/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
  --ctx-size 32768 \
  -ngl 99

Only when you are stretching a short-context base model past its native window do you scale manually. --rope-scaling accepts {none, linear, yarn}; use yarn and tell it the original window with --yarn-orig-ctx:

# Llama 3 8B (native 8k) extended to 32k -- factor = 32768 / 8192 = 4
./llama-server \
  -m models/llama-3-8b-instruct-Q4_K_M.gguf \
  --ctx-size 32768 \
  --rope-scaling yarn \
  --rope-scale 4 \
  --yarn-orig-ctx 8192 \
  -ngl 99

Leave --yarn-attn-factor, --yarn-beta-fast, and --yarn-beta-slow at their defaults (each defaults to -1.00, meaning “auto from the model”) unless the model card publishes specific values. If the embedded rope_theta was wrong and you cannot reconvert, you can override it directly with --rope-freq-base 500000, but reconverting is cleaner.

Step 5: For vLLM, override through `--hf-overrides`

vLLM auto-applies rope_scaling when you load a model whose config.json already has it, so the first thing to try is loading by the Hub ID with no extra flags:

vllm serve meta-llama/Llama-3.1-8B-Instruct --max-model-len 131072

To extend a model whose config does NOT define scaling, pass the parameters as a JSON override (the old --rope-scaling CLI arg is no longer supported as of June 2026):

vllm serve Qwen/Qwen3-8B \
  --hf-overrides '{"rope_parameters": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768, "rope_theta": 1000000}}' \
  --max-model-len 131072

--max-model-len is the new maximum after extension (original x factor). Watch the startup log for the resolved rope configuration to confirm it took effect.

Step 6: Confirm it is fixed with a retrieval probe

The cheapest reliable test is needle-in-a-haystack: bury a marker deep in a long prompt and ask the model to fetch it.

python3 << 'EOF'
import requests

# ~200 markers; Position 150 lands well past a base 8k window in a long doc
markers = " ".join(f"Position {i}: the keyword is MARKER{i}." for i in range(1, 201))
prompt = f"Here is a numbered sequence:\n{markers}\n\nWhat is the keyword at Position 150? Answer with only the keyword."

resp = requests.post("http://localhost:8080/v1/chat/completions", json={
    "messages": [{"role": "user", "content": prompt}],
    "max_tokens": 16,
    "temperature": 0,
})
print(resp.json()["choices"][0]["message"]["content"])
# Correct output: MARKER150
EOF

Run it at two sizes: one prompt comfortably under the native window and one that pushes well past it. If the short prompt is right and the long one is garbled, RoPE scaling is still wrong. If both are right, you are fixed.

How to confirm it is fixed

The retrieval probe in Step 6 returns the correct marker at a position past the model’s native window.
Output quality no longer collapses at a sharp boundary (the native max_position_embeddings); degradation, if any, is gradual at the far end of the extended range.
For modern models, you achieved this with NO --rope-* flags — only --ctx-size / --max-model-len.

Prevention

First check whether the model is natively long-context. If config.json already has rope_scaling, pass no RoPE flags and just set the context size.
Keep manual extensions to about 4x of the native window. Beyond that, switch to a model that was trained long rather than stretching a short one.
Re-convert GGUF files from the original HuggingFace weights when updating versions; verify llama.rope.freq_base matches the source rope_theta before deploying.
After any change to --ctx-size / --max-model-len, run the retrieval probe at a long position before putting the model in production.
Document each model’s native context, rope_theta, scaling type, and maximum reliable extension in your launch-script comments.
Do not stack --rope-scale on a self-scaling model, and do not trust unknown third-party GGUFs without checking their RoPE metadata.

FAQ

Q: My Llama 3.1 model garbles long context even though it is supposed to support 128k. Why? A: Almost always because you added RoPE flags it does not need. Llama 3.1 ships rope_type: "llama3" scaling inside the GGUF. Passing --rope-scaling yarn or --rope-scale on top double-scales the frequencies. Remove every --rope-* and --yarn-* flag, keep only --ctx-size 131072, and re-test.

Q: What context length is safe without any RoPE scaling? A: At most the model’s max_position_embeddings from config.json — 4096 for Llama 2, 8192 for Llama 3 base, but 131072 for Llama 3.1 (whose scaling is built in). Going even 10% past a short model’s native window without scaling starts breaking long-range dependencies.

Q: Does quantization cause this, or is it RoPE? A: Use length as the discriminator. Quantization damage is roughly constant across context sizes; RoPE damage is near-zero below the native window and collapses sharply past it. Run the same prompt at 4k and at 16k+ — a big gap points to RoPE, a uniform haze points to quantization.

Q: linear or yarn — which should I use? A: yarn for anything beyond ~2x, because it preserves short-range attention while extending long-range coverage. linear is fine only for very mild stretches. For models with a large rope_theta (Mistral, Llama 3.1) you usually need neither.

Q: Can I set RoPE scaling in an Ollama Modelfile? A: As of June 2026, Ollama exposes num_ctx (PARAMETER) for context size but does not surface llama.cpp’s low-level --rope-scaling / --yarn-* knobs in the Modelfile; native model scaling is honored automatically. If you need a custom extension, drive llama.cpp directly or use vLLM. For most users the better answer is to pick a model that is natively long-context.

Q: My output is clean to exactly 8k then breaks at 9k. What’s the threshold? A: That boundary is the model’s native max_position_embeddings (8192 for Llama 3 base) and the classic missing-scaling tell. Either use the 3.1 variant (built-in scaling) or add --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 8192, and the cliff moves out to the extended length.

Tags: #local-llm #llama.cpp #Troubleshooting

Which bucket are you in?

Common causes

1. Manual RoPE flags applied on top of a model that already scales itself

2. Context extended past native length with no scaling at all

3. Wrong RoPE scaling type for the model

4. Wrong --rope-scale factor

5. rope_theta not preserved in a third-party GGUF

6. vLLM not applying rope_scaling from the config

Shortest path to fix

Step 1: Read the model’s expected RoPE configuration

Step 2: Verify the GGUF carries the right rope_theta

Step 3: Re-convert the GGUF from source weights

Step 4: Configure llama.cpp correctly

Step 5: For vLLM, override through --hf-overrides

Step 6: Confirm it is fixed with a retrieval probe

How to confirm it is fixed

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

4. Wrong `--rope-scale` factor

5. `rope_theta` not preserved in a third-party GGUF

6. vLLM not applying `rope_scaling` from the config

Step 5: For vLLM, override through `--hf-overrides`