llama.cpp Quality Drops After Switching to a More Aggressive Quant

Q: Is IQ4_XS better or worse than Q4_K_M?

With a good imatrix, `IQ4_XS` roughly matches `Q4_K_M` quality at a smaller size (about 4.46 bits/weight versus 4.89, so ~4.17 GB vs 4.58 GB for an 8B). Without a good imatrix it is usually worse. I-quants also decode slower than K-quants on CPU. If you don't control the imatrix, prefer `Q4_K_M`.

Q: I downloaded a worse quant. Can I generate an imatrix myself?

Yes, if you have the `fp16` source. Run `./llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat --chunks 128`, then `./llama-quantize --imatrix imatrix.dat model-f16.gguf model-IQ4_XS.gguf IQ4_XS`. You need the original weights and enough RAM/VRAM to load the fp16 model.

Responses degrade after moving from Q5_K_M or Q8_0 to Q4_0, IQ4_XS, or lower in llama.cpp. Pick the right quant tier, fix bad re-quants, and confirm with perplexity.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You moved a model to a smaller quant to fit a tighter VRAM budget — say Llama 3.1 70B from Q5_K_M to Q4_0, or an 8B from Q8_0 to IQ4_XS — and within a few prompts the output is clearly worse: repetition loops, facts it used to get right are now wrong, code drops brackets, and in bad cases it drifts into nonsense mid-generation. This is a real quantization cliff, not placebo, and it is predictable.

Fastest fix: stop using legacy quants (Q4_0, Q4_1, Q5_0, Q5_1) and use Q4_K_M instead — it is nearly the same file size but much higher quality. If you went to an IQ quant (IQ4_XS, IQ3_M, etc.), confirm it was built with an importance matrix — without one, an I-quant is often worse than the plain K-quant of the same size. And never re-quantize from an already-quantized GGUF; always quantize from the original fp16/bf16 weights.

The reason the Q5_K_M → Q4_0 gap feels bigger than “one bit” is that Q4_0 uses uniform scalar quantization that treats every weight the same, while K-quants (Q4_K_M, Q5_K_M, Q6_K) use mixed-precision blocks that spend extra bits on the most sensitive weights.

Which bucket are you in?

Symptom	Most likely cause	Jump to
Output got much worse moving to `Q4_0` / `Q5_0` / `Q5_1`	Legacy quant cliff	Cause 1 / Step 5
Downloaded an `IQ` GGUF, worse than the K-quant of same size	Missing or wrong imatrix	Cause 3 / Step 3
You re-quantized a GGUF you already had	Double quantization	Cause 5 / Step 2
7B/8B model at `Q2_K` or `Q3_K_S` falls apart	Small model + aggressive quant	Cause 6 / Step 4
Coherent at short context, garbage at long context	Quantized KV cache / flash-attn interaction	Cause 7
Same quant works for others, bad only on your model	Architecture sensitivity (MoE, head dims)	Cause 7

Common causes

Ordered by impact, highest first.

1. Jumping past the cliff between K-quants and legacy quants

The llama.cpp quant ladder has a steep cliff between the K-quant family (Q4_K_S, Q4_K_M, Q5_K_S, Q5_K_M, Q6_K) and the legacy uniform quants (Q4_0, Q4_1, Q5_0, Q5_1). K-quants use block quantization with mixed 4-bit and 6-bit precision for sensitive weight clusters; Q4_0 treats all weights equally. For nearly the same file size, Q4_K_M is materially better than Q4_0.

How to spot it: Run a repeatable benchmark prompt (e.g., a fixed 200-token continuation) with both quants at temperature 0. Q4_0 will show measurably higher perplexity than Q4_K_M on the same base weights — on a 7B model the published gap is roughly +0.25 ppl for Q4_0 versus +0.05 ppl for Q4_K_M over F16 (as of June 2026), a ~5x larger error for almost the same file size.

2. Embedding and output layers under-quantized

Even inside the K-quant family, the token-embedding layer and the final output/lm_head layer are the most sensitive to quantization error. Running llama-quantize with --pure disables the K-quant mixtures and forces every tensor to the same bit-width, which usually hurts quality more than the default mixed-precision scheme.

How to spot it: Re-quantize without --pure and compare. Run ./llama-perplexity -m model_q4km.gguf -f wiki.test.raw before and after; for a 7B model a perplexity increase above ~0.05 over the default mix is meaningful (a typical Q4_K_M sits around +0.05 versus F16, so doubling that is a real regression).

3. IQ (importance-matrix) quants built without the right imatrix

I-quants — IQ2_XS, IQ3_M, IQ4_XS, and friends — get their quality from an importance matrix (imatrix) computed on calibration data from that specific base model. If the GGUF you downloaded was built with no imatrix, or with one from a different model, the bit allocation is wrong and the I-quant can be worse than the equivalent K-quant. Known imatrix uploaders (e.g., Bartowski, Unsloth) usually document this; anonymous uploads frequently do not.

How to spot it: Inspect the GGUF metadata for a quantize.imatrix.file / quantize.imatrix.dataset field. Run ./llama-quantize --help to confirm the --imatrix flag, or dump metadata with the Python gguf reader (Step 1).

4. Quantizing a fine-tuned model with a base-model imatrix

If the model is an instruct/fine-tuned variant (e.g., Llama-3.1-70B-Instruct), the imatrix should be computed from chat-formatted prompts, not generic web text. A base-model imatrix under-protects the attention patterns reinforced during fine-tuning, so the chat behavior degrades even when raw perplexity looks fine.

How to spot it: Check the imatrix calibration source. If it came from a Wikitext/Wikipedia corpus and your model is an instruct variant, regenerate the imatrix from chat-formatted examples.

5. The GGUF was re-quantized from an already-quantized file (double quantization)

If you downloaded a Q8_0 (or any) GGUF and ran llama-quantize on it to produce Q4_K_M, you applied quantization error twice. llama-quantize even refuses this by default (it errors with requantizing from type ... is disabled) unless you pass --allow-requantize, precisely because, in the project’s own words, it “can severely reduce quality compared to quantizing from 16bit or 32bit.” The correct source is always the original fp16/bf16 HuggingFace weights.

How to spot it: Check what you fed llama-quantize. If the input was a .gguf rather than an f16/bf16 GGUF freshly converted from safetensors, you are double-quantizing. The GGUF’s general.source.url or quantization metadata can also reveal an already-quantized origin.

6. `Q2_K` / `Q3_K_S` on a model below 13B parameters

Small models tolerate aggressive quantization far less gracefully than 70B models. A 7B at Q2_K loses so much expressiveness that multi-step reasoning breaks down. “Small model + aggressive quant” is the most brittle combination.

How to spot it: If your model is under ~13B and you are on Q2_K, Q3_K_S, or Q3_K_M, degradation is expected. Step up to Q4_K_M at minimum, Q5_K_M for code or math.

7. Architecture or KV-cache interaction, not the weights

Some quality drops aren’t about the weight quant at all:

MoE models (Mixtral, DeepSeek-MoE): expert weights are more quant-sensitive than dense weights — keep MoE at Q5_K_M or higher.
Quantized KV cache + flash attention: a quantized KV cache (--cache-type-k q8_0 etc.) only works with flash attention enabled (otherwise llama.cpp dequantizes every attention step and you lose the savings), and open llama.cpp issues continue to report flash-attn errors, NaN attention, or large-context decode regressions with quantized KV on CUDA as of mid-2026 — for example issue #24166, where a q8_0 KV cache at very large -c inflates the CUDA flash-attn scratch allocation and thrashes VRAM. If the model is fine short and garbage (or jittery) at long context, this is the suspect, not the weight quant.
Mismatched head dims: models where n_embd_head_k != n_embd_head_v (e.g., some DeepSeek variants) silently disable flash attention, which can change behavior.

How to spot it: Keep the KV cache at f16 (the default) and set --flash-attn off; if quality returns, it was the KV/flash-attn path, not the weights.

Shortest path to fix

Step 1: Inspect the GGUF — imatrix present? re-quantized?

pip install gguf

python3 -c "
import gguf
r = gguf.GGUFReader('model.gguf')
for f in r.fields.values():
    if 'quantize' in f.name or 'source' in f.name or 'general.file_type' in f.name:
        print(f.name)
"

You want to see a quantize.imatrix.file / quantize.imatrix.dataset for any IQ quant, and a general.source.url that points to the original HF repo — not another GGUF.

Step 2: Establish a quality baseline with perplexity

# Small wikitext sample used by llama.cpp CI
wget https://huggingface.co/datasets/ggml-org/ci/resolve/main/wikitext-2-raw-v1.zip
unzip wikitext-2-raw-v1.zip

# Measure perplexity per quant (lower is better) at temperature 0
./llama-perplexity -m models/model.Q5_K_M.gguf -f wikitext-2-raw/wiki.test.raw --ctx-size 512
./llama-perplexity -m models/model.Q4_0.gguf    -f wikitext-2-raw/wiki.test.raw --ctx-size 512

As a reference, measured perplexity deltas versus F16 on a 7B model (as of June 2026) are roughly: Q8_0 +0.0004, Q6_K +0.0044, Q5_K_M +0.0142, Q4_K_M +0.0535, and the legacy Q4_0 around +0.25 — roughly 5x the Q4_K_M error at nearly identical file size. A Q4_0 value well above the Q4_K_M figure confirms the legacy-quant cliff.

Step 3: Re-quantize from fp16 source, not from another GGUF

# 1) Convert HF weights to an fp16 GGUF (allowed outtypes: f32, f16, bf16, q8_0, ...)
python convert_hf_to_gguf.py \
  /path/to/Meta-Llama-3.1-70B-Instruct \
  --outtype f16 \
  --outfile llama3.1-70b-instruct-f16.gguf

# 2) Quantize to Q4_K_M FROM the fp16 GGUF (not from a Q8_0)
./llama-quantize \
  llama3.1-70b-instruct-f16.gguf \
  llama3.1-70b-instruct-Q4_K_M.gguf \
  Q4_K_M

If you only have a quantized GGUF, llama-quantize will refuse (requantizing from type ... is disabled) unless you pass --allow-requantize — that flag is a warning sign, not a fix. Re-download or re-convert the fp16 source.

Step 4: Generate a proper imatrix for IQ quants

# Build the importance matrix from representative data.
# For instruct models, use chat-formatted prompts (100+ diverse examples), not raw web text.
./llama-imatrix \
  -m llama3.1-70b-instruct-f16.gguf \
  -f calibration_data_instruct.txt \
  -o llama3.1-70b-instruct.imatrix \
  --ctx-size 512 \
  --chunks 128

# Quantize the IQ tier WITH the imatrix
./llama-quantize \
  --imatrix llama3.1-70b-instruct.imatrix \
  llama3.1-70b-instruct-f16.gguf \
  llama3.1-70b-instruct-IQ4_XS.gguf \
  IQ4_XS

Step 5: Pick the right quant tier for your VRAM

File sizes below are measured for Llama-3.1 8B and 70B as of June 2026; leave 1-3 GB headroom for the KV cache and context.

70B model:
  ~50 GB VRAM → Q5_K_M (49.9 GB) — best practical quality
  ~44 GB VRAM → Q4_K_M (42.5 GB) — excellent
  ~40 GB VRAM → IQ4_XS (~38 GB, needs a good imatrix) — close to Q4_K_M, smaller
  24 GB VRAM  → Q4_K_M with CPU/-ngl offload — usable, slower

8B model:
  16 GB VRAM → Q8_0 (8.54 GB)  — near-lossless
  12 GB VRAM → Q6_K (6.60 GB)  — virtually F16
  10 GB VRAM → Q5_K_M (5.73 GB) — excellent
  8 GB VRAM  → Q4_K_M (4.92 GB) — good (floor for code/math)

Step 6: Avoid `Q4_0` — use `Q4_K_M` or a properly-imatrixed `IQ4_XS`

# Q4_K_M: mixed K-quant blocks — much better than Q4_0 at ~the same size
./llama-quantize model-f16.gguf model-Q4_K_M.gguf Q4_K_M

# IQ4_XS: ~0.4 bpw smaller than Q4_K_M, comparable quality WHEN built with an imatrix
./llama-quantize --imatrix model.imatrix model-f16.gguf model-IQ4_XS.gguf IQ4_XS

Step 7: Rule out the KV-cache / flash-attn path

# If it's only bad at long context, test default f16 KV cache with flash-attn forced off.
# (--flash-attn now takes on|off|auto and defaults to auto.)
./llama-cli -m model-Q4_K_M.gguf --flash-attn off -c 8192 -p "long-context test prompt" -n 200

If quality returns with --flash-attn off and an f16 KV cache, the regression was the quantized-KV/flash-attn path, not the weight quant — keep the KV cache at f16 or pin a llama.cpp build without the bug.

How to confirm it’s fixed

Perplexity is back in range: the new GGUF’s llama-perplexity delta versus F16 is close to the reference figures in Step 2 (a good Q4_K_M sits near +0.05 on a 7B).
A/B on your real tasks at temperature 0: run 5-10 fixed prompts (one code, one math, one factual) against the new quant and the old higher-bit quant; the gap should be small, not a cliff.
No repetition loops at temp 0: greedy decoding for ~300 tokens should not collapse into a repeating phrase.
Metadata sanity: the file shows an imatrix field for any IQ quant and a HF (not GGUF) general.source.url.

Prevention

Never quantize from an already-quantized GGUF — always start from fp16/bf16 HuggingFace weights. Treat --allow-requantize as a red flag.
Prefer GGUFs from documented imatrix uploaders (Bartowski, Unsloth, etc.); verify the quantize.imatrix.* metadata before trusting any IQ quant.
Avoid Q4_0, Q5_0, Q5_1 for production — Q4_K_M or a properly-imatrixed IQ4_XS is almost always better at the same or smaller size.
For models below ~13B, use Q4_K_M as the floor and Q5_K_M for code or math; reserve Q2_K/Q3_K_S for 70B-class models.
Keep MoE models (Mixtral, DeepSeek-MoE) at Q5_K_M or higher.
Keep one fp16 GGUF on a backup drive so you can re-quantize without re-downloading, and record the imatrix source + chunk count next to each quantized file.
Build a small perplexity table (Q4_K_M/Q5_K_M/Q6_K) for each model before committing a tier.

FAQ

Q: Is IQ4_XS better or worse than Q4_K_M? A: With a good imatrix, IQ4_XS roughly matches Q4_K_M quality at a smaller size (about 4.46 bits/weight versus 4.89, so ~4.17 GB vs 4.58 GB for an 8B). Without a good imatrix it is usually worse. I-quants also decode slower than K-quants on CPU. If you don’t control the imatrix, prefer Q4_K_M.

Q: What’s the minimum quant for a coding or math assistant? A: Code and math are the most quant-sensitive workloads — token precision affects bracket matching, indentation, and rare-identifier recall. Use Q5_K_M or higher for 7B/8B coding models, with Q4_K_M as the absolute floor.

Q: Does temperature affect quantization artifacts? A: Yes, but in a misleading way. At temperature 0 (greedy) the quantization error follows a deterministic path and is the cleanest way to see the loss. At higher temperature the sampling noise can mask artifacts while also adding randomness, so subjective chat quality is a poor measurement tool. Compare quants with perplexity or fixed temp-0 prompts.

Q: I downloaded a worse quant. Can I generate an imatrix myself? A: Yes, if you have the fp16 source. Run ./llama-imatrix -m model-f16.gguf -f calibration_data.txt -o imatrix.dat --chunks 128, then ./llama-quantize --imatrix imatrix.dat model-f16.gguf model-IQ4_XS.gguf IQ4_XS. You need the original weights and enough RAM/VRAM to load the fp16 model.

Q: Can a better prompt recover quality after downgrading? A: Only partially. Tighter, more constrained prompts and chain-of-thought reduce how often the model follows a degraded probability path, but they don’t restore lost precision. Fix the quant tier first; treat prompting as a small top-up.

Tags: #local-llm #llama.cpp #Troubleshooting

Which bucket are you in?

Common causes

1. Jumping past the cliff between K-quants and legacy quants

2. Embedding and output layers under-quantized

3. IQ (importance-matrix) quants built without the right imatrix

4. Quantizing a fine-tuned model with a base-model imatrix

5. The GGUF was re-quantized from an already-quantized file (double quantization)

6. Q2_K / Q3_K_S on a model below 13B parameters

7. Architecture or KV-cache interaction, not the weights

Shortest path to fix

Step 1: Inspect the GGUF — imatrix present? re-quantized?

Step 2: Establish a quality baseline with perplexity

Step 3: Re-quantize from fp16 source, not from another GGUF

Step 4: Generate a proper imatrix for IQ quants

Step 5: Pick the right quant tier for your VRAM

Step 6: Avoid Q4_0 — use Q4_K_M or a properly-imatrixed IQ4_XS

Step 7: Rule out the KV-cache / flash-attn path

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)

6. `Q2_K` / `Q3_K_S` on a model below 13B parameters

Step 6: Avoid `Q4_0` — use `Q4_K_M` or a properly-imatrixed `IQ4_XS`