Fix vLLM context length exceeded errors

Q: vLLM accepted my 32k prompt but the response cut off early — same problem?

Usually not. An early cutoff is normally `max_tokens` set too low, or the model emitting an EOS token early because of a chat-template mismatch. Check the response's `finish_reason`: `length` means you hit `max_tokens`; `stop` means the model ended on its own.

Q: Can I offload KV cache to CPU to get longer context?

`--cpu-offload-gb N` offloads model weights to host RAM, freeing VRAM that the cache can then use, which indirectly lets you keep a longer `--max-model-len`. It trades latency for capacity, so reserve it for cases that genuinely will not fit otherwise.

Q: Does `--tensor-parallel-size` raise the max context length?

Not `max_model_len` directly, but it shards the KV cache across GPUs, so the aggregate cache grows roughly with the number of cards. That lets you allocate more blocks and support either longer context or more concurrent sequences.

vLLM rejects a request with This model's maximum context length is X tokens. Set max-model-len realistically, raise GPU memory, use fp8 KV cache, and budget output tokens.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You start a vllm serve process for Mistral-7B-Instruct or Llama-3.1-70B-Instruct, send a 12,000-token RAG prompt, and the server rejects it with This model's maximum context length is 8192 tokens. However, you requested 12000 tokens — even though the model card says 128k. The number in the error is almost never the model’s real maximum; it is whatever vLLM could actually fit into the paged KV cache at startup, and that is usually much smaller than the model config maximum.

Fastest fix (covers ~80% of cases): explicitly set --max-model-len to a value your VRAM can actually hold, and raise --gpu-memory-utilization so the KV cache pool is bigger. Start here:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 16k \
  --gpu-memory-utilization 0.92

Then confirm the real served length with curl -s http://localhost:8000/v1/models | python3 -m json.tool | grep max_model_len. If that prints what you need, you are done. If not, work through the buckets below.

Note: as of June 2026 vLLM ships the V1 engine by default (V1 has been the default since v0.6, and the current line is ~v0.11). V1 enables chunked prefill and automatic prefix caching automatically, so several flags that used to be required (--enable-chunked-prefill) are now on by default. If you are still on an old V0 build, upgrade first — most context-length papercuts were fixed in V1.

Which bucket are you in?

Read the exact error string. vLLM emits three different ones and they have different causes.

Error string you see	What it means	Go to
`This model's maximum context length is X tokens. However, you requested Y tokens (A in the messages, B in the completion)`	Your request (prompt + `max_tokens`) is bigger than the server’s configured `max_model_len`	Cause 1 and Cause 4
`The model's max seq len (X) is larger than the maximum number of tokens that can be stored in KV cache (Y). Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine`	The server refused to start because VRAM cannot hold the KV cache for `max_model_len`	Cause 2
`The prompt (total length 25938) is too long to fit into the model (context length 4096). Make sure that max_model_len is no smaller than the number of text tokens plus multimodal tokens`	Offline `LLM(...)` path, or a multimodal prompt whose image tokens blew the budget	Cause 4 and Cause 5

Common causes

Ordered by hit rate, highest first.

1. max_model_len silently capped to fit the KV cache

If you do not pass --max-model-len, vLLM reads max_position_embeddings from the model’s config.json (often 32768 or 131072) and tries to use that. But the KV cache needed to hold that many tokens often will not fit in VRAM, so vLLM caps the usable length down to whatever the cache can hold and logs a warning that scrolls past in the startup spam. Any request above the real cap then fails with This model's maximum context length is X tokens.

How to spot it: run curl -s http://localhost:8000/v1/models | python3 -m json.tool | grep max_model_len and compare it to what you expected. A rough rule for fp16 KV cache: about 1 GB per 4,000 tokens for a 7B model, and far more for 70B. If the served number is small, the cache, not the model, is the limit.

2. Server refuses to start: KV cache too small for max_model_len

You set --max-model-len 32768, but VRAM only fits, say, 3,664 tokens of KV cache, so vLLM aborts at startup with The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (3664). In V1, the KV cache the engine reserves is roughly max_num_seqs × max_model_len worth of token slots, so a high --max-num-seqs makes this worse.

How to spot it: the process exits during init, not on a request. The error tells you the exact KV cache token ceiling. Lower --max-model-len below that ceiling, lower --max-num-seqs, or free VRAM by raising --gpu-memory-utilization.

3. Wrong GPU memory budget (utilization too low, or model + cache too tight)

--gpu-memory-utilization defaults to 0.9. If something else on the card (another process, a notebook, an OS compositor) is holding VRAM, vLLM’s slice shrinks and the KV pool with it. Conversely, after switching to a bigger model you may have left utilization low from earlier testing.

How to spot it: run nvidia-smi and check free VRAM before launching. In vLLM’s startup logs, look for the # GPU blocks: line — that is the real cache size. Each block holds 16 tokens by default, so # GPU blocks: 2048 means a 32,768-token server-wide budget.

4. Prompt + completion together exceed max_model_len

vLLM counts input tokens plus the maximum output tokens against the window. The error spells this out: (A in the messages, B in the completion). If your prompt is 28,000 tokens and you set max_tokens=4096, the sum 32,096 can exceed a 32,768 cap once special tokens are added.

How to spot it: add prompt_tokens + max_tokens from your API call. If the total is at or above max_model_len, vLLM rejects the request before inference even starts. Note vLLM rejects rather than auto-clamping max_tokens, so you must leave headroom yourself.

5. Multimodal image tokens inflate the prompt

For vision models (Qwen-VL, Llama-Vision, etc.), each image expands into hundreds or thousands of tokens depending on resolution and aspect ratio. A “short” prompt with one high-res base64 image can blow past max_model_len and produce The prompt (total length N) is too long to fit into the model.

How to spot it: drop the image and resend the same text. If the text alone fits, the image tokens are the culprit. Downscale the image or set the model’s --limit-mm-per-prompt / resolution controls.

6. RoPE scaling not picked up, so the long-context ceiling is the base length

Models like Llama-3.1 extend context past their base training length via RoPE scaling declared in config.json. If that block is missing or a community re-upload changed max_position_embeddings, vLLM will cap at the base length (e.g. 8192) no matter how much VRAM you have.

How to spot it: open the model’s config.json and check rope_scaling and max_position_embeddings against the official model card. If they were edited down, re-download the official weights or pass --rope-scaling and --max-model-len explicitly. See Misconfigured RoPE Scaling Garbles Long-Context Output for the gory details.

Shortest path to fix

Step 1: Find the real served length

# What the server actually advertises:
curl -s http://localhost:8000/v1/models | python3 -m json.tool | grep max_model_len

# What it decided at startup (block count and any cap warning):
journalctl -u vllm -n 300 | grep -iE 'GPU blocks|max_model_len|maximum number of tokens|Reducing'

Step 2: Set max-model-len explicitly and raise the memory budget

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 32k \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 \
  --port 8000

--max-model-len accepts human-readable sizes (16k, 32k, 128k); lowercase k is 1000, uppercase K is 1024. Raise --gpu-memory-utilization in small steps (0.90 to 0.92 to 0.95) and watch for CUDA OOM — going too high starves the activation buffers.

Step 3: If the server will not start, lower the ceiling or the concurrency

When you hit The model's max seq len is larger than the maximum number of tokens that can be stored in KV cache, the KV cache is the bottleneck. Either lower --max-model-len below the number the error reports, or lower --max-num-seqs (V1 reserves roughly max_num_seqs × max_model_len of cache):

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 16k \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.92

Step 4: Halve KV cache memory with fp8, or spill to CPU

--kv-cache-dtype fp8 stores the cache in 8-bit and roughly halves its memory, letting you keep a longer --max-model-len on the same card (needs CUDA 11.8+ or a supported AMD GPU):

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 64k \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92

If you are still short, --cpu-offload-gb N offloads weights to host RAM to free VRAM for the cache (slower, but it lets a big model run at all):

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --max-model-len 32k \
  --cpu-offload-gb 16 \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.90

Step 5: Reclaim memory after quantizing the weights

AWQ/GPTQ shrink the weights but KV cache stays fp16 unless you say otherwise. After quantizing, push --gpu-memory-utilization up so the freed VRAM becomes cache:

vllm serve TheBloke/Llama-3.1-8B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32k \
  --gpu-memory-utilization 0.95

Compare the # GPU blocks: line before and after — if the block count did not rise, the freed memory is going to waste.

Step 6: Budget output tokens in the client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# If max_model_len = 16384 and the prompt is ~12,000 tokens,
# keep prompt + max_tokens comfortably under the cap:
response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": long_prompt}],
    max_tokens=2048,  # 12000 + 2048 = 14048, safely under 16384
)

How to confirm it’s fixed

Send a request close to your target length and confirm it returns instead of erroring:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# ~10k-token prompt to probe the real ceiling
long_text = "Summarize the following text:\n" + ("The quick brown fox. " * 600)
r = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": long_text}],
    max_tokens=200,
)
print(r.usage)            # prompt_tokens should match what you sent
print(r.choices[0].message.content[:200])

If r.usage.prompt_tokens is what you expected and no error is raised, the cap is genuinely lifted. Then watch vllm:gpu_cache_usage_perc in the /metrics Prometheus endpoint under real load — if it regularly sits above 80%, you are close to the edge again.

Prevention

Always pass --max-model-len explicitly, sized to your VRAM, instead of trusting the model config default. It also stops a stale value from leaking in when you swap models.
Remember V1’s cache rule of thumb: reserved KV cache scales with max_num_seqs × max_model_len. Long context and high concurrency fight over the same pool — pick one to prioritize.
Reserve output budget: set max_tokens to at most max_model_len - expected_prompt_tokens - 256.
For RAG, truncate retrieved chunks to a fixed prompt budget in the app layer before the call, rather than letting the server reject the request.
Add a startup assertion to CI: hit /v1/models and check max_model_len is what you expect before routing traffic.
After AWQ/GPTQ quantization, raise --gpu-memory-utilization to 0.92-0.95 to turn the freed VRAM into cache.
Monitor vllm:gpu_cache_usage_perc; if it regularly tops 80%, add VRAM, enable --kv-cache-dtype fp8, or lower --max-num-seqs.

FAQ

Q: The error says my context is 8192 but the model card says 128k. Who is right? A: Both. 128k is the architectural maximum; 8192 is what vLLM could fit in the KV cache on your GPU. The fix is not a different model — it is raising --gpu-memory-utilization, enabling --kv-cache-dtype fp8, lowering --max-num-seqs, or adding VRAM until the cache can hold the length you need.

Q: vLLM accepted my 32k prompt but the response cut off early — same problem? A: Usually not. An early cutoff is normally max_tokens set too low, or the model emitting an EOS token early because of a chat-template mismatch. Check the response’s finish_reason: length means you hit max_tokens; stop means the model ended on its own.

Q: What does gpu_cache_usage_perc at 100% mean? A: Every paged KV block is occupied. New requests queue or get rejected depending on --max-num-seqs and scheduling. Lower --max-num-seqs, lower --max-model-len, switch to --kv-cache-dtype fp8, or add VRAM.

Q: Can I offload KV cache to CPU to get longer context? A: --cpu-offload-gb N offloads model weights to host RAM, freeing VRAM that the cache can then use, which indirectly lets you keep a longer --max-model-len. It trades latency for capacity, so reserve it for cases that genuinely will not fit otherwise.

Q: Does --tensor-parallel-size raise the max context length? A: Not max_model_len directly, but it shards the KV cache across GPUs, so the aggregate cache grows roughly with the number of cards. That lets you allocate more blocks and support either longer context or more concurrent sequences.

Q: Same code works on an A100 80GB but fails on an A6000 48GB — why? A: Less VRAM means a smaller KV cache, so the real ceiling drops roughly in proportion. Lower --max-model-len, enable --kv-cache-dtype fp8, or move to a multi-GPU setup with --tensor-parallel-size.

Tags: #local-llm #vllm #Troubleshooting