vLLM Throws context length exceeded

vLLM raises a context length exceeded error mid-request. Fix max-model-len, chunked prefill, and KV cache allocation to handle long prompts reliably.

You start a vLLM 0.4 server with Mistral-7B-Instruct-v0.3 or Llama-3.1-70B-Instruct and send a 12,000-token RAG prompt. vLLM responds with ValueError: context length exceeded the model's context window even though the model nominally supports 128k tokens and you’re nowhere near that limit. Alternatively, the error appears inside a batch job when total sequence lengths in a single scheduled batch exceed the KV cache capacity. Both failure modes trace back to how vLLM allocates the paged KV cache at startup — the theoretical model maximum and the practical runtime maximum are very different numbers.

Common causes

Ordered by hit rate, highest first.

1. —max-model-len not set, defaulting to the model config maximum

vLLM reads max_position_embeddings from the model’s config.json (often 32768 or 131072) and uses that as the default max_model_len. But the KV cache to support that length at full GPU utilization requires far more VRAM than most setups have. vLLM then silently caps the actual usable length at whatever the KV cache can support, leading to errors when a request exceeds the real cap.

How to spot it: Run curl http://localhost:8000/v1/models and check max_model_len in the response. If it’s larger than what your VRAM budget allows (roughly 1 GB of KV cache per 4000 tokens for a 7B model at fp16), requests exceeding the real cap will fail.

2. KV cache exhausted by concurrent long requests

vLLM uses a paged KV cache with a fixed number of blocks allocated at startup. If --max-num-seqs concurrent sequences each hold a 8k context, the total KV cache demand (e.g., 32 sequences × 8k = 256k token-slots) may exceed what was allocated, causing new requests to get a context-length-exceeded error even though each individual request is within max_model_len.

How to spot it: Check the vLLM startup logs for “KV cache size: X blocks” and “GPU memory utilization: Y%”. If Y is near 90% before any requests arrive, the KV pool is small and will exhaust quickly under load.

3. Chunked prefill disabled with very long prompts

Without chunked prefill, vLLM must process the entire prompt in a single prefill pass. For 32k+ token prompts, this can OOM the forward pass even if the KV cache has capacity, because the attention matrix for the prefill pass alone is enormous.

How to spot it: Check whether --enable-chunked-prefill is set. If not, large prefill prompts will fail at the forward pass stage regardless of max_model_len.

4. Rope scaling not configured for extended context

Models like Llama-3.1 use dynamic NTK RoPE scaling to extend context beyond their training length. If vLLM doesn’t receive the correct rope_scaling configuration (from the model’s config.json or via --rope-scaling), it will refuse requests longer than the base trained context length (e.g., 4096 for older Llama 2 variants).

How to spot it: Check the model’s config.json for rope_scaling. If it contains {"type": "dynamic", "factor": 4.0} and your vLLM startup logs don’t mention rope scaling, the config wasn’t read correctly.

5. Prompt + completion tokens together exceed max_model_len

vLLM counts input tokens + maximum output tokens against the context window. If your request sets max_tokens=4096 and your prompt is 28000 tokens, the sum (32096) may exceed max_model_len=32768 — not by the prompt alone, but by the combined budget.

How to spot it: Calculate prompt_tokens + max_tokens in your API call. If this exceeds max_model_len, vLLM rejects the request before inference starts.

6. Quantization reducing KV cache capacity

When using AWQ or GPTQ quantization in vLLM, weight memory shrinks but KV cache is still stored in fp16 by default. If --gpu-memory-utilization is not adjusted upward after quantizing the weights, the KV cache allocation stays the same size it would have been for the full-precision model, wasting the memory headroom that quantization freed.

How to spot it: Check startup logs for “KV cache size: N blocks” before and after enabling quantization. If the block count didn’t increase after quantization, --gpu-memory-utilization needs adjustment.

Shortest path to fix

Step 1: Explicitly set —max-model-len to a realistic value

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90

Start conservatively (8192 or 16384) and increase if memory allows. Check via:

curl -s http://localhost:8000/v1/models | python3 -m json.tool | grep max_model_len

Step 2: Enable chunked prefill for long-context workloads

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --max-model-len 32768 \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.90

--max-num-batched-tokens controls the chunk size per prefill step. 4096-8192 is a safe range for 24 GB GPUs.

Step 3: Increase GPU memory utilization after quantization

# AWQ quantized model — weights use ~half the VRAM, so increase utilization
python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-3-8B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.95

Step 4: Cap max_tokens in API requests to leave room for the prompt

import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# If max_model_len=16384 and prompt is ~12000 tokens:
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": long_prompt}],
    max_tokens=2048,  # 12000 + 2048 = 14048, safely under 16384
)

Step 5: Verify the KV cache block allocation at startup

# Start vLLM and look for this log line:
# "# GPU blocks: 2048, # CPU blocks: 512"
# Each block holds 16 tokens by default.
# 2048 blocks × 16 tokens = 32768 token capacity for the entire server.

# Increase block capacity by raising gpu-memory-utilization or reducing model precision:
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8  # halves KV cache memory on supported hardware

Prevention

  • Always set --max-model-len explicitly based on your VRAM budget rather than relying on the model config default.
  • Use --enable-chunked-prefill by default for any deployment that will receive prompts over 4096 tokens.
  • Reserve max_tokens budget when making API calls: set max_tokens to no more than max_model_len - expected_prompt_tokens - 256.
  • For RAG pipelines, truncate retrieved documents to stay within a safe prompt budget before sending to vLLM.
  • Monitor vllm_gpu_cache_usage_perc in vLLM’s Prometheus metrics — if it regularly exceeds 80%, increase VRAM or reduce max-num-seqs.
  • When using AWQ or GPTQ, increase --gpu-memory-utilization to 0.92-0.95 to reclaim the headroom quantization created.
  • Test max_model_len with a single 90th-percentile-length request before load testing to confirm the server handles it.

FAQ

Q: vLLM accepted my 32k prompt but the response cuts off early — is this also a context limit issue? A: Possibly. If max_tokens was not set or was set too low in the API call, the response will stop at that limit. Also check whether the model is producing an EOS token early due to a chat template mismatch, which looks similar but is a different root cause.

Q: What does “KV cache utilization 100%” in vLLM metrics mean? A: It means all paged KV cache blocks are occupied. New requests will either be queued or rejected depending on your --max-num-seqs and scheduling configuration. Reduce max-num-seqs, reduce max_model_len, or add VRAM.

Q: Can I use CPU KV cache offload in vLLM to extend effective context? A: vLLM 0.4 supports --cpu-offload-gb N to spill KV blocks to CPU RAM. This allows longer contexts at the cost of memory bandwidth. Set --cpu-offload-gb 16 on a machine with 64+ GB RAM to extend effective KV cache capacity.

Q: Does —tensor-parallel-size affect the max context length? A: Tensor parallelism splits attention heads across GPUs. This doesn’t change max_model_len directly, but it does multiply the total KV cache capacity proportionally (each GPU holds a shard), which allows allocating more KV blocks in aggregate and supporting longer contexts or more concurrent requests.

Tags: #local-llm #vllm #Troubleshooting