vLLM Startup Fails With CUDA Version Mismatch

vLLM fails to start with a CUDA version mismatch or undefined symbol error. Align your CUDA toolkit, driver, and PyTorch versions to fix the incompatibility.

You install vLLM 0.4 and run python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct. The process crashes with RuntimeError: CUDA error: no kernel image is available for execution on the device or ImportError: undefined symbol: __nvJitLinkCreate_12_4 or torch.cuda.is_available() returns False. Your NVIDIA driver reports CUDA 12.4, your pip-installed PyTorch reports CUDA 12.1, and vLLM was compiled against CUDA 12.2 — three different version numbers that create an incompatible chain. This is the most common reason vLLM fails to start on a new machine.

Common causes

Ordered by hit rate, highest first.

1. vLLM wheel compiled for a different CUDA version than your toolkit

vLLM publishes separate wheels for CUDA 11.8 and CUDA 12.x. If you install vllm without specifying the CUDA version, pip may pull the CUDA 12.1 wheel even if your system has CUDA 11.8. At runtime, the compiled CUDA kernels can’t load on the mismatched toolkit.

How to spot it: Run python -c "import vllm; print(vllm.__version__)" and then nvcc --version. The CUDA version in the vLLM wheel name (visible in pip show vllm | grep Location) must match your nvcc output.

2. PyTorch CUDA version doesn’t match the system toolkit

vLLM depends on PyTorch. If torch was installed with pip install torch (which pulls the latest CPU build or a mismatched CUDA build), torch.cuda.is_available() returns False even though the GPU exists.

How to spot it: Run python -c "import torch; print(torch.version.cuda)". Compare against nvcc --version. They must match (e.g., both 12.1).

3. Driver version too old for the CUDA toolkit version

CUDA 12.x requires a minimum NVIDIA driver version:

  • CUDA 12.0: driver >= 525
  • CUDA 12.1: driver >= 530
  • CUDA 12.4: driver >= 550

If your driver is older than the minimum, CUDA operations fail with “no kernel image available” or silent GPU detection failures.

How to spot it: Run nvidia-smi and check the “Driver Version” field. Cross-reference against the CUDA Toolkit Release Notes for the minimum driver requirement.

4. vLLM installed in a conda environment with a different CUDA path

Conda environments often have their own CUDA toolkit (installed via conda install cuda-toolkit). If the conda environment’s CUDA differs from the system CUDA, shared libraries may conflict. The process links against the system libcuda.so but tries to load conda-compiled CUDA kernels, causing undefined symbol errors.

How to spot it: Activate your conda environment and run which nvcc vs. which python. If nvcc points to a conda path and python also points to a conda path, check conda list | grep cuda and verify the version matches nvcc --version.

5. Missing CUDA development libraries (nvcc present, libs absent)

Some minimal CUDA installations include only the runtime library and driver but not the development toolkit headers and static libraries. vLLM’s compiled CUDA extensions (_C.so files) need both runtime and development libraries at import time.

How to spot it: Run ldconfig -p | grep libcublas. If libcublas.so.12 is absent, install the full CUDA toolkit (not just the driver).

6. Flash Attention version incompatible with CUDA/PyTorch

vLLM uses FlashAttention-2 for fast attention. If flash-attn was compiled for CUDA 12.1 but your system has CUDA 12.4, vLLM will fail to import it and crash with a CUDA symbol error.

How to spot it: Run python -c "import flash_attn; print(flash_attn.__version__)". If this raises an import error with a symbol mismatch, FlashAttention needs to be recompiled for your CUDA version.

Shortest path to fix

Step 1: Establish the ground truth CUDA version chain

# System CUDA toolkit version
nvcc --version
# Should print: Cuda compilation tools, release 12.4, V12.4.131

# NVIDIA driver and its max supported CUDA version
nvidia-smi
# Top-right corner shows: CUDA Version: 12.4

# PyTorch CUDA version
python -c "import torch; print('torch CUDA:', torch.version.cuda)"

# Installed vLLM CUDA target
pip show vllm
# Check the wheel filename in site-packages
ls $(python -c "import site; print(site.getsitepackages()[0])")/vllm*.dist-info/WHEEL 2>/dev/null

All three must agree on the same CUDA major.minor version.

Step 2: Reinstall PyTorch for your specific CUDA version

# For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# For CUDA 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# For CUDA 11.8 (older GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Verify
python -c "import torch; print(torch.version.cuda); print(torch.cuda.is_available())"

Step 3: Install the matching vLLM wheel

# For CUDA 12.1
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

# For CUDA 11.8
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

# Or use a specific vLLM release that targets your CUDA version:
pip install "vllm==0.4.3+cu121" --extra-index-url https://download.pytorch.org/whl/cu121

Step 4: Reinstall flash-attn for your CUDA version

# Uninstall the existing flash-attn (may be the wrong CUDA build)
pip uninstall flash-attn -y

# Build from source for your exact CUDA + PyTorch combination
pip install flash-attn --no-build-isolation

# This compiles flash-attn for the currently installed PyTorch + CUDA
# Takes 5-20 minutes on first install

Step 5: Verify the full stack in a Python test

python3 << 'EOF'
import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

import vllm
print(f"vLLM: {vllm.__version__}")
print("All imports OK")
EOF

Step 6: Use Docker for a pre-validated CUDA environment

# Official vLLM Docker image with pre-matched CUDA/PyTorch/vLLM
docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --max-model-len 16384

The official Docker image eliminates version management entirely.

Prevention

  • Always install PyTorch before vLLM using the explicit --index-url for your CUDA version — never pip install torch without the CUDA specifier.
  • Create a requirements.txt with pinned versions and CUDA URL: --index-url https://download.pytorch.org/whl/cu121.
  • Use virtual environments (venv or conda) per project to isolate CUDA-dependent packages.
  • After any NVIDIA driver update, verify nvcc --version still matches your PyTorch CUDA version.
  • Document the full version matrix (driver, CUDA, PyTorch, vLLM, flash-attn) in your project README as a compatibility table.
  • For production deployments, use the official vLLM Docker image to eliminate CUDA version management.
  • Before pip install vllm, always run the Step 2 PyTorch verification to confirm the CUDA foundation is correct.

FAQ

Q: My NVIDIA driver shows “CUDA Version: 12.4” but nvcc shows 12.1 — which one matters? A: Both. The driver’s “CUDA Version” shows the maximum CUDA version the driver supports. nvcc --version shows the installed toolkit version. You can install PyTorch for CUDA 12.1 even if the driver supports 12.4 — backward compatibility works. But you can’t run CUDA 12.4 PyTorch on a driver that only supports 12.1.

Q: Can I run vLLM without CUDA — on CPU only? A: vLLM 0.4 has experimental CPU support but it’s very slow and not recommended for production. For CPU inference, use llama.cpp (with llama-server) or Ollama instead — they have mature CPU backends.

Q: After matching all versions, vLLM still fails with “Segmentation fault” on startup — why? A: Segfaults at import time are often caused by a FlashAttention build that doesn’t match the PyTorch ABI. Run pip uninstall flash-attn && pip install flash-attn --no-build-isolation to rebuild from source. Also check if xformers is installed with a conflicting CUDA version.

Q: Is there a faster way to check compatibility without installing everything? A: Use pip index versions vllm to see available vLLM versions, then check the release notes for each version’s PyTorch + CUDA requirements at https://docs.vllm.ai/en/latest/getting_started/installation.html. This lets you confirm compatibility before installing anything.

Tags: #local-llm #vllm #Troubleshooting