You pull mistral:7b and inference is inexplicably slow — 2-3 tokens/s instead of the 40+ you expect from a 3090. Running ollama ps shows the model with “100% CPU” in the processor column. Ollama 0.4 on Ubuntu 22.04 with a 24 GB RTX 3090 and CUDA 12.2 installed, yet the GPU sits idle at 0% utilization in nvidia-smi. The model is small enough to fit entirely in VRAM, the driver is recent, yet something in Ollama’s detection chain is failing. This is one of the most common local LLM headaches and has five distinct failure modes, each with a different fix.
Common causes
Ordered by hit rate, highest first.
1. CUDA libraries not on LD_LIBRARY_PATH
Ollama bundles its own CUDA runtime but still needs libcuda.so.1 from the system driver to be discoverable. On systems where CUDA was installed via conda, a virtual environment, or a non-standard path, libcuda.so.1 may not be visible to the Ollama process.
How to spot it: Run ldconfig -p | grep libcuda. If it returns nothing, or if the path is only inside a conda env that isn’t active when Ollama starts, the library is invisible to the service.
2. Ollama service started before the NVIDIA driver was loaded
On Linux, if ollama.service starts at boot before nvidia-persistenced or the kernel module initializes, Ollama probes for GPUs, finds none, and caches “no GPU” for the lifetime of the process. Restarting the service fixes it, but the root cause is startup ordering.
How to spot it: Run systemctl status ollama and check the start timestamp against dmesg | grep -i nvidia — if the service started before the kernel module loaded, timestamps confirm it.
3. Wrong GPU selected when multiple devices are present
With two GPUs (e.g., a 3090 + an older GT 730 for display), Ollama may pick the display GPU (index 0) which has only 2 GB VRAM and immediately falls back to CPU because the model doesn’t fit.
How to spot it: Run nvidia-smi -L and compare indices. Check CUDA_VISIBLE_DEVICES — if unset, CUDA defaults to device 0 which may be the display adapter.
4. ROCm not configured for AMD GPUs
On AMD hardware, Ollama requires ROCm 5.7+ and HSA_OVERRIDE_GFX_VERSION for some consumer cards (RX 6xxx, RX 7xxx) that ROCm doesn’t officially list. Without the override, Ollama’s ROCm probe returns an empty device list.
How to spot it: Run rocm-smi — if it shows your GPU but ollama run llama3.2:3b still uses CPU, the GFX version override is missing.
5. Ollama installed via Snap or Flatpak with GPU access restricted
Snap and Flatpak sandbox the process and restrict access to /dev/nvidia* by default. Even with a correct driver, the containerized Ollama cannot open the device nodes.
How to spot it: Run snap connections ollama | grep hardware-observe or check the Flatpak permissions. If GPU interfaces are not connected, that’s the cause.
6. Metal/MPS not available on older macOS or non-Apple-Silicon Mac
On macOS, Ollama uses Metal/MPS. On Intel Macs with AMD discrete GPUs, Metal compute is available but Ollama 0.4 may not enumerate it correctly if the macOS version is below 13.0.
How to spot it: Run system_profiler SPDisplaysDataType | grep Metal — if Metal is “supported” but not “enabled,” or if macOS is 12.x or older, GPU acceleration will not work.
Shortest path to fix
Step 1: Confirm the GPU is visible to CUDA and check driver version
# NVIDIA
nvidia-smi
# Look for: Driver Version >= 525.xx, CUDA Version >= 12.0
# AMD
rocm-smi
For Ollama 0.4, NVIDIA driver 525+ with CUDA 12.x is required. Driver 520 or older will not work.
Step 2: Restart the Ollama service (not just the CLI)
# Linux systemd
sudo systemctl restart ollama
# macOS
pkill ollama
ollama serve &
# Then verify GPU is detected
ollama run llama3.2:3b "say hello"
# Check utilization during inference
watch -n 1 nvidia-smi
Step 3: Set CUDA_VISIBLE_DEVICES to pin the correct GPU
# List all GPUs and their indices
nvidia-smi -L
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: ...)
# GPU 1: NVIDIA GeForce GT 730 (UUID: ...)
# Force Ollama to use GPU 0 (the 3090)
export CUDA_VISIBLE_DEVICES=0
ollama serve &
Add to /etc/systemd/system/ollama.service under [Service]:
Environment="CUDA_VISIBLE_DEVICES=0"
Then sudo systemctl daemon-reload && sudo systemctl restart ollama.
Step 4: Fix LD_LIBRARY_PATH for non-standard CUDA installs
# Find libcuda
find /usr /opt -name "libcuda.so*" 2>/dev/null
# If found in a non-standard path, add it
export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH
sudo systemctl restart ollama
Step 5: For AMD GPUs, set HSA_OVERRIDE_GFX_VERSION
# RX 6xxx series — override to gfx1030
export HSA_OVERRIDE_GFX_VERSION=10.3.0
# RX 7xxx series — override to gfx1100
export HSA_OVERRIDE_GFX_VERSION=11.0.0
ollama serve &
ollama run llama3.2:3b "say hello"
Step 6: Verify GPU offloading is active
# After starting a model, check the ps output
ollama ps
# Should show: NAME ID SIZE PROCESSOR UNTIL
# ... ... ... GPU ...
# NVIDIA: confirm VRAM is consumed
nvidia-smi --query-gpu=memory.used,memory.free --format=csv
Prevention
- Pin
CUDA_VISIBLE_DEVICESin the systemd unit file so the correct GPU is always selected at boot. - Add
After=nvidia-persistenced.servicetoollama.serviceto ensure the driver is loaded before Ollama probes for devices. - If using conda, activate the base environment in the Ollama service
ExecStartPresolibcuda.so.1is on the path. - For AMD systems, document the
HSA_OVERRIDE_GFX_VERSIONvalue in a comment next to the export in/etc/environment. - After any driver update, always run
sudo systemctl restart ollama— Ollama caches the device list at startup. - On macOS, keep the system on 13.0+ for reliable Metal compute access.
- Run
ollama psimmediately after loading a model to confirm GPU vs. CPU before starting a long inference job. - For Snap installs, switch to the official
.tar.gzinstall script (curl -fsSL https://ollama.com/install.sh | sh) to avoid sandbox restrictions.
FAQ
Q: Ollama shows “GPU” in ollama ps but nvidia-smi shows 0% utilization — why?
A: The “GPU” label means Ollama allocated VRAM for the weights, but utilization spikes only during token generation. Start a prompt and immediately watch nvidia-smi — you’ll see the utilization jump during the forward pass.
Q: Can I use both CPU and GPU for a model that doesn’t fit in VRAM?
A: Yes. Ollama automatically splits layers between GPU and CPU when the model is larger than available VRAM. Set OLLAMA_GPU_OVERHEAD=256MiB to reserve headroom and reduce the split threshold. The GPU handles as many layers as fit; the remainder run on CPU.
Q: My VRAM is 8 GB but ollama ps shows 0 B GPU — why?
A: The model you loaded exceeds 8 GB even in its quantized form, so Ollama fell back entirely to CPU. Try a more aggressive quantization: ollama pull llama3.1:8b-instruct-q4_K_M instead of q8_0.
Q: Does Ollama support multiple GPUs for a single model?
A: Yes, starting in Ollama 0.3. Set OLLAMA_GPU_OVERHEAD and ensure all GPUs are CUDA-visible. Ollama distributes transformer layers evenly across available GPUs. See the multi-GPU article for split verification steps.