Ollama Doesn't Detect the GPU, Falls Back to CPU

Q: `ollama ps` shows GPU but `nvidia-smi` reads 0% utilization — is the GPU really being used?

Probably yes. The PROCESSOR column reflects where the *weights* live, but GPU utilization only spikes during the forward pass. Start a prompt and `watch -n 1 nvidia-smi` during generation — you'll see utilization jump. If VRAM is allocated but utilization never moves, then nothing is running on the GPU.

Q: Ollama runs fine in my terminal but uses CPU under Docker — what's missing?

Add `--gpus all` to `docker run` and install the `nvidia-container-toolkit` on the host. Verify with `docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi` before launching the Ollama container.

Q: My AMD card shows in `rocm-smi` but Ollama stays on CPU.

Look for `amdgpu is not supported (supported types:[...])` in the `OLLAMA_DEBUG=1` log. If your `gfx` ID isn't listed, alias it with `HSA_OVERRIDE_GFX_VERSION` (e.g. `10.3.0` for RX 6xxx, `11.0.0` for RX 7xxx). Also confirm you're on ROCm v7.

Ollama ignores your NVIDIA or AMD GPU and runs on CPU only. Read the inference-compute log line, fix driver, CUDA, and ROCm mismatches, and force GPU offloading.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You pull a small model like mistral:7b, and generation crawls at 2-5 tokens/s instead of the 40-120 tokens/s a recent GPU should deliver. ollama ps shows the model with 100% CPU in the PROCESSOR column, and nvidia-smi sits at 0% utilization even though the model fits in VRAM and the driver looks current. Ollama auto-detects NVIDIA, AMD, Apple, and (since 0.12.11) Vulkan GPUs, and when detection fails it falls back to CPU silently — no error, just slow output. Nine times out of ten the problem is in the driver/runtime layer, not in Ollama itself.

Fastest fix: stop the service and start Ollama in the foreground with debug logging on — the one log line msg="inference compute" tells you exactly which device (if any) Ollama found, which collapses the guesswork:

sudo systemctl stop ollama 2>/dev/null || true
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -iE 'inference compute|no compatible|cuda|rocm|metal|error'

If you see a library=cuda (or rocm/metal) line naming your card, the GPU is detected and your problem is VRAM/offload, not detection. If you see no compatible GPUs were discovered, work through the causes below.

Which bucket are you in?

Symptom in the debug log / `nvidia-smi`	Most likely cause	Jump to
`nvidia-smi` itself fails or shows old driver	Driver too old / not installed	Cause 1
`inference compute` line missing, `CUDA_VISIBLE_DEVICES` is `""` or `-1`	Env var hiding the GPU	Cause 2
Works in your shell, fails under systemd	Service can’t see driver/device nodes	Cause 3
`rocm-smi` shows the card but Ollama says `amdgpu is not supported`	AMD gfx version not overridden / ROCm too old	Cause 4
In WSL2, `nvidia-smi` empty or `/dev/dxg` missing	Windows-side driver / passthrough	Cause 5
Snap or Flatpak install, `/dev/nvidia*` not accessible	Sandbox blocks device nodes	Cause 6
Apple Silicon, slow despite Metal	`OLLAMA_GPU`/no-GPU override or RAM pressure	Cause 7

Common causes

1. NVIDIA driver too old (or CUDA never installed)

As of June 2026, Ollama needs NVIDIA driver 531 or newer (CUDA 12.3+). Cards with compute capability 5.0-6.2 (Maxwell/Pascal, e.g. GTX 10-series, Tesla P40) need driver 570 or newer. Anything older is detected as incompatible and Ollama drops to CPU. Note that the display driver and the CUDA toolkit are separate; Ollama bundles its own CUDA runtime, but it still needs nvidia-smi to work and libcuda.so.1 from the system driver to be loadable.

How to spot it: nvidia-smi --query-gpu=driver_version --format=csv,noheader. If it errors with command not found or shows a version below 531 (or below 570 on an older card), that is your cause. Also run ldconfig -p | grep libcuda — if it returns nothing, the driver library is not on the loader path.

2. `CUDA_VISIBLE_DEVICES` set to empty or `-1`

Some conda envs, Docker images, and IDE plugins set CUDA_VISIBLE_DEVICES="" or =-1 on activation, which hides every GPU from CUDA. Ollama then finds no devices and uses CPU.

How to spot it: echo "[$CUDA_VISIBLE_DEVICES]". If it prints [] (empty) or [-1], that is the cause. Check ~/.bashrc, ~/.zshrc, and any conda/activate.d/ scripts for a stray export.

3. Ollama runs under systemd but can’t reach the driver or device nodes

When Ollama runs as the ollama systemd service, two things break GPU access: (a) the service may start before the NVIDIA kernel module is loaded, so it caches “no GPU” for the life of the process, and (b) the service account may not be in the render/video groups, so it cannot open /dev/dri/* (AMD) or /dev/nvidia*.

How to spot it: systemctl status ollama start time vs. dmesg | grep -i nvidia module-load time; and journalctl -u ollama -n 100 | grep -iE 'permission|denied|no compatible'. A plain sudo systemctl restart ollama that suddenly makes the GPU appear confirms a startup-ordering race.

4. AMD GPU: ROCm too old or gfx version not overridden

On AMD, Ollama now targets ROCm v7 (Linux and Windows). The ROCm backend only accepts a fixed list of GPU families. If OLLAMA_DEBUG=1 logs amdgpu is not supported (supported types:[gfx1030 gfx1100 gfx1101 gfx1102 gfx900 gfx906 gfx908 gfx90a gfx940 gfx941 gfx942]), your card’s gfx ID isn’t on the list and you must alias it to a supported one with HSA_OVERRIDE_GFX_VERSION.

How to spot it: rocm-smi shows the card, rocminfo | grep gfx prints its gfx ID, but ollama run stays on CPU. The debug log’s “supported types” line is the tell.

5. WSL2: GPU not passed through from Windows

In Windows 11 + WSL2 the GPU is exposed by the Windows-side driver — you must not install a Linux NVIDIA driver inside WSL2 (that breaks passthrough). You need a Windows NVIDIA driver 531+ and the /dev/dxg device present inside the distro.

How to spot it: inside WSL2 run nvidia-smi (should list the card) and ls /dev/dxg (must exist). If nvidia-smi is empty or /dev/dxg is missing, passthrough is the problem. AMD ROCm in WSL2 is unreliable because WSL2 exposes /dev/dxg, not the /dev/kfd ROCm expects.

6. Snap or Flatpak install with GPU access sandboxed

Snap and Flatpak sandbox the process and restrict /dev/nvidia* by default, so even a correct driver leaves Ollama unable to open the device nodes.

How to spot it: snap connections ollama | grep -i hardware (or the Flatpak permission list). If the GPU interface is not connected, that is the cause.

7. Apple Silicon: an override forces CPU, or the model is too big for unified memory

On Apple Silicon, Ollama uses Metal automatically — there is no CUDA. If someone exported OLLAMA_NO_GPU=1 (or a similar flag), Metal is disabled and the failure is quiet. The other common case: a quantized model larger than free unified memory pushes layers onto CPU.

How to spot it: env | grep -i ollama for stray overrides; system_profiler SPDisplaysDataType | grep -i metal to confirm Metal support. Setting any CUDA_* variable on a Mac is pointless and can cause odd behavior.

Shortest path to fix

Step 1: Read the inference-compute log line first

sudo systemctl stop ollama 2>/dev/null || true
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -iE 'inference compute|no compatible|cuda|rocm|metal'

A healthy NVIDIA line looks like:

msg="inference compute" id=GPU-xxxx library=cuda compute=8.9 driver=12.x name="NVIDIA GeForce RTX 4090" total="23.6 GiB" available="23.2 GiB"

If that line names your card, skip to Step 6 (this is a VRAM/offload issue, not detection). If you see no compatible GPUs were discovered, continue.

Step 2: Confirm the GPU and driver are visible to the OS

# NVIDIA
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv
# Driver must be >= 531 (>= 570 for compute-capability 5.0-6.2 cards)

# AMD
rocm-smi
rocminfo | grep -i gfx   # note the gfxNNNN id

# Apple Silicon
system_profiler SPDisplaysDataType | grep -i "metal\|chipset"

Step 3: Clear environment variables that hide the GPU

unset CUDA_VISIBLE_DEVICES   # NVIDIA
unset ROCR_VISIBLE_DEVICES   # AMD device selector
unset OLLAMA_NO_GPU

# Re-test with verbose timing
ollama run llama3.2:3b "say hello" --verbose
# Look at the "eval rate" line — on GPU it should be well above 30 tokens/s

Step 4: Update the NVIDIA driver (Linux), then restart Ollama

# Ubuntu 22.04 / 24.04
ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo reboot

# After reboot, confirm and restart the service (Ollama caches the device list at startup)
nvidia-smi --query-gpu=driver_version --format=csv,noheader
sudo systemctl restart ollama

Step 5: Pin the correct GPU and give the service device access

If you have a display GPU plus a compute GPU, pin the compute card so Ollama doesn’t pick the small one:

nvidia-smi -L
# GPU 0: NVIDIA GeForce GT 730 (UUID: GPU-...)   <- display, 2 GB
# GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-...)  <- compute, 24 GB
export CUDA_VISIBLE_DEVICES=1   # or the GPU-UUID for stability

Persist it in the unit file and grant device-node access:

sudo systemctl edit ollama
# Under [Service] add:
#   Environment="CUDA_VISIBLE_DEVICES=1"
#   SupplementaryGroups=render video
#   After=nvidia-persistenced.service

sudo systemctl daemon-reload && sudo systemctl restart ollama

Step 6: AMD only — set the gfx override

If the debug log printed the “supported types” warning, alias your card to the nearest supported family:

# RX 6xxx (gfx103x) -> alias to gfx1030
export HSA_OVERRIDE_GFX_VERSION=10.3.0
# RX 7xxx (gfx110x) -> alias to gfx1100
export HSA_OVERRIDE_GFX_VERSION=11.0.0
# Multiple AMD cards: per-device form
# export HSA_OVERRIDE_GFX_VERSION_0=10.3.0

ollama run llama3.2:3b "say hello" --verbose

Note: forcing an override on a card whose gfx is already supported can crash inference (SIGSEGV) — only set it when the log says your type is unsupported.

Step 7: Confirm GPU offloading is actually active

ollama run llama3.2:3b "say hello"   # load the model first
ollama ps
# PROCESSOR should read 100% GPU (or e.g. "30%/70% CPU/GPU" for a partial split)

# NVIDIA: confirm VRAM is consumed and utilization spikes during generation
watch -n 1 nvidia-smi

If ollama ps shows a CPU/GPU split, the model is too big for VRAM and some layers spilled to CPU — see the FAQ on offloading.

Prevention

After any driver update, run sudo systemctl restart ollama — Ollama probes for devices once at startup and caches the result.
Pin CUDA_VISIBLE_DEVICES (use the GPU UUID, not the index) in the systemd unit so the right card is always selected at boot.
Add After=nvidia-persistenced.service to ollama.service so the driver is loaded before Ollama probes.
In conda/venv activation scripts, never blanket unset/blank CUDA_VISIBLE_DEVICES; pin a specific ID instead.
On WSL2, lock the Windows-side NVIDIA driver version and never install a Linux NVIDIA driver inside the distro.
For Snap/Flatpak woes, switch to the official install script: curl -fsSL https://ollama.com/install.sh | sh.
Keep OLLAMA_DEBUG=1 in the service EnvironmentFile so the inference compute line is always in the logs when you need it.
Run ollama ps right after loading a model to confirm GPU vs. CPU before a long job.

FAQ

Q: ollama ps shows GPU but nvidia-smi reads 0% utilization — is the GPU really being used? A: Probably yes. The PROCESSOR column reflects where the weights live, but GPU utilization only spikes during the forward pass. Start a prompt and watch -n 1 nvidia-smi during generation — you’ll see utilization jump. If VRAM is allocated but utilization never moves, then nothing is running on the GPU.

Q: Can I split a model that doesn’t fit in VRAM across GPU and CPU? A: Yes — Ollama does this automatically, loading as many transformer layers onto the GPU as fit and running the rest on CPU. ollama ps shows the split in the PROCESSOR column. To trim a model below your VRAM, pull a smaller quant (e.g. llama3.1:8b-instruct-q4_K_M instead of q8_0). Set OLLAMA_GPU_OVERHEAD to reserve VRAM headroom if you hit out-of-memory under load.

Q: How old is too old for my NVIDIA driver? A: As of June 2026, below driver 531 won’t work at all, and compute-capability 5.0-6.2 cards (GTX 10-series, Tesla P40, etc.) need 570+. Check with nvidia-smi --query-gpu=driver_version --format=csv,noheader.

Q: Ollama runs fine in my terminal but uses CPU under Docker — what’s missing? A: Add --gpus all to docker run and install the nvidia-container-toolkit on the host. Verify with docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi before launching the Ollama container.

Q: My AMD card shows in rocm-smi but Ollama stays on CPU. A: Look for amdgpu is not supported (supported types:[...]) in the OLLAMA_DEBUG=1 log. If your gfx ID isn’t listed, alias it with HSA_OVERRIDE_GFX_VERSION (e.g. 10.3.0 for RX 6xxx, 11.0.0 for RX 7xxx). Also confirm you’re on ROCm v7.

Q: Does Ollama use multiple GPUs for one model? A: Yes. Make all cards CUDA-visible and Ollama distributes transformer layers across them. If only GPU 0 is used, see the multi-GPU article for split verification.

Tags: #local-llm #ollama #Troubleshooting

Which bucket are you in?

Common causes

1. NVIDIA driver too old (or CUDA never installed)

2. CUDA_VISIBLE_DEVICES set to empty or -1

3. Ollama runs under systemd but can’t reach the driver or device nodes

4. AMD GPU: ROCm too old or gfx version not overridden

5. WSL2: GPU not passed through from Windows

6. Snap or Flatpak install with GPU access sandboxed

7. Apple Silicon: an override forces CPU, or the model is too big for unified memory

Shortest path to fix

Step 1: Read the inference-compute log line first

Step 2: Confirm the GPU and driver are visible to the OS

Step 3: Clear environment variables that hide the GPU

Step 4: Update the NVIDIA driver (Linux), then restart Ollama

Step 5: Pin the correct GPU and give the service device access

Step 6: AMD only — set the gfx override

Step 7: Confirm GPU offloading is actually active

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

2. `CUDA_VISIBLE_DEVICES` set to empty or `-1`