Multi-GPU Not Used — Model Runs Only on GPU 0

A local LLM uses only one GPU even though multiple are present. Fix tensor-parallel splits, NCCL setup, and Ollama multi-GPU configuration to distribute the workload.

You have two RTX 3090s (48 GB combined VRAM) in your workstation and load a 70B Q4_K_M model that requires 42 GB. Running nvidia-smi during inference shows GPU 0 at 100% utilization and 24 GB VRAM used while GPU 1 shows 0% and 0 MB used. The model should split across both cards but it’s running entirely on GPU 0, despite the fact that the combined capacity would be sufficient and there’s no error message. This behavior is the default for both Ollama and llama.cpp when multi-GPU support isn’t explicitly configured — the engines enumerate GPUs and try to fit the model into a single card before falling back to split-layer mode.

Common causes

Ordered by hit rate, highest first.

1. Model fits in GPU 0 VRAM alone — no automatic split

Ollama and llama.cpp will use a single GPU if the model fits. A Q4_K_M 7B model (4.4 GB) on a system with two 24 GB GPUs will run entirely on GPU 0 — correctly, because there’s no benefit to splitting a 4.4 GB model across two GPUs. The “problem” only occurs when the model is large enough to benefit from splitting, and even then, the engine must be told to split.

How to spot it: Check the model VRAM footprint vs. single-GPU VRAM. If the model fits comfortably in one GPU, single-GPU execution is correct behavior. Only expect auto-split for models larger than a single GPU’s VRAM.

2. CUDA_VISIBLE_DEVICES restricts to GPU 0 only

A CUDA_VISIBLE_DEVICES=0 environment variable (set in shell profile, systemd unit, or a parent process) hides all GPUs except GPU 0. The engine genuinely only sees one GPU and cannot split across two invisible devices.

How to spot it: Run echo $CUDA_VISIBLE_DEVICES. If it’s set to 0 or a single GPU UUID, that’s the cause. Run nvidia-smi -L to verify both GPUs are present at the system level.

3. Ollama 0.3 or earlier — multi-GPU not supported

Ollama added multi-GPU support in version 0.3. Earlier versions enumerate all GPUs at startup but only use GPU 0 for model loading and inference.

How to spot it: Run ollama --version. If it’s below 0.3.0, upgrade. On Linux: curl -fsSL https://ollama.com/install.sh | sh. On macOS: download the latest Ollama.app.

For optimal multi-GPU inference, the GPUs need to communicate efficiently. Consumer multi-GPU setups (two 3090s on PCIe without NVLink) can bottleneck on PCIe bandwidth for inter-layer communication. Some llama.cpp versions detect this and fall back to single-GPU mode to avoid the bandwidth penalty.

How to spot it: Run nvidia-smi topo -m and check the relationship between your GPUs. NV# means NVLink (fast); PIX means PCIe via a CPU (slower); PHB means PCIe via a host bridge (slowest). Without NVLink, split inference is slower than single-GPU for many models.

5. —tensor-split not configured for vLLM or llama-server

vLLM requires --tensor-parallel-size N and llama-server requires --tensor-split A,B (proportional VRAM allocation) to distribute a model across multiple GPUs. These flags are not optional — without them, the engines use a single GPU.

How to spot it: Check your startup command for --tensor-parallel-size (vLLM) or --tensor-split (llama-server/llama-cli). If absent, only GPU 0 is used.

6. Mismatched GPU VRAM sizes causing unequal load distribution

If GPU 0 has 24 GB and GPU 1 has 16 GB, an equal 50/50 split will OOM GPU 1. Engines that don’t detect the asymmetry may abort the split and fall back to single-GPU mode. The model must be split proportionally to each GPU’s available VRAM.

How to spot it: Run nvidia-smi --query-gpu=memory.total --format=csv,noheader. If the two values are different, you need a proportional split, not a 50/50 split.

Shortest path to fix

Step 1: Verify both GPUs are visible

# List all GPUs and their indices
nvidia-smi -L
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-abc123)
# GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-def456)

# Unset CUDA_VISIBLE_DEVICES if it's restricting visibility
unset CUDA_VISIBLE_DEVICES

# Verify Ollama sees both GPUs
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i gpu

Step 2: For Ollama 0.4+, set CUDA_VISIBLE_DEVICES to include both GPUs

# In shell
export CUDA_VISIBLE_DEVICES=0,1
ollama serve &

# In systemd unit file
sudo systemctl edit ollama
# Add under [Service]:
# Environment="CUDA_VISIBLE_DEVICES=0,1"

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Verify
ollama ps  # after loading a large model, should show both GPUs

Step 3: For llama-server, use —tensor-split for proportional split

# Two equal GPUs (both 24 GB) — 50/50 split
./llama-server \
  -m models/llama-3.1-70b-Q4_K_M.gguf \
  --tensor-split 1,1 \
  --n-gpu-layers 80 \
  --port 8080

# Asymmetric GPUs (24 GB + 16 GB) — proportional split
./llama-server \
  -m models/llama-3.1-70b-Q4_K_M.gguf \
  --tensor-split 1.5,1 \
  --n-gpu-layers 80 \
  --port 8080

Step 4: For vLLM, use —tensor-parallel-size

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --port 8000

vLLM requires the number of GPUs to be a power of 2 or a divisor of the model’s attention heads count. For models with 64 attention heads, --tensor-parallel-size 2 is valid.

Step 5: For llama-cli, verify split is active

# Run with verbose to confirm layer distribution
./llama-cli \
  -m models/llama-3.1-70b-Q4_K_M.gguf \
  --tensor-split 1,1 \
  --n-gpu-layers 80 \
  -p "say hello" \
  -n 10 \
  --verbose 2>&1 | grep "GPU\|layer"

You should see layers distributed across both GPUs in the verbose output.

Step 6: Monitor both GPUs during inference

# Watch both GPUs in real time
watch -n 1 "nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.free --format=csv,noheader"

Both GPUs should show non-zero utilization during token generation. If only GPU 0 is active, the split is not working.

Prevention

  • Always explicitly set CUDA_VISIBLE_DEVICES=0,1 (or all relevant GPU indices) in the Ollama systemd service, not just in your user shell.
  • Check nvidia-smi -L after every driver or OS update to confirm all GPUs are enumerated.
  • Use --tensor-split proportional to each GPU’s VRAM, not a fixed 50/50 split on mixed-capacity systems.
  • For vLLM, only use --tensor-parallel-size values that divide evenly into the model’s attention head count.
  • Accept that consumer multi-GPU setups without NVLink (PCIe-only) have significant communication overhead — only split models that don’t fit in a single GPU.
  • After setting up multi-GPU, run a generation benchmark and compare against single-GPU speed. If multi-GPU is slower, stick with single-GPU plus CPU offload for the remainder.
  • Document your --tensor-split ratio in a comment next to the launch script so future you or teammates don’t have to rediscover it.

FAQ

Q: Does multi-GPU make inference faster or just allow larger models? A: Primarily larger models. For models that fit in a single GPU, tensor parallel across multiple GPUs introduces inter-GPU communication overhead (allreduce operations) that typically makes inference slower than single-GPU. Multi-GPU is primarily a capacity expansion mechanism, not a speed optimization, for consumer hardware without NVLink.

Q: What’s the difference between tensor parallelism and pipeline parallelism for multi-GPU? A: Tensor parallelism (used by vLLM and llama-server’s --tensor-split) splits individual weight matrices across GPUs, requiring all-to-all communication every layer. Pipeline parallelism splits the model’s layers sequentially across GPUs, with less communication but more latency per token. Tensor parallelism is generally preferred for inference speed.

Q: Ollama shows GPU 0 and GPU 1 in ollama ps but utilization is still only on GPU 0 — why? A: ollama ps shows which GPUs were allocated VRAM for the model, not which are actively computing. GPU 1 may hold some weight layers but if the current batch of tokens doesn’t touch those layers, its compute utilization will be near 0%. Multi-GPU utilization is non-uniform — both GPUs won’t show exactly equal utilization.

Q: Can I run two different models on two GPUs simultaneously without splitting? A: Yes. Set CUDA_VISIBLE_DEVICES=0 for one Ollama instance on port 11434 and CUDA_VISIBLE_DEVICES=1 for another on port 11435. Each instance uses its own GPU independently. This is often better than splitting one model for models that fit in a single GPU.

Tags: #local-llm #ollama #Troubleshooting