Multi-GPU Not Used — Local LLM Runs Only on GPU 0

Q: What's the difference between tensor parallelism and pipeline parallelism?

Tensor parallelism (vLLM `--tensor-parallel-size`, llama.cpp `-sm row`/`tensor`) splits each weight matrix across GPUs and communicates every layer — fast with NVLink, bandwidth-bound on PCIe. Pipeline parallelism (vLLM `--pipeline-parallel-size`, llama.cpp `-sm layer`) puts whole layers on each GPU and communicates far less, at the cost of "pipeline bubble" idle time. On PCIe-only rigs, pipeline/layer is usually the safer default.

Q: Ollama lists both GPUs in `ollama ps` but only GPU 0 shows utilization — why?

`ollama ps` shows which GPUs hold model weights, not which are computing this instant. A card holding later layers sits near 0% utilization until tokens reach those layers, so utilization across cards is uneven by design. As long as both show VRAM used and both spike over time, the split is working.

Q: Can I tensor-parallel across two different GPU models (e.g. 4090 + 3090)?

Technically yes, but match `--tensor-split` to each card's real VRAM and expect the slower card to gate overall speed (weakest-link effect). Same-model pairs are far less painful in practice.

Your local LLM uses one GPU while the others sit at 0%. Fix it with llama.cpp --split-mode, vLLM --tensor-parallel-size, Ollama auto-spread, and the NCCL flags PCIe rigs need.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You have two RTX 3090s (48 GB combined) and load a 70B Q4_K_M model that needs ~42 GB. During inference nvidia-smi shows GPU 0 at 100% and ~24 GB used while GPU 1 sits at 0% and 0 MB. No error — the model just runs on one card.

Fastest fix: the engine almost never splits a model on its own, so you have to tell it which split mode to use.

llama.cpp / llama-server: add -sm layer (the default; spreads layers across cards) or -sm row for tensor-parallel, plus --tensor-split 1,1 and -ngl 99.
vLLM: add --tensor-parallel-size 2 (must divide the model’s attention-head count).
Ollama: make sure both cards are visible (CUDA_VISIBLE_DEVICES=0,1); current Ollama auto-spreads a model that doesn’t fit on one card. Force spreading with OLLAMA_SCHED_SPREAD=1.

If the model already fits on a single GPU, running on one card is correct — splitting it usually makes inference slower, not faster, on consumer hardware without NVLink. Work through the buckets below to find your case.

Which bucket are you in?

Symptom	Most likely cause	Jump to
Only GPU 0 has VRAM used, model is small (smaller than one GPU’s VRAM)	No split needed — single-GPU is correct	Cause 1
`nvidia-smi -L` lists fewer GPUs than you have	`CUDA_VISIBLE_DEVICES` / Docker hides cards	Cause 2
Engine starts, big model, still one card	Split mode/flag not set	Cause 3
vLLM crashes: “attention heads must be divisible”	`--tensor-parallel-size` doesn’t divide heads	Cause 4
vLLM hangs at startup on NCCL with no error	P2P over PCIe broken on consumer cards	Cause 5
Cards split, but slower than one card	No NVLink, PCIe bandwidth is the bottleneck	Cause 6
One card OOMs on an even split	Mismatched VRAM, needs a proportional split	Cause 7

Common causes

1. The model fits in GPU 0 alone — no split is needed

Ollama, llama.cpp, and vLLM all prefer a single GPU when the model fits, and that is the right call. A Q4_K_M 7B model (~4.4 GB) on two 24 GB cards runs entirely on GPU 0 because splitting a 4.4 GB model across two GPUs only adds inter-GPU traffic for no capacity gain. Auto-split (or the need to force a split) only matters once the model is larger than one card’s free VRAM.

How to spot it: compare the model’s VRAM footprint against one GPU’s free VRAM. If it fits in one card, single-GPU is expected behavior, not a bug.

2. CUDA_VISIBLE_DEVICES (or Docker) restricts the engine to one GPU

CUDA_VISIBLE_DEVICES=0 — set in a shell profile, a systemd unit, a conda activation hook, a CI script, or a parent process — hides every GPU except GPU 0. The engine genuinely only sees one device and cannot split across invisible cards. The Docker equivalent is launching with --gpus '"device=0"' instead of --gpus all.

How to spot it: run echo $CUDA_VISIBLE_DEVICES. If it prints 0 or a single GPU UUID, that is the cause. Confirm both cards exist at the system level with nvidia-smi -L. Inside a container, run nvidia-smi -L in the container, not the host.

3. The split mode / flag isn’t set

This is the most common real cause for a large model. Each engine has its own switch and none of them split a multi-GPU-sized model without being told how:

vLLM defaults to --tensor-parallel-size 1 (single GPU). Set it to your GPU count.
llama.cpp / llama-server uses --split-mode (-sm): layer (default, pipeline-style, each card holds a contiguous slice of layers), row (tensor-parallel, splits weights across cards every layer), tensor (experimental backend-agnostic tensor parallel), or none (one GPU). With the default layer mode it does spread layers automatically when a model is too big — but only across GPUs it can see, and only if -ngl is high enough to push layers onto the GPUs in the first place.
Ollama auto-spreads a model that doesn’t fit on one visible card; if you want it spread even when it would fit on one, set OLLAMA_SCHED_SPREAD=1 (as of June 2026, treat this as an advanced override, not a default).

How to spot it: check the launch command. No --tensor-parallel-size (vLLM)? No -ngl/-sm doing anything useful (llama.cpp)? A restrictive CUDA_VISIBLE_DEVICES in front of Ollama? Any of those pins you to GPU 0.

4. vLLM: tensor-parallel-size doesn’t divide the attention-head count

vLLM tensor parallelism splits attention heads across GPUs, so the model’s number of attention heads must be divisible by --tensor-parallel-size. If it isn’t, vLLM aborts at load with:

Total number of attention heads must be divisible by tensor parallel size

A model with 64 heads is fine on --tensor-parallel-size 2, 4, or 8 but not 3 or 5.

How to spot it: read the startup traceback for the line above. If you hit it, pick a TP size that divides the head count, or use --pipeline-parallel-size instead (it splits layers, not heads).

5. vLLM hangs at NCCL init on consumer GPUs (no error, no progress)

On consumer cards without NVLink, GPU-to-GPU peer access (P2P) over PCIe is often broken by IOMMU/ACS or a driver quirk. vLLM gets stuck during NCCL initialization (around pynccl.py) with no traceback, or logs peer access is not supported between these two devices.

How to spot it / fix it: launch with NCCL_P2P_DISABLE=1 (and NCCL_IB_DISABLE=1 if there’s no InfiniBand). If startup now completes, P2P was the culprit. It works but costs throughput, so it’s a diagnosis aid, not a permanent answer — the real fix is disabling IOMMU/ACS or updating drivers.

6. No NVLink — PCIe bandwidth makes the split slower than one card

Tensor parallelism does an all-reduce across GPUs at every layer. On consumer rigs (two 3090s on PCIe, no NVLink) that all-reduce rides the PCIe bus and becomes the bottleneck. As of June 2026, the practical rule: for a single stream at low latency, splitting is often slower than one card; tensor parallel only clearly wins under high concurrency (roughly 10+ simultaneous requests), where PCIe cost is amortized across many requests. For single-stream work on PCIe, prefer pipeline/layer split (less cross-GPU chatter) or just keep the model on one card.

How to spot it: run nvidia-smi topo -m. NV# = NVLink (fast); PIX/PXB = PCIe via a switch; PHB/SYS = via the host bridge (slowest). No NV# means PCIe-only — set expectations accordingly.

7. Mismatched VRAM sizes — an even split OOMs the smaller card

If GPU 0 has 24 GB and GPU 1 has 16 GB, a 50/50 (--tensor-split 1,1) split will OOM GPU 1. Split proportionally to each card’s free VRAM instead.

How to spot it: run nvidia-smi --query-gpu=memory.total --format=csv,noheader. Different values mean you need a proportional --tensor-split, not 1,1.

Shortest path to fix

Step 1: Verify every GPU is visible

# List all GPUs and their indices
nvidia-smi -L
# GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-abc123)
# GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-def456)

# Drop any restriction, then make all cards visible
unset CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=0,1

If nvidia-smi -L lists only one card but you have two, fix the driver/seating first — no software flag can split across a card the OS doesn’t see.

Step 2: Ollama — confirm both cards, force spread if needed

Current Ollama auto-spreads any model too big for one visible card, so you usually only need to make both visible.

# Don't restrict Ollama's view of the GPUs
unset CUDA_VISIBLE_DEVICES   # or: export CUDA_VISIBLE_DEVICES=0,1

# See which GPUs Ollama detected
OLLAMA_DEBUG=1 ollama serve 2>&1 | grep -i "gpu\|cuda"
# expect one "inference compute id=GPU-..." line per card

# Optional: force spreading even for a model that would fit on one card
export OLLAMA_SCHED_SPREAD=1

For a systemd-managed Ollama, set the variables in the unit, not just your shell:

sudo systemctl edit ollama
# Add under [Service]:
# Environment="CUDA_VISIBLE_DEVICES=0,1"
# Environment="OLLAMA_SCHED_SPREAD=1"
sudo systemctl daemon-reload
sudo systemctl restart ollama

# After loading a large model, ollama ps should report it across both GPUs
ollama ps

Step 3: llama-server — pick a split mode and ratio

# Default layer split across two equal 24 GB cards
./llama-server \
  -m models/llama-3.1-70b-Q4_K_M.gguf \
  -sm layer \
  --tensor-split 1,1 \
  -ngl 99 \
  --port 8080

# Tensor-parallel (row) split — lower latency if you have NVLink
./llama-server -m models/llama-3.1-70b-Q4_K_M.gguf -sm row --tensor-split 1,1 -ngl 99 --port 8080

# Asymmetric cards (24 GB + 16 GB) — proportional split
./llama-server -m models/llama-3.1-70b-Q4_K_M.gguf -sm layer --tensor-split 1.5,1 -ngl 99 --port 8080

As of June 2026, -sm/--split-mode takes none, layer (default), row (deprecated tensor split), and tensor (experimental). -ngl/--n-gpu-layers also accepts auto and all. Use -sm row/tensor only with a fast interconnect; on PCIe stick with layer.

Step 4: vLLM — set tensor (or pipeline) parallel size

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --port 8000

--tensor-parallel-size must divide the model’s attention-head count (64 heads → 2/4/8 OK). If you hit Total number of attention heads must be divisible by tensor parallel size, or you’re on PCIe with no NVLink and want lower cross-GPU traffic, use pipeline parallel instead:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --pipeline-parallel-size 2 \
  --gpu-memory-utilization 0.90 \
  --port 8000

If vLLM hangs silently at startup on consumer GPUs, prepend NCCL_P2P_DISABLE=1 (see Cause 5).

Step 5: Confirm the split is actually live

# Watch every GPU during token generation
watch -n 1 "nvidia-smi --query-gpu=index,utilization.gpu,memory.used,memory.free --format=csv,noheader"

How to confirm it’s fixed: during generation, every card should show non-zero VRAM used, and (with the model spread) more than one card should show GPU utilization spikes as tokens flow. If only GPU 0 ever moves, the split didn’t take — recheck visibility (Step 1) and the split flag for your engine.

Prevention

Set CUDA_VISIBLE_DEVICES=0,1 (all relevant indices) in the Ollama systemd unit, not just your interactive shell.
Re-run nvidia-smi -L after every driver or kernel update to confirm all cards still enumerate.
Use --tensor-split proportional to each card’s free VRAM on mixed-capacity rigs — never blind 1,1.
For vLLM, only use --tensor-parallel-size values that divide the model’s attention-head count; otherwise use --pipeline-parallel-size.
On PCIe-only rigs (no NVLink), only split models that won’t fit on one card, and benchmark multi-GPU against single-GPU before trusting it — for single-stream latency it’s often slower.
Write your split mode and --tensor-split ratio in a comment beside the launch script so the next person doesn’t rediscover it.

FAQ

Q: Does multi-GPU make inference faster or just allow bigger models? A: On consumer hardware without NVLink, mostly the latter. For a model that fits on one card, splitting it adds inter-GPU all-reduce traffic and is usually slower for a single request. The clear win is capacity (running a model that won’t fit on one card) and throughput under high concurrency, not single-stream speed.

Q: What’s the difference between tensor parallelism and pipeline parallelism? A: Tensor parallelism (vLLM --tensor-parallel-size, llama.cpp -sm row/tensor) splits each weight matrix across GPUs and communicates every layer — fast with NVLink, bandwidth-bound on PCIe. Pipeline parallelism (vLLM --pipeline-parallel-size, llama.cpp -sm layer) puts whole layers on each GPU and communicates far less, at the cost of “pipeline bubble” idle time. On PCIe-only rigs, pipeline/layer is usually the safer default.

Q: Ollama lists both GPUs in ollama ps but only GPU 0 shows utilization — why? A: ollama ps shows which GPUs hold model weights, not which are computing this instant. A card holding later layers sits near 0% utilization until tokens reach those layers, so utilization across cards is uneven by design. As long as both show VRAM used and both spike over time, the split is working.

Q: vLLM hangs at startup with no error on my two RTX cards. What now? A: That’s almost always broken GPU peer access over PCIe. Launch with NCCL_P2P_DISABLE=1 (add NCCL_IB_DISABLE=1 if you have no InfiniBand). If it starts, P2P was the issue — the durable fix is disabling IOMMU/ACS in BIOS or updating your driver.

Q: Can I run two different models, one per GPU, instead of splitting one? A: Yes, and it’s often better for models that each fit on one card. Start one Ollama instance with CUDA_VISIBLE_DEVICES=0 on port 11434 and another with CUDA_VISIBLE_DEVICES=1 on port 11435. Each uses its own card with no cross-GPU traffic.

Q: Can I tensor-parallel across two different GPU models (e.g. 4090 + 3090)? A: Technically yes, but match --tensor-split to each card’s real VRAM and expect the slower card to gate overall speed (weakest-link effect). Same-model pairs are far less painful in practice.

Tags: #local-llm #ollama #Troubleshooting

Which bucket are you in?

Common causes

1. The model fits in GPU 0 alone — no split is needed

2. CUDA_VISIBLE_DEVICES (or Docker) restricts the engine to one GPU

3. The split mode / flag isn’t set

4. vLLM: tensor-parallel-size doesn’t divide the attention-head count

5. vLLM hangs at NCCL init on consumer GPUs (no error, no progress)

6. No NVLink — PCIe bandwidth makes the split slower than one card

7. Mismatched VRAM sizes — an even split OOMs the smaller card

Shortest path to fix

Step 1: Verify every GPU is visible

Step 2: Ollama — confirm both cards, force spread if needed

Step 3: llama-server — pick a split mode and ratio

Step 4: vLLM — set tensor (or pipeline) parallel size

Step 5: Confirm the split is actually live

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)