Local LLM Runtime Issues | Troubleshooting

Running LLMs locally fails in completely different ways than cloud APIs: download stalls mid-pull, GPU not detected, VRAM OOM, quantized version loses smarts, chat template mismatch produces garbage output, tool-calling model ignores your JSON schema. This hub covers the most common runtimes: Ollama, LM Studio, llama.cpp, vLLM, MLX. Each article solves one symptom — "why did my model get dumber after I switched quants", "why is the GPU never being used", "why doesn`t ollama list show the model I just pulled". Skips beginner "how to install Ollama" content; goes straight to failure modes + shortest fix + verification checklist.

Common problems

Ollama model download stalls at some percentage
Re-run ollama pull to resume; check proxy / DNS.
Ollama doesn`t detect the GPU, falls back to CPU
Check ollama serve logs; align CUDA / ROCm driver versions.
LM Studio OOMs while loading a model
Lower quant + reduce context size; tune n_gpu_layers.
llama.cpp quality drops after switching to more aggressive quant
Return to Q4_K_M / Q5_K_M baseline; compare via perplexity.
vLLM throws context length exceeded
Tune --max-model-len; check chat-template overhead.
Ollama startup fails with port already in use
lsof :11434; kill old process or change OLLAMA_HOST.
Local model output truncated mid-token
Raise max_tokens; verify stop sequences aren`t hitting prematurely.
Chat-template mismatch produces garbage output
Match template to model README; never reuse the wrong template.
Local embedding server crashes under batched requests
Lower batch size; add backpressure on the client.
Local RAG index rebuild is unbearably slow
Switch to incremental indexing; parallel embedding; persist vector store.
Ollama pull succeeds but the model isn`t listed
Verify OLLAMA_MODELS path; ensure list hits the same process.
llama.cpp mmap fails on a network drive
Copy weights to local SSD; or pass --no-mmap.
Local model ignores the tool-calling format
Switch to a function-call fine-tuned model; or BYO parser.
Local model very slow on first-token after cold start
Warm-up request; keep-alive; pin layers in VRAM.
Multi-GPU not used — model runs only on GPU 0
--tensor-split or CUDA_VISIBLE_DEVICES; verify runtime support.
Modelfile SYSTEM prompt is ignored
Check client override; ollama show <model> to confirm.
vLLM startup fails with CUDA version mismatch
Install matching wheel for the driver; or pin a docker image.
Misconfigured RoPE scaling garbles long-context output
Revert to the model`s recommended scaling type / factor.
Tokenizer drift causes token-count mismatch
Ship tokenizer with the weights; do not mix HF cache versions.
MLX conversion from HuggingFace fails
Update mlx-lm; check config.json fields; or convert via GGUF.

💻 Local LLM Runtime Issues

Common problems