💻 Local LLM Runtime Issues
Ollama, LM Studio, llama.cpp, vLLM, MLX — model loading, GPU, VRAM, quantization, chat templates, tool calling.
Running LLMs locally fails in completely different ways than cloud APIs: download stalls mid-pull, GPU not detected, VRAM OOM, quantized version loses smarts, chat template mismatch produces garbage output, tool-calling model ignores your JSON schema. This hub covers the most common runtimes: Ollama, LM Studio, llama.cpp, vLLM, MLX. Each article solves one symptom — "why did my model get dumber after I switched quants", "why is the GPU never being used", "why doesn`t ollama list show the model I just pulled". Skips beginner "how to install Ollama" content; goes straight to failure modes + shortest fix + verification checklist.
Common problems
- Ollama model download stalls at some percentage Re-run ollama pull to resume; check proxy / DNS.
- Ollama doesn`t detect the GPU, falls back to CPU Check ollama serve logs; align CUDA / ROCm driver versions.
- LM Studio OOMs while loading a model Lower quant + reduce context size; tune n_gpu_layers.
- llama.cpp quality drops after switching to more aggressive quant Return to Q4_K_M / Q5_K_M baseline; compare via perplexity.
- vLLM throws context length exceeded Tune --max-model-len; check chat-template overhead.
- Ollama startup fails with port already in use lsof :11434; kill old process or change OLLAMA_HOST.
- Local model output truncated mid-token Raise max_tokens; verify stop sequences aren`t hitting prematurely.
- Chat-template mismatch produces garbage output Match template to model README; never reuse the wrong template.
- Local embedding server crashes under batched requests Lower batch size; add backpressure on the client.
- Local RAG index rebuild is unbearably slow Switch to incremental indexing; parallel embedding; persist vector store.
- Ollama pull succeeds but the model isn`t listed Verify OLLAMA_MODELS path; ensure list hits the same process.
- llama.cpp mmap fails on a network drive Copy weights to local SSD; or pass --no-mmap.
- Local model ignores the tool-calling format Switch to a function-call fine-tuned model; or BYO parser.
- Local model very slow on first-token after cold start Warm-up request; keep-alive; pin layers in VRAM.
- Multi-GPU not used — model runs only on GPU 0 --tensor-split or CUDA_VISIBLE_DEVICES; verify runtime support.
- Modelfile SYSTEM prompt is ignored Check client override; ollama show <model> to confirm.
- vLLM startup fails with CUDA version mismatch Install matching wheel for the driver; or pin a docker image.
- Misconfigured RoPE scaling garbles long-context output Revert to the model`s recommended scaling type / factor.
- Tokenizer drift causes token-count mismatch Ship tokenizer with the weights; do not mix HF cache versions.
- MLX conversion from HuggingFace fails Update mlx-lm; check config.json fields; or convert via GGUF.