#local-llm - Tag | AI Tools Guidebook

Troubleshooting

llama.cpp mmap Fails on a Network Drive

llama.cpp crashes or stalls loading a GGUF model from an NFS or SMB share. Fastest fix: add --no-mmap (and --no-direct-io if DirectIO is on), or copy the model to local disk.

May 25, 2026 #local-llm #llama.cpp

Troubleshooting

llama.cpp Quality Drops After Switching to a More Aggressive Quant

Responses degrade after moving from Q5_K_M or Q8_0 to Q4_0, IQ4_XS, or lower in llama.cpp. Pick the right quant tier, fix bad re-quants, and confirm with perplexity.

May 25, 2026 #local-llm #llama.cpp

Troubleshooting

LM Studio Out of Memory When Loading a Model

LM Studio crashes or shows out of memory loading a GGUF. Fix it fast by cutting context length, enabling Flash Attention, and tuning GPU offload — with a VRAM sizing table.

May 25, 2026 #local-llm #lmstudio

Troubleshooting

Local Embedding Server Crashes Under Batched Requests

Ollama, llama-server, vLLM, or sentence-transformers crashes or OOMs on batched embeddings. Fix batch size, num_batch, sequence length, and concurrency — with the exact flags.

May 25, 2026 #local-llm #ollama

Troubleshooting

Chat-Template Mismatch Produces Garbage Local LLM Output

A local LLM echoes your prompt, prints literal [INST] or <|im_start|> tags, or loops the same sentence. That is a chat-template mismatch. Find the model's real template and force the engine to use it.

May 25, 2026 #local-llm #llama.cpp

Troubleshooting

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

Your local LLM uses one GPU while the others sit at 0%. Fix it with llama.cpp --split-mode, vLLM --tensor-parallel-size, Ollama auto-spread, and the NCCL flags PCIe rigs need.

May 25, 2026 #local-llm #ollama

Troubleshooting

Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)

Your local model stops mid-word with no EOS token. Diagnose num_predict limits, the VRAM-based num_ctx default, stop sequences, proxy buffering, and UTF-8 byte splits.

May 25, 2026 #local-llm #ollama

Troubleshooting

Misconfigured RoPE Scaling Garbles Long-Context Output

A local LLM stays coherent up to its native context length, then degenerates into repetition or gibberish. Diagnose and fix RoPE scaling (YaRN, llama3, rope_theta) in llama.cpp and vLLM.

May 25, 2026 #local-llm #llama.cpp

Troubleshooting

Local LLM Very Slow on First Token After Cold Start

Local LLM takes 30-120s to produce the first token after loading, then runs fast. Diagnose disk I/O, model eviction, CUDA/Metal shader JIT, and KV cache allocation, and pin the model warm.

May 25, 2026 #local-llm #ollama

Troubleshooting

Tokenizer Drift: Local LLM Token Counts Don't Match

Your app's token count disagrees with the local llama.cpp or Ollama server, causing context overflow or silent truncation. Use the server's own tokenizer as ground truth to fix the drift.

May 25, 2026 #local-llm #llama.cpp

Troubleshooting

Local Model Ignores the Tool-Calling Format

Local LLM writes tool names in prose instead of structured JSON, or ignores the tools list. Fix it with the right tool-capable model, --jinja in llama-server, and Ollama's format JSON-schema constraint.

May 25, 2026 #local-llm #ollama

Troubleshooting

Local RAG Index Rebuild Is Unbearably Slow

Rebuilding a local vector index from thousands of documents takes hours instead of minutes. Fix batch size, skip unchanged docs, batch-write the vectorstore, and right-size chunks.

May 25, 2026 #local-llm #ollama

Troubleshooting

mlx_lm.convert Fails Converting a HuggingFace Model

mlx_lm.convert errors when converting a HuggingFace model to MLX on Apple Silicon: Model type not supported, GatedRepoError 401, or OOM. Fixes verified June 2026.

May 25, 2026 #local-llm #mlx

Troubleshooting

Ollama Doesn't Detect the GPU, Falls Back to CPU

Ollama ignores your NVIDIA or AMD GPU and runs on CPU only. Read the inference-compute log line, fix driver, CUDA, and ROCm mismatches, and force GPU offloading.

May 25, 2026 #local-llm #ollama

Troubleshooting

Ollama Pull Stalls or Resets Mid-Download — Fixes

ollama pull freezes at a percentage, the bar runs backwards, or you see max retries exceeded: EOF. Diagnose network, disk, and partial-blob causes and resume cleanly.

May 25, 2026 #local-llm #ollama

Troubleshooting

Ollama Pull Succeeds but the Model Isn't in ollama list

ollama pull finishes with no error but the model is missing from ollama list. Fix the OLLAMA_MODELS path split, the ollama service-user mismatch, and corrupted manifests.

May 25, 2026 #local-llm #ollama

Troubleshooting

Ollama Modelfile SYSTEM Prompt Is Ignored

Your Ollama Modelfile SYSTEM directive has no effect on model behavior. Fix it fast: verify the template injects .System, check for RENDERER/PARSER inheritance, and stop your client from overriding the system message.

May 25, 2026 #local-llm #ollama

Troubleshooting

Fix Ollama port already in use (11434)

Ollama won't start because port 11434 is already bound. Find the process holding it, free the port, or move Ollama to another port — exact commands for macOS, Linux, and Windows.

May 25, 2026 #local-llm #ollama

Troubleshooting

Fix vLLM context length exceeded errors

vLLM rejects a request with This model's maximum context length is X tokens. Set max-model-len realistically, raise GPU memory, use fp8 KV cache, and budget output tokens.

May 25, 2026 #local-llm #vllm

Troubleshooting

Fix vLLM CUDA Version Mismatch and undefined symbol Errors

vLLM crashes on startup with undefined symbol, no kernel image, or CUDA mismatch. Install into a clean env with uv --torch-backend=auto and align driver, CUDA, and PyTorch.

May 25, 2026 #local-llm #vllm