llama.cpp mmap Fails on a Network Drive
llama.cpp crashes or errors when loading a GGUF model from an NFS or SMB network share. Disable mmap or copy the model to local storage to fix it.
Articles tagged with #local-llm
llama.cpp crashes or errors when loading a GGUF model from an NFS or SMB network share. Disable mmap or copy the model to local storage to fix it.
Responses degrade noticeably after moving from Q5_K_M to Q4_0 or lower in llama.cpp. Identify quality-sensitive layers and choose the right quantization tier.
LM Studio crashes or shows an out-of-memory error when loading a model. Diagnose VRAM limits, quantization choice, and context size to load successfully.
Local embedding server (Ollama, llama-server, or sentence-transformers) crashes or OOMs when processing large batches. Fix batch size, sequence length, and memory allocation.
Local LLM returns scrambled, repetitive, or role-confused output because the chat template doesn't match the model. Identify and apply the correct template.
A local LLM uses only one GPU even though multiple are present. Fix tensor-parallel splits, NCCL setup, and Ollama multi-GPU configuration to distribute the workload.
Local LLM stops generating mid-sentence or mid-word without an EOS token. Diagnose max_tokens limits, stop sequences, and streaming buffer issues.
Local model output becomes incoherent or repetitive beyond a certain context length due to wrong RoPE scaling settings. Diagnose and fix dynamic NTK or linear scaling config.
Local LLM takes 30-120 seconds to produce the first token after loading. Diagnose model loading, KV cache allocation, and GPU warmup to reduce cold-start latency.
Token counts from your application's tokenizer disagree with the local inference server, causing context overflow or incorrect billing. Align tokenizer versions to fix the drift.
Local LLM outputs tool names in plain text instead of structured JSON, or ignores the tools list entirely. Fix tool-call templates, grammar constraints, and model selection.
Rebuilding a local vector index from thousands of documents takes hours instead of minutes. Tune batch size, parallelism, and chunking to speed up RAG indexing.
mlx_lm.convert fails when converting a HuggingFace model to MLX format on Apple Silicon. Fix architecture support, dtype mismatches, and memory limits during conversion.
Ollama ignores your NVIDIA or AMD GPU and runs inference on CPU only. Diagnose driver, CUDA, and ROCm mismatches and force GPU offloading.
Ollama pull freezes mid-download at a specific percentage. Diagnose network, disk, and registry issues and resume cleanly.
Ollama pull completes without error but the model doesn't appear in ollama list. Fix manifest path, OLLAMA_MODELS conflicts, and corrupted registry state.
The SYSTEM directive in an Ollama Modelfile has no effect on the model's behavior. Diagnose template structure, system role injection, and chat API vs. generate API differences.
Ollama refuses to start because port 11434 is already bound. Find the conflicting process, free the port, or run Ollama on an alternate port.
vLLM raises a context length exceeded error mid-request. Fix max-model-len, chunked prefill, and KV cache allocation to handle long prompts reliably.
vLLM fails to start with a CUDA version mismatch or undefined symbol error. Align your CUDA toolkit, driver, and PyTorch versions to fix the incompatibility.