AI AI Tools Guidebook
Home AI Tool Tutorials AI Use Cases Prompt Library About
🌐 中文
Home / #local-llm

#local-llm

Articles tagged with #local-llm

Troubleshooting

llama.cpp mmap Fails on a Network Drive

llama.cpp crashes or errors when loading a GGUF model from an NFS or SMB network share. Disable mmap or copy the model to local storage to fix it.

May 25, 2026 #local-llm #llama.cpp
Troubleshooting

llama.cpp Quality Drops After Switching to More Aggressive Quant

Responses degrade noticeably after moving from Q5_K_M to Q4_0 or lower in llama.cpp. Identify quality-sensitive layers and choose the right quantization tier.

May 25, 2026 #local-llm #llama.cpp
Troubleshooting

LM Studio OOMs While Loading a Model

LM Studio crashes or shows an out-of-memory error when loading a model. Diagnose VRAM limits, quantization choice, and context size to load successfully.

May 25, 2026 #local-llm #lmstudio
Troubleshooting

Local Embedding Server Crashes Under Batched Requests

Local embedding server (Ollama, llama-server, or sentence-transformers) crashes or OOMs when processing large batches. Fix batch size, sequence length, and memory allocation.

May 25, 2026 #local-llm #ollama
Troubleshooting

Chat-Template Mismatch Produces Garbage Output

Local LLM returns scrambled, repetitive, or role-confused output because the chat template doesn't match the model. Identify and apply the correct template.

May 25, 2026 #local-llm #llama.cpp
Troubleshooting

Multi-GPU Not Used — Model Runs Only on GPU 0

A local LLM uses only one GPU even though multiple are present. Fix tensor-parallel splits, NCCL setup, and Ollama multi-GPU configuration to distribute the workload.

May 25, 2026 #local-llm #ollama
Troubleshooting

Local Model Output Truncated Mid-Token

Local LLM stops generating mid-sentence or mid-word without an EOS token. Diagnose max_tokens limits, stop sequences, and streaming buffer issues.

May 25, 2026 #local-llm #ollama
Troubleshooting

Misconfigured RoPE Scaling Garbles Long-Context Output

Local model output becomes incoherent or repetitive beyond a certain context length due to wrong RoPE scaling settings. Diagnose and fix dynamic NTK or linear scaling config.

May 25, 2026 #local-llm #llama.cpp
Troubleshooting

Local Model Very Slow on First-Token After Cold Start

Local LLM takes 30-120 seconds to produce the first token after loading. Diagnose model loading, KV cache allocation, and GPU warmup to reduce cold-start latency.

May 25, 2026 #local-llm #ollama
Troubleshooting

Tokenizer Drift Causes Token-Count Mismatch

Token counts from your application's tokenizer disagree with the local inference server, causing context overflow or incorrect billing. Align tokenizer versions to fix the drift.

May 25, 2026 #local-llm #llama.cpp
Troubleshooting

Local Model Ignores the Tool-Calling Format

Local LLM outputs tool names in plain text instead of structured JSON, or ignores the tools list entirely. Fix tool-call templates, grammar constraints, and model selection.

May 25, 2026 #local-llm #ollama
Troubleshooting

Local RAG Index Rebuild Is Unbearably Slow

Rebuilding a local vector index from thousands of documents takes hours instead of minutes. Tune batch size, parallelism, and chunking to speed up RAG indexing.

May 25, 2026 #local-llm #ollama
Troubleshooting

MLX Conversion From HuggingFace Fails

mlx_lm.convert fails when converting a HuggingFace model to MLX format on Apple Silicon. Fix architecture support, dtype mismatches, and memory limits during conversion.

May 25, 2026 #local-llm #ollama
Troubleshooting

Ollama Doesn't Detect the GPU, Falls Back to CPU

Ollama ignores your NVIDIA or AMD GPU and runs inference on CPU only. Diagnose driver, CUDA, and ROCm mismatches and force GPU offloading.

May 25, 2026 #local-llm #ollama
Troubleshooting

Ollama Model Download Stalls at Some Percentage

Ollama pull freezes mid-download at a specific percentage. Diagnose network, disk, and registry issues and resume cleanly.

May 25, 2026 #local-llm #ollama
Troubleshooting

Ollama Pull Succeeds but the Model Isn't Listed

Ollama pull completes without error but the model doesn't appear in ollama list. Fix manifest path, OLLAMA_MODELS conflicts, and corrupted registry state.

May 25, 2026 #local-llm #ollama
Troubleshooting

Modelfile SYSTEM Prompt Is Ignored

The SYSTEM directive in an Ollama Modelfile has no effect on the model's behavior. Diagnose template structure, system role injection, and chat API vs. generate API differences.

May 25, 2026 #local-llm #ollama
Troubleshooting

Ollama Startup Fails With port already in use

Ollama refuses to start because port 11434 is already bound. Find the conflicting process, free the port, or run Ollama on an alternate port.

May 25, 2026 #local-llm #ollama
Troubleshooting

vLLM Throws context length exceeded

vLLM raises a context length exceeded error mid-request. Fix max-model-len, chunked prefill, and KV cache allocation to handle long prompts reliably.

May 25, 2026 #local-llm #vllm
Troubleshooting

vLLM Startup Fails With CUDA Version Mismatch

vLLM fails to start with a CUDA version mismatch or undefined symbol error. Align your CUDA toolkit, driver, and PyTorch versions to fix the incompatibility.

May 25, 2026 #local-llm #vllm
AI AI Tools Guidebook

A bilingual content site focused on AI tools and digital productivity.

Navigation

  • AI Tool Tutorials
  • AI Use Cases
  • Prompt Library
  • Indie Dev & Website Building
  • Troubleshooting

Legal

  • About
  • Contact
  • Privacy
  • Terms
  • Disclaimer
  • Editorial Policy
  • Affiliate Disclosure
  • RSS Feed
© 2026 AI Tools Guidebook. All rights reserved.

This site uses cookies to measure traffic and serve personalised ads. Click "Accept" to consent to all cookies, or "Decline" to keep only the strictly necessary ones. Privacy policy