Local Embedding Server Crashes Under Batched Requests
Local embedding server (Ollama, llama-server, or sentence-transformers) crashes or OOMs when processing large batches. Fix batch size, sequence length, and memory allocation.
Articles tagged with #ollama
Local embedding server (Ollama, llama-server, or sentence-transformers) crashes or OOMs when processing large batches. Fix batch size, sequence length, and memory allocation.
A local LLM uses only one GPU even though multiple are present. Fix tensor-parallel splits, NCCL setup, and Ollama multi-GPU configuration to distribute the workload.
Local LLM stops generating mid-sentence or mid-word without an EOS token. Diagnose max_tokens limits, stop sequences, and streaming buffer issues.
Local LLM takes 30-120 seconds to produce the first token after loading. Diagnose model loading, KV cache allocation, and GPU warmup to reduce cold-start latency.
Local LLM outputs tool names in plain text instead of structured JSON, or ignores the tools list entirely. Fix tool-call templates, grammar constraints, and model selection.
Rebuilding a local vector index from thousands of documents takes hours instead of minutes. Tune batch size, parallelism, and chunking to speed up RAG indexing.
mlx_lm.convert fails when converting a HuggingFace model to MLX format on Apple Silicon. Fix architecture support, dtype mismatches, and memory limits during conversion.
Ollama ignores your NVIDIA or AMD GPU and runs inference on CPU only. Diagnose driver, CUDA, and ROCm mismatches and force GPU offloading.
Ollama pull freezes mid-download at a specific percentage. Diagnose network, disk, and registry issues and resume cleanly.
Ollama pull completes without error but the model doesn't appear in ollama list. Fix manifest path, OLLAMA_MODELS conflicts, and corrupted registry state.
The SYSTEM directive in an Ollama Modelfile has no effect on the model's behavior. Diagnose template structure, system role injection, and chat API vs. generate API differences.
Ollama refuses to start because port 11434 is already bound. Find the conflicting process, free the port, or run Ollama on an alternate port.