Local Embedding Server Crashes Under Batched Requests

Q: What's the difference between `--batch-size` and `--ubatch-size` in llama-server?

`--batch-size` is the logical batch the scheduler accepts; `--ubatch-size` is the physical micro-batch actually computed in one pass. For embeddings, lower `--ubatch-size` to 32-64 to cap peak attention memory while keeping `--batch-size` larger for throughput.

Q: Should I use `/api/embed` or `/api/embeddings`?

Use `/api/embed`. It accepts an `input` array (real batching), returns fp32, and L2-normalizes. The older `/api/embeddings` only takes one `prompt` string and skips normalization, so cosine similarity can be off if you mix the two.

Q: I need a CPU-only embedding server — what batch size is safe?

On CPU with sentence-transformers, `batch_size` 1-4 is safe for long sequences and ~16 for short (128-token) chunks. Watch RAM with `htop`; if the process nears your system RAM limit, halve the batch.

Ollama, llama-server, vLLM, or sentence-transformers crashes or OOMs on batched embeddings. Fix batch size, num_batch, sequence length, and concurrency — with the exact flags.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

Single embeddings work. Then your RAG indexer pushes 64 or 128 chunks at once and the local embedding server — Ollama serving nomic-embed-text, llama-server with bge-large-en-v1.5, vLLM in embed mode, or a sentence-transformers FastAPI service — OOMs, hangs, or starts returning 500s after a few batches while silently dropping the rest.

Fastest fix: cut the batch and the per-item context first. For Ollama, lower the runner batch with a Modelfile PARAMETER num_batch 64 (default is 512) and pre-truncate inputs to ~512 tokens; for llama-server set --ubatch-size 64 --ctx-size 512; for sentence-transformers set model.max_seq_length = 512 and encode(..., batch_size=16). Then cap client concurrency to 2-4. Embedding models run the entire batch through every encoder layer in one forward pass, so peak memory scales with batch_size × longest_sequence², not with the average — one outlier document can sink an otherwise-fine batch.

If you are on Ollama and the crash log says caching disabled but unable to fit entire input in a batch, jump straight to cause 3 — that is a known v0.13.x regression.

Which bucket are you in?

Symptom	Most likely cause	Go to
VRAM spikes to 100% then the process dies	Batch too large for VRAM	Cause 1
Crashes only when one chunk is a huge un-split document	Outlier sequence length / padding blowup	Cause 2
Ollama log: `caching disabled but unable to fit entire input in a batch`	`num_batch` too high / v0.13.x regression	Cause 3
Works one at a time, dies under parallel workers	Concurrent forward passes	Cause 4
`model.max_seq_length` prints 4096/8192 on a small GPU	sentence-transformers long-context default	Cause 5
Slow, queue grows, then OOM; `ollama ps` shows CPU	Embeddings not on GPU	Cause 6
llama-server uses huge memory per request	Embedding mode/pooling not enabled	Cause 7

Common causes

Ordered by hit rate, highest first.

1. Batch size too large for available VRAM

Embedding models run every item in the batch simultaneously through the encoder. For bge-large-en-v1.5 (335M params, fp32), a batch of 128 items at 512 tokens each needs roughly 128 × 512 × 1024 × 4 bytes ≈ 268 MB just for the input representations, before attention matrices and intermediate activations. On an 8 GB GPU, batches of 256+ OOM almost every time.

How to spot it: run nvidia-smi dmon -s m -d 1 while sending batches. If VRAM climbs to the ceiling and the process then dies, the batch is the cause.

2. One outlier long sequence makes the whole batch huge

A batch is padded to its longest member. Mix a 10-token chunk with a 2,000-token chunk and the whole batch is sized for 2,000 tokens. Because attention memory grows with the square of sequence length, a single un-split PDF page in a batch of 64 can OOM even though the other 63 chunks are tiny.

How to spot it: log max(len(t) for t in batch) (token count, not characters) before each call. If the max is far above the average, padding is amplifying memory.

3. Ollama: `num_batch` too high (and the v0.13.x regression)

Ollama’s runner batch defaults to num_batch = 512 (inherited from llama.cpp). On large-context embedding inputs this is the classic OOM trigger, fixed by lowering it. As of June 2026 there is also a specific regression: Ollama v0.13.0–v0.13.2 crash on embeddings with the panic caching disabled but unable to fit entire input in a batch, where the same workload runs fine on v0.12.11. Note the two embedding endpoints have different shapes: the modern /api/embed takes an input string-or-array, returns fp32 and L2-normalizes; the legacy /api/embeddings takes a single prompt string only.

How to spot it: check ollama --version. If you are on 0.13.0–0.13.2 and see that panic, downgrade to 0.12.11 or lower num_batch. Set it via a Modelfile (PARAMETER num_batch 64) — keep it >= 32, or llama.cpp won’t use the cuBLAS prompt-eval kernels. You can also bound embedding context in the same Modelfile with PARAMETER num_ctx 2048 so over-long inputs don’t blow up the runner.

4. Concurrent embedding requests from multiple RAG workers

If your indexer spawns parallel workers that each POST a batch to the same server, the server can start several forward passes before earlier ones free their memory, so the effective concurrent batch is workers × batch_size. Eight workers sending 32-item batches behaves like a 256-item batch.

How to spot it: count your parallel workers and multiply by the per-call batch size. If that product is much larger than what a single batch tolerates, concurrency is the cause.

5. sentence-transformers long-context `max_seq_length` default

SentenceTransformer.encode() defaults to batch_size=32, but the per-item ceiling comes from model.max_seq_length, which varies by model — classic BERT-based models cap at 512, while several modern embedding models default to 4096 or 8192. On a small GPU that long-context default OOMs at even moderate batch sizes.

How to spot it: print(model.max_seq_length). If it is 4096/8192 and your GPU is under 16 GB, drop it to 512 unless you truly need long contexts.

6. Ollama embedding model not GPU-accelerated

On some setups Ollama runs the embedding model on CPU when the generation GPU is busy. CPU embedding is 20-100x slower, so under batched load the request queue grows until the in-memory queue itself OOMs.

How to spot it: run ollama ps while under load and read the Processor column. If it shows 100% CPU (or any CPU share), the embedding model is not fully on the GPU.

7. llama-server not actually in embedding mode

llama-server needs --embeddings to expose the OpenAI-compatible /v1/embeddings endpoint, and the model must use a pooling mode other than none. Without the right pooling it either errors or falls back to per-request generation buffers, inflating memory and latency.

How to spot it: check your startup command for --embeddings and a --pooling value (mean or cls; rank is for rerankers). If --pooling is missing or none, fix it.

Shortest path to fix

Step 1: Reduce batch size and add backoff (Ollama `/api/embed`)

import time
import requests

def embed_with_retry(texts: list[str], batch_size: int = 16) -> list:
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        for attempt in range(3):
            try:
                resp = requests.post(
                    "http://localhost:11434/api/embed",
                    json={"model": "nomic-embed-text", "input": batch},
                    timeout=60,
                )
                resp.raise_for_status()
                embeddings.extend(resp.json()["embeddings"])
                break
            except Exception:
                if attempt == 2:
                    raise
                time.sleep(2 ** attempt)
    return embeddings

Step 2: Lower Ollama’s `num_batch` (and cap embedding context)

If the runner itself OOMs, drop the batch the runner uses, not just the request batch. Create a small Modelfile:

FROM nomic-embed-text
PARAMETER num_batch 64
PARAMETER num_ctx 2048

Build and use it: ollama create nomic-embed-batched -f Modelfile. Keep num_batch >= 32. The PARAMETER num_ctx 2048 line bounds the per-request context. Leave truncate at its default (true) so over-long inputs are clipped instead of crashing the runner.

Step 3: Start llama-server with proper embedding flags

./llama-server \
  -m models/bge-large-en-v1.5-Q8_0.gguf \
  --embeddings \
  --pooling mean \
  --ctx-size 512 \
  --batch-size 512 \
  --ubatch-size 64 \
  --n-gpu-layers 99 \
  --port 8081

--ubatch-size is the physical batch actually computed at once — keep it at 32-64 to bound peak attention memory while --batch-size stays larger for scheduling throughput.

Step 4: Enforce max sequence length in sentence-transformers

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
model.max_seq_length = 512  # Override the model default (may be 4096/8192)

def embed_documents(texts: list[str]) -> list:
    return model.encode(
        texts,
        batch_size=16,
        show_progress_bar=True,
        convert_to_numpy=True,
        normalize_embeddings=True,
    ).tolist()

Step 5: Sort batches by length to minimize padding waste

def embed_sorted(texts: list[str], model, batch_size: int = 32) -> list:
    # Group similar lengths so short chunks aren't padded up to a long one
    indexed = sorted(enumerate(texts), key=lambda x: len(x[1]), reverse=True)
    sorted_texts = [t for _, t in indexed]
    original_indices = [i for i, _ in indexed]

    embeddings_sorted = model.encode(sorted_texts, batch_size=batch_size)

    result = [None] * len(texts)  # Restore original order
    for orig_idx, emb in zip(original_indices, embeddings_sorted):
        result[orig_idx] = emb
    return result

Step 6: Limit concurrent embedding workers

import asyncio

sem = asyncio.Semaphore(2)  # At most 2 concurrent embedding requests

async def embed_chunk(session, chunk):
    async with sem:
        async with session.post(
            "http://localhost:11434/api/embed",
            json={"model": "nomic-embed-text", "input": [chunk]},
        ) as resp:
            data = await resp.json()
            return data["embeddings"][0]

Step 7 (vLLM): cap concurrent sequences

If you serve embeddings with vLLM, the default --max-num-seqs is tuned for throughput, not for an 8 GB card. Pin it down and bound the model length:

vllm serve BAAI/bge-large-en-v1.5 \
  --task embed \
  --max-num-seqs 32 \
  --max-model-len 512 \
  --gpu-memory-utilization 0.85 \
  --port 8001

How to confirm it’s fixed

Re-run the batch that used to crash. The process must finish without dropping to a runner restart.
Watch nvidia-smi dmon -s m -d 1 (or ollama ps) during a full index pass — peak VRAM should plateau well under the card’s limit, not pin at 100%.
Check the count: len(embeddings) == len(texts). Silent drops, not crashes, are the failure mode of an over-loaded server.
Spot-check one vector’s dimension (len(embeddings[0])) matches the model (nomic-embed-text is 768, bge-large-en-v1.5 is 1024) — proof the items weren’t truncated to empty.

Prevention

Start with batch_size 8-32 for large models (335M+ params) and 32-64 for small ones (~110M), then tune upward while watching VRAM.
Enforce max_seq_length = 512 unless you genuinely need longer contexts — most RAG chunks should be 128-512 tokens anyway.
Sort batches by length before embedding to cut padding overhead (the same idea as dynamic padding in training).
Run one embedding server with a request queue instead of several parallel servers fighting for the same VRAM.
Monitor VRAM with nvidia-smi dmon -s m -d 1 before a full-scale indexing run.
Keep the embedding model on a dedicated GPU or VRAM allocation, separate from any generation model.
Pin your Ollama version in production — embedding behavior changed between 0.12.x and 0.13.x, so test before upgrading.
Add circuit-breaker logic that pauses and retries on a 429/500 instead of hammering the server.

FAQ

Q: What’s the difference between --batch-size and --ubatch-size in llama-server? A: --batch-size is the logical batch the scheduler accepts; --ubatch-size is the physical micro-batch actually computed in one pass. For embeddings, lower --ubatch-size to 32-64 to cap peak attention memory while keeping --batch-size larger for throughput.

Q: My Ollama embeddings crash with caching disabled but unable to fit entire input in a batch. What now? A: That panic appears on Ollama v0.13.0–v0.13.2 (it works on v0.12.11). As of June 2026, either downgrade with ollama --version confirming the build, or lower the runner batch via a Modelfile PARAMETER num_batch 64 and bound context with PARAMETER num_ctx 2048 in the same Modelfile.

Q: Should I use /api/embed or /api/embeddings? A: Use /api/embed. It accepts an input array (real batching), returns fp32, and L2-normalizes. The older /api/embeddings only takes one prompt string and skips normalization, so cosine similarity can be off if you mix the two.

Q: Why does the server crash only on the 50th batch, not the first? A: Memory fragmentation. The first batches allocate and free memory, but the allocator hands back fragmented blocks. When a later batch needs one large contiguous region, the allocation fails. On Linux, setting MALLOC_ARENA_MAX=2 before launching can reduce fragmentation.

Q: I need a CPU-only embedding server — what batch size is safe? A: On CPU with sentence-transformers, batch_size 1-4 is safe for long sequences and ~16 for short (128-token) chunks. Watch RAM with htop; if the process nears your system RAM limit, halve the batch.

Q: Can I run embeddings and generation on one llama-server instance? A: Not recommended. The instance allocates a fixed KV cache at startup tuned for either generation (large autoregressive cache) or embedding (no autoregressive cache). Run separate instances on different ports.

Tags: #local-llm #ollama #Troubleshooting

Which bucket are you in?

Common causes

1. Batch size too large for available VRAM

2. One outlier long sequence makes the whole batch huge

3. Ollama: num_batch too high (and the v0.13.x regression)

4. Concurrent embedding requests from multiple RAG workers

5. sentence-transformers long-context max_seq_length default

6. Ollama embedding model not GPU-accelerated

7. llama-server not actually in embedding mode

Shortest path to fix

Step 1: Reduce batch size and add backoff (Ollama /api/embed)

Step 2: Lower Ollama’s num_batch (and cap embedding context)

Step 3: Start llama-server with proper embedding flags

Step 4: Enforce max sequence length in sentence-transformers

Step 5: Sort batches by length to minimize padding waste

Step 6: Limit concurrent embedding workers

Step 7 (vLLM): cap concurrent sequences

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)

3. Ollama: `num_batch` too high (and the v0.13.x regression)

5. sentence-transformers long-context `max_seq_length` default

Step 1: Reduce batch size and add backoff (Ollama `/api/embed`)

Step 2: Lower Ollama’s `num_batch` (and cap embedding context)