You’re building a RAG pipeline and start pushing document chunks to your local embedding server — Ollama serving nomic-embed-text, or llama-server with bge-large-en-v1.5, or a sentence-transformers FastAPI service. Individual embeddings work fine. But when your indexer sends batches of 64 or 128 chunks at once, the server either crashes with an OOM error, hangs indefinitely, or starts returning 500 errors after processing some batches and silently dropping others. The root cause is almost always that embedding models, unlike generation models, process the entire batch in a single forward pass through all encoder layers simultaneously, multiplying memory consumption by the batch size.
Common causes
Ordered by hit rate, highest first.
1. Batch size too large for available VRAM
Embedding models run each item in the batch simultaneously through the encoder. For a model like bge-large-en-v1.5 (335M parameters, fp32), a batch of 128 items at 512 tokens each requires roughly: 128 × 512 × 1024 × 4 bytes ≈ 268 MB for the input representations alone, plus attention matrices and intermediate activations. On an 8 GB GPU processing batches of 256+, OOM is almost certain.
How to spot it: Run nvidia-smi dmon -s m -d 1 while sending batches. If VRAM spikes to maximum and then the process dies, the batch size is the cause.
2. Variable-length sequences in a batch without padding limits
When batch items have very different lengths (e.g., one item is 10 tokens and another is 512 tokens), the batch is padded to the longest sequence. A single outlier long document can make the entire batch 10x larger in memory than expected. Without a maximum sequence length enforced, one document with 2000 tokens in a batch of 64 will cause an OOM even if 63 other documents are short.
How to spot it: Add logging to print max(len(chunk) for chunk in batch) before each embedding call. If the max is significantly higher than your average, padding is amplifying memory usage.
3. llama-server using generation mode instead of embedding mode
llama-server requires --embedding flag to enable the embedding endpoint. Without it, it may run in a fallback mode that uses the full generation KV cache for each embedding request — dramatically increasing memory and latency per item.
How to spot it: Check your llama-server startup command for --embedding. Also check the --pooling flag: --pooling mean or --pooling cls is required for correct embedding extraction.
4. sentence-transformers worker processes not limiting batch memory
The HuggingFace SentenceTransformer.encode() method uses batch_size parameter but also quietly sets a maximum sequence length via max_seq_length. If max_seq_length is set to 8192 (default for some modern models) while the GPU is small, even moderate batch sizes will OOM.
How to spot it: Run print(model.max_seq_length) on your SentenceTransformer instance. If it’s 4096 or 8192 and your GPU is less than 16 GB, lower it to 512 unless you specifically need long contexts.
5. Concurrent embedding requests from multiple RAG workers
If your document indexer spawns multiple parallel workers that each send batches to the same embedding server simultaneously, the server may receive 4× the batch size it was designed for. The server queues requests but starts executing them before previous batches finish, leading to OOM from concurrent forward passes.
How to spot it: Count the number of parallel workers in your indexing pipeline. If you have 8 workers each sending 32-item batches, the effective concurrent batch size is 256.
6. Ollama embedding model not GPU-accelerated
Ollama’s embedding models (nomic-embed-text, mxbai-embed-large) run on CPU by default on some installations where the main generation GPU is busy. CPU embedding is 20-100x slower, and under batched load the request queue grows until the server’s in-memory queue OOMs.
How to spot it: Run ollama ps while the embedding server is under load. Check the “Processor” column. If it shows “CPU,” the embedding model is not using the GPU.
Shortest path to fix
Step 1: Reduce batch size and add exponential backoff
import time
import requests
def embed_with_retry(texts: list[str], batch_size: int = 16) -> list:
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
for attempt in range(3):
try:
resp = requests.post(
"http://localhost:11434/api/embed",
json={"model": "nomic-embed-text", "input": batch},
timeout=60,
)
resp.raise_for_status()
embeddings.extend(resp.json()["embeddings"])
break
except Exception as e:
if attempt == 2:
raise
time.sleep(2 ** attempt)
return embeddings
Step 2: Start llama-server with proper embedding flags
./llama-server \
-m models/bge-large-en-v1.5-Q8_0.gguf \
--embedding \
--pooling mean \
--ctx-size 512 \
--batch-size 512 \
--ubatch-size 64 \
--n-gpu-layers 99 \
--port 8081
--ubatch-size controls the micro-batch size within the physical batch — keep it at 32-64 to avoid OOM on the encoder’s attention pass.
Step 3: Enforce max sequence length in sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
model.max_seq_length = 512 # Override default (may be 8192)
# Process in small batches
def embed_documents(texts: list[str]) -> list:
return model.encode(
texts,
batch_size=32,
show_progress_bar=True,
convert_to_numpy=True,
normalize_embeddings=True,
).tolist()
Step 4: Sort batches by length to minimize padding waste
def embed_sorted(texts: list[str], model, batch_size: int = 32) -> list:
# Sort by token length to group similar-length sequences
indexed = sorted(enumerate(texts), key=lambda x: len(x[1]), reverse=True)
sorted_texts = [t for _, t in indexed]
original_indices = [i for i, _ in indexed]
embeddings_sorted = model.encode(sorted_texts, batch_size=batch_size)
# Restore original order
result = [None] * len(texts)
for orig_idx, emb in zip(original_indices, embeddings_sorted):
result[orig_idx] = emb
return result
Step 5: Limit concurrent embedding workers
import asyncio
sem = asyncio.Semaphore(2) # Max 2 concurrent embedding requests
async def embed_chunk(session, chunk):
async with sem:
async with session.post(
"http://localhost:11434/api/embed",
json={"model": "nomic-embed-text", "input": [chunk]},
) as resp:
data = await resp.json()
return data["embeddings"][0]
Prevention
- Set
batch_sizeto 8-32 for large models (335M+ parameters) and 32-64 for small models (110M) as starting points, then tune upward. - Always enforce a
max_seq_lengthof 512 unless your use case specifically needs longer sequences — most RAG chunks should be 128-512 tokens anyway. - Sort input batches by length before embedding to minimize padding overhead (same principle as dynamic padding in training).
- Use a single embedding server with a request queue rather than multiple parallel servers competing for GPU memory.
- Monitor VRAM during indexing with
nvidia-smi dmon -s m -d 1before running full-scale indexing jobs. - Keep the embedding model running on a dedicated GPU or VRAM allocation separate from the generation model to avoid resource competition.
- Implement circuit-breaker logic in your indexer to pause and retry after a 429 or 500 response, rather than hammering the server.
FAQ
Q: What’s the difference between --batch-size and --ubatch-size in llama-server?
A: --batch-size is the maximum number of tokens processed in a single scheduling window. --ubatch-size is the micro-batch chunk size within that window. For embedding workloads, reduce --ubatch-size to 32-64 to control peak attention memory while keeping --batch-size larger for throughput.
Q: Can I run embeddings and generation on the same llama-server instance? A: Not recommended. llama-server allocates a fixed KV cache at startup optimized for either generation (large KV cache, autoregressive) or embedding (no autoregressive KV cache needed). Run separate instances on different ports for each workload type.
Q: Why does the embedding server crash only on the 50th batch and not the first?
A: Memory fragmentation. The first 49 batches allocate and free memory, but the allocator returns fragmented blocks rather than a single contiguous region. When the 50th batch needs a larger contiguous region, the allocation fails. Adding MALLOC_ARENA_MAX=2 before the process can reduce fragmentation on Linux.
Q: My embeddings are fast on GPU but the CPU version is needed for a CPU-only server — what batch size is safe?
A: On CPU with sentence-transformers, batch size 1-4 is typically safe for long sequences. For short sequences (128 tokens), batch_size=16 is usually fine. Monitor RAM usage with htop — if you see the process approaching your system RAM limit, halve the batch size.