You have 15,000 markdown files to index for your local RAG system using nomic-embed-text via Ollama or bge-large-en-v1.5 via sentence-transformers. You kick off the indexing job and it’s been running for 4 hours with 3,000 documents left. At this rate, a full rebuild takes 6-8 hours — completely impractical for daily incremental updates. The embedding model is on a 4090 that handles individual embeddings in under 10ms, yet the wall-clock indexing rate is only 50 documents/minute. The throughput loss is hiding in three places: small batch sizes that waste GPU cycles, unnecessary re-embedding of unchanged documents, and a vectorstore that serializes writes one at a time.
Common causes
Ordered by impact, highest first.
1. Batch size of 1 — embedding documents one at a time
The default for many LangChain and LlamaIndex embeddings integrations is batch_size=1 or chunk_size=1. Each single-item call has the same GPU launch overhead as a 32-item batch, wasting 97% of potential throughput. On a 4090, single-item embedding calls through the Ollama HTTP API average roughly 15ms of network + GPU overhead — limiting throughput to 67 documents/minute regardless of model speed.
How to spot it: Add a timer around your embedding calls. If time per item is roughly constant whether you send 1 or 32 items, you’re paying per-call overhead for each item instead of amortizing it across a batch.
2. Re-embedding documents that haven’t changed since last index
Every full rebuild re-embeds all documents even if 95% of them are unchanged. For a 50,000-document corpus, 47,500 re-embeddings are wasted work.
How to spot it: Check your indexing code for content-hash or modification-time checks. If the code calls embed_documents(all_chunks) without filtering to only modified chunks, all documents are re-embedded every run.
3. Vectorstore write serialization — inserting embeddings one by one
FAISS, Chroma, and Qdrant all support batch upserts. If your code does collection.add(embedding, id=doc_id) in a loop, each call acquires a write lock, commits the transaction, and releases the lock — regardless of the vectorstore backend. A loop of 50,000 individual adds can be 100x slower than a single collection.add(embeddings_list, ids=ids_list) call.
How to spot it: Profile the vectorstore write phase independently from the embedding phase. If embedding takes 10 minutes but the vectorstore writes take 3 hours, serialization is the bottleneck.
4. Chunking strategy producing too many tiny chunks
If your document splitter uses a chunk size of 50 tokens with a 10-token overlap, a 10-page document produces 400+ chunks. Each chunk has the same embedding overhead as a 500-token chunk. Most embedding models produce similar representations for 50-token vs. 500-token chunks, but you’re doing 10x the work.
How to spot it: Count the total number of chunks across your corpus. If the average chunks-per-document is more than 20-30 for typical documents, your chunk size may be too small.
5. CPU fallback for the embedding model
If the embedding model is running on CPU instead of GPU (see the embedding server crash article), throughput drops 20-100x. This is especially common for Ollama embedding models that share GPU with a generation model, or for sentence-transformers models that silently fall back to CPU when device is not explicitly set.
How to spot it: Run nvidia-smi during indexing. If the embedding model’s GPU utilization is 0%, it’s running on CPU.
6. Network I/O overhead for large file corpus
If your documents are on a network share (NFS, SMB, or S3-mounted via rclone), reading 15,000 files from a remote source introduces latency before any embedding work begins. File-reading bottlenecks look identical to embedding bottlenecks if you’re not profiling each phase separately.
How to spot it: Time just the file-reading phase: time find /path/to/docs -name '*.md' -exec wc -c {} +. If this takes more than a few seconds for your corpus, disk/network I/O is part of the problem.
Shortest path to fix
Step 1: Switch to large batch embedding calls
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
model.max_seq_length = 512
# Bad: embedding one at a time
# embeddings = [model.encode(chunk) for chunk in chunks]
# Good: embed the full batch at once
embeddings = model.encode(
chunks,
batch_size=64,
show_progress_bar=True,
convert_to_numpy=True,
normalize_embeddings=True,
)
For Ollama’s API, batch the input field:
import requests
def embed_batch(texts: list[str]) -> list[list[float]]:
resp = requests.post(
"http://localhost:11434/api/embed",
json={"model": "nomic-embed-text", "input": texts},
timeout=120,
)
return resp.json()["embeddings"]
# Process 64 chunks per API call
batch_size = 64
all_embeddings = []
for i in range(0, len(chunks), batch_size):
all_embeddings.extend(embed_batch(chunks[i:i+batch_size]))
Step 2: Add content-hash-based change detection
import hashlib
import json
import pathlib
def compute_hash(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()[:16]
# Load existing hash registry
hash_file = pathlib.Path(".index_hashes.json")
hashes = json.loads(hash_file.read_text()) if hash_file.exists() else {}
# Filter to only changed documents
changed_chunks = []
changed_ids = []
for chunk_id, chunk_text in zip(all_ids, all_chunks):
new_hash = compute_hash(chunk_text)
if hashes.get(chunk_id) != new_hash:
changed_chunks.append(chunk_text)
changed_ids.append(chunk_id)
hashes[chunk_id] = new_hash
print(f"Re-embedding {len(changed_chunks)} of {len(all_chunks)} chunks")
# Save updated hashes
hash_file.write_text(json.dumps(hashes))
Step 3: Batch-insert into the vectorstore
import chromadb
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")
# Bad: inserting one at a time in a loop
# for id, emb, doc in zip(ids, embeddings, documents):
# collection.add(ids=[id], embeddings=[emb], documents=[doc])
# Good: single batch upsert (max 5461 items per Chroma call)
CHROMA_MAX_BATCH = 5000
for i in range(0, len(changed_ids), CHROMA_MAX_BATCH):
collection.upsert(
ids=changed_ids[i:i+CHROMA_MAX_BATCH],
embeddings=all_embeddings[i:i+CHROMA_MAX_BATCH],
documents=changed_chunks[i:i+CHROMA_MAX_BATCH],
)
Step 4: Increase chunk size to reduce total chunk count
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # was 100 — increase to reduce chunk count 5x
chunk_overlap=64, # 12% overlap is sufficient for retrieval quality
length_function=len,
)
Step 5: Run embedding and vectorstore write in parallel pipelines
import concurrent.futures
import queue
embed_queue = queue.Queue(maxsize=10)
def embedding_producer(chunks, batch_size=64):
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i+batch_size]
embeddings = model.encode(batch, normalize_embeddings=True)
embed_queue.put((batch, embeddings.tolist()))
embed_queue.put(None) # sentinel
def vectorstore_consumer():
while True:
item = embed_queue.get()
if item is None:
break
texts, embeddings = item
collection.upsert(ids=[...], embeddings=embeddings, documents=texts)
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as ex:
ex.submit(embedding_producer, all_chunks)
ex.submit(vectorstore_consumer)
Prevention
- Always set
batch_sizeexplicitly when callingmodel.encode()or the Ollama embedding API — never rely on defaults. - Implement content-hash change detection from the start of a project; retrofitting it later requires a full metadata migration.
- Set a minimum chunk size of 256 tokens for RAG indexing unless you have a specific reason for smaller chunks.
- Store chunk hashes, IDs, and embeddings in a structured format (SQLite or Parquet) so partial rebuilds are resumable without re-processing everything.
- Profile embedding vs. vectorstore write time separately on day one — the bottleneck is not always where you expect it.
- Keep the embedding model pinned to GPU with
device="cuda"and verify withnvidia-smibefore starting large indexing runs. - Use a local SSD for both the document corpus and the vectorstore; NFS-mounted paths halve throughput on large corpora.
FAQ
Q: Should I use Ollama or sentence-transformers for local embedding at scale?
A: sentence-transformers with device="cuda" and batch_size=64-128 is generally faster for bulk indexing because it avoids HTTP round-trip overhead. Ollama is more convenient for interactive use where you want a unified server. For nightly batch indexing, sentence-transformers directly is the better choice.
Q: What chunk size gives the best RAG retrieval quality? A: Most benchmarks show 256-512 tokens per chunk as the sweet spot for retrieval recall. Chunks below 128 tokens often lack enough context for the embeddings to be semantically meaningful. Chunks above 1024 tokens risk burying the relevant sentence in irrelevant surrounding text.
Q: Is FAISS faster than Chroma for local RAG vectorstore writes? A: For batch indexing, FAISS (IndexFlatL2 or IndexHNSWFlat) built in memory and saved to disk is 10-50x faster than Chroma for pure insertion throughput. Chroma adds metadata indexing and SQLite write overhead. If you don’t need Chroma’s metadata filtering, FAISS with numpy batch adds is significantly faster.
Q: How do I handle documents that span more than the embedding model’s max sequence length?
A: Truncate at max_seq_length tokens. Most embedding models (BGE, nomic-embed) are trained on 512-token windows, and embedding only the first 512 tokens of a long document captures the document’s main topic well enough for retrieval. For long documents, use hierarchical indexing: embed each 512-token chunk plus a summary of the full document.