Local RAG Index Rebuild Is Unbearably Slow

Q: Ollama or sentence-transformers for local embedding at scale?

For bulk nightly indexing, sentence-transformers with `device="cuda"` and `batch_size=64-128` is generally faster because it avoids the HTTP round trip entirely and lets the GPU batch internally. Ollama is more convenient for interactive use and a unified server, and its `/api/embed` array input closes most of the gap — but for the fastest rebuild, call the model in-process.

Q: My Ollama embeddings still look slow even with a batched `input` array. Why?

Two usual culprits as of June 2026. First, `OLLAMA_NUM_PARALLEL` defaults to 1, so firing many concurrent small requests just queues them — send fewer, larger batches instead. Second, the embedding model may have been evicted to CPU under VRAM pressure from a running chat model; check `ollama ps` for the `PROCESSOR` column.

Q: Is FAISS faster than Chroma for local RAG writes?

For pure batch insertion, an in-memory FAISS index (`IndexFlatL2` or `IndexHNSWFlat`) saved to disk is 10-50x faster than Chroma, which adds metadata indexing and SQLite write overhead. If you do not need Chroma's metadata filtering, FAISS with numpy batch adds wins on raw throughput.

Rebuilding a local vector index from thousands of documents takes hours instead of minutes. Fix batch size, skip unchanged docs, batch-write the vectorstore, and right-size chunks.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You have 15,000 markdown files to index for a local RAG system using nomic-embed-text via Ollama or bge-large-en-v1.5 via sentence-transformers. You kick off the job, and four hours later it still has 3,000 documents left. At this rate a full rebuild takes 6-8 hours, which kills any plan for daily incremental updates. The embedding model is on a 4090 that handles individual embeddings in under 10ms, yet the wall-clock rate is only 50 documents/minute.

Fastest fix: the throughput loss is almost never the model. It is hiding in three places, in this order of impact: (1) embedding one item per call instead of in batches of 32-128, (2) re-embedding documents that have not changed since the last run, and (3) writing the vectorstore one row at a time instead of one batch upsert. Fix those three and a 6-hour rebuild typically drops under 15 minutes. Work down the table below to find your bucket before touching code.

Which bucket are you in?

Symptom you observe	Most likely cause	Jump to
Time-per-item is the same whether you send 1 or 64 texts	`batch_size=1` — per-call overhead not amortized	Step 1
Every rebuild re-embeds the whole corpus even when little changed	No content-hash change detection	Step 2
Embedding finishes fast but writes take hours	One-by-one vectorstore inserts	Step 3
Chunk count is 20-50x the document count	Chunk size too small	Step 4
`nvidia-smi` shows 0% GPU during indexing	Embedding model fell back to CPU	Causes below
First phase is slow before any embedding starts	File read I/O from NFS/SMB/S3	Causes below

Common causes

Ordered by impact, highest first.

1. Batch size of 1 — embedding documents one at a time

The default for many LangChain and LlamaIndex embedding integrations is effectively one item per call. Each single-item request carries the same GPU launch and HTTP round-trip overhead as a 64-item batch, so you pay that overhead 64 times instead of once. Through the Ollama HTTP API on a local machine, a single-item embedding call averages roughly 10-15ms of network plus scheduling overhead, capping throughput near 60-70 documents/minute no matter how fast the model itself is.

How to spot it: Add a timer around your embedding calls. If time-per-item is roughly constant whether you send 1 or 64 items, you are paying per-call overhead for every item instead of amortizing it across a batch.

2. Re-embedding documents that have not changed since the last index

Every full rebuild re-embeds all documents even if 95% of them are unchanged. For a 50,000-chunk corpus, that is 47,500 wasted embeddings per run.

How to spot it: Check your indexing code for a content-hash or modification-time check. If it calls embed_documents(all_chunks) without first filtering to changed chunks, everything is re-embedded every run.

3. Vectorstore write serialization — inserting embeddings one by one

FAISS, Chroma, and Qdrant all support batch upserts. If your code does collection.add(embedding, id=doc_id) inside a loop, each call acquires a write lock, commits, and releases — for every single row. A loop of 50,000 individual add calls can be 100x slower than one collection.upsert(embeddings_list, ids=ids_list). As of June 2026, Chroma rejects any single call over 5,461 items (ValueError: Cannot submit more than 5,461 embeddings at once), so you still chunk, just into large batches rather than singletons.

How to spot it: Profile the vectorstore write phase separately from the embedding phase. If embedding takes 10 minutes but writes take 3 hours, serialization is the bottleneck.

4. Chunking strategy producing too many tiny chunks

If your splitter uses a chunk size of 50 tokens with 10-token overlap, a 10-page document explodes into 400+ chunks. A 50-token chunk costs the same embedding overhead as a 500-token chunk, so you do roughly 10x the work for marginal retrieval gain.

How to spot it: Count total chunks across the corpus and divide by document count. If the average is more than 20-30 chunks per typical document, your chunk size is too small.

5. CPU fallback for the embedding model

If the embedding model runs on CPU instead of GPU, throughput drops 20-100x. This is common when an Ollama embedding model shares the GPU with a chat model and gets evicted under VRAM pressure, or when a sentence-transformers model silently falls back to CPU because device was never set.

How to spot it: Run nvidia-smi (or ollama ps, which prints a PROCESSOR column showing GPU, CPU, or a split) during indexing. If the embedding model shows 0% GPU utilization or 100% CPU, it is running on CPU.

6. File-read I/O for a large or remote corpus

If documents live on a network share (NFS, SMB, or S3 mounted via rclone), reading 15,000 files introduces latency before any embedding starts. A read bottleneck looks identical to an embedding bottleneck unless you time each phase separately.

How to spot it: Time just the read phase: time find /path/to/docs -name '*.md' -exec wc -c {} +. If that alone takes more than a few seconds, disk or network I/O is part of the problem.

Shortest path to fix

Step 1: Switch to large batch embedding calls

With sentence-transformers, pass the whole list and set batch_size explicitly (the encode default is only 32):

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5", device="cuda")
model.max_seq_length = 512

# Bad: one at a time
# embeddings = [model.encode(chunk) for chunk in chunks]

# Good: embed the full list, GPU batches internally
embeddings = model.encode(
    chunks,
    batch_size=64,            # raise to 128 if VRAM allows
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,
)

For Ollama, use /api/embed (the current endpoint, not the deprecated /api/embeddings). Its input field accepts an array, returns an embeddings array, and the vectors are already L2-normalized:

import requests

def embed_batch(texts: list[str]) -> list[list[float]]:
    resp = requests.post(
        "http://localhost:11434/api/embed",
        json={
            "model": "nomic-embed-text",
            "input": texts,
            "options": {"num_ctx": 8192},  # see note below
        },
        timeout=120,
    )
    return resp.json()["embeddings"]   # plural; one vector per input

batch_size = 64
all_embeddings = []
for i in range(0, len(chunks), batch_size):
    all_embeddings.extend(embed_batch(chunks[i:i + batch_size]))

Two things that bite people here, both current as of June 2026:

nomic-embed-text supports an 8192-token context, but Ollama’s model card defaults num_ctx to 2048. If your chunks run long and you do not set num_ctx, the tail is silently truncated. Pass "options": {"num_ctx": 8192} (above) to use the full window.
Sending a large input array from one client does not by itself give you server-side parallelism. Ollama’s OLLAMA_NUM_PARALLEL defaults to 1, so concurrent requests beyond that just queue. For bulk indexing, the bigger win is the single large batch per call shown above, not many concurrent small requests.

Step 2: Add content-hash change detection

import hashlib, json, pathlib

def compute_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

hash_file = pathlib.Path(".index_hashes.json")
hashes = json.loads(hash_file.read_text()) if hash_file.exists() else {}

changed_chunks, changed_ids = [], []
for chunk_id, chunk_text in zip(all_ids, all_chunks):
    new_hash = compute_hash(chunk_text)
    if hashes.get(chunk_id) != new_hash:
        changed_chunks.append(chunk_text)
        changed_ids.append(chunk_id)
        hashes[chunk_id] = new_hash

print(f"Re-embedding {len(changed_chunks)} of {len(all_chunks)} chunks")
hash_file.write_text(json.dumps(hashes))

Hash the chunk text, not the file — a one-line edit to a large file should only re-embed the chunks it touched, not the whole document.

Step 3: Batch-insert into the vectorstore

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("docs")

# Bad: one at a time in a loop
# for id, emb, doc in zip(ids, embeddings, documents):
#     collection.add(ids=[id], embeddings=[emb], documents=[doc])

# Good: chunked batch upsert. Stay under Chroma's 5,461-per-call ceiling.
CHROMA_MAX_BATCH = 5000
for i in range(0, len(changed_ids), CHROMA_MAX_BATCH):
    collection.upsert(
        ids=changed_ids[i:i + CHROMA_MAX_BATCH],
        embeddings=all_embeddings[i:i + CHROMA_MAX_BATCH],
        documents=changed_chunks[i:i + CHROMA_MAX_BATCH],
    )

Use upsert, not add, so re-running after a partial failure overwrites instead of erroring on duplicate IDs.

Step 4: Increase chunk size to cut total chunk count

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,    # was 100 — roughly 5x fewer chunks
    chunk_overlap=64,  # ~12% overlap is enough for retrieval continuity
    length_function=len,
)

Step 5: Overlap embedding and writing with a producer/consumer pipeline

While the GPU embeds the next batch, the previous one is being written. A small bounded queue keeps both busy without unbounded memory growth.

import concurrent.futures, queue

embed_queue = queue.Queue(maxsize=10)

def embedding_producer(chunks, ids, batch_size=64):
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        batch_ids = ids[i:i + batch_size]
        vecs = model.encode(batch, normalize_embeddings=True)
        embed_queue.put((batch_ids, batch, vecs.tolist()))
    embed_queue.put(None)  # sentinel

def vectorstore_consumer():
    while True:
        item = embed_queue.get()
        if item is None:
            break
        ids, texts, vecs = item
        collection.upsert(ids=ids, embeddings=vecs, documents=texts)

with concurrent.futures.ThreadPoolExecutor(max_workers=2) as ex:
    ex.submit(embedding_producer, changed_chunks, changed_ids)
    ex.submit(vectorstore_consumer)

How to confirm it’s fixed

Run a timed full rebuild and watch the documents/minute rate. After Steps 1-3 a corpus that took 6-8 hours should finish in roughly 10-20 minutes on a single 4090.
Run the indexer a second time with no document changes. The “Re-embedding N of M chunks” line should print 0 of M, and the whole run should finish in seconds — proof that change detection works.
During the run, confirm nvidia-smi shows the embedding model pinned near 100% GPU, not idle and not on CPU.
Spot-check that the chunk count fell after Step 4 (print len(chunks) before and after).

Prevention

Always set batch_size explicitly on model.encode() and batch the input array for /api/embed — never rely on per-item defaults.
Build content-hash change detection in from the start; retrofitting it later means a full metadata migration.
Keep a minimum chunk size of 256 tokens for RAG indexing unless you have a specific reason to go smaller.
Store chunk hashes, IDs, and embeddings in a structured store (SQLite or Parquet) so partial rebuilds resume without reprocessing everything.
Profile embedding time vs. vectorstore write time separately on day one — the bottleneck is rarely where you first assume.
Pin the embedding model to GPU with device="cuda" (or verify ollama ps shows GPU) and check before every large run.
Keep both the corpus and the vectorstore on a local SSD; NFS-mounted paths can halve throughput on large corpora.

FAQ

Q: Ollama or sentence-transformers for local embedding at scale? A: For bulk nightly indexing, sentence-transformers with device="cuda" and batch_size=64-128 is generally faster because it avoids the HTTP round trip entirely and lets the GPU batch internally. Ollama is more convenient for interactive use and a unified server, and its /api/embed array input closes most of the gap — but for the fastest rebuild, call the model in-process.

Q: My Ollama embeddings still look slow even with a batched input array. Why? A: Two usual culprits as of June 2026. First, OLLAMA_NUM_PARALLEL defaults to 1, so firing many concurrent small requests just queues them — send fewer, larger batches instead. Second, the embedding model may have been evicted to CPU under VRAM pressure from a running chat model; check ollama ps for the PROCESSOR column.

Q: What chunk size gives the best RAG retrieval quality? A: Most benchmarks land on 256-512 tokens per chunk as the sweet spot for recall. Below 128 tokens, chunks often lack enough context to be semantically meaningful; above 1024 tokens, the relevant sentence gets buried in surrounding text.

Q: Is FAISS faster than Chroma for local RAG writes? A: For pure batch insertion, an in-memory FAISS index (IndexFlatL2 or IndexHNSWFlat) saved to disk is 10-50x faster than Chroma, which adds metadata indexing and SQLite write overhead. If you do not need Chroma’s metadata filtering, FAISS with numpy batch adds wins on raw throughput.

Q: How do I handle documents longer than the embedding model’s max sequence length? A: Right-size chunks so each fits the window, and set the window deliberately — nomic-embed-text allows 8192 tokens but Ollama caps num_ctx at 2048 unless you raise it. For very long source documents, use hierarchical indexing: embed each chunk plus a short summary of the full document.

Tags: #local-llm #ollama #Troubleshooting

Which bucket are you in?

Common causes

1. Batch size of 1 — embedding documents one at a time

2. Re-embedding documents that have not changed since the last index

3. Vectorstore write serialization — inserting embeddings one by one

4. Chunking strategy producing too many tiny chunks

5. CPU fallback for the embedding model

6. File-read I/O for a large or remote corpus

Shortest path to fix

Step 1: Switch to large batch embedding calls

Step 2: Add content-hash change detection

Step 3: Batch-insert into the vectorstore

Step 4: Increase chunk size to cut total chunk count

Step 5: Overlap embedding and writing with a producer/consumer pipeline

How to confirm it’s fixed

Prevention

FAQ

Related

Related Articles

llama.cpp mmap Fails on a Network Drive

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0