Local LLM Very Slow on First Token After Cold Start

Q: How do I separate model-load time from prefill time?

Use `ollama run --verbose`: `load duration` is the model-load (cold-start) cost and `prompt eval duration` is the prefill cost. If `load duration` is large, the model was cold; if `prompt eval duration` is large, your prompt is long. With `llama-server`, warm the model first, then time a second request — that second TTFT is pure prefill.

Q: Does `--n-gpu-layers` change cold-start time?

Yes. With `--n-gpu-layers all` (or `auto` resolving to all), every layer is copied from RAM into VRAM, which is an extra cold-start step. Offloading fewer layers shrinks that copy but slows inference. The default is `auto` as of June 2026; pin an explicit number only if you're tuning a specific VRAM budget.

Local LLM takes 30-120s to produce the first token after loading, then runs fast. Diagnose disk I/O, model eviction, CUDA/Metal shader JIT, and KV cache allocation, and pin the model warm.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You run ollama run llama3.1:8b "say hello" on a freshly booted workstation. The cursor blinks for 45 seconds before the first token appears. After that, tokens stream at 40+ t/s and a second prompt answers almost instantly. Nothing is broken — this is the cold-start penalty, and it shows up identically across Ollama, llama-server (llama.cpp), and LM Studio on NVIDIA, AMD, and Apple Silicon.

Fastest fix: the slow first token is almost always one of two things — the model isn’t in memory yet (disk read + VRAM copy), or it was evicted after going idle. Pin it warm and pre-warm it once at startup:

# Pin the model in memory so it never unloads (Ollama)
export OLLAMA_KEEP_ALIVE=-1
# Then preload it with an empty request (no generation, just load):
curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b"}'

That removes the eviction-driven cold start entirely. The rest of this guide diagnoses the first-ever cold start (the one even a warm-keep can’t avoid) and shrinks it.

What “cold start” actually includes

Time-to-first-token (TTFT) on a cold model is the sum of several sequential phases, each with its own cause and fix:

Phase	What happens	Typical cost (8B Q4)	Typical cost (70B Q4)
Disk read	GGUF weights read from SSD/HDD into RAM	2-8s NVMe / 30-90s HDD	10-25s NVMe / 4-10 min HDD
VRAM copy	Weights copied host RAM to GPU VRAM	1-3s	5-15s
Shader JIT	CUDA/Metal kernels compiled on first use	5-30s first run, then cached	5-30s first run, then cached
KV cache alloc	Contiguous VRAM block reserved for context	under 1s	2-8s
Prefill	Input prompt processed before token 1	1-3s for 2k tokens	15-30s for 2k tokens

Once the model is loaded and the shader cache is warm, every phase except prefill is skipped on the next request — which is why the second prompt is fast.

Common causes

Ordered by impact on TTFT, highest first.

1. The model was evicted — it has to reload

This is the most common “random slow first token.” Both runtimes unload idle models by default:

Ollama unloads a model 5 minutes after its last request (server-wide OLLAMA_KEEP_ALIVE, default 5m, as of June 2026). Come back after 10 minutes and the next request triggers a full reload.
LM Studio gives JIT-loaded models a 60-minute idle TTL by default; models you load manually via lms load have no TTL and stay resident until you unload them.

How to spot it: run ollama ps before you send the prompt. If the model is not listed, it was evicted and the next request pays the full cold start. In LM Studio, check the loaded-models list in the Developer tab.

2. Weights not in the OS file cache (first read from disk)

On a fresh boot the OS page cache is empty, so the GGUF is read from disk. A 40 GB Q4_K_M file off an NVMe SSD (about 3-5 GB/s) takes roughly 8-15 seconds; off a SATA SSD (about 500 MB/s) it takes 80+ seconds; off a spinning HDD (about 150 MB/s) it can take several minutes. This disk-read phase dominates the very first cold start.

How to spot it: run time ollama run llama3.1:8b "a" twice in a row. If the second run is 5-10x faster, a file-cache miss (disk I/O) was the dominant cost. Confirm the storage tier with df -h ~/.ollama/models.

3. CUDA shader JIT compilation (NVIDIA)

NVIDIA’s prebuilt CUDA binaries (Ollama bundles llama.cpp) ship compiled SASS for common architectures, but if your GPU’s architecture isn’t an exact match the driver JIT-compiles PTX on first launch — a few hundred ms to several seconds. The result is cached in ~/.cache/nvidia/ComputeCache/ on Linux (controlled by CUDA_CACHE_PATH, default max 256 MiB via CUDA_CACHE_MAXSIZE). Subsequent runs skip it.

How to spot it: during the first load, watch nvidia-smi — JIT compilation shows GPU utilization sitting at 10-30% for several seconds (not 0%, not 90%+). Check whether ~/.cache/nvidia/ComputeCache/ grows on first startup. Setting CUDA_CACHE_DISABLE=1 forces recompilation every run, which is a quick way to confirm the cost.

4. Metal shader compilation (Apple Silicon)

On macOS the Metal backend compiles shaders on first use, costing 5-15 seconds on M2/M3. The result is cached, so it’s a one-time hit per Ollama/LM Studio version — updating the runtime invalidates the cache and you pay it again.

How to spot it: look for shader-compilation lines in the Ollama log, and watch Activity Monitor — first-load shader compilation shows a burst of CPU (not GPU) activity. If a runtime update made cold start slow again, this is why.

5. KV cache allocation delay

A 70B model at 8192 context needs a large contiguous VRAM block (roughly 4-8 GB) for the KV cache. On systems where VRAM is fragmented or shared with the display, that allocation can stall for several seconds.

How to spot it: run nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader before and right after model load. A large jump in memory.used spread over 5-10 seconds is the KV-cache allocation phase.

6. mmap lazy loading causes page faults on the first prefill

llama.cpp memory-maps the GGUF by default (--mmap, enabled), so startup looks instant but the weights aren’t actually read until they’re touched. The first inference then triggers a storm of page faults, so the first few tokens crawl and only later tokens hit full speed.

How to spot it: perf stat -e page-faults ./llama-cli -m model.gguf -p "hello" -n 10. Tens of thousands of major faults during the first generation points at mmap lazy loading. Fixed by --no-mmap (full read at load) plus --mlock.

7. Long system prompt = long prefill

If the first request carries a 2,000-token system prompt, prefill (processing all input before token 1) scales with input length: about 1-3s on an 8B model, 15-30s on a 70B. This is real latency even when the model is fully warm.

How to spot it: send a minimal prompt ("a") to measure base TTFT, then send your full prompt. The difference is prefill. With llama-server, cache_prompt is true by default, so a repeated identical prefix is reused instead of reprocessed.

Shortest path to fix

Step 1: Pin the model in memory (kills eviction cold starts)

# Keep every model loaded indefinitely (server-wide default)
export OLLAMA_KEEP_ALIVE=-1
ollama serve &

# Or per-request, overriding the server default:
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "hello", "keep_alive": -1}'

# Or a fixed window, e.g. 1 hour:
export OLLAMA_KEEP_ALIVE=1h

For a persistent systemd install, add Environment="OLLAMA_KEEP_ALIVE=-1" under [Service] in /etc/systemd/system/ollama.service, then sudo systemctl daemon-reload && sudo systemctl restart ollama. On macOS, add OLLAMA_KEEP_ALIVE to the EnvironmentVariables block of ~/Library/LaunchAgents/com.ollama.ollama.plist.

Note: keep_alive: -1 is runtime state only — after a server restart the model is unloaded again, so you still need Step 2. Also, if VRAM fills while a model is pinned, Ollama returns an error rather than evicting the pinned model.

For LM Studio, set the JIT TTL in Developer tab > Server Settings, or load the model manually with lms load <model> (no TTL) so it stays resident.

Step 2: Pre-warm the model at service startup

#!/bin/bash
# warmup.sh — run after Ollama starts (e.g. from ExecStartPost or app boot)
until curl -s http://localhost:11434/api/version > /dev/null; do sleep 1; done
# An empty request loads the model without generating anything:
curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "keep_alive": -1}' > /dev/null
echo "Model warmed up"

Sending the request with no prompt field loads the model into VRAM (and compiles shaders) without running a generation, so the first real user never sees the cold start.

Step 3: Pre-load weights into the OS page cache

# Populate the OS file cache so the disk-read phase is already done.
# Find the model blob first:
ls -lh ~/.ollama/models/blobs/

# Then warm it (vmtouch is cleanest):
vmtouch -t ~/.ollama/models/blobs/sha256-<model-hash>
# No vmtouch? A plain read works too:
dd if=~/.ollama/models/blobs/sha256-<model-hash> of=/dev/null bs=4M status=progress

# Verify it's resident:
vmtouch ~/.ollama/models/blobs/sha256-<model-hash>

The file stays cached until memory pressure evicts it. Pair this with Step 1 so the model also stays in VRAM.

Step 4: Pre-compile the GPU shaders with a warmup inference

# Start the server, then fire one tiny generation to trigger and cache shader JIT.
./llama-server -m model.gguf --n-gpu-layers all --port 8080 &

# Wait for it to come up, then send a 1-token warmup:
until curl -s http://localhost:8080/health > /dev/null; do sleep 1; done
curl -s http://localhost:8080/v1/chat/completions \
  -d '{"messages": [{"role": "user", "content": "a"}], "max_tokens": 1}' > /dev/null
echo "Shaders compiled and cached"

On NVIDIA this populates ~/.cache/nvidia/ComputeCache/; on Apple Silicon it warms the Metal shader cache. Do this once per machine and once after every runtime upgrade.

Step 5: Reuse the system-prompt prefill instead of reprocessing it

import requests

# llama-server reuses the KV cache for an identical prefix automatically:
# cache_prompt defaults to true. The first request pays the prefill cost for
# the system prompt; later requests with the same prefix skip it.
response = requests.post("http://localhost:8080/v1/chat/completions", json={
    "messages": [
        {"role": "system", "content": FIXED_SYSTEM_PROMPT},
        {"role": "user", "content": "first question"},
    ],
    "cache_prompt": True,   # already the default in current builds; shown for clarity
})

To persist that KV cache across server restarts, start llama-server with --slot-save-path /path/to/cache and enable cross-request reuse with --cache-reuse N (min reusable chunk size; default 0 = off). Host-memory prompt caching is governed by -cram, --cache-ram N (default 8192 MiB, -1 = no limit, 0 = disable) — on a 16 GB laptop set --cache-ram 0 so the cache doesn’t starve the live KV cache.

Step 6: Eliminate mmap page faults with —no-mmap and —mlock

./llama-server \
  -m model.gguf \
  --no-mmap \          # read the whole model at load (slower load, no page-fault stalls)
  --mlock \            # pin weights in RAM, prevent swap/compression
  --n-gpu-layers all \
  --port 8080

--no-mmap makes the load itself slower (the whole file is read up front) but removes the first-prefill page-fault stall. --mlock forces the system to keep the model in RAM rather than swapping or compressing it, so a brief idle period doesn’t push pages out. Use both together for latency-sensitive single-user setups.

How to confirm it’s fixed

ollama ps (or the LM Studio loaded-models list) shows the model resident before you prompt.
Cold first request and a warm second request now have nearly the same TTFT — the gap that was 30-120s is gone.
With ollama run --verbose, load duration is near zero on the warmed model (it was the bulk of TTFT before); a large remaining prompt eval duration means your prompt is just long (cause 7), not that the model is cold.

Prevention

Set OLLAMA_KEEP_ALIVE=-1 (or a window longer than your idle gaps) in production so models aren’t evicted between requests.
Add a startup warmup request (empty prompt) before the service accepts user traffic.
Store GGUFs on NVMe, never SATA SSD or HDD — at 40+ GB this alone cuts the first cold start by 60-80%.
In any UI, show a “loading model” indicator while the model is cold so users don’t think it froze.
Monitor TTFT separately from inter-token latency; otherwise a cold start looks like a generic latency spike.
On shared/multi-tenant servers, run a keepalive heartbeat (a periodic empty request) to keep the model warm.
Bake the shader-warmup inference into your deploy/CI step so the ComputeCache/Metal cache is primed before first traffic.

FAQ

Q: Why is the second request always fast even when the first crawled? A: After the first request the weights are in VRAM, the CUDA/Metal shaders are compiled and cached, and the GGUF is in the OS page cache. All three cold-start costs are gone for as long as the model stays loaded — which is exactly what OLLAMA_KEEP_ALIVE=-1 guarantees.

Q: How do I separate model-load time from prefill time? A: Use ollama run --verbose: load duration is the model-load (cold-start) cost and prompt eval duration is the prefill cost. If load duration is large, the model was cold; if prompt eval duration is large, your prompt is long. With llama-server, warm the model first, then time a second request — that second TTFT is pure prefill.

Q: Does --n-gpu-layers change cold-start time? A: Yes. With --n-gpu-layers all (or auto resolving to all), every layer is copied from RAM into VRAM, which is an extra cold-start step. Offloading fewer layers shrinks that copy but slows inference. The default is auto as of June 2026; pin an explicit number only if you’re tuning a specific VRAM budget.

Q: Is cold start faster on Apple Silicon than NVIDIA? A: Usually, yes. Apple’s unified memory means there’s no separate RAM-to-VRAM copy step. Metal shader compilation adds 5-15s on the first run but is then cached. A 13B Q4_K_M cold start on an M3 Max is typically 3-8s versus 15-30s on a 4090 — but a runtime update re-triggers the Metal compile.

Q: Can I persist the loaded state across a server or container restart? A: Not the VRAM state itself. The practical answer is to keep the process running (keep_alive: -1 plus a Docker restart: unless-stopped policy) and re-run the warmup request on boot. For prefill specifically, llama-server --slot-save-path persists the KV cache to disk so an identical system prompt isn’t reprocessed after a restart.

Tags: #local-llm #ollama #Troubleshooting