Local Model Very Slow on First-Token After Cold Start

Local LLM takes 30-120 seconds to produce the first token after loading. Diagnose model loading, KV cache allocation, and GPU warmup to reduce cold-start latency.

You run ollama run llama3.1:8b "say hello" on a fresh boot of your workstation. The cursor blinks for 45 seconds before the first token appears. After the first response, subsequent tokens arrive at 40+ t/s. The cold-start penalty is a real and consistent phenomenon — it happens across Ollama, llama-server, and LM Studio on NVIDIA, AMD, and Apple Silicon hardware. It’s not a bug but a combination of model weight loading, GPU VRAM allocation, CUDA/Metal JIT compilation of shaders, and KV cache initialization, all happening in sequence before the first forward pass can begin.

Common causes

Ordered by impact on time-to-first-token, highest first.

1. Model weights not cached in RAM (cold file cache)

On a fresh boot, the OS file cache is empty. Loading a 40 GB Q4_K_M GGUF from an NVMe SSD takes 10-20 seconds even at 3 GB/s read speed. Loading from a SATA SSD (500 MB/s) takes 80+ seconds. The disk I/O phase alone accounts for most cold-start latency.

How to spot it: Run time ollama run llama3.1:8b "a" 2>/dev/null twice. If the second run is 5-10x faster, disk I/O (file cache miss) is the dominant cold-start cost.

2. CUDA shader JIT compilation

The first time a CUDA kernel is used, NVIDIA’s driver JIT-compiles the shader for the specific GPU architecture. For llama.cpp’s CUDA backend, this takes 5-30 seconds on the first run and is then cached in ~/.nv/ComputeCache/. Subsequent runs skip this step.

How to spot it: Run with CUDA_LAUNCH_BLOCKING=1 CUDA_CACHE_DISABLE=0 ./llama-server ... and check whether the CUDA cache directory grows during first startup.

3. Metal shader compilation on macOS

On Apple Silicon, Metal shaders are compiled on first use and cached in /var/folders/.../com.apple.metal.default.metallib.cache. LM Studio and Ollama on macOS show the same 10-20 second first-run JIT overhead.

How to spot it: Delete the Metal cache directory and rerun — if the cold-start is significantly longer after cache deletion, shader compilation is the cause.

4. Ollama model keepalive expired — model unloaded from VRAM

Ollama unloads models from VRAM after 5 minutes of inactivity by default (OLLAMA_KEEP_ALIVE=5m). If you come back after 10 minutes, the model has been evicted and the next request triggers a full reload. This looks like a “random” slow first token.

How to spot it: Run ollama ps before sending your prompt. If the model is not listed (not loaded), Ollama will reload it on the next request, causing a cold-start delay.

5. KV cache memory allocation delay

The KV cache for a 70B model at 8192 context length requires allocating a large contiguous block of VRAM (4-8 GB). On systems with fragmented VRAM or when VRAM is shared with a display adapter, this allocation can take several seconds.

How to spot it: Run nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader before and immediately after starting model load. A large jump in memory.used over a 5-10 second period indicates the KV cache allocation phase.

6. Prefill latency for long system prompts

If your first request includes a 2,000-token system prompt, the prefill phase (processing all input tokens before generating the first output token) takes proportionally longer. A 2k-token prefill on an 8B model takes 1-3 seconds; the same prompt on a 70B model can take 15-30 seconds.

How to spot it: Send a minimal test prompt (“a”) to measure base first-token latency, then send your full prompt. The difference is prefill latency.

Shortest path to fix

Step 1: Keep the model warm in VRAM with OLLAMA_KEEP_ALIVE

# Keep the model loaded indefinitely
export OLLAMA_KEEP_ALIVE=-1
ollama serve &

# Or set per-request in the API
curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "hello", "keep_alive": -1}'

# Or for a specific duration (e.g., 1 hour)
export OLLAMA_KEEP_ALIVE=1h

Add Environment="OLLAMA_KEEP_ALIVE=-1" to /etc/systemd/system/ollama.service for a persistent setting.

Step 2: Pre-warm the model at service startup

#!/bin/bash
# warmup.sh — run after Ollama starts
until curl -s http://localhost:11434/api/version > /dev/null; do sleep 1; done
curl -s http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": " ", "stream": false, "keep_alive": -1}' \
  > /dev/null
echo "Model warmed up"

Add this to your Ollama service’s ExecStartPost or run it from your application startup.

Step 3: Pre-load model weights into the OS page cache

# "Touch" the model file to populate the OS file cache
vmtouch -t ~/.ollama/models/blobs/sha256-<model-hash>
# Or without vmtouch:
dd if=~/.ollama/models/blobs/sha256-<model-hash> of=/dev/null bs=4M status=progress

# The file will be in cache until memory pressure evicts it
# Check cache status
vmtouch ~/.ollama/models/blobs/sha256-<model-hash>

Step 4: Pre-compile CUDA shaders by running a warmup inference

# Run a short warmup prompt at server startup to trigger shader compilation
./llama-server \
  -m model.gguf \
  --n-gpu-layers 80 \
  --port 8080 &

# Wait for server to start, then send warmup request
sleep 5
curl -s http://localhost:8080/v1/chat/completions \
  -d '{"messages": [{"role": "user", "content": " "}], "max_tokens": 1}' \
  > /dev/null
echo "CUDA shaders compiled and cached"

Step 5: Reduce the system prompt size for latency-sensitive requests

# Cache the prefill by pre-computing a "prompt cache" if your framework supports it
# llama-server supports prompt caching (KV cache reuse across requests)
# with identical prefix

# First request populates KV cache for the system prompt
# Subsequent requests with the same system prompt reuse it
response1 = requests.post("/v1/chat/completions", json={
    "messages": [
        {"role": "system", "content": FIXED_SYSTEM_PROMPT},
        {"role": "user", "content": "first question"},
    ],
    "cache_prompt": True,  # llama-server extension
})

Step 6: For llama.cpp, use —mlock to keep weights resident in RAM

./llama-server \
  -m model.gguf \
  --mlock \          # pin model weights in RAM, prevent eviction
  --n-gpu-layers 80 \
  --port 8080

--mlock increases cold-start time slightly (the kernel must immediately commit all pages) but eliminates paging-induced stalls on subsequent cold starts after brief idle periods.

Prevention

  • Set OLLAMA_KEEP_ALIVE=-1 in production deployments to prevent model eviction between requests.
  • Add a startup warmup script to your application that runs a no-op prompt before accepting user requests.
  • Store GGUF files on NVMe (not SATA SSD or HDD) — the difference in cold-start time is 60-80% at 40+ GB model sizes.
  • For interactive applications, show a “model loading” indicator when the model is cold so users don’t think the app is frozen.
  • Track time-to-first-token separately from inter-token latency in your monitoring — cold-start TTFT will look like a latency spike if not separated.
  • On shared servers, use a model keepalive daemon that sends periodic heartbeat requests to prevent eviction.
  • For llama-server deployments, precompile the CUDA cache by running a warmup inference at server startup as part of CI/CD.

FAQ

Q: Why is the second request always fast even after the first was slow? A: After the first request, the model weights are in VRAM (GPU-side), the CUDA shaders are compiled and cached, and the OS page cache contains the model file. All three cold-start costs are eliminated for subsequent requests as long as the model stays loaded.

Q: How can I measure prefill latency vs. model-load latency separately? A: Load the model with a warmup request, then time a second request with your target prompt. The second request’s time-to-first-token is pure prefill latency (model weights are already in VRAM). Compare against the first request’s TTFT to isolate model-load overhead.

Q: Does --n-gpu-layers affect cold-start time? A: Yes. With --n-gpu-layers 99 (all layers on GPU), the model copy from system RAM to VRAM is an additional cold-start step. With fewer GPU layers (CPU offload), more of the model stays in system RAM and the GPU copy is smaller. However, this trades cold-start time for slower inference speed — the optimal setting depends on your VRAM capacity.

Q: On macOS with Apple Silicon, is cold-start faster than on NVIDIA? A: Generally yes — Apple Silicon uses unified memory shared between CPU and GPU, so there’s no separate CPU RAM to GPU VRAM copy step. Metal shader compilation adds 5-15 seconds on first run, but is then cached. Cold starts on M3 Max with a 13B Q4_K_M model are typically 3-8 seconds vs. 15-30 seconds on a 4090.

Tags: #local-llm #ollama #Troubleshooting