LM Studio OOMs While Loading a Model

LM Studio crashes or shows an out-of-memory error when loading a model. Diagnose VRAM limits, quantization choice, and context size to load successfully.

You open LM Studio 0.3.5 on a MacBook Pro with an M3 Max (96 GB unified memory) or a Windows workstation with a 4080 (16 GB VRAM) and click Load on a 70B model. The progress bar fills, then LM Studio either crashes silently or shows “Failed to load model: out of memory” with no further detail. The model file on disk is 40 GB, your system has 96 GB RAM, yet the load fails. The confusion usually comes from underestimating the KV cache allocation on top of the weight footprint, or from LM Studio defaulting to a context length that multiplies memory consumption several times over.

Common causes

Ordered by hit rate, highest first.

1. Context length set too high, inflating KV cache

LM Studio defaults to the model’s maximum context length — often 128k tokens for modern models. The KV cache for a 70B model at 128k context (fp16 KV) requires roughly 50-80 GB on its own, on top of the ~40 GB for Q4_K_M weights. The combined 90-120 GB exceeds even large VRAM/unified-memory systems.

How to spot it: In LM Studio, open the model’s settings and check “Context Length.” If it’s 32768, 65536, or 131072, that’s likely the cause. Reduce to 4096 or 8192 and retry.

2. Quantization too large for available VRAM

A Q8_0 70B model is roughly 75 GB. A Q4_K_M 70B is roughly 42 GB. If you’re on a 16 GB GPU, even Q4_K_M exceeds VRAM and LM Studio falls back to CPU/RAM loading, which can OOM if system RAM is also limited.

How to spot it: Check the model file size. On a 16 GB VRAM system, only models up to ~12 GB fit with headroom for the KV cache (Q4_K_M 13B ≈ 8 GB, Q4_K_M 8B ≈ 5 GB).

3. GPU memory fragmentation from a previous session

If you loaded a model, ran inference, then tried to load a different (larger) model without fully unloading the first, the VRAM may be fragmented. The OS reports 10 GB free but the largest contiguous block is only 4 GB, making the new model’s mmap fail.

How to spot it: Unload all models in LM Studio, then check GPU memory via nvidia-smi or Activity Monitor. If VRAM usage doesn’t drop to near-zero after unloading, a fragmentation or leak is present.

4. Metal Performance Shaders memory limit on macOS

On Apple Silicon, the default Metal memory limit is 75% of system RAM by default (adjustable in newer macOS but enforced by the GPU driver). On a 16 GB M2, that’s 12 GB for the entire GPU workload. A Q4_K_M 7B model (4.4 GB weights + KV cache) fits fine, but a 13B at Q4_K_M (8 GB weights) may push past the limit when the KV cache is added.

How to spot it: In Activity Monitor on macOS, check the “Memory” tab under “GPU Memory.” Compare against the model’s expected footprint.

5. Mlock / mmap competition with system RAM

LM Studio uses mmap to load model weights. On Windows, if the system virtual memory (page file) is small or disabled, large mmap regions fail to commit. On Linux, vm.overcommit_memory=2 with a small swap can also cause OOM at map time.

How to spot it: On Windows, open System → Advanced → Performance → Virtual Memory and check the page file size. It should be at least 1.5x the model file size.

6. Running multiple models simultaneously

LM Studio 0.3 allows loading multiple models in the “Local Server” tab. If two models are active, they compete for the same VRAM pool, and a third load attempt will OOM even if the third model alone would fit.

How to spot it: Open the “My Models” or “Local Server” tab in LM Studio and check for multiple green “Loaded” badges. Unload all unused models before loading a large one.

Shortest path to fix

Step 1: Reduce context length to the minimum you actually need

In LM Studio, select the model, go to “Model Settings” (the sliders icon), and set:

Context Length: 4096

For most chat tasks, 4096 tokens is sufficient. For RAG or long documents, try 8192. Only use 32k+ when the use case genuinely requires it — memory usage scales roughly linearly with context length.

Step 2: Switch to a more aggressive quantization

Q8_0 → Q5_K_M → Q4_K_M → Q4_0 → IQ4_XS

For a 70B model on a 16 GB GPU: Q4_K_M is the largest practical option (42 GB) but still requires CPU offload. Consider using a 13B Q4_K_M (8 GB) for pure GPU inference instead.

In LM Studio, search for the model name followed by “GGUF” and filter by quantization. Recommended targets by VRAM:

VRAMSafe model + quant
8 GB7B Q4_K_M or 8B Q4_K_M
16 GB13B Q4_K_M or 13B Q5_K_M
24 GB34B Q4_K_M or 13B Q8_0
48 GB70B Q4_K_M

Step 3: Unload all other models and clear VRAM

# NVIDIA: confirm VRAM is free before loading
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

# Apple Silicon: check in Activity Monitor > GPU Memory

In LM Studio, go to “My Models” and click “Eject” on every loaded model before attempting to load the new one.

Step 4: On Windows, increase the page file

Open: Control Panel → System → Advanced System Settings → Performance → Settings → Advanced → Virtual Memory → Change.

Set a custom size with minimum = 1.5x the model file size in MB and maximum = 3x. Apply and restart.

Step 5: Enable GPU offload split for large models

For models that won’t fit in VRAM alone, LM Studio can split layers between GPU and CPU RAM. In model settings, set “GPU Layers” to a value less than the total layer count (e.g., 32 out of 80 for a 70B model). This trades speed for the ability to load the model at all.

GPU Layers: 20  (start low, increase until you hit OOM)

Prevention

  • Before downloading a model, estimate its memory footprint: GGUF file size + (context_length × 2 × num_heads × head_dim × num_layers × 2 bytes / 1e9 GB for fp16 KV).
  • Set a default context length of 4096 in LM Studio’s global settings and only increase per-session when needed.
  • Keep at least one model slot empty in LM Studio’s server tab as a buffer.
  • On Windows, always configure a manual page file of at least 32 GB on your fastest drive.
  • Add ~/.cache/lm-studio or the LM Studio models folder to antivirus exclusions to avoid I/O stalls during mmap.
  • After any model eviction, wait 5 seconds and verify nvidia-smi shows VRAM returned before loading the next model.
  • Track peak VRAM usage with nvidia-smi dmon -s m during your first run so you know the true footprint.

FAQ

Q: LM Studio says the model loaded successfully but then crashes when I send the first message — is this the same problem? A: Yes — the KV cache for the first prompt is allocated lazily on the first forward pass, not at load time. If your context length is 128k and the KV cache allocation fails on the first token, you’ll see a crash at inference time rather than at load time. Reduce context length and reload.

Q: What is IQ4_XS and is it a good choice? A: IQ4_XS is an “importance-matrix” quantization that targets 4.25 bits/weight using non-uniform quantization. It is slightly smaller than Q4_K_M and comparably accurate — often the best option for fitting a 70B model in 40 GB. LM Studio 0.3.5+ supports IQ4_XS GGUF files natively.

Q: Can I use both GPU and RAM for a model that doesn’t fit in VRAM? A: Yes, via the “GPU Layers” slider. Layers assigned to GPU run at full GPU speed; the rest run on CPU RAM. Performance degrades gracefully — a 70B model with 20/80 GPU layers is slow but usable for non-interactive tasks.

Q: Why does LM Studio show “16 GB VRAM available” but still OOM on a 10 GB model? A: The VRAM “available” figure includes memory that Windows or macOS may reclaim for display or system use. The practical allocatable pool is often 80-85% of the nominal VRAM. On a 16 GB card, assume roughly 13 GB is reliably allocatable for model weights.

Tags: #local-llm #lmstudio #Troubleshooting