LM Studio Out of Memory When Loading a Model

Q: The loader says I "may not have enough resources" but I'm sure it fits. Can I force it?

Yes. In the GUI use **Load anyway**, or set **Settings → Model Loading Guardrails** to **Relaxed** or **Off**. The estimate is conservative and tends to run higher than actual usage. Confirm with `lms load --estimate-only` first.

LM Studio crashes or shows out of memory loading a GGUF. Fix it fast by cutting context length, enabling Flash Attention, and tuning GPU offload — with a VRAM sizing table.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You click Load in LM Studio and the progress bar fills, then the app either crashes silently or shows Failed to load model: out of memory. In the logs you may see a more specific line such as Error: Failed to initialize the context: failed to allocate buffer for kv cache or llama_model_load: error loading model: failed to allocate ... out of memory. The confusing part: the model file on disk is 40 GB and your machine has 96 GB of RAM, yet it still fails.

Fastest fix (works for most people): open the model’s load settings, drop Context Length to 4096, make sure Flash Attention is on, and reload. The KV cache — not the weights — is almost always what blows past your memory budget, and context length is the single biggest lever on KV-cache size. Everything below is for when that alone doesn’t do it.

As of June 2026 this article tracks LM Studio 0.3.3x (the loader UI, Model Loading Guardrails, and the lms CLI all match that build).

Which bucket are you in?

Symptom	Most likely cause	Jump to
Big model, big RAM, OOM at load	Context length inflating the KV cache	Cause 1, Step 1
Model file is larger than your VRAM	Quantization too heavy for the GPU	Cause 2, Step 2
Loads fine, then OOMs on first message	KV cache allocated lazily at first token	Cause 1, FAQ
Loader says “may not have enough resources” / blocks load	Resource Guardrails estimate	Cause 7, Step 5
31B+ GGUF OOMs only with KV-on-GPU enabled	”Offload KV Cache to GPU Memory” toggle	Cause 6, Step 3
OOM after switching models without ejecting	VRAM not released from prior model	Cause 5, Step 4

Common causes

Ordered by hit rate, highest first.

1. Context length set too high, inflating the KV cache

LM Studio pre-allocates the full KV cache for your chosen context window before inference starts. It often defaults to the model’s maximum — frequently 128k tokens on modern models. The KV cache for a 70B model at 128k context (fp16 KV) needs roughly 50-80 GB on its own, on top of the ~40 GB for Q4_K_M weights. Combined, 90-120 GB exceeds even a 96 GB unified-memory Mac.

How to spot it: Open the model in My Models, click the gear/settings, and check Context Length under Load settings. If it reads 32768, 65536, or 131072, that is almost certainly the cause. Drop it to 4096 and reload.

2. Quantization too large for available VRAM

A Q8_0 70B is roughly 75 GB; a Q4_K_M 70B is roughly 42 GB. On a 16 GB GPU, even Q4_K_M exceeds VRAM, so LM Studio offloads to CPU/RAM, which can still OOM if system RAM is tight.

How to spot it: Check the GGUF file size against your VRAM. On a 16 GB card, only models up to ~12 GB of weights fit with headroom for the KV cache (Q4_K_M 13B ≈ 8 GB, Q4_K_M 8B ≈ 5 GB).

3. Flash Attention disabled on an older config

Flash Attention reduces the memory used during attention and shrinks per-token KV-cache overhead — typically freeing 20-30% of VRAM at a given context length. It became the default on CUDA in v0.3.31 and on Vulkan/Metal in v0.3.32, but a per-model config saved before that may still have it forced off.

How to spot it: In the model’s Load settings, check that Flash Attention is on (or Auto). If it was manually set to Off, a long-context load that should fit will OOM.

4. Metal memory budget on Apple Silicon

On Apple Silicon, the GPU can address most of unified memory, but macOS still caps the GPU working set (historically ~75% of RAM, enforced by the Metal driver). On a 16 GB M2 that’s roughly 12 GB for the entire GPU workload. A Q4_K_M 7B (4.4 GB weights + KV cache) fits; a Q4_K_M 13B (8 GB weights) can push past once the KV cache is added.

How to spot it: In Activity Monitor, open the GPU view (or the Memory tab) and compare GPU memory against the model’s expected footprint.

5. VRAM not released from a previous model

If you loaded a model, ran inference, then loaded a different (larger) one without ejecting the first, the prior allocation may still be resident — or VRAM may be fragmented so the OS reports 10 GB free but the largest contiguous block is 4 GB, and the new model’s buffer allocation fails.

How to spot it: Eject all models, then check GPU memory with nvidia-smi (NVIDIA) or Activity Monitor (Mac). If VRAM doesn’t drop to near-idle after ejecting, a leak or fragmentation is present.

6. “Offload KV Cache to GPU Memory” forcing the cache into VRAM

This toggle stores the KV cache in VRAM (faster) instead of system RAM. On a VRAM-tight setup, or with certain large GGUFs (multiple 31B models showed this in 2026), enabling it tips the load over the edge — the same model loads cleanly with the toggle off.

How to spot it: In Load settings, find Offload KV Cache to GPU Memory. If it’s on and you’re close to your VRAM ceiling, turn it off so the cache spills to RAM.

7. Resource Guardrails blocking the load before it starts

LM Studio’s loader estimates weights + KV cache + compute buffers and compares it to free memory. Under Settings → Model Loading Guardrails (modes: Strict, Balanced, Relaxed, Off), a too-tight estimate can refuse the load with a message like “not enough resources to run model with the current settings,” even on configs that would actually fit. The estimate is known to run high versus raw llama.cpp.

How to spot it: If you see the guardrail warning rather than a hard crash, lower context first; if you’re confident it fits, use Load anyway or relax the guardrail mode.

Shortest path to fix

Step 1: Cut context length to what you actually need

In the model’s Load settings, set:

Context Length: 4096

4096 covers most chat. Use 8192 for RAG or long documents; only go to 32k+ when the task genuinely needs it. KV-cache memory scales roughly linearly with context length, so this is the highest-leverage single change.

Before loading from the terminal, you can preview the cost without committing:

# Print a memory estimate and exit (no load), honoring your flags
lms load <model-key> --context-length 4096 --gpu max --estimate-only

Step 2: Move to a more memory-efficient quantization

Q8_0 → Q6_K → Q5_K_M → Q4_K_M → IQ4_XS

Q4_K_M is the usual sweet spot (quality loss is hard to notice on most tasks for roughly half the memory of Q8_0). In LM Studio’s downloader, search the model name and filter by quantization. Targets by VRAM, as of June 2026:

VRAM	Safe model + quant	Context to start
8 GB	7B/8B Q4_K_M	2048-4096
16 GB	13B Q4_K_M or 13B Q5_K_M	4096-8192
24 GB	34B Q4_K_M or 13B Q8_0	8192-16384
48 GB	70B Q4_K_M or IQ4_XS	8192-16384

For a 70B on a 16 GB GPU, Q4_K_M (42 GB) still needs heavy CPU offload — a 13B Q4_K_M (8 GB) that lives entirely on the GPU is usually the better trade.

Step 3: Turn on Flash Attention, quantize the KV cache, and right-size offload

Three load-setting levers, in order of impact:

Flash Attention → on (default in current builds). Frees ~20-30% VRAM and speeds decode.
KV Cache Quantization → switch from fp16 to Q8_0, which roughly halves per-token cache memory at negligible quality cost.
Offload KV Cache to GPU Memory → turn off if you’re VRAM-bound, so the cache spills to system RAM (slower, but it loads).

For models that won’t fit in VRAM at all, set partial GPU Offload so some layers run on the GPU and the rest on CPU RAM:

GPU Offload: 20  (start low, raise until just before OOM)

Or from the CLI, offload by fraction:

lms load <model-key> --gpu 0.5 --context-length 4096   # 50% of layers on GPU
lms load <model-key> --gpu off                          # CPU only, no VRAM pressure

Step 4: Eject other models and clear VRAM

# NVIDIA: confirm VRAM is free before loading
nvidia-smi --query-gpu=memory.used,memory.free --format=csv,noheader

# Apple Silicon: check Activity Monitor > GPU

In LM Studio, open My Models (or the server/Power-User panel) and click Eject on every loaded model before loading the new one. Auto-evict and “only keep last JIT-loaded model” settings help avoid stacking models unintentionally.

Step 5: If guardrails block a fit-able config, relax them deliberately

If the loader refuses with a resource warning rather than crashing, go to Settings → Model Loading Guardrails and move from Strict toward Balanced or Relaxed, or click Load anyway in the loader. Only do this once your own estimate (or --estimate-only) shows the config genuinely fits — guardrails exist to stop you from freezing the machine.

Step 6 (Windows): increase the page file as a spill buffer

Control Panel → System → Advanced System Settings → Performance → Settings → Advanced → Virtual Memory → Change. Set a custom size with minimum ≈ 1.5x the model file size and maximum ≈ 3x, apply, and restart. A disabled or tiny page file makes large mmap regions fail to commit on Windows.

How to confirm it’s fixed

Load the model and watch the loader’s live memory readout settle below your total — it should not pin at 100%.
Send a real prompt at your full context. The first token allocates the KV cache lazily; if it streams without crashing, the cache fit.
Watch memory during generation:

nvidia-smi --query-gpu=memory.used,memory.free --format=csv -l 2

VRAM use should plateau, not keep climbing toward the ceiling. On a Mac, the GPU view in Activity Monitor should hold steady through a long generation.

Prevention

Estimate before downloading: GGUF size + (context_length × 2 × num_kv_heads × head_dim × num_layers × 2 bytes / 1e9) GB for fp16 KV; halve the KV term if you’ll use Q8_0 KV cache.
Default new models to 4096 context and raise per-session only when needed.
Keep Flash Attention on and prefer Q4_K_M unless a task is provably sensitive to quantization.
Turn off Auto-load last model on startup so a large model never loads before you’ve sized it.
After ejecting a model, wait a few seconds and confirm nvidia-smi shows VRAM returned before loading the next.
On Windows, keep a manual page file of at least 32 GB on your fastest SSD.
Add the LM Studio models folder to antivirus exclusions to avoid I/O stalls during mmap.

FAQ

Q: The model loads fine but crashes when I send the first message — same problem? A: Yes. The KV cache for your prompt is allocated lazily on the first forward pass, not at load time. With a 128k context, the allocation can fail on the first token, producing a crash at inference rather than at load. Lower Context Length and reload.

Q: A 31B model OOMs only when “Offload KV Cache to GPU Memory” is on. Why? A: That toggle forces the cache into VRAM. Some large GGUFs sit right at the VRAM ceiling and tip over once the cache is added on top; turning the toggle off keeps the cache in system RAM and they load. Enabling Flash Attention and Q8_0 KV cache also helps.

Q: What is IQ4_XS, and is it a good pick? A: It’s an importance-matrix quantization targeting ~4.25 bits/weight with non-uniform precision — slightly smaller than Q4_K_M and comparably accurate, often the best way to squeeze a 70B into ~40 GB. LM Studio supports IQ4_XS GGUFs natively.

Q: The loader says I “may not have enough resources” but I’m sure it fits. Can I force it? A: Yes. In the GUI use Load anyway, or set Settings → Model Loading Guardrails to Relaxed or Off. The estimate is conservative and tends to run higher than actual usage. Confirm with lms load <model> --estimate-only first.

Q: Same model loads in Ollama but OOMs in LM Studio. Why? A: Different KV-cache defaults. Ollama often defaults to a smaller context (e.g. 2048), while LM Studio may default to the model’s maximum. Lower Context Length in LM Studio and it usually matches.

Q: Why does it say “16 GB VRAM available” but still OOM on a 10 GB model? A: The nominal figure includes memory the OS can reclaim for display and system use. The reliably allocatable pool is roughly 80-85% of the nominal VRAM, so on a 16 GB card budget around 13 GB for weights plus cache plus compute buffers.

Tags: #local-llm #lmstudio #Troubleshooting