llama.cpp mmap Fails on a Network Drive

Q: I added `--no-mmap` and now I get `read error: Invalid argument` instead — what changed?

That string means DirectIO (`O_DIRECT`) is active, and network mounts do not support it. The startup log usually shows `direct I/O is enabled, disabling mmap` just before the failure, so `--no-mmap` alone cannot help. Add `--no-direct-io`. On Windows you may instead see `read error: An attempt was made to move the file pointer before the beginning of the file`, which is a separate `--no-mmap` regression on some builds; for that case prefer copying the model to a local drive letter.

Q: Does `--mlock` help when the model is on NFS?

`--mlock` calls `mlock` on the mmap'd region to prevent page eviction. On NFS, even locked pages must be fetched from the network on first access — `--mlock` only prevents eviction after the initial fault. `--no-mmap` is more reliable because it reads all data upfront into RAM without going through the page-fault path.

Q: Can I use a RAM disk as a middle layer between NFS and llama.cpp?

Yes. Create a tmpfs mount (`sudo mount -t tmpfs -o size=50G tmpfs /mnt/ramdisk`), copy the model there (`cp /mnt/nas/models/model.gguf /mnt/ramdisk/`), and load from `/mnt/ramdisk/`. This gives mmap the local, low-latency filesystem it needs while keeping the master copy on NFS.

Q: My model is on a local ext4 SSD but I'm still getting mmap errors — why?

Check if the filesystem is mounted with `noexec` or if the model file lacks read permissions for the running user. Also check available disk space — if the volume is 100% full, mmap writes (for model conversion) can fail. Run `df -h /path/to/model` and `ls -la model.gguf`.

llama.cpp crashes or stalls loading a GGUF model from an NFS or SMB share. Fastest fix: add --no-mmap (and --no-direct-io if DirectIO is on), or copy the model to local disk.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You store your 40 GB GGUF models on a NAS mounted at /mnt/nas/models/ via NFS, and when you run ./llama-server -m /mnt/nas/models/llama-3.1-70b-Q4_K_M.gguf, the process either crashes immediately with mmap failed: Invalid argument or mmap failed: Operation not permitted, or loads extremely slowly (30-60 minutes) compared to local SSD (under 30 seconds). A third failure mode is that the model loads but random reads during inference cause the server to stall for 5-30 seconds mid-generation as the NFS client waits for page-fault I/O. All three trace back to how llama.cpp reads model weights off a filesystem that assumes local, low-latency disk.

Fastest fix (do this first): add --no-mmap to your launch command. That makes llama.cpp read the file with ordinary read() calls into RAM instead of memory-mapping it, which sidesteps the network filesystem’s mmap restriction entirely. If you are on a recent build (since early 2026) that enabled DirectIO, also add --no-direct-io, because O_DIRECT fails on network mounts the same way and surfaces a slightly different string: read error: Invalid argument. If you load through Ollama, set OLLAMA_NO_MMAP=1. The most reliable long-term fix is to copy the model to a local SSD once and load from there.

Pick your bucket first

The right flag depends on which symptom you actually have. As of June 2026, llama.cpp’s default loader is still mmap (DirectIO is opt-in via -dio/--direct-io), but some Windows GGUF distributions and tuned scripts turn DirectIO on, which changes the error string and the fix.

Symptom	Most likely cause	First fix
Crash on load: `mmap failed: Invalid argument` / `Operation not permitted`	NFS/CIFS rejects `mmap(MAP_SHARED/MAP_PRIVATE)`	`--no-mmap`
Crash on load: `read error: Invalid argument`	DirectIO (`O_DIRECT`) not supported on the network mount	`--no-direct-io` (and `--no-mmap`)
Loads but each token takes 5-30 s	mmap page faults fetched over the network during inference	`--no-mmap` or copy to local disk
Loads “too fast”, output is garbage	network dropped mid-load; mmap region has zeroed pages	re-copy to local disk; check `dmesg \| grep nfs`
`Permission denied` opening the `.gguf`	NFS `root_squash` maps your UID to `nobody`	fix export/permissions (see cause 6)

Common causes

Ordered by hit rate, highest first.

1. NFS mount does not support mmap

NFSv3 and some NFSv4 configurations do not support mmap(MAP_SHARED) for client-side memory mapping. The kernel returns EINVAL or EPERM when llama.cpp calls mmap on the file descriptor, causing an immediate crash. This is a protocol limitation, not a permissions issue.

How to spot it: Run mount | grep nfs to confirm the filesystem type, then try python3 -c "import mmap, os; f=open('/mnt/nas/models/test.bin','rb'); m=mmap.mmap(f.fileno(),0,access=mmap.ACCESS_READ)". If it raises OSError: [Errno 22] Invalid argument, mmap is unsupported on this mount.

2. SMB/CIFS mount with mmap disabled by kernel

The Linux SMB client (cifs) disables mmap on non-local files by default for security reasons. Even with cache=none or cache=strict, mmap calls on CIFS-mounted files return EINVAL. On Windows the equivalent path uses CreateFileMappingA over a UNC path (\\server\share\model.gguf) or a mapped drive letter, and it fails the same way.

How to spot it: Run mount | grep cifs. Any CIFS/SMB mount will cause mmap failures for llama.cpp unless you explicitly use --no-mmap.

3. DirectIO (O_DIRECT) is enabled and the network mount rejects it

This cause is new since early 2026. llama.cpp added a DirectIO loader (-dio / --direct-io) that bypasses mmap and reads with O_DIRECT. It is opt-in on the default upstream build, but several prebuilt binaries and tuned launch scripts enable it. Network filesystems generally do not support O_DIRECT, so the load fails with read error: Invalid argument and the log line direct I/O is enabled, disabling mmap right before the failure. Because DirectIO already disabled mmap, adding --no-mmap alone will not help here — you have to turn DirectIO off.

How to spot it: Look in the startup log for direct I/O is enabled, disabling mmap. If the failure string is read error: Invalid argument (not mmap failed: ...), DirectIO is the culprit. Fix it with --no-direct-io.

4. Model loads but page faults during inference cause multi-second stalls

Even when mmap succeeds (some NFS configurations do allow it), each forward pass reads different portions of the model weights. Over NFS, each page fault that requires fetching a page from the server introduces network latency. A 70B model has thousands of such faults per forward pass, making each token take 5-30 seconds instead of milliseconds.

How to spot it: Run vmstat 1 while generating tokens. If si (swap-in / page faults) is non-zero during inference, the model weights are not fully resident in RAM.

5. mlock disabled or failing on large network-mounted files

llama.cpp uses mlock to pin model weights in RAM after mapping them. On network mounts, mlock may succeed for the mapped range but the OS’s LRU page eviction policy still evicts NFS pages under memory pressure, causing page faults to recur during long inference sessions.

How to spot it: Run ulimit -l — if the mlock limit is less than the model file size in KB, mlock won’t fully pin the model. Also check /proc/sys/vm/nr_hugepages.

6. File permission on NFS prevents the open call

Some NFS configurations with root_squash enabled map the root user to nobody, and nobody may not have read permission on the model file. If llama.cpp is run as root (common in Docker containers), this results in a permission error during model load.

How to spot it: Run ls -la /mnt/nas/models/*.gguf as the user running llama.cpp. If you see “Permission denied,” the NFS permission mapping is the cause.

7. Network interruption mid-load causes a corrupted mmap state

If the network connection to the NFS server drops during model loading (the initial mmap + sequential page load phase), llama.cpp may continue running with a partially-populated mmap region containing zeros. Inference on zeroed weights produces garbage output rather than an error.

How to spot it: Run dmesg | grep nfs after a suspiciously fast model load. If you see timeout or reconnect messages, the model may have loaded with network-interrupted pages.

Shortest path to fix

Step 1: Disable mmap (and DirectIO) so weights load with plain reads

# --no-mmap reads the model into RAM with ordinary read() calls on load
# (slower initial load, but fast inference — no page faults over the network)
# --no-direct-io is only needed if your build/script enabled DirectIO
./llama-server \
  -m /mnt/nas/models/llama-3.1-70b-Q4_K_M.gguf \
  --no-mmap \
  --no-direct-io \
  --n-gpu-layers 80 \
  --port 8080

With --no-mmap, llama.cpp reads the file sequentially into system RAM during load (30-120 seconds for 40 GB over gigabit NFS) and then runs entirely from RAM, so no further NFS I/O happens during inference. The flag names are stable in current builds: --mmap / --no-mmap (mmap is on by default) and -dio / --direct-io / -ndio / --no-direct-io (DirectIO is off by default upstream).

If you cannot edit the command line — for example a packaged service or a third-party wrapper — set an environment variable instead. Any of these disable mmap for the llama.cpp loader:

# llama.cpp: presence of this var (any value) disables mmap
export LLAMA_ARG_NO_MMAP=1
# or, explicitly:
export LLAMA_ARG_MMAP=false   # also accepts 0, off, disabled

# Ollama (which embeds llama.cpp): disable mmap globally
export OLLAMA_NO_MMAP=1
# or per model in a Modelfile:
#   PARAMETER use_mmap false

How to confirm it’s fixed: the server should reach main: model loaded (or llama_model_loader: loaded meta data) and start listening on the port without any mmap failed: or read error: line. If you still see a crash, read the exact string and jump to the matching row in the bucket table above.

Step 2: Copy the model to local SSD before loading

# Copy once, then load from local path
rsync --progress \
  /mnt/nas/models/llama-3.1-70b-Q4_K_M.gguf \
  /home/$USER/models/llama-3.1-70b-Q4_K_M.gguf

./llama-server \
  -m /home/$USER/models/llama-3.1-70b-Q4_K_M.gguf \
  --n-gpu-layers 80 \
  --port 8080

This is the fastest option for repeated use. A 40 GB model copied from NFS to local NVMe at 500 MB/s takes under 90 seconds and loads from local disk in under 30 seconds with mmap.

Step 3: Increase mlock limits if using —no-mmap is insufficient

# Check current mlock limit
ulimit -l
# If "unlimited", mlock is not the issue
# If a number, increase it:

# Temporary (current session)
ulimit -l unlimited

# Permanent (add to /etc/security/limits.conf)
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf

Step 4: For Docker-based deployments, mount the local path instead of NFS inside the container

# docker-compose.yml — mount local SSD path, not NFS
services:
  llama:
    image: ghcr.io/ggml-org/llama.cpp:server
    volumes:
      - /home/user/models:/models  # local path, not /mnt/nas
    command: >
      -m /models/llama-3.1-70b-Q4_K_M.gguf
      --no-mmap
      --n-gpu-layers 80
      --port 8080
    ports:
      - "8080:8080"

A bind mount does not change this: if the host directory is itself an NFS or CIFS mount, the container sees the same network filesystem and mmap/O_DIRECT fail exactly the same way. Either mount a genuinely local host path or keep --no-mmap in the command.

Step 5: If you must use NFS, add the async and rsize mount options

# /etc/fstab entry for the NFS share
nas:/models /mnt/nas/models nfs \
  rw,soft,async,rsize=1048576,wsize=1048576,timeo=600,retrans=5 0 0

# Remount
sudo mount -o remount /mnt/nas/models

# Verify rsize
mount | grep nas | grep rsize

The rsize=1048576 (1 MB read size) maximizes sequential read throughput for the initial model load with --no-mmap.

Prevention

Store GGUF model files on local NVMe or SSD whenever performance matters — NFS is a network filesystem, not a local storage substitute.
Always include --no-mmap (and --no-direct-io if DirectIO is enabled) in launch scripts when the model path is on any network filesystem (NFS, CIFS, sshfs). Setting LLAMA_ARG_NO_MMAP=1 once in the service environment covers every invocation.
When buying a NAS for model storage, also budget for a local SSD cache — copy models to local disk before use.
Add a pre-flight check to your launch script that verifies the model file is on a local filesystem before starting:

FSTYPE=$(stat -f -c %T "$MODEL_PATH" 2>/dev/null || stat -f "$MODEL_PATH" | grep 'Type:' | awk '{print $NF}')
if echo "$FSTYPE" | grep -qiE "nfs|cifs|fuse"; then
  echo "Warning: model on network filesystem — using --no-mmap --no-direct-io"
  EXTRA_FLAGS="--no-mmap --no-direct-io"
fi

For shared teams using a NAS, set up a nightly rsync job to pre-cache frequently used models on each workstation’s local disk.
Monitor dmesg | grep nfs for timeout/reconnect messages that indicate NFS instability during long inference sessions.
If mmap must work over network (e.g., low-RAM systems), use NFSv4.1 with pNFS and cache=fscache for local page caching.

FAQ

Q: I added --no-mmap and now I get read error: Invalid argument instead — what changed? A: That string means DirectIO (O_DIRECT) is active, and network mounts do not support it. The startup log usually shows direct I/O is enabled, disabling mmap just before the failure, so --no-mmap alone cannot help. Add --no-direct-io. On Windows you may instead see read error: An attempt was made to move the file pointer before the beginning of the file, which is a separate --no-mmap regression on some builds; for that case prefer copying the model to a local drive letter.

Q: Does --mlock help when the model is on NFS? A: --mlock calls mlock on the mmap’d region to prevent page eviction. On NFS, even locked pages must be fetched from the network on first access — --mlock only prevents eviction after the initial fault. --no-mmap is more reliable because it reads all data upfront into RAM without going through the page-fault path.

Q: Can I use a RAM disk as a middle layer between NFS and llama.cpp? A: Yes. Create a tmpfs mount (sudo mount -t tmpfs -o size=50G tmpfs /mnt/ramdisk), copy the model there (cp /mnt/nas/models/model.gguf /mnt/ramdisk/), and load from /mnt/ramdisk/. This gives mmap the local, low-latency filesystem it needs while keeping the master copy on NFS.

Q: My model is on a local ext4 SSD but I’m still getting mmap errors — why? A: Check if the filesystem is mounted with noexec or if the model file lacks read permissions for the running user. Also check available disk space — if the volume is 100% full, mmap writes (for model conversion) can fail. Run df -h /path/to/model and ls -la model.gguf.

Q: Does LM Studio have the same NFS mmap issue? A: Yes. LM Studio uses the same llama.cpp backend. The fix is the same: copy the model to local storage. LM Studio does not expose a --no-mmap toggle in the UI, so the only practical option is to store models on local disk.

Tags: #local-llm #llama.cpp #Troubleshooting

Pick your bucket first

Common causes

1. NFS mount does not support mmap

2. SMB/CIFS mount with mmap disabled by kernel

3. DirectIO (O_DIRECT) is enabled and the network mount rejects it

4. Model loads but page faults during inference cause multi-second stalls

5. mlock disabled or failing on large network-mounted files

6. File permission on NFS prevents the open call

7. Network interruption mid-load causes a corrupted mmap state

Shortest path to fix

Step 1: Disable mmap (and DirectIO) so weights load with plain reads

Step 2: Copy the model to local SSD before loading

Step 3: Increase mlock limits if using —no-mmap is insufficient

Step 4: For Docker-based deployments, mount the local path instead of NFS inside the container

Step 5: If you must use NFS, add the async and rsize mount options

Prevention

FAQ

Related

Related Articles

llama.cpp Quality Drops After Switching to a More Aggressive Quant

LM Studio Out of Memory When Loading a Model

Local Embedding Server Crashes Under Batched Requests

Chat-Template Mismatch Produces Garbage Local LLM Output

Multi-GPU Not Used — Local LLM Runs Only on GPU 0

Local LLM Output Truncated Mid-Token (Ollama / llama.cpp)