You store your 40 GB GGUF models on a NAS mounted at /mnt/nas/models/ via NFS, and when you run ./llama-server -m /mnt/nas/models/llama-3.1-70b-Q4_K_M.gguf, the process either crashes immediately with mmap failed: Invalid argument or mmap failed: Operation not permitted, or loads extremely slowly (30-60 minutes) compared to local SSD (under 30 seconds). A third failure mode is that the model loads but random reads during inference cause the server to stall for 5-30 seconds mid-generation as the NFS client waits for page-fault I/O. All three failure modes trace back to llama.cpp’s default memory-mapped file loading, which assumes a local, low-latency filesystem.
Common causes
Ordered by hit rate, highest first.
1. NFS mount does not support mmap
NFSv3 and some NFSv4 configurations do not support mmap(MAP_SHARED) for client-side memory mapping. The kernel returns EINVAL or EPERM when llama.cpp calls mmap on the file descriptor, causing an immediate crash. This is a protocol limitation, not a permissions issue.
How to spot it: Run mount | grep nfs to confirm the filesystem type, then try python3 -c "import mmap, os; f=open('/mnt/nas/models/test.bin','rb'); m=mmap.mmap(f.fileno(),0,access=mmap.ACCESS_READ)". If it raises OSError: [Errno 22] Invalid argument, mmap is unsupported on this mount.
2. SMB/CIFS mount with mmap disabled by kernel
The Linux SMB client (cifs) disables mmap on non-local files by default for security reasons. Even with cache=none or cache=strict, mmap calls on CIFS-mounted files return EINVAL.
How to spot it: Run mount | grep cifs. Any CIFS/SMB mount will cause mmap failures for llama.cpp unless you explicitly use --no-mmap.
3. Model loads but page faults during inference cause multi-second stalls
Even when mmap succeeds (some NFS configurations do allow it), each forward pass reads different portions of the model weights. Over NFS, each page fault that requires fetching a page from the server introduces network latency. A 70B model has thousands of such faults per forward pass, making each token take 5-30 seconds instead of milliseconds.
How to spot it: Run vmstat 1 while generating tokens. If si (swap-in / page faults) is non-zero during inference, the model weights are not fully resident in RAM.
4. mlock disabled or failing on large network-mounted files
llama.cpp uses mlock to pin model weights in RAM after mapping them. On network mounts, mlock may succeed for the mapped range but the OS’s LRU page eviction policy still evicts NFS pages under memory pressure, causing page faults to recur during long inference sessions.
How to spot it: Run ulimit -l — if the mlock limit is less than the model file size in KB, mlock won’t fully pin the model. Also check /proc/sys/vm/nr_hugepages.
5. File permission on NFS prevents the open call
Some NFS configurations with root_squash enabled map the root user to nobody, and nobody may not have read permission on the model file. If llama.cpp is run as root (common in Docker containers), this results in a permission error during model load.
How to spot it: Run ls -la /mnt/nas/models/*.gguf as the user running llama.cpp. If you see “Permission denied,” the NFS permission mapping is the cause.
6. Network interruption mid-load causes a corrupted mmap state
If the network connection to the NFS server drops during model loading (the initial mmap + sequential page load phase), llama.cpp may continue running with a partially-populated mmap region containing zeros. Inference on zeroed weights produces garbage output rather than an error.
How to spot it: Run dmesg | grep nfs after a suspiciously fast model load. If you see timeout or reconnect messages, the model may have loaded with network-interrupted pages.
Shortest path to fix
Step 1: Disable mmap and load weights into RAM directly
# The --no-mmap flag reads the entire model into RAM on load
# (slow initial load, but fast inference — no page faults)
./llama-server \
-m /mnt/nas/models/llama-3.1-70b-Q4_K_M.gguf \
--no-mmap \
--n-gpu-layers 80 \
--port 8080
With --no-mmap, llama.cpp reads the file sequentially into system RAM during load (takes 30-120 seconds for 40 GB over gigabit NFS) and then runs entirely from RAM. No further NFS I/O occurs during inference.
Step 2: Copy the model to local SSD before loading
# Copy once, then load from local path
rsync --progress \
/mnt/nas/models/llama-3.1-70b-Q4_K_M.gguf \
/home/$USER/models/llama-3.1-70b-Q4_K_M.gguf
./llama-server \
-m /home/$USER/models/llama-3.1-70b-Q4_K_M.gguf \
--n-gpu-layers 80 \
--port 8080
This is the fastest option for repeated use. A 40 GB model copied from NFS to local NVMe at 500 MB/s takes under 90 seconds and loads from local disk in under 30 seconds with mmap.
Step 3: Increase mlock limits if using —no-mmap is insufficient
# Check current mlock limit
ulimit -l
# If "unlimited", mlock is not the issue
# If a number, increase it:
# Temporary (current session)
ulimit -l unlimited
# Permanent (add to /etc/security/limits.conf)
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
Step 4: For Docker-based deployments, mount the local path instead of NFS inside the container
# docker-compose.yml — mount local SSD path, not NFS
services:
llama:
image: ghcr.io/ggerganov/llama.cpp:server
volumes:
- /home/user/models:/models # local path, not /mnt/nas
command: >
-m /models/llama-3.1-70b-Q4_K_M.gguf
--no-mmap
--n-gpu-layers 80
--port 8080
ports:
- "8080:8080"
Step 5: If you must use NFS, add the async and rsize mount options
# /etc/fstab entry for the NFS share
nas:/models /mnt/nas/models nfs \
rw,soft,async,rsize=1048576,wsize=1048576,timeo=600,retrans=5 0 0
# Remount
sudo mount -o remount /mnt/nas/models
# Verify rsize
mount | grep nas | grep rsize
The rsize=1048576 (1 MB read size) maximizes sequential read throughput for the initial model load with --no-mmap.
Prevention
- Store GGUF model files on local NVMe or SSD whenever performance matters — NFS is a network filesystem, not a local storage substitute.
- Always include
--no-mmapin launch scripts when the model path is on any network filesystem (NFS, CIFS, sshfs). - When buying a NAS for model storage, also budget for a local SSD cache — copy models to local disk before use.
- Add a pre-flight check to your launch script that verifies the model file is on a local filesystem before starting:
FSTYPE=$(stat -f -c %T "$MODEL_PATH" 2>/dev/null || stat -f "$MODEL_PATH" | grep 'Type:' | awk '{print $NF}')
if echo "$FSTYPE" | grep -qiE "nfs|cifs|fuse"; then
echo "Warning: model on network filesystem — using --no-mmap"
EXTRA_FLAGS="--no-mmap"
fi
- For shared teams using a NAS, set up a nightly
rsyncjob to pre-cache frequently used models on each workstation’s local disk. - Monitor
dmesg | grep nfsfor timeout/reconnect messages that indicate NFS instability during long inference sessions. - If mmap must work over network (e.g., low-RAM systems), use NFSv4.1 with pNFS and
cache=fscachefor local page caching.
FAQ
Q: Does --mlock help when the model is on NFS?
A: --mlock calls mlock on the mmap’d region to prevent page eviction. On NFS, even locked pages must be fetched from the network on the first access — mlock only prevents eviction after the initial fault. Using --no-mmap is more reliable because it reads all data upfront into RAM without going through the page fault path.
Q: Can I use a RAM disk as a middle layer between NFS and llama.cpp?
A: Yes. Create a tmpfs mount (sudo mount -t tmpfs -o size=50G tmpfs /mnt/ramdisk), copy the model there (cp /mnt/nas/models/model.gguf /mnt/ramdisk/), and load from /mnt/ramdisk/. This gives mmap the local, low-latency filesystem it needs while keeping the master copy on NFS.
Q: My model is on a local ext4 SSD but I’m still getting mmap errors — why?
A: Check if the filesystem is mounted with noexec or if the model file lacks read permissions for the running user. Also check available disk space — if the volume is 100% full, mmap writes (for model conversion) can fail. Run df -h /path/to/model and ls -la model.gguf.
Q: Does LM Studio have the same NFS mmap issue?
A: Yes. LM Studio uses the same llama.cpp backend. The fix is the same: copy the model to local storage. LM Studio does not expose a --no-mmap toggle in the UI, so the only practical option is to store models on local disk.