mlx_lm.convert Fails Converting a HuggingFace Model

Q: Can I convert a GGUF model to MLX format?

Not with `mlx_lm.convert` — it only accepts HuggingFace safetensors (or PyTorch `.bin`) as input, not GGUF. You would first convert GGUF back to a HuggingFace repo with `llama.cpp`'s conversion scripts, then run `mlx_lm.convert`, which means double quantization and quality loss. It is almost always better to download the original safetensors repo, or grab a prebuilt model from `mlx-community`.

Q: I get `ValueError: Model type X not supported` even after upgrading. Now what?

The PyPI release may lag `main` by weeks. Install the dev build with `pip install --upgrade "mlx-lm @ git+https://github.com/ml-explore/mlx-lm.git"`. If `main` does not support it either, the architecture simply has not been ported yet — check the GitHub issues, or use a GGUF build via `llama.cpp`/Ollama in the meantime.

Q: The converted model runs but is very slow — why?

Confirm it is actually quantized. Run the `config.json` check in "How to confirm it's fixed." If `quantization` is `None`, the `-q` flag was missing and you are running full-precision weights; re-convert with `-q --q-bits 4`. Also make sure other apps are not pushing you into Memory Pressure, which forces swap.

Q: Do I have to log in for every model?

No. `hf auth login` stores a token once. You only need per-model access approval for **gated** repos (Llama, Gemma, some Mistral). Open repos like most `mlx-community` and `Qwen` models need no login at all.

mlx_lm.convert errors when converting a HuggingFace model to MLX on Apple Silicon: Model type not supported, GatedRepoError 401, or OOM. Fixes verified June 2026.

Published: May 25, 2026 Updated: Jun 21, 2026 Author: AI Productivity Guide Team 🌐 查看中文版本

You run mlx_lm.convert --hf-path meta-llama/Llama-3.1-8B-Instruct --mlx-path ./mlx-llama31-8b -q on your Apple Silicon Mac and it fails one of three ways: a ValueError: Model type X not supported (often paired with No module named 'mlx_lm.models.X'), a GatedRepoError: 401 Client Error during download, or the process hangs near the end and gets killed by macOS with an out-of-memory crash.

mlx-lm (the LLM layer on top of Apple’s MLX framework) ships an explicit Python class for each architecture it supports, so conversion breaks the moment the model is newer than your installed package, the repo is gated and you are not logged in, or the unmodified weights do not fit in unified memory before quantization runs.

Fastest fix (covers most cases): upgrade with pip install --upgrade mlx-lm mlx, log in via hf auth login for gated repos (Llama, Gemma, some Mistral), then re-run. If the architecture is brand-new and still unsupported on the PyPI release, install mlx-lm from GitHub. If your Mac is short on memory, skip conversion entirely and pull a prebuilt model from mlx-community. As of June 2026 the current release is mlx-lm 0.31.3 (April 22, 2026).

Which bucket are you in?

Symptom on screen	Most likely cause	Jump to
`ValueError: Model type X not supported` / `No module named 'mlx_lm.models.X'`	Architecture not in your installed `mlx-lm`	Step 1
`GatedRepoError`, `401 Client Error`, `You are trying to access a gated repo`	Not authenticated / access not approved	Step 2
Hang near 100%, then `Killed: 9` or beachball + OOM	Unquantized weights exceed unified memory	Step 3
`No such file` / shard `model-0000N-of-...safetensors` missing	Interrupted download, partial cache	Step 4
`Error while deserializing` / only `pytorch_model.bin` present	Repo ships PyTorch pickle, not safetensors	Step 5
`ImportError: cannot import name ... from 'mlx.core'`	`mlx` and `mlx-lm` versions mismatched	Step 6

Common causes

Ordered by hit rate, highest first.

1. mlx-lm doesn’t have a class for this architecture

mlx-lm keeps one module per architecture under mlx_lm/models/ and matches it to the model_type field in the repo’s config.json. If there is no matching module, the loader raises ValueError: Model type X not supported and often No module named 'mlx_lm.models.X'. This is the single most common failure, and it hits new releases first: in 2025-2026 it was reported for qwen3_moe, minimax, gemma4_unified, minicpmv, and other architectures before support landed.

How to spot it: check the model’s model_type and compare it to your install:

python3 -c "
import os, mlx_lm.models as m
d = os.path.dirname(m.__file__)
print(sorted(f[:-3] for f in os.listdir(d) if f.endswith('.py') and not f.startswith('_')))
"

If model_type is not in that list, your installed mlx-lm cannot convert it. Upgrade (Step 1); if the architecture only landed on main, install from GitHub.

2. Gated HuggingFace repo, not authenticated

Llama 3.x, Gemma, and some Mistral repos are gated: you must accept the license on the model page and be logged in. Without a valid token, mlx_lm.convert fails during download with GatedRepoError, 401 Client Error, or Cannot access gated repo ... You are trying to access a gated repo. Make sure to request access.

How to spot it: run hf auth whoami. If it prints Not logged in, that is your problem. Note the CLI was renamed from huggingface-cli to hf in mid-2025; the old huggingface-cli login/whoami still work but print a deprecation notice pointing you to hf auth ....

3. OOM — full-precision weights don’t fit before quantization

mlx_lm.convert loads the model in its stored precision (the torch_dtype from config.json, which for modern Llama/Qwen/Mistral/Gemma is bfloat16), then quantizes. So peak memory is driven by the bf16 size, not fp32. Rough math: bf16 size in GB is roughly the parameter count in billions multiplied by 2 (2 bytes per weight). A 70B model is about 140 GB in bf16, and conversion needs headroom on top of that, so it OOMs on a 128 GB Mac and only fits on a 192 GB or larger Studio/Ultra. macOS kills the process (Killed: 9) or thrashes into the beachball.

How to spot it: before converting, check free memory in Activity Monitor (Memory tab, watch Memory Pressure) or run vm_stat; during a stalled run, vm_stat 1 showing large, growing Pageouts means you are out of memory.

4. Sharded model not fully downloaded

Large repos split weights into model-00001-of-0000N.safetensors shards. An interrupted download leaves some shards missing; the converter loads the first ones and then fails on the missing file.

How to spot it: compare the shards on disk to the manifest. The repo’s model.safetensors.index.json lists every expected shard:

ls ~/.cache/huggingface/hub/models--*/snapshots/*/model-*.safetensors | wc -l

If that count is lower than the shards referenced in index.json, re-download (Step 4).

5. Repo ships PyTorch pickle, not safetensors

Some older or custom repos only contain pytorch_model.bin (Python pickle) with no model.safetensors. mlx_lm.convert expects safetensors; with only .bin files it can fail to load or fail on unexpected weight key names. (Note: mlx_lm.convert does not accept GGUF as input at all — see the FAQ.)

How to spot it: ls ~/.cache/huggingface/hub/models--*/snapshots/*/ — if you see pytorch_model*.bin and no *.safetensors, convert the weights to safetensors first (Step 5).

6. mlx and mlx-lm versions mismatched

mlx-lm pins a compatible range of the core mlx package. Upgrading one without the other can break imports or produce wrong results.

How to spot it: an ImportError: cannot import name 'X' from 'mlx.core' is the tell. Run pip show mlx mlx-lm and upgrade them together (Step 1).

Shortest path to fix

Step 1: Upgrade mlx-lm and mlx together

# Upgrade both in one shot so their versions stay compatible
pip install --upgrade mlx-lm mlx

# Confirm the installed version (0.31.3 as of June 2026)
python3 -c "import mlx_lm; print(mlx_lm.__version__)"

# Confirm your architecture now has a module
python3 -c "
import os, mlx_lm.models as m
d = os.path.dirname(m.__file__)
print(sorted(f[:-3] for f in os.listdir(d) if f.endswith('.py') and not f.startswith('_')))
"

If the architecture only landed on main and is not in the PyPI release yet, install the development build:

pip install --upgrade "mlx-lm @ git+https://github.com/ml-explore/mlx-lm.git"

Step 2: Authenticate with HuggingFace for gated repos

# New CLI (huggingface-cli still works but is deprecated)
hf auth login
# Paste a token from https://huggingface.co/settings/tokens (read scope is enough)

# Verify you are logged in
hf auth whoami

# For gated models you must ALSO accept the license on the model page, e.g.
# https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct  -> click "Agree and access"

If hf auth whoami succeeds but you still get a 401, the access request for that specific repo has not been approved yet — approval can take minutes to days. Confirm the green “You have been granted access” banner on the model page.

Step 3: Convert with quantization, and pick a dtype you can fit

-q (short for --quantize) produces a 4-bit model by default (group size 64, affine quantization). This is the lowest-memory output. Conversion still loads the source weights first, so the bf16 size is your floor.

# 8B model: convert and quantize to 4-bit in one step
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.1-8B-Instruct \
  --mlx-path ./mlx-llama31-8b-4bit \
  -q --q-bits 4 --q-group-size 64

# If you want unquantized output but a smaller dtype to save space:
mlx_lm.convert \
  --hf-path meta-llama/Llama-3.1-8B-Instruct \
  --mlx-path ./mlx-llama31-8b-bf16 \
  --dtype bfloat16

If a model converts in bf16 but fails only when you add -q, the problem is in the quantization step, not architecture support — try --q-group-size 32, or skip quantization and run bf16.

If the bf16 weights simply do not fit in your unified memory, do not try to convert locally. Download a prebuilt MLX model from mlx-community (Step 6) or quantize on a larger machine.

Step 4: Pre-download all shards before converting

# Fetch every shard up front so an interrupted run can't leave the cache partial
python3 << 'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    "meta-llama/Llama-3.1-8B-Instruct",
    local_dir="./llama31-8b-hf",
    ignore_patterns=["*.msgpack", "*.h5", "*.bin"],  # safetensors only
)
print("Download complete")
EOF

# Then convert from the local directory
mlx_lm.convert \
  --hf-path ./llama31-8b-hf \
  --mlx-path ./mlx-llama31-8b-4bit \
  -q --q-bits 4

Step 5: Convert PyTorch .bin weights to safetensors first

pip install safetensors torch

python3 << 'EOF'
from safetensors.torch import save_file
import torch, os

model_path = "./model-path"
for f in os.listdir(model_path):
    if f.endswith(".bin"):
        sd = torch.load(f"{model_path}/{f}", map_location="cpu")
        out = f.replace(".bin", ".safetensors")
        save_file(sd, f"{model_path}/{out}")
        print(f"Converted {f} -> {out}")
EOF

After this, point --hf-path at the local folder containing the new .safetensors files.

Step 6: Skip conversion — pull a prebuilt MLX model

If conversion keeps fighting you (unsupported architecture, OOM, gated access), the mlx-community org on HuggingFace already publishes thousands of pre-converted MLX models. mlx_lm.generate and mlx_lm.load download them on first use:

# 4-bit (smallest footprint)
mlx_lm.generate \
  --model mlx-community/Llama-3.1-8B-Instruct-4bit \
  --prompt "What is 2+2?" --max-tokens 20

# 8-bit (higher quality, larger)
mlx_lm.generate \
  --model mlx-community/Llama-3.1-8B-Instruct-8bit \
  --prompt "What is 2+2?" --max-tokens 20

How to confirm it’s fixed

Run a quick generation against your converted folder, then check the weights are actually quantized:

# 1. It should produce coherent text, not garbage or an error
mlx_lm.generate --model ./mlx-llama31-8b-4bit --prompt "Name three primary colors." --max-tokens 30

# 2. Confirm quantization actually applied (a 4-bit model carries quantization config)
python3 -c "
import json, glob
cfg = json.load(open(glob.glob('./mlx-llama31-8b-4bit/config.json')[0]))
print('quantization:', cfg.get('quantization'))
"

If quantization is None, the -q flag did not take effect and the model is full-size — re-run Step 3 with -q.

Prevention

Pin mlx-lm and mlx together in requirements.txt and upgrade both at once; a lone mlx bump is a common source of ImportError.
Before converting a brand-new model, search mlx-lm GitHub issues for its model_type — unsupported architectures are usually tracked there with the release that adds them.
Always run hf auth whoami before a long conversion so you do not waste a 30-minute download on an expired token.
Estimate memory up front: bf16 size in GB is roughly parameters-in-billions times 2. If that exceeds your free unified memory, grab a prebuilt model from mlx-community instead of converting locally.
Use snapshot_download to fetch all shards before converting so an interrupted download cannot leave a partial cache.
After conversion, run a short generation test and verify config.json shows the expected quantization before deleting the HuggingFace source weights.

FAQ

Q: Can I convert a GGUF model to MLX format? A: Not with mlx_lm.convert — it only accepts HuggingFace safetensors (or PyTorch .bin) as input, not GGUF. You would first convert GGUF back to a HuggingFace repo with llama.cpp’s conversion scripts, then run mlx_lm.convert, which means double quantization and quality loss. It is almost always better to download the original safetensors repo, or grab a prebuilt model from mlx-community.

Q: I get ValueError: Model type X not supported even after upgrading. Now what? A: The PyPI release may lag main by weeks. Install the dev build with pip install --upgrade "mlx-lm @ git+https://github.com/ml-explore/mlx-lm.git". If main does not support it either, the architecture simply has not been ported yet — check the GitHub issues, or use a GGUF build via llama.cpp/Ollama in the meantime.

Q: How much slower is MLX 4-bit vs bf16 on Apple Silicon? A: Throughput on a quantized 4-bit model is typically lower per token than bf16 because of dequantization overhead, but 4-bit is often the only option that fits. A 13B model is roughly 26 GB in bf16 and about 7-8 GB at 4-bit, so on a 16 GB Mac, 4-bit is the only one that loads at all.

Q: The converted model runs but is very slow — why? A: Confirm it is actually quantized. Run the config.json check in “How to confirm it’s fixed.” If quantization is None, the -q flag was missing and you are running full-precision weights; re-convert with -q --q-bits 4. Also make sure other apps are not pushing you into Memory Pressure, which forces swap.

Q: Do I have to log in for every model? A: No. hf auth login stores a token once. You only need per-model access approval for gated repos (Llama, Gemma, some Mistral). Open repos like most mlx-community and Qwen models need no login at all.

Tags: #local-llm #mlx #Troubleshooting