MLX Conversion From HuggingFace Fails

mlx_lm.convert fails when converting a HuggingFace model to MLX format on Apple Silicon. Fix architecture support, dtype mismatches, and memory limits during conversion.

You run mlx_lm.convert --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct --mlx-path mlx-llama31-8b --quantize --q-bits 4 on your M3 Max and hit a KeyError: 'rope_type' or NotImplementedError: Architecture llama3 is not supported or the conversion hangs at 98% before crashing with an out-of-memory error. MLX is Apple’s machine learning framework for Apple Silicon, and mlx-lm is its LLM inference layer. Conversion failures are common when the model architecture is newer than the installed mlx-lm version, when the HuggingFace model requires gated access, or when the unified memory budget is exhausted by loading fp32 weights before quantization.

Common causes

Ordered by hit rate, highest first.

1. mlx-lm version doesn’t support the model architecture

mlx-lm adds support for new architectures in each release. Llama 3.1’s rope_type=llama3 field was added in mlx-lm 0.12. Qwen2.5’s sliding window attention was added in 0.15. If you try to convert a model that uses architecture features added after your mlx-lm version, the converter throws a KeyError or NotImplementedError.

How to spot it: Run pip show mlx-lm | grep Version. Then check the mlx-lm GitHub releases page for the version that first added your model’s architecture. If your installed version is older, upgrade.

2. HuggingFace gated model not authenticated

Llama 3.x, Mistral-7B, and Gemma models require HuggingFace login and access request approval. If huggingface-cli is not authenticated, mlx_lm.convert will download a 404 or empty file and fail during weight loading with a confusing error like Invalid file format or Unable to read tensor.

How to spot it: Run huggingface-cli whoami. If it shows “Not logged in,” run huggingface-cli login and re-request access for gated models.

3. OOM during conversion — weights loaded in fp32 before quantization

MLX conversion loads the full fp32 model into unified memory before quantizing. A 70B model in fp32 requires ~140 GB — exceeding even the M3 Ultra’s 192 GB. The conversion process crashes at the out-of-memory point, often mid-way through loading shards.

How to spot it: Check your Mac’s free memory with vm_stat or Activity Monitor before starting conversion. If free + compressed memory is less than the model’s fp32 size, the conversion will OOM.

4. Sharded model not all downloaded (partial HuggingFace cache)

Many large models on HuggingFace Hub are split across multiple model-00001-of-00008.safetensors shards. If the download was interrupted, some shards are missing. The converter loads the first N shards successfully and then crashes on the missing shard.

How to spot it: Check ~/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3.1-70B-Instruct/snapshots/*/model-*.safetensors and count the files. Compare against the total shard count listed in the model’s HuggingFace model.safetensors.index.json.

5. Incompatible safetensors format — model uses PyTorch pickle format

Some older HuggingFace models use pytorch_model.bin (PyTorch pickle format) instead of the newer model.safetensors. The mlx_lm.convert script may fail to load .bin files, especially if the weight key names differ from the expected pattern.

How to spot it: Run ls ~/.cache/huggingface/hub/models--*/snapshots/*/model*.bin. If the files are .bin rather than .safetensors, the model may need explicit format handling.

6. mlx version incompatible with mlx-lm version

mlx-lm depends on a specific version range of the core mlx package. If you updated mlx via pip without also updating mlx-lm (or vice versa), the two may be incompatible, causing import errors or silent calculation failures during conversion.

How to spot it: Run pip show mlx mlx-lm and compare versions against the mlx-lm README compatibility matrix. Also check for ImportError: cannot import name 'X' from 'mlx.core'.

Shortest path to fix

Step 1: Update mlx-lm to the latest version

pip install --upgrade mlx-lm mlx

# Verify the update
python3 -c "import mlx_lm; print(mlx_lm.__version__)"

# Check if your model architecture is now supported
python3 -c "
from mlx_lm.models import MODEL_REMAPPING
print(sorted(MODEL_REMAPPING.keys()))
"

Step 2: Authenticate with HuggingFace for gated models

# Login (creates ~/.huggingface/token)
huggingface-cli login
# Enter your HuggingFace access token from https://huggingface.co/settings/tokens

# Verify authentication
huggingface-cli whoami

# For Llama models, also request access at:
# https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

Step 3: Run conversion with —quantize to reduce peak memory

# Convert and quantize in one step (lower peak memory than convert-then-quantize)
mlx_lm.convert \
  --hf-path meta-llama/Meta-Llama-3.1-8B-Instruct \
  --mlx-path ./mlx-llama31-8b-4bit \
  --quantize \
  --q-bits 4 \
  --q-group-size 64

# For 70B models that are too large for fp32 in memory,
# use a lower quantization group size to reduce peak memory:
mlx_lm.convert \
  --hf-path meta-llama/Meta-Llama-3.1-70B-Instruct \
  --mlx-path ./mlx-llama31-70b-4bit \
  --quantize \
  --q-bits 4 \
  --q-group-size 32 \
  --trust-remote-code

Step 4: Pre-download all model shards before converting

# Download all shards explicitly before conversion
python3 << 'EOF'
from huggingface_hub import snapshot_download
snapshot_download(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    local_dir="./llama31-8b-hf",
    ignore_patterns=["*.msgpack", "*.h5"],  # Skip non-safetensors formats
)
print("Download complete")
EOF

# Then convert from the local directory
mlx_lm.convert \
  --hf-path ./llama31-8b-hf \
  --mlx-path ./mlx-llama31-8b-4bit \
  --quantize \
  --q-bits 4

Step 5: For PyTorch .bin format, convert to safetensors first

# Install safetensors conversion tool
pip install safetensors torch

# Convert .bin to .safetensors
python3 << 'EOF'
from safetensors.torch import save_file
import torch
import os

model_path = "./model-path"
for bin_file in os.listdir(model_path):
    if bin_file.endswith(".bin"):
        state_dict = torch.load(f"{model_path}/{bin_file}", map_location="cpu")
        out_file = bin_file.replace(".bin", ".safetensors")
        save_file(state_dict, f"{model_path}/{out_file}")
        print(f"Converted {bin_file} -> {out_file}")
EOF

Step 6: Verify the converted model loads and runs correctly

# Quick generation test on the converted model
mlx_lm.generate \
  --model ./mlx-llama31-8b-4bit \
  --prompt "What is 2+2?" \
  --max-tokens 20

# Check the model's unified memory footprint
python3 << 'EOF'
import mlx.core as mx
from mlx_lm import load
model, tokenizer = load("./mlx-llama31-8b-4bit")
# Check parameter count and memory
total_params = sum(v.size for v in model.parameters().values() if hasattr(v, 'size'))
print(f"Total parameters: {total_params:,}")
EOF

Prevention

  • Pin mlx-lm and mlx versions together in requirements.txt and update both simultaneously.
  • Before converting a new model, check mlx-lm GitHub issues or the model’s HuggingFace page for known MLX conversion issues.
  • Always run huggingface-cli whoami before starting a long conversion to confirm authentication is valid.
  • Pre-calculate peak memory needs: fp32 weight size = model parameter count × 4 bytes. For a 70B model, that’s 280 GB — only feasible on M3 Ultra (192 GB) with the --quantize flag to do in-place quantization during shard loading.
  • Use snapshot_download to pre-fetch all shards before conversion to avoid mid-conversion interruption failures.
  • Keep a model-to-mlx-version compatibility note file so you know which mlx-lm version was used for each converted model.
  • After conversion, always run a short generation test before deleting the HuggingFace source weights.

FAQ

Q: Can I convert a GGUF model to MLX format instead of converting from HuggingFace? A: Not directly — mlx_lm.convert only supports HuggingFace SafeTensors format as input. However, you can convert GGUF back to HuggingFace format using llama.cpp’s convert_llama_gguf_to_hf.py script, then convert to MLX. This is slower and introduces double-quantization risk. Download from HuggingFace directly when possible.

Q: How much slower is MLX with Q4 quantization vs. fp16? A: MLX Q4 quantization runs at approximately 1.5-2x slower throughput than fp16 on Apple Silicon due to dequantization overhead per forward pass. However, Q4 allows running models that would not fit in unified memory at fp16. For a 13B model on a 16 GB M2, Q4 (7.5 GB) is the only option; fp16 (26 GB) exceeds available memory.

Q: Does mlx-lm support vision models like LLaVA or Qwen-VL? A: Yes, starting in mlx-lm 0.14+. Conversion for multimodal models requires --trust-remote-code because the vision tower often uses a custom HuggingFace architecture class. Some vision models (InternVL, CogVLM) are not yet supported — check the mlx-lm model support table before attempting conversion.

Q: The converted MLX model generates correct text but very slowly (2-3 t/s on M3 Max) — why? A: Check whether the model is using 8-bit or 16-bit precision instead of 4-bit. Run python3 -c "from mlx_lm import load; m, t = load('./model'); print(list(m.parameters().values())[0].dtype)". If it shows float16 instead of a quantized type, the --quantize flag wasn’t applied during conversion. Re-convert with --quantize --q-bits 4.

Tags: #local-llm #ollama #Troubleshooting