/install mlx-local-inference
MLX Local Inference Stack
Local AI inference on Apple Silicon. oMLX handles LLM/VLM with continuous batching.
Python libraries handle Embedding/ASR/OCR directly via uv.
Architecture
┌─────────────────────────────────────┐
│ oMLX (localhost:8000/v1) │
│ - LLM (Qwen3.5-35B, etc.) │
│ - VLM (vision-language models) │
│ - Continuous batching + SSD cache │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Python Libraries (via uv run) │
│ - mlx-lm: Embedding │
│ - mlx-vlm: OCR (PaddleOCR-VL) │
│ - mlx-audio: ASR (Qwen3-ASR) │
└─────────────────────────────────────┘
Models
| Capability | Implementation | Model | Size |
|---|---|---|---|
| 💬 LLM | oMLX API | Qwen3.5-35B-A3B-4bit |
~20 GB |
| 👁️ VLM | oMLX API | Any mlx-vlm model | varies |
| 📐 Embed | mlx-lm (uv) | Qwen3-Embedding-0.6B-4bit-DWQ |
~1 GB |
| 🎤 ASR | mlx-audio (uv) | Qwen3-ASR-1.7B-8bit |
~1.5 GB |
| 👁️ OCR | mlx-vlm (uv) | PaddleOCR-VL-1.5-6bit |
~3.3 GB |
Usage
LLM / Vision-Language (via oMLX API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
# Text generation
resp = client.chat.completions.create(
model="Qwen3.5-35B-A3B-4bit",
messages=[{"role": "user", "content": "Hello"}]
)
print(resp.choices[0].message.content)
Embeddings (via mlx-lm + uv)
uv run --with mlx-lm python -c "
from mlx_lm import load
model, tokenizer = load('~/models/Qwen3-Embedding-0.6B-4bit-DWQ')
text = 'text to embed'
inputs = tokenizer(text, return_tensors='np')
embeddings = model(**inputs).last_hidden_state.mean(axis=1)
print(embeddings.shape)
"
ASR — Speech-to-Text (via mlx-audio + uv)
Important: Must run with
--python 3.11to avoid OpenMP threading issues (SIGSEGV).
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \
--model ~/models/Qwen3-ASR-1.7B-8bit \
--audio "audio.wav" \
--output-path /tmp/asr_result \
--format txt \
--language zh \
--verbose
OCR (via mlx-vlm + uv)
Important: The
generatefunction parameter order must be(model, processor, prompt, image).
cat \x3C\x3C 'PY_EOF' > run_ocr.py
import os
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
model_path = os.path.expanduser("~/models/PaddleOCR-VL-1.5-6bit")
model, processor = load(model_path)
prompt = apply_chat_template(processor, config=model.config, prompt="OCR:", num_images=1)
output = generate(model, processor, prompt, "document.jpg", max_tokens=512, temp=0.0)
print(output.text)
PY_EOF
uv run --python 3.11 --with mlx-vlm python run_ocr.py
Service Management (oMLX only)
# Check running models
curl http://localhost:8000/v1/models
# Restart oMLX
launchctl kickstart -k gui/$(id -u)/com.omlx-server
Model Storage Strategy
All models stored in ~/models/ using oMLX-compatible structure:
~/models/
├── Qwen3-Embedding-0.6B-4bit-DWQ/
├── Qwen3-ASR-1.7B-8bit/
├── PaddleOCR-VL-1.5-6bit/
└── Qwen3.5-35B-A3B-4bit/
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
uvinstalled (curl -LsSf https://astral.sh/uv/install.sh | sh)
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install mlx-local-inference - 安装完成后,直接呼叫该 Skill 的名称或使用
/mlx-local-inference触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
mlx-local-inference 是什么?
Use when calling local AI on this Mac — text generation, embeddings, speech-to-text, OCR, or image understanding. LLM/VLM via oMLX gateway at localhost:8000/... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 772 次。
如何安装 mlx-local-inference?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install mlx-local-inference」即可一键安装,无需额外配置。
mlx-local-inference 是免费的吗?
是的,mlx-local-inference 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
mlx-local-inference 支持哪些平台?
mlx-local-inference 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(darwin)。
谁开发了 mlx-local-inference?
由 bendusy(@bendusy)开发并维护,当前版本 v2.2.1。