Description

MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, tool calling, LoRA fine-tuning, and embeddings.

README (SKILL.md)

mlx-swift-lm Skill

Name: MLX Swift LM Expert
Author: ronaldmannak

1. Overview & Triggers

mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, fine-tuning via LoRA/DoRA, and embeddings generation.

When to Use This Skill

Running LLM/VLM inference on macOS/iOS with Apple Silicon
Streaming text generation from local models
Vision tasks with images/video (VLMs)
Tool calling / function calling with models
LoRA adapter training and fine-tuning
Text embeddings for RAG/semantic search

Architecture Overview

MLXLMCommon     - Core infrastructure (ModelContainer, ChatSession, KVCache, etc.)
MLXLLM          - Text-only LLM support (Llama, Qwen, Gemma, Phi, DeepSeek, etc. - examples, not exhaustive)
MLXVLM          - Vision-Language Models (Qwen2-VL, PaliGemma, Gemma3, etc. - examples, not exhaustive)
Embedders       - Embedding models (BGE, Nomic, MiniLM)

2. Key File Reference

Purpose	File Path
Thread-safe model wrapper	`Libraries/MLXLMCommon/ModelContainer.swift`
Simplified chat API	`Libraries/MLXLMCommon/ChatSession.swift`
Generation & streaming	`Libraries/MLXLMCommon/Evaluate.swift`
KV cache types	`Libraries/MLXLMCommon/KVCache.swift`
Model configuration	`Libraries/MLXLMCommon/ModelConfiguration.swift`
Chat message types	`Libraries/MLXLMCommon/Chat.swift`
Tool call processing	`Libraries/MLXLMCommon/Tool/ToolCallFormat.swift`
Concurrency utilities	`Libraries/MLXLMCommon/Utilities/SerialAccessContainer.swift`
LLM factory & registry	`Libraries/MLXLLM/LLMModelFactory.swift`
VLM factory & registry	`Libraries/MLXVLM/VLMModelFactory.swift`
LoRA configuration	`Libraries/MLXLMCommon/Adapters/LoRA/LoRAContainer.swift`
LoRA training	`Libraries/MLXLLM/LoraTrain.swift`

3. Quick Start

LLM Chat (Simplest API)

import MLXLLM
import MLXLMCommon

// Load model (downloads from HuggingFace automatically)
let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)

// Create chat session
let session = ChatSession(modelContainer)

// Single response
let response = try await session.respond(to: "What is Swift?")
print(response)

// Streaming response
for try await chunk in session.streamResponse(to: "Explain concurrency") {
    print(chunk, terminator: "")
}

VLM with Image

import MLXVLM
import MLXLMCommon

let modelContainer = try await VLMModelFactory.shared.loadContainer(
    configuration: .init(id: "mlx-community/Qwen2-VL-2B-Instruct-4bit")
)

let session = ChatSession(modelContainer)

// With image (video is also an optional parameter)
let image = UserInput.Image.url(imageURL)
let response = try await session.respond(
    to: "Describe this image",
    image: image,
    video: nil  // Optional video parameter
)

Embeddings

import Embedders

// Note: Embedders uses loadModelContainer() helper (not a factory pattern)
let container = try await loadModelContainer(
    configuration: ModelConfiguration(id: "mlx-community/bge-small-en-v1.5-mlx")
)

let embeddings = await container.perform { model, tokenizer, pooler in
    let tokens = tokenizer.encode(text: "Hello world")
    let input = MLXArray(tokens).expandedDimensions(axis: 0)
    let output = model(input)
    let pooled = pooler(output, normalize: true)
    eval(pooled)
    return pooled
}

4. Primary Workflow: LLM Inference

ChatSession API (Recommended)

ChatSession manages conversation history and KV cache automatically:

let session = ChatSession(
    modelContainer,
    instructions: "You are a helpful assistant",  // System prompt
    generateParameters: GenerateParameters(
        maxTokens: 500,
        temperature: 0.7
    )
)

// Multi-turn conversation (history preserved automatically)
let r1 = try await session.respond(to: "What is 2+2?")
let r2 = try await session.respond(to: "And if you multiply that by 3?")

// Clear session to start fresh
await session.clear()

Streaming with generate()

For lower-level control, use generate() directly:

let input = try await modelContainer.prepare(input: UserInput(prompt: .text("Hello")))
let stream = try await modelContainer.generate(input: input, parameters: GenerateParameters())

for await generation in stream {
    switch generation {
    case .chunk(let text):
        print(text, terminator: "")
    case .info(let info):
        print("\
\(info.tokensPerSecond) tok/s")
    case .toolCall(let call):
        // Handle tool call
        break
    }
}

Tool Calling

// 1. Define tool
struct WeatherInput: Codable { let location: String }
struct WeatherOutput: Codable { let temperature: Double; let conditions: String }

let weatherTool = Tool\x3CWeatherInput, WeatherOutput>(
    name: "get_weather",
    description: "Get current weather",
    parameters: [.required("location", type: .string, description: "City name")]
) { input in
    WeatherOutput(temperature: 22.0, conditions: "Sunny")
}

// 2. Include tool schema in request
let input = UserInput(
    prompt: .text("What's the weather in Tokyo?"),
    tools: [weatherTool.schema]
)

// 3. Handle tool calls in generation stream
for await generation in try await modelContainer.generate(input: input, parameters: params) {
    switch generation {
    case .chunk(let text): print(text)
    case .toolCall(let call):
        let result = try await call.execute(with: weatherTool)
        print("Weather: \(result.conditions)")
    case .info: break
    }
}

See references/tool-calling.md for multi-turn and feeding results back.

GenerateParameters

let params = GenerateParameters(
    maxTokens: 1000,           // nil = unlimited
    maxKVSize: 4096,           // Sliding window (uses RotatingKVCache)
    kvBits: 4,                 // Quantized cache (4 or 8 bit)
    temperature: 0.7,          // 0 = greedy/argmax
    topP: 0.9,                 // Nucleus sampling
    repetitionPenalty: 1.1,    // Penalize repeats
    repetitionContextSize: 20  // Window for penalty
)

Prompt Caching / History Re-hydration

Restore chat from persisted history:

let history: [Chat.Message] = [
    .system("You are helpful"),
    .user("Hello"),
    .assistant("Hi there!")
]

let session = ChatSession(
    modelContainer,
    history: history
)
// Continues from this point

5. Secondary Workflow: VLM Inference

Image Input Types

// From URL (file or remote)
let image = UserInput.Image.url(fileURL)

// From CIImage
let image = UserInput.Image.ciImage(ciImage)

// From MLXArray directly
let image = UserInput.Image.array(mlxArray)

Video Input

// From URL (file or remote)
let video = UserInput.Video.url(videoURL)

// From AVFoundation asset
let video = UserInput.Video.avAsset(avAsset)

// From pre-extracted frames
let video = UserInput.Video.frames(videoFrames)

let response = try await session.respond(
    to: "What happens in this video?",
    video: video
)

Multiple Images

let images: [UserInput.Image] = [
    .url(url1),
    .url(url2)
]

let response = try await session.respond(
    to: "Compare these two images",
    images: images,
    videos: []
)

VLM-Specific Processing

let session = ChatSession(
    modelContainer,
    processing: UserInput.Processing(
        resize: CGSize(width: 512, height: 512)  // Resize images
    )
)

6. Best Practices

DO

// DO: Use ChatSession for multi-turn conversations
let session = ChatSession(modelContainer)

// DO: Use AsyncStream APIs (modern, Swift concurrency)
for try await chunk in session.streamResponse(to: prompt) { ... }

// DO: Check Task.isCancelled in long-running loops
for try await generation in stream {
    if Task.isCancelled { break }
    // process generation
}

// DO: Use ModelContainer.perform() for thread-safe access
await modelContainer.perform { context in
    // Access model, tokenizer safely
    let tokens = try context.tokenizer.applyChatTemplate(messages: messages)
    return tokens
}

// DO: When breaking early from generation, use generateTask() to get a task handle
// This is the lower-level API used internally by ChatSession
let (stream, task) = generateTask(...)  // Returns (AsyncStream, Task)

for await item in stream {
    if shouldStop { break }
}
await task.value  // Ensures KV cache cleanup before next generation

generateTask() is defined in Evaluate.swift. Most users should use ChatSession which handles this internally.

DON'T

// DON'T: Share MLXArray across tasks (not Sendable)
let array = MLXArray(...)
Task { array.sum() }  // Wrong!

// DON'T: Use deprecated callback-based generation
// Old:
generate(input: input, parameters: params) { tokens in ... }  // Deprecated
// New:
for await generation in try generate(input: input, parameters: params, context: context) { ... }

// DON'T: Use old perform(model, tokenizer) signature
// Old:
modelContainer.perform { model, tokenizer in ... }  // Deprecated
// New:
modelContainer.perform { context in ... }

// DON'T: Forget to eval() MLXArrays before returning from perform()
await modelContainer.perform { context in
    let result = context.model(input)
    eval(result)  // Required before returning
    return result.item(Float.self)
}

Thread Safety

ModelContainer is Sendable and thread-safe
ChatSession is NOT thread-safe (use from single task)
MLXArray is NOT Sendable - don't pass across isolation boundaries
Use SendableBox for transferring non-Sendable data in consuming contexts

Memory Management

// For long contexts, use sliding window cache
let params = GenerateParameters(maxKVSize: 4096)

// For memory efficiency, use quantized cache
let params = GenerateParameters(kvBits: 4)  // or 8

// Clear session cache when done
await session.clear()

7. Reference Links

For detailed documentation on specific topics, see:

Reference	When to Use
references/model-container.md	Loading models, ModelContainer API, ModelConfiguration
references/kv-cache.md	Cache types, memory optimization, cache serialization
references/concurrency.md	Thread safety, SerialAccessContainer, async patterns
references/tool-calling.md	Function calling, tool formats, ToolCallProcessor
references/tokenizer-chat.md	Tokenizer, Chat.Message, EOS tokens
references/supported-models.md	Model families, registries, model-specific config
references/lora-adapters.md	LoRA/DoRA/QLoRA, loading adapters
references/training.md	LoRATrain API, fine-tuning
references/embeddings.md	EmbeddingModel, pooling, use cases

8. Deprecated Patterns Summary

Most common migrations (see individual reference files for topic-specific deprecations):

If you see...	Use instead...
`generate(... didGenerate:)` callback	`generate(...) -> AsyncStream`
`perform { model, tokenizer in }`	`perform { context in }`
`TokenIterator(prompt: MLXArray)`	`TokenIterator(input: LMInput)`
`ModelRegistry` typealias	`LLMRegistry` or `VLMRegistry`
`createAttentionMask(h:cache:[KVCache]?)`	`createAttentionMask(h:cache:KVCache?)`

Each reference file contains a "Deprecated Patterns" section with topic-specific migrations.

9. Automatic vs Manual Configuration

Automatic Behaviors (NO developer action needed)

The framework handles these automatically:

Feature	Details
EOS token loading	Loaded from `config.json`
EOS token override	Priority: `generation_config.json` > `config.json` > defaults
EOS token merging	All sources merged at generation time
EOS token detection	Stops generation automatically when EOS encountered
Chat template application	Applied automatically via `applyChatTemplate()`
Tool call format detection	Inferred from `model_type` in `config.json`
Cache type selection	Based on GenerateParameters (`maxKVSize`, `kvBits`)
Tokenizer loading	Loaded from `tokenizer.json` automatically
Model weights loading	Downloaded and loaded from HuggingFace

Optional Configuration (Developer MAY configure)

Feature	When to Configure
`extraEOSTokens`	Only if model has unlisted stop tokens
`toolCallFormat`	Only to override auto-detection
`maxKVSize`	To enable sliding window cache
`kvBits`	To enable quantized cache (4 or 8 bit)
`maxTokens`	To limit output length

Usage Guidance

This skill is internally coherent for running and fine-tuning local models on Apple Silicon. Before installing/using it, consider: (1) model downloads: it will fetch model weights from HuggingFace (public models are fine; private models may require your HF token) — only load models from sources you trust; (2) disk and memory: large models require substantial disk/memory and will store caches and .safetensors files (the docs reference ~/.cache/huggingface and adapter.safetensors); (3) local file I/O: LoRA training and loading functions read training data from local directories and write checkpoints — ensure training data and target paths are correct and safe; (4) network: the skill expects network access to download models; if you have network restrictions, be aware of that; (5) no environment variables or installs are required by the skill itself, but you should still review any model code you load (model weights are data but can encode harmful behaviors). If you need higher assurance, ask the maintainer for the skill's source repository or a signed release, and only use trusted model IDs.

Capability Analysis

Type: OpenClaw Skill Name: mlx-swift-lm Version: 1.0.0 The OpenClaw AgentSkills skill bundle for 'mlx-swift-lm' is benign. The documentation and code examples describe a Swift library for running and fine-tuning LLMs/VLMs on Apple Silicon using MLX. All observed behaviors, such as downloading models from HuggingFace, loading/saving model weights and caches locally, and performing network calls for model inference, are legitimate and aligned with the stated purpose of an ML library. There is no evidence of data exfiltration, malicious execution, persistence mechanisms, prompt injection against the OpenClaw agent, or obfuscation.

Capability Assessment

✓ Purpose & Capability

The name/description (MLX Swift LM for Apple Silicon) match the documented behavior: loading models, running inference, streaming, tool-calling, LoRA training, and embeddings. Required capabilities (downloading models, reading model directories, saving adapter weights, local training) are consistent with the stated purpose. No unrelated credentials, binaries, or config paths are requested.

ℹ Instruction Scope

SKILL.md and reference docs instruct the agent/developer to download models (HuggingFace hub), load tokenizers, read local model/adapter directories, and save adapter/checkpoint files. This file and network I/O and local filesystem access are expected for a local model runtime. The instructions do reference cache paths (~/.cache/huggingface/...) and reading training data files via FileManager; those are within scope but users should be aware the skill will read/write model and training files on disk.

✓ Install Mechanism

Instruction-only skill with no install spec and no code files to execute at install time. This is low-risk from an install mechanism perspective.

✓ Credentials

The skill declares no required environment variables or credentials. It documents optional use of a HubApi with an hfToken for private models, which is appropriate. No unrelated secrets or multiple unrelated credentials are requested.

✓ Persistence & Privilege

always:false and default invocation settings. The skill does not request permanent system presence or modify other skills or system-wide agent settings. It describes saving/loading adapter weights and model caches to user-specified file paths (normal for this domain).

Version History

v1.0.0

Initial release of mlx-swift-lm. - Run Large Language Models (LLMs) and Vision-Language Models (VLMs) locally on Apple Silicon using MLX. - Supports local inference, streaming text generation, and both single-turn and multi-turn chat via a ChatSession API. - Enables tool/function calling, LoRA/DoRA fine-tuning, and text embeddings for search/semantic applications. - Provides Swift-friendly factory/load interfaces for a variety of model types (LLM, VLM, Embeddings). - Offers quick-start code examples and comprehensive API references for common ML workflows.

Metadata

Slug mlx-swift-lm

Version 1.0.0

License —

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is MLX Swift LM Expert?

MLX Swift LM - Run LLMs and VLMs on Apple Silicon using MLX. Covers local inference, streaming, tool calling, LoRA fine-tuning, and embeddings. It is an AI Agent Skill for Claude Code / OpenClaw, with 1986 downloads so far.

How do I install MLX Swift LM Expert?

Run "/install mlx-swift-lm" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is MLX Swift LM Expert free?

Yes, MLX Swift LM Expert is completely free (open-source). You can download, install and use it at no cost.

Which platforms does MLX Swift LM Expert support?

MLX Swift LM Expert is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created MLX Swift LM Expert?

It is built and maintained by ronaldmannak (@ronaldmannak); the current version is v1.0.0.

More Skills

MLX Swift LM Expert