Description

On-device speech-to-text (Whisper) + text-to-speech (Qwen3-TTS) CLI. Runs on the Apple Neural Engine (ANE), Apple's low power, dedicated ML inference chip. M...

README (SKILL.md)

whisperkit-cli

Name: Argmax Transcription and TTS
Author: zachnagengast

On-device Whisper transcription + Qwen3-TTS synthesis
Local file-based audio I/O -- models are downloaded from HuggingFace on first run, then all inference runs on-device with no network required. Perfect for agents that receive voice messages/attachments and reply with text or generated audio files.

The agent saves incoming audio attachments to a temp path, runs the CLI, and either returns the transcribed text in chat or attaches the generated .wav/.m4a reply.

Why agents love this skill

Runs on ANE -- no GPU contention, low power, always available
No API keys, no per-request costs, no data leaves the machine after setup
One-time model download on first run, then fully offline
Handles audio files from user messages (m4a, wav, mp3, flac)
Generates reply audio files the agent can attach/send
9 built-in voices + 10 languages
Natural-language style instructions (1.7B model)

Installation

brew install whisperkit-cli

First run automatically downloads models as needed.

Core Commands

Transcribe (Audio File -> Text)

whisperkit-cli transcribe --help

Agent patterns

# Transcribe user-uploaded audio attachment (recommended default)
whisperkit-cli transcribe --audio-path /tmp/user-message.m4a

Important model notes

By default, whisperkit-cli transcribe automatically selects the highest-quality model that fits on your Apple Silicon device (typically a large-v3 variant on M1+). This is great for accuracy but may be slower for real-time agent workflows.
--model small is the fastest option and works well across languages. For non-English audio, pass --language with the ISO code (e.g. --language ja for Japanese). Avoid .en model variants for non-English audio.

# Explicit small model (fast + good quality for most cases)
whisperkit-cli transcribe --model small --audio-path /tmp/voice-note.wav

# Non-English audio -- specify the language ISO code
whisperkit-cli transcribe --model small --language ja --audio-path /tmp/japanese-message.m4a

# Higher quality with auto language detection (no --language needed)
# --prompt provides context as if it were the previous transcript segment,
# helping the model spell proper nouns and domain terms correctly
whisperkit-cli transcribe --model large-v3-v20240930_626MB --audio-path /tmp/long-meeting.m4a \
  --word-timestamps --prompt "Argmax, WhisperKit, CoreML"

Output goes to stdout (clean text) -- agent copies it directly into the chat reply.

TTS (Text -> Audio File)

whisperkit-cli tts --help

Agent patterns

# Generate reply audio file (agent will attach it)
whisperkit-cli tts --text "Got it, I'll handle the report by Friday" \
  --output-path /tmp/agent-reply

# With voice + language
whisperkit-cli tts --text "こんにちは、世界" \
  --speaker ono-anna --language japanese \
  --output-path /tmp/japanese-reply.m4a

# 1.7B model with expressive style instruction
whisperkit-cli tts --model 1.7b \
  --text "Once upon a time in a galaxy far, far away..." \
  --instruction "Read dramatically like a movie trailer narrator" \
  --output-path /tmp/story-reply.m4a

# From text file (great for long LLM summaries)
whisperkit-cli tts --text-file /tmp/llm-response.txt \
  --output-path /tmp/voice-reply.m4a

You can include the extension in --output-path (e.g. /tmp/reply.m4a) or omit it and the CLI will append it based on --output-format (default .m4a). Use --output-format wav for .wav. Default voice is aiden if --speaker is omitted.

Voices (TTS)

ryan, aiden, ono-anna, sohee, eric, dylan, serena, vivian, uncle-fu

Languages (TTS)

english, chinese, japanese, korean, german, french, russian, portuguese, spanish, italian

Local OpenAI-Compatible API Server

whisperkit-cli serve --port 50060

Auto-selects the best model for your device. To specify a model explicitly:

whisperkit-cli serve --model small --port 50060

Exposes OpenAI-compatible endpoints at http://127.0.0.1:50060:

POST /v1/audio/transcriptions -- transcribe audio to text
POST /v1/audio/translations -- translate audio to English
GET /health -- health check

Agent Usage Patterns

# Typical voice message flow
# User sends audio -> agent saves to /tmp/user-audio.m4a
whisperkit-cli transcribe --model small --audio-path /tmp/user-audio.m4a

# Agent sends text to LLM, gets response, generates voice reply
whisperkit-cli tts --text "{{llm_response}}" --output-path /tmp/reply --speaker ryan

# Agent attaches /tmp/reply.m4a to the chat message

Full docs & model list

https://github.com/argmaxinc/WhisperKit

Whisper model sizes (speed vs quality trade-off):

tiny (~76MB), base (~146MB) -- fastest, lower accuracy
small (~486MB) -- recommended for most agents, fastest. Works across languages when --language is specified. Avoid .en variants for non-English.
large-v3-v20240930_626MB (~626MB) -- quantized large model, best balance of accuracy and size. Auto-detects language without needing --language.
large-v3-v20240930 (~1.6GB) -- auto-selected default on M1+, full-precision large model.

Model names use the short form after the openai_whisper- prefix (e.g. --model small resolves to openai_whisper-small). Append .en for English-only variants.

TTS model sizes:

0.6b -- fast, works on all Apple Silicon devices
1.7b -- best quality + style instructions, macOS 15+

Run whisperkit-cli transcribe --help or whisperkit-cli tts --help for the latest flags.

Usage Guidance

This skill is internally consistent with its description: it installs a Homebrew CLI that runs locally and requires no API keys. Before installing, verify the Homebrew formula/tap and the GitHub repo (https://github.com/argmaxinc/WhisperKit) are legitimate. Be aware that on first run the tool will download models from HuggingFace (large files, network activity) and will store them locally—check disk usage and model storage location. The skill can also start a local HTTP server (examples show 127.0.0.1:50060); ensure it’s bound to localhost and not exposed to external networks. If you need stronger assurance, inspect the brew formula source or run the CLI in a controlled environment (VM or dedicated machine) before adding it to an agent that processes sensitive audio.

Capability Analysis

Type: OpenClaw Skill Name: argmax-cli Version: 1.0.1 The OpenClaw AgentSkills bundle for `argmax-cli` (whisperkit-cli) appears benign. It defines a legitimate on-device speech-to-text and text-to-speech functionality using the `whisperkit-cli` tool. Installation is via Homebrew, and all described operations involve local file processing (`/tmp/`) or a local API server (`127.0.0.1`). There is no evidence of intentional data exfiltration, malicious execution, persistence mechanisms, or prompt injection attempts against the OpenClaw agent itself. While passing user/LLM-controlled text to CLI arguments (e.g., `--prompt`, `--instruction`, `{{llm_response}}`) could introduce vulnerabilities if the agent or the `whisperkit-cli` tool does not properly sanitize inputs, this is a potential vulnerability in the execution environment rather than malicious intent within the `SKILL.md` instructions.

Capability Assessment

✓ Purpose & Capability

Name/description (on-device STT + TTS) aligns with declared requirements: a single binary (whisperkit-cli) and a Homebrew install. No unrelated credentials, binaries, or config paths are requested.

ℹ Instruction Scope

Runtime instructions are scoped to saving incoming audio to /tmp, invoking whisperkit-cli for transcribe/tts, and attaching generated files. SKILL.md documents a one-time model download from HuggingFace on first run and a local OpenAI-compatible server (binds to 127.0.0.1 in examples). These behaviors are reasonable for the stated purpose but imply network activity only at setup (model download) and a local HTTP endpoint that other local processes could call.

✓ Install Mechanism

Install is via a Homebrew formula (whisperkit-cli). Using Homebrew is a standard, low-risk distribution method compared with arbitrary downloads; the SKILL.md points to a GitHub project. Users should still confirm the formula source (tap) is trustworthy before installing.

✓ Credentials

The skill requests no environment variables, credentials, or unrelated config paths. This is proportionate to a local CLI tool that performs on-device inference.

ℹ Persistence & Privilege

The skill does not request elevated platform privileges or always:true. It will install a binary via Homebrew and download/keep model files on disk (potentially hundreds of MBs to GBs). It also documents running a local server; ensure you understand where models are stored and that the server binds to localhost only.

Version History

v1.0.1

Added detail on model sizes and capabilities

v1.0.0

First release of the argmax-cli skill featuring STT and TTS generation on Apple Neural Engine via whisperkit-cli homebrew formula.

Metadata

Slug argmax-cli

Version 1.0.1

License —

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Argmax Transcription and TTS?

On-device speech-to-text (Whisper) + text-to-speech (Qwen3-TTS) CLI. Runs on the Apple Neural Engine (ANE), Apple's low power, dedicated ML inference chip. M... It is an AI Agent Skill for Claude Code / OpenClaw, with 321 downloads so far.

How do I install Argmax Transcription and TTS?

Run "/install argmax-cli" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Argmax Transcription and TTS free?

Yes, Argmax Transcription and TTS is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Argmax Transcription and TTS support?

Argmax Transcription and TTS is cross-platform and runs anywhere OpenClaw / Claude Code is available (macos).

Who created Argmax Transcription and TTS?

It is built and maintained by Zach Nagengast (@zachnagengast); the current version is v1.0.1.

More Skills

Argmax Transcription and TTS