/install smart-voice-recognition
🎤 Voice Recognition — Smart Auto-Model Selection
Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.
Smart auto-selection dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.
Quick Start
# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg
# Force a specific model
scripts/transcribe.py voice.ogg --model small
# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en # English
scripts/transcribe.py voice.ogg --language yue # Cantonese
# Show segment timestamps
scripts/transcribe.py voice.ogg --segments
# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt
Smart Auto-Selection
The script analyzes audio duration + complexity and selects the optimal model automatically:
| Audio Characteristic | Model Used | Why |
|---|---|---|
| Short (\x3C10s), clean speech | base | Fast (2-3s). Accurate enough for simple content. |
| Short (\x3C10s), mixed languages | small | Better multilingual handling for code-switching. |
| Medium (10-60s), clean | base | Balanced speed and accuracy. |
| Medium (10-60s), mixed | small | Handles accents and language transitions. |
| Long (1-2min) | small | Maintains context, still fast enough. |
| Very long (2min+) | medium | Maximum accuracy for extended recordings. |
You don't need to think about models. Just send audio.
Installation
Prerequisites
- Python 3.10+
- pip (Python package manager)
Via bundled installer
python3 scripts/install.py
Manual
pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu
Using requirements.txt
pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu
Note: First run downloads the Whisper model (~139MB for base, ~461MB for small). Subsequent runs use the cached model (
~/.cache/whisper/) and load instantly.
Model Reference
| Model | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
| tiny | 72MB | ⚡⚡⚡ | ⭐⭐ | Real-time preview, very short clips |
| base | 139MB | ⚡⚡ | ⭐⭐⭐ | General use (auto-select default for short audio) |
| small | 461MB | ⚡ | ⭐⭐⭐⭐ | Mixed languages, accents (auto-select for long/complex) |
| medium | 1.5GB | 🐢 | ⭐⭐⭐⭐⭐ | Maximum accuracy, long recordings |
| large | 2.9GB | 🐢 | ⭐⭐⭐⭐⭐ | Research-grade transcription |
Language Support
Whisper supports 99 languages including:
- 🇨🇳 Chinese (Mandarin, Cantonese)
- 🇺🇸 English
- 🇪🇸 Spanish
- 🇯🇵 Japanese
- 🇰🇷 Korean
- 🇫🇷 French
- 🇩🇪 German
Auto-detects language by default. Use --language to provide a hint for better accuracy.
Features
| Feature | Description |
|---|---|
| 🔒 100% Private | Everything runs locally. No data leaves your machine. |
| 🆓 No API Costs | Free unlimited transcription. No quotas, no keys. |
| 🌐 99 Languages | Supports virtually all major world languages. |
| 🧠 Smart Auto-Model | Analyzes audio → picks optimal model automatically. |
| ⚡ Fast by Default | Short clips → base model (2-3s). Long clips → small/medium. |
| 🎯 Accurate When Needed | Complex/mixed audio automatically upgrades the model. |
| 📊 Segment Timestamps | Sentence-level timing for long recordings. |
| 📁 Multiple Formats | OGG, WAV, MP3, M4A, FLAC, OPUS and more. |
Supported Audio Formats
| Format | Extension | Notes |
|---|---|---|
| OGG Opus | .ogg |
Common voice message format ✅ |
| WAV | .wav |
Uncompressed, high quality |
| MP3 | .mp3 |
Compressed audio |
| M4A | .m4a |
Apple/MPEG-4 audio |
| FLAC | .flac |
Lossless compressed |
| OPUS | .opus |
Pure Opus stream |
Usage Examples
Quick transcription (auto model)
$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱ Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...
Transcription in context
# Chinese
scripts/transcribe.py voice.ogg --language zh
# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments
# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg
# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt
# Force high accuracy
scripts/transcribe.py important.wav --model medium
Output with segments
$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱ Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?
📝 Segments:
[0.0s - 3.6s] Now I'm sending this voice message
[3.6s - 7.4s] to XiaoA, can you recognize what I said?
Troubleshooting
| Problem | Solution |
|---|---|
No module error |
Use the venv Python: python3 scripts/transcribe.py or run scripts/install.py |
| Slow transcription | First download caches the model (~139-461MB). Normal for first run. |
| Wrong language detected | Pass --language en or --language zh for a hint |
| Background noise | Use --model small or --model medium for noisy environments |
Token Savings Examples
| Scenario | Cloud API Cost | This Skill | Savings |
|---|---|---|---|
| 10 short voice messages/day | ~$0.60/day (Whisper API) | $0 | ∞ |
| 1 hour meeting transcription | ~$2.88 (Deepgram) | $0 | ∞ |
| 1000 files for a project | ~$50-200 | $0 | ∞ |
| Agent processing voice inputs | LLM tokens + API fees | 0 tokens | Full token budget saved |
Privacy & Security
- 100% offline — no data leaves your machine.
- No API keys — no third-party services, no accounts.
- No telemetry — zero tracking.
- No cloud — everything runs locally.
- Zero token consumption — frees your LLM budget for reasoning.
Your audio is yours. Always.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install smart-voice-recognition - 安装完成后,直接呼叫该 Skill 的名称或使用
/smart-voice-recognition触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Voice Recognition 是什么?
Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 31 次。
如何安装 Voice Recognition?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install smart-voice-recognition」即可一键安装,无需额外配置。
Voice Recognition 是免费的吗?
是的,Voice Recognition 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Voice Recognition 支持哪些平台?
Voice Recognition 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Voice Recognition?
由 08Jacky04(@08jacky04)开发并维护,当前版本 v1.1.0。