← 返回 Skills 市场
08jacky04

Voice Recognition

作者 08Jacky04 · GitHub ↗ · v1.1.0 · MIT-0
cross-platform ⚠ suspicious
31
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install smart-voice-recognition
功能描述
Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages...
使用说明 (SKILL.md)

🎤 Voice Recognition — Smart Auto-Model Selection

Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.

Smart auto-selection dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.

Quick Start

# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg

# Force a specific model
scripts/transcribe.py voice.ogg --model small

# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh   # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en   # English
scripts/transcribe.py voice.ogg --language yue  # Cantonese

# Show segment timestamps
scripts/transcribe.py voice.ogg --segments

# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt

Smart Auto-Selection

The script analyzes audio duration + complexity and selects the optimal model automatically:

Audio Characteristic Model Used Why
Short (\x3C10s), clean speech base Fast (2-3s). Accurate enough for simple content.
Short (\x3C10s), mixed languages small Better multilingual handling for code-switching.
Medium (10-60s), clean base Balanced speed and accuracy.
Medium (10-60s), mixed small Handles accents and language transitions.
Long (1-2min) small Maintains context, still fast enough.
Very long (2min+) medium Maximum accuracy for extended recordings.

You don't need to think about models. Just send audio.

Installation

Prerequisites

  • Python 3.10+
  • pip (Python package manager)

Via bundled installer

python3 scripts/install.py

Manual

pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu

Using requirements.txt

pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu

Note: First run downloads the Whisper model (~139MB for base, ~461MB for small). Subsequent runs use the cached model (~/.cache/whisper/) and load instantly.

Model Reference

Model Size Speed Accuracy Best For
tiny 72MB ⚡⚡⚡ ⭐⭐ Real-time preview, very short clips
base 139MB ⚡⚡ ⭐⭐⭐ General use (auto-select default for short audio)
small 461MB ⭐⭐⭐⭐ Mixed languages, accents (auto-select for long/complex)
medium 1.5GB 🐢 ⭐⭐⭐⭐⭐ Maximum accuracy, long recordings
large 2.9GB 🐢 ⭐⭐⭐⭐⭐ Research-grade transcription

Language Support

Whisper supports 99 languages including:

  • 🇨🇳 Chinese (Mandarin, Cantonese)
  • 🇺🇸 English
  • 🇪🇸 Spanish
  • 🇯🇵 Japanese
  • 🇰🇷 Korean
  • 🇫🇷 French
  • 🇩🇪 German

Auto-detects language by default. Use --language to provide a hint for better accuracy.

Features

Feature Description
🔒 100% Private Everything runs locally. No data leaves your machine.
🆓 No API Costs Free unlimited transcription. No quotas, no keys.
🌐 99 Languages Supports virtually all major world languages.
🧠 Smart Auto-Model Analyzes audio → picks optimal model automatically.
Fast by Default Short clips → base model (2-3s). Long clips → small/medium.
🎯 Accurate When Needed Complex/mixed audio automatically upgrades the model.
📊 Segment Timestamps Sentence-level timing for long recordings.
📁 Multiple Formats OGG, WAV, MP3, M4A, FLAC, OPUS and more.

Supported Audio Formats

Format Extension Notes
OGG Opus .ogg Common voice message format ✅
WAV .wav Uncompressed, high quality
MP3 .mp3 Compressed audio
M4A .m4a Apple/MPEG-4 audio
FLAC .flac Lossless compressed
OPUS .opus Pure Opus stream

Usage Examples

Quick transcription (auto model)

$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱  Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...

Transcription in context

# Chinese
scripts/transcribe.py voice.ogg --language zh

# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments

# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg

# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt

# Force high accuracy
scripts/transcribe.py important.wav --model medium

Output with segments

$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱  Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?

📝 Segments:
   [0.0s - 3.6s] Now I'm sending this voice message
   [3.6s - 7.4s] to XiaoA, can you recognize what I said?

Troubleshooting

Problem Solution
No module error Use the venv Python: python3 scripts/transcribe.py or run scripts/install.py
Slow transcription First download caches the model (~139-461MB). Normal for first run.
Wrong language detected Pass --language en or --language zh for a hint
Background noise Use --model small or --model medium for noisy environments

Token Savings Examples

Scenario Cloud API Cost This Skill Savings
10 short voice messages/day ~$0.60/day (Whisper API) $0
1 hour meeting transcription ~$2.88 (Deepgram) $0
1000 files for a project ~$50-200 $0
Agent processing voice inputs LLM tokens + API fees 0 tokens Full token budget saved

Privacy & Security

  • 100% offline — no data leaves your machine.
  • No API keys — no third-party services, no accounts.
  • No telemetry — zero tracking.
  • No cloud — everything runs locally.
  • Zero token consumption — frees your LLM budget for reasoning.

Your audio is yours. Always.

安全使用建议
Install only in an isolated environment you control. Before using it, remove or audit the /tmp/whisper-venv import fallback and avoid the installer path that uses --break-system-packages. The transcription function itself appears purpose-aligned and local, but setup/model downloads still require external package/model sources.
能力标签
requires-sensitive-credentials
能力评估
Purpose & Capability
The main functionality matches the stated purpose: it reads a user-supplied audio file and transcribes it locally with Whisper. Users should still note that setup/model downloads require network access despite broad 'no internet required' wording.
Instruction Scope
The documented commands are user-directed transcription and installation examples. The artifacts do not show prompt injection, hidden goal changes, or autonomous background behavior.
Install Mechanism
Installation uses external pip packages and the bundled installer can fall back to system pip with --break-system-packages, which may alter the user's broader Python environment rather than staying contained in the skill directory.
Credentials
The transcription script prepends /tmp/whisper-venv site-packages to Python's import path before importing Whisper, which is not the documented local .venv and could load unreviewed code from a temporary location.
Persistence & Privilege
The skill creates a virtual environment and caches Whisper models on disk, which is expected for local Whisper use. No background process, self-starting persistence, or credential storage is shown.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install smart-voice-recognition
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /smart-voice-recognition 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.0
Added token savings comparison table. Highlighted zero token consumption as key differentiator.
v1.0.0
- Initial release of smart-voice-recognition with **local, private OpenAI Whisper transcription** (no API keys, no internet required). - **Smart auto-model selection**: Automatically analyzes audio length and complexity to choose the optimal Whisper model for speed and accuracy. - Supports **99+ languages** with automatic language detection and manual override option. - Multiple audio formats supported, including OGG, WAV, MP3, M4A, FLAC, and OPUS. - Features include segment timestamps, output file saving, and fully offline privacy.
元数据
Slug smart-voice-recognition
版本 1.1.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

Voice Recognition 是什么?

Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 31 次。

如何安装 Voice Recognition?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install smart-voice-recognition」即可一键安装,无需额外配置。

Voice Recognition 是免费的吗?

是的,Voice Recognition 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Voice Recognition 支持哪些平台?

Voice Recognition 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Voice Recognition?

由 08Jacky04(@08jacky04)开发并维护,当前版本 v1.1.0。

💬 留言讨论