← 返回 Skills 市场

Voice Recognition

Name: Voice Recognition
Author: 08jacky04

作者 08Jacky04 · GitHub ↗ · v1.1.0 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install smart-voice-recognition

功能描述

Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages...

使用说明 (SKILL.md)

🎤 Voice Recognition — Smart Auto-Model Selection

Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.

Smart auto-selection dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.

Quick Start

# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg

# Force a specific model
scripts/transcribe.py voice.ogg --model small

# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh   # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en   # English
scripts/transcribe.py voice.ogg --language yue  # Cantonese

# Show segment timestamps
scripts/transcribe.py voice.ogg --segments

# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt

Smart Auto-Selection

The script analyzes audio duration + complexity and selects the optimal model automatically:

Audio Characteristic	Model Used	Why
Short (\x3C10s), clean speech	base	Fast (2-3s). Accurate enough for simple content.
Short (\x3C10s), mixed languages	small	Better multilingual handling for code-switching.
Medium (10-60s), clean	base	Balanced speed and accuracy.
Medium (10-60s), mixed	small	Handles accents and language transitions.
Long (1-2min)	small	Maintains context, still fast enough.
Very long (2min+)	medium	Maximum accuracy for extended recordings.

You don't need to think about models. Just send audio.

Installation

Prerequisites

Python 3.10+
pip (Python package manager)

Via bundled installer

python3 scripts/install.py

Manual

pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu

Using requirements.txt

pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu

Note: First run downloads the Whisper model (~139MB for base, ~461MB for small). Subsequent runs use the cached model (~/.cache/whisper/) and load instantly.

Model Reference

Model	Size	Speed	Accuracy	Best For
tiny	72MB	⚡⚡⚡	⭐⭐	Real-time preview, very short clips
base	139MB	⚡⚡	⭐⭐⭐	General use (auto-select default for short audio)
small	461MB	⚡	⭐⭐⭐⭐	Mixed languages, accents (auto-select for long/complex)
medium	1.5GB	🐢	⭐⭐⭐⭐⭐	Maximum accuracy, long recordings
large	2.9GB	🐢	⭐⭐⭐⭐⭐	Research-grade transcription

Language Support

Whisper supports 99 languages including:

🇨🇳 Chinese (Mandarin, Cantonese)
🇺🇸 English
🇪🇸 Spanish
🇯🇵 Japanese
🇰🇷 Korean
🇫🇷 French
🇩🇪 German

Auto-detects language by default. Use --language to provide a hint for better accuracy.

Features

Feature	Description
🔒 100% Private	Everything runs locally. No data leaves your machine.
🆓 No API Costs	Free unlimited transcription. No quotas, no keys.
🌐 99 Languages	Supports virtually all major world languages.
🧠 Smart Auto-Model	Analyzes audio → picks optimal model automatically.
⚡ Fast by Default	Short clips → base model (2-3s). Long clips → small/medium.
🎯 Accurate When Needed	Complex/mixed audio automatically upgrades the model.
📊 Segment Timestamps	Sentence-level timing for long recordings.
📁 Multiple Formats	OGG, WAV, MP3, M4A, FLAC, OPUS and more.

Supported Audio Formats

Format	Extension	Notes
OGG Opus	`.ogg`	Common voice message format ✅
WAV	`.wav`	Uncompressed, high quality
MP3	`.mp3`	Compressed audio
M4A	`.m4a`	Apple/MPEG-4 audio
FLAC	`.flac`	Lossless compressed
OPUS	`.opus`	Pure Opus stream

Usage Examples

Quick transcription (auto model)

$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱  Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...

Transcription in context

# Chinese
scripts/transcribe.py voice.ogg --language zh

# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments

# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg

# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt

# Force high accuracy
scripts/transcribe.py important.wav --model medium

Output with segments

$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱  Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?

📝 Segments:
   [0.0s - 3.6s] Now I'm sending this voice message
   [3.6s - 7.4s] to XiaoA, can you recognize what I said?

Troubleshooting

Problem	Solution
`No module` error	Use the venv Python: `python3 scripts/transcribe.py` or run `scripts/install.py`
Slow transcription	First download caches the model (~139-461MB). Normal for first run.
Wrong language detected	Pass `--language en` or `--language zh` for a hint
Background noise	Use `--model small` or `--model medium` for noisy environments

Token Savings Examples

Scenario	Cloud API Cost	This Skill	Savings
10 short voice messages/day	~$0.60/day (Whisper API)	$0	∞
1 hour meeting transcription	~$2.88 (Deepgram)	$0	∞
1000 files for a project	~$50-200	$0	∞
Agent processing voice inputs	LLM tokens + API fees	0 tokens	Full token budget saved

Privacy & Security

100% offline — no data leaves your machine.
No API keys — no third-party services, no accounts.
No telemetry — zero tracking.
No cloud — everything runs locally.
Zero token consumption — frees your LLM budget for reasoning.

Your audio is yours. Always.

安全使用建议

Install only in an isolated environment you control. Before using it, remove or audit the /tmp/whisper-venv import fallback and avoid the installer path that uses --break-system-packages. The transcription function itself appears purpose-aligned and local, but setup/model downloads still require external package/model sources.

能力标签

requires-sensitive-credentials

能力评估

ℹ Purpose & Capability

The main functionality matches the stated purpose: it reads a user-supplied audio file and transcribes it locally with Whisper. Users should still note that setup/model downloads require network access despite broad 'no internet required' wording.

✓ Instruction Scope

The documented commands are user-directed transcription and installation examples. The artifacts do not show prompt injection, hidden goal changes, or autonomous background behavior.

⚠ Install Mechanism

Installation uses external pip packages and the bundled installer can fall back to system pip with --break-system-packages, which may alter the user's broader Python environment rather than staying contained in the skill directory.

⚠ Credentials

The transcription script prepends /tmp/whisper-venv site-packages to Python's import path before importing Whisper, which is not the documented local .venv and could load unreviewed code from a temporary location.

ℹ Persistence & Privilege

The skill creates a virtual environment and caches Whisper models on disk, which is expected for local Whisper use. No background process, self-starting persistence, or credential storage is shown.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install smart-voice-recognition
安装完成后，直接呼叫该 Skill 的名称或使用 /smart-voice-recognition 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.1.0

Added token savings comparison table. Highlighted zero token consumption as key differentiator.

v1.0.0

- Initial release of smart-voice-recognition with **local, private OpenAI Whisper transcription** (no API keys, no internet required). - **Smart auto-model selection**: Automatically analyzes audio length and complexity to choose the optimal Whisper model for speed and accuracy. - Supports **99+ languages** with automatic language detection and manual override option. - Multiple audio formats supported, including OGG, WAV, MP3, M4A, FLAC, and OPUS. - Features include segment timestamps, output file saving, and fully offline privacy.

元数据

Slug smart-voice-recognition

版本 1.1.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题