← Back to Skills Marketplace

Voice Recognition

Name: Voice Recognition
Author: 08jacky04

by 08Jacky04 · GitHub ↗ · v1.1.0 · MIT-0

cross-platform ⚠ suspicious

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install smart-voice-recognition

Description

Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages...

README (SKILL.md)

🎤 Voice Recognition — Smart Auto-Model Selection

Transcribe audio to text using local OpenAI Whisper. No API keys, no internet required, 100% private.

Smart auto-selection dynamically picks the best model based on your audio characteristics — you never have to think about which model to use.

Quick Start

# Auto mode — analyzes audio, picks best model automatically
scripts/transcribe.py voice.ogg

# Force a specific model
scripts/transcribe.py voice.ogg --model small

# Specify language (auto-detect if omitted)
scripts/transcribe.py voice.ogg --language zh   # Chinese (Mandarin)
scripts/transcribe.py voice.ogg --language en   # English
scripts/transcribe.py voice.ogg --language yue  # Cantonese

# Show segment timestamps
scripts/transcribe.py voice.ogg --segments

# Save transcript to file
scripts/transcribe.py voice.ogg -o transcript.txt

Smart Auto-Selection

The script analyzes audio duration + complexity and selects the optimal model automatically:

Audio Characteristic	Model Used	Why
Short (\x3C10s), clean speech	base	Fast (2-3s). Accurate enough for simple content.
Short (\x3C10s), mixed languages	small	Better multilingual handling for code-switching.
Medium (10-60s), clean	base	Balanced speed and accuracy.
Medium (10-60s), mixed	small	Handles accents and language transitions.
Long (1-2min)	small	Maintains context, still fast enough.
Very long (2min+)	medium	Maximum accuracy for extended recordings.

You don't need to think about models. Just send audio.

Installation

Prerequisites

Python 3.10+
pip (Python package manager)

Via bundled installer

python3 scripts/install.py

Manual

pip install openai-whisper soundfile numpy
pip install torch --index-url https://download.pytorch.org/whl/cpu

Using requirements.txt

pip install -r requirements.txt
pip install torch --index-url https://download.pytorch.org/whl/cpu

Note: First run downloads the Whisper model (~139MB for base, ~461MB for small). Subsequent runs use the cached model (~/.cache/whisper/) and load instantly.

Model Reference

Model	Size	Speed	Accuracy	Best For
tiny	72MB	⚡⚡⚡	⭐⭐	Real-time preview, very short clips
base	139MB	⚡⚡	⭐⭐⭐	General use (auto-select default for short audio)
small	461MB	⚡	⭐⭐⭐⭐	Mixed languages, accents (auto-select for long/complex)
medium	1.5GB	🐢	⭐⭐⭐⭐⭐	Maximum accuracy, long recordings
large	2.9GB	🐢	⭐⭐⭐⭐⭐	Research-grade transcription

Language Support

Whisper supports 99 languages including:

🇨🇳 Chinese (Mandarin, Cantonese)
🇺🇸 English
🇪🇸 Spanish
🇯🇵 Japanese
🇰🇷 Korean
🇫🇷 French
🇩🇪 German

Auto-detects language by default. Use --language to provide a hint for better accuracy.

Features

Feature	Description
🔒 100% Private	Everything runs locally. No data leaves your machine.
🆓 No API Costs	Free unlimited transcription. No quotas, no keys.
🌐 99 Languages	Supports virtually all major world languages.
🧠 Smart Auto-Model	Analyzes audio → picks optimal model automatically.
⚡ Fast by Default	Short clips → base model (2-3s). Long clips → small/medium.
🎯 Accurate When Needed	Complex/mixed audio automatically upgrades the model.
📊 Segment Timestamps	Sentence-level timing for long recordings.
📁 Multiple Formats	OGG, WAV, MP3, M4A, FLAC, OPUS and more.

Supported Audio Formats

Format	Extension	Notes
OGG Opus	`.ogg`	Common voice message format ✅
WAV	`.wav`	Uncompressed, high quality
MP3	`.mp3`	Compressed audio
M4A	`.m4a`	Apple/MPEG-4 audio
FLAC	`.flac`	Lossless compressed
OPUS	`.opus`	Pure Opus stream

Usage Examples

Quick transcription (auto model)

$ scripts/transcribe.py meeting.ogg
📂 Loading audio: meeting.ogg
⏱  Duration: 32.0s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (4.1s total)
Meeting notes: Today we discuss three topics. First, project progress...

Transcription in context

# Chinese
scripts/transcribe.py voice.ogg --language zh

# English lecture with timestamps
scripts/transcribe.py lecture.m4a --language en --segments

# Mixed Chinese-English interview (auto complexity detection)
scripts/transcribe.py interview.ogg

# Save to file
scripts/transcribe.py podcast.mp3 -o transcript.txt

# Force high accuracy
scripts/transcribe.py important.wav --model medium

Output with segments

$ scripts/transcribe.py message.ogg --segments
📂 Loading audio: message.ogg
⏱  Duration: 7.5s | Sample rate: 16000Hz
🧠 Auto-selected model: BASE
✓ Model loaded (1.0s)
🎯 Transcribing...
✅ Done (2.4s total)
Now I'm sending this voice message to XiaoA, can you recognize what I said?

📝 Segments:
   [0.0s - 3.6s] Now I'm sending this voice message
   [3.6s - 7.4s] to XiaoA, can you recognize what I said?

Troubleshooting

Problem	Solution
`No module` error	Use the venv Python: `python3 scripts/transcribe.py` or run `scripts/install.py`
Slow transcription	First download caches the model (~139-461MB). Normal for first run.
Wrong language detected	Pass `--language en` or `--language zh` for a hint
Background noise	Use `--model small` or `--model medium` for noisy environments

Token Savings Examples

Scenario	Cloud API Cost	This Skill	Savings
10 short voice messages/day	~$0.60/day (Whisper API)	$0	∞
1 hour meeting transcription	~$2.88 (Deepgram)	$0	∞
1000 files for a project	~$50-200	$0	∞
Agent processing voice inputs	LLM tokens + API fees	0 tokens	Full token budget saved

Privacy & Security

100% offline — no data leaves your machine.
No API keys — no third-party services, no accounts.
No telemetry — zero tracking.
No cloud — everything runs locally.
Zero token consumption — frees your LLM budget for reasoning.

Your audio is yours. Always.

Usage Guidance

Install only in an isolated environment you control. Before using it, remove or audit the /tmp/whisper-venv import fallback and avoid the installer path that uses --break-system-packages. The transcription function itself appears purpose-aligned and local, but setup/model downloads still require external package/model sources.

Capability Tags

requires-sensitive-credentials

Capability Assessment

ℹ Purpose & Capability

The main functionality matches the stated purpose: it reads a user-supplied audio file and transcribes it locally with Whisper. Users should still note that setup/model downloads require network access despite broad 'no internet required' wording.

✓ Instruction Scope

The documented commands are user-directed transcription and installation examples. The artifacts do not show prompt injection, hidden goal changes, or autonomous background behavior.

⚠ Install Mechanism

Installation uses external pip packages and the bundled installer can fall back to system pip with --break-system-packages, which may alter the user's broader Python environment rather than staying contained in the skill directory.

⚠ Credentials

The transcription script prepends /tmp/whisper-venv site-packages to Python's import path before importing Whisper, which is not the documented local .venv and could load unreviewed code from a temporary location.

ℹ Persistence & Privilege

The skill creates a virtual environment and caches Whisper models on disk, which is expected for local Whisper use. No background process, self-starting persistence, or credential storage is shown.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install smart-voice-recognition
After installation, invoke the skill by name or use /smart-voice-recognition
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.1.0

Added token savings comparison table. Highlighted zero token consumption as key differentiator.

v1.0.0

- Initial release of smart-voice-recognition with **local, private OpenAI Whisper transcription** (no API keys, no internet required). - **Smart auto-model selection**: Automatically analyzes audio length and complexity to choose the optimal Whisper model for speed and accuracy. - Supports **99+ languages** with automatic language detection and manual override option. - Multiple audio formats supported, including OGG, WAV, MP3, M4A, FLAC, and OPUS. - Features include segment timestamps, output file saving, and fully offline privacy.

Metadata

Slug smart-voice-recognition

Version 1.1.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Voice Recognition?

Intelligent speech-to-text using local OpenAI Whisper (no API key needed, fully private). Use when you need to transcribe audio files, convert voice messages... It is an AI Agent Skill for Claude Code / OpenClaw, with 31 downloads so far.

How do I install Voice Recognition?

Run "/install smart-voice-recognition" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Voice Recognition free?

Yes, Voice Recognition is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Voice Recognition support?

Voice Recognition is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Voice Recognition?

It is built and maintained by 08Jacky04 (@08jacky04); the current version is v1.1.0.

More Skills