Description

Complete offline voice-to-voice AI assistant for OpenClaw (Whisper.cpp STT + Pocket-TTS). 100% local processing, no cloud APIs, no costs. Use for hands-free...

README (SKILL.md)

Voice Agent - OpenClaw Skill

Name: Local Voice Agent
Author: pinological

Complete voice-to-voice AI assistant for hands-free operation.

Architecture

User Voice → Whisper STT → Text → OpenClaw AI → Text → Pocket-TTS → Voice Response

Prerequisites

1. Whisper.cpp (Speech-to-Text)

# Clone and build
git clone https://github.com/ggerganov/whisper.cpp ~/.local/whisper.cpp
cd ~/.local/whisper.cpp
make -j4

# Download tiny model (fast, low-resource)
bash ./models/download-ggml-model.sh tiny

Test:

./build/bin/whisper-cli -m models/ggml-tiny.bin -f samples/jfk.wav

2. Pocket-TTS (Text-to-Speech)

Option A: Use existing server

export POCKET_TTS_URL="http://localhost:5000"

Option B: Install locally

# Clone your Pocket-TTS server
cd /path/to/pockettts
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python3 -m app.main --host 0.0.0.0 --port 5000

3. FFmpeg (Audio Conversion)

sudo apt-get install -y ffmpeg

Quick Start

Voice Command (One-shot)

# Record → Transcribe → Process → Speak
./bin/voice-agent "What's the weather today?"

Interactive Mode

# Continuous voice conversation
./bin/voice-agent --interactive

Voice File Processing

# Transcribe existing audio file
./bin/voice-to-text recording.wav

# Generate voice from text
./bin/text-to-voice "Hello world!" output.wav

Configuration

Edit config/voices.yaml:

# Default voices
stt:
  model: tiny  # tiny, small, medium (larger = more accurate, slower)
  language: en  # en, ne, hi, etc.

tts:
  url: http://localhost:5000
  voice: peter voice  # Your custom voice
  format: wav  # wav, mp3

# Performance
performance:
  threads: 4  # CPU threads for Whisper
  realtime: true  # Faster-than-realtime processing

API Endpoints

POST /v1/voice/command

Voice command processing:

curl -X POST "http://localhost:5000/v1/voice/command" \
  -F "[email protected]" \
  -F "action=openclaw"

Response:

{
  "transcription": "What's the weather today?",
  "response_text": "The weather in Kathmandu is partly cloudy, 22 degrees Celsius.",
  "audio_response": "/tmp/response.wav"
}

GET /v1/voices

List available TTS voices:

curl http://localhost:5000/v1/voices

Use Cases

1. Daily Briefings (Voice)

./bin/voice-agent "Give me my morning briefing"

2. Voice Notes

./bin/voice-agent "Remind me to call Peter at 3 PM"

3. Hands-Free Coding

./bin/voice-agent "Show me the status of my git repository"

4. Accessibility

Perfect for users who prefer voice interaction or have mobility constraints.

Scripts

bin/voice-to-text

Convert speech to text:

./bin/voice-to-text input.wav
./bin/voice-to-text input.ogg  # Auto-converts with ffmpeg
./bin/voice-to-text input.mp4  # Extracts audio from video

bin/text-to-voice

Convert text to speech:

./bin/text-to-voice "Hello world!" output.wav
./bin/text-to-voice --voice "usha lama" "Namaste!" greeting.wav

bin/voice-agent

Full voice pipeline:

./bin/voice-agent "What time is it?"
./bin/voice-agent --interactive  # Conversation mode
./bin/voice-agent --file recording.wav  # Process file

Troubleshooting

Whisper.cpp Errors

"failed to read audio file"

Convert to WAV first: ffmpeg -i input.ogg -ar 16000 -ac 1 output.wav

"model not found"

Download model: bash models/download-ggml-model.sh tiny

Pocket-TTS Errors

"Connection refused"

Start TTS server: python3 -m app.main
Check URL: export POCKET_TTS_URL="http://localhost:5000"

"Voice not found"

List voices: curl http://localhost:5000/v1/voices
Clone custom voice if needed

Performance Issues

Slow transcription

Use smaller model: tiny instead of small
Reduce audio sample rate: ffmpeg -i input.wav -ar 16000 output.wav

Slow TTS

Use shorter text
Generate in background

Examples

See examples/ directory for:

morning-briefing.sh - Automated voice briefing
voice-reminder.sh - Voice-based reminders
conversation-mode.sh - Interactive voice chat

Performance

Model	RAM	Speed (1 min audio)	Accuracy
tiny	500MB	~30 sec	~90%
small	1GB	~60 sec	~95%
medium	2GB	~120 sec	~98%

Recommendation: Start with tiny, upgrade to small if needed.

License

MIT License - See LICENSE file

Credits

Whisper.cpp by Georgi Gerganov (ggerganov/whisper.cpp)
Pocket-TTS by Kyutai Labs (kyutai-labs/pocket-tts)
OpenClaw by OpenClaw Team (openclaw/openclaw)

Support

GitHub Issues: [Your Repo Link]
OpenClaw Discord: https://discord.com/invite/clawd
Documentation: [Your Docs Link]

Usage Guidance

This package appears to do what it claims (local Whisper.cpp STT + Pocket‑TTS). Before installing: 1) Open config/voices.yaml and ensure openclaw.session_key is empty unless you intentionally provide a session token; 2) keep tts.url set to http://localhost:5000 unless you trust a remote TTS server (changing it will send text/audio data off‑device); 3) note the skill will create cache files (~/.cache/voice-agent) and a log (~/.local/log/voice-agent.log) which may contain transcripts — delete or secure them if needed; 4) run install.sh manually and review its file‑copy operations (it copies into ~/.openclaw/workspace/skills/voice-agent); 5) the install may clone and build whisper.cpp and download models from GitHub — review those third‑party repos if you need to. If you want extra caution, run the skill in a sandboxed account or VM and verify network activity (ensure Pocket‑TTS runs on localhost) before granting broader access.

Capability Analysis

Type: OpenClaw Skill Name: local-voice-agent Version: 1.0.2 The local-voice-agent skill is a legitimate implementation of a voice-to-voice AI assistant using Whisper.cpp for speech-to-text and Pocket-TTS for text-to-speech. The bundle consists of shell scripts (bin/voice-agent.sh, bin/voice-to-text.sh) and Python wrappers (lib/stt.py, lib/tts.py) that coordinate audio recording via ffmpeg, transcription, and synthesis. While bin/voice-agent.sh uses 'eval' for tilde expansion in paths (a minor shell injection vulnerability if the config file is untrusted), the overall behavior is transparent, well-documented, and strictly aligned with the stated purpose of local audio processing without evidence of data exfiltration or malicious intent.

Capability Assessment

✓ Purpose & Capability

Name/description (offline voice agent) match the code and runtime requirements: wrapper scripts call whisper-cli, ffmpeg and a local Pocket‑TTS HTTP server; install.sh installs/copies the skill into the OpenClaw workspace. Required binaries (whisper-cli, python3, ffmpeg) are appropriate for the stated purpose.

ℹ Instruction Scope

SKILL.md and the scripts stay within the stated voice pipeline (record → STT → AI → TTS → playback). They instruct cloning whisper.cpp and optionally running a Pocket‑TTS server. One runtime capability to note: the TTS client (lib/tts.py) POSTs to whatever url is configured (default localhost), so if you change config to a remote TTS endpoint the skill will send text to that server. The OpenClaw integration is a placeholder that reads an optional session_key from config but does not implement remote OpenClaw calls in the provided code.

✓ Install Mechanism

There is no automated package install from untrusted URLs; install.sh clones whisper.cpp from GitHub only on user approval and copies files into the user's OpenClaw workspace. No obscure download hosts, archive extraction, or arbitrary remote binaries are present in the package.

ℹ Credentials

The skill requests no environment variables or external credentials by default. It does read config/voices.yaml (which contains an optional openclaw.session_key field) and writes caches (~/.cache/voice-agent) and logs (~/.local/log/voice-agent.log). Because the TTS URL is configurable, changing it to a remote server would expose generated text and possibly user transcripts — keep TTS set to localhost to stay fully local.

✓ Persistence & Privilege

Skill is not always-enabled and does not request elevated/system-wide privileges. install.sh copies the skill into the user's OpenClaw workspace and suggests a PATH change, which is normal for a skill. It does not modify other skills' configurations or system authentication.

Version History

v1.0.2

- Removed CLAUDE.md and SECURITY.md files from the repository. - No changes to functionality or documentation within SKILL.md.

v1.0.1

- Added FFmpeg ("ffmpeg") as a required binary dependency. - Added multiple scripts and Python modules for voice-to-text, text-to-voice, and agent operations. - Removed unnecessary package.json file. - Updated installation instructions and requirements for Pocket-TTS server. - Improved documentation with new SECURITY.md and CLAUDE.md files.

v1.0.0

- Initial release of local-voice-agent: a fully offline, voice-to-voice AI assistant for OpenClaw. - Integrates Whisper.cpp for speech-to-text (STT) and Pocket-TTS for text-to-speech (TTS) with 100% local processing—no cloud APIs required. - Provides scripts for one-shot and interactive voice operation, voice-to-text, and text-to-voice tasks. - Features easy configuration, API endpoints for remote control, and guidance for installation and troubleshooting. - Suitable for hands-free commands, accessibility, daily briefings, voice notes, and custom voice cloning.

Metadata

Slug local-voice-agent

Version 1.0.2

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 3

Frequently Asked Questions

What is Local Voice Agent?

Complete offline voice-to-voice AI assistant for OpenClaw (Whisper.cpp STT + Pocket-TTS). 100% local processing, no cloud APIs, no costs. Use for hands-free... It is an AI Agent Skill for Claude Code / OpenClaw, with 144 downloads so far.

How do I install Local Voice Agent?

Run "/install local-voice-agent" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Local Voice Agent free?

Yes, Local Voice Agent is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Local Voice Agent support?

Local Voice Agent is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Local Voice Agent?

It is built and maintained by Pinological (@pinological); the current version is v1.0.2.

More Skills

Local Voice Agent