Local Voice (FluidAudio TTS/STT)
/install local-voice
Local Voice (FluidAudio TTS/STT)
Sub-second local voice AI for Apple Silicon Macs using FluidAudio's CoreML models.
Features
- TTS: Kokoro model with 54 voices, ~0.6-0.8s latency
- STT: Parakeet TDT v3, ~0.2-0.3s latency, 25 languages
- 100% local: No cloud, no cost, works offline
- Neural Engine: Runs on Apple's ANE for efficiency
Requirements
- macOS 14+ on Apple Silicon (M1/M2/M3/M4)
- Swift 5.9+
- espeak-ng (for TTS phoneme fallback)
Quick Setup
1. Install Dependencies
brew install espeak-ng
2. Build the Daemon
cd /path/to/skill/sources
swift build -c release
3. Install Binary and Framework
mkdir -p ~/clawd/bin
cp .build/release/StellaVoice ~/clawd/bin/
cp -R .build/arm64-apple-macosx/release/ESpeakNG.framework ~/clawd/bin/
install_name_tool -add_rpath @executable_path ~/clawd/bin/StellaVoice
4. Create LaunchAgent
cat > ~/Library/LaunchAgents/com.stella.tts.plist \x3C\x3C 'EOF'
\x3C?xml version="1.0" encoding="UTF-8"?>
\x3C!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
\x3Cplist version="1.0">
\x3Cdict>
\x3Ckey>Label\x3C/key>
\x3Cstring>com.stella.tts\x3C/string>
\x3Ckey>ProgramArguments\x3C/key>
\x3Carray>
\x3Cstring>$HOME/clawd/bin/StellaVoice\x3C/string>
\x3C/array>
\x3Ckey>RunAtLoad\x3C/key>
\x3Ctrue/>
\x3Ckey>KeepAlive\x3C/key>
\x3Ctrue/>
\x3Ckey>StandardOutPath\x3C/key>
\x3Cstring>$HOME/.clawdbot/logs/stella-tts.log\x3C/string>
\x3Ckey>StandardErrorPath\x3C/key>
\x3Cstring>$HOME/.clawdbot/logs/stella-tts.err.log\x3C/string>
\x3C/dict>
\x3C/plist>
EOF
launchctl load ~/Library/LaunchAgents/com.stella.tts.plist
API Endpoints
The daemon listens on http://127.0.0.1:18790:
TTS - Text to Speech
# Simple text to WAV
curl -X POST http://127.0.0.1:18790/synthesize -d "Hello world" -o output.wav
# With speed control (0.5-2.0)
curl -X POST "http://127.0.0.1:18790/synthesize?speed=1.2" -d "Fast!" -o output.wav
# JSON endpoint
curl -X POST http://127.0.0.1:18790/synthesize/json \
-H "Content-Type: application/json" \
-d '{"text": "Hello", "speed": 1.0, "deEss": true}'
STT - Speech to Text
curl -X POST http://127.0.0.1:18790/transcribe \
--data-binary @audio.wav \
-H "Content-Type: audio/wav"
# Returns: {"text": "transcribed text"}
Health Check
curl http://127.0.0.1:18790/health
# Returns: ok
Voice Options
Default voice is af_sky. Change by modifying the source code.
Top Kokoro voices (American female):
af_heart(A grade) - warm, naturalaf_bella(A-) - expressiveaf_sky(C-) - clear, light
All 54 voices: See references/VOICES.md
Expressiveness
Speed Control
speed=0.8→ Calm, relaxedspeed=1.0→ Natural pacespeed=1.2→ Energetic, upbeat
Punctuation (automatic)
!→ Excited tone?→ Rising intonation.→ Neutral, falling...→ Pauses
SSML Tags
\x3Cphoneme ph="kəkˈɔɹO">Kokoro\x3C/phoneme>
\x3Csub alias="Doctor">Dr.\x3C/sub>
\x3Csay-as interpret-as="date">2024-01-15\x3C/say-as>
Helper Script
See scripts/stella-tts.sh for a convenient wrapper:
scripts/stella-tts.sh "Hello world" output.wav
scripts/stella-tts.sh "Hello world" output.mp3 # Auto-converts
Integration Example
For voice assistants, update your voice proxy to use local endpoints:
// STT
const response = await fetch('http://127.0.0.1:18790/transcribe', {
method: 'POST',
headers: { 'Content-Type': 'audio/wav' },
body: audioData
});
const { text } = await response.json();
// TTS
const audio = await fetch('http://127.0.0.1:18790/synthesize', {
method: 'POST',
body: textToSpeak
});
Troubleshooting
Library not loaded (ESpeakNG)
- Ensure ESpeakNG.framework is in the same directory as the binary
- Run
install_name_tool -add_rpath @executable_path /path/to/binary
Slow first request
- First request loads models (~8-10s)
- Subsequent requests are sub-second
x86 vs ARM
- Must build and run on ARM64 native (not Rosetta)
- Check with
uname -m(should showarm64)
Source Code
The daemon source is in sources/ directory. It's a Swift package using:
- FluidAudio (TTS + STT models)
- Hummingbird (HTTP server)
Rebuild after modifying:
cd sources && swift build -c release
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install local-voice - 安装完成后,直接呼叫该 Skill 的名称或使用
/local-voice触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Local Voice (FluidAudio TTS/STT) 是什么?
Local text-to-speech (TTS) and speech-to-text (STT) using FluidAudio on Apple Silicon. Sub-second voice synthesis and transcription running entirely on-device via the Apple Neural Engine. Use when setting up local voice capabilities, voice assistant integration, or replacing cloud TTS/STT services. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1575 次。
如何安装 Local Voice (FluidAudio TTS/STT)?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install local-voice」即可一键安装,无需额外配置。
Local Voice (FluidAudio TTS/STT) 是免费的吗?
是的,Local Voice (FluidAudio TTS/STT) 完全免费(开源免费),可自由下载、安装和使用。
Local Voice (FluidAudio TTS/STT) 支持哪些平台?
Local Voice (FluidAudio TTS/STT) 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Local Voice (FluidAudio TTS/STT)?
由 Trond Wuellner(@trondw)开发并维护,当前版本 v1.0.1。