功能描述

Local OPUS/Ogg voice-reply pipeline for Feishu/Discord with structured voice customization. Default voice is Juno (`voice/juno_ref.wav`), with support for re...

使用说明 (SKILL.md)

Local Voice Reply

Name: local-voice-reply
Author: tenured-master-chef-607

Use this skill to turn text into a cloned/custom-voice audio reply and deliver it reliably to Feishu or Discord.

Structured skill definition

Purpose: local low-latency voice replies in Opus/Ogg.
Channels: Feishu + Discord.
Default voice: juno (reference file: voice/juno_ref.wav).
Custom voice modes:
1. File-based: replace/update voice/juno_ref.wav.
2. Registry-based: upload/register voices via POST /voice/register, then call by voice_name.
Output: .opus (Ogg container) under .openclaw/media/outbound/voice-server-v3/ (or TARVIS_VOICE_OUTPUT_DIR).
Control scripts:
- scripts/send_voice_reply.ps1 (server API path)
- scripts/generate_cuda_voice.ps1 (stable local CUDA generation path)

Server implementation is kept with the skill (not workspace root):

server/voice_server_v3.py (FastAPI routes)
server/voice_engine.py (generation and cache engine)

Voice assets are also colocated with the skill:

voice/

Runtime requirements

ffmpeg must be installed and available on PATH (required for Opus encoding).
Python packages required by the server:
- fastapi
- uvicorn
- python-multipart
- chatterbox-tts
- torch
- torchaudio
- numpy
On first startup, ChatterboxTTS.from_pretrained() may download model assets, so initial run can require network access and additional disk.
Optional env vars:
- TARVIS_VOICE_OUTPUT_DIR to override where generated Opus files are written.
- TARVIS_VOICE_DEVICE to force device selection (cuda/gpu, mps, or cpu).

Persistence behavior

Uploaded voice samples from POST /voice/register are persisted under server/voices/.
Cache and registry data are persisted under server/voice_cache/.
Generated Opus outputs are written under .openclaw/media/outbound/voice-server-v3/ by default (or TARVIS_VOICE_OUTPUT_DIR when set).
POST /output/cleanup only deletes staged .opus files inside the configured output directory and their .json sidecar files.

Use this workflow

Ensure local v3.3 TTS server is running from this skill folder:
- python -m uvicorn --app-dir server voice_server_v3:app --host 127.0.0.1 --port 8000
Call /speak with text (and optional speed, exaggeration, cfg).
- voice_name defaults to juno.
Receive Opus directly from server (audio/ogg) in Juno voice.
Save final media into allowed path:
- C:\Users\hanli\.openclaw\media\outbound\
Send with message tool:
- action=send
- filePath=\x3Callowed-path>
- asVoice=true
- For Feishu: channel=feishu
- For Discord: channel=discord

Voice customization guide

A) Replace default Juno reference

Replace voice/juno_ref.wav with your target reference voice sample.
Keep sample clean (single speaker, low noise, clear pronunciation).
Restart server and test with voice_name=juno.

B) Register additional named voices

Call POST /voice/register with a reference sample and target voice_name.
Confirm registration under server/voices/.
Generate with that voice_name in /speak or /speak_stream.

Defaults

voice_name: juno
speed: 1.2
Output format: Opus in Ogg container from server /speak (no post-conversion)
Discord compatibility: Ogg/Opus is supported and can be sent as voice/audio with asVoice=true

Speed Improvements In This Version

Caches model capability lookups once at startup.
Uses torch.inference_mode() during synthesis to reduce overhead.
Reuses phrase cache for both /speak and /speak_stream.
Improves chunking behavior for long CJK text to avoid oversized chunks.
Keeps latency metrics for benchmarking and tuning.

Common failure and fix

Error: LocalMediaAccessError ... path-not-allowed
Fix: copy the file into .openclaw/media/outbound before sending.

Script

Use scripts/send_voice_reply.ps1 to generate Opus directly with defaults (voice_name=juno, speed=1.2). It auto-selects /speak_stream for longer text (or when -Stream is passed) for better throughput.

For stable CUDA generation command patterns under stricter exec approval policies, use:

scripts/generate_cuda_voice.ps1 -Text "..." This keeps the outer command shape fixed so allow-always is more reusable.

安全使用建议

This skill appears to be what it claims: a local TTS server that produces Opus/Ogg outputs. Before installing, be aware: (1) you must have ffmpeg on PATH and install heavyweight Python deps (torch, torchaudio, chatterbox-tts) — initial startup may download large model files and use significant disk and GPU/CPU resources; (2) uploaded voice samples and generated audio are persisted locally under the skill's folders and by default in ~/.openclaw/media/outbound — only register voice samples you trust; (3) SKILL.md mentions helper scripts that are not present in the bundle—confirm whether those scripts are provided separately or replaced by your own invocation; (4) the service can read files referenced by its manifest if that file is edited, so avoid placing sensitive files under the skill's voice/manifest paths. If you need network isolation, prevent ChatterboxTTS.from_pretrained() from downloading by pre-providing model artifacts or blocking outbound network during startup.

功能分析

Type: OpenClaw Skill Name: local-voice-reply Version: 3.3.3 The skill bundle provides a legitimate local text-to-speech (TTS) pipeline using FastAPI, ChatterboxTTS, and ffmpeg. The code in 'server/voice_engine.py' and 'server/voice_server_v3.py' is well-structured and includes security best practices such as path traversal guards (using .relative_to() checks) and safe subprocess execution (using argument lists instead of shell strings). While 'SKILL.md' contains a hardcoded local path ('C:\Users\hanli\...') in its instructions to the agent, this appears to be a developer artifact rather than a malicious injection. The high-risk capabilities (file system access and subprocess execution) are strictly aligned with the stated purpose of generating and managing audio files.

能力评估

✓ Purpose & Capability

Name/description (local OPUS/Ogg voice replies for Feishu/Discord) aligns with included FastAPI server and TTS engine. Required tools (ffmpeg, Python libraries including torch/torchaudio/chatterbox-tts) are proportional to the stated functionality.

ℹ Instruction Scope

SKILL.md instructs running the local uvicorn server and calling /speak endpoints, saving outputs under .openclaw/media/outbound; those instructions are consistent with code. One small mismatch: SKILL.md references control scripts (scripts/send_voice_reply.ps1 and scripts/generate_cuda_voice.ps1) that are not present in the file manifest—this may be an omission or packaging error. The skill persists uploaded voices and cache data under its own server folders and writes outputs into the user's .openclaw media dir (or TARVIS_VOICE_OUTPUT_DIR).

ℹ Install Mechanism

No install spec (instruction-only) and the server code is bundled with the skill — low install risk. However runtime requires large Python packages (torch/torchaudio/chatterbox-tts) and ffmpeg; ChatterboxTTS.from_pretrained() may download model artifacts over the network on first run, which is expected but can be large.

✓ Credentials

No required credentials or secret env vars. Optional env vars (TARVIS_VOICE_OUTPUT_DIR, TARVIS_VOICE_DEVICE, TARVIS_VOICE_FFMPEG_TIMEOUT_SEC, TARVIS_VOICE_PHRASE_RAM_CACHE_ITEMS) are relevant to operation and proportionate. The code reads only these environment variables (plus standard log-level).

✓ Persistence & Privilege

The skill persists uploaded voice samples under server/voices/, caching under server/voice_cache/, and writes generated .opus files to the configured outputs directory (default: ~/.openclaw/media/outbound/voice-server-v3). It does not request always:true or global privileges; persistence is limited to its own directories and the configured output path.

版本历史

v3.3.3

docs: structured skill definition + explicit voice customization guide (juno_ref replacement and /voice/register flow)

v3.3.2

docs: rewrite skill description; clarify local low-latency voice customization via juno_ref/registered voices

v1.0.2

docs: clarify juno_ref.wav voice customization support in skill description

v3.3.1

Add MIT LICENSE and keep v3.3 voice-server structure.

v3.3.0

v3.3: split voice engine module, improved caching + inference mode, added CUDA generation script, and refreshed docs.

v1.0.1

- Updated skill description for clarity and user focus. - Clarified that OPUS files are generated to match Feishu audio requirements. - No functional or workflow changes; documentation improvements only.

v1.0.0

Summary: Initial release with an optimized workflow for generating and sending cloned-voice Feishu audio replies using a local TTS server. - Introduces local Chatterbox TTS server integration with direct Opus output and structured FastAPI endpoints. - Implements efficient caching for model lookups and repeated phrase synthesis, reducing response times. - Enhances handling of long CJK text via improved chunking and stream support. - Provides a PowerShell script for automatic voice reply generation and selection of streaming for long texts. - Voice files are stored in the skill directory; correct media paths are enforced for Feishu compatibility.

元数据

Slug local-voice-reply

版本 3.3.3

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 7

常见问题