← 返回 Skills 市场
leonaaardob

Pocket TTS Complete Documentation

作者 leonaaardob · GitHub ↗ · v0.1.0
cross-platform ⚠ suspicious
1031
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install lb-pocket-tts-skill
功能描述
Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.
使用说明 (SKILL.md)

Pocket TTS

Lightweight CPU-friendly text-to-speech with voice cloning. No GPU required.

When to Use

  • Generating speech from text on CPU without GPU
  • Voice cloning from audio samples
  • Streaming audio generation (low latency)
  • Local TTS without API dependencies
  • Real-time speech synthesis (~6x faster than real-time)

Key Features

  • 100M parameters - Small, efficient model
  • CPU-optimized - No GPU needed, uses only 2 cores
  • ~6x real-time - Fast generation on modern CPUs
  • ~200ms latency - To first audio chunk (streaming)
  • Voice cloning - From 3-10s audio samples
  • 24kHz mono WAV - High-quality output
  • English only - More languages planned

Installation

pip install pocket-tts
# or
uv add pocket-tts

CLI Commands

Generate Speech

# Basic generation (default voice)
pocket-tts generate --text "Hello world"

# Custom voice (local file, URL, or safetensors)
pocket-tts generate --voice ./my_voice.wav
pocket-tts generate --voice "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
pocket-tts generate --voice ./voice.safetensors

# Quality tuning
pocket-tts generate --temperature 0.7 --lsd-decode-steps 3

See docs/generate.md for full CLI reference.

Start Web Server

# Start FastAPI server with web UI
pocket-tts serve

# Custom host/port
pocket-tts serve --host localhost --port 8080

See docs/serve.md for server options.

Export Voice Embeddings

Convert audio files to .safetensors for faster loading:

# Single file
pocket-tts export-voice voice.mp3 voice.safetensors

# Batch conversion
pocket-tts export-voice voices/ embeddings/ --truncate

See docs/export_voice.md for export options.


Python API

Basic Usage

from pocket_tts import TTSModel
import scipy.io.wavfile

# Load model
model = TTSModel.load_model()

# Get voice state
voice = model.get_state_for_audio_prompt(
    "hf://kyutai/tts-voices/alba-mackenna/casual.wav"
)

# Generate audio
audio = model.generate_audio(voice, "Hello world!")

# Save
scipy.io.wavfile.write("output.wav", model.sample_rate, audio.numpy())

Load Model

model = TTSModel.load_model(
    config="b6369a24",       # Model variant
    temp=0.7,                # Temperature (0.5-1.0)
    lsd_decode_steps=1,      # Generation steps (1-5)
    eos_threshold=-4.0       # End-of-sequence threshold
)

Voice State

# From audio file/URL
voice = model.get_state_for_audio_prompt("./voice.wav")
voice = model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav")

# From safetensors (fast loading)
voice = model.get_state_for_audio_prompt("./voice.safetensors")

Streaming Generation

# Stream audio chunks
for chunk in model.generate_audio_stream(voice, "Long text..."):
    # Process/save/play each chunk as generated
    print(f"Chunk: {chunk.shape[0]} samples")

Multi-Voice Management

# Preload multiple voices
voices = {
    "casual": model.get_state_for_audio_prompt("hf://kyutai/tts-voices/alba-mackenna/casual.wav"),
    "announcer": model.get_state_for_audio_prompt("./announcer.safetensors"),
}

# Use different voices
audio1 = model.generate_audio(voices["casual"], "Hey there!")
audio2 = model.generate_audio(voices["announcer"], "Breaking news!")

See docs/python-api.md for complete API reference.


Available Voices

Pre-made voices from hf://kyutai/tts-voices/:

  • alba-mackenna/casual.wav (default, female)
  • jessica-jian/casual.wav (female)
  • voice-donations/Selfie.wav (male, marius)
  • voice-donations/Butter.wav (male, javert)
  • ears/p010/freeform_speech_01.wav (male, jean)
  • vctk/p244_023.wav (female, fantine)
  • vctk/p262_023.wav (female, eponine)
  • vctk/p303_023.wav (female, azelma)

Or clone any voice from your own audio samples.


Voice Cloning Tips

  • Clean audio - Remove background noise (use Adobe Podcast Enhance)
  • Length - 3-10 seconds of speech is ideal
  • Quality - Input quality affects output quality
  • Format - WAV, MP3, or any common audio format supported

Performance Tips

  • CPU-only - GPU provides no speedup (model too small, batch size 1)
  • 2 cores - Uses only 2 CPU cores efficiently
  • Streaming - Low latency (\x3C200ms to first chunk)
  • Safetensors - Pre-process voices to .safetensors for instant loading

Output Format

All commands output WAV files:

  • Sample rate: 24 kHz
  • Channels: Mono
  • Bit depth: 16-bit PCM

Links

安全使用建议
This skill is documentation for Pocket TTS and appears internally consistent. Before installing or following the instructions consider: (1) installing via 'pip install pocket-tts' runs third-party code — review the package repository and maintainers if you need a higher assurance level; (2) model weights and voice files may be downloaded from HuggingFace or other URLs (public downloads are normal; private models would require tokens you must manage securely); (3) voice-cloning uses your audio files — avoid feeding sensitive or private audio you don't want embedded; (4) the 'serve' command starts a local HTTP server — do not bind it to 0.0.0.0 or an exposed interface unless you understand and secure the service; and (5) if you need higher assurance, review the upstream GitHub repo (https://github.com/kyutai-labs/pocket-tts) and the pip package source before installing. Overall, the skill is coherent and does not request unrelated privileges or secrets.
功能分析
Type: OpenClaw Skill Name: lb-pocket-tts-skill Version: 0.1.0 The skill documents several high-risk capabilities of the `pocket-tts` tool, which, while presented as features, could be leveraged for attacks if the OpenClaw agent is prompted with malicious inputs or if the underlying library has vulnerabilities. Specifically, the skill describes how to load model configurations from arbitrary local YAML files (e.g., via `--config` in `docs/generate.md`, `docs/serve.md`), which could lead to arbitrary code execution. It also details loading audio prompts for voice cloning from arbitrary local files or remote HTTP/HTTPS URLs (e.g., via `--voice` in `SKILL.md`, `docs/generate.md`, `docs/export_voice.md`, `docs/python-api.md`), posing risks of Local File Inclusion (LFI) or Server-Side Request Forgery (SSRF). Additionally, the `pocket-tts serve` command (documented in `SKILL.md`, `docs/serve.md`) starts a web server, exposing an API that could be a further attack surface. There is no direct evidence of intentional malicious instructions within the skill's markdown, but the documented capabilities introduce significant security risks.
能力评估
Purpose & Capability
The name/description (Pocket TTS, CPU-friendly streaming TTS with voice cloning) match the files and instructions. The docs explain CLI, Python API, exporting voices, and serving a local API — all expected for a TTS documentation skill.
Instruction Scope
SKILL.md is documentation-only and stays within TTS domain. It instructs the user/agent to run pocket-tts CLI commands, install the package via pip/uv, load models, read local audio files or URLs, and optionally start a FastAPI server. These actions are consistent with the stated functionality but do involve reading user audio files and downloading model weights from external URLs (e.g., HuggingFace) if the agent/user follows the docs.
Install Mechanism
There is no automated install spec embedded in the skill (instruction-only), which is low-risk. The docs recommend 'pip install pocket-tts' or 'uv add pocket-tts' — installing a third-party package is an expected step but carries the usual risks of executing code from PyPI/uv; the skill itself does not supply or automatically fetch binaries from untrusted URLs.
Credentials
The skill declares no required environment variables, credentials, or config paths. The documented behavior (loading local files, using HF URLs, writing safetensors/WAV outputs) matches that; there are no unrelated secrets requested.
Persistence & Privilege
The skill is not forced-always and doesn't request persistent elevated privileges. It documents a 'serve' command that can start a local web server (default host localhost), which is normal for this tool; the skill does not attempt to modify other skills or system-wide configs.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install lb-pocket-tts-skill
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /lb-pocket-tts-skill 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
- Initial release of Pocket-TTS skill. - Generate English speech from text using Kyutai Pocket TTS, optimized for CPU-only use with no GPU required. - Supports voice cloning from 3–10 second audio samples and streaming, low-latency audio generation (~6x real-time on modern CPUs). - Provides both CLI and Python API interfaces for text-to-speech generation and voice management. - Outputs high-quality 24kHz mono WAV files. - Includes guidance for installation, usage, voice cloning, and performance optimization.
元数据
Slug lb-pocket-tts-skill
版本 0.1.0
许可证
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Pocket TTS Complete Documentation 是什么?

Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1031 次。

如何安装 Pocket TTS Complete Documentation?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install lb-pocket-tts-skill」即可一键安装,无需额外配置。

Pocket TTS Complete Documentation 是免费的吗?

是的,Pocket TTS Complete Documentation 完全免费(开源免费),可自由下载、安装和使用。

Pocket TTS Complete Documentation 支持哪些平台?

Pocket TTS Complete Documentation 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Pocket TTS Complete Documentation?

由 leonaaardob(@leonaaardob)开发并维护,当前版本 v0.1.0。

💬 留言讨论