功能描述

Local text-to-speech using Qwen3-TTS with mlx_audio (macOS Apple Silicon) or qwen-tts (Linux/Windows). Privacy-first offline TTS with natural, realistic voic...

使用说明 (SKILL.md)

Local TTS with Qwen3-TTS

Name: Local TTS
Author: irachex

Privacy-First | Offline | High-Quality | Natural Real Voices

Local text-to-speech synthesis using Qwen3-TTS models. Your text never leaves your machine.

Why Local TTS?

Unlike cloud TTS (Google, AWS, Azure), local-tts ensures:

Zero data transmission - 100% on-device processing
Works offline - No network required
No API keys - No external dependencies
GDPR/HIPAA friendly - Simplified compliance

See privacy & security details.

Platform Overview

Platform	Backend	Installation	Best For
macOS (Apple Silicon)	`mlx_audio`	`pip install mlx-audio`	M1/M2/M3/M4 Macs
Linux/Windows	`qwen-tts`	`pip install qwen-tts`	CUDA GPUs

Quick Start

macOS

pip install mlx-audio
brew install ffmpeg

# Natural female voice
python -m mlx_audio.tts.generate \
    --text "Hello world" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
    --voice Chelsie

Linux/Windows

pip install qwen-tts

# With optimizations (FlashAttention, bfloat16, auto-device)
python scripts/tts_linux.py "Hello world" --female

Key Concepts

`--voice` vs `--instruct` (Important)

Model	`--voice`	`--instruct`	Notes
CustomVoice	Select preset voice	Add style/emotion	Can use together - voice + style control
VoiceDesign	N/A	Create voice from description	`--instruct` only
Base	N/A	N/A	For voice cloning with `--ref_audio`

CustomVoice with style control:

python -m mlx_audio.tts.generate \
    --text "Hello there!" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
    --voice Serena \
    --instruct "excited and enthusiastic"

9 Preset Voices (Open Source CustomVoice)

Voice	Gender	Language	Character
Chelsie	Female	English (American)	Gentle, empathetic
Serena	Female	English	Warm, gentle
Ono Anna	Female	Japanese	Playful
Sohee	Female	Korean	Warm
Aiden	Male	English (American)	Sunny
Dylan	Male	English	Natural
Eric	Male	English	Real
Ryan	Male	English	Natural
Uncle Fu	Male	Chinese	Youthful Beijing

Defaults: Female=Serena, Male=Aiden

Usage Examples

CustomVoice (Preset Voices)

# Natural female
python -m mlx_audio.tts.generate \
    --text "Your text" --voice Serena --lang_code en \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit

# Real male
python -m mlx_audio.tts.generate \
    --text "Your text" --voice Aiden --lang_code en \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit

VoiceDesign (Text-Based)

python -m mlx_audio.tts.generate \
    --text "Hello" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit \
    --instruct "A warm female voice, professional and clear"

Long Text Generation

For long text, increase --max_tokens and enable --join_audio (macOS/MLX only):

python -m mlx_audio.tts.generate \
    --text "Your very long text here..." \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
    --voice Serena \
    --max_tokens 4096 \
    --join_audio \
    --output long_audio.wav

Voice Cloning

python -m mlx_audio.tts.generate \
    --text "Cloned voice speaking" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
    --ref_audio sample.wav --ref_text "Sample transcript"

Parameters

Parameter	Description	Values
`--text`	Text to speak	Required
`--model`	Model ID	See table below
`--voice`	Preset voice (CustomVoice)	Chelsie, Serena, Aiden, Ryan...
`--instruct`	Voice description (VoiceDesign) or style/emotion (CustomVoice)	e.g., "excited", "calm", "professional"
`--speed`	Speaking rate	0.5-2.0 (default: 1.0)
`--pitch`	Voice pitch	0.5-2.0 (default: 1.0)
`--lang_code`	Language	en, cn, ja, ko, de, fr...
`--ref_audio`	Reference for cloning	File path
`--output`	Output file	Path (auto-generated if omitted)
`--max_tokens`	Max generation tokens	Integer (default: 2048) - Increase for long text
`--join_audio`	Merge audio segments	`true` (default) or `false` - Recommended for long text

Models

Model	Size	Purpose
`Qwen3-TTS-12Hz-1.7B-CustomVoice`	1.7B	9 preset voices + style control
`Qwen3-TTS-12Hz-1.7B-VoiceDesign`	1.7B	Text-based voice creation
`Qwen3-TTS-12Hz-1.7B-Base`	1.7B	Voice cloning
`Qwen3-TTS-12Hz-0.6B-*`	0.6B	Lightweight versions

macOS: Add mlx-community/ prefix (e.g., mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit)

Scripts

scripts/tts_macos.py - macOS wrapper
scripts/tts_linux.py - Linux/Windows wrapper with optimizations

Optimizations (Linux/Windows)

tts_linux.py automatically enables:

FlashAttention - Faster, less memory
bfloat16 - Better precision
Auto device - CUDA → CPU fallback
Mixed precision - Speed + quality

Troubleshooting

Issue	Solution
macOS: Model not found	Use `mlx-community/` prefix
macOS: Audio format	`brew install ffmpeg`
Linux: CUDA OOM	Use `0.6B` models
Linux: Slow	Check CUDA: `torch.cuda.is_available()`

References

Version

1.0.0 - See VERSION and package.json

安全使用建议

This skill appears to do what it says: local, offline TTS wrappers for macOS (mlx_audio) and Linux/Windows (qwen-tts). Before installing or running it: - Expect large one-time downloads for model weights (from Hugging Face-style model IDs) and significant disk/GPU usage for 1.7B models — the docs note smaller 0.6B alternatives. - If you require strict air-gapped operation, pre-download and verify models and dependencies; the scripts will call from_pretrained which normally performs network fetches. - Some model checkpoints may be gated and require a Hugging Face token (not declared by the skill); provide such credentials yourself if needed and verify the trustworthiness of the model source. - The registry metadata had minor mismatches (homepage present in package.json but registry listed none) and tests reference a VERSION file not present in the manifest — these are build/metadata inconsistencies, not direct security red flags, but you may want to confirm the repository/author (package.json points to https://github.com/irachex/local-tts). - Dependencies to install (mlx-audio, qwen-tts, torch, ffmpeg, optional flash-attn) are normal for TTS but be prepared for native builds (flash-attn) and sizeable installs. If you want to be extra cautious, review the upstream GitHub repo and the actual model sources before running the first model download.

功能分析

Type: OpenClaw Skill Name: local-tts Version: 1.0.0 The skill bundle provides a legitimate local text-to-speech utility using Qwen3-TTS models for macOS, Linux, and Windows. The Python wrappers (scripts/tts_linux.py and scripts/tts_macos.py) use standard machine learning libraries (torch, transformers, mlx_audio) and implement safe subprocess handling without shell injection risks. The documentation (SKILL.md and references/privacy_security.md) is consistent with the stated purpose, focusing on privacy and offline processing, and contains no evidence of malicious prompt injection or hidden instructions.

能力评估

✓ Purpose & Capability

Name/description (local Qwen3-TTS via mlx_audio or qwen-tts) match the included scripts and docs. The scripts call the expected libraries (mlx_audio on macOS, qwen-tts/torch on Linux/Windows) and expose the parameters described in SKILL.md. Minor metadata mismatch: registry metadata said homepage none but package.json includes a GitHub homepage, which is a non-security inconsistency in metadata.

✓ Instruction Scope

SKILL.md and the scripts instruct only to run local TTS generation, reference local files (ref_audio) and standard parameters. They do not attempt to read arbitrary unrelated system files or exfiltrate data. The instructions do rely on downloading models (one-time) from model hosting (Hugging Face-style identifiers), which is documented in the README and references; that initial network activity is expected but should be acknowledged by users who require strict air-gapped operation.

✓ Install Mechanism

No install spec in the registry (instruction-only), so nothing is auto-downloaded by the platform. The code relies on pip-installable packages (mlx-audio, qwen-tts, torch, flash-attn, ffmpeg) which are reasonable for this purpose. Model weights are loaded via from_pretrained calls which will fetch artifacts from model hosts; this is expected but can involve large downloads and possibly gated models that require credentials.

✓ Credentials

The skill requests no environment variables, no credentials, and no config paths — consistent with a local-offline TTS tool. Caveat: some Hugging Face-hosted models can be gated and would require a HUGGINGFACE_TOKEN or equivalent at download time; the skill does not declare such env vars, so users should be aware to provide tokens manually if needed. Disk, memory and GPU resource requirements (large model files, VRAM) are documented in the references.

✓ Persistence & Privilege

Skill does not request always:true or any elevated/persistent privileges. It is a user-invocable wrapper that runs local Python code and runs subprocesses to installed libraries; this is proportional to its function.

版本历史

v1.0.0

Initial release: Local text-to-speech with Qwen3-TTS, supporting macOS (mlx_audio) and Linux/Windows (qwen-tts) with FlashAttention, bfloat16 optimizations. 9 natural preset voices, voice cloning, and voice design.

元数据

Slug local-tts

版本 1.0.0

许可证 MIT-0

累计安装 4

当前安装数 4

历史版本数 1

常见问题

Local TTS 是什么？

Local text-to-speech using Qwen3-TTS with mlx_audio (macOS Apple Silicon) or qwen-tts (Linux/Windows). Privacy-first offline TTS with natural, realistic voic... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 359 次。

如何安装 Local TTS？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install local-tts」即可一键安装，无需额外配置。

Local TTS 是免费的吗？

是的，Local TTS 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Local TTS 支持哪些平台？

Local TTS 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Local TTS？

由 irachex（@irachex）开发并维护，当前版本 v1.0.0。

Local TTS