功能描述

Local text-to-speech using Alibaba's CosyVoice3 on macOS Apple Silicon. Supports Chinese, English, Japanese, Korean, and 18+ Chinese dialects. Provides zero-...

使用说明 (SKILL.md)

CosyVoice3 TTS

Name: CosyVoice3 macOS
Author: lhuaizhong

Local text-to-speech using Alibaba's CosyVoice3 on macOS Apple Silicon.

Overview

CosyVoice3 is an advanced TTS system based on large language models, supporting:

9 languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
18+ Chinese dialects: Cantonese, Sichuan, Dongbei, Shanghai, etc.
Zero-shot voice cloning: Clone any voice from 3-10 seconds of audio
Cross-lingual synthesis: Speak Chinese with English voice or vice versa
Fine-grained control: Emotions, speed, volume via text tags

Prerequisites

macOS with Apple Silicon (M1/M2/M3)
Python 3.10
Conda installed
~5GB disk space for models

Installation

Run the installation script:

cd /Users/lhz/.openclaw/workspace/skills/cosyvoice3/scripts
bash install.sh

This will:

Create conda environment cosyvoice
Install PyTorch (CPU version for Apple Silicon)
Install CosyVoice dependencies
Download Fun-CosyVoice3-0.5B model (~2GB)

Usage

Quick Start - Basic TTS

重要：CosyVoice3 需要在参考文本中添加 \x3C|endofprompt|> 标记！

cd /Users/lhz/.openclaw/workspace/cosyvoice3-repo
export PATH="$HOME/miniconda3/bin:$PATH"
conda activate cosyvoice

python -c "
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
for i, j in enumerate(cosyvoice.inference_zero_shot(
    '你好，这是CosyVoice3语音合成测试。', 
    '希望你以后能够做的比我还好呦。\x3C|endofprompt|>',  # 注意这个标记！
    'asset/zero_shot_prompt.wav'
)):
    torchaudio.save('output.wav', j['tts_speech'], cosyvoice.sample_rate)
print('Generated: output.wav')
"

Using the TTS Script

Generate speech from text:

cd /Users/lhz/.openclaw/workspace/skills/cosyvoice3/scripts
conda activate cosyvoice

# Basic TTS with default voice
python tts.py "你好，这是一个测试。"

# With custom reference audio for voice cloning
python tts.py "你好，这是克隆的声音。" --reference /path/to/reference.wav

# Cross-lingual (English text with Chinese voice)
python tts.py "Hello, this is cross-lingual synthesis." --reference asset/zero_shot_prompt.wav --lang en

# With speed control
python tts.py "这是一段快速的语音。" --speed 1.5

# Save to specific path
python tts.py "你好。" --output ~/Desktop/greeting.wav

Available Assets

Reference audio files in cosyvoice3-repo/asset/:

zero_shot_prompt.wav - Default Chinese female voice
cross_lingual_prompt.wav - English prompt for cross-lingual

Advanced Features

Voice Cloning

Clone a voice from 3-10 seconds of reference audio:

from cosyvoice.cli.cosyvoice import AutoModel
import torchaudio

cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')

# Clone voice and generate
for i, j in enumerate(cosyvoice.inference_zero_shot(
    '这是克隆后的声音在说话。',
    'Reference text transcription',
    '/path/to/reference.wav'
)):
    torchaudio.save('cloned.wav', j['tts_speech'], cosyvoice.sample_rate)

Fine-Grained Control

Control prosody with special tags:

# Add laughter
"他突然[laughter]笑了起来[laughter]。"

# Add breathing
"他说完这句话[breath]，深吸一口气。"

# Strong emphasis
"这是\x3Cstrong>非常重要\x3C/strong>的。"

# Combined
"在面对挑战时，他展现了非凡的\x3Cstrong>勇气\x3C/strong>与\x3Cstrong>智慧\x3C/strong>[breath]。"

Dialect Support

Use instruct mode for dialects:

cosyvoice = AutoModel(model_dir='pretrained_models/CosyVoice-300M-Instruct')

for i, j in enumerate(cosyvoice.inference_instruct(
    '你好，这是测试语音。',
    '中文男',
    '用四川话说这句话\x3C|endofprompt|>'
)):
    torchaudio.save('sichuan.wav', j['tts_speech'], cosyvoice.sample_rate)

Troubleshooting

Model not found

If you get "model not found" errors, download models manually:

cd /Users/lhz/.openclaw/workspace/cosyvoice3-repo
export PATH="$HOME/miniconda3/bin:$PATH"
conda activate cosyvoice

python -c "
from modelscope import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
"

Memory issues

For long text, split into sentences:

text = "很长的文本..."
sentences = text.split('。')
for sent in sentences:
    if sent.strip():
        # Process each sentence

Audio format

Reference audio requirements:

Format: WAV, MP3
Sample rate: 16kHz+ (automatically resampled)
Duration: 3-10 seconds optimal
Content: Clear speech, minimal background noise

Resources

Scripts

install.sh - Installation script for macOS
tts.py - Main TTS script with CLI interface
download_models.py - Download pretrained models

References

Model Files

Located in cosyvoice3-repo/pretrained_models/:

Fun-CosyVoice3-0.5B/ - Main model (recommended)
CosyVoice2-0.5B/ - Previous version
CosyVoice-300M/ - Lighter model
CosyVoice-300M-SFT/ - SFT version
CosyVoice-300M-Instruct/ - Instruct version

Notes

First inference takes ~30 seconds (model warmup)
Subsequent inferences are faster
Apple Silicon uses CPU mode (no CUDA)
RTF (real-time factor) ~0.3-0.5 on M-series chips
Model files are cached locally after first download

安全使用建议

This skill appears coherent with its stated purpose, but review and consider the following before installing: (1) It will download and install Miniconda, many pip packages, and ~2GB of model files from the internet (GitHub, PyPI, ModelScope). Ensure you trust the FunAudioLLM repo and are comfortable with network downloads and disk use. (2) The scripts use hard-coded paths (/Users/lhz/.openclaw/…); update them to your environment before running. (3) FastAPI/Gradio are installed (likely for demos); avoid running any web servers unless you intend to expose them. (4) Run the installer in a sandboxed environment or isolated conda environment if you want to limit system impact, and inspect repository code (especially any demo/example scripts) before launching. No credentials or secret exfiltration were requested or found in the provided files.

功能分析

Type: OpenClaw Skill Name: cosyvoice3-macos Version: 1.0.0 The skill is designed for local text-to-speech using Alibaba's CosyVoice3. Its installation script (`scripts/install.sh`) performs extensive setup, including downloading and installing Miniconda from `repo.anaconda.com`, cloning the official CosyVoice GitHub repository, installing numerous Python dependencies, and downloading large ML models from Alibaba's ModelScope. While these actions involve downloading and executing remote code, they are standard and necessary for setting up a complex ML environment and are sourced from legitimate, well-known providers. The `SKILL.md` instructions and other Python scripts (`scripts/download_models.py`, `scripts/tts.py`) align with the stated purpose and do not exhibit any signs of malicious intent, such as data exfiltration, unauthorized persistence, or prompt injection attempts to subvert the agent's core function.

能力评估

✓ Purpose & Capability

Name/description (local CosyVoice3 TTS) align with the included files: installer, downloader, and CLI TTS script. The model download, conda env, and Python deps are reasonable for running a local TTS model.

ℹ Instruction Scope

SKILL.md and scripts instruct the agent/user to run install.sh, clone the CosyVoice repo, install many Python packages, and download models. All of these are relevant, but instructions use hard-coded workspace paths (/Users/lhz/.openclaw/...) which are environment-specific and may need adjustment before running. The instructions perform network downloads (git clone, pip, modelscope snapshot_download) — expected for this purpose.

ℹ Install Mechanism

There is no separate install spec in registry, but the included install.sh performs network installs: Miniconda installer from repo.anaconda.com, git clone from GitHub, many pip installs from PyPI, and models downloaded via modelscope. These are standard but do execute remote code and write ~2GB of model files to disk.

✓ Credentials

The skill requests no environment variables, no credentials, and no config paths beyond the local workspace. Dependencies like modelscope and snapshot_download need network access but no secrets — proportional to the task.

✓ Persistence & Privilege

Skill is not always-enabled and does not request elevated privileges or modify other skills. It only writes to its own workspace and model directories per the installer.

版本历史

v1.0.0

Initial release: Alibaba CosyVoice3 TTS for macOS Apple Silicon. Supports Chinese, English, 18+ dialects, zero-shot voice cloning.

元数据

Slug cosyvoice3-macos

版本 1.0.0

许可证 —

累计安装 2

当前安装数 2

历史版本数 1

常见问题

CosyVoice3 macOS 是什么？

Local text-to-speech using Alibaba's CosyVoice3 on macOS Apple Silicon. Supports Chinese, English, Japanese, Korean, and 18+ Chinese dialects. Provides zero-... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 645 次。

如何安装 CosyVoice3 macOS？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install cosyvoice3-macos」即可一键安装，无需额外配置。

CosyVoice3 macOS 是免费的吗？

是的，CosyVoice3 macOS 完全免费（开源免费），可自由下载、安装和使用。

CosyVoice3 macOS 支持哪些平台？

CosyVoice3 macOS 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 CosyVoice3 macOS？

由 lhuaizhong（@lhuaizhong）开发并维护，当前版本 v1.0.0。

CosyVoice3 macOS