← 返回 Skills 市场
openlark

Fish Audio S2 Pro TTS

作者 OpenLark · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
114
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install fish-speech
功能描述
Fish Audio S2 Pro TTS.
使用说明 (SKILL.md)

Fish Audio S2 Pro TTS

Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.

Installation

See references/install.md. Quick summary:

conda create -n fish-speech python=3.12 && conda activate fish-speech
pip install -e .[cu129]     # CUDA 12.9
# or: uv sync --python 3.12 --extra cu129
# minimal: pip install fish-speech

apt install portaudio19-dev libsox-dev ffmpeg  # System dependencies
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro

Server Deployment

vLLM-Omni (recommended, OpenAI compatible):

pip install fish-speech
vllm serve fishaudio/s2-pro --omni --port 8091
# Endpoints: POST /v1/audio/speech, /v1/audio/speech/batch

SGLang-Omni (high-performance streaming):

sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000
# RTF 0.195, TTFA ~100ms, throughput 3000+ t/s

Docker:

docker compose --profile webui up    # Port 7860
COMPILE=1 docker compose --profile webui up  # ~10x speedup

Native API Server:

python tools/api_server.py --llama-checkpoint-path checkpoints/s2-pro --decoder-checkpoint-path checkpoints/s2-pro/codec.pth --listen 0.0.0.0:8080

Raw CLI Inference (Three Steps)

# 1. Extract VQ tokens
python fish_speech/models/dac/inference.py -i "ref.wav" --checkpoint-path "checkpoints/s2-pro/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-tokens "fake.npy"
# 3. Decode to audio
python fish_speech/models/dac/inference.py -i "codes_0.npy"

API Calls

cURL

# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello."}' --output out.wav

# Voice cloning (vLLM)
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --output cloned.wav

# Streaming PCM
curl -N -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Streaming.", "stream": true, "response_format": "pcm"}' --no-buffer | play -t raw -r 44100 -e signed -b 16 -c 1 -

# Batch
curl -X POST http://localhost:8091/v1/audio/speech/batch \
  -H "Content-Type: application/json" \
  -d '{"items": [{"input": "Sentence 1"}, {"input": "Sentence 2"}], "voice": "default"}'

Python

import requests
resp = requests.post("http://localhost:8091/v1/audio/speech", json={
    "input": "Hello.", "voice": "default",
    "ref_audio": "https://...", "ref_text": "Reference text"
})
with open("out.wav", "wb") as f: f.write(resp.content)

# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
client.audio.speech.create(model="fishaudio/s2-pro", voice="default", input="Hello.").stream_to_file("out.wav")

SGLang format: "references": [{"audio_path": "...", "text": "..."}]

Request Parameters

Parameter Type Default Description
input string Required Text to synthesize
voice string "default" Voice
response_format string "wav" wav/mp3/flac/pcm/aac/opus
speed float 1.0 Speech speed (0.25-4.0)
stream bool false Streaming (requires response_format="pcm")
ref_audio string null Reference audio URL/base64/file://
ref_text string null Reference audio transcription
max_new_tokens int 2048 Max generation tokens
temperature float null Sampling temperature
top_p float null Nucleus sampling
top_k int null Top-K
repetition_penalty float null Repetition penalty
seed int null Random seed

Emotion Tags

Embed [tag] anywhere in the text, supports 15000+ free-form tags:

[excited]Today is a great day![pause] [whisper in small voice]But there's a secret…
[professional broadcast tone]Welcome.

Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [pause] [emphasis] [echo] [inhale] [sigh] [singing]

Full reference: references/emotion-tags.md

Multi-Speaker

\x3C|speaker:0|>Hello, welcome.
\x3C|speaker:1|>Thank you, glad to be here.

LoRA Fine-tuning

⚠️ Not recommended for models after RL. Only fine-tune Slow AR:

# Preparation: data/SPK1/*.mp3 + *.lab
python tools/vqgan/extract_vq.py data --config-name modded_dac_vq --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth
python tools/llama/build_dataset.py --input data --output data/protos
python fish_speech/train.py --config-name text2semantic_finetune project=my_project [email protected]_config=r_8_alpha_16
python tools/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight checkpoints/openaudio-s1-mini --lora-weight results/my_project/checkpoints/step_xxx.ckpt --output checkpoints/merged/

See references/finetune.md

Important Notes

  1. Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription
  2. Without reference audio, voice tends to sound mechanical
  3. vLLM is easy to deploy; SGLang has better latency/throughput
  4. SGLang: BF16 RoPE precision must match training; if early EOS occurs, switch to FA3
  5. Fast AR torch.compile can achieve ~5x speedup
  6. Docker image does not include model weights; mount checkpoints
安全使用建议
This skill appears benign as documentation for running Fish Audio S2 Pro. Before installing, verify the external package/container/model sources, run setup in an isolated environment, keep the API bound to localhost unless you add access controls, and only upload voice samples with proper consent.
功能分析
Type: OpenClaw Skill Name: fish-speech Version: 1.0.0 The skill bundle provides comprehensive documentation and setup instructions for 'Fish Audio S2 Pro TTS', a high-performance text-to-speech system. The files (SKILL.md, install.md, api-reference.md) describe legitimate operations such as environment configuration via Conda/UV, model weight acquisition from Hugging Face, and server deployment using vLLM or Docker. Although the documentation contains future-dated references (e.g., arXiv 2026 and CUDA 12.9), which suggests it may be a synthetic or forward-looking example, the technical logic is entirely consistent with a standard machine learning tool and lacks any indicators of malicious intent, data exfiltration, or harmful prompt injection.
能力标签
requires-sensitive-credentials
能力评估
Purpose & Capability
The TTS, voice cloning, streaming, and fine-tuning instructions match the stated Fish Audio S2 Pro purpose; voice cloning inherently handles sensitive voice samples, so consent and authorization matter.
Instruction Scope
The commands and API examples are explicit and user-directed. One server example binds to all network interfaces, so users should decide deliberately whether the service should be reachable beyond localhost.
Install Mechanism
There is no automatic install spec or shipped code; the documentation asks users to install external packages, Docker images, and Hugging Face model files that were not reviewed in this skill package.
Credentials
GPU, Docker, ffmpeg/sox, local HTTP APIs, and model downloads are proportionate for a local TTS system. The registry capability signal mentions sensitive credentials, but the provided docs do not show credential use.
Persistence & Privilege
Uploaded voice profiles are documented as persistent under the user's cache directory. System package installation may require elevated OS privileges, but it is presented as a user-run setup step rather than autonomous behavior.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install fish-speech
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /fish-speech 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of fish-speech: Fish Audio S2 Pro TTS. - Dual-AR TTS model supporting 80+ languages, 44.1kHz audio output, and 10 RVQ codebooks - Multiple deployment options: vLLM, SGLang, Docker, native API server - Supports voice cloning via reference audio and text; emotion tags and multi-speaker synthesis - CLI and API (OpenAI-compatible) usage examples provided - LoRA fine-tuning supported (Slow AR only, with guidance) - Includes installation instructions, system requirements, and performance tips
元数据
Slug fish-speech
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Fish Audio S2 Pro TTS 是什么?

Fish Audio S2 Pro TTS. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 114 次。

如何安装 Fish Audio S2 Pro TTS?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install fish-speech」即可一键安装,无需额外配置。

Fish Audio S2 Pro TTS 是免费的吗?

是的,Fish Audio S2 Pro TTS 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Fish Audio S2 Pro TTS 支持哪些平台?

Fish Audio S2 Pro TTS 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Fish Audio S2 Pro TTS?

由 OpenLark(@openlark)开发并维护,当前版本 v1.0.0。

💬 留言讨论