← 返回 Skills 市场

Fish Audio S2 Pro TTS

Name: Fish Audio S2 Pro TTS
Author: openlark

作者 OpenLark · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

114

总下载

当前安装

版本数

在 OpenClaw 中安装

/install fish-speech

功能描述

Fish Audio S2 Pro TTS.

使用说明 (SKILL.md)

Fish Audio S2 Pro TTS

Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.

Model: fishaudio/s2-pro
Output: 44.1 kHz WAV/PCM mono
VRAM: ≥24GB for inference, A800/H200 recommended
Technical Report: arXiv 2603.08823 | Architecture

Installation

See references/install.md. Quick summary:

conda create -n fish-speech python=3.12 && conda activate fish-speech
pip install -e .[cu129]     # CUDA 12.9
# or: uv sync --python 3.12 --extra cu129
# minimal: pip install fish-speech

apt install portaudio19-dev libsox-dev ffmpeg  # System dependencies
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro

Server Deployment

vLLM-Omni (recommended, OpenAI compatible):

pip install fish-speech
vllm serve fishaudio/s2-pro --omni --port 8091
# Endpoints: POST /v1/audio/speech, /v1/audio/speech/batch

SGLang-Omni (high-performance streaming):

sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000
# RTF 0.195, TTFA ~100ms, throughput 3000+ t/s

Docker:

docker compose --profile webui up    # Port 7860
COMPILE=1 docker compose --profile webui up  # ~10x speedup

Native API Server:

python tools/api_server.py --llama-checkpoint-path checkpoints/s2-pro --decoder-checkpoint-path checkpoints/s2-pro/codec.pth --listen 0.0.0.0:8080

Raw CLI Inference (Three Steps)

# 1. Extract VQ tokens
python fish_speech/models/dac/inference.py -i "ref.wav" --checkpoint-path "checkpoints/s2-pro/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-tokens "fake.npy"
# 3. Decode to audio
python fish_speech/models/dac/inference.py -i "codes_0.npy"

API Calls

cURL

# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello."}' --output out.wav

# Voice cloning (vLLM)
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --output cloned.wav

# Streaming PCM
curl -N -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Streaming.", "stream": true, "response_format": "pcm"}' --no-buffer | play -t raw -r 44100 -e signed -b 16 -c 1 -

# Batch
curl -X POST http://localhost:8091/v1/audio/speech/batch \
  -H "Content-Type: application/json" \
  -d '{"items": [{"input": "Sentence 1"}, {"input": "Sentence 2"}], "voice": "default"}'

Python

import requests
resp = requests.post("http://localhost:8091/v1/audio/speech", json={
    "input": "Hello.", "voice": "default",
    "ref_audio": "https://...", "ref_text": "Reference text"
})
with open("out.wav", "wb") as f: f.write(resp.content)

# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
client.audio.speech.create(model="fishaudio/s2-pro", voice="default", input="Hello.").stream_to_file("out.wav")

SGLang format: "references": [{"audio_path": "...", "text": "..."}]

Request Parameters

Parameter	Type	Default	Description
`input`	string	Required	Text to synthesize
`voice`	string	`"default"`	Voice
`response_format`	string	`"wav"`	wav/mp3/flac/pcm/aac/opus
`speed`	float	`1.0`	Speech speed (0.25-4.0)
`stream`	bool	false	Streaming (requires `response_format="pcm"`)
`ref_audio`	string	null	Reference audio URL/base64/file://
`ref_text`	string	null	Reference audio transcription
`max_new_tokens`	int	2048	Max generation tokens
`temperature`	float	null	Sampling temperature
`top_p`	float	null	Nucleus sampling
`top_k`	int	null	Top-K
`repetition_penalty`	float	null	Repetition penalty
`seed`	int	null	Random seed

Emotion Tags

Embed [tag] anywhere in the text, supports 15000+ free-form tags:

[excited]Today is a great day![pause] [whisper in small voice]But there's a secret…
[professional broadcast tone]Welcome.

Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [pause] [emphasis] [echo] [inhale] [sigh] [singing]

Full reference: references/emotion-tags.md

Multi-Speaker

\x3C|speaker:0|>Hello, welcome.
\x3C|speaker:1|>Thank you, glad to be here.

LoRA Fine-tuning

⚠️ Not recommended for models after RL. Only fine-tune Slow AR:

# Preparation: data/SPK1/*.mp3 + *.lab
python tools/vqgan/extract_vq.py data --config-name modded_dac_vq --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth
python tools/llama/build_dataset.py --input data --output data/protos
python fish_speech/train.py --config-name text2semantic_finetune project=my_project [email protected]_config=r_8_alpha_16
python tools/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight checkpoints/openaudio-s1-mini --lora-weight results/my_project/checkpoints/step_xxx.ckpt --output checkpoints/merged/

See references/finetune.md

Important Notes

Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription
Without reference audio, voice tends to sound mechanical
vLLM is easy to deploy; SGLang has better latency/throughput
SGLang: BF16 RoPE precision must match training; if early EOS occurs, switch to FA3
Fast AR torch.compile can achieve ~5x speedup
Docker image does not include model weights; mount checkpoints

安全使用建议

This skill appears benign as documentation for running Fish Audio S2 Pro. Before installing, verify the external package/container/model sources, run setup in an isolated environment, keep the API bound to localhost unless you add access controls, and only upload voice samples with proper consent.

功能分析

Type: OpenClaw Skill Name: fish-speech Version: 1.0.0 The skill bundle provides comprehensive documentation and setup instructions for 'Fish Audio S2 Pro TTS', a high-performance text-to-speech system. The files (SKILL.md, install.md, api-reference.md) describe legitimate operations such as environment configuration via Conda/UV, model weight acquisition from Hugging Face, and server deployment using vLLM or Docker. Although the documentation contains future-dated references (e.g., arXiv 2026 and CUDA 12.9), which suggests it may be a synthetic or forward-looking example, the technical logic is entirely consistent with a standard machine learning tool and lacks any indicators of malicious intent, data exfiltration, or harmful prompt injection.

能力标签

requires-sensitive-credentials

能力评估

ℹ Purpose & Capability

The TTS, voice cloning, streaming, and fine-tuning instructions match the stated Fish Audio S2 Pro purpose; voice cloning inherently handles sensitive voice samples, so consent and authorization matter.

ℹ Instruction Scope

The commands and API examples are explicit and user-directed. One server example binds to all network interfaces, so users should decide deliberately whether the service should be reachable beyond localhost.

ℹ Install Mechanism

There is no automatic install spec or shipped code; the documentation asks users to install external packages, Docker images, and Hugging Face model files that were not reviewed in this skill package.

ℹ Credentials

GPU, Docker, ffmpeg/sox, local HTTP APIs, and model downloads are proportionate for a local TTS system. The registry capability signal mentions sensitive credentials, but the provided docs do not show credential use.

ℹ Persistence & Privilege

Uploaded voice profiles are documented as persistent under the user's cache directory. System package installation may require elevated OS privileges, but it is presented as a user-run setup step rather than autonomous behavior.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install fish-speech
安装完成后，直接呼叫该 Skill 的名称或使用 /fish-speech 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release of fish-speech: Fish Audio S2 Pro TTS. - Dual-AR TTS model supporting 80+ languages, 44.1kHz audio output, and 10 RVQ codebooks - Multiple deployment options: vLLM, SGLang, Docker, native API server - Supports voice cloning via reference audio and text; emotion tags and multi-speaker synthesis - CLI and API (OpenAI-compatible) usage examples provided - LoRA fine-tuning supported (Slow AR only, with guidance) - Includes installation instructions, system requirements, and performance tips

元数据

Slug fish-speech

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题