/install fish-speech
Fish Audio S2 Pro TTS
Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.
- Model: fishaudio/s2-pro
- Output: 44.1 kHz WAV/PCM mono
- VRAM: ≥24GB for inference, A800/H200 recommended
- Technical Report: arXiv 2603.08823 | Architecture
Installation
See references/install.md. Quick summary:
conda create -n fish-speech python=3.12 && conda activate fish-speech
pip install -e .[cu129] # CUDA 12.9
# or: uv sync --python 3.12 --extra cu129
# minimal: pip install fish-speech
apt install portaudio19-dev libsox-dev ffmpeg # System dependencies
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro
Server Deployment
vLLM-Omni (recommended, OpenAI compatible):
pip install fish-speech
vllm serve fishaudio/s2-pro --omni --port 8091
# Endpoints: POST /v1/audio/speech, /v1/audio/speech/batch
SGLang-Omni (high-performance streaming):
sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000
# RTF 0.195, TTFA ~100ms, throughput 3000+ t/s
Docker:
docker compose --profile webui up # Port 7860
COMPILE=1 docker compose --profile webui up # ~10x speedup
Native API Server:
python tools/api_server.py --llama-checkpoint-path checkpoints/s2-pro --decoder-checkpoint-path checkpoints/s2-pro/codec.pth --listen 0.0.0.0:8080
Raw CLI Inference (Three Steps)
# 1. Extract VQ tokens
python fish_speech/models/dac/inference.py -i "ref.wav" --checkpoint-path "checkpoints/s2-pro/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-tokens "fake.npy"
# 3. Decode to audio
python fish_speech/models/dac/inference.py -i "codes_0.npy"
API Calls
cURL
# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello."}' --output out.wav
# Voice cloning (vLLM)
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --output cloned.wav
# Streaming PCM
curl -N -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Streaming.", "stream": true, "response_format": "pcm"}' --no-buffer | play -t raw -r 44100 -e signed -b 16 -c 1 -
# Batch
curl -X POST http://localhost:8091/v1/audio/speech/batch \
-H "Content-Type: application/json" \
-d '{"items": [{"input": "Sentence 1"}, {"input": "Sentence 2"}], "voice": "default"}'
Python
import requests
resp = requests.post("http://localhost:8091/v1/audio/speech", json={
"input": "Hello.", "voice": "default",
"ref_audio": "https://...", "ref_text": "Reference text"
})
with open("out.wav", "wb") as f: f.write(resp.content)
# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
client.audio.speech.create(model="fishaudio/s2-pro", voice="default", input="Hello.").stream_to_file("out.wav")
SGLang format: "references": [{"audio_path": "...", "text": "..."}]
Request Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input |
string | Required | Text to synthesize |
voice |
string | "default" |
Voice |
response_format |
string | "wav" |
wav/mp3/flac/pcm/aac/opus |
speed |
float | 1.0 |
Speech speed (0.25-4.0) |
stream |
bool | false | Streaming (requires response_format="pcm") |
ref_audio |
string | null | Reference audio URL/base64/file:// |
ref_text |
string | null | Reference audio transcription |
max_new_tokens |
int | 2048 | Max generation tokens |
temperature |
float | null | Sampling temperature |
top_p |
float | null | Nucleus sampling |
top_k |
int | null | Top-K |
repetition_penalty |
float | null | Repetition penalty |
seed |
int | null | Random seed |
Emotion Tags
Embed [tag] anywhere in the text, supports 15000+ free-form tags:
[excited]Today is a great day![pause] [whisper in small voice]But there's a secret…
[professional broadcast tone]Welcome.
Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [pause] [emphasis] [echo] [inhale] [sigh] [singing]
Full reference: references/emotion-tags.md
Multi-Speaker
\x3C|speaker:0|>Hello, welcome.
\x3C|speaker:1|>Thank you, glad to be here.
LoRA Fine-tuning
⚠️ Not recommended for models after RL. Only fine-tune Slow AR:
# Preparation: data/SPK1/*.mp3 + *.lab
python tools/vqgan/extract_vq.py data --config-name modded_dac_vq --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth
python tools/llama/build_dataset.py --input data --output data/protos
python fish_speech/train.py --config-name text2semantic_finetune project=my_project [email protected]_config=r_8_alpha_16
python tools/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight checkpoints/openaudio-s1-mini --lora-weight results/my_project/checkpoints/step_xxx.ckpt --output checkpoints/merged/
Important Notes
- Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription
- Without reference audio, voice tends to sound mechanical
- vLLM is easy to deploy; SGLang has better latency/throughput
- SGLang: BF16 RoPE precision must match training; if early EOS occurs, switch to FA3
- Fast AR torch.compile can achieve ~5x speedup
- Docker image does not include model weights; mount checkpoints
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install fish-speech - 安装完成后,直接呼叫该 Skill 的名称或使用
/fish-speech触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Fish Audio S2 Pro TTS 是什么?
Fish Audio S2 Pro TTS. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 114 次。
如何安装 Fish Audio S2 Pro TTS?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install fish-speech」即可一键安装,无需额外配置。
Fish Audio S2 Pro TTS 是免费的吗?
是的,Fish Audio S2 Pro TTS 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Fish Audio S2 Pro TTS 支持哪些平台?
Fish Audio S2 Pro TTS 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Fish Audio S2 Pro TTS?
由 OpenLark(@openlark)开发并维护,当前版本 v1.0.0。