/install fish-speech
Fish Audio S2 Pro TTS
Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.
- Model: fishaudio/s2-pro
- Output: 44.1 kHz WAV/PCM mono
- VRAM: ≥24GB for inference, A800/H200 recommended
- Technical Report: arXiv 2603.08823 | Architecture
Installation
See references/install.md. Quick summary:
conda create -n fish-speech python=3.12 && conda activate fish-speech
pip install -e .[cu129] # CUDA 12.9
# or: uv sync --python 3.12 --extra cu129
# minimal: pip install fish-speech
apt install portaudio19-dev libsox-dev ffmpeg # System dependencies
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro
Server Deployment
vLLM-Omni (recommended, OpenAI compatible):
pip install fish-speech
vllm serve fishaudio/s2-pro --omni --port 8091
# Endpoints: POST /v1/audio/speech, /v1/audio/speech/batch
SGLang-Omni (high-performance streaming):
sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000
# RTF 0.195, TTFA ~100ms, throughput 3000+ t/s
Docker:
docker compose --profile webui up # Port 7860
COMPILE=1 docker compose --profile webui up # ~10x speedup
Native API Server:
python tools/api_server.py --llama-checkpoint-path checkpoints/s2-pro --decoder-checkpoint-path checkpoints/s2-pro/codec.pth --listen 0.0.0.0:8080
Raw CLI Inference (Three Steps)
# 1. Extract VQ tokens
python fish_speech/models/dac/inference.py -i "ref.wav" --checkpoint-path "checkpoints/s2-pro/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-tokens "fake.npy"
# 3. Decode to audio
python fish_speech/models/dac/inference.py -i "codes_0.npy"
API Calls
cURL
# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello."}' --output out.wav
# Voice cloning (vLLM)
curl -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --output cloned.wav
# Streaming PCM
curl -N -X POST http://localhost:8091/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Streaming.", "stream": true, "response_format": "pcm"}' --no-buffer | play -t raw -r 44100 -e signed -b 16 -c 1 -
# Batch
curl -X POST http://localhost:8091/v1/audio/speech/batch \
-H "Content-Type: application/json" \
-d '{"items": [{"input": "Sentence 1"}, {"input": "Sentence 2"}], "voice": "default"}'
Python
import requests
resp = requests.post("http://localhost:8091/v1/audio/speech", json={
"input": "Hello.", "voice": "default",
"ref_audio": "https://...", "ref_text": "Reference text"
})
with open("out.wav", "wb") as f: f.write(resp.content)
# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
client.audio.speech.create(model="fishaudio/s2-pro", voice="default", input="Hello.").stream_to_file("out.wav")
SGLang format: "references": [{"audio_path": "...", "text": "..."}]
Request Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
input |
string | Required | Text to synthesize |
voice |
string | "default" |
Voice |
response_format |
string | "wav" |
wav/mp3/flac/pcm/aac/opus |
speed |
float | 1.0 |
Speech speed (0.25-4.0) |
stream |
bool | false | Streaming (requires response_format="pcm") |
ref_audio |
string | null | Reference audio URL/base64/file:// |
ref_text |
string | null | Reference audio transcription |
max_new_tokens |
int | 2048 | Max generation tokens |
temperature |
float | null | Sampling temperature |
top_p |
float | null | Nucleus sampling |
top_k |
int | null | Top-K |
repetition_penalty |
float | null | Repetition penalty |
seed |
int | null | Random seed |
Emotion Tags
Embed [tag] anywhere in the text, supports 15000+ free-form tags:
[excited]Today is a great day![pause] [whisper in small voice]But there's a secret…
[professional broadcast tone]Welcome.
Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [pause] [emphasis] [echo] [inhale] [sigh] [singing]
Full reference: references/emotion-tags.md
Multi-Speaker
\x3C|speaker:0|>Hello, welcome.
\x3C|speaker:1|>Thank you, glad to be here.
LoRA Fine-tuning
⚠️ Not recommended for models after RL. Only fine-tune Slow AR:
# Preparation: data/SPK1/*.mp3 + *.lab
python tools/vqgan/extract_vq.py data --config-name modded_dac_vq --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth
python tools/llama/build_dataset.py --input data --output data/protos
python fish_speech/train.py --config-name text2semantic_finetune project=my_project [email protected]_config=r_8_alpha_16
python tools/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight checkpoints/openaudio-s1-mini --lora-weight results/my_project/checkpoints/step_xxx.ckpt --output checkpoints/merged/
Important Notes
- Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription
- Without reference audio, voice tends to sound mechanical
- vLLM is easy to deploy; SGLang has better latency/throughput
- SGLang: BF16 RoPE precision must match training; if early EOS occurs, switch to FA3
- Fast AR torch.compile can achieve ~5x speedup
- Docker image does not include model weights; mount checkpoints
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install fish-speech - After installation, invoke the skill by name or use
/fish-speech - Provide required inputs per the skill's parameter spec and get structured output
What is Fish Audio S2 Pro TTS?
Fish Audio S2 Pro TTS. It is an AI Agent Skill for Claude Code / OpenClaw, with 114 downloads so far.
How do I install Fish Audio S2 Pro TTS?
Run "/install fish-speech" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Fish Audio S2 Pro TTS free?
Yes, Fish Audio S2 Pro TTS is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Fish Audio S2 Pro TTS support?
Fish Audio S2 Pro TTS is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Fish Audio S2 Pro TTS?
It is built and maintained by OpenLark (@openlark); the current version is v1.0.0.