← Back to Skills Marketplace

Fish Audio S2 Pro TTS

Name: Fish Audio S2 Pro TTS
Author: openlark

by OpenLark · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

114

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install fish-speech

Description

Fish Audio S2 Pro TTS.

README (SKILL.md)

Fish Audio S2 Pro TTS

Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.

Model: fishaudio/s2-pro
Output: 44.1 kHz WAV/PCM mono
VRAM: ≥24GB for inference, A800/H200 recommended
Technical Report: arXiv 2603.08823 | Architecture

Installation

See references/install.md. Quick summary:

conda create -n fish-speech python=3.12 && conda activate fish-speech
pip install -e .[cu129]     # CUDA 12.9
# or: uv sync --python 3.12 --extra cu129
# minimal: pip install fish-speech

apt install portaudio19-dev libsox-dev ffmpeg  # System dependencies
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro

Server Deployment

vLLM-Omni (recommended, OpenAI compatible):

pip install fish-speech
vllm serve fishaudio/s2-pro --omni --port 8091
# Endpoints: POST /v1/audio/speech, /v1/audio/speech/batch

SGLang-Omni (high-performance streaming):

sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000
# RTF 0.195, TTFA ~100ms, throughput 3000+ t/s

Docker:

docker compose --profile webui up    # Port 7860
COMPILE=1 docker compose --profile webui up  # ~10x speedup

Native API Server:

python tools/api_server.py --llama-checkpoint-path checkpoints/s2-pro --decoder-checkpoint-path checkpoints/s2-pro/codec.pth --listen 0.0.0.0:8080

Raw CLI Inference (Three Steps)

# 1. Extract VQ tokens
python fish_speech/models/dac/inference.py -i "ref.wav" --checkpoint-path "checkpoints/s2-pro/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-tokens "fake.npy"
# 3. Decode to audio
python fish_speech/models/dac/inference.py -i "codes_0.npy"

API Calls

cURL

# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello."}' --output out.wav

# Voice cloning (vLLM)
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --output cloned.wav

# Streaming PCM
curl -N -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Streaming.", "stream": true, "response_format": "pcm"}' --no-buffer | play -t raw -r 44100 -e signed -b 16 -c 1 -

# Batch
curl -X POST http://localhost:8091/v1/audio/speech/batch \
  -H "Content-Type: application/json" \
  -d '{"items": [{"input": "Sentence 1"}, {"input": "Sentence 2"}], "voice": "default"}'

Python

import requests
resp = requests.post("http://localhost:8091/v1/audio/speech", json={
    "input": "Hello.", "voice": "default",
    "ref_audio": "https://...", "ref_text": "Reference text"
})
with open("out.wav", "wb") as f: f.write(resp.content)

# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
client.audio.speech.create(model="fishaudio/s2-pro", voice="default", input="Hello.").stream_to_file("out.wav")

SGLang format: "references": [{"audio_path": "...", "text": "..."}]

Request Parameters

Parameter	Type	Default	Description
`input`	string	Required	Text to synthesize
`voice`	string	`"default"`	Voice
`response_format`	string	`"wav"`	wav/mp3/flac/pcm/aac/opus
`speed`	float	`1.0`	Speech speed (0.25-4.0)
`stream`	bool	false	Streaming (requires `response_format="pcm"`)
`ref_audio`	string	null	Reference audio URL/base64/file://
`ref_text`	string	null	Reference audio transcription
`max_new_tokens`	int	2048	Max generation tokens
`temperature`	float	null	Sampling temperature
`top_p`	float	null	Nucleus sampling
`top_k`	int	null	Top-K
`repetition_penalty`	float	null	Repetition penalty
`seed`	int	null	Random seed

Emotion Tags

Embed [tag] anywhere in the text, supports 15000+ free-form tags:

[excited]Today is a great day![pause] [whisper in small voice]But there's a secret…
[professional broadcast tone]Welcome.

Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [pause] [emphasis] [echo] [inhale] [sigh] [singing]

Full reference: references/emotion-tags.md

Multi-Speaker

\x3C|speaker:0|>Hello, welcome.
\x3C|speaker:1|>Thank you, glad to be here.

LoRA Fine-tuning

⚠️ Not recommended for models after RL. Only fine-tune Slow AR:

# Preparation: data/SPK1/*.mp3 + *.lab
python tools/vqgan/extract_vq.py data --config-name modded_dac_vq --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth
python tools/llama/build_dataset.py --input data --output data/protos
python fish_speech/train.py --config-name text2semantic_finetune project=my_project [email protected]_config=r_8_alpha_16
python tools/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight checkpoints/openaudio-s1-mini --lora-weight results/my_project/checkpoints/step_xxx.ckpt --output checkpoints/merged/

See references/finetune.md

Important Notes

Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription
Without reference audio, voice tends to sound mechanical
vLLM is easy to deploy; SGLang has better latency/throughput
SGLang: BF16 RoPE precision must match training; if early EOS occurs, switch to FA3
Fast AR torch.compile can achieve ~5x speedup
Docker image does not include model weights; mount checkpoints

Usage Guidance

This skill appears benign as documentation for running Fish Audio S2 Pro. Before installing, verify the external package/container/model sources, run setup in an isolated environment, keep the API bound to localhost unless you add access controls, and only upload voice samples with proper consent.

Capability Analysis

Type: OpenClaw Skill Name: fish-speech Version: 1.0.0 The skill bundle provides comprehensive documentation and setup instructions for 'Fish Audio S2 Pro TTS', a high-performance text-to-speech system. The files (SKILL.md, install.md, api-reference.md) describe legitimate operations such as environment configuration via Conda/UV, model weight acquisition from Hugging Face, and server deployment using vLLM or Docker. Although the documentation contains future-dated references (e.g., arXiv 2026 and CUDA 12.9), which suggests it may be a synthetic or forward-looking example, the technical logic is entirely consistent with a standard machine learning tool and lacks any indicators of malicious intent, data exfiltration, or harmful prompt injection.

Capability Tags

requires-sensitive-credentials

Capability Assessment

ℹ Purpose & Capability

The TTS, voice cloning, streaming, and fine-tuning instructions match the stated Fish Audio S2 Pro purpose; voice cloning inherently handles sensitive voice samples, so consent and authorization matter.

ℹ Instruction Scope

The commands and API examples are explicit and user-directed. One server example binds to all network interfaces, so users should decide deliberately whether the service should be reachable beyond localhost.

ℹ Install Mechanism

There is no automatic install spec or shipped code; the documentation asks users to install external packages, Docker images, and Hugging Face model files that were not reviewed in this skill package.

ℹ Credentials

GPU, Docker, ffmpeg/sox, local HTTP APIs, and model downloads are proportionate for a local TTS system. The registry capability signal mentions sensitive credentials, but the provided docs do not show credential use.

ℹ Persistence & Privilege

Uploaded voice profiles are documented as persistent under the user's cache directory. System package installation may require elevated OS privileges, but it is presented as a user-run setup step rather than autonomous behavior.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install fish-speech
After installation, invoke the skill by name or use /fish-speech
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release of fish-speech: Fish Audio S2 Pro TTS. - Dual-AR TTS model supporting 80+ languages, 44.1kHz audio output, and 10 RVQ codebooks - Multiple deployment options: vLLM, SGLang, Docker, native API server - Supports voice cloning via reference audio and text; emotion tags and multi-speaker synthesis - CLI and API (OpenAI-compatible) usage examples provided - LoRA fine-tuning supported (Slow AR only, with guidance) - Includes installation instructions, system requirements, and performance tips

Metadata

Slug fish-speech

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Fish Audio S2 Pro TTS?

Fish Audio S2 Pro TTS. It is an AI Agent Skill for Claude Code / OpenClaw, with 114 downloads so far.

How do I install Fish Audio S2 Pro TTS?

Run "/install fish-speech" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Fish Audio S2 Pro TTS free?

Yes, Fish Audio S2 Pro TTS is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Fish Audio S2 Pro TTS support?

Fish Audio S2 Pro TTS is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Fish Audio S2 Pro TTS?

It is built and maintained by OpenLark (@openlark); the current version is v1.0.0.

More Skills

Fish Audio S2 Pro TTS

Fish Audio S2 Pro TTS

Installation

Server Deployment

Raw CLI Inference (Three Steps)

API Calls

cURL

Python

Request Parameters

Emotion Tags

Multi-Speaker

LoRA Fine-tuning

Important Notes

What is Fish Audio S2 Pro TTS?

How do I install Fish Audio S2 Pro TTS?

Is Fish Audio S2 Pro TTS free?

Which platforms does Fish Audio S2 Pro TTS support?

Who created Fish Audio S2 Pro TTS?

💬 Comments