← Back to Skills Marketplace
openlark

Fish Audio S2 Pro TTS

by OpenLark · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
114
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install fish-speech
Description
Fish Audio S2 Pro TTS.
README (SKILL.md)

Fish Audio S2 Pro TTS

Dual-AR architecture (Slow AR 4B + Fast AR 400M), 10 RVQ codebooks, ~21 Hz frame rate, 80+ languages.

Installation

See references/install.md. Quick summary:

conda create -n fish-speech python=3.12 && conda activate fish-speech
pip install -e .[cu129]     # CUDA 12.9
# or: uv sync --python 3.12 --extra cu129
# minimal: pip install fish-speech

apt install portaudio19-dev libsox-dev ffmpeg  # System dependencies
hf download fishaudio/s2-pro --local-dir checkpoints/s2-pro

Server Deployment

vLLM-Omni (recommended, OpenAI compatible):

pip install fish-speech
vllm serve fishaudio/s2-pro --omni --port 8091
# Endpoints: POST /v1/audio/speech, /v1/audio/speech/batch

SGLang-Omni (high-performance streaming):

sgl-omni serve --model-path fishaudio/s2-pro --config examples/configs/s2pro_tts.yaml --port 8000
# RTF 0.195, TTFA ~100ms, throughput 3000+ t/s

Docker:

docker compose --profile webui up    # Port 7860
COMPILE=1 docker compose --profile webui up  # ~10x speedup

Native API Server:

python tools/api_server.py --llama-checkpoint-path checkpoints/s2-pro --decoder-checkpoint-path checkpoints/s2-pro/codec.pth --listen 0.0.0.0:8080

Raw CLI Inference (Three Steps)

# 1. Extract VQ tokens
python fish_speech/models/dac/inference.py -i "ref.wav" --checkpoint-path "checkpoints/s2-pro/codec.pth"
# 2. Generate semantic tokens
python fish_speech/models/text2semantic/inference.py --text "Text" --prompt-text "Reference text" --prompt-tokens "fake.npy"
# 3. Decode to audio
python fish_speech/models/dac/inference.py -i "codes_0.npy"

API Calls

cURL

# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Hello."}' --output out.wav

# Voice cloning (vLLM)
curl -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Cloned voice.", "ref_audio": "https://...", "ref_text": "Reference transcription"}' --output cloned.wav

# Streaming PCM
curl -N -X POST http://localhost:8091/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "Streaming.", "stream": true, "response_format": "pcm"}' --no-buffer | play -t raw -r 44100 -e signed -b 16 -c 1 -

# Batch
curl -X POST http://localhost:8091/v1/audio/speech/batch \
  -H "Content-Type: application/json" \
  -d '{"items": [{"input": "Sentence 1"}, {"input": "Sentence 2"}], "voice": "default"}'

Python

import requests
resp = requests.post("http://localhost:8091/v1/audio/speech", json={
    "input": "Hello.", "voice": "default",
    "ref_audio": "https://...", "ref_text": "Reference text"
})
with open("out.wav", "wb") as f: f.write(resp.content)

# OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8091/v1", api_key="none")
client.audio.speech.create(model="fishaudio/s2-pro", voice="default", input="Hello.").stream_to_file("out.wav")

SGLang format: "references": [{"audio_path": "...", "text": "..."}]

Request Parameters

Parameter Type Default Description
input string Required Text to synthesize
voice string "default" Voice
response_format string "wav" wav/mp3/flac/pcm/aac/opus
speed float 1.0 Speech speed (0.25-4.0)
stream bool false Streaming (requires response_format="pcm")
ref_audio string null Reference audio URL/base64/file://
ref_text string null Reference audio transcription
max_new_tokens int 2048 Max generation tokens
temperature float null Sampling temperature
top_p float null Nucleus sampling
top_k int null Top-K
repetition_penalty float null Repetition penalty
seed int null Random seed

Emotion Tags

Embed [tag] anywhere in the text, supports 15000+ free-form tags:

[excited]Today is a great day![pause] [whisper in small voice]But there's a secret…
[professional broadcast tone]Welcome.

Common: [excited] [angry] [sad] [whisper] [shouting] [laughing] [pause] [emphasis] [echo] [inhale] [sigh] [singing]

Full reference: references/emotion-tags.md

Multi-Speaker

\x3C|speaker:0|>Hello, welcome.
\x3C|speaker:1|>Thank you, glad to be here.

LoRA Fine-tuning

⚠️ Not recommended for models after RL. Only fine-tune Slow AR:

# Preparation: data/SPK1/*.mp3 + *.lab
python tools/vqgan/extract_vq.py data --config-name modded_dac_vq --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth
python tools/llama/build_dataset.py --input data --output data/protos
python fish_speech/train.py --config-name text2semantic_finetune project=my_project [email protected]_config=r_8_alpha_16
python tools/llama/merge_lora.py --lora-config r_8_alpha_16 --base-weight checkpoints/openaudio-s1-mini --lora-weight results/my_project/checkpoints/step_xxx.ckpt --output checkpoints/merged/

See references/finetune.md

Important Notes

  1. Voice cloning: Reference audio 10-30 seconds, clear and noise-free, provide accurate transcription
  2. Without reference audio, voice tends to sound mechanical
  3. vLLM is easy to deploy; SGLang has better latency/throughput
  4. SGLang: BF16 RoPE precision must match training; if early EOS occurs, switch to FA3
  5. Fast AR torch.compile can achieve ~5x speedup
  6. Docker image does not include model weights; mount checkpoints
Usage Guidance
This skill appears benign as documentation for running Fish Audio S2 Pro. Before installing, verify the external package/container/model sources, run setup in an isolated environment, keep the API bound to localhost unless you add access controls, and only upload voice samples with proper consent.
Capability Analysis
Type: OpenClaw Skill Name: fish-speech Version: 1.0.0 The skill bundle provides comprehensive documentation and setup instructions for 'Fish Audio S2 Pro TTS', a high-performance text-to-speech system. The files (SKILL.md, install.md, api-reference.md) describe legitimate operations such as environment configuration via Conda/UV, model weight acquisition from Hugging Face, and server deployment using vLLM or Docker. Although the documentation contains future-dated references (e.g., arXiv 2026 and CUDA 12.9), which suggests it may be a synthetic or forward-looking example, the technical logic is entirely consistent with a standard machine learning tool and lacks any indicators of malicious intent, data exfiltration, or harmful prompt injection.
Capability Tags
requires-sensitive-credentials
Capability Assessment
Purpose & Capability
The TTS, voice cloning, streaming, and fine-tuning instructions match the stated Fish Audio S2 Pro purpose; voice cloning inherently handles sensitive voice samples, so consent and authorization matter.
Instruction Scope
The commands and API examples are explicit and user-directed. One server example binds to all network interfaces, so users should decide deliberately whether the service should be reachable beyond localhost.
Install Mechanism
There is no automatic install spec or shipped code; the documentation asks users to install external packages, Docker images, and Hugging Face model files that were not reviewed in this skill package.
Credentials
GPU, Docker, ffmpeg/sox, local HTTP APIs, and model downloads are proportionate for a local TTS system. The registry capability signal mentions sensitive credentials, but the provided docs do not show credential use.
Persistence & Privilege
Uploaded voice profiles are documented as persistent under the user's cache directory. System package installation may require elevated OS privileges, but it is presented as a user-run setup step rather than autonomous behavior.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install fish-speech
  3. After installation, invoke the skill by name or use /fish-speech
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of fish-speech: Fish Audio S2 Pro TTS. - Dual-AR TTS model supporting 80+ languages, 44.1kHz audio output, and 10 RVQ codebooks - Multiple deployment options: vLLM, SGLang, Docker, native API server - Supports voice cloning via reference audio and text; emotion tags and multi-speaker synthesis - CLI and API (OpenAI-compatible) usage examples provided - LoRA fine-tuning supported (Slow AR only, with guidance) - Includes installation instructions, system requirements, and performance tips
Metadata
Slug fish-speech
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Fish Audio S2 Pro TTS?

Fish Audio S2 Pro TTS. It is an AI Agent Skill for Claude Code / OpenClaw, with 114 downloads so far.

How do I install Fish Audio S2 Pro TTS?

Run "/install fish-speech" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Fish Audio S2 Pro TTS free?

Yes, Fish Audio S2 Pro TTS is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Fish Audio S2 Pro TTS support?

Fish Audio S2 Pro TTS is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Fish Audio S2 Pro TTS?

It is built and maintained by OpenLark (@openlark); the current version is v1.0.0.

💬 Comments