Description

OpenClaw text-to-speech workflow for an OpenAI-compatible TTS server, including remote/self-hosted deployments such as vLLM Omni. Use when configuring, testi...

README (SKILL.md)

Local TTS Workflow

Name: Local Tts Workflow
Author: mozi1924

Use this skill to debug the actual speech pipeline and to prepare text so the model reads it sanely.

Do not hardcode 127.0.0.1 blindly. Read the active OpenClaw config first and use the current messages.tts.openai.baseUrl as the source of truth.

Current known deployment in this workspace: http://127.0.0.1:8000/v1.

Current local model-path fallback worth remembering: if the server did not pull a model by registry name, it may be loading directly from a local path such as ./models/qwen3-tts-0.6b-mlx.

When exact route shape matters, the local OpenAPI document is available at:

http://localhost:8000/openapi.json

Use this OpenAPI doc as a schema/reference source to compare this local mlx-audio server against OpenAI’s API. Do not treat it as a health check.

Core rule: normalize numbers before synthesis

If text is meant to be spoken aloud, do not leave Arabic numerals in the final TTS input.

Convert them into words first.

Examples:

Chinese output: write 一二三, not 123
English output: write one two three, not 123

This rule matters because the TTS model can go weird or read digits badly when fed raw numerals.

When preparing spoken text, normalize:

dates
times
counts
version-like strings if they will be read aloud
mixed Chinese/English numeric snippets

If preserving exact machine-readable formatting matters, keep one copy for display and a separate normalized copy for TTS.

Workflow

1. Verify the server before touching OpenClaw

Read ~/.openclaw/openclaw.json first and extract:

messages.tts.provider
messages.tts.openai.baseUrl
messages.tts.openai.model
messages.tts.openai.voice

Check the basics against the actual configured host:

curl http://127.0.0.1:8000/health
curl http://127.0.0.1:8000/v1/models

Confirm that the intended TTS model exists.

If the model does not appear by pulled registry name, do not assume TTS is broken — this server may be loading a local-path model such as ./models/qwen3-tts-0.6b-mlx.

If the server is task-gated, ensure TTS is enabled:

MLX_AUDIO_SERVER_TASKS=tts uv run python server.py

2. Prove the raw TTS endpoint works

Always isolate the server from the client stack.

Minimal non-streaming test:

curl http://127.0.0.1:8000/v1/audio/speech \
 -X POST \
 -H 'Content-Type: application/json' \
 -d '{
 "model": "/models/lj-qwen3-tts/",
 "voice": "lj",
 "input": "你好，这是一次性返回完整音频的测试。",
 "response_format": "wav",
 "stream": false
 }' \
 --output sample.wav

Basic streaming test:

curl http://127.0.0.1:8000/v1/audio/speech \
 -H 'Content-Type: application/json' \
 -X POST \
 -d '{
 "model": "/models/lj-qwen3-tts/",
 "voice": "lj",
 "input": "你好，这是实时流式语音合成测试。",
 "response_format": "wav",
 "stream": true,
 "streaming_interval": 2.0
 }' \
 | ffplay -i -

If direct curl works but OpenClaw does not, the bug is probably in the TTS integration or provider selection layer, not the TTS backend.

3. Distinguish server failure from integration failure

Use this rule:

Direct curl fails → fix the local TTS server first
Direct curl works, but OpenClaw sounds wrong or falls back → inspect OpenClaw provider selection, fallback, and request shape
OpenClaw sends requests but voice/mode is wrong → inspect fields like model, voice, instructions, ref_audio, ref_text, and streaming flags

4. Know the four TTS modes

Use the right request shape for the right model type.

Base speaker

Use built-in speaker playback.

Typical shape:

model type: base
no full ref_audio + ref_text
voice.id means built-in speaker name

Base clone

Use clone-style synthesis.

Typical shape:

model type: base
must provide both ref_audio and ref_text, or supply a consent voice identity that resolves to both

Hard rule: do not attempt clone with only ref_audio.

CustomVoice

Use a model with prebuilt custom speakers.

Typical shape:

model type: custom_voice
voice may be accepted either as a plain string or as {"id":"..."} depending on the server
for this workspace, lj-qwen3-tts / /models/lj-qwen3-tts/ must use speaker/voice lj
do not send clone payloads

VoiceDesign

Use style-description-driven synthesis.

Typical shape:

model type: voice_design
must provide instructions
do not send voice, ref_audio, or ref_text

5. Treat streaming as a real transport choice

This server supports real incremental generation, not fake post-hoc slicing.

Important behavior:

Current OpenAPI says stream defaults to false
response_format defaults to mp3
streaming_interval defaults to 2.0
Required fields are only model and input
Extra optional fields exposed by this local server include instruct, voice, speed, gender, pitch, lang_code, ref_audio, ref_text, temperature, top_p, top_k, repetition_penalty, response_format, stream, streaming_interval, max_tokens, and verbose

Do not assume OpenAI parity on names or defaults — check the local OpenAPI schema first.

6. Use consent uploads properly

For consent-based clone flows, upload voice material through /v1/audio/voice_consents.

Use ref_text with the recording. That is not optional in spirit, even if a workflow tries to pretend otherwise.

If later synthesis depends on stored consent voices, verify that the saved identity actually maps to both:

reference audio
reference text

7. OpenClaw-specific debugging pattern

When OpenClaw TTS appears broken:

Confirm messages.tts points at the actual configured endpoint in openclaw.json
Confirm the intended model exists in /v1/models or is otherwise accepted by the server; if not, check whether it is a local-path-backed deployment such as ./models/qwen3-tts-0.6b-mlx
Confirm the selected provider is really the OpenAI-compatible path and not Microsoft fallback
Test direct curl with the same effective model/voice/mode assumptions
Inspect whether OpenClaw is falling back to another provider
If using [[tts:...]], verify whether single-reply override keys (model, voice, maybe provider) are enabled and are being honored
If needed, compare raw request shape with a dump proxy

If OpenClaw reaches the server successfully, the next question is usually which mode did it actually request.

8. Preferred test ladder

Use this order:

GET /health
GET /v1/models
direct non-streaming TTS test
direct streaming TTS test
consent upload test if clone is involved
OpenAI client compatibility test if relevant
OpenClaw integration test
dump-proxy / log inspection only if still ambiguous

9. Common conclusions

Server good, integration bad

Typical signs:

manual curl returns playable audio
OpenClaw output sounds like fallback voice or wrong mode
provider selection is inconsistent

Conclusion: fix integration, not inference.

Text normalization bug

Typical signs:

synthesis succeeds technically
numbers are read awkwardly, skipped, or glitched

Conclusion: normalize the spoken text first. Do not blame the transport layer for a prompt-content problem.

Mode mismatch

Typical signs:

clone request sent to CustomVoice
VoiceDesign called without instructions
only ref_audio present for Base clone

Conclusion: wrong request semantics for the chosen model type.

10. Use the reference doc when exact fields matter

Read references/tts-api.md when you need exact behavior for:

/v1/audio/speech
/v1/audio/voice_consents
streaming vs non-streaming
stream_format="audio" vs stream_format="event"
mode selection and response headers
consent storage semantics
exact model/request mismatch errors

Do not assume generic OpenAI TTS docs fully match this local server.

Resources

references/

references/tts-api.md — exact local API behavior, streaming semantics, mode rules, consent upload flow, and common error conditions

Usage Guidance

This skill is coherent for debugging a local/OpenClaw-compatible TTS server, but before using it: (1) inspect ~/.openclaw/openclaw.json yourself to see what it contains — it may include provider settings or tokens the skill would read; (2) ensure the TTS baseUrl you allow the skill to contact is a trusted local/hosted server (examples use localhost); (3) be mindful that consent uploads persist audio and metadata on the TTS host; and (4) if you prefer not to allow autonomous invocation, restrict the skill's access or disable autonomous skills in your agent settings. If you want stronger assurance, run the curl examples manually rather than allowing the agent to execute them.

Capability Analysis

Type: OpenClaw Skill Name: local-tts-workflow Version: 1.0.1 The local-tts-workflow skill is a legitimate utility designed for configuring and debugging OpenAI-compatible text-to-speech (TTS) servers. It provides technical instructions for the agent to read local configuration files (~/.openclaw/openclaw.json) to identify active endpoints and includes standard curl commands for testing streaming and non-streaming audio synthesis. The documentation (SKILL.md and references/tts-api.md) is detailed and focuses on functional requirements like number normalization and API mode validation without any signs of malicious intent or data exfiltration.

Capability Assessment

✓ Purpose & Capability

Name/description match the behavior: the skill explains how to validate a local/OpenClaw-compatible TTS server, how to shape requests for different model modes, and how to normalize text for speech. It does not request unrelated credentials, binaries, or installs.

ℹ Instruction Scope

Runtime instructions ask the agent to read the OpenClaw config (~/.openclaw/openclaw.json), call local endpoints (e.g., /v1/audio/speech, /openapi.json), and upload consent audio to the local server. These actions are appropriate for debugging TTS, but reading the OpenClaw config can expose other configuration fields or tokens present in that file — the skill explicitly limits which keys to extract (messages.tts.*), which is reasonable.

✓ Install Mechanism

Instruction-only skill with no install spec and no code files — nothing is written to disk or downloaded by the skill itself.

ℹ Credentials

The skill declares no required environment variables or credentials. It references environment variables (MLX_AUDIO_SERVER_TASKS, MLX_AUDIO_TTS_WARMUP) as examples for running the server, which is proportional. Caveat: the config file the skill reads may contain provider info or secrets unrelated to TTS; users should inspect that file before permitting the skill to access it.

✓ Persistence & Privilege

The skill is not always-on and has no install-time persistence. disable-model-invocation is false (the platform default), so the agent could invoke the skill autonomously per normal platform behavior; this is expected and not by itself a red flag.

Version History

v1.0.1

- Updated server references from a remote IP to `127.0.0.1` to assume local deployment. - Noted that models may be loaded via local file paths (e.g., `./models/qwen3-tts-0.6b-mlx`), not just registry pulls, and clarified OpenClaw model detection flow accordingly. - Documented the local OpenAPI schema endpoint (`/openapi.json`) as the reference source for API details; clarified distinction between schema and server health. - Revised TTS endpoint examples to match new field names (`response_format` instead of `format`, added `streaming_interval`), and updated streaming/non-streaming default behaviors per current OpenAPI spec. - Expanded documentation of optional and required API parameters and urged users to consult the local OpenAPI for non-parity with OpenAI’s API. - Added clarifications for troubleshooting and debugging patterns, focusing on local-path deployments and real server vs. integration failures.

v1.0.0

- Initial release of local-tts-workflow skill for OpenClaw-compatible text-to-speech integration - Supports configuring, testing, and debugging local and remote/self-hosted TTS servers (including vLLM Omni) - Focuses on correct pipeline setup, TTS mode selection (base, clone, custom, voice design), and validating endpoint behavior - Provides clear guidance for number normalization in spoken text to ensure correct speech synthesis - Includes detailed workflow, troubleshooting steps, curl testing commands, and OpenClaw-specific debugging patterns

Metadata

Slug local-tts-workflow

Version 1.0.1

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 2

Frequently Asked Questions

What is Local Tts Workflow?

OpenClaw text-to-speech workflow for an OpenAI-compatible TTS server, including remote/self-hosted deployments such as vLLM Omni. Use when configuring, testi... It is an AI Agent Skill for Claude Code / OpenClaw, with 121 downloads so far.

How do I install Local Tts Workflow?

Run "/install local-tts-workflow" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Local Tts Workflow free?

Yes, Local Tts Workflow is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Local Tts Workflow support?

Local Tts Workflow is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Local Tts Workflow?

It is built and maintained by Mozi Arasaka (@mozi1924); the current version is v1.0.1.

More Skills

Local Tts Workflow

Local TTS Workflow

Core rule: normalize numbers before synthesis

Workflow

1. Verify the server before touching OpenClaw

2. Prove the raw TTS endpoint works

3. Distinguish server failure from integration failure

4. Know the four TTS modes

Base speaker

Base clone

CustomVoice

VoiceDesign

5. Treat streaming as a real transport choice

6. Use consent uploads properly

7. OpenClaw-specific debugging pattern

8. Preferred test ladder

9. Common conclusions

Server good, integration bad

Text normalization bug

Mode mismatch

10. Use the reference doc when exact fields matter

Resources

references/

What is Local Tts Workflow?

How do I install Local Tts Workflow?

Is Local Tts Workflow free?

Which platforms does Local Tts Workflow support?

Who created Local Tts Workflow?

💬 Comments