Description

Baidu Intelligent Cloud Speech Synthesis (TTS), supporting multi-role dialogue audio generation, SSML/segment-merge dual modes, speech rate/pitch adjustment.

README (SKILL.md)

Baidu Intelligent Cloud Speech Synthesis Skill

Name: Baidu Speech Synthesis
Author: guoxh

Triggers

Use this skill when the user mentions:

"Convert this dialogue to audio using Baidu TTS"
"Generate male-female dialogue, male voice using Duxiaoyao, female voice using Duxiaomei"
"Batch process all dialogues in dialogue.txt"
"Adjust speech rate to 7, pitch to 6"
"View available voice list"
"baidu tts", "dialogue to audio", "multi-speaker speech synthesis"
"baidu speech synthesis", "multi-speaker dialogue", "Baidu TTS"

Chinese triggers (for Chinese users):

"用百度TTS把这段对话转成音频"
"生成男女对话，男声用度逍遥，女声用度小美"
"批量处理 dialogue.txt 里的所有对话"
"调整语速到7，音调到6"
"查看可用的音色列表"

Overview

This skill calls the Baidu Intelligent Cloud Speech Synthesis API, supporting multi-speaker dialogue synthesis (SSML mode or segment-merge fallback). It provides rich voice selection, speech rate/pitch/volume adjustment, and can automatically convert text dialogues into audio files with character-specific voices.

Installation Dependencies

# Install Python dependencies
pip install requests

# Ensure ffmpeg is installed (required for audio merging)
# Ubuntu/Debian:
sudo apt install ffmpeg
# macOS:
brew install ffmpeg
# Windows: Download from https://ffmpeg.org/download.html

# Optional: If pydub is needed (alternative merging solution)
# pip install pydub

Environment Variables Setup

Choose one of three authentication methods:

Method 1: API Key + Secret Key (auto-token)

export BAIDU_API_KEY="Your API Key (non-bce-v3 format)"
export BAIDU_SECRET_KEY="Your Secret Key"

Method 2: Direct access_token (starts with `1.`)

export BAIDU_API_KEY="1.a6b7dbd428f731035f771b8d********"
# BAIDU_SECRET_KEY not required

Method 3: IAM Key (starts with `bce-v3/`)

export BAIDU_API_KEY="bce-v3/ALTAK-8h6t5Y7uI9o0P1q3W2e4R5t6Y7u8I9o0P"
# BAIDU_SECRET_KEY not required
# Note: Existing bce-v3/ALTAK-... keys may be dedicated to other services (e.g., search).
# If authentication fails, create a dedicated speech synthesis application to get API Key + Secret Key.

Required Environment Variables

BAIDU_API_KEY must be set. Whether BAIDU_SECRET_KEY is needed depends on the authentication method:

Method 1: API Key + Secret Key (auto-token)

BAIDU_API_KEY=Your API Key (non-bce-v3 format)
BAIDU_SECRET_KEY=Your Secret Key

Method 2: Direct access_token (starts with `1.`)

BAIDU_API_KEY=1.a6b7dbd428f731035f771b8d********
# BAIDU_SECRET_KEY not required

Method 3: IAM Key (starts with `bce-v3/`)

BAIDU_API_KEY=bce-v3/ALTAK-8h6t5Y7uI9o0P1q3W2e4R5t6Y7u8I9o0P
# BAIDU_SECRET_KEY not required

The skill scripts automatically detect the key format and choose the corresponding authentication method. If not set, the user will be prompted.

Usage

1. Direct script invocation (command line)

# Single dialogue file synthesis
python ~/.openclaw/skills/baidu-speech-synthesis/scripts/baidu_tts.py \
    --input dialogue.txt \
    --output conversation.mp3

# Specify voice mapping (character name → voice code)
python scripts/baidu_tts.py \
    --input script.txt \
    --map 小明:1 小红:0 老师:106

# Batch process all .txt files in a directory
python scripts/baidu_tts.py \
    --dir ./dialogues \
    --format mp3

# Adjust parameters
python scripts/baidu_tts.py \
    --input text.txt \
    --spd 7 --pit 6 --vol 5 \
    --aue 3

2. Usage in OpenClaw sessions

When the user triggers the above phrases, the skill will:

Check environment variable configuration
Ask or automatically identify input text/file
Generate SSML according to default or specified voice assignment scheme
Call the Baidu API and return the audio file (can be played automatically or saved)

File Structure

baidu-speech-synthesis/
├── SKILL.md                    # This file
├── scripts/
│   ├── baidu_tts.py            # Main API client (token acquisition, SSML requests, segment merging)
│   ├── dialogue_formatter.py   # Dialogue text → SSML conversion and voice mapping
│   └── audio_merger.py         # ffmpeg audio merging tool (segment merge solution)
└── references/
    ├── voice_list.md           # Voice code table, samples, recommended pairings
    ├── ssml_guide.md           # Baidu SSML tags, limitations, examples
    └── api_setup.md            # How to obtain keys, free quota (5 million chars/month), authentication details

Technical Points

Intelligent Mode Selection: Automatically detects multi-voice requirements, defaults to segment synthesis mode (Baidu API only supports single-voice SSML).
Segment Synthesis Solution: Splits multi-role dialogues into single-voice segments → synthesizes separately → merges with ffmpeg (solves API limitations, compatible with Python 3.13).
SSML Single-Voice Support: Supports single-voice SSML (tex_type=3) for complex speech expressions of individual characters.
Automatic Voice Assignment: Default mapping "老王" → Duxiaoyao (3), "张经理" → Duxiaoyu (1), "小李" → Duyaya (4), customizable via --map.
Error Handling: Friendly prompts for network timeouts, quota exhaustion, audio merge failures, etc.

Notes

Free Quota: Baidu Speech Synthesis provides 5 million characters/month free quota (2026 latest policy), pay-as-you-go beyond that.
Authentication Methods: Supports three authentication methods (API Key+Secret Key, access_token, IAM Key), automatically detected by skill.
SSML Limitations: SSML text length limited to 1024 bytes (note Chinese character count), recommend each sentence not exceed 120 characters.
Dependencies: Segment merge solution requires ffmpeg installation (skill will detect and prompt). No need to install pydub.
Voice Expressiveness: Baidu's base voices are relatively flat; recommend enhancing dialogue expressiveness through text optimization (adding语气词, emotional descriptions).
Key Security: Do not hardcode API keys in code; always use environment variables or .env files.
Error Handling: Detailed guidance provided for authentication failures; refer to references/api_setup.md for help.

Changelog

2026‑03‑31 (v1.2.3): Fixed bare except: statements in audio_merger.py; replaced with proper exception handling to improve debugging and error visibility.
2026‑03‑26 (v1.2.2): Added MIT LICENSE file; updated metadata to declare ffmpeg dependency; addressing ClawHub security warnings.
2026‑03‑26 (v1.2.1): Complete English translation of skill documentation; improved bilingual triggers for both English and Chinese users.
2026‑03‑26 (v1.2): Switched to ffmpeg instead of pydub, solving Python 3.13 compatibility issues; corrected Baidu API limitation description (only supports single-voice SSML); optimized documentation and default voice mapping.
2026‑03‑26 (v1.1): Enhanced authentication support, added IAM Key and direct access_token authentication, updated free quota information, improved error guidance.
2026‑03‑26 (v1.0): Initial release, supporting multi-speaker dialogue synthesis, SSML/segment-merge dual modes.

Usage Guidance

This skill appears to do what it claims: construct SSML, call Baidu TTS endpoints, and merge audio with ffmpeg. Before installing, consider: (1) Keys you provide (BAIDU_API_KEY / BAIDU_SECRET_KEY or access_token/IAM key) will be used to call Baidu endpoints — keep them secret and prefer least-privilege keys scoped to TTS. (2) validate_config may require both API and Secret for its checks and may reject some valid IAM/access-token formats; if you use an alternative auth method, the validator might give false errors. (3) The skill runs ffmpeg via subprocess and writes temporary files — avoid feeding untrusted input files to prevent maliciously crafted inputs from causing problems. (4) The included requirements.txt lists pydub and python-dotenv in addition to requests; install only what you need and review the code if you plan to run it in sensitive environments. Overall the package is internally consistent with its stated purpose.

Capability Analysis

Type: OpenClaw Skill Name: baidu-speech-synthesis Version: 1.2.3 The skill bundle provides a robust implementation for Baidu Intelligent Cloud Speech Synthesis (TTS), supporting multi-role dialogues and SSML formatting. The code is well-structured, using legitimate Baidu API endpoints (aip.baidubce.com and tsn.baidu.com) and standard libraries like requests and ffmpeg for audio processing. While some test scripts (e.g., test_client.py) contain a hardcoded placeholder IAM key and others print partial keys for diagnostic purposes, these are clearly intended for local debugging and authentication troubleshooting rather than data exfiltration or malicious intent.

Capability Assessment

✓ Purpose & Capability

Name/description (Baidu TTS) matches required binaries (python3, ffmpeg), required env vars (BAIDU_API_KEY, BAIDU_SECRET_KEY), and included client/formatter/merger scripts. No unrelated credentials or surprising binaries are requested.

ℹ Instruction Scope

SKILL.md and the scripts instruct the agent to read input text files, build SSML, call Baidu token and TTS endpoints, produce temporary audio files and merge them with ffmpeg. These actions are within the stated purpose. Note: some helper scripts (validate_config, diagnose_auth) perform network calls to Baidu endpoints and inspect environment variables (including BAIDU_ACCESS_TOKEN if present); this is expected behavior but worth noting.

✓ Install Mechanism

No remote download/install spec is present (instruction-only install). Dependencies are typical Python libraries and ffmpeg. Minor inconsistency: SKILL.md suggests installing only requests, whereas requirements.txt also lists pydub and python-dotenv; this is not a security issue but is a documentation mismatch to be aware of.

✓ Credentials

Requested environment variables (BAIDU_API_KEY as primary, BAIDU_SECRET_KEY when needed) are proportionate for a Baidu TTS client. The skill supports access_token and IAM key formats as well. One caveat: validate_config enforces specific length/alphanumeric checks for API/Secret that may not match all valid key formats (e.g., bce-v3 IAM keys), causing false failures if using alternate auth methods.

✓ Persistence & Privilege

Skill is not force-included (always: false) and is user-invocable. It allows autonomous invocation (platform default) but does not request elevated or system-wide persistence or credentials for other skills.

Version History

v1.2.3

Fixed bare except statements in audio_merger.py for better error visibility and debugging

v1.2.2

- Added MIT LICENSE file. - Updated metadata to explicitly declare ffmpeg as a dependency. - Addressed ClawHub security warnings.

v1.2.0

Version 1.2.0 - Switched to using ffmpeg instead of pydub for audio merging, ensuring compatibility with Python 3.13. - Clarified that Baidu TTS API only supports SSML for single voice; improved related documentation. - Enhanced skill documentation with clearer setup, usage, and technical explanations. - Improved default voice mapping and added more robust error handling guidance.

Metadata

Slug baidu-speech-synthesis

Version 1.2.3

License MIT-0

All-time Installs 1

Active Installs 1

Total Versions 3

Frequently Asked Questions

What is Baidu Speech Synthesis?

Baidu Intelligent Cloud Speech Synthesis (TTS), supporting multi-role dialogue audio generation, SSML/segment-merge dual modes, speech rate/pitch adjustment. It is an AI Agent Skill for Claude Code / OpenClaw, with 178 downloads so far.

How do I install Baidu Speech Synthesis?

Run "/install baidu-speech-synthesis" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Baidu Speech Synthesis free?

Yes, Baidu Speech Synthesis is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Baidu Speech Synthesis support?

Baidu Speech Synthesis is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Baidu Speech Synthesis?

It is built and maintained by guoxh (@guoxh); the current version is v1.2.3.

More Skills

Baidu Speech Synthesis

Baidu Intelligent Cloud Speech Synthesis Skill

Triggers

Overview

Installation Dependencies

Environment Variables Setup

Method 1: API Key + Secret Key (auto-token)

Method 2: Direct access_token (starts with 1.)

Method 3: IAM Key (starts with bce-v3/)

Required Environment Variables

Method 1: API Key + Secret Key (auto-token)

Method 2: Direct access_token (starts with 1.)

Method 3: IAM Key (starts with bce-v3/)

Usage

1. Direct script invocation (command line)

2. Usage in OpenClaw sessions

File Structure

Technical Points

Notes

Changelog

What is Baidu Speech Synthesis?

How do I install Baidu Speech Synthesis?

Is Baidu Speech Synthesis free?

Which platforms does Baidu Speech Synthesis support?

Who created Baidu Speech Synthesis?

💬 Comments

Method 2: Direct access_token (starts with `1.`)

Method 3: IAM Key (starts with `bce-v3/`)

Method 2: Direct access_token (starts with `1.`)

Method 3: IAM Key (starts with `bce-v3/`)