功能描述

ElevenLabs voice API integration — TTS, sound effects, music generation, speech-to-text, voice isolation, and streaming. Use when building voice-enabled apps...

使用说明 (SKILL.md)

ElevenLabs Toolkit

Name: Elevenlabs Toolkit
Author: nissan

Programmatic access to all 7 ElevenLabs API capabilities via FastAPI endpoints or standalone Python functions.

When to Use This / When NOT to Use This

Use ElevenLabs when:

Generating high-quality narration audio for videos, demos, or content (especially with Rachel or a consistent character voice)
Building a voice-enabled app that needs streamed speech in real-time
Transcribing audio files (STT/Scribe)
Generating ambient sound effects or background music from text descriptions
Isolating clean voice from a noisy recording

Do NOT use ElevenLabs when:

You need fast/cheap TTS with no quality bar — use local TTS instead (see below)
You're offline or the API key isn't available
You're generating large volumes of test audio and don't want to burn character quota

ElevenLabs vs Local TTS (kokoro / chatterbox)

Criteria	ElevenLabs	Local TTS (kokoro/chatterbox)
Voice quality	★★★★★ — natural, expressive	★★★ — good but robotic edges
Cost	Chars deducted from monthly quota	Free, unlimited
Latency	~300–800ms API round-trip	~50–200ms local inference
Voice consistency	Named voices (Rachel etc.) persist	Model-dependent
Offline use	❌ Requires internet + API key	✅ Fully local
Best for	Final narration, published content	Drafts, testing, high-volume batch

Rule of thumb: Use ElevenLabs for anything that will be seen/heard by a user. Use local TTS for drafts, tests, and volume work.

Capabilities

Tool	Endpoint	What It Does
Voices	GET /api/voices	Browse available voices with metadata
TTS	POST /api/voice/tts	Batch text-to-speech (any voice, any language)
TTS Stream	WS /api/voice/stream	Real-time WebSocket TTS streaming
Sound Effects	POST /api/voice/sfx	Generate ambient audio from text prompts
Music	POST /api/voice/music	Generate background music from descriptions
STT (Scribe)	POST /api/voice/stt	Transcribe audio with language detection
Voice Isolation	POST /api/voice/isolate	Extract clean voice from noisy audio

Known Voice IDs

These are confirmed voices used in OpenClaw workflows. Always prefer these over browsing the full list:

Voice	Voice ID	Best For
Rachel	`21m00Tcm4TlvDq8ikWAM`	Default narration — clear, warm, American English
Adam	`pNInz6obpgDQGcFmaJgB`	Male narration, authoritative tone
Domi	`AZnzlk1XvdvUeBnXmlld`	Energetic, conversational
Bella	`EXAVITQu4vr4xnSDxMaL`	Soft, gentle narration

Default for all narration tasks: Use Rachel (21m00Tcm4TlvDq8ikWAM) unless explicitly specified otherwise.

To get the full current list from the API:

curl -s -H "xi-api-key: $ELEVENLABS_API_KEY" https://api.elevenlabs.io/v1/voices | python3 -m json.tool

Quick Start

import httpx

BASE = "http://localhost:8000"  # Your FastAPI app
KEY = os.environ["ELEVENLABS_API_KEY"]

# Get voices
voices = httpx.get(f"{BASE}/api/voices").json()

# Generate speech
audio = httpx.post(f"{BASE}/api/voice/tts", json={
    "text": "Hello world",
    "voice_id": voices[0]["voice_id"],
    "model_id": "eleven_multilingual_v2"
}).content  # Returns raw audio bytes

# Generate sound effects
sfx = httpx.post(f"{BASE}/api/voice/sfx", json={
    "prompt": "ocean waves on a quiet beach at night"
}).content

Audio Output Format

TTS and SFX endpoints return raw audio bytes (not base64, not JSON).

# Correct: .content gives you bytes
audio_bytes = response.content  # type: bytes

# Save to file
with open("output.mp3", "wb") as f:
    f.write(audio_bytes)

# The file format is MP3 by default
# File size estimate: ~1 MB per minute of speech at standard quality

What you get back from each endpoint:

Endpoint	Response type	How to handle
POST /api/voice/tts	`bytes` (MP3)	Write directly to `.mp3` file
POST /api/voice/sfx	`bytes` (MP3)	Write directly to `.mp3` file
POST /api/voice/music	`bytes` (MP3)	Write directly to `.mp3` file
POST /api/voice/stt	`JSON`	`{"text": "transcription...", "language": "en"}`
POST /api/voice/isolate	`bytes` (MP3)	Write directly to `.mp3` file
GET /api/voices	`JSON`	List of `{voice_id, name, labels, ...}`

Voice Selection Guide

English only: Use eleven_turbo_v2_5 — faster, no accent bleed
Multilingual: Use eleven_multilingual_v2 — supports 29 languages
Accent warning: Multilingual model can bleed accents across languages. If an English voice sounds Japanese, switch to turbo.

Quota Management

ElevenLabs charges per character for TTS. Key patterns:

Cache aggressively — identical text + voice = identical audio
Use prompt-cache skill for SHA-256 dedup before calling TTS
A 6-scene children's story ≈ 2,000 characters
Free tier: 10k chars/month. Starter: 30k. Creator: 100k.

Integration

Copy scripts/elevenlabs_api.py into your FastAPI app and mount the router:

from elevenlabs_api import router
app.include_router(router)

Set ELEVENLABS_API_KEY in your environment. All endpoints handle errors gracefully with proper HTTP status codes.

What If the FastAPI Server Isn't Running?

The Quick Start examples assume http://localhost:8000 is live. If it's not:

# Check if server is up before calling
import httpx

try:
    httpx.get("http://localhost:8000/health", timeout=2.0)
except httpx.ConnectError:
    # Server is not running — start it first
    import subprocess
    subprocess.Popen(["uvicorn", "elevenlabs_api:app", "--port", "8000"])
    import time; time.sleep(2)  # Give it a moment to bind

Or call the ElevenLabs API directly without the FastAPI wrapper — the scripts/elevenlabs_api.py functions are importable standalone:

from elevenlabs_api import generate_tts  # if the module exposes standalone functions

Error Handling: API Key and Rate Limits

Missing API key:

httpx.HTTPStatusError: 401 Unauthorized
{"detail": {"status": "unauthorized", "message": "Invalid API key"}}

→ Check ELEVENLABS_API_KEY is set: echo $ELEVENLABS_API_KEY → Retrieve from 1Password: op read "op://OpenClaw/ElevenLabs API Credentials/credential"

Rate limited (429):

{"detail": {"status": "too_many_requests", "message": "Too many requests"}}

→ Wait and retry with exponential backoff. ElevenLabs rate limits are per-minute on the free/starter tiers. → On Creator tier and above, limits are much higher — check your tier in the ElevenLabs dashboard.

Quota exhausted:

{"detail": {"status": "quota_exceeded", "message": "Quota exceeded"}}

→ Character quota for the month is used up. Either wait for monthly reset or upgrade tier. → Check current usage: curl -s -H "xi-api-key: $KEY" https://api.elevenlabs.io/v1/user/subscription

Files

scripts/elevenlabs_api.py — FastAPI router with all 7 endpoints

Common Mistakes

Treating the response as JSON when it's bytes
- ❌ response.json() on a TTS call → JSONDecodeError
- ✅ response.content → raw bytes, then write to .mp3
Using the wrong voice ID
- ElevenLabs voice IDs are opaque strings, not names
- ❌ "voice_id": "Rachel" → 404 or wrong voice
- ✅ "voice_id": "21m00Tcm4TlvDq8ikWAM" (Rachel's actual ID)
Calling TTS for large batches without caching
- Identical text+voice always produces identical audio — don't re-generate what's already cached
- Burns character quota unnecessarily
Using multilingual model for English-only content
- eleven_multilingual_v2 is slower and can produce accent artifacts on English-only text
- Use eleven_turbo_v2_5 for English-only work
Not checking the FastAPI server is running before calling
- httpx.ConnectError is confusing if you forget the local server dependency
- Add a health check or start-server step before calling endpoints

Security Notes

This skill uses patterns that may trigger automated security scanners:

base64: Used for encoding audio/binary data in API responses (standard practice for media APIs)
UploadFile: FastAPI's built-in file upload parameter for STT/voice isolation endpoints
"system prompt": Refers to configuring agent instructions, not prompt injection

安全使用建议

This skill appears to implement the ElevenLabs features it advertises, but you should be cautious before installing or running it: 1) The package includes Python code that requires additional libraries (fastapi, httpx, websockets, mistralai, etc.) but provides no install instructions — ask the author for a requirements file or installation spec or prepare to install dependencies yourself. 2) The code can optionally call Mistral if MISTRAL_API_KEY is present, but that env var is not declared; if you do not want it to call Mistral, ensure MISTRAL_API_KEY is not set in your environment. 3) The skill needs outbound network access and your ELEVENLABS_API_KEY; never share that key with untrusted code. 4) Confirm expected behavior (for streaming, STT uploads, and conversational features) in a safe environment before using in production. If you need higher assurance, request the author to: (a) declare all required env vars (including optional ones), (b) provide a requirements.txt or install spec that uses trusted package sources, and (c) document exactly when additional services (like Mistral) will be invoked.

功能分析

Type: OpenClaw Skill Name: elevenlabs-toolkit Version: 1.0.2 The elevenlabs-toolkit skill provides a legitimate FastAPI-based integration for ElevenLabs voice and audio services. The code in scripts/elevenlabs_api.py correctly handles API keys via environment variables and communicates only with official ElevenLabs and Mistral AI endpoints. The documentation in SKILL.md is informative and includes proactive security notes regarding common false-positive triggers like base64 encoding and file uploads.

能力评估

✓ Purpose & Capability

The name/description (ElevenLabs TTS, STT, SFX, music, streaming, voice isolation) align with the code and SKILL.md: the code proxies to api.elevenlabs.io endpoints for voices, text-to-speech, sound generation, speech-to-text, isolation, and streaming. ELEVENLABS_API_KEY is declared and used as the primary credential.

ℹ Instruction Scope

SKILL.md and the included Python implement only the declared ElevenLabs features and expose FastAPI endpoints for them. However, the code also implements a conversational 'story concierge' that calls a third-party Mistral client if MISTRAL_API_KEY is present — this behavior is not declared in requires.env and broadens the runtime scope. SKILL.md's metadata mentions base64 usage but the implementation returns raw bytes (minor inconsistency).

⚠ Install Mechanism

There is no install spec, yet the included code depends on multiple Python packages (fastapi, starlette, httpx, websockets, mistralai, etc.). Without a declared install step, an environment running this skill may lack required dependencies or the operator may need to install them manually; that absence is an operational and supply-chain mismatch (not necessarily malicious but worth noting).

⚠ Credentials

ELEVENLABS_API_KEY is appropriate and declared as primary. The code optionally reads MISTRAL_API_KEY and imports a 'mistralai' client to call another service, but MISTRAL_API_KEY is not listed in requires.env. Requesting or using additional service credentials without declaration is a proportionality/information-gap concern.

✓ Persistence & Privilege

The skill does not request always:true, does not modify other skills, and has no declared persistent/system-level privileges. It performs outbound network calls to ElevenLabs (expected for the stated purpose).

版本历史

v1.0.2

Add security_notes explaining base64 audio encoding, UploadFile type, and system prompt config context

v1.0.1

- Added explicit security notes field to metadata, clarifying use of base64 encoding and FastAPI's UploadFile. - Version updated to 1.0.1. - Documentation and integration details remain unchanged.

v1.0.0

Initial release — extracted from Sandman Tales v2 hackathon

元数据

Slug elevenlabs-toolkit

版本 1.0.2

许可证 MIT-0

累计安装 6

当前安装数 5

历史版本数 3

常见问题