Description

Generate spoken audio from text using Google's Gemini TTS models (default is Gemini 3.1 Flash TTS Preview, with fallback to Gemini 2.5 Flash/Pro preview TTS)...

README (SKILL.md)

Gemini TTS

Name: Google Gemini TTS
Author: shubhamsaboo

Generate speech audio from text using Gemini TTS models. The default is Gemini 3.1 Flash TTS Preview, and the script still supports Gemini 2.5 preview TTS models when you pass -m.

What this skill does

Single-speaker text to speech
Two-speaker podcast-style audio
Style control with natural language prompts
WAV output that can be sent directly in chat or used in apps

Files

scripts/gemini_tts.sh: CLI wrapper around the Gemini REST API

Quick start

# Show all options
scripts/gemini_tts.sh --help

# Single speaker, default voice (Kore)
scripts/gemini_tts.sh "Hello, welcome to the show!"

# Pick a voice
scripts/gemini_tts.sh -v Puck "This is Puck speaking."

# With style control
scripts/gemini_tts.sh -s "Say in a warm, calm tone:" "Take a deep breath."

# Save to a specific file
scripts/gemini_tts.sh -o /tmp/greeting.wav "Hey there!"

# Multi-speaker conversation
scripts/gemini_tts.sh --multi "Host:Kore,Guest:Puck" \
  "Host: Welcome to the podcast! Guest: Thanks for having me."

The script prints the output WAV file path.

Models

Model	Best for
`gemini-3.1-flash-tts-preview` (default)	Best default now: low-latency, natural output, expressive narration
`gemini-2.5-flash-preview-tts`	Backward-compatible fast preview model
`gemini-2.5-pro-preview-tts`	Long-form narration and higher-end creative work

Current note: Gemini 3.1 Flash TTS Preview is live and should be the default path for this skill. Gemini 2.5 preview TTS models remain useful as compatibility fallbacks.

Preview model note: gemini-3.1-flash-tts-preview is a preview model. If Google renames or retires it, pass -m gemini-2.5-flash-preview-tts as a fallback, or check the current model list.

Switch model examples:

scripts/gemini_tts.sh -m gemini-2.5-pro-preview-tts "Your text here"
scripts/gemini_tts.sh -m gemini-2.5-flash-preview-tts "Your text here"

Voices

Available prebuilt voices:

Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat, Laomedeia, Achernar, Schedar, Rasalgethi, Nashira, Enif

The same 30-voice library is shared between gemini-3.1-flash-tts-preview and the gemini-2.5-flash-preview-tts / gemini-2.5-pro-preview-tts fallbacks, so a voice you pick for the default model will still work if you drop back to a fallback via -m.

Style control

Gemini 3.1 Flash TTS reads plain transcripts naturally, but gives you two complementary ways to steer the delivery when you want more control.

Inline audio tags

Drop bracketed directions into the transcript. They modify what follows, can appear anywhere, and can stack or repeat across a single script:

[excitedly] Massive update today — [whispers] but keep it between us. [laughs]

Tags are open-ended; anything in [ ] is treated as a direction to the model. A useful starting set:

Emotion — [excitedly], [bored], [reluctantly], [amazed], [curious], [mischievously], [panicked], [sarcastic], [serious], [tired], [trembling]
Pace and volume — [very fast], [very slowly], [asmr], [deep and loud shouting], [whispers]
Non-verbal — [gasp], [giggles], [sighs], [snorts], [cough], [laughs], [crying]
Character / style — [like dracula], [like a dog], [singing], [sarcastically, one painfully slow word at a time]

Structured context prompt

For longer pieces where you want a consistent persona, prepend an AUDIO PROFILE / SCENE / DIRECTOR'S NOTES / TRANSCRIPT block. The four headers are load-bearing — the model uses them to separate performance context from the script it should actually speak:

# AUDIO PROFILE: Jaz, London morning-show radio DJ

## THE SCENE: 10 PM, neon-lit studio, "ON AIR" tally blazing.
Jaz is bouncing on their heels, hands on the faders, infectious energy.

### DIRECTOR'S NOTES
Style: vocal smile always audible; punchy consonants; elongated vowels on excitement words.
Accent: Brixton, London.
Pace: energetic, bouncing cadence, no dead air.

#### TRANSCRIPT
[excitedly] Yes, massive vibes in the studio! [shouting] Turn it up!

Inline tags inside #### TRANSCRIPT override the baseline direction when you want a specific beat.

Tips

Keep the script and direction coherent — the speaker, what is said, and how it is said should agree.
Don't overspecify. Give the model space to fill gaps; it reads better.
A simple preamble ("Say cheerfully: ...") still works for quick one-offs, but inline tags give you per-phrase control and structured prompts give you persona consistency.

Full prompting reference: Gemini speech-generation docs.

Multi-speaker

Up to 2 speakers. Use --multi "Name1:Voice1,Name2:Voice2" and make sure the speaker names in the text match.

Supported languages

70+ languages are supported, including Arabic, Bengali, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, Vietnamese, and many more. See the Gemini speech-generation docs for the full locale list.

Limitations

Audio output only
Maximum 2 speakers in multi-speaker mode
Preview model names may change
No SSML support
No custom voice cloning

Verification

Basic smoke test once your API key is set:

export GEMINI_API_KEY=your_key_here   # GOOGLE_API_KEY is also accepted
scripts/gemini_tts.sh -o /tmp/gemini-test.wav "This is a Gemini TTS smoke test."
file /tmp/gemini-test.wav

Expected result: a playable WAV file is created (24 kHz mono, 16-bit PCM WAV).

Usage Guidance

This skill appears to do exactly what it says: call Google's Gemini TTS API and save WAV output after converting raw PCM with ffmpeg. Before installing, confirm you trust the source (owner/slug) and intend to provide a GEMINI_API_KEY: the key will be sent to https://generativelanguage.googleapis.com. Prefer setting a dedicated GEMINI_API_KEY rather than reusing a shared GOOGLE_API_KEY environment variable to reduce accidental key sharing across tools. Also review the script if you want to change output paths or retention of temporary files, and be aware API usage may incur billing on the associated Google account.

Capability Analysis

Type: OpenClaw Skill Name: google-gemini-tts Version: 1.0.3 The skill is a legitimate wrapper for the Google Gemini Text-to-Speech API. The shell script (scripts/gemini_tts.sh) safely handles API authentication via environment variables, constructs JSON payloads using jq to avoid injection, and uses ffmpeg for audio conversion. No evidence of data exfiltration, malicious execution, or harmful instructions was found.

Capability Tags

requires-sensitive-credentials

Capability Assessment

✓ Purpose & Capability

Name/description (Gemini TTS) align with required binaries (curl, jq, base64, ffmpeg) and a GEMINI_API_KEY. The script calls the Google Generative Language TTS endpoint and performs local audio conversion — all expected for a TTS wrapper.

✓ Instruction Scope

SKILL.md and the shipped script limit actions to building a TTS request, POSTing to https://generativelanguage.googleapis.com, decoding returned base64 audio, converting PCM→WAV with ffmpeg, and writing a local output file. The script checks only the declared binaries and the GEMINI_API_KEY (or alias) and does not read unrelated system files or other environment secrets.

✓ Install Mechanism

No install spec (instruction-only) and the included files are simple shell/script/text files. Nothing is downloaded or extracted at install time, so there is no high-risk installer behavior.

ℹ Credentials

Only GEMINI_API_KEY is required (the script also accepts GOOGLE_API_KEY as an alias). This is appropriate for a Google API wrapper. Note: using GOOGLE_API_KEY as a shared env var may overlap with other tools that also read that name; prefer a dedicated GEMINI_API_KEY if you want to limit exposure.

✓ Persistence & Privilege

always is false and the skill does not attempt to persist itself or change other skills/configs. The agent-autonomy flag is default; combined with the limited scope and single API key requirement this is standard for an invocable TTS skill.

Version History

v1.0.3

Rewrite Style control docs to match Gemini 3.1 Flash TTS prompting guide: distinguish inline audio tags from structured AUDIO PROFILE/SCENE/DIRECTOR'S NOTES/TRANSCRIPT blocks, list real tag examples (emotion, pace, non-verbal, character), drop old preamble-only bullets.

v1.0.2

Dependency check parity: script's runtime deps loop now verifies base64 alongside curl, jq, ffmpeg, matching requires.bins.

v1.0.1

Security hardening: remove gcloud ADC fallback (no more ambient cloud credential use). Fix metadata/code parity: declare only GEMINI_API_KEY as required (GOOGLE_API_KEY still accepted as a script-level alias), add base64 to requires.bins. Fix display-name capitalization (TTS).

v1.0.0

Initial publish: Gemini 3.1 Flash TTS with 2.5 Flash/Pro fallbacks

Metadata

Slug google-gemini-tts

Version 1.0.3

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 4

Frequently Asked Questions

What is Google Gemini TTS?

Generate spoken audio from text using Google's Gemini TTS models (default is Gemini 3.1 Flash TTS Preview, with fallback to Gemini 2.5 Flash/Pro preview TTS)... It is an AI Agent Skill for Claude Code / OpenClaw, with 187 downloads so far.

How do I install Google Gemini TTS?

Run "/install google-gemini-tts" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Google Gemini TTS free?

Yes, Google Gemini TTS is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Google Gemini TTS support?

Google Gemini TTS is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Google Gemini TTS?

It is built and maintained by Shubham Saboo (@shubhamsaboo); the current version is v1.0.3.

More Skills

Google Gemini TTS