← Back to Skills Marketplace

Gemini STT

Name: Gemini STT
Author: araa47

by araa47 · GitHub ↗ · v1.1.0

linuxdarwin ✓ Security Clean

3114

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install gemini-stt

Description

Transcribe audio files using Google's Gemini API or Vertex AI

README (SKILL.md)

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is gemini-2.0-flash-lite for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)

Requirements

Python 3.10+ (no external dependencies)
Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

.ogg / .opus (Telegram voice messages)
.mp3
.wav
.m4a

Usage

# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

Option	Description
`\x3Caudio_file>`	Path to the audio file (required)
`--model`, `-m`	Gemini model to use (default: `gemini-2.0-flash-lite`)
`--vertex`, `-v`	Force use of Vertex AI with ADC
`--project`, `-p`	GCP project ID (for Vertex, defaults to gcloud config)
`--region`, `-r`	GCP region (for Vertex, default: `us-central1`)

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

Model	Notes
`gemini-2.0-flash-lite`	Default. Fastest transcription speed.
`gemini-2.0-flash`	Fast and cost-effective.
`gemini-2.5-flash-lite`	Lightweight 2.5 model.
`gemini-2.5-flash`	Balanced speed and quality.
`gemini-2.5-pro`	Higher quality, slower.
`gemini-3-flash-preview`	Latest flash model.
`gemini-3-pro-preview`	Latest pro model, best quality.

See Gemini API Models for the latest list.

How It Works

Reads the audio file and base64 encodes it
Auto-detects authentication:
- If ADC is available (gcloud), uses Vertex AI endpoint
- Otherwise, uses GEMINI_API_KEY with direct Gemini API
Sends to the selected Gemini model with transcription prompt
Returns the transcribed text

Example Integration

For Clawdbot voice message handling:

# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"

Error Handling

The script exits with code 1 and prints to stderr on:

No authentication available (neither ADC nor GEMINI_API_KEY)
File not found
API errors
Missing GCP project (when using Vertex)

Notes

Uses Gemini 2.0 Flash Lite by default for fastest transcription
No external Python dependencies (uses stdlib only)
Automatically detects MIME type from file extension
Prefers Vertex AI with ADC when available (no API key management needed)

Usage Guidance

This skill is coherent with its stated purpose, but before installing: (1) be aware it requires authentication—either set GEMINI_API_KEY or run 'gcloud auth application-default login' and ensure a proper GCP project is configured; the registry metadata currently omits these requirements. (2) Using ADC (gcloud) will cause the script to call 'gcloud auth print-access-token' and use your ADC permissions to call Vertex; prefer a least-privilege service account or isolated environment if you are concerned about exposing broader GCP credentials. (3) GEMINI_API_KEY should be stored securely (not in world-readable files). (4) Review and run the script in a safe environment if you want to inspect network calls; endpoints contacted are standard Google APIs (generativelanguage.googleapis.com and *.aiplatform.googleapis.com). If you need the metadata fixed or want the skill to declare GEMINI_API_KEY / GOOGLE_CLOUD_PROJECT as required, request that from the publisher before trusting it in production.

Capability Analysis

Type: OpenClaw Skill Name: gemini-stt Version: 1.1.0 The skill is designed to transcribe audio files using Google's Gemini API or Vertex AI. The `transcribe.py` script legitimately uses `subprocess` to interact with the `gcloud` CLI for authentication (retrieving access tokens and project IDs) and sends base64-encoded audio data to official Google API endpoints. There is no evidence of data exfiltration to unauthorized parties, malicious execution, persistence mechanisms, or prompt injection attempts against the OpenClaw agent in `SKILL.md`. All actions are aligned with the stated purpose of speech-to-text transcription.

Capability Assessment

ℹ Purpose & Capability

Skill name/description (Gemini/Vertex STT) match the code and runtime instructions. The only mismatch is registry metadata claiming 'no required env vars' while SKILL.md and the script require either GEMINI_API_KEY or Google ADC (gcloud). This is an inconsistency in metadata, not in functionality.

✓ Instruction Scope

Runtime instructions and the script are scoped to reading an audio file, base64-encoding it, and calling Google Gemini or Vertex endpoints. It invokes 'gcloud' only to obtain an access token/project configuration. It does not read unrelated system files or send data to unexpected endpoints.

✓ Install Mechanism

No install spec; the skill is instruction-only with a single Python script that uses only the standard library. Low risk from installation artifacts.

ℹ Credentials

Authentication requirements (GEMINI_API_KEY or gcloud ADC and possibly GOOGLE_CLOUD_PROJECT/CLOUDSDK_CORE_PROJECT) are appropriate for contacting Gemini/Vertex. However, the skill metadata declares no required environment variables or primary credential, which is inaccurate and could mislead users about needed credentials.

✓ Persistence & Privilege

The skill does not request permanent inclusion (always:false), does not modify other skills or system settings, and does not persist credentials. It runs commands locally (gcloud) but does not escalate privileges or change system-wide configuration.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install gemini-stt
After installation, invoke the skill by name or use /gemini-stt
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.1.0

Added support for Google Vertex AI with Application Default Credentials (ADC). Now supports both GEMINI_API_KEY and gcloud ADC authentication methods. Auto-detects authentication method.

v1.0.0

Initial release of Gemini-based Speech-to-Text skill. Optimized for speed with gemini-2.0-flash-lite default.

Metadata

Slug gemini-stt

Version 1.1.0

License —

All-time Installs 11

Active Installs 11

Total Versions 2

Frequently Asked Questions

What is Gemini STT?

Transcribe audio files using Google's Gemini API or Vertex AI. It is an AI Agent Skill for Claude Code / OpenClaw, with 3114 downloads so far.

How do I install Gemini STT?

Run "/install gemini-stt" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Gemini STT free?

Yes, Gemini STT is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Gemini STT support?

Gemini STT is cross-platform and runs anywhere OpenClaw / Claude Code is available (linux, darwin).

Who created Gemini STT?

It is built and maintained by araa47 (@araa47); the current version is v1.1.0.

More Skills