← Back to Skills Marketplace
araa47

Gemini STT

by araa47 · GitHub ↗ · v1.1.0
linuxdarwin ✓ Security Clean
3114
Downloads
2
Stars
11
Active Installs
2
Versions
Install in OpenClaw
/install gemini-stt
Description
Transcribe audio files using Google's Gemini API or Vertex AI
README (SKILL.md)

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is gemini-2.0-flash-lite for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)

Requirements

  • Python 3.10+ (no external dependencies)
  • Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

  • .ogg / .opus (Telegram voice messages)
  • .mp3
  • .wav
  • .m4a

Usage

# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

Option Description
\x3Caudio_file> Path to the audio file (required)
--model, -m Gemini model to use (default: gemini-2.0-flash-lite)
--vertex, -v Force use of Vertex AI with ADC
--project, -p GCP project ID (for Vertex, defaults to gcloud config)
--region, -r GCP region (for Vertex, default: us-central1)

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

Model Notes
gemini-2.0-flash-lite Default. Fastest transcription speed.
gemini-2.0-flash Fast and cost-effective.
gemini-2.5-flash-lite Lightweight 2.5 model.
gemini-2.5-flash Balanced speed and quality.
gemini-2.5-pro Higher quality, slower.
gemini-3-flash-preview Latest flash model.
gemini-3-pro-preview Latest pro model, best quality.

See Gemini API Models for the latest list.

How It Works

  1. Reads the audio file and base64 encodes it
  2. Auto-detects authentication:
    • If ADC is available (gcloud), uses Vertex AI endpoint
    • Otherwise, uses GEMINI_API_KEY with direct Gemini API
  3. Sends to the selected Gemini model with transcription prompt
  4. Returns the transcribed text

Example Integration

For Clawdbot voice message handling:

# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"

Error Handling

The script exits with code 1 and prints to stderr on:

  • No authentication available (neither ADC nor GEMINI_API_KEY)
  • File not found
  • API errors
  • Missing GCP project (when using Vertex)

Notes

  • Uses Gemini 2.0 Flash Lite by default for fastest transcription
  • No external Python dependencies (uses stdlib only)
  • Automatically detects MIME type from file extension
  • Prefers Vertex AI with ADC when available (no API key management needed)
Usage Guidance
This skill is coherent with its stated purpose, but before installing: (1) be aware it requires authentication—either set GEMINI_API_KEY or run 'gcloud auth application-default login' and ensure a proper GCP project is configured; the registry metadata currently omits these requirements. (2) Using ADC (gcloud) will cause the script to call 'gcloud auth print-access-token' and use your ADC permissions to call Vertex; prefer a least-privilege service account or isolated environment if you are concerned about exposing broader GCP credentials. (3) GEMINI_API_KEY should be stored securely (not in world-readable files). (4) Review and run the script in a safe environment if you want to inspect network calls; endpoints contacted are standard Google APIs (generativelanguage.googleapis.com and *.aiplatform.googleapis.com). If you need the metadata fixed or want the skill to declare GEMINI_API_KEY / GOOGLE_CLOUD_PROJECT as required, request that from the publisher before trusting it in production.
Capability Analysis
Type: OpenClaw Skill Name: gemini-stt Version: 1.1.0 The skill is designed to transcribe audio files using Google's Gemini API or Vertex AI. The `transcribe.py` script legitimately uses `subprocess` to interact with the `gcloud` CLI for authentication (retrieving access tokens and project IDs) and sends base64-encoded audio data to official Google API endpoints. There is no evidence of data exfiltration to unauthorized parties, malicious execution, persistence mechanisms, or prompt injection attempts against the OpenClaw agent in `SKILL.md`. All actions are aligned with the stated purpose of speech-to-text transcription.
Capability Assessment
Purpose & Capability
Skill name/description (Gemini/Vertex STT) match the code and runtime instructions. The only mismatch is registry metadata claiming 'no required env vars' while SKILL.md and the script require either GEMINI_API_KEY or Google ADC (gcloud). This is an inconsistency in metadata, not in functionality.
Instruction Scope
Runtime instructions and the script are scoped to reading an audio file, base64-encoding it, and calling Google Gemini or Vertex endpoints. It invokes 'gcloud' only to obtain an access token/project configuration. It does not read unrelated system files or send data to unexpected endpoints.
Install Mechanism
No install spec; the skill is instruction-only with a single Python script that uses only the standard library. Low risk from installation artifacts.
Credentials
Authentication requirements (GEMINI_API_KEY or gcloud ADC and possibly GOOGLE_CLOUD_PROJECT/CLOUDSDK_CORE_PROJECT) are appropriate for contacting Gemini/Vertex. However, the skill metadata declares no required environment variables or primary credential, which is inaccurate and could mislead users about needed credentials.
Persistence & Privilege
The skill does not request permanent inclusion (always:false), does not modify other skills or system settings, and does not persist credentials. It runs commands locally (gcloud) but does not escalate privileges or change system-wide configuration.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install gemini-stt
  3. After installation, invoke the skill by name or use /gemini-stt
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.1.0
Added support for Google Vertex AI with Application Default Credentials (ADC). Now supports both GEMINI_API_KEY and gcloud ADC authentication methods. Auto-detects authentication method.
v1.0.0
Initial release of Gemini-based Speech-to-Text skill. Optimized for speed with gemini-2.0-flash-lite default.
Metadata
Slug gemini-stt
Version 1.1.0
License
All-time Installs 11
Active Installs 11
Total Versions 2
Frequently Asked Questions

What is Gemini STT?

Transcribe audio files using Google's Gemini API or Vertex AI. It is an AI Agent Skill for Claude Code / OpenClaw, with 3114 downloads so far.

How do I install Gemini STT?

Run "/install gemini-stt" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Gemini STT free?

Yes, Gemini STT is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Gemini STT support?

Gemini STT is cross-platform and runs anywhere OpenClaw / Claude Code is available (linux, darwin).

Who created Gemini STT?

It is built and maintained by araa47 (@araa47); the current version is v1.1.0.

💬 Comments