← 返回 Skills 市场
araa47

Gemini STT

作者 araa47 · GitHub ↗ · v1.1.0
linuxdarwin ✓ 安全检测通过
3114
总下载
2
收藏
11
当前安装
2
版本数
在 OpenClaw 中安装
/install gemini-stt
功能描述
Transcribe audio files using Google's Gemini API or Vertex AI
使用说明 (SKILL.md)

Gemini Speech-to-Text Skill

Transcribe audio files using Google's Gemini API or Vertex AI. Default model is gemini-2.0-flash-lite for fastest transcription.

Authentication (choose one)

Option 1: Vertex AI with Application Default Credentials (Recommended)

gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

The script will automatically detect and use ADC when available.

Option 2: Direct Gemini API Key

Set GEMINI_API_KEY in environment (e.g., ~/.env or ~/.clawdbot/.env)

Requirements

  • Python 3.10+ (no external dependencies)
  • Either GEMINI_API_KEY or gcloud CLI with ADC configured

Supported Formats

  • .ogg / .opus (Telegram voice messages)
  • .mp3
  • .wav
  • .m4a

Usage

# Auto-detect auth (tries ADC first, then GEMINI_API_KEY)
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg

# Force Vertex AI
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex

# With a specific model
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --model gemini-2.5-pro

# Vertex AI with specific project and region
python ~/.claude/skills/gemini-stt/transcribe.py /path/to/audio.ogg --vertex --project my-project --region us-central1

# With Clawdbot media
python ~/.claude/skills/gemini-stt/transcribe.py ~/.clawdbot/media/inbound/voice-message.ogg

Options

Option Description
\x3Caudio_file> Path to the audio file (required)
--model, -m Gemini model to use (default: gemini-2.0-flash-lite)
--vertex, -v Force use of Vertex AI with ADC
--project, -p GCP project ID (for Vertex, defaults to gcloud config)
--region, -r GCP region (for Vertex, default: us-central1)

Supported Models

Any Gemini model that supports audio input can be used. Recommended models:

Model Notes
gemini-2.0-flash-lite Default. Fastest transcription speed.
gemini-2.0-flash Fast and cost-effective.
gemini-2.5-flash-lite Lightweight 2.5 model.
gemini-2.5-flash Balanced speed and quality.
gemini-2.5-pro Higher quality, slower.
gemini-3-flash-preview Latest flash model.
gemini-3-pro-preview Latest pro model, best quality.

See Gemini API Models for the latest list.

How It Works

  1. Reads the audio file and base64 encodes it
  2. Auto-detects authentication:
    • If ADC is available (gcloud), uses Vertex AI endpoint
    • Otherwise, uses GEMINI_API_KEY with direct Gemini API
  3. Sends to the selected Gemini model with transcription prompt
  4. Returns the transcribed text

Example Integration

For Clawdbot voice message handling:

# Transcribe incoming voice message
TRANSCRIPT=$(python ~/.claude/skills/gemini-stt/transcribe.py "$AUDIO_PATH")
echo "User said: $TRANSCRIPT"

Error Handling

The script exits with code 1 and prints to stderr on:

  • No authentication available (neither ADC nor GEMINI_API_KEY)
  • File not found
  • API errors
  • Missing GCP project (when using Vertex)

Notes

  • Uses Gemini 2.0 Flash Lite by default for fastest transcription
  • No external Python dependencies (uses stdlib only)
  • Automatically detects MIME type from file extension
  • Prefers Vertex AI with ADC when available (no API key management needed)
安全使用建议
This skill is coherent with its stated purpose, but before installing: (1) be aware it requires authentication—either set GEMINI_API_KEY or run 'gcloud auth application-default login' and ensure a proper GCP project is configured; the registry metadata currently omits these requirements. (2) Using ADC (gcloud) will cause the script to call 'gcloud auth print-access-token' and use your ADC permissions to call Vertex; prefer a least-privilege service account or isolated environment if you are concerned about exposing broader GCP credentials. (3) GEMINI_API_KEY should be stored securely (not in world-readable files). (4) Review and run the script in a safe environment if you want to inspect network calls; endpoints contacted are standard Google APIs (generativelanguage.googleapis.com and *.aiplatform.googleapis.com). If you need the metadata fixed or want the skill to declare GEMINI_API_KEY / GOOGLE_CLOUD_PROJECT as required, request that from the publisher before trusting it in production.
功能分析
Type: OpenClaw Skill Name: gemini-stt Version: 1.1.0 The skill is designed to transcribe audio files using Google's Gemini API or Vertex AI. The `transcribe.py` script legitimately uses `subprocess` to interact with the `gcloud` CLI for authentication (retrieving access tokens and project IDs) and sends base64-encoded audio data to official Google API endpoints. There is no evidence of data exfiltration to unauthorized parties, malicious execution, persistence mechanisms, or prompt injection attempts against the OpenClaw agent in `SKILL.md`. All actions are aligned with the stated purpose of speech-to-text transcription.
能力评估
Purpose & Capability
Skill name/description (Gemini/Vertex STT) match the code and runtime instructions. The only mismatch is registry metadata claiming 'no required env vars' while SKILL.md and the script require either GEMINI_API_KEY or Google ADC (gcloud). This is an inconsistency in metadata, not in functionality.
Instruction Scope
Runtime instructions and the script are scoped to reading an audio file, base64-encoding it, and calling Google Gemini or Vertex endpoints. It invokes 'gcloud' only to obtain an access token/project configuration. It does not read unrelated system files or send data to unexpected endpoints.
Install Mechanism
No install spec; the skill is instruction-only with a single Python script that uses only the standard library. Low risk from installation artifacts.
Credentials
Authentication requirements (GEMINI_API_KEY or gcloud ADC and possibly GOOGLE_CLOUD_PROJECT/CLOUDSDK_CORE_PROJECT) are appropriate for contacting Gemini/Vertex. However, the skill metadata declares no required environment variables or primary credential, which is inaccurate and could mislead users about needed credentials.
Persistence & Privilege
The skill does not request permanent inclusion (always:false), does not modify other skills or system settings, and does not persist credentials. It runs commands locally (gcloud) but does not escalate privileges or change system-wide configuration.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install gemini-stt
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /gemini-stt 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.0
Added support for Google Vertex AI with Application Default Credentials (ADC). Now supports both GEMINI_API_KEY and gcloud ADC authentication methods. Auto-detects authentication method.
v1.0.0
Initial release of Gemini-based Speech-to-Text skill. Optimized for speed with gemini-2.0-flash-lite default.
元数据
Slug gemini-stt
版本 1.1.0
许可证
累计安装 11
当前安装数 11
历史版本数 2
常见问题

Gemini STT 是什么?

Transcribe audio files using Google's Gemini API or Vertex AI. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 3114 次。

如何安装 Gemini STT?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install gemini-stt」即可一键安装,无需额外配置。

Gemini STT 是免费的吗?

是的,Gemini STT 完全免费(开源免费),可自由下载、安装和使用。

Gemini STT 支持哪些平台?

Gemini STT 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(linux, darwin)。

谁开发了 Gemini STT?

由 araa47(@araa47)开发并维护,当前版本 v1.1.0。

💬 留言讨论