← 返回 Skills 市场
ahqazi-dev

audio to text and video to text

作者 ahqazi-dev · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
203
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install audio-to-text-and-video-to-text
功能描述
Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — incl...
使用说明 (SKILL.md)

Transcription Skill

Converts audio and video files into clean, readable text using OpenAI's Whisper API and ffmpeg for media handling.

Overview

This skill handles the full pipeline:

  1. Media extraction — use ffmpeg to strip audio from video files and convert to a Whisper-compatible format
  2. Chunking — split large files (>25 MB) into overlapping segments to stay within API limits
  3. Transcription — send each chunk to OpenAI's Whisper API
  4. Assembly — merge chunk transcripts, adjusting timestamps, into a single clean output
  5. Post-processing — optionally clean up with Claude (punctuation, speaker labels, summaries)

Requirements

  • ffmpeg must be installed (which ffmpeg to verify — it's usually pre-installed in claude.ai's environment)
  • OpenAI API key stored in the environment as OPENAI_API_KEY — the user must provide this
  • Python packages: openai, pydub (install via pip if needed)

Quick Start

When a user provides a media file, run the transcription script:

# Install dependencies if missing
pip install openai pydub --break-system-packages -q

# Run transcription
python /home/claude/transcription/scripts/transcribe.py \
  --input "/path/to/media/file" \
  --output "/mnt/user-data/outputs/transcript.txt" \
  --api-key "$OPENAI_API_KEY"

See scripts/transcribe.py for the full implementation.

Supported Formats

Category Formats
Audio mp3, wav, m4a, ogg, flac, aac, opus, wma
Video mp4, mov, avi, mkv, webm, wmv, m4v

ffmpeg handles extraction from any of these.

Options & Flags

Flag Default Description
--model whisper-1 Whisper model to use (whisper-1, gpt-4o-transcribe)
--language auto-detect ISO 639-1 language code (e.g. en, ar, fr)
--format txt Output format: txt, srt, vtt, json
--timestamps off Include timestamps in output
--chunk-size 20 Max chunk size in MB (must be ≤ 25)
--prompt none Context hint to improve accuracy (e.g. domain vocab)

Output Formats

  • txt — plain text, ideal for most uses
  • srt — SubRip subtitle format (for video players)
  • vtt — WebVTT format (for web video)
  • json — full Whisper JSON with segments and timestamps

Step-by-Step Workflow

1. Check for the file

Ask the user to upload the file or provide a local path. Check:

ls /mnt/user-data/uploads/

2. Check ffmpeg and install deps

which ffmpeg && ffmpeg -version 2>&1 | head -1
pip install openai pydub --break-system-packages -q 2>&1 | tail -3

3. Get the API key

If OPENAI_API_KEY is not set in the environment, ask the user:

"Please provide your OpenAI API key — it starts with sk-. You can get one at https://platform.openai.com/api-keys"

4. Run the script

python /home/claude/transcription/scripts/transcribe.py \
  --input "\x3Cfile_path>" \
  --output "/mnt/user-data/outputs/transcript.txt"

5. Post-process (optional but recommended)

After transcription, offer to:

  • Clean up punctuation/formatting with Claude
  • Summarize the content
  • Extract action items, speakers, or key topics
  • Translate to another language

Use the transcript text directly in the conversation for these steps.

Handling Large Files

The script automatically splits files > 20 MB into overlapping chunks (with 1-second overlap for continuity). Each chunk is transcribed separately and the results are merged.

For very long recordings (> 1 hour), warn the user it may take a few minutes and show progress.

Error Handling

Error Fix
AuthenticationError Invalid API key — ask user to verify
RateLimitError Wait 60s and retry, or use --chunk-size 10
InvalidRequestError: file too large Reduce --chunk-size below 25
ffmpeg not found sudo apt install ffmpeg or brew install ffmpeg
No audio stream found File may be corrupt or wrong format

Example Interaction

User: "Can you transcribe this meeting recording?"
[uploads meeting.mp4]

→ Check file exists in /mnt/user-data/uploads/
→ Run transcribe.py on it
→ Save transcript to /mnt/user-data/outputs/
→ present_files() to the user
→ Offer to summarize or extract action items

Notes for openclaw.ai

  • Always save output to /mnt/user-data/outputs/ so users can download it
  • Use present_files() to share the transcript file with the user after saving
  • For business users, suggest the srt or vtt format if they're adding captions to video
  • The --prompt flag is useful for technical/domain-specific content: pass a few domain keywords to improve accuracy
安全使用建议
This skill's code looks like a normal Whisper transcription tool, but the registry metadata failed to declare the required OpenAI credential. Before installing: 1) Do NOT paste your OpenAI API key into chat — prefer setting OPENAI_API_KEY via the platform's secure env/secret settings or pass it to the script through the secure installer. 2) Verify the publisher (source is unknown) and prefer skills that declare their required env vars in metadata. 3) Confirm ffmpeg is available in your environment and be aware the script will pip-install deps (it uses --break-system-packages). 4) If you plan to use the optional 'Claude' post-processing, check what credentials (Anthropic) would be needed; the skill does not declare them. 5) If unsure, ask the publisher to correct metadata (add OPENAI_API_KEY as primaryEnv) and to document exactly how to supply secrets securely; avoid pasting keys into chat.
功能分析
Type: OpenClaw Skill Name: audio-to-text-and-video-to-text Version: 1.0.0 The skill bundle provides a legitimate utility for transcribing audio and video files using OpenAI's Whisper API and ffmpeg. The implementation in `scripts/transcribe.py` uses safe subprocess handling (passing arguments as lists) to prevent shell injection and correctly manages the user's OpenAI API key without evidence of exfiltration. While the script performs automated package installation via pip, this behavior is aligned with the stated purpose of setting up the environment for transcription tasks.
能力评估
Purpose & Capability
The code and SKILL.md match the stated purpose (ffmpeg + OpenAI Whisper transcription, chunking, output formats). However the registry metadata claims no required environment variables or primary credential while SKILL.md and the script clearly require an OpenAI API key (OPENAI_API_KEY). The skill also mentions optional post-processing with 'Claude' (Anthropic) without declaring any Anthropic credentials, which is inconsistent.
Instruction Scope
The runtime instructions stay within transcription scope: checking uploads, verifying ffmpeg, installing Python deps, running the included transcribe.py, splitting large files, and saving outputs to /mnt/user-data/outputs/. These actions are expected for a transcription skill. One concern: instructions explicitly ask the user to provide an OpenAI API key (text starting with 'sk-') which could encourage pasting secrets into chat rather than setting environment variables through a secure platform mechanism.
Install Mechanism
No external download/install spec is included (instruction-only install). The bundled Python script self-installs dependencies via pip if missing. There are no remote downloads or URL-based installers in the manifest. The pip commands use --break-system-packages which can alter system package isolation in some environments; that's notable but common for scripts running in managed sandboxes.
Credentials
The script requires an OpenAI API key (OPENAI_API_KEY) and accepts --api-key on the CLI, but the registry metadata lists no required env or primary credential — a direct mismatch. Asking users to paste their secret (it even states the 'sk-' prefix) is risky; the skill does not request other unrelated credentials, so the scope of secrets is proportional but the omission in metadata and guidance to provide keys in-chat are concerning.
Persistence & Privilege
The skill does not request permanent presence (always:false), does not modify other skills or system-wide settings, and contains no code that appears to persist credentials beyond typical environment variable usage. It writes outputs to /mnt/user-data/outputs/, which is appropriate for delivering transcripts.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install audio-to-text-and-video-to-text
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /audio-to-text-and-video-to-text 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Version 1.0.0 – Initial release - Enables transcription of audio and video files into text using OpenAI's Whisper API. - Supports a wide range of audio (MP3, WAV, M4A, etc.) and video (MP4, MOV, AVI, etc.) formats. - Automatically handles large files by splitting them into transcribable chunks. - Offers multiple output formats: plain text, SRT, VTT, and JSON. - Includes optional post-processing: formatting cleanup, summarization, speaker identification, and translation. - Provides robust error handling and clear guidance on setup and usage.
元数据
Slug audio-to-text-and-video-to-text
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

audio to text and video to text 是什么?

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — incl... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 203 次。

如何安装 audio to text and video to text?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install audio-to-text-and-video-to-text」即可一键安装,无需额外配置。

audio to text and video to text 是免费的吗?

是的,audio to text and video to text 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

audio to text and video to text 支持哪些平台?

audio to text and video to text 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 audio to text and video to text?

由 ahqazi-dev(@ahqazi-dev)开发并维护,当前版本 v1.0.0。

💬 留言讨论