← Back to Skills Marketplace

Audio Video To Text

Name: Audio Video To Text
Author: ivan830826

by ivan830826 · GitHub ↗ · v1.0.0

cross-platform ✓ Security Clean

1024

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install audio-video-to-text

Description

音视频转文字技能，使用 Whisper 进行语音识别。支持多种音视频格式，可输出纯文本、SRT/VTT 字幕或 JSON 格式。适用于会议记录、视频字幕生成、采访整理、播客转录等场景。

README (SKILL.md)

音视频转文字

概述

本技能使用 OpenAI Whisper 模型将音频/视频文件转换为文字。支持自动语言检测和多种输出格式。

何时使用

会议录音转文字记录
视频内容生成字幕（SRT/VTT）
采访/播客内容整理
语音备忘录转文本
多语言视频翻译准备

快速开始

1. 安装依赖

pip install openai-whisper ffmpeg-python

确保系统已安装 ffmpeg：

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# 从 https://ffmpeg.org/download.html 下载

2. 基本用法

python scripts/transcribe.py \x3C输入文件> [输出文件] [选项]

3. 示例

# 转录 MP4 视频，输出文本
python scripts/transcribe.py meeting.mp4

# 转录音频，输出 SRT 字幕
python scripts/transcribe.py podcast.mp3 podcast.srt --output-format srt

# 指定中文和较小模型（更快）
python scripts/transcribe.py interview.wav --model tiny --language zh

# 输出带时间戳的 JSON
python scripts/transcribe.py video.mp4 result.json --output-format json

命令行选项

选项	说明	默认值
`--model`	模型大小：tiny, base, small, medium, large	base
`--language`	语言代码：zh, en, ja 等	自动检测
`--output-format`	输出格式：txt, srt, vtt, json	txt
`--device`	运行设备：cpu, cuda	cpu
`--keep-audio`	保留临时音频文件	false

模型选择指南

模型	大小	速度	精度	适用场景
tiny	39M	最快	一般	快速测试、短音频
base	74M	快	良好	日常使用
small	244M	中等	较好	正式场合
medium	769M	慢	很好	高精度需求
large	1550M	最慢	最佳	专业转录

输出格式说明

TXT（纯文本）

这是转录的完整文本内容，适合阅读和编辑。

SRT（字幕格式）

1
00:00:01,000 --> 00:00:04,000
这是第一句字幕。

2
00:00:04,500 --> 00:00:07,000
这是第二句字幕。

VTT（Web 字幕）

WEBVTT

00:00:01.000 --> 00:00:04.000
这是第一句字幕。

00:00:04.500 --> 00:00:07.000
这是第二句字幕。

JSON（完整数据）

包含分段、时间戳、置信度等完整信息，适合程序处理。

支持的文件格式

音频： MP3, WAV, FLAC, OGG, M4A, AAC

视频： MP4, AVI, MOV, MKV, WEBM, FLV

性能优化建议

短音频优先用 tiny/base 模型 - 速度快，精度够用
长内容用 CPU - 避免 GPU 内存不足
指定语言 - 可提升准确率和速度
批量处理 - 脚本可循环调用处理多个文件

常见问题

转录质量不佳

尝试更大的模型（small/medium/large）
指定正确的语言代码
确保音频质量清晰

处理速度慢

使用更小的模型（tiny/base）
如有 GPU，使用 --device cuda
缩短音频长度或分段处理

内存不足

使用更小的模型
将长文件分割后分别处理
关闭其他占用内存的程序

脚本

scripts/transcribe.py - 主转录脚本

参考资料

Usage Guidance

This skill appears to do only local transcription with Whisper and ffmpeg. Before installing/running: (1) verify you trust the skill source and the PyPI package name (openai-whisper) you will install, (2) be aware that Whisper will likely download large model files (especially medium/large) which use network bandwidth and disk space and may require substantial RAM/GPU, (3) install ffmpeg from official sources, (4) run the script in a virtual environment or sandbox and inspect the code if you have concerns, and (5) only run it on files you trust (the script spawns ffmpeg as a subprocess and writes a temp audio file under /tmp by default).

Capability Analysis

Type: OpenClaw Skill Name: audio-video-to-text Version: 1.0.0 The skill provides a standard utility for transcribing audio and video files using the OpenAI Whisper library. The code in scripts/transcribe.py uses safe subprocess calls to interface with ffmpeg and contains no evidence of data exfiltration, malicious execution, or prompt injection. All behaviors align with the stated purpose of audio-to-text transcription.

Capability Assessment

✓ Purpose & Capability

Name/description (audio/video → text using Whisper) align with the included script and SKILL.md. Required tools (whisper package, ffmpeg) are explainable and necessary for transcription.

✓ Instruction Scope

SKILL.md and the script limit actions to installing dependencies, extracting audio, loading a Whisper model, transcribing, formatting output, and deleting temporary audio. There are no instructions to read unrelated files, access environment secrets, or send data to external endpoints.

ℹ Install Mechanism

This is an instruction-only skill (no install spec). The script depends on the openai-whisper and ffmpeg-python packages and a system ffmpeg binary. Note: loading Whisper models will typically download large model weight files from the network the first time they are used, consuming disk and bandwidth.

✓ Credentials

The skill requires no environment variables, credentials, or config paths. It does not access unrelated secrets or other services.

✓ Persistence & Privilege

always:false and default invocation settings. The skill does not attempt to persist or modify other skills or system-wide agent configuration.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install audio-video-to-text
After installation, invoke the skill by name or use /audio-video-to-text
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release of the audio-video-to-text skill. - Converts audio/video files to text using OpenAI Whisper. - Supports multiple formats: txt, SRT, VTT, and JSON. - Handles various audio/video types: MP3, WAV, MP4, AVI, and more. - Allows model selection for speed/accuracy trade-offs. - Suitable for meeting notes, subtitles, interviews, and podcasts.

Metadata

Slug audio-video-to-text

Version 1.0.0

License —

All-time Installs 7

Active Installs 7

Total Versions 1

Frequently Asked Questions

What is Audio Video To Text?

音视频转文字技能，使用 Whisper 进行语音识别。支持多种音视频格式，可输出纯文本、SRT/VTT 字幕或 JSON 格式。适用于会议记录、视频字幕生成、采访整理、播客转录等场景。 It is an AI Agent Skill for Claude Code / OpenClaw, with 1024 downloads so far.

How do I install Audio Video To Text?

Run "/install audio-video-to-text" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Audio Video To Text free?

Yes, Audio Video To Text is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Audio Video To Text support?

Audio Video To Text is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Audio Video To Text?

It is built and maintained by ivan830826 (@ivan830826); the current version is v1.0.0.

More Skills