← 返回 Skills 市场

Audio Video To Text

Name: Audio Video To Text
Author: ivan830826

作者 ivan830826 · GitHub ↗ · v1.0.0

cross-platform ✓ 安全检测通过

1024

总下载

当前安装

版本数

在 OpenClaw 中安装

/install audio-video-to-text

功能描述

音视频转文字技能，使用 Whisper 进行语音识别。支持多种音视频格式，可输出纯文本、SRT/VTT 字幕或 JSON 格式。适用于会议记录、视频字幕生成、采访整理、播客转录等场景。

使用说明 (SKILL.md)

音视频转文字

概述

本技能使用 OpenAI Whisper 模型将音频/视频文件转换为文字。支持自动语言检测和多种输出格式。

何时使用

会议录音转文字记录
视频内容生成字幕（SRT/VTT）
采访/播客内容整理
语音备忘录转文本
多语言视频翻译准备

快速开始

1. 安装依赖

pip install openai-whisper ffmpeg-python

确保系统已安装 ffmpeg：

# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# 从 https://ffmpeg.org/download.html 下载

2. 基本用法

python scripts/transcribe.py \x3C输入文件> [输出文件] [选项]

3. 示例

# 转录 MP4 视频，输出文本
python scripts/transcribe.py meeting.mp4

# 转录音频，输出 SRT 字幕
python scripts/transcribe.py podcast.mp3 podcast.srt --output-format srt

# 指定中文和较小模型（更快）
python scripts/transcribe.py interview.wav --model tiny --language zh

# 输出带时间戳的 JSON
python scripts/transcribe.py video.mp4 result.json --output-format json

命令行选项

选项	说明	默认值
`--model`	模型大小：tiny, base, small, medium, large	base
`--language`	语言代码：zh, en, ja 等	自动检测
`--output-format`	输出格式：txt, srt, vtt, json	txt
`--device`	运行设备：cpu, cuda	cpu
`--keep-audio`	保留临时音频文件	false

模型选择指南

模型	大小	速度	精度	适用场景
tiny	39M	最快	一般	快速测试、短音频
base	74M	快	良好	日常使用
small	244M	中等	较好	正式场合
medium	769M	慢	很好	高精度需求
large	1550M	最慢	最佳	专业转录

输出格式说明

TXT（纯文本）

这是转录的完整文本内容，适合阅读和编辑。

SRT（字幕格式）

1
00:00:01,000 --> 00:00:04,000
这是第一句字幕。

2
00:00:04,500 --> 00:00:07,000
这是第二句字幕。

VTT（Web 字幕）

WEBVTT

00:00:01.000 --> 00:00:04.000
这是第一句字幕。

00:00:04.500 --> 00:00:07.000
这是第二句字幕。

JSON（完整数据）

包含分段、时间戳、置信度等完整信息，适合程序处理。

支持的文件格式

音频： MP3, WAV, FLAC, OGG, M4A, AAC

视频： MP4, AVI, MOV, MKV, WEBM, FLV

性能优化建议

短音频优先用 tiny/base 模型 - 速度快，精度够用
长内容用 CPU - 避免 GPU 内存不足
指定语言 - 可提升准确率和速度
批量处理 - 脚本可循环调用处理多个文件

常见问题

转录质量不佳

尝试更大的模型（small/medium/large）
指定正确的语言代码
确保音频质量清晰

处理速度慢

使用更小的模型（tiny/base）
如有 GPU，使用 --device cuda
缩短音频长度或分段处理

内存不足

使用更小的模型
将长文件分割后分别处理
关闭其他占用内存的程序

脚本

scripts/transcribe.py - 主转录脚本

参考资料

安全使用建议

This skill appears to do only local transcription with Whisper and ffmpeg. Before installing/running: (1) verify you trust the skill source and the PyPI package name (openai-whisper) you will install, (2) be aware that Whisper will likely download large model files (especially medium/large) which use network bandwidth and disk space and may require substantial RAM/GPU, (3) install ffmpeg from official sources, (4) run the script in a virtual environment or sandbox and inspect the code if you have concerns, and (5) only run it on files you trust (the script spawns ffmpeg as a subprocess and writes a temp audio file under /tmp by default).

功能分析

Type: OpenClaw Skill Name: audio-video-to-text Version: 1.0.0 The skill provides a standard utility for transcribing audio and video files using the OpenAI Whisper library. The code in scripts/transcribe.py uses safe subprocess calls to interface with ffmpeg and contains no evidence of data exfiltration, malicious execution, or prompt injection. All behaviors align with the stated purpose of audio-to-text transcription.

能力评估

✓ Purpose & Capability

Name/description (audio/video → text using Whisper) align with the included script and SKILL.md. Required tools (whisper package, ffmpeg) are explainable and necessary for transcription.

✓ Instruction Scope

SKILL.md and the script limit actions to installing dependencies, extracting audio, loading a Whisper model, transcribing, formatting output, and deleting temporary audio. There are no instructions to read unrelated files, access environment secrets, or send data to external endpoints.

ℹ Install Mechanism

This is an instruction-only skill (no install spec). The script depends on the openai-whisper and ffmpeg-python packages and a system ffmpeg binary. Note: loading Whisper models will typically download large model weight files from the network the first time they are used, consuming disk and bandwidth.

✓ Credentials

The skill requires no environment variables, credentials, or config paths. It does not access unrelated secrets or other services.

✓ Persistence & Privilege

always:false and default invocation settings. The skill does not attempt to persist or modify other skills or system-wide agent configuration.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install audio-video-to-text
安装完成后，直接呼叫该 Skill 的名称或使用 /audio-video-to-text 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release of the audio-video-to-text skill. - Converts audio/video files to text using OpenAI Whisper. - Supports multiple formats: txt, SRT, VTT, and JSON. - Handles various audio/video types: MP3, WAV, MP4, AVI, and more. - Allows model selection for speed/accuracy trade-offs. - Suitable for meeting notes, subtitles, interviews, and podcasts.

元数据

Slug audio-video-to-text

版本 1.0.0

许可证 —

累计安装 7

当前安装数 7

历史版本数 1

常见问题