← 返回 Skills 市场

Video Subtitle Extractor

Name: Video Subtitle Extractor
Author: forhonourlx

作者 forhonourlx · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ pending

总下载

当前安装

版本数

在 OpenClaw 中安装

/install video-subtitle-extractor

功能描述

Cross-platform video subtitle extraction using ASR (speech-to-text). Downloads audio from video URLs via yt-dlp, transcribes with openai-whisper (small/mediu...

使用说明 (SKILL.md)

\r \r

Video Subtitle Extractor 🎬→📝\r

\r Cross-platform ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with openai-whisper, and applies LLM-based text calibration for Chinese content.\r \r Tested & verified on Windows 11 with real Bilibili videos (medium model, ~95% accuracy for Chinese).\r \r

Quick Start\r

# One-command full pipeline\r
python scripts/run.py \x3Cvideo_url> --model medium --language zh --output-dir ./output\r
\r
# Download audio only\r
python scripts/download_audio.py \x3Cvideo_url> \x3Coutput_dir>\r
\r
# Transcribe existing audio\r
python scripts/transcribe.py \x3Caudio_file> --model medium --language zh\r
```\r
\r
## When to Use This Skill\r
\r
Use this skill when:\r
1. The video has **no built-in subtitles** (Bilibili, YouTube, etc.)\r
2. You need **high-accuracy Chinese transcription** (~95% with medium model)\r
3. You want **multiple output formats** (TXT, SRT, VTT, JSON)\r
4. You need **LLM-assisted text calibration** for financial/technical terms\r
5. The user says: "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写"\r
\r
## Workflow\r
\r
### Step 0: Install Dependencies (once)\r
\r
```bash\r
python scripts/install_deps.py\r
```\r
\r
Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.\r
\r
### Step 1: Download Audio\r
\r
Run `scripts/download_audio.py \x3Curl> [output_dir]`.\r
\r
Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.\r
\r
**If download fails**: the video may require cookies. Try:\r
```bash\r
yt-dlp --cookies-from-browser chrome \x3Curl>\r
```\r
\r
### Step 2: ASR Transcription\r
\r
Run `scripts/transcribe.py \x3Caudio> --model \x3Csize> --language \x3Clang>`.\r
\r
Models are auto-downloaded on first use (disk space required):\r
\r
| Model | RAM | Disk | Speed | Quality | Best For |\r
|-------|-----|------|-------|---------|----------|\r
| `small` | ~2GB | 461MB | ~475 fps | ~90% | Quick tests |\r
| `medium` | ~5GB | 1.42GB | ~165 fps | **~95%** ✅ | **Recommended** |\r
| `large-v3` | ~10GB | 2.88GB | ~80 fps | ~97% | Best accuracy |\r
| `large-v3-turbo` | ~6GB | 1.6GB | ~120 fps | ~96% | Good balance |\r
\r
> **⚠️ Windows note**: With \x3C16GB RAM, `large-v3` may be killed (SIGKILL). Fall back to `medium`.\r
\r
Output formats: `txt`, `srt`, `vtt`, `json` (default: all).\r
\r
See `references/asr_models.md` for full model comparison.\r
\r
### Step 3: LLM Text Calibration\r
\r
After transcription, read the `.txt` output and apply corrections. Key calibration categories:\r
\r
1. **Homophone fixes** (同音字): 硬钢→硬扛, 模→磨, 骨→股\r
2. **Company/product names**: Deepseat→DeepSeek, 中繼續創→中际旭创, HPM→HBM\r
3. **Financial terms**: 抛押→抛压, 护盘 (not 互盘), 筹码, K线收十字星 (not 14星)\r
4. **Common substitutions**: 跟锋→跟风, 微转→微赚, 落带为安→落袋为安\r
5. **Traditional→Simplified**: If model outputs traditional Chinese, convert to simplified\r
6. **Structural cleanup**: Add paragraph breaks at topic shifts, format as prose\r
\r
See `references/calibration_guide.md` for the full 30+ pattern library.\r
\r
### Step 4: Deliver Results\r
\r
Present the calibrated text. Always include:\r
- Model used (small/medium/large) and quality notes\r
- Any sections with low confidence or unclear audio\r
- Summary of corrections applied (counts by category)\r
\r
## Platform Support\r
\r
| Platform | Status | Notes |\r
|----------|--------|-------|\r
| Bilibili | ✅ | Audio-only streams available without login. 720P+ video needs cookies. |\r
| YouTube | ✅ | Full support. Cookies may improve format selection. |\r
| Douyin/TikTok | ✅ | Via yt-dlp |\r
| All yt-dlp sites | ✅ | 1800+ supported platforms |\r
\r
## Extending with New ASR Models\r
\r
`scripts/transcribe.py` is designed for backend extensibility:\r
\r
1. Add model info to `MODEL_SIZES` dict\r
2. Implement `transcribe_\x3Cbackend>()` function\r
3. Add CLI flag in argparse\r
\r
**Planned backends**: faster-whisper (CTranslate2), whisper.cpp (native C++), Cloud APIs (AssemblyAI, iFlytek).\r
\r
## Troubleshooting\r
\r
| Problem | Solution |\r
|---------|----------|\r
| SIGKILL during transcription | Model too large. Use `--model medium` or `--model small`. |\r
| yt-dlp download fails | Update yt-dlp: `pip install -U yt-dlp`. Try with cookies. |\r
| "No subtitles found" | Expected. This skill uses ASR, not built-in captions. |\r
| ffmpeg not found | Run `install_deps.py` (handles Windows non-PATH detection). |\r
| GPU not utilized | openai-whisper CPU-only by default. Install `faster-whisper` for GPU. |\r
\r
## Performance Benchmarks (Tested)\r
\r
| Video Duration | Model | Time | RAM Peak | Accuracy |\r
|---------------|-------|------|----------|----------|\r
| 6 min (Bilibili) | small | ~1m 17s | ~2.5GB | ~90% |\r
| 6 min (Bilibili) | medium | ~4m 30s | ~6GB | ~95% |\r
| 13 min (Bilibili) | medium | ~8m | ~6.5GB | ~95% |\r
\r
Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install video-subtitle-extractor
安装完成后，直接呼叫该 Skill 的名称或使用 /video-subtitle-extractor 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Cross-platform ASR subtitle extraction pipeline. Auto-installs ffmpeg, yt-dlp, openai-whisper. Configurable models. Multi-format output.

元数据

Slug video-subtitle-extractor

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题