← Back to Skills Marketplace
forhonourlx

Video Subtitle Extractor

by forhonourlx · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ pending
39
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install video-subtitle-extractor
Description
Cross-platform video subtitle extraction using ASR (speech-to-text). Downloads audio from video URLs via yt-dlp, transcribes with openai-whisper (small/mediu...
README (SKILL.md)

\r \r

Video Subtitle Extractor 🎬→📝\r

\r Cross-platform ASR subtitle extraction pipeline. Downloads audio from any yt-dlp-compatible video platform, transcribes with openai-whisper, and applies LLM-based text calibration for Chinese content.\r \r Tested & verified on Windows 11 with real Bilibili videos (medium model, ~95% accuracy for Chinese).\r \r

Quick Start\r

\r

# One-command full pipeline\r
python scripts/run.py \x3Cvideo_url> --model medium --language zh --output-dir ./output\r
\r
# Download audio only\r
python scripts/download_audio.py \x3Cvideo_url> \x3Coutput_dir>\r
\r
# Transcribe existing audio\r
python scripts/transcribe.py \x3Caudio_file> --model medium --language zh\r
```\r
\r
## When to Use This Skill\r
\r
Use this skill when:\r
1. The video has **no built-in subtitles** (Bilibili, YouTube, etc.)\r
2. You need **high-accuracy Chinese transcription** (~95% with medium model)\r
3. You want **multiple output formats** (TXT, SRT, VTT, JSON)\r
4. You need **LLM-assisted text calibration** for financial/technical terms\r
5. The user says: "下载字幕", "提取字幕", "语音转文字", "视频转文字", "字幕提取", "ASR转写"\r
\r
## Workflow\r
\r
### Step 0: Install Dependencies (once)\r
\r
```bash\r
python scripts/install_deps.py\r
```\r
\r
Auto-detects OS and installs: ffmpeg (winget/brew/apt), yt-dlp (pip), openai-whisper (pip). Handles Windows ffmpeg path detection even when not in PATH.\r
\r
### Step 1: Download Audio\r
\r
Run `scripts/download_audio.py \x3Curl> [output_dir]`.\r
\r
Uses yt-dlp to extract the best available audio format (m4a preferred). Supports Bilibili, YouTube, and 1800+ yt-dlp-compatible platforms. The script automatically detects ffmpeg even when not in system PATH.\r
\r
**If download fails**: the video may require cookies. Try:\r
```bash\r
yt-dlp --cookies-from-browser chrome \x3Curl>\r
```\r
\r
### Step 2: ASR Transcription\r
\r
Run `scripts/transcribe.py \x3Caudio> --model \x3Csize> --language \x3Clang>`.\r
\r
Models are auto-downloaded on first use (disk space required):\r
\r
| Model | RAM | Disk | Speed | Quality | Best For |\r
|-------|-----|------|-------|---------|----------|\r
| `small` | ~2GB | 461MB | ~475 fps | ~90% | Quick tests |\r
| `medium` | ~5GB | 1.42GB | ~165 fps | **~95%** ✅ | **Recommended** |\r
| `large-v3` | ~10GB | 2.88GB | ~80 fps | ~97% | Best accuracy |\r
| `large-v3-turbo` | ~6GB | 1.6GB | ~120 fps | ~96% | Good balance |\r
\r
> **⚠️ Windows note**: With \x3C16GB RAM, `large-v3` may be killed (SIGKILL). Fall back to `medium`.\r
\r
Output formats: `txt`, `srt`, `vtt`, `json` (default: all).\r
\r
See `references/asr_models.md` for full model comparison.\r
\r
### Step 3: LLM Text Calibration\r
\r
After transcription, read the `.txt` output and apply corrections. Key calibration categories:\r
\r
1. **Homophone fixes** (同音字): 硬钢→硬扛, 模→磨, 骨→股\r
2. **Company/product names**: Deepseat→DeepSeek, 中繼續創→中际旭创, HPM→HBM\r
3. **Financial terms**: 抛押→抛压, 护盘 (not 互盘), 筹码, K线收十字星 (not 14星)\r
4. **Common substitutions**: 跟锋→跟风, 微转→微赚, 落带为安→落袋为安\r
5. **Traditional→Simplified**: If model outputs traditional Chinese, convert to simplified\r
6. **Structural cleanup**: Add paragraph breaks at topic shifts, format as prose\r
\r
See `references/calibration_guide.md` for the full 30+ pattern library.\r
\r
### Step 4: Deliver Results\r
\r
Present the calibrated text. Always include:\r
- Model used (small/medium/large) and quality notes\r
- Any sections with low confidence or unclear audio\r
- Summary of corrections applied (counts by category)\r
\r
## Platform Support\r
\r
| Platform | Status | Notes |\r
|----------|--------|-------|\r
| Bilibili | ✅ | Audio-only streams available without login. 720P+ video needs cookies. |\r
| YouTube | ✅ | Full support. Cookies may improve format selection. |\r
| Douyin/TikTok | ✅ | Via yt-dlp |\r
| All yt-dlp sites | ✅ | 1800+ supported platforms |\r
\r
## Extending with New ASR Models\r
\r
`scripts/transcribe.py` is designed for backend extensibility:\r
\r
1. Add model info to `MODEL_SIZES` dict\r
2. Implement `transcribe_\x3Cbackend>()` function\r
3. Add CLI flag in argparse\r
\r
**Planned backends**: faster-whisper (CTranslate2), whisper.cpp (native C++), Cloud APIs (AssemblyAI, iFlytek).\r
\r
## Troubleshooting\r
\r
| Problem | Solution |\r
|---------|----------|\r
| SIGKILL during transcription | Model too large. Use `--model medium` or `--model small`. |\r
| yt-dlp download fails | Update yt-dlp: `pip install -U yt-dlp`. Try with cookies. |\r
| "No subtitles found" | Expected. This skill uses ASR, not built-in captions. |\r
| ffmpeg not found | Run `install_deps.py` (handles Windows non-PATH detection). |\r
| GPU not utilized | openai-whisper CPU-only by default. Install `faster-whisper` for GPU. |\r
\r
## Performance Benchmarks (Tested)\r
\r
| Video Duration | Model | Time | RAM Peak | Accuracy |\r
|---------------|-------|------|----------|----------|\r
| 6 min (Bilibili) | small | ~1m 17s | ~2.5GB | ~90% |\r
| 6 min (Bilibili) | medium | ~4m 30s | ~6GB | ~95% |\r
| 13 min (Bilibili) | medium | ~8m | ~6.5GB | ~95% |\r
\r
Tested on Windows 11, Intel i7, 16GB RAM. Performance may vary by CPU speed.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install video-subtitle-extractor
  3. After installation, invoke the skill by name or use /video-subtitle-extractor
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Cross-platform ASR subtitle extraction pipeline. Auto-installs ffmpeg, yt-dlp, openai-whisper. Configurable models. Multi-format output.
Metadata
Slug video-subtitle-extractor
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Video Subtitle Extractor?

Cross-platform video subtitle extraction using ASR (speech-to-text). Downloads audio from video URLs via yt-dlp, transcribes with openai-whisper (small/mediu... It is an AI Agent Skill for Claude Code / OpenClaw, with 39 downloads so far.

How do I install Video Subtitle Extractor?

Run "/install video-subtitle-extractor" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Video Subtitle Extractor free?

Yes, Video Subtitle Extractor is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Video Subtitle Extractor support?

Video Subtitle Extractor is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Video Subtitle Extractor?

It is built and maintained by forhonourlx (@forhonourlx); the current version is v1.0.0.

💬 Comments