Video Metadata Intelligence System
/install video-metadata-analyzer
Video Analyzer
Three-stage video analysis pipeline: parallel visual + audio observation, then metadata synthesis for Bilibili publishing.
When to Use
- User says "analyze this video", "视频分析", "提取视频信息", "生成投稿元数据"
- User wants structured content analysis from a video file
- User wants to prepare a video for Bilibili publishing (title, intro, tags, category, cover)
- Upstream of
bilibili-publish-playwright: this skill generates the metadata that feeds into Bilibili publishing
Architecture
Input Video ──────────────────────────────────────────────────────
│ │
├── Stage 1a: visual.py ──→ observations_visual.json │
│ (ffmpeg extract frames → encode → vision LLM observe) │ PARALLEL
│ │
├── Stage 1b: transcribe.py ──→ observations_audio.json │
│ (ffmpeg extract audio → transcribe + structure) │
│ │
└── Stage 2: analyze.py ──→ metadata.json ←───────────────────┘
(merge V+A observations → publishable metadata via LLM)
run.sh orchestrates: launches visual.py and transcribe.py as background processes (&), wait for both, then optionally runs analyze.py.
Output Directory
$OUTPUT/
├── observations_visual.json # JSON array: one object per frame
├── observations_audio.json # JSON object: transcript + structured info
├── metadata.json # (optional) Synthesized Bilibili metadata
└── frames/ # (only with --keep-frames, auto-cleaned otherwise)
Procedure
1. Full pipeline (recommended — all external LLM API)
bash scripts/run.sh \
--video VIDEO_PATH --output /tmp/va-out \
--transcribe audio-llm \
--audio-llm-key KEY --audio-llm-base URL --audio-llm-model MODEL \
--vision-llm-key KEY --vision-llm-base URL --vision-llm-model MODEL \
--max-frames 15 \
--synthesize-method api \
--analyze-llm-key KEY --analyze-llm-base URL --analyze-llm-model MODEL
2. Agent-direct mode (no external API — agent reads frames/audio directly)
bash scripts/run.sh \
--video VIDEO_PATH --output /tmp/va-out --keep-frames
Agent then reads observations_visual.json (placeholder frames), observations_audio.json (audio file path), and optionally the frame images + audio file directly to generate metadata.
3. Mixed / observe-only
Omit --synthesize-method to observe only, then run analyze.py separately later. Each stage (visual, audio, synthesize) can use different keys and models.
Key Parameters
| Parameter | Default | Purpose |
|---|---|---|
--video PATH |
— | Required. Input video file |
--output DIR |
— | Required. Output directory |
--transcribe MODE |
agent-direct |
local / cloud / agent-direct / audio-llm |
--max-frames N |
15 |
Max frames per 4-min segment |
--keep-frames |
false |
Keep extracted frame images |
--synthesize-method METHOD |
— | api / agent / manual. Omit = observe only |
All *-key, *-base, *-model parameters follow the pattern: --vision-llm-key, --audio-llm-key, --analyze-llm-key etc. See references/REFERENCE.md for the complete parameter table.
Scripts
| File | Role |
|---|---|
scripts/common.py |
Shared utilities: HTTP retry with backoff, media duration via ffprobe, JSON parse from LLM output |
scripts/visual.py |
Frame extraction (auto-segment, auto-compress >200KB) + vision LLM observation. Long videos: segments processed in parallel (max 4 concurrent) |
scripts/transcribe.py |
Audio extraction + transcription (4 modes). Auto-chunks large audio with 2s overlap for dedup |
scripts/analyze.py |
Observations → publish metadata (3 methods: api/agent/manual). Heuristic fallback on API failure |
scripts/run.sh |
Orchestrator: parallel visual+audio, then optional synthesis |
Output Summary
observations_visual.json — JSON array, one object per frame with frame, objects, desc, texts, actions, style, cover_candidate, segment, segment_start.
observations_audio.json — transcript, speakers, key_points, tone. Agent-direct mode includes audio_file path.
metadata.json — title (≤80 chars), intro (≤2000 chars), tags (≤10), category, sub_category, cover_suggestion (primary + reason + secondary), declaration, copyright_claim.
Pitfalls
- API keys in chat: Some platforms truncate keys with
…. Always pass keys via command-line arguments, not through messages. - Model capability: Vision requires
image_urlsupport. Audio-LLM requiresinput_audiosupport. Check your provider. - Game recordings: Frames are large (~300-380KB vs ~30-80KB for phone). Auto-compression handles this, but plan rate limits for long videos.
- Long videos = parallel API calls: 30-min video = 8 segments × 15 frames = 8 vision API calls (capped at 4 concurrent). Consider rate limits.
- Missing credentials auto-degrade: Omitting LLM keys → preprocess-only or agent-direct mode. Scripts never crash on missing keys.
--intervaldeprecated: Ignored. Interval auto-calculated per segment based on--max-frames.
Error Handling
Three-layer defense:
- HTTP retry — 3 retries with exponential backoff on 5xx / connection errors
- JSON parse retry — 3 attempts with error feedback sent back to LLM
- Graceful degradation — placeholder observations on visual failure, raw text on audio failure, heuristic fallback on synthesis failure
Verification
- Check exit code:
run.shreturns 0 on success - Verify
observations_visual.jsonhas entries for expected frame count - Verify
observations_audio.jsonhastranscriptfield (non-empty for speech videos) - If
--synthesize-methodused, verifymetadata.jsonhas all required fields (title,intro,tags,category,cover_suggestion)
For complete parameter reference, output schemas, standalone usage per script, and detailed error handling, see references/REFERENCE.md.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install video-metadata-analyzer - 安装完成后,直接呼叫该 Skill 的名称或使用
/video-metadata-analyzer触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Video Metadata Intelligence System 是什么?
Drop in a video file and get back a complete Bilibili publishing package: title, intro, tags, category, cover suggestion, and content declaration, all generated automatically from visual and audio analysis. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 37 次。
如何安装 Video Metadata Intelligence System?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install video-metadata-analyzer」即可一键安装,无需额外配置。
Video Metadata Intelligence System 是免费的吗?
是的,Video Metadata Intelligence System 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Video Metadata Intelligence System 支持哪些平台?
Video Metadata Intelligence System 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Video Metadata Intelligence System?
由 CyberKurry(@cyberkurry)开发并维护,当前版本 v1.0.0。