Video Metadata Intelligence System
/install video-metadata-analyzer
Video Analyzer
Three-stage video analysis pipeline: parallel visual + audio observation, then metadata synthesis for Bilibili publishing.
When to Use
- User says "analyze this video", "视频分析", "提取视频信息", "生成投稿元数据"
- User wants structured content analysis from a video file
- User wants to prepare a video for Bilibili publishing (title, intro, tags, category, cover)
- Upstream of
bilibili-publish-playwright: this skill generates the metadata that feeds into Bilibili publishing
Architecture
Input Video ──────────────────────────────────────────────────────
│ │
├── Stage 1a: visual.py ──→ observations_visual.json │
│ (ffmpeg extract frames → encode → vision LLM observe) │ PARALLEL
│ │
├── Stage 1b: transcribe.py ──→ observations_audio.json │
│ (ffmpeg extract audio → transcribe + structure) │
│ │
└── Stage 2: analyze.py ──→ metadata.json ←───────────────────┘
(merge V+A observations → publishable metadata via LLM)
run.sh orchestrates: launches visual.py and transcribe.py as background processes (&), wait for both, then optionally runs analyze.py.
Output Directory
$OUTPUT/
├── observations_visual.json # JSON array: one object per frame
├── observations_audio.json # JSON object: transcript + structured info
├── metadata.json # (optional) Synthesized Bilibili metadata
└── frames/ # (only with --keep-frames, auto-cleaned otherwise)
Procedure
1. Full pipeline (recommended — all external LLM API)
bash scripts/run.sh \
--video VIDEO_PATH --output /tmp/va-out \
--transcribe audio-llm \
--audio-llm-key KEY --audio-llm-base URL --audio-llm-model MODEL \
--vision-llm-key KEY --vision-llm-base URL --vision-llm-model MODEL \
--max-frames 15 \
--synthesize-method api \
--analyze-llm-key KEY --analyze-llm-base URL --analyze-llm-model MODEL
2. Agent-direct mode (no external API — agent reads frames/audio directly)
bash scripts/run.sh \
--video VIDEO_PATH --output /tmp/va-out --keep-frames
Agent then reads observations_visual.json (placeholder frames), observations_audio.json (audio file path), and optionally the frame images + audio file directly to generate metadata.
3. Mixed / observe-only
Omit --synthesize-method to observe only, then run analyze.py separately later. Each stage (visual, audio, synthesize) can use different keys and models.
Key Parameters
| Parameter | Default | Purpose |
|---|---|---|
--video PATH |
— | Required. Input video file |
--output DIR |
— | Required. Output directory |
--transcribe MODE |
agent-direct |
local / cloud / agent-direct / audio-llm |
--max-frames N |
15 |
Max frames per 4-min segment |
--keep-frames |
false |
Keep extracted frame images |
--synthesize-method METHOD |
— | api / agent / manual. Omit = observe only |
All *-key, *-base, *-model parameters follow the pattern: --vision-llm-key, --audio-llm-key, --analyze-llm-key etc. See references/REFERENCE.md for the complete parameter table.
Scripts
| File | Role |
|---|---|
scripts/common.py |
Shared utilities: HTTP retry with backoff, media duration via ffprobe, JSON parse from LLM output |
scripts/visual.py |
Frame extraction (auto-segment, auto-compress >200KB) + vision LLM observation. Long videos: segments processed in parallel (max 4 concurrent) |
scripts/transcribe.py |
Audio extraction + transcription (4 modes). Auto-chunks large audio with 2s overlap for dedup |
scripts/analyze.py |
Observations → publish metadata (3 methods: api/agent/manual). Heuristic fallback on API failure |
scripts/run.sh |
Orchestrator: parallel visual+audio, then optional synthesis |
Output Summary
observations_visual.json — JSON array, one object per frame with frame, objects, desc, texts, actions, style, cover_candidate, segment, segment_start.
observations_audio.json — transcript, speakers, key_points, tone. Agent-direct mode includes audio_file path.
metadata.json — title (≤80 chars), intro (≤2000 chars), tags (≤10), category, sub_category, cover_suggestion (primary + reason + secondary), declaration, copyright_claim.
Pitfalls
- API keys in chat: Some platforms truncate keys with
…. Always pass keys via command-line arguments, not through messages. - Model capability: Vision requires
image_urlsupport. Audio-LLM requiresinput_audiosupport. Check your provider. - Game recordings: Frames are large (~300-380KB vs ~30-80KB for phone). Auto-compression handles this, but plan rate limits for long videos.
- Long videos = parallel API calls: 30-min video = 8 segments × 15 frames = 8 vision API calls (capped at 4 concurrent). Consider rate limits.
- Missing credentials auto-degrade: Omitting LLM keys → preprocess-only or agent-direct mode. Scripts never crash on missing keys.
--intervaldeprecated: Ignored. Interval auto-calculated per segment based on--max-frames.
Error Handling
Three-layer defense:
- HTTP retry — 3 retries with exponential backoff on 5xx / connection errors
- JSON parse retry — 3 attempts with error feedback sent back to LLM
- Graceful degradation — placeholder observations on visual failure, raw text on audio failure, heuristic fallback on synthesis failure
Verification
- Check exit code:
run.shreturns 0 on success - Verify
observations_visual.jsonhas entries for expected frame count - Verify
observations_audio.jsonhastranscriptfield (non-empty for speech videos) - If
--synthesize-methodused, verifymetadata.jsonhas all required fields (title,intro,tags,category,cover_suggestion)
For complete parameter reference, output schemas, standalone usage per script, and detailed error handling, see references/REFERENCE.md.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install video-metadata-analyzer - After installation, invoke the skill by name or use
/video-metadata-analyzer - Provide required inputs per the skill's parameter spec and get structured output
What is Video Metadata Intelligence System?
Drop in a video file and get back a complete Bilibili publishing package: title, intro, tags, category, cover suggestion, and content declaration, all generated automatically from visual and audio analysis. It is an AI Agent Skill for Claude Code / OpenClaw, with 37 downloads so far.
How do I install Video Metadata Intelligence System?
Run "/install video-metadata-analyzer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Video Metadata Intelligence System free?
Yes, Video Metadata Intelligence System is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Video Metadata Intelligence System support?
Video Metadata Intelligence System is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Video Metadata Intelligence System?
It is built and maintained by CyberKurry (@cyberkurry); the current version is v1.0.0.