Description

Drop in a video file and get back a complete Bilibili publishing package: title, intro, tags, category, cover suggestion, and content declaration, all generated automatically from visual and audio analysis.

README (SKILL.md)

Video Analyzer

Name: Video Metadata Intelligence System
Author: cyberkurry

Three-stage video analysis pipeline: parallel visual + audio observation, then metadata synthesis for Bilibili publishing.

When to Use

User says "analyze this video", "视频分析", "提取视频信息", "生成投稿元数据"
User wants structured content analysis from a video file
User wants to prepare a video for Bilibili publishing (title, intro, tags, category, cover)
Upstream of bilibili-publish-playwright: this skill generates the metadata that feeds into Bilibili publishing

Architecture

Input Video ──────────────────────────────────────────────────────
    │                                                            │
    ├── Stage 1a: visual.py ──→ observations_visual.json          │
    │      (ffmpeg extract frames → encode → vision LLM observe)  │ PARALLEL
    │                                                            │
    ├── Stage 1b: transcribe.py ──→ observations_audio.json       │
    │      (ffmpeg extract audio → transcribe + structure)         │
    │                                                            │
    └── Stage 2: analyze.py ──→ metadata.json ←───────────────────┘
           (merge V+A observations → publishable metadata via LLM)

run.sh orchestrates: launches visual.py and transcribe.py as background processes (&), wait for both, then optionally runs analyze.py.

Output Directory

$OUTPUT/
├── observations_visual.json    # JSON array: one object per frame
├── observations_audio.json     # JSON object: transcript + structured info
├── metadata.json               # (optional) Synthesized Bilibili metadata
└── frames/                     # (only with --keep-frames, auto-cleaned otherwise)

Procedure

1. Full pipeline (recommended — all external LLM API)

bash scripts/run.sh \
  --video VIDEO_PATH --output /tmp/va-out \
  --transcribe audio-llm \
  --audio-llm-key KEY --audio-llm-base URL --audio-llm-model MODEL \
  --vision-llm-key KEY --vision-llm-base URL --vision-llm-model MODEL \
  --max-frames 15 \
  --synthesize-method api \
  --analyze-llm-key KEY --analyze-llm-base URL --analyze-llm-model MODEL

2. Agent-direct mode (no external API — agent reads frames/audio directly)

bash scripts/run.sh \
  --video VIDEO_PATH --output /tmp/va-out --keep-frames

Agent then reads observations_visual.json (placeholder frames), observations_audio.json (audio file path), and optionally the frame images + audio file directly to generate metadata.

3. Mixed / observe-only

Omit --synthesize-method to observe only, then run analyze.py separately later. Each stage (visual, audio, synthesize) can use different keys and models.

Key Parameters

Parameter	Default	Purpose
`--video PATH`	—	Required. Input video file
`--output DIR`	—	Required. Output directory
`--transcribe MODE`	`agent-direct`	`local` / `cloud` / `agent-direct` / `audio-llm`
`--max-frames N`	`15`	Max frames per 4-min segment
`--keep-frames`	`false`	Keep extracted frame images
`--synthesize-method METHOD`	—	`api` / `agent` / `manual`. Omit = observe only

All *-key, *-base, *-model parameters follow the pattern: --vision-llm-key, --audio-llm-key, --analyze-llm-key etc. See references/REFERENCE.md for the complete parameter table.

Scripts

File	Role
`scripts/common.py`	Shared utilities: HTTP retry with backoff, media duration via ffprobe, JSON parse from LLM output
`scripts/visual.py`	Frame extraction (auto-segment, auto-compress >200KB) + vision LLM observation. Long videos: segments processed in parallel (max 4 concurrent)
`scripts/transcribe.py`	Audio extraction + transcription (4 modes). Auto-chunks large audio with 2s overlap for dedup
`scripts/analyze.py`	Observations → publish metadata (3 methods: api/agent/manual). Heuristic fallback on API failure
`scripts/run.sh`	Orchestrator: parallel visual+audio, then optional synthesis

Output Summary

observations_visual.json — JSON array, one object per frame with frame, objects, desc, texts, actions, style, cover_candidate, segment, segment_start.

observations_audio.json — transcript, speakers, key_points, tone. Agent-direct mode includes audio_file path.

metadata.json — title (≤80 chars), intro (≤2000 chars), tags (≤10), category, sub_category, cover_suggestion (primary + reason + secondary), declaration, copyright_claim.

Pitfalls

API keys in chat: Some platforms truncate keys with …. Always pass keys via command-line arguments, not through messages.
Model capability: Vision requires image_url support. Audio-LLM requires input_audio support. Check your provider.
Game recordings: Frames are large (~300-380KB vs ~30-80KB for phone). Auto-compression handles this, but plan rate limits for long videos.
Long videos = parallel API calls: 30-min video = 8 segments × 15 frames = 8 vision API calls (capped at 4 concurrent). Consider rate limits.
Missing credentials auto-degrade: Omitting LLM keys → preprocess-only or agent-direct mode. Scripts never crash on missing keys.
--interval deprecated: Ignored. Interval auto-calculated per segment based on --max-frames.

Error Handling

Three-layer defense:

HTTP retry — 3 retries with exponential backoff on 5xx / connection errors
JSON parse retry — 3 attempts with error feedback sent back to LLM
Graceful degradation — placeholder observations on visual failure, raw text on audio failure, heuristic fallback on synthesis failure

Verification

Check exit code: run.sh returns 0 on success
Verify observations_visual.json has entries for expected frame count
Verify observations_audio.json has transcript field (non-empty for speech videos)
If --synthesize-method used, verify metadata.json has all required fields (title, intro, tags, category, cover_suggestion)

For complete parameter reference, output schemas, standalone usage per script, and detailed error handling, see references/REFERENCE.md.

Usage Guidance

Do not treat this as a completed security review. The workspace files need to be readable before making an installation decision.

Capability Tags

requires-sensitive-credentials

Capability Assessment

ℹ Purpose & Capability

Workspace command execution failed before files could be read; no artifact-backed purpose or capability assessment was possible.

ℹ Instruction Scope

SKILL.md could not be inspected, so instruction scope could not be verified from artifacts.

ℹ Install Mechanism

Install metadata and specs could not be inspected from the workspace in this run.

ℹ Credentials

No artifact-backed environment requirements were available for review.

ℹ Persistence & Privilege

No artifact-backed persistence or privilege behavior was available for review.

Version History

v1.0.0

Initial release of the video metadata analyzer skill. - Launches a three-stage video analysis pipeline: frame extraction + vision LLM, audio transcription, and Bilibili-ready metadata synthesis. - Supports multiple modes: full LLM-powered, agent-direct (API-free), and mixed observe-only. - Outputs detailed observations and synthesized metadata, including Bilibili title, intro, tags, category, and cover suggestions. - Robust error handling: retries, fallback methods, and graceful degradation on failures. - Flexible architecture with parallel processing and support for custom LLM/API configuration.

Metadata

Slug video-metadata-analyzer

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Video Metadata Intelligence System?

Drop in a video file and get back a complete Bilibili publishing package: title, intro, tags, category, cover suggestion, and content declaration, all generated automatically from visual and audio analysis. It is an AI Agent Skill for Claude Code / OpenClaw, with 37 downloads so far.

How do I install Video Metadata Intelligence System?

Run "/install video-metadata-analyzer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Video Metadata Intelligence System free?

Yes, Video Metadata Intelligence System is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Video Metadata Intelligence System support?

Video Metadata Intelligence System is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Video Metadata Intelligence System?

It is built and maintained by CyberKurry (@cyberkurry); the current version is v1.0.0.

More Skills

Video Metadata Intelligence System