/install see-video
see-video
Extract frames from a video and inject them as a grid image + XML timestamps into LLM context.
Setup (first time only)
cd \x3Cskill directory>
npm install
Usage
node {baseDir}/scripts/inject.mjs \x3Cvideo_path> [--mode uniform|highlight] [--start N] [--end N]
On success, outputs JSON to stdout:
{
"gridPath": "/tmp/video_llm-frames.jpg",
"description": "\x3Cvideo_frames>...\x3C/video_frames>",
"duration": 1326,
"frameCount": 28,
"layout": { "cols": 4, "rows": 7, "cellW": 384, "cellH": 216 },
"videoWidth": 854,
"videoHeight": 480,
"inputSizeMb": 42.3
}
If the video exceeds 10 minutes and uniform mode was used without --start/--end, a hint field is included:
{
"hint": "Video is 30 minutes long. This is a uniform overview. For better scene coverage re-run with --mode highlight, or use --start/--end to zoom into a specific section."
}
Recommended workflow for long videos:
- First run with
--mode highlight— shows key scene changes across the whole video - If the user wants detail on a specific section, re-run with
--start N --end N
On error, writes ERROR: \x3Cmessage> + Hint: \x3Cdiagnosis> to stderr and exits 1.
Injection procedure
Step 1 — Run the script (bash tool):
node {baseDir}/scripts/inject.mjs "/path/to/video.mp4"
Step 2 — Parse JSON:
Extract gridPath and description.
Step 3 — Inject image (read tool):
read \x3CgridPath>
The read tool injects the jpg as a native multimodal image block into context.
After viewing the grid, use the description XML timestamps to reference frames:
"Look at the grid image above. Use the timestamps in the description XML to analyze the video. The number in the top-left of each cell is the frame index."
On error:
- Translate the
Hint:message into natural language for the user. Do not paste raw error output. - If
read \x3CgridPath>fails —/tmp/files are ephemeral. Re-run the script and read immediately.
Options
| Option | Default | Description |
|---|---|---|
--mode uniform |
✅ | Evenly spaced frames |
--mode highlight |
Scene-change biased sampling | |
--start N |
0 |
Segment start (seconds) |
--end N |
end of video | Segment end (seconds) |
Diagnostics
| Error | Cause | Action |
|---|---|---|
Input file not found |
File missing or dropped by channel media size limit | Ask the user to share the file path directly as text |
corrupt, incomplete, or unsupported format |
Damaged file, interrupted transfer, or unsupported codec | Try a different file, or use --start/--end to skip problematic sections |
moov atom not found |
Incomplete mp4 (streaming not finished) | Retry with a complete file |
ffmpeg not found |
ffmpeg not installed | Check ffmpeg installation |
Notes
- Frame count and cell size are determined automatically from video duration and aspect ratio
- Grid is ~1500×1500px, cell long side 384–512px
- Timestamps are in the
descriptionXML only, not overlaid on the image - Portrait and landscape videos both supported
- Telegram users: if a video file is not attached to the message, check
channels.telegram.mediaMaxMbin the OpenClaw config — the file may have been dropped at the channel level before reaching the agent
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install see-video - After installation, invoke the skill by name or use
/see-video - Provide required inputs per the skill's parameter spec and get structured output
What is see-video?
Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy... It is an AI Agent Skill for Claude Code / OpenClaw, with 105 downloads so far.
How do I install see-video?
Run "/install see-video" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is see-video free?
Yes, see-video is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does see-video support?
see-video is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created see-video?
It is built and maintained by john-ver (@john-ver); the current version is v1.0.0.