← Back to Skills Marketplace

see-video

Name: see-video
Author: john-ver

by john-ver · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

105

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install see-video

Description

Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy...

README (SKILL.md)

see-video

Extract frames from a video and inject them as a grid image + XML timestamps into LLM context.

Setup (first time only)

cd \x3Cskill directory>
npm install

Usage

node {baseDir}/scripts/inject.mjs \x3Cvideo_path> [--mode uniform|highlight] [--start N] [--end N]

On success, outputs JSON to stdout:

{
  "gridPath": "/tmp/video_llm-frames.jpg",
  "description": "\x3Cvideo_frames>...\x3C/video_frames>",
  "duration": 1326,
  "frameCount": 28,
  "layout": { "cols": 4, "rows": 7, "cellW": 384, "cellH": 216 },
  "videoWidth": 854,
  "videoHeight": 480,
  "inputSizeMb": 42.3
}

If the video exceeds 10 minutes and uniform mode was used without --start/--end, a hint field is included:

{
  "hint": "Video is 30 minutes long. This is a uniform overview. For better scene coverage re-run with --mode highlight, or use --start/--end to zoom into a specific section."
}

Recommended workflow for long videos:

First run with --mode highlight — shows key scene changes across the whole video
If the user wants detail on a specific section, re-run with --start N --end N

On error, writes ERROR: \x3Cmessage> + Hint: \x3Cdiagnosis> to stderr and exits 1.

Injection procedure

Step 1 — Run the script (bash tool):

node {baseDir}/scripts/inject.mjs "/path/to/video.mp4"

Step 2 — Parse JSON: Extract gridPath and description.

Step 3 — Inject image (read tool):

read \x3CgridPath>

The read tool injects the jpg as a native multimodal image block into context. After viewing the grid, use the description XML timestamps to reference frames:

"Look at the grid image above. Use the timestamps in the description XML to analyze the video. The number in the top-left of each cell is the frame index."

On error:

Translate the Hint: message into natural language for the user. Do not paste raw error output.
If read \x3CgridPath> fails — /tmp/ files are ephemeral. Re-run the script and read immediately.

Options

Option	Default	Description
`--mode uniform`	✅	Evenly spaced frames
`--mode highlight`		Scene-change biased sampling
`--start N`	`0`	Segment start (seconds)
`--end N`	end of video	Segment end (seconds)

Diagnostics

Error	Cause	Action
`Input file not found`	File missing or dropped by channel media size limit	Ask the user to share the file path directly as text
`corrupt, incomplete, or unsupported format`	Damaged file, interrupted transfer, or unsupported codec	Try a different file, or use `--start`/`--end` to skip problematic sections
`moov atom not found`	Incomplete mp4 (streaming not finished)	Retry with a complete file
`ffmpeg not found`	ffmpeg not installed	Check ffmpeg installation

Notes

Frame count and cell size are determined automatically from video duration and aspect ratio
Grid is ~1500×1500px, cell long side 384–512px
Timestamps are in the description XML only, not overlaid on the image
Portrait and landscape videos both supported
Telegram users: if a video file is not attached to the message, check channels.telegram.mediaMaxMb in the OpenClaw config — the file may have been dropped at the channel level before reaching the agent

Usage Guidance

This skill appears to do exactly what it says: it needs node and ffmpeg, runs a local script that extracts frames, writes a grid image to /tmp, and outputs JSON for injection into a vision-capable model. Before installing: 1) be aware npm install will fetch the llm-frames package from the public registry — review that package (and the integrity hash in package-lock.json) if you have supply-chain concerns; 2) the grid image is written to the system tmpdir and may be readable by other local users on shared systems — delete sensitive files after use; 3) the README mentions future audio transcription, but the included code does not perform network calls or transcription today; 4) run the skill in an isolated environment if you will process highly sensitive video; and 5) ensure your model and platform correctly handle injected images (the 'read' tool will place the JPEG into the LLM context).

Capability Analysis

Type: OpenClaw Skill Name: see-video Version: 1.0.0 The 'see-video' skill is a legitimate utility designed to extract video frames into a grid for multimodal LLM analysis. The core logic in `scripts/inject.mjs` uses the `llm-frames` library to process video files and safely writes the resulting image to a temporary directory using randomized filenames. The instructions in `SKILL.md` and `README.md` are consistent with the stated purpose, providing clear guidance for the agent without any signs of prompt injection, data exfiltration, or malicious execution.

Capability Tags

crypto

Capability Assessment

✓ Purpose & Capability

Name/description require ffmpeg/node and the packaged script uses ffmpeg (via the llm-frames npm library) to extract frames and produce a JPEG grid — the declared binaries and npm dependency align with this purpose. No unrelated credentials or unusual tools are requested.

✓ Instruction Scope

SKILL.md instructs running the provided node script, parsing its JSON output, and using the platform 'read' tool to inject the produced jpg. The script only reads the provided video file, checks its size, extracts frames, writes a single grid JPEG to the system tmpdir, and emits metadata — it does not access other files, environment variables, or external endpoints.

ℹ Install Mechanism

Install is standard: npm install (pulls llm-frames from public npm) and an optional brew/apt ffmpeg install. This is expected for the task but carries normal supply-chain risk from an npm dependency; the package-lock includes an integrity hash for llm-frames.

✓ Credentials

No environment variables, secrets, or external credentials are requested. The skill does not require unrelated permissions or configuration paths.

✓ Persistence & Privilege

always:false and the skill does not attempt to modify other skills or global agent settings. It writes ephemeral output to the OS tmpdir (one JPEG per run) and exits; no background services or persistent privileges are requested.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install see-video
After installation, invoke the skill by name or use /see-video
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release: video frame extraction for multimodal LLM context injection

Metadata

Slug see-video

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is see-video?

Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy... It is an AI Agent Skill for Claude Code / OpenClaw, with 105 downloads so far.

How do I install see-video?

Run "/install see-video" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is see-video free?

Yes, see-video is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does see-video support?

see-video is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created see-video?

It is built and maintained by john-ver (@john-ver); the current version is v1.0.0.

More Skills

see-video

see-video

Setup (first time only)

Usage

Injection procedure

Options

Diagnostics

Notes

What is see-video?

How do I install see-video?

Is see-video free?

Which platforms does see-video support?

Who created see-video?

💬 Comments