← Back to Skills Marketplace
john-ver

see-video

by john-ver · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
105
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install see-video
Description
Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy...
README (SKILL.md)

see-video

Extract frames from a video and inject them as a grid image + XML timestamps into LLM context.

Setup (first time only)

cd \x3Cskill directory>
npm install

Usage

node {baseDir}/scripts/inject.mjs \x3Cvideo_path> [--mode uniform|highlight] [--start N] [--end N]

On success, outputs JSON to stdout:

{
  "gridPath": "/tmp/video_llm-frames.jpg",
  "description": "\x3Cvideo_frames>...\x3C/video_frames>",
  "duration": 1326,
  "frameCount": 28,
  "layout": { "cols": 4, "rows": 7, "cellW": 384, "cellH": 216 },
  "videoWidth": 854,
  "videoHeight": 480,
  "inputSizeMb": 42.3
}

If the video exceeds 10 minutes and uniform mode was used without --start/--end, a hint field is included:

{
  "hint": "Video is 30 minutes long. This is a uniform overview. For better scene coverage re-run with --mode highlight, or use --start/--end to zoom into a specific section."
}

Recommended workflow for long videos:

  1. First run with --mode highlight — shows key scene changes across the whole video
  2. If the user wants detail on a specific section, re-run with --start N --end N

On error, writes ERROR: \x3Cmessage> + Hint: \x3Cdiagnosis> to stderr and exits 1.

Injection procedure

Step 1 — Run the script (bash tool):

node {baseDir}/scripts/inject.mjs "/path/to/video.mp4"

Step 2 — Parse JSON: Extract gridPath and description.

Step 3 — Inject image (read tool):

read \x3CgridPath>

The read tool injects the jpg as a native multimodal image block into context. After viewing the grid, use the description XML timestamps to reference frames:

"Look at the grid image above. Use the timestamps in the description XML to analyze the video. The number in the top-left of each cell is the frame index."

On error:

  • Translate the Hint: message into natural language for the user. Do not paste raw error output.
  • If read \x3CgridPath> fails — /tmp/ files are ephemeral. Re-run the script and read immediately.

Options

Option Default Description
--mode uniform Evenly spaced frames
--mode highlight Scene-change biased sampling
--start N 0 Segment start (seconds)
--end N end of video Segment end (seconds)

Diagnostics

Error Cause Action
Input file not found File missing or dropped by channel media size limit Ask the user to share the file path directly as text
corrupt, incomplete, or unsupported format Damaged file, interrupted transfer, or unsupported codec Try a different file, or use --start/--end to skip problematic sections
moov atom not found Incomplete mp4 (streaming not finished) Retry with a complete file
ffmpeg not found ffmpeg not installed Check ffmpeg installation

Notes

  • Frame count and cell size are determined automatically from video duration and aspect ratio
  • Grid is ~1500×1500px, cell long side 384–512px
  • Timestamps are in the description XML only, not overlaid on the image
  • Portrait and landscape videos both supported
  • Telegram users: if a video file is not attached to the message, check channels.telegram.mediaMaxMb in the OpenClaw config — the file may have been dropped at the channel level before reaching the agent
Usage Guidance
This skill appears to do exactly what it says: it needs node and ffmpeg, runs a local script that extracts frames, writes a grid image to /tmp, and outputs JSON for injection into a vision-capable model. Before installing: 1) be aware npm install will fetch the llm-frames package from the public registry — review that package (and the integrity hash in package-lock.json) if you have supply-chain concerns; 2) the grid image is written to the system tmpdir and may be readable by other local users on shared systems — delete sensitive files after use; 3) the README mentions future audio transcription, but the included code does not perform network calls or transcription today; 4) run the skill in an isolated environment if you will process highly sensitive video; and 5) ensure your model and platform correctly handle injected images (the 'read' tool will place the JPEG into the LLM context).
Capability Analysis
Type: OpenClaw Skill Name: see-video Version: 1.0.0 The 'see-video' skill is a legitimate utility designed to extract video frames into a grid for multimodal LLM analysis. The core logic in `scripts/inject.mjs` uses the `llm-frames` library to process video files and safely writes the resulting image to a temporary directory using randomized filenames. The instructions in `SKILL.md` and `README.md` are consistent with the stated purpose, providing clear guidance for the agent without any signs of prompt injection, data exfiltration, or malicious execution.
Capability Tags
crypto
Capability Assessment
Purpose & Capability
Name/description require ffmpeg/node and the packaged script uses ffmpeg (via the llm-frames npm library) to extract frames and produce a JPEG grid — the declared binaries and npm dependency align with this purpose. No unrelated credentials or unusual tools are requested.
Instruction Scope
SKILL.md instructs running the provided node script, parsing its JSON output, and using the platform 'read' tool to inject the produced jpg. The script only reads the provided video file, checks its size, extracts frames, writes a single grid JPEG to the system tmpdir, and emits metadata — it does not access other files, environment variables, or external endpoints.
Install Mechanism
Install is standard: npm install (pulls llm-frames from public npm) and an optional brew/apt ffmpeg install. This is expected for the task but carries normal supply-chain risk from an npm dependency; the package-lock includes an integrity hash for llm-frames.
Credentials
No environment variables, secrets, or external credentials are requested. The skill does not require unrelated permissions or configuration paths.
Persistence & Privilege
always:false and the skill does not attempt to modify other skills or global agent settings. It writes ephemeral output to the OS tmpdir (one JPEG per run) and exits; no background services or persistent privileges are requested.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install see-video
  3. After installation, invoke the skill by name or use /see-video
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release: video frame extraction for multimodal LLM context injection
Metadata
Slug see-video
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is see-video?

Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy... It is an AI Agent Skill for Claude Code / OpenClaw, with 105 downloads so far.

How do I install see-video?

Run "/install see-video" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is see-video free?

Yes, see-video is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does see-video support?

see-video is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created see-video?

It is built and maintained by john-ver (@john-ver); the current version is v1.0.0.

💬 Comments