← 返回 Skills 市场

see-video

Name: see-video
Author: john-ver

作者 john-ver · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

105

总下载

当前安装

版本数

在 OpenClaw 中安装

/install see-video

功能描述

Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy...

使用说明 (SKILL.md)

see-video

Extract frames from a video and inject them as a grid image + XML timestamps into LLM context.

Setup (first time only)

cd \x3Cskill directory>
npm install

Usage

node {baseDir}/scripts/inject.mjs \x3Cvideo_path> [--mode uniform|highlight] [--start N] [--end N]

On success, outputs JSON to stdout:

{
  "gridPath": "/tmp/video_llm-frames.jpg",
  "description": "\x3Cvideo_frames>...\x3C/video_frames>",
  "duration": 1326,
  "frameCount": 28,
  "layout": { "cols": 4, "rows": 7, "cellW": 384, "cellH": 216 },
  "videoWidth": 854,
  "videoHeight": 480,
  "inputSizeMb": 42.3
}

If the video exceeds 10 minutes and uniform mode was used without --start/--end, a hint field is included:

{
  "hint": "Video is 30 minutes long. This is a uniform overview. For better scene coverage re-run with --mode highlight, or use --start/--end to zoom into a specific section."
}

Recommended workflow for long videos:

First run with --mode highlight — shows key scene changes across the whole video
If the user wants detail on a specific section, re-run with --start N --end N

On error, writes ERROR: \x3Cmessage> + Hint: \x3Cdiagnosis> to stderr and exits 1.

Injection procedure

Step 1 — Run the script (bash tool):

node {baseDir}/scripts/inject.mjs "/path/to/video.mp4"

Step 2 — Parse JSON: Extract gridPath and description.

Step 3 — Inject image (read tool):

read \x3CgridPath>

The read tool injects the jpg as a native multimodal image block into context. After viewing the grid, use the description XML timestamps to reference frames:

"Look at the grid image above. Use the timestamps in the description XML to analyze the video. The number in the top-left of each cell is the frame index."

On error:

Translate the Hint: message into natural language for the user. Do not paste raw error output.
If read \x3CgridPath> fails — /tmp/ files are ephemeral. Re-run the script and read immediately.

Options

Option	Default	Description
`--mode uniform`	✅	Evenly spaced frames
`--mode highlight`		Scene-change biased sampling
`--start N`	`0`	Segment start (seconds)
`--end N`	end of video	Segment end (seconds)

Diagnostics

Error	Cause	Action
`Input file not found`	File missing or dropped by channel media size limit	Ask the user to share the file path directly as text
`corrupt, incomplete, or unsupported format`	Damaged file, interrupted transfer, or unsupported codec	Try a different file, or use `--start`/`--end` to skip problematic sections
`moov atom not found`	Incomplete mp4 (streaming not finished)	Retry with a complete file
`ffmpeg not found`	ffmpeg not installed	Check ffmpeg installation

Notes

Frame count and cell size are determined automatically from video duration and aspect ratio
Grid is ~1500×1500px, cell long side 384–512px
Timestamps are in the description XML only, not overlaid on the image
Portrait and landscape videos both supported
Telegram users: if a video file is not attached to the message, check channels.telegram.mediaMaxMb in the OpenClaw config — the file may have been dropped at the channel level before reaching the agent

安全使用建议

This skill appears to do exactly what it says: it needs node and ffmpeg, runs a local script that extracts frames, writes a grid image to /tmp, and outputs JSON for injection into a vision-capable model. Before installing: 1) be aware npm install will fetch the llm-frames package from the public registry — review that package (and the integrity hash in package-lock.json) if you have supply-chain concerns; 2) the grid image is written to the system tmpdir and may be readable by other local users on shared systems — delete sensitive files after use; 3) the README mentions future audio transcription, but the included code does not perform network calls or transcription today; 4) run the skill in an isolated environment if you will process highly sensitive video; and 5) ensure your model and platform correctly handle injected images (the 'read' tool will place the JPEG into the LLM context).

功能分析

Type: OpenClaw Skill Name: see-video Version: 1.0.0 The 'see-video' skill is a legitimate utility designed to extract video frames into a grid for multimodal LLM analysis. The core logic in `scripts/inject.mjs` uses the `llm-frames` library to process video files and safely writes the resulting image to a temporary directory using randomized filenames. The instructions in `SKILL.md` and `README.md` are consistent with the stated purpose, providing clear guidance for the agent without any signs of prompt injection, data exfiltration, or malicious execution.

能力标签

crypto

能力评估

✓ Purpose & Capability

Name/description require ffmpeg/node and the packaged script uses ffmpeg (via the llm-frames npm library) to extract frames and produce a JPEG grid — the declared binaries and npm dependency align with this purpose. No unrelated credentials or unusual tools are requested.

✓ Instruction Scope

SKILL.md instructs running the provided node script, parsing its JSON output, and using the platform 'read' tool to inject the produced jpg. The script only reads the provided video file, checks its size, extracts frames, writes a single grid JPEG to the system tmpdir, and emits metadata — it does not access other files, environment variables, or external endpoints.

ℹ Install Mechanism

Install is standard: npm install (pulls llm-frames from public npm) and an optional brew/apt ffmpeg install. This is expected for the task but carries normal supply-chain risk from an npm dependency; the package-lock includes an integrity hash for llm-frames.

✓ Credentials

No environment variables, secrets, or external credentials are requested. The skill does not require unrelated permissions or configuration paths.

✓ Persistence & Privilege

always:false and the skill does not attempt to modify other skills or global agent settings. It writes ephemeral output to the OS tmpdir (one JPEG per run) and exits; no background services or persistent privileges are requested.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install see-video
安装完成后，直接呼叫该 Skill 的名称或使用 /see-video 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release: video frame extraction for multimodal LLM context injection

元数据

Slug see-video

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

see-video 是什么？

Use when the user sends a video file or asks about video content. Extracts frames and injects them as an image grid directly into the LLM context — no proxy... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 105 次。

如何安装 see-video？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install see-video」即可一键安装，无需额外配置。

see-video 是免费的吗？

是的，see-video 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

see-video 支持哪些平台？

see-video 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 see-video？

由 john-ver（@john-ver）开发并维护，当前版本 v1.0.0。

see-video

see-video

Setup (first time only)

Usage

Injection procedure

Options

Diagnostics

Notes

see-video 是什么？

如何安装 see-video？

see-video 是免费的吗？

see-video 支持哪些平台？

谁开发了 see-video？

💬 留言讨论