← Back to Skills Marketplace
Video Reader
by
Qianke Meng
· GitHub ↗
· v4.1.1
· MIT-0
122
Downloads
1
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install video-reader
Description
Tool-driven video question answering with frame extraction, sub-agent analysis, and audio transcription
Usage Guidance
This skill appears to implement a real video QA system, but there are important mismatches and operational risks you should consider before installing or running it:
- Credentials & env vars: The skill bundle and docs mention both local Whisper (faster-whisper) and remote transcription APIs, but videoarm_audio.py currently requires WHISPER_API_KEY (and will call a remote transcription endpoint) unless you run a local whisper server. Do NOT paste your OpenAI/Anthropic/Groq API keys into environment variables for this skill until you confirm whether the skill will use a local model or an external API. Prefer testing first with a non-sensitive account or an isolated environment.
- Missing declared requirements: The manifest declares no required binaries or env vars, yet the code needs ffmpeg (required), optionally yt-dlp (for downloads), and Python packages (opencv, faster-whisper). Install and test in a sandbox or VM and run videoarm-doctor to verify dependencies before giving the skill access to important files.
- Local file writes & cleanup: The skill creates ~/.videoarm and writes logs and cached videos; its cleaner can delete ~/.openclaw/workspace/tmp — that could remove other OpenClaw workspace files. If you care about other workspace data, avoid running the cleaner with broad flags or inspect the cleaner code first.
- Data exposure via sub-agents: The orchestrator spawns sub-agents and writes frame-grid images to workspace tmp for those sub-agents to read. If the video contains sensitive content you do not want shared with remote models, ensure the sub-agent/image tools operate locally and that no remote vision/transcription endpoints are configured.
What to do next:
1. Inspect videoarm_audio.py and videoarm_local_whisper to confirm whether transcription runs locally or requires an API key in your deployment. 2. Run videoarm-doctor in a safe environment to see what dependencies are missing. 3. If you must provide API keys, create scoped/test keys and run in an isolated account. 4. If you want to use only local models, confirm the local server path and disable WHISPER_API_KEY/BASE_URL. 5. Consider running the skill inside a disposable container/VM to validate behavior and filesystem changes before using on your regular workstation.
Capability Analysis
Type: OpenClaw Skill
Name: video-reader
Version: 4.1.1
The VideoARM skill implements a sophisticated video analysis pipeline that utilizes several high-risk capabilities, including spawning sub-agents via `sessions_spawn` for image analysis and executing shell commands through `subprocess` for video processing with `ffmpeg` and `yt-dlp` (e.g., in `videoarm_cli/videoarm_audio.py` and `videoarm_cli/videoarm_download.py`). Notably, the package includes a local Whisper transcription service (`videoarm_local_whisper/server.py`) and a setup script (`videoarm_local_whisper/setup.py`) that establishes persistence on macOS by installing a `launchd` agent. While these features are plausibly necessary for the stated purpose of tool-driven video QA and are documented, the combination of persistence, local server execution, and broad sub-agent orchestration constitutes a significant security footprint without explicit safeguards against common risks like argument injection or unauthorized persistence.
Capability Assessment
Purpose & Capability
The name/description (video question answering with frame extraction and transcription) matches the code and runtime instructions: tools for download, metadata, frame extraction, and audio transcription are present. However the skill metadata declares no required binaries or environment variables while the code clearly expects external binaries (ffmpeg, optional yt-dlp) and reads multiple environment variables (WHISPER_API_KEY, WHISPER_BASE_URL, WHISPER_MODEL, VISION_API_KEY/OPENAI_API_KEY, ANTHROPIC_API_KEY). That mismatch between declared requirements and actual dependencies is unexpected and should be resolved before trusting the skill.
Instruction Scope
SKILL.md confines runtime actions to video download/inspect/extract/transcribe and spawning sub-agents to analyze image grids; it instructs the agent to use /tmp/videoarm_memory.json as single source-of-truth and to spawn isolated sub-agents via sessions_spawn. It does not instruct reading arbitrary system files or exfiltration endpoints. The memory file usage and sub-agent dispatch are explicit and scoped to the skill's purpose.
Install Mechanism
There is no install spec in the skill manifest (instruction-only), but the bundle includes a full Python package (pyproject.toml, CLI scripts, requirements). That means the package will not be auto-installed by the platform; manual installation is required to get dependencies (opencv, faster-whisper, ffmpeg, yt-dlp). This is reasonable but increases the chance users will miss required system binaries or optional components. No suspicious remote download URLs or archive extraction were found in the install artifacts.
Credentials
The manifest lists no required environment variables or primary credential, yet the code and docs read/expect multiple credential-like env vars (WHISPER_API_KEY, WHISPER_BASE_URL, VISION_API_KEY/OPENAI_API_KEY, ANTHROPIC_API_KEY, HTTPS_PROXY, VIDEOARM_SESSION_ID). In particular, videoarm_audio.py currently requires WHISPER_API_KEY and will return an error if it is not set, contradicting README statements about local faster-whisper working without API keys. Asking for API keys or base URLs (and implicitly supporting OpenAI/Anthropic/Groq endpoints) is reasonable for optional cloud transcription/vision backends, but the skill's manifest does not declare these needs and the code will attempt network API calls when an API key/base URL is supplied — so do not provide secrets until you confirm which backend (local vs remote) will be used.
Persistence & Privilege
The skill writes logs and cache under ~/.videoarm and creates files under ~/.openclaw/workspace/tmp and /tmp/videoarm_memory.json. The provided cleaning tool (videoarm-clean) can delete files in ~/.openclaw/workspace/tmp and the VideoARM memory file; that may remove other workspace artifacts if run with broad arguments. The skill does not set always:true and does not modify other skills' configs, but its file I/O footprint in user home and OpenClaw workspace is significant and could affect other local agent state if cleaning tools are used carelessly.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install video-reader - After installation, invoke the skill by name or use
/video-reader - Provide required inputs per the skill's parameter spec and get structured output
Version History
v4.1.1
Fixed 3x2 grid layout with frame numbers, support multiple grids for 30/60+ frames
v4.1.0
Summary: Major redesign—introduces a tool-driven orchestrator architecture for video question answering with strict memory management and sub-agent dispatch.
- Rebranded as "videoarm": a video QA orchestrator that never analyzes images directly, but dispatches sub-agents for all frame and audio analysis.
- Enforces strict read/write of a single memory file (`/tmp/videoarm_memory.json`) as the source of truth on each turn; forbids reliance on prior tool outputs in conversation history.
- Sub-agent pattern standardized: main agent extracts frames or audio, spawns stateless sub-agents for analysis (scene captions, targeted questions), and writes all results to memory.
- Improved reproducibility and architecture: tool outputs go only to memory, allowing context to be fully rebuilt each turn and enabling parallel or isolated sub-agent analysis.
- Clarified use of toolset: defined pipelines for downloading, metadata, frame extraction, audio transcription, and sub-agent dispatch for both visual and audio QA.
- Documented recommended strategies, decision-making, and detailed memory file structure for consistent, traceable workflows.
Metadata
Frequently Asked Questions
What is Video Reader?
Tool-driven video question answering with frame extraction, sub-agent analysis, and audio transcription. It is an AI Agent Skill for Claude Code / OpenClaw, with 122 downloads so far.
How do I install Video Reader?
Run "/install video-reader" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Video Reader free?
Yes, Video Reader is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Video Reader support?
Video Reader is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Video Reader?
It is built and maintained by Qianke Meng (@qiankemeng); the current version is v4.1.1.
More Skills