← Back to Skills Marketplace
178
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install tom-video-understanding
Description
Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding.
Usage Guidance
This skill appears to do what it says: extract audio/frames and run local models. Before using it, ensure: (1) you have ffmpeg, conda, and Ollama installed and you trust the sources from which models will be downloaded (model downloads will use network and can be large); (2) if you plan to use the optional cloud LLM step, confirm which endpoint and credentials you'll use and avoid sending sensitive video/audio to unknown cloud services; (3) update the Windows-specific ModelScope cache path and any example mirrors (README's OLLAMA_BASE_URL example is a placeholder) to suit your environment; (4) verify Ollama's model provenance (qwen3-vl) and FunASR model identifiers before pulling. If the skill ever asked to read unrelated system files, request unrelated credentials, or included opaque download URLs or install scripts, treat it as suspicious and do not run without deeper review.
Capability Analysis
Type: OpenClaw Skill
Name: tom-video-understanding
Version: 1.0.0
The skill bundle provides legitimate instructions for local video processing using ffmpeg, FunASR, and Ollama. While SKILL.md contains hardcoded environment paths (e.g., C:/Users/TOM/.cache/modelscope) and executes shell commands for media extraction, these actions are directly aligned with the stated purpose of video understanding and lack any indicators of malicious intent or data exfiltration.
Capability Assessment
Purpose & Capability
The name/description (local video comprehension) matches the instructions: ffmpeg for audio/frames extraction, FunASR for Chinese ASR, and qwen3-vl via Ollama for image understanding. None of the required actions or tools appear unrelated to the stated purpose.
Instruction Scope
SKILL.md confines actions to extracting audio/frames, running FunASR in a conda env, and querying a local Ollama model. It does reference a specific ModelScope cache path (C:/Users/TOM/.cache/modelscope) and suggests copying files if paths contain Chinese characters — this is Windows- and user-specific and may need adjustment. The doc also allows optional "Summary/Analysis → Cloud LLM API (if needed)", which would send derived data externally if used; that is outside the local-only flow and should be considered separately.
Install Mechanism
This is instruction-only with no install spec or packaged downloads in the skill itself. That reduces risk. Note: models (FunASR/ModelScope and qwen3-vl) and Ollama are expected to be pulled/downloaded at runtime by the user, which involves network activity and large binary downloads but is not performed by the skill bundle itself.
Credentials
The skill declares no required env vars or credentials. The instructions set MODELSCOPE_CACHE to a specific, user-named path (C:/Users/TOM/...) which is an implementation detail and not a request for secrets, but it may reveal or assume a specific user environment. The skill also mentions optionally using a cloud LLM for summaries; that would require credentials/configuration provided by the user but are not requested by the skill itself.
Persistence & Privilege
The skill does not request always-on presence and makes no claims to modify other skills or system-wide configs. It is user-invocable and can be run locally as-needed.
How to Use
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install tom-video-understanding - After installation, invoke the skill by name or use
/tom-video-understanding - Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of the video-understanding skill.
- Enables local video content comprehension using ffmpeg, FunASR, and qwen3-vl.
- Extracts audio and key frames from videos via ffmpeg commands.
- Performs local Chinese speech recognition with FunASR.
- Provides detailed image understanding for video frames using qwen3-vl through Ollama.
- Outlines a step-by-step workflow and key prerequisites for setup and usage.
Metadata
Frequently Asked Questions
What is Tom Video Understanding?
Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding. It is an AI Agent Skill for Claude Code / OpenClaw, with 178 downloads so far.
How do I install Tom Video Understanding?
Run "/install tom-video-understanding" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Tom Video Understanding free?
Yes, Tom Video Understanding is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Tom Video Understanding support?
Tom Video Understanding is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Tom Video Understanding?
It is built and maintained by TOMUIV (@tomuiv); the current version is v1.0.0.
More Skills