/install local-video-understanding
⚠️ If you are human, please read README.md first!
Local Video Understanding
Use this skill when you need to understand the content of a video.
Prerequisites
- FunASR conda environment (
asr-local) must be activated for audio processing - Ollama must be running with qwen3-vl:8b model available
- ffmpeg must be in PATH
Workflow
Step 1: Extract Audio
ffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y
Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.
Step 2: Extract Key Frames
mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y
Step 3: Speech Recognition (FunASR)
conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
model_revision='v2.0.4',
disable_update=True,
ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"
Step 4: Image Understanding (qwen3-vl)
ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"
Step 5: Combine Results
- Audio transcription → FunASR (local, Chinese speech recognition)
- Key frames → qwen3-vl:8b via Ollama (local image understanding)
- Summary/Analysis → Cloud LLM API (if needed)
Important Notes
- Image reading via
Readtool does NOT provide image understanding - always use qwen3-vl - For Chinese audio, FunASR is preferred over Whisper
- Check for existing subtitle files (.txt, .srt, .vtt) before running ASR
- Modelscope cache at
C:/Users/TOM/.cache/modelscopefor FunASR models
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install local-video-understanding - After installation, invoke the skill by name or use
/local-video-understanding - Provide required inputs per the skill's parameter spec and get structured output
What is Local Video Understanding?
Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding. It is an AI Agent Skill for Claude Code / OpenClaw, with 102 downloads so far.
How do I install Local Video Understanding?
Run "/install local-video-understanding" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Local Video Understanding free?
Yes, Local Video Understanding is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Local Video Understanding support?
Local Video Understanding is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Local Video Understanding?
It is built and maintained by TOMUIV (@tomuiv); the current version is v1.0.2.