← 返回 Skills 市场
tomuiv

Local Video Understanding

作者 TOMUIV · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ 安全检测通过
102
总下载
1
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install local-video-understanding
功能描述
Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding.
使用说明 (SKILL.md)

⚠️ If you are human, please read README.md first!


Local Video Understanding

Use this skill when you need to understand the content of a video.

Prerequisites

  • FunASR conda environment (asr-local) must be activated for audio processing
  • Ollama must be running with qwen3-vl:8b model available
  • ffmpeg must be in PATH

Workflow

Step 1: Extract Audio

ffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y

Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.

Step 2: Extract Key Frames

mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y

Step 3: Speech Recognition (FunASR)

conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
    model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    model_revision='v2.0.4',
    disable_update=True,
    ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"

Step 4: Image Understanding (qwen3-vl)

ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"

Step 5: Combine Results

  • Audio transcription → FunASR (local, Chinese speech recognition)
  • Key frames → qwen3-vl:8b via Ollama (local image understanding)
  • Summary/Analysis → Cloud LLM API (if needed)

Important Notes

  • Image reading via Read tool does NOT provide image understanding - always use qwen3-vl
  • For Chinese audio, FunASR is preferred over Whisper
  • Check for existing subtitle files (.txt, .srt, .vtt) before running ASR
  • Modelscope cache at C:/Users/TOM/.cache/modelscope for FunASR models
安全使用建议
This skill appears coherent for local video processing, but review these before installing/using: 1) It runs local commands (ffmpeg, conda python, ollama) and reads/writes files (audio.wav, frames/*); run it only on machines you control. 2) Models are auto-downloaded on first use—ensure you trust the model sources and have disk/network capacity. 3) Update the MODELSCOPE_CACHE path in the Python snippet to a directory that exists for your user instead of the hard-coded C:/Users/TOM path. 4) The README mentions using a cloud LLM for summaries — avoid sending sensitive video data to cloud services unless you understand and accept the privacy implications. 5) If you need higher assurance, verify the exact FunASR and Ollama model sources and pull them manually before running.
功能分析
Type: OpenClaw Skill Name: local-video-understanding Version: 1.0.2 The skill provides a workflow for local video analysis using ffmpeg, FunASR, and Ollama. While SKILL.md contains a hardcoded user-specific path (C:/Users/TOM/.cache/modelscope) which is a configuration artifact from the author's environment, there is no evidence of malicious intent, data exfiltration, or unauthorized execution. The instructions are consistent with the stated purpose of video comprehension.
能力评估
Purpose & Capability
The name/description (local video understanding) matches the instructions: ffmpeg for extraction, FunASR for ASR, and qwen3-vl via Ollama for image understanding. Required tools mentioned in the README/SKILL.md are exactly what the workflow needs.
Instruction Scope
Instructions are concrete and narrowly scoped to extracting audio/frames, running FunASR in a conda env, and calling Ollama for image understanding. They do reference local files and paths (frames, audio.wav) and set a MODELSCOPE_CACHE path inside the Python snippet (hard-coded Windows path). They also mention optionally using a 'Cloud LLM API' for summaries without specifying which service—this could lead to data being sent off-device if the operator chooses to do so.
Install Mechanism
This is an instruction-only skill with no install spec or downloaded archives. The README notes that models are auto-downloaded on first use (FunASR/ModelScope and pulling qwen3-vl via Ollama), which is expected for local models but requires network access and disk space.
Credentials
No environment variables or credentials are declared. The SKILL.md does set MODELSCOPE_CACHE inside the Python snippet to a specific Windows user path (C:/Users/TOM/.cache/modelscope), which is odd and non-portable but not a secret-exfiltration pattern. The workflow may require internet for initial model downloads and the README suggests possible later use of a cloud LLM for summaries—this is the main privacy-related consideration.
Persistence & Privilege
The skill does not request always:true or any elevated/persistent platform privileges, nor does it modify other skills' configs. It is user-invocable and relies on local binaries and environments.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install local-video-understanding
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /local-video-understanding 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.2
Fix README structure to match actual files
v1.0.1
Add human warning, proper casing
元数据
Slug local-video-understanding
版本 1.0.2
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

Local Video Understanding 是什么?

Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 102 次。

如何安装 Local Video Understanding?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install local-video-understanding」即可一键安装,无需额外配置。

Local Video Understanding 是免费的吗?

是的,Local Video Understanding 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Local Video Understanding 支持哪些平台?

Local Video Understanding 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Local Video Understanding?

由 TOMUIV(@tomuiv)开发并维护,当前版本 v1.0.2。

💬 留言讨论