← 返回 Skills 市场
guitu917

Screen Vision

作者 guitu917 · GitHub ↗ · v1.1.0 · MIT-0
cross-platform ⚠ suspicious
156
总下载
0
收藏
3
当前安装
8
版本数
在 OpenClaw 中安装
/install ai-screen-vision
功能描述
AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an...
使用说明 (SKILL.md)

Screen Vision

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials. You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL. See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

  1. Take screenshotbash scripts/platform/screenshot.sh [output_path] [display]
  2. Analyze with AIpython3 scripts/vision/analyze.py --image \x3Cpath> --task "\x3Ctask>"
  3. Execute actionpython3 scripts/platform/execute.py --action \x3Ctype> [options]
  4. Full task looppython3 scripts/core/run_task.py --task "\x3Ctask>"

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

Platform Screenshot Mouse/Keyboard Notes
Linux scrot xdotool Headless: XFCE4 + VNC
macOS screencapture cliclick Needs Accessibility permission
Windows pyautogui pyautogui No extra setup needed

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

Model Provider Cost/Task Quality
Qwen3-VL-32B SiliconFlow Low ★★★★
GLM-4V-Plus Zhipu BigModel Low ★★★★
GPT-5.4-Mini OpenAI / relays Medium ★★★★★
GPT-5.4 CUA OpenAI High ★★★★★
Llama 3.2 Vision Ollama (local) Free ★★

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

  • click — Click at (x, y). Supports left/right/double-click.
  • type — Type text string.
  • key — Press a key (Return, Tab, Escape, etc.).
  • scroll — Scroll up or down.
  • drag — Drag from (x1,y1) to (x2,y2).
  • wait — Wait for screen to update.
  • done — Task complete.
  • failed — Cannot complete task.

Safety

  • Blocked: rm -rf, format disk, shutdown, drop database, etc.
  • Confirmation required: delete, sudo, payment-related operations
  • Limits: max 5 minutes, max 100 actions per task
  • Logging: all screenshots saved to /tmp/screen-vision/logs/
  • Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

Variable Default Description
SV_VISION_API_KEY Vision API key
SV_VISION_BASE_URL API endpoint (required)
SV_VISION_MODEL Vision model name (required)
SV_DISPLAY :1 X11 display (Linux)
SV_MAX_DURATION 5 Max task duration (min)
SV_MAX_ACTIONS 100 Max actions per task
SV_SCREENSHOT_INTERVAL 1.0 Seconds between screenshots
安全使用建议
What to consider before installing: - Metadata mismatch: the skill requires a vision API key (SV_VISION_* or config.json) even though registry metadata listed no env vars; expect to provide and store an API token locally. - Network exposure: the headless setup creates VNC + noVNC and sets a default VNC password 'screen123' and may run vncserver with '-localhost no', which allows remote access. Do NOT run headless/noVNC on a public server without changing the password and restricting access (firewall, SSH tunnel, or localhost-only proxy). - Privileged install: the setup scripts call apt/yum/dnf and write /usr/local/bin — you will need sudo and the installer changes system state. Review scripts before running, and prefer installing inside an isolated VM/container if possible. - Sensitive data handling: screenshots (potentially containing passwords and private information) are saved to /tmp/screen-vision/logs/ and full images are uploaded (base64) to whichever vision API you configure. If you care about privacy, run a local model provider (ollama/local) or avoid sending screenshots to external services. - Safety checks are heuristic: blocked/confirm rules are regex-based and act on the action text/reason that comes from the model; these can be bypassed by crafted responses. Do not grant this skill uncontrolled autonomous access on sensitive machines. - Recommendations: inspect and modify install scripts (change VNC password, remove '-localhost no', restrict noVNC binding to localhost or disable noVNC), run in an isolated environment, use a local vision provider if you want to avoid sending screenshots externally, and ensure the API key is stored securely (not world-readable). If you are not comfortable reviewing and hardening these scripts, avoid installing on production or internet-exposed hosts.
功能分析
Type: OpenClaw Skill Name: ai-screen-vision Version: 1.1.0 The skill provides extensive 'Computer Use' capabilities, including screen capture, input simulation, and remote desktop access, which are inherently high-risk. A significant security concern is found in `scripts/setup/setup-linux.sh`, which installs a VNC server and sets a hardcoded default password ('screen123'), potentially allowing unauthorized remote access. Additionally, `scripts/vision/analyze.py` transmits system screenshots to external APIs, and while the endpoint is user-configurable, this facilitates data exfiltration of sensitive on-screen information. Although the skill includes a `safety_check.py` to blacklist dangerous commands, the combination of broad system control and weak default security configurations warrants a suspicious classification.
能力标签
cryptocan-make-purchases
能力评估
Purpose & Capability
The SKILL.md and code clearly require vision API credentials (baseUrl, apiKey, model) and write/read config.json under ~/.openclaw/... or /etc, but the registry metadata declared 'Required env vars: none' and 'Required config paths: none' — this is an explicit metadata mismatch. The skill legitimately needs an API key and local display access for its stated purpose, but the metadata omission is misleading. The skill also requires installation of system packages and may create system services/scripts (sv-start/sv-stop) which is consistent with headless operation but elevates the system footprint beyond a simple instruction-only helper.
Instruction Scope
Runtime instructions and scripts perform full-screen capture, encode and send screenshots (base64) to an external vision API, run an analyze->execute loop which can drive xdotool/cliclick/pyautogui, and save all screenshots to /tmp/screen-vision/logs/. The safety check relies on regex matching of action text/reason produced by the model; because actions are derived from an external LLM/vision model, a malicious or malformed response could bypass intent. The SKILL.md also documents starting a headless XFCE + VNC + noVNC stack which exposes a remote desktop — that expands scope to remote-access surface beyond local automation.
Install Mechanism
Although there is no remote arbitrary binary download, the included install/setup scripts run package manager installs (apt/yum/dnf), pip installs, create files under /usr/local/bin (sv-start/sv-stop), write VNC configuration (~/.vnc) and may configure noVNC/websockify. The setup script sets a default VNC password ('screen123') and runs vncserver with '-localhost no' allowing non-local connections — this is a risky default. The install requires sudo for system packages and writes system-level scripts, so it has substantial install-time impact.
Credentials
The skill legitimately needs a vision API key/baseUrl/model (config.json or env SV_VISION_*). That is proportionate to its purpose. However the skill stores/sources credentials from config.json (~/.openclaw/.../config.json) and environment variables; this was not reflected in registry metadata (declared none). The skill does not request unrelated cloud credentials, but it does create and store screenshots and VNC password files locally which you should consider sensitive.
Persistence & Privilege
The skill does not set always:true, but its installer creates persistent system artifacts: installs packages, writes /usr/local/bin scripts, config files under the user's home and potentially /etc, and can start a VNC/noVNC server that listens on network ports. Those artifacts persist beyond a single invocation and can expose a desktop over the network with a weak default password. Autonomous invocation is allowed by default (disable-model-invocation is false) — combined with network-exposed VNC this increases blast radius.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install ai-screen-vision
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /ai-screen-vision 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.1.0
test
v1.0.6
Fix: setup-win.py GBK encoding, README version, SKILL.md .ps1->.py
v1.0.5
Fix: remove __pycache__ from package, add .clawhubignore
v1.0.4
Bug fix: analyze.py syntax error (extra brace), run_task.py deep_merge import error
v1.0.3
Fix: added setup-win.py (Python), replaced .ps1, updated PLATFORM_GUIDE and install.sh
v1.0.2
Fix: updated SKILL.md description with CN/EN search keywords, added Windows setup script, added README.md, fixed execute permissions
v1.0.1
ai-screen-vision 1.0.1 Changelog - SKILL.md description and metadata streamlined for brevity and clarity. - Examples and trigger list shortened; technical details and keywords removed from skill description. - No changes to functionality or code; documentation cleanup only.
v1.0.0
v1.0.0 - AI screen vision and desktop control skill for OpenClaw. Linux/macOS/Windows. Screenshot-analyze-action loop with GPT-5.4-Mini vision. Smart diff detection saves tokens. Safety mechanisms block dangerous operations. One-click install.
元数据
Slug ai-screen-vision
版本 1.1.0
许可证 MIT-0
累计安装 3
当前安装数 3
历史版本数 8
常见问题

Screen Vision 是什么?

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 156 次。

如何安装 Screen Vision?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ai-screen-vision」即可一键安装,无需额外配置。

Screen Vision 是免费的吗?

是的,Screen Vision 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Screen Vision 支持哪些平台?

Screen Vision 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Screen Vision?

由 guitu917(@guitu917)开发并维护,当前版本 v1.1.0。

💬 留言讨论