功能描述

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an...

使用说明 (SKILL.md)

Screen Vision

Name: Screen Vision
Author: guitu917

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials. You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL. See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

Take screenshot → bash scripts/platform/screenshot.sh [output_path] [display]
Analyze with AI → python3 scripts/vision/analyze.py --image \x3Cpath> --task "\x3Ctask>"
Execute action → python3 scripts/platform/execute.py --action \x3Ctype> [options]
Full task loop → python3 scripts/core/run_task.py --task "\x3Ctask>"

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

Platform	Screenshot	Mouse/Keyboard	Notes
Linux	scrot	xdotool	Headless: XFCE4 + VNC
macOS	screencapture	cliclick	Needs Accessibility permission
Windows	pyautogui	pyautogui	No extra setup needed

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

Model	Provider	Cost/Task	Quality
Qwen3-VL-32B	SiliconFlow	Low	★★★★
GLM-4V-Plus	Zhipu BigModel	Low	★★★★
GPT-5.4-Mini	OpenAI / relays	Medium	★★★★★
GPT-5.4 CUA	OpenAI	High	★★★★★
Llama 3.2 Vision	Ollama (local)	Free	★★

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

click — Click at (x, y). Supports left/right/double-click.
type — Type text string.
key — Press a key (Return, Tab, Escape, etc.).
scroll — Scroll up or down.
drag — Drag from (x1,y1) to (x2,y2).
wait — Wait for screen to update.
done — Task complete.
failed — Cannot complete task.

Safety

Blocked: rm -rf, format disk, shutdown, drop database, etc.
Confirmation required: delete, sudo, payment-related operations
Limits: max 5 minutes, max 100 actions per task
Logging: all screenshots saved to /tmp/screen-vision/logs/
Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

Variable	Default	Description
`SV_VISION_API_KEY`	—	Vision API key
`SV_VISION_BASE_URL`	—	API endpoint (required)
`SV_VISION_MODEL`	—	Vision model name (required)
`SV_DISPLAY`	`:1`	X11 display (Linux)
`SV_MAX_DURATION`	`5`	Max task duration (min)
`SV_MAX_ACTIONS`	`100`	Max actions per task
`SV_SCREENSHOT_INTERVAL`	`1.0`	Seconds between screenshots

安全使用建议

What to consider before installing: - Metadata mismatch: the skill requires a vision API key (SV_VISION_* or config.json) even though registry metadata listed no env vars; expect to provide and store an API token locally. - Network exposure: the headless setup creates VNC + noVNC and sets a default VNC password 'screen123' and may run vncserver with '-localhost no', which allows remote access. Do NOT run headless/noVNC on a public server without changing the password and restricting access (firewall, SSH tunnel, or localhost-only proxy). - Privileged install: the setup scripts call apt/yum/dnf and write /usr/local/bin — you will need sudo and the installer changes system state. Review scripts before running, and prefer installing inside an isolated VM/container if possible. - Sensitive data handling: screenshots (potentially containing passwords and private information) are saved to /tmp/screen-vision/logs/ and full images are uploaded (base64) to whichever vision API you configure. If you care about privacy, run a local model provider (ollama/local) or avoid sending screenshots to external services. - Safety checks are heuristic: blocked/confirm rules are regex-based and act on the action text/reason that comes from the model; these can be bypassed by crafted responses. Do not grant this skill uncontrolled autonomous access on sensitive machines. - Recommendations: inspect and modify install scripts (change VNC password, remove '-localhost no', restrict noVNC binding to localhost or disable noVNC), run in an isolated environment, use a local vision provider if you want to avoid sending screenshots externally, and ensure the API key is stored securely (not world-readable). If you are not comfortable reviewing and hardening these scripts, avoid installing on production or internet-exposed hosts.

功能分析

Type: OpenClaw Skill Name: ai-screen-vision Version: 1.1.0 The skill provides extensive 'Computer Use' capabilities, including screen capture, input simulation, and remote desktop access, which are inherently high-risk. A significant security concern is found in `scripts/setup/setup-linux.sh`, which installs a VNC server and sets a hardcoded default password ('screen123'), potentially allowing unauthorized remote access. Additionally, `scripts/vision/analyze.py` transmits system screenshots to external APIs, and while the endpoint is user-configurable, this facilitates data exfiltration of sensitive on-screen information. Although the skill includes a `safety_check.py` to blacklist dangerous commands, the combination of broad system control and weak default security configurations warrants a suspicious classification.

能力标签

cryptocan-make-purchases

能力评估

⚠ Purpose & Capability

The SKILL.md and code clearly require vision API credentials (baseUrl, apiKey, model) and write/read config.json under ~/.openclaw/... or /etc, but the registry metadata declared 'Required env vars: none' and 'Required config paths: none' — this is an explicit metadata mismatch. The skill legitimately needs an API key and local display access for its stated purpose, but the metadata omission is misleading. The skill also requires installation of system packages and may create system services/scripts (sv-start/sv-stop) which is consistent with headless operation but elevates the system footprint beyond a simple instruction-only helper.

⚠ Instruction Scope

Runtime instructions and scripts perform full-screen capture, encode and send screenshots (base64) to an external vision API, run an analyze->execute loop which can drive xdotool/cliclick/pyautogui, and save all screenshots to /tmp/screen-vision/logs/. The safety check relies on regex matching of action text/reason produced by the model; because actions are derived from an external LLM/vision model, a malicious or malformed response could bypass intent. The SKILL.md also documents starting a headless XFCE + VNC + noVNC stack which exposes a remote desktop — that expands scope to remote-access surface beyond local automation.

⚠ Install Mechanism

Although there is no remote arbitrary binary download, the included install/setup scripts run package manager installs (apt/yum/dnf), pip installs, create files under /usr/local/bin (sv-start/sv-stop), write VNC configuration (~/.vnc) and may configure noVNC/websockify. The setup script sets a default VNC password ('screen123') and runs vncserver with '-localhost no' allowing non-local connections — this is a risky default. The install requires sudo for system packages and writes system-level scripts, so it has substantial install-time impact.

ℹ Credentials

The skill legitimately needs a vision API key/baseUrl/model (config.json or env SV_VISION_*). That is proportionate to its purpose. However the skill stores/sources credentials from config.json (~/.openclaw/.../config.json) and environment variables; this was not reflected in registry metadata (declared none). The skill does not request unrelated cloud credentials, but it does create and store screenshots and VNC password files locally which you should consider sensitive.

⚠ Persistence & Privilege

The skill does not set always:true, but its installer creates persistent system artifacts: installs packages, writes /usr/local/bin scripts, config files under the user's home and potentially /etc, and can start a VNC/noVNC server that listens on network ports. Those artifacts persist beyond a single invocation and can expose a desktop over the network with a weak default password. Autonomous invocation is allowed by default (disable-model-invocation is false) — combined with network-exposed VNC this increases blast radius.

版本历史

v1.1.0

test

v1.0.6

Fix: setup-win.py GBK encoding, README version, SKILL.md .ps1->.py

v1.0.5

Fix: remove __pycache__ from package, add .clawhubignore

v1.0.4

Bug fix: analyze.py syntax error (extra brace), run_task.py deep_merge import error

v1.0.3

Fix: added setup-win.py (Python), replaced .ps1, updated PLATFORM_GUIDE and install.sh

v1.0.2

Fix: updated SKILL.md description with CN/EN search keywords, added Windows setup script, added README.md, fixed execute permissions

v1.0.1

ai-screen-vision 1.0.1 Changelog - SKILL.md description and metadata streamlined for brevity and clarity. - Examples and trigger list shortened; technical details and keywords removed from skill description. - No changes to functionality or code; documentation cleanup only.

v1.0.0

v1.0.0 - AI screen vision and desktop control skill for OpenClaw. Linux/macOS/Windows. Screenshot-analyze-action loop with GPT-5.4-Mini vision. Smart diff detection saves tokens. Safety mechanisms block dangerous operations. One-click install.

元数据

Slug ai-screen-vision

版本 1.1.0

许可证 MIT-0

累计安装 3

当前安装数 3

历史版本数 8

常见问题

Screen Vision 是什么？

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 156 次。

如何安装 Screen Vision？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ai-screen-vision」即可一键安装，无需额外配置。

Screen Vision 是免费的吗？

是的，Screen Vision 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Screen Vision 支持哪些平台？

Screen Vision 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Screen Vision？

由 guitu917（@guitu917）开发并维护，当前版本 v1.1.0。

Screen Vision

Screen Vision

Quick Start

1. Setup (one-time)

2. Configure API

3. Usage

Architecture

Platform Tools

Vision Providers

Recommended Models

Action Types

Safety

Examples

Config

Screen Vision 是什么？

如何安装 Screen Vision？

Screen Vision 是免费的吗？

Screen Vision 支持哪些平台？

谁开发了 Screen Vision？

💬 留言讨论