Description

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an...

README (SKILL.md)

Screen Vision

Name: Screen Vision
Author: guitu917

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials. You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL. See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

Take screenshot → bash scripts/platform/screenshot.sh [output_path] [display]
Analyze with AI → python3 scripts/vision/analyze.py --image \x3Cpath> --task "\x3Ctask>"
Execute action → python3 scripts/platform/execute.py --action \x3Ctype> [options]
Full task loop → python3 scripts/core/run_task.py --task "\x3Ctask>"

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

Platform	Screenshot	Mouse/Keyboard	Notes
Linux	scrot	xdotool	Headless: XFCE4 + VNC
macOS	screencapture	cliclick	Needs Accessibility permission
Windows	pyautogui	pyautogui	No extra setup needed

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

Model	Provider	Cost/Task	Quality
Qwen3-VL-32B	SiliconFlow	Low	★★★★
GLM-4V-Plus	Zhipu BigModel	Low	★★★★
GPT-5.4-Mini	OpenAI / relays	Medium	★★★★★
GPT-5.4 CUA	OpenAI	High	★★★★★
Llama 3.2 Vision	Ollama (local)	Free	★★

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

click — Click at (x, y). Supports left/right/double-click.
type — Type text string.
key — Press a key (Return, Tab, Escape, etc.).
scroll — Scroll up or down.
drag — Drag from (x1,y1) to (x2,y2).
wait — Wait for screen to update.
done — Task complete.
failed — Cannot complete task.

Safety

Blocked: rm -rf, format disk, shutdown, drop database, etc.
Confirmation required: delete, sudo, payment-related operations
Limits: max 5 minutes, max 100 actions per task
Logging: all screenshots saved to /tmp/screen-vision/logs/
Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

Variable	Default	Description
`SV_VISION_API_KEY`	—	Vision API key
`SV_VISION_BASE_URL`	—	API endpoint (required)
`SV_VISION_MODEL`	—	Vision model name (required)
`SV_DISPLAY`	`:1`	X11 display (Linux)
`SV_MAX_DURATION`	`5`	Max task duration (min)
`SV_MAX_ACTIONS`	`100`	Max actions per task
`SV_SCREENSHOT_INTERVAL`	`1.0`	Seconds between screenshots

Usage Guidance

What to consider before installing: - Metadata mismatch: the skill requires a vision API key (SV_VISION_* or config.json) even though registry metadata listed no env vars; expect to provide and store an API token locally. - Network exposure: the headless setup creates VNC + noVNC and sets a default VNC password 'screen123' and may run vncserver with '-localhost no', which allows remote access. Do NOT run headless/noVNC on a public server without changing the password and restricting access (firewall, SSH tunnel, or localhost-only proxy). - Privileged install: the setup scripts call apt/yum/dnf and write /usr/local/bin — you will need sudo and the installer changes system state. Review scripts before running, and prefer installing inside an isolated VM/container if possible. - Sensitive data handling: screenshots (potentially containing passwords and private information) are saved to /tmp/screen-vision/logs/ and full images are uploaded (base64) to whichever vision API you configure. If you care about privacy, run a local model provider (ollama/local) or avoid sending screenshots to external services. - Safety checks are heuristic: blocked/confirm rules are regex-based and act on the action text/reason that comes from the model; these can be bypassed by crafted responses. Do not grant this skill uncontrolled autonomous access on sensitive machines. - Recommendations: inspect and modify install scripts (change VNC password, remove '-localhost no', restrict noVNC binding to localhost or disable noVNC), run in an isolated environment, use a local vision provider if you want to avoid sending screenshots externally, and ensure the API key is stored securely (not world-readable). If you are not comfortable reviewing and hardening these scripts, avoid installing on production or internet-exposed hosts.

Capability Analysis

Type: OpenClaw Skill Name: ai-screen-vision Version: 1.1.0 The skill provides extensive 'Computer Use' capabilities, including screen capture, input simulation, and remote desktop access, which are inherently high-risk. A significant security concern is found in `scripts/setup/setup-linux.sh`, which installs a VNC server and sets a hardcoded default password ('screen123'), potentially allowing unauthorized remote access. Additionally, `scripts/vision/analyze.py` transmits system screenshots to external APIs, and while the endpoint is user-configurable, this facilitates data exfiltration of sensitive on-screen information. Although the skill includes a `safety_check.py` to blacklist dangerous commands, the combination of broad system control and weak default security configurations warrants a suspicious classification.

Capability Tags

cryptocan-make-purchases

Capability Assessment

⚠ Purpose & Capability

The SKILL.md and code clearly require vision API credentials (baseUrl, apiKey, model) and write/read config.json under ~/.openclaw/... or /etc, but the registry metadata declared 'Required env vars: none' and 'Required config paths: none' — this is an explicit metadata mismatch. The skill legitimately needs an API key and local display access for its stated purpose, but the metadata omission is misleading. The skill also requires installation of system packages and may create system services/scripts (sv-start/sv-stop) which is consistent with headless operation but elevates the system footprint beyond a simple instruction-only helper.

⚠ Instruction Scope

Runtime instructions and scripts perform full-screen capture, encode and send screenshots (base64) to an external vision API, run an analyze->execute loop which can drive xdotool/cliclick/pyautogui, and save all screenshots to /tmp/screen-vision/logs/. The safety check relies on regex matching of action text/reason produced by the model; because actions are derived from an external LLM/vision model, a malicious or malformed response could bypass intent. The SKILL.md also documents starting a headless XFCE + VNC + noVNC stack which exposes a remote desktop — that expands scope to remote-access surface beyond local automation.

⚠ Install Mechanism

Although there is no remote arbitrary binary download, the included install/setup scripts run package manager installs (apt/yum/dnf), pip installs, create files under /usr/local/bin (sv-start/sv-stop), write VNC configuration (~/.vnc) and may configure noVNC/websockify. The setup script sets a default VNC password ('screen123') and runs vncserver with '-localhost no' allowing non-local connections — this is a risky default. The install requires sudo for system packages and writes system-level scripts, so it has substantial install-time impact.

ℹ Credentials

The skill legitimately needs a vision API key/baseUrl/model (config.json or env SV_VISION_*). That is proportionate to its purpose. However the skill stores/sources credentials from config.json (~/.openclaw/.../config.json) and environment variables; this was not reflected in registry metadata (declared none). The skill does not request unrelated cloud credentials, but it does create and store screenshots and VNC password files locally which you should consider sensitive.

⚠ Persistence & Privilege

The skill does not set always:true, but its installer creates persistent system artifacts: installs packages, writes /usr/local/bin scripts, config files under the user's home and potentially /etc, and can start a VNC/noVNC server that listens on network ports. Those artifacts persist beyond a single invocation and can expose a desktop over the network with a weak default password. Autonomous invocation is allowed by default (disable-model-invocation is false) — combined with network-exposed VNC this increases blast radius.

Version History

v1.1.0

test

v1.0.6

Fix: setup-win.py GBK encoding, README version, SKILL.md .ps1->.py

v1.0.5

Fix: remove __pycache__ from package, add .clawhubignore

v1.0.4

Bug fix: analyze.py syntax error (extra brace), run_task.py deep_merge import error

v1.0.3

Fix: added setup-win.py (Python), replaced .ps1, updated PLATFORM_GUIDE and install.sh

v1.0.2

Fix: updated SKILL.md description with CN/EN search keywords, added Windows setup script, added README.md, fixed execute permissions

v1.0.1

ai-screen-vision 1.0.1 Changelog - SKILL.md description and metadata streamlined for brevity and clarity. - Examples and trigger list shortened; technical details and keywords removed from skill description. - No changes to functionality or code; documentation cleanup only.

v1.0.0

v1.0.0 - AI screen vision and desktop control skill for OpenClaw. Linux/macOS/Windows. Screenshot-analyze-action loop with GPT-5.4-Mini vision. Smart diff detection saves tokens. Safety mechanisms block dangerous operations. One-click install.

Metadata

Slug ai-screen-vision

Version 1.1.0

License MIT-0

All-time Installs 3

Active Installs 3

Total Versions 8

Frequently Asked Questions

What is Screen Vision?

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an... It is an AI Agent Skill for Claude Code / OpenClaw, with 156 downloads so far.

How do I install Screen Vision?

Run "/install ai-screen-vision" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Screen Vision free?

Yes, Screen Vision is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Screen Vision support?

Screen Vision is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Screen Vision?

It is built and maintained by guitu917 (@guitu917); the current version is v1.1.0.

More Skills

Screen Vision

Screen Vision

Quick Start

1. Setup (one-time)

2. Configure API

3. Usage

Architecture

Platform Tools

Vision Providers

Recommended Models

Action Types

Safety

Examples

Config

What is Screen Vision?

How do I install Screen Vision?

Is Screen Vision free?

Which platforms does Screen Vision support?

Who created Screen Vision?

💬 Comments