← Back to Skills Marketplace
guitu917

Screen Vision

by guitu917 · GitHub ↗ · v1.1.0 · MIT-0
cross-platform ⚠ suspicious
156
Downloads
0
Stars
3
Active Installs
8
Versions
Install in OpenClaw
/install ai-screen-vision
Description
AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an...
README (SKILL.md)

Screen Vision

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials. You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL. See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

  1. Take screenshotbash scripts/platform/screenshot.sh [output_path] [display]
  2. Analyze with AIpython3 scripts/vision/analyze.py --image \x3Cpath> --task "\x3Ctask>"
  3. Execute actionpython3 scripts/platform/execute.py --action \x3Ctype> [options]
  4. Full task looppython3 scripts/core/run_task.py --task "\x3Ctask>"

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

Platform Screenshot Mouse/Keyboard Notes
Linux scrot xdotool Headless: XFCE4 + VNC
macOS screencapture cliclick Needs Accessibility permission
Windows pyautogui pyautogui No extra setup needed

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

Model Provider Cost/Task Quality
Qwen3-VL-32B SiliconFlow Low ★★★★
GLM-4V-Plus Zhipu BigModel Low ★★★★
GPT-5.4-Mini OpenAI / relays Medium ★★★★★
GPT-5.4 CUA OpenAI High ★★★★★
Llama 3.2 Vision Ollama (local) Free ★★

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

  • click — Click at (x, y). Supports left/right/double-click.
  • type — Type text string.
  • key — Press a key (Return, Tab, Escape, etc.).
  • scroll — Scroll up or down.
  • drag — Drag from (x1,y1) to (x2,y2).
  • wait — Wait for screen to update.
  • done — Task complete.
  • failed — Cannot complete task.

Safety

  • Blocked: rm -rf, format disk, shutdown, drop database, etc.
  • Confirmation required: delete, sudo, payment-related operations
  • Limits: max 5 minutes, max 100 actions per task
  • Logging: all screenshots saved to /tmp/screen-vision/logs/
  • Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

Variable Default Description
SV_VISION_API_KEY Vision API key
SV_VISION_BASE_URL API endpoint (required)
SV_VISION_MODEL Vision model name (required)
SV_DISPLAY :1 X11 display (Linux)
SV_MAX_DURATION 5 Max task duration (min)
SV_MAX_ACTIONS 100 Max actions per task
SV_SCREENSHOT_INTERVAL 1.0 Seconds between screenshots
Usage Guidance
What to consider before installing: - Metadata mismatch: the skill requires a vision API key (SV_VISION_* or config.json) even though registry metadata listed no env vars; expect to provide and store an API token locally. - Network exposure: the headless setup creates VNC + noVNC and sets a default VNC password 'screen123' and may run vncserver with '-localhost no', which allows remote access. Do NOT run headless/noVNC on a public server without changing the password and restricting access (firewall, SSH tunnel, or localhost-only proxy). - Privileged install: the setup scripts call apt/yum/dnf and write /usr/local/bin — you will need sudo and the installer changes system state. Review scripts before running, and prefer installing inside an isolated VM/container if possible. - Sensitive data handling: screenshots (potentially containing passwords and private information) are saved to /tmp/screen-vision/logs/ and full images are uploaded (base64) to whichever vision API you configure. If you care about privacy, run a local model provider (ollama/local) or avoid sending screenshots to external services. - Safety checks are heuristic: blocked/confirm rules are regex-based and act on the action text/reason that comes from the model; these can be bypassed by crafted responses. Do not grant this skill uncontrolled autonomous access on sensitive machines. - Recommendations: inspect and modify install scripts (change VNC password, remove '-localhost no', restrict noVNC binding to localhost or disable noVNC), run in an isolated environment, use a local vision provider if you want to avoid sending screenshots externally, and ensure the API key is stored securely (not world-readable). If you are not comfortable reviewing and hardening these scripts, avoid installing on production or internet-exposed hosts.
Capability Analysis
Type: OpenClaw Skill Name: ai-screen-vision Version: 1.1.0 The skill provides extensive 'Computer Use' capabilities, including screen capture, input simulation, and remote desktop access, which are inherently high-risk. A significant security concern is found in `scripts/setup/setup-linux.sh`, which installs a VNC server and sets a hardcoded default password ('screen123'), potentially allowing unauthorized remote access. Additionally, `scripts/vision/analyze.py` transmits system screenshots to external APIs, and while the endpoint is user-configurable, this facilitates data exfiltration of sensitive on-screen information. Although the skill includes a `safety_check.py` to blacklist dangerous commands, the combination of broad system control and weak default security configurations warrants a suspicious classification.
Capability Tags
cryptocan-make-purchases
Capability Assessment
Purpose & Capability
The SKILL.md and code clearly require vision API credentials (baseUrl, apiKey, model) and write/read config.json under ~/.openclaw/... or /etc, but the registry metadata declared 'Required env vars: none' and 'Required config paths: none' — this is an explicit metadata mismatch. The skill legitimately needs an API key and local display access for its stated purpose, but the metadata omission is misleading. The skill also requires installation of system packages and may create system services/scripts (sv-start/sv-stop) which is consistent with headless operation but elevates the system footprint beyond a simple instruction-only helper.
Instruction Scope
Runtime instructions and scripts perform full-screen capture, encode and send screenshots (base64) to an external vision API, run an analyze->execute loop which can drive xdotool/cliclick/pyautogui, and save all screenshots to /tmp/screen-vision/logs/. The safety check relies on regex matching of action text/reason produced by the model; because actions are derived from an external LLM/vision model, a malicious or malformed response could bypass intent. The SKILL.md also documents starting a headless XFCE + VNC + noVNC stack which exposes a remote desktop — that expands scope to remote-access surface beyond local automation.
Install Mechanism
Although there is no remote arbitrary binary download, the included install/setup scripts run package manager installs (apt/yum/dnf), pip installs, create files under /usr/local/bin (sv-start/sv-stop), write VNC configuration (~/.vnc) and may configure noVNC/websockify. The setup script sets a default VNC password ('screen123') and runs vncserver with '-localhost no' allowing non-local connections — this is a risky default. The install requires sudo for system packages and writes system-level scripts, so it has substantial install-time impact.
Credentials
The skill legitimately needs a vision API key/baseUrl/model (config.json or env SV_VISION_*). That is proportionate to its purpose. However the skill stores/sources credentials from config.json (~/.openclaw/.../config.json) and environment variables; this was not reflected in registry metadata (declared none). The skill does not request unrelated cloud credentials, but it does create and store screenshots and VNC password files locally which you should consider sensitive.
Persistence & Privilege
The skill does not set always:true, but its installer creates persistent system artifacts: installs packages, writes /usr/local/bin scripts, config files under the user's home and potentially /etc, and can start a VNC/noVNC server that listens on network ports. Those artifacts persist beyond a single invocation and can expose a desktop over the network with a weak default password. Autonomous invocation is allowed by default (disable-model-invocation is false) — combined with network-exposed VNC this increases blast radius.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install ai-screen-vision
  3. After installation, invoke the skill by name or use /ai-screen-vision
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.1.0
test
v1.0.6
Fix: setup-win.py GBK encoding, README version, SKILL.md .ps1->.py
v1.0.5
Fix: remove __pycache__ from package, add .clawhubignore
v1.0.4
Bug fix: analyze.py syntax error (extra brace), run_task.py deep_merge import error
v1.0.3
Fix: added setup-win.py (Python), replaced .ps1, updated PLATFORM_GUIDE and install.sh
v1.0.2
Fix: updated SKILL.md description with CN/EN search keywords, added Windows setup script, added README.md, fixed execute permissions
v1.0.1
ai-screen-vision 1.0.1 Changelog - SKILL.md description and metadata streamlined for brevity and clarity. - Examples and trigger list shortened; technical details and keywords removed from skill description. - No changes to functionality or code; documentation cleanup only.
v1.0.0
v1.0.0 - AI screen vision and desktop control skill for OpenClaw. Linux/macOS/Windows. Screenshot-analyze-action loop with GPT-5.4-Mini vision. Smart diff detection saves tokens. Safety mechanisms block dangerous operations. One-click install.
Metadata
Slug ai-screen-vision
Version 1.1.0
License MIT-0
All-time Installs 3
Active Installs 3
Total Versions 8
Frequently Asked Questions

What is Screen Vision?

AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an... It is an AI Agent Skill for Claude Code / OpenClaw, with 156 downloads so far.

How do I install Screen Vision?

Run "/install ai-screen-vision" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Screen Vision free?

Yes, Screen Vision is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Screen Vision support?

Screen Vision is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Screen Vision?

It is built and maintained by guitu917 (@guitu917); the current version is v1.1.0.

💬 Comments