/install ai-screen-vision
Screen Vision
Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.
Quick Start
1. Setup (one-time)
Detect platform and install dependencies:
bash scripts/setup/setup-linux.sh --headless # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop # Linux with desktop
bash scripts/setup/setup-mac.sh # macOS
python scripts/setup/setup-win.py # Windows
2. Configure API
Copy config.example.json to config.json and fill in your vision API credentials.
You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.
{
"vision": {
"baseUrl": "https://api.siliconflow.cn/v1",
"apiKey": "sk-your-key",
"model": "Qwen/Qwen3-VL-32B"
}
}
Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL.
See references/API_CONFIG.md for all supported providers and detailed setup.
3. Usage
The skill operates through a screenshot-analyze-action loop:
- Take screenshot →
bash scripts/platform/screenshot.sh [output_path] [display] - Analyze with AI →
python3 scripts/vision/analyze.py --image \x3Cpath> --task "\x3Ctask>" - Execute action →
python3 scripts/platform/execute.py --action \x3Ctype> [options] - Full task loop →
python3 scripts/core/run_task.py --task "\x3Ctask>"
Architecture
User task → run_task.py (orchestrator)
├── screenshot.sh (capture screen)
├── diff_check.py (detect changes, skip if unchanged → saves tokens)
├── analyze.py (send screenshot + task to vision API)
├── safety_check.py (block dangerous operations)
├── execute.py (xdotool/cliclick/pyautogui)
└── loop until done or timeout
Platform Tools
| Platform | Screenshot | Mouse/Keyboard | Notes |
|---|---|---|---|
| Linux | scrot | xdotool | Headless: XFCE4 + VNC |
| macOS | screencapture | cliclick | Needs Accessibility permission |
| Windows | pyautogui | pyautogui | No extra setup needed |
See references/PLATFORM_GUIDE.md for platform-specific commands.
Vision Providers
Supports any OpenAI-compatible vision API. You choose the provider and model.
Recommended Models
| Model | Provider | Cost/Task | Quality |
|---|---|---|---|
| Qwen3-VL-32B | SiliconFlow | Low | ★★★★ |
| GLM-4V-Plus | Zhipu BigModel | Low | ★★★★ |
| GPT-5.4-Mini | OpenAI / relays | Medium | ★★★★★ |
| GPT-5.4 CUA | OpenAI | High | ★★★★★ |
| Llama 3.2 Vision | Ollama (local) | Free | ★★ |
See references/API_CONFIG.md for per-provider configuration examples.
No defaults are hardcoded — you must configure your own API credentials before use.
Action Types
click— Click at (x, y). Supports left/right/double-click.type— Type text string.key— Press a key (Return, Tab, Escape, etc.).scroll— Scroll up or down.drag— Drag from (x1,y1) to (x2,y2).wait— Wait for screen to update.done— Task complete.failed— Cannot complete task.
Safety
- Blocked: rm -rf, format disk, shutdown, drop database, etc.
- Confirmation required: delete, sudo, payment-related operations
- Limits: max 5 minutes, max 100 actions per task
- Logging: all screenshots saved to
/tmp/screen-vision/logs/ - Auto-stop on error or API failure
Examples
See references/EXAMPLES.md for usage examples.
Config
| Variable | Default | Description |
|---|---|---|
SV_VISION_API_KEY |
— | Vision API key |
SV_VISION_BASE_URL |
— | API endpoint (required) |
SV_VISION_MODEL |
— | Vision model name (required) |
SV_DISPLAY |
:1 |
X11 display (Linux) |
SV_MAX_DURATION |
5 |
Max task duration (min) |
SV_MAX_ACTIONS |
100 |
Max actions per task |
SV_SCREENSHOT_INTERVAL |
1.0 |
Seconds between screenshots |
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install ai-screen-vision - 安装完成后,直接呼叫该 Skill 的名称或使用
/ai-screen-vision触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Screen Vision 是什么?
AI screen vision and desktop computer control skill for OpenClaw. Let your AI agent see the screen, understand UI elements, and autonomously perform mouse an... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 156 次。
如何安装 Screen Vision?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install ai-screen-vision」即可一键安装,无需额外配置。
Screen Vision 是免费的吗?
是的,Screen Vision 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Screen Vision 支持哪些平台?
Screen Vision 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Screen Vision?
由 guitu917(@guitu917)开发并维护,当前版本 v1.1.0。