/install desktop-agent-ops
Desktop Agent Ops
Use this skill as a main-agent operating manual for desktop GUI tasks.
MANDATORY: Auto-setup gate (FIRST ACTION, every time)
python3 \x3CSKILL_DIR>/scripts/first_run_setup.py --check
If "ready": false, run setup (installs EVERYTHING automatically):
python3 \x3CSKILL_DIR>/scripts/first_run_setup.py
Auto-installs on first run:
- Platform detection (macOS / Windows / Linux)
cliclick+tesseract(macOS via brew; Linux guide printed)- OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.)
- Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip)
- OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings
- Smoke test (screenshot + mouse move verification)
After setup, set $PY for ALL subsequent calls:
PY=\x3Coutput.env.DESKTOP_AGENT_OPS_PYTHON>
Do NOT proceed if setup is not ready.
Core Execution Loop
Every desktop task follows this loop. No exceptions.
1. auto-setup gate ← run once per session
2. init task context ← create isolated temp directory
3. FOCUS the target app ← bring app to front, confirm frontmost
4. GET window bounds ← know exact position and size
5. CAPTURE that window ← screenshot ONLY the target window
6. ANALYZE the capture ← read screenshot or run OCR
7. LOCATE target via OCR ← find text/button within window bounds
8. VERIFY before acting ← move cursor, screenshot with cursor, confirm
9. EXECUTE one action ← click, type, scroll, press key
10. CAPTURE again ← screenshot to see result
11. VERIFY the result ← did the UI change as expected?
12. → if more steps, go to 5
13. CLEANUP ← remove task temp directory
Key principles:
- One action at a time. Never chain blind actions.
- Always verify after each action. If verification fails, recapture and retry — do NOT guess.
- Always work within a specific window. Never click based on full-screen assumptions.
Window-Scoped Targeting (THE CORRECT WAY)
NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.
The 6-Step Pipeline
┌─────────────────────────────────────────────────────────┐
│ Step 1: FOCUS the target app │
│ $PY desktop_ops.py focus-app --name "AppName" │
│ → brings app to front │
├─────────────────────────────────────────────────────────┤
│ Step 2: GET window bounds │
│ $PY desktop_ops.py front-window-bounds --app "AppName"│
│ → {x, y, width, height} in logical coordinates │
├─────────────────────────────────────────────────────────┤
│ Step 3: CAPTURE only that window │
│ $PY desktop_ops.py capture-region --x X --y Y │
│ --width W --height H --output /tmp/window.png │
├─────────────────────────────────────────────────────────┤
│ Step 4: OCR within the window │
│ $PY ocr_text.py --app "AppName" --python $PY │
│ → abs_box coordinates are INSIDE the window │
├─────────────────────────────────────────────────────────┤
│ Step 5: VERIFY before clicking │
│ $PY desktop_ops.py move --x TX --y TY │
│ $PY desktop_ops.py screenshot --with-cursor │
│ → confirm cursor is on the right element │
├─────────────────────────────────────────────────────────┤
│ Step 6: CLICK only if verified │
│ $PY desktop_ops.py click --x TX --y TY │
│ $PY desktop_ops.py screenshot → verify result │
└─────────────────────────────────────────────────────────┘
Shortcut (RECOMMENDED for most targeting):
$PY scripts/target_resolver.py --app "AppName" --text "按钮文字" --python $PY
This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.
Why window-scoped matters:
| Approach | Risk |
|---|---|
| ❌ Full-screen OCR | "搜索" in WeChat AND Chrome → clicks wrong app |
| ✅ Window-scoped | "搜索" ONLY in WeChat window → correct click |
Failure Recovery (CRITICAL)
When something fails, follow these rules:
OCR finds nothing
- Re-focus the app:
focus-app --name "AppName" - Re-get bounds:
front-window-bounds --app "AppName"(window may have moved/resized) - Take a fresh screenshot and read it visually
- Try a different region label (e.g.
content_areainstead ofbottom_input) - Try lowering OCR confidence:
--min-conf 30
Click doesn't work
- Screenshot with cursor to check cursor position
- The window may have moved — re-get bounds
- Try clicking a few pixels offset from the OCR center
- Check if a dialog/popup is blocking the target
App state changed (login screen, dialog, etc.)
- ALWAYS re-get window bounds after any major UI change
- ALWAYS re-run OCR after navigation or state change
- Never reuse old coordinates — they may be stale
General retry rule
- Maximum 3 retries per action
- Each retry must recapture fresh state
- If 3 retries fail, report the failure with screenshots and stop
Generalization: How to Apply This to ANY App
The pipeline works for any desktop application. Here is how to reason about new apps:
Step-by-step for ANY new app:
- Identify the app name exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings")
- Focus and get bounds — this tells you the window's exact position
- Screenshot the window — look at what's on screen
- Identify the target — what text, button, or area do you need to interact with?
- Use OCR to find it —
target_resolver.py --app "AppName" --text "target text" - Verify and click
Common patterns across apps:
| Task | How to do it |
|---|---|
| Click a button | OCR find text → verify → click |
| Type in a field | OCR find field label → click field → type --text |
| Search for something | OCR find search box → click → type query → press return |
| Scroll a list | Get window bounds → scroll at window center with --x --y |
| Switch between apps | focus-app --name "OtherApp" → re-get bounds |
| Handle a dialog | Screenshot → OCR for dialog buttons → click appropriate one |
| Navigate menus | Click menu item → wait → screenshot → OCR new menu → click |
| Select from dropdown | Click dropdown → wait → OCR options → click selection |
| Read screen content | OCR the window → extract all text boxes |
| Verify an action | Screenshot before and after → compare or OCR for expected text |
App-specific adaptations:
| App type | Special considerations |
|---|---|
| Chat apps (WeChat, Slack, etc.) | Verify conversation title before typing; use insert-newline for multi-line; verify send mechanism |
| Browsers (Chrome, Safari, etc.) | Address bar at top; content area varies; may need to handle tabs |
| System Settings | Deep navigation; panels change; re-get bounds after each navigation |
| File managers (Finder, Explorer) | Sidebar + content area; double-click to open; path bar for navigation |
| Editors (VS Code, TextEdit, etc.) | Tab bar + editor area; use hotkeys for save/undo; type in editor area |
Text Input and Send Rules
Typing text
$PY scripts/desktop_ops.py type --text "your message"
- Uses clipboard paste as primary method on all platforms — reliable for all languages including CJK
- macOS:
set the clipboard to+Cmd+V(single osascript call) - Windows: PowerShell
Set-Clipboard+Ctrl+V(falls back toclip.exe) - Linux:
xclip+Ctrl+V - First click on the input field to focus it before typing
Multi-line messages
$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"
- Use
insert-newlinefor literal line breaks - Do NOT use
\intype --text— it may trigger send in some apps
Sending a message
- Preferred: Look for a visible send button (e.g.,
发送) via OCR, then click it - Alternative: Use
press --key returnONLY when the app is verified to use Enter-to-send - Never guess which send method to use — verify first
Backend priority (macOS)
| Operation | Primary | Fallback |
|---|---|---|
type |
Clipboard paste | cliclick (ASCII only) |
press |
AppleScript key code |
cliclick kp: |
hotkey |
cliclick kd:/t:/ku: |
pyautogui |
click |
cliclick | pyautogui |
Important: cliclick
kp:returnis NOT recognized by WeChat — always use AppleScript for key press. Important: cliclickt:silently drops CJK characters — always use clipboard paste for text input.
DPI / HiDPI / Retina (All Platforms)
Handled automatically. No manual DPI work needed.
| Platform | Common scales | Detection method |
|---|---|---|
| macOS Retina | 2.0x | screenshot pixels ÷ logical screen bounds |
| Windows HiDPI | 1.25x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() |
| Linux X11 | 1.0x, 1.5x, 2.0x | screenshot pixels ÷ pyautogui.size() |
OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.
CLI Quick Reference (EXACT parameter names)
CRITICAL: Use EXACTLY these names. Do NOT guess.
desktop_ops.py
$PY scripts/desktop_ops.py screenshot [--output PATH] [--x X --y Y --width W --height H] [--with-cursor]
$PY scripts/desktop_ops.py capture-region --x X --y Y --width W --height H [--output PATH] [--with-cursor]
$PY scripts/desktop_ops.py frontmost
$PY scripts/desktop_ops.py list-apps
$PY scripts/desktop_ops.py front-window-bounds [--app NAME]
$PY scripts/desktop_ops.py focus-app --name "App Name"
$PY scripts/desktop_ops.py move --x X --y Y [--duration SECONDS]
$PY scripts/desktop_ops.py click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py double-click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py drag --x1 X1 --y1 Y1 --x2 X2 --y2 Y2 [--duration SEC] [--button left]
$PY scripts/desktop_ops.py scroll --amount N [--x X --y Y] [--direction vertical|horizontal]
$PY scripts/desktop_ops.py mouse-position
$PY scripts/desktop_ops.py press --key KEY
$PY scripts/desktop_ops.py type --text "text to type"
$PY scripts/desktop_ops.py insert-newline [--count N]
$PY scripts/desktop_ops.py hotkey --keys cmd c
$PY scripts/desktop_ops.py screen-size
$PY scripts/desktop_ops.py pixel-color --x X --y Y
ocr_text.py
$PY scripts/ocr_text.py --app "AppName" --python $PY [--region-label LABEL] [--lang auto]
$PY scripts/ocr_text.py --image /path/to/capture.png --python $PY [--lang auto]
target_resolver.py
$PY scripts/target_resolver.py --app "AppName" --text "text" --python $PY
$PY scripts/target_resolver.py --app "AppName" --template /path/icon.png --python $PY
$PY scripts/target_resolver.py --app "AppName" --text "text" --region-label LABEL --python $PY
task_context.py / cleanup_task.py
$PY scripts/task_context.py init --task-id "my-task" # aliases: create, --name
$PY scripts/task_context.py show --task-id "my-task"
$PY scripts/cleanup_task.py --task-id "my-task"
window_regions.py
$PY scripts/window_regions.py --window-x X --window-y Y --window-width W --window-height H [--label LABEL]
Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action
Workflow Examples
Example 1: Click a button by text (any app)
1. $PY first_run_setup.py --check → ready: true
2. $PY task_context.py init --task-id "click-button"
3. $PY desktop_ops.py focus-app --name "AppName"
4. $PY desktop_ops.py front-window-bounds --app "AppName" → {x, y, w, h}
5. $PY target_resolver.py --app "AppName" --text "OK" --python $PY
→ best_candidate: {x:450, y:520, within_window:true}
6. $PY desktop_ops.py move --x 450 --y 520
7. $PY desktop_ops.py screenshot --with-cursor → verify cursor on "OK"
8. $PY desktop_ops.py click --x 450 --y 520
9. $PY desktop_ops.py screenshot → verify result
10. $PY cleanup_task.py --task-id "click-button"
Example 2: Type and search
1. $PY desktop_ops.py focus-app --name "Safari"
2. $PY target_resolver.py --app "Safari" --text "Search" --region-label top_search --python $PY
→ {x:300, y:80, within_window:true}
3. $PY desktop_ops.py click --x 300 --y 80
4. $PY desktop_ops.py type --text "hello world"
5. $PY desktop_ops.py press --key return
6. $PY desktop_ops.py screenshot → verify search results
Example 3: Send a chat message (WeChat, Slack, etc.)
1. $PY desktop_ops.py focus-app --name "WeChat"
2. $PY desktop_ops.py front-window-bounds --app "WeChat"
3. # Navigate to the right conversation (OCR sidebar or search)
4. $PY target_resolver.py --app "WeChat" --text "ContactName" --region-label left_sidebar --python $PY
5. $PY desktop_ops.py click --x \x3Cfound_x> --y \x3Cfound_y>
6. # Verify conversation is open
7. $PY desktop_ops.py screenshot → confirm conversation title
8. # Click the input field
9. $PY target_resolver.py --app "WeChat" --text "" --region-label bottom_input --python $PY
OR: click at the bottom center of the window
10. $PY desktop_ops.py type --text "Hello!"
11. # Send: prefer visible send button; if not available, use press --key return
12. $PY target_resolver.py --app "WeChat" --text "发送" --python $PY
IF found: $PY desktop_ops.py click --x \x3Cx> --y \x3Cy>
ELSE: $PY desktop_ops.py press --key return
13. $PY desktop_ops.py screenshot → verify message sent
Example 4: Scroll a list and find an item
1. $PY desktop_ops.py focus-app --name "AppName"
2. $PY desktop_ops.py front-window-bounds --app "AppName" → {x:100, y:50, w:800, h:600}
3. # Scroll down in the window center
$PY desktop_ops.py scroll --amount -5 --x 500 --y 350
4. $PY desktop_ops.py screenshot → check if target visible
5. $PY target_resolver.py --app "AppName" --text "target item" --python $PY
6. IF not found: scroll more and retry (max 5 scrolls)
7. IF found: click it
Example 5: Handle an unexpected dialog
1. # During any operation, if the expected UI doesn't match:
2. $PY desktop_ops.py screenshot → examine what's on screen
3. # If a dialog is visible, OCR it:
$PY ocr_text.py --app "AppName" --python $PY
4. # Find and click the appropriate button (OK, Cancel, Allow, etc.)
$PY target_resolver.py --app "AppName" --text "OK" --python $PY
5. $PY desktop_ops.py click --x \x3Cx> --y \x3Cy>
6. # After dialog is dismissed, re-get window bounds and continue
$PY desktop_ops.py front-window-bounds --app "AppName"
Reference Documents
Load as needed:
| Document | When to read |
|---|---|
references/workflow.md |
Core 8-step closed loop |
references/platform-macos.md |
macOS-specific tools and permissions |
references/platform-windows.md |
Windows setup |
references/platform-linux.md |
Linux X11/Wayland setup |
references/operation-patterns.md |
Reusable task templates |
references/validation-patterns.md |
Two-stage validation |
references/precise-targeting.md |
5-layer precision targeting |
references/target-providers.md |
Provider ordering and fallback contract |
references/coordinate-reconstruction.md |
Rebuild click coordinates from screenshot evidence |
references/chat-app-macos.md |
Chat app workflow |
references/app-wechat-desktop.md |
Cross-platform WeChat guidance |
references/cleanup-rules.md |
Cleanup timing and scope |
references/collaboration-rules.md |
When multi-agent collaboration is justified |
references/example-cases.md |
Repeatable task examples |
references/reproducible-setup.md |
Host bring-up checklist |
Scope
Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.
Hard Rules
- Always run auto-setup gate first
- Always use EXACT parameter names from CLI reference — never guess
- Always scope OCR to the target app window — NEVER full-screen OCR
- Always: focus-app → front-window-bounds → OCR within window → verify → act
- Always pass
--python $PYto ocr_text.py and target_resolver.py - Always verify coordinates are within window bounds before clicking
- Always re-get window bounds after any UI state change (login, dialog, navigation)
- Use
insert-newlinefor line breaks; never use\intype --text - For send actions: prefer visible send button; use
press --key returnonly when verified - One action at a time; verify after each
- Maximum 3 retries per action; each retry must recapture fresh state
- Cleanup is mandatory at task end
- If verification fails, recapture and rebuild — do not retry blindly
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install desktop-agent-ops - 安装完成后,直接呼叫该 Skill 的名称或使用
/desktop-agent-ops触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Desktop Agent Ops 是什么?
Execute cross-platform desktop tasks through a packaged desktop automation skill that guides the main agent to observe the screen, focus apps and windows, ca... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 273 次。
如何安装 Desktop Agent Ops?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install desktop-agent-ops」即可一键安装,无需额外配置。
Desktop Agent Ops 是免费的吗?
是的,Desktop Agent Ops 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Desktop Agent Ops 支持哪些平台?
Desktop Agent Ops 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(macos, windows, linux)。
谁开发了 Desktop Agent Ops?
由 TRIP(@appergb)开发并维护,当前版本 v1.0.3。