功能描述

Execute cross-platform desktop tasks through a packaged desktop automation skill that guides the main agent to observe the screen, focus apps and windows, ca...

使用说明 (SKILL.md)

Desktop Agent Ops

Name: Desktop Agent Ops
Author: appergb

Use this skill as a main-agent operating manual for desktop GUI tasks.

MANDATORY: Auto-setup gate (FIRST ACTION, every time)

python3 \x3CSKILL_DIR>/scripts/first_run_setup.py --check

If "ready": false, run setup (installs EVERYTHING automatically):

python3 \x3CSKILL_DIR>/scripts/first_run_setup.py

Auto-installs on first run:

Platform detection (macOS / Windows / Linux)
cliclick + tesseract (macOS via brew; Linux guide printed)
OCR language packs auto-detected from system locale (中文→chi_sim, 日本語→jpn, etc.)
Python venv + pillow, pyautogui, pytesseract, opencv-python, numpy (via uv or pip)
OS permissions (Screen Recording, Accessibility, Automation) with auto-open System Settings
Smoke test (screenshot + mouse move verification)

After setup, set $PY for ALL subsequent calls:

PY=\x3Coutput.env.DESKTOP_AGENT_OPS_PYTHON>

Do NOT proceed if setup is not ready.

Core Execution Loop

Every desktop task follows this loop. No exceptions.

 1. auto-setup gate           ← run once per session
 2. init task context          ← create isolated temp directory
 3. FOCUS the target app       ← bring app to front, confirm frontmost
 4. GET window bounds          ← know exact position and size
 5. CAPTURE that window        ← screenshot ONLY the target window
 6. ANALYZE the capture        ← read screenshot or run OCR
 7. LOCATE target via OCR      ← find text/button within window bounds
 8. VERIFY before acting       ← move cursor, screenshot with cursor, confirm
 9. EXECUTE one action         ← click, type, scroll, press key
10. CAPTURE again              ← screenshot to see result
11. VERIFY the result          ← did the UI change as expected?
12. → if more steps, go to 5
13. CLEANUP                    ← remove task temp directory

Key principles:

One action at a time. Never chain blind actions.
Always verify after each action. If verification fails, recapture and retry — do NOT guess.
Always work within a specific window. Never click based on full-screen assumptions.

Window-Scoped Targeting (THE CORRECT WAY)

NEVER do OCR or clicking on a full-screen screenshot. Always scope to the target app window.

The 6-Step Pipeline

┌─────────────────────────────────────────────────────────┐
│ Step 1: FOCUS the target app                            │
│   $PY desktop_ops.py focus-app --name "AppName"         │
│   → brings app to front                                 │
├─────────────────────────────────────────────────────────┤
│ Step 2: GET window bounds                               │
│   $PY desktop_ops.py front-window-bounds --app "AppName"│
│   → {x, y, width, height} in logical coordinates        │
├─────────────────────────────────────────────────────────┤
│ Step 3: CAPTURE only that window                        │
│   $PY desktop_ops.py capture-region --x X --y Y         │
│     --width W --height H --output /tmp/window.png       │
├─────────────────────────────────────────────────────────┤
│ Step 4: OCR within the window                           │
│   $PY ocr_text.py --app "AppName" --python $PY          │
│   → abs_box coordinates are INSIDE the window           │
├─────────────────────────────────────────────────────────┤
│ Step 5: VERIFY before clicking                          │
│   $PY desktop_ops.py move --x TX --y TY                 │
│   $PY desktop_ops.py screenshot --with-cursor            │
│   → confirm cursor is on the right element              │
├─────────────────────────────────────────────────────────┤
│ Step 6: CLICK only if verified                          │
│   $PY desktop_ops.py click --x TX --y TY                │
│   $PY desktop_ops.py screenshot → verify result          │
└─────────────────────────────────────────────────────────┘

Shortcut (RECOMMENDED for most targeting):

$PY scripts/target_resolver.py --app "AppName" --text "按钮文字" --python $PY

This single command: focuses app → gets bounds → OCR within window → returns best_candidate with {x, y, within_window}.

Why window-scoped matters:

Approach	Risk
❌ Full-screen OCR	"搜索" in WeChat AND Chrome → clicks wrong app
✅ Window-scoped	"搜索" ONLY in WeChat window → correct click

Failure Recovery (CRITICAL)

When something fails, follow these rules:

OCR finds nothing

Re-focus the app: focus-app --name "AppName"
Re-get bounds: front-window-bounds --app "AppName" (window may have moved/resized)
Take a fresh screenshot and read it visually
Try a different region label (e.g. content_area instead of bottom_input)
Try lowering OCR confidence: --min-conf 30

Click doesn't work

Screenshot with cursor to check cursor position
The window may have moved — re-get bounds
Try clicking a few pixels offset from the OCR center
Check if a dialog/popup is blocking the target

App state changed (login screen, dialog, etc.)

ALWAYS re-get window bounds after any major UI change
ALWAYS re-run OCR after navigation or state change
Never reuse old coordinates — they may be stale

General retry rule

Maximum 3 retries per action
Each retry must recapture fresh state
If 3 retries fail, report the failure with screenshots and stop

Generalization: How to Apply This to ANY App

The pipeline works for any desktop application. Here is how to reason about new apps:

Step-by-step for ANY new app:

Identify the app name exactly as it appears in the system (e.g. "Google Chrome", "微信", "System Settings")
Focus and get bounds — this tells you the window's exact position
Screenshot the window — look at what's on screen
Identify the target — what text, button, or area do you need to interact with?
Use OCR to find it — target_resolver.py --app "AppName" --text "target text"
Verify and click

Common patterns across apps:

Task	How to do it
Click a button	OCR find text → verify → click
Type in a field	OCR find field label → click field → `type --text`
Search for something	OCR find search box → click → type query → press return
Scroll a list	Get window bounds → scroll at window center with `--x --y`
Switch between apps	`focus-app --name "OtherApp"` → re-get bounds
Handle a dialog	Screenshot → OCR for dialog buttons → click appropriate one
Navigate menus	Click menu item → wait → screenshot → OCR new menu → click
Select from dropdown	Click dropdown → wait → OCR options → click selection
Read screen content	OCR the window → extract all text boxes
Verify an action	Screenshot before and after → compare or OCR for expected text

App-specific adaptations:

App type	Special considerations
Chat apps (WeChat, Slack, etc.)	Verify conversation title before typing; use `insert-newline` for multi-line; verify send mechanism
Browsers (Chrome, Safari, etc.)	Address bar at top; content area varies; may need to handle tabs
System Settings	Deep navigation; panels change; re-get bounds after each navigation
File managers (Finder, Explorer)	Sidebar + content area; double-click to open; path bar for navigation
Editors (VS Code, TextEdit, etc.)	Tab bar + editor area; use hotkeys for save/undo; type in editor area

Text Input and Send Rules

Typing text

$PY scripts/desktop_ops.py type --text "your message"

Uses clipboard paste as primary method on all platforms — reliable for all languages including CJK
macOS: set the clipboard to + Cmd+V (single osascript call)
Windows: PowerShell Set-Clipboard + Ctrl+V (falls back to clip.exe)
Linux: xclip + Ctrl+V
First click on the input field to focus it before typing

Multi-line messages

$PY scripts/desktop_ops.py type --text "first line"
$PY scripts/desktop_ops.py insert-newline
$PY scripts/desktop_ops.py type --text "second line"

Use insert-newline for literal line breaks
Do NOT use \ in type --text — it may trigger send in some apps

Sending a message

Preferred: Look for a visible send button (e.g., 发送) via OCR, then click it
Alternative: Use press --key return ONLY when the app is verified to use Enter-to-send
Never guess which send method to use — verify first

Backend priority (macOS)

Operation	Primary	Fallback
`type`	Clipboard paste	cliclick (ASCII only)
`press`	AppleScript `key code`	cliclick `kp:`
`hotkey`	cliclick `kd:/t:/ku:`	pyautogui
`click`	cliclick	pyautogui

Important: cliclick kp:return is NOT recognized by WeChat — always use AppleScript for key press. Important: cliclick t: silently drops CJK characters — always use clipboard paste for text input.

DPI / HiDPI / Retina (All Platforms)

Handled automatically. No manual DPI work needed.

Platform	Common scales	Detection method
macOS Retina	2.0x	screenshot pixels ÷ logical screen bounds
Windows HiDPI	1.25x, 1.5x, 2.0x	screenshot pixels ÷ pyautogui.size()
Linux X11	1.0x, 1.5x, 2.0x	screenshot pixels ÷ pyautogui.size()

OCR output: box = logical (use for mouse), pixel_box = raw pixels, dpi_scale = factor.

CLI Quick Reference (EXACT parameter names)

CRITICAL: Use EXACTLY these names. Do NOT guess.

desktop_ops.py

$PY scripts/desktop_ops.py screenshot [--output PATH] [--x X --y Y --width W --height H] [--with-cursor]
$PY scripts/desktop_ops.py capture-region --x X --y Y --width W --height H [--output PATH] [--with-cursor]
$PY scripts/desktop_ops.py frontmost
$PY scripts/desktop_ops.py list-apps
$PY scripts/desktop_ops.py front-window-bounds [--app NAME]
$PY scripts/desktop_ops.py focus-app --name "App Name"
$PY scripts/desktop_ops.py move --x X --y Y [--duration SECONDS]
$PY scripts/desktop_ops.py click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py double-click [--x X --y Y] [--button left|right|middle]
$PY scripts/desktop_ops.py drag --x1 X1 --y1 Y1 --x2 X2 --y2 Y2 [--duration SEC] [--button left]
$PY scripts/desktop_ops.py scroll --amount N [--x X --y Y] [--direction vertical|horizontal]
$PY scripts/desktop_ops.py mouse-position
$PY scripts/desktop_ops.py press --key KEY
$PY scripts/desktop_ops.py type --text "text to type"
$PY scripts/desktop_ops.py insert-newline [--count N]
$PY scripts/desktop_ops.py hotkey --keys cmd c
$PY scripts/desktop_ops.py screen-size
$PY scripts/desktop_ops.py pixel-color --x X --y Y

ocr_text.py

$PY scripts/ocr_text.py --app "AppName" --python $PY [--region-label LABEL] [--lang auto]
$PY scripts/ocr_text.py --image /path/to/capture.png --python $PY [--lang auto]

target_resolver.py

$PY scripts/target_resolver.py --app "AppName" --text "text" --python $PY
$PY scripts/target_resolver.py --app "AppName" --template /path/icon.png --python $PY
$PY scripts/target_resolver.py --app "AppName" --text "text" --region-label LABEL --python $PY

task_context.py / cleanup_task.py

$PY scripts/task_context.py init --task-id "my-task"   # aliases: create, --name
$PY scripts/task_context.py show --task-id "my-task"
$PY scripts/cleanup_task.py --task-id "my-task"

window_regions.py

$PY scripts/window_regions.py --window-x X --window-y Y --window-width W --window-height H [--label LABEL]

Labels: top_search, left_sidebar, left_sidebar_top, title_header, content_area, toolbar_row, bottom_input, primary_action

Workflow Examples

Example 1: Click a button by text (any app)

1. $PY first_run_setup.py --check                           → ready: true
2. $PY task_context.py init --task-id "click-button"
3. $PY desktop_ops.py focus-app --name "AppName"
4. $PY desktop_ops.py front-window-bounds --app "AppName"    → {x, y, w, h}
5. $PY target_resolver.py --app "AppName" --text "OK" --python $PY
   → best_candidate: {x:450, y:520, within_window:true}
6. $PY desktop_ops.py move --x 450 --y 520
7. $PY desktop_ops.py screenshot --with-cursor               → verify cursor on "OK"
8. $PY desktop_ops.py click --x 450 --y 520
9. $PY desktop_ops.py screenshot                             → verify result
10. $PY cleanup_task.py --task-id "click-button"

Example 2: Type and search

1. $PY desktop_ops.py focus-app --name "Safari"
2. $PY target_resolver.py --app "Safari" --text "Search" --region-label top_search --python $PY
   → {x:300, y:80, within_window:true}
3. $PY desktop_ops.py click --x 300 --y 80
4. $PY desktop_ops.py type --text "hello world"
5. $PY desktop_ops.py press --key return
6. $PY desktop_ops.py screenshot                             → verify search results

Example 3: Send a chat message (WeChat, Slack, etc.)

1. $PY desktop_ops.py focus-app --name "WeChat"
2. $PY desktop_ops.py front-window-bounds --app "WeChat"
3. # Navigate to the right conversation (OCR sidebar or search)
4. $PY target_resolver.py --app "WeChat" --text "ContactName" --region-label left_sidebar --python $PY
5. $PY desktop_ops.py click --x \x3Cfound_x> --y \x3Cfound_y>
6. # Verify conversation is open
7. $PY desktop_ops.py screenshot → confirm conversation title
8. # Click the input field
9. $PY target_resolver.py --app "WeChat" --text "" --region-label bottom_input --python $PY
   OR: click at the bottom center of the window
10. $PY desktop_ops.py type --text "Hello!"
11. # Send: prefer visible send button; if not available, use press --key return
12. $PY target_resolver.py --app "WeChat" --text "发送" --python $PY
    IF found: $PY desktop_ops.py click --x \x3Cx> --y \x3Cy>
    ELSE: $PY desktop_ops.py press --key return
13. $PY desktop_ops.py screenshot → verify message sent

Example 4: Scroll a list and find an item

1. $PY desktop_ops.py focus-app --name "AppName"
2. $PY desktop_ops.py front-window-bounds --app "AppName"   → {x:100, y:50, w:800, h:600}
3. # Scroll down in the window center
   $PY desktop_ops.py scroll --amount -5 --x 500 --y 350
4. $PY desktop_ops.py screenshot                             → check if target visible
5. $PY target_resolver.py --app "AppName" --text "target item" --python $PY
6. IF not found: scroll more and retry (max 5 scrolls)
7. IF found: click it

Example 5: Handle an unexpected dialog

1. # During any operation, if the expected UI doesn't match:
2. $PY desktop_ops.py screenshot → examine what's on screen
3. # If a dialog is visible, OCR it:
   $PY ocr_text.py --app "AppName" --python $PY
4. # Find and click the appropriate button (OK, Cancel, Allow, etc.)
   $PY target_resolver.py --app "AppName" --text "OK" --python $PY
5. $PY desktop_ops.py click --x \x3Cx> --y \x3Cy>
6. # After dialog is dismissed, re-get window bounds and continue
   $PY desktop_ops.py front-window-bounds --app "AppName"

Reference Documents

Load as needed:

Document	When to read
`references/workflow.md`	Core 8-step closed loop
`references/platform-macos.md`	macOS-specific tools and permissions
`references/platform-windows.md`	Windows setup
`references/platform-linux.md`	Linux X11/Wayland setup
`references/operation-patterns.md`	Reusable task templates
`references/validation-patterns.md`	Two-stage validation
`references/precise-targeting.md`	5-layer precision targeting
`references/target-providers.md`	Provider ordering and fallback contract
`references/coordinate-reconstruction.md`	Rebuild click coordinates from screenshot evidence
`references/chat-app-macos.md`	Chat app workflow
`references/app-wechat-desktop.md`	Cross-platform WeChat guidance
`references/cleanup-rules.md`	Cleanup timing and scope
`references/collaboration-rules.md`	When multi-agent collaboration is justified
`references/example-cases.md`	Repeatable task examples
`references/reproducible-setup.md`	Host bring-up checklist

Scope

Use this skill for: chat apps, browsers, file managers, editors, office apps, system settings, any closed desktop software with no usable API.

Hard Rules

Always run auto-setup gate first
Always use EXACT parameter names from CLI reference — never guess
Always scope OCR to the target app window — NEVER full-screen OCR
Always: focus-app → front-window-bounds → OCR within window → verify → act
Always pass --python $PY to ocr_text.py and target_resolver.py
Always verify coordinates are within window bounds before clicking
Always re-get window bounds after any UI state change (login, dialog, navigation)
Use insert-newline for line breaks; never use \ in type --text
For send actions: prefer visible send button; use press --key return only when verified
One action at a time; verify after each
Maximum 3 retries per action; each retry must recapture fresh state
Cleanup is mandatory at task end
If verification fails, recapture and rebuild — do not retry blindly

安全使用建议

This skill appears to do what it says (screen capture, OCR, focus apps, mouse/keyboard actions). Before installing or running it: - Audit the installer script (scripts/first_run_setup.py) and any bootstrap scripts for shell commands or network calls you don't expect. The package will run installs and create a venv on first run. - Expect and approve OS permission prompts (Accessibility, Screen Recording, Automation). Granting these allows the skill to observe the screen and drive input — treat that as granting powerful local access. - Because the skill can operate chat apps, do not run it on machines containing sensitive accounts or private conversations until you’ve tested in a safe sandbox account. - Run first_run_setup.py and smoke tests in a controlled environment (VM or throwaway account) first to confirm exactly what is installed and what the smoke test does. - Search the bundled scripts for any outbound network activity or telemetry (HTTP, sockets, uploads). If present, verify endpoints and purpose before proceeding. - If you are uncomfortable running an automatic installer, create and point the skill at an explicit, user-created Python virtualenv and install dependencies manually after review. If you want, I can: (a) list the top-level contents of specific scripts (first_run_setup.py, permission_bootstrap.py, desktop_ops.py) so you can see what commands they run, or (b) search the code for obvious network calls or subprocess.exec usage and summarize findings.

能力评估

✓ Purpose & Capability

The name/description (cross‑platform desktop GUI automation) aligns with the requested binaries (python3, cliclick/xdotool) and with the included scripts (capture, OCR, window bounds, click/type helpers). The references and workflows focus on chat apps and other desktop targets, which justifies OCR and input tooling.

ℹ Instruction Scope

SKILL.md instructs the agent to run the bundled scripts to capture screenshots, run OCR, focus windows, and generate input events — all of which are necessary for the stated purpose. It mandates an initial auto-setup step that can install tooling, create a venv, open OS permission dialogs, and run smoke tests. Those actions are within scope but are invasive (screen recording, accessibility, automatic installs) and can control arbitrary desktop apps, including sending messages if the agent follows chat workflows. There is no guidance to contact external endpoints in the instructions.

⚠ Install Mechanism

The registry shows no formal installer spec, but SKILL.md and the included scripts perform a local auto-setup (first_run_setup.py) that installs system binaries (cliclick, tesseract via brew on macOS), Python dependencies (venv + pip/uv installs), and OCR language packs. Auto-install behavior is supported by the package itself (scripts are included and non-empty). This is coherent with the skill's goals but raises the usual risks of running an arbitrary bundled installer script without auditing it first (it will write to disk, may run shell commands, and open system settings).

✓ Credentials

The skill does not request credentials, environment variables, or external API tokens. That matches expectations for a purely local desktop automation tool. No unrelated secrets or config paths are declared.

ℹ Persistence & Privilege

The skill is not marked always:true and does not request special platform-level config paths in the manifest. However the scripts are designed to create an external venv and temporary task directories and to prompt for OS permissions (Accessibility, Screen Recording, Automation). Those are necessary for GUI automation but are high-impact permissions and require explicit user grant at the OS level.

版本历史

v1.0.3

v1.0.3: Full skill bundle with scripts and references. 7.6x faster, CJK fix, Enter-to-send fix, 12 bug fixes, 8 new example cases

v1.0.2

desktop-agent-ops 1.0.2 - Improved auto-setup: automates environment checks, installs dependencies, and verifies permissions for macOS, Windows, and Linux. - Introduced a mandatory setup gate to ensure all prerequisites and system permissions are in place before execution. - Described a strict, step-by-step execution loop with built-in verification and per-step recovery guidance. - Standardized all GUI operations to be window-scoped for accurate targeting, preventing cross-app mistakes. - Added clear recovery procedures for OCR failures, clicks, state changes, and retries. - Provided generalized instructions and patterns for controlling any desktop app across platforms.

元数据

Slug desktop-agent-ops

版本 1.0.3

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题