功能描述

Desktop automation via native OS accessibility trees using the agent-desktop CLI. Use when an AI agent needs to observe, interact with, or automate desktop a...

使用说明 (SKILL.md)

agent-desktop

Name: Agent Desktop
Author: lahfir

CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.

Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.

Installation

npm install -g agent-desktop
# or
bun install -g --trust agent-desktop

Requires macOS 12+ with Accessibility permission granted to your terminal.

Reference Files

Detailed documentation is split into focused reference files. Read them as needed:

Reference	Contents
`references/commands-observation.md`	snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples
`references/commands-interaction.md`	click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command
`references/commands-system.md`	launch, close, windows, clipboard, wait, batch, status, permissions, version
`references/workflows.md`	12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns
`references/macos.md`	macOS permissions/TCC, AX API internals, smart activation chain, surfaces, Notification Center, troubleshooting

The Observe-Act Loop (Progressive Skeleton Traversal)

Use progressive skeleton traversal as the default approach. It reduces token consumption 78-96% for dense apps by exploring the UI in two phases: a shallow skeleton overview, then targeted drill-downs into regions of interest.

1. SKELETON → agent-desktop snapshot --skeleton --app "App" -i --compact
   Parse the overview. Identify the region containing your target.
   Regions show children_count (e.g., "Sidebar" with children_count: 42).
   Named containers at truncation boundary have refs for drill-down.

2. DRILL    → agent-desktop snapshot --root @e3 -i --compact
   Expand the target region. Now you see its interactive elements.

3. ACT      → agent-desktop click @e12  (or type, select, toggle...)

4. VERIFY   → agent-desktop snapshot --root @e3 -i --compact
   Re-drill the same region to confirm the state change.
   Scoped invalidation: only @e3's subtree refs are replaced.

5. REPEAT   → Continue drilling other regions or acting as needed.

When to skip skeleton and use full snapshot instead:

Simple apps with few elements (Finder, Calculator, TextEdit)
You already know the exact element name — use find instead
Surface snapshots (menus, sheets, alerts) — these are already focused

When skeleton shines:

Dense Electron apps (Slack, VS Code, Discord, Notion)
Any app where full snapshot exceeds ~50 refs
Multi-region workflows (sidebar + main content + toolbar)

Ref System

Refs assigned depth-first: @e1, @e2, @e3...
Only interactive elements get refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell
In skeleton mode, named/described containers at truncation boundary also get refs (drill-down targets with empty available_actions)
Static text, groups, containers remain in tree for context but have no ref
Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
After any action that changes UI, re-drill the affected region or re-snapshot
Scoped invalidation: re-drilling --root @e3 only replaces refs from @e3's previous drill — refs from other regions and the skeleton itself are preserved

JSON Output Contract

Every command returns a JSON envelope on stdout:

Success: { "version": "1.0", "ok": true, "command": "snapshot", "data": { ... } } Error: { "version": "1.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }

Exit codes: 0 success, 1 structured error, 2 argument error.

Error Codes

Code	Meaning	Recovery
`PERM_DENIED`	Accessibility permission not granted	Grant in System Settings > Privacy > Accessibility
`ELEMENT_NOT_FOUND`	Ref not in current refmap	Re-run snapshot, use fresh ref
`APP_NOT_FOUND`	App not running	Launch it first
`ACTION_FAILED`	AX action rejected	Try alternative approach or coordinate-based click
`ACTION_NOT_SUPPORTED`	Element can't do this	Use different command
`STALE_REF`	Ref from old snapshot	Re-run snapshot
`WINDOW_NOT_FOUND`	No matching window	Check app name, use list-windows
`TIMEOUT`	Wait condition not met	Increase --timeout
`INVALID_ARGS`	Bad arguments	Check command syntax

Command Quick Reference (53 commands)

Observation

agent-desktop snapshot --skeleton --app "App" -i --compact  # Skeleton overview (preferred)
agent-desktop snapshot --root @e3 -i --compact              # Drill into region
agent-desktop snapshot --app "App" -i                       # Full tree (simple apps)
agent-desktop snapshot --app "App" --surface menu -i        # Surface snapshot
agent-desktop screenshot --app "App" out.png                # PNG screenshot
agent-desktop find --app "App" --role button                # Search elements
agent-desktop get @e1 --property text                       # Read element property
agent-desktop is @e1 --property enabled                     # Check element state
agent-desktop list-surfaces --app "App"                     # Available surfaces

Interaction

agent-desktop click @e5                         # Click element
agent-desktop double-click @e3                  # Double-click
agent-desktop triple-click @e2                  # Triple-click (select line)
agent-desktop right-click @e5                   # Right-click (context menu)
agent-desktop type @e2 "hello"                  # Type text into element
agent-desktop set-value @e2 "new value"         # Set value directly
agent-desktop clear @e2                         # Clear element value
agent-desktop focus @e2                         # Set keyboard focus
agent-desktop select @e4 "Option B"             # Select dropdown option
agent-desktop toggle @e6                        # Toggle checkbox/switch
agent-desktop check @e6                         # Idempotent check
agent-desktop uncheck @e6                       # Idempotent uncheck
agent-desktop expand @e7                        # Expand disclosure
agent-desktop collapse @e7                      # Collapse disclosure
agent-desktop scroll @e1 --direction down       # Scroll element
agent-desktop scroll-to @e8                     # Scroll into view

Keyboard & Mouse

agent-desktop press cmd+c                       # Key combo
agent-desktop press return --app "App"          # Targeted key press
agent-desktop key-down shift                    # Hold key
agent-desktop key-up shift                      # Release key
agent-desktop hover @e5                         # Cursor to element
agent-desktop hover --xy 500,300                # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5          # Drag between elements
agent-desktop mouse-click --xy 500,300          # Click at coordinates
agent-desktop mouse-move --xy 100,200           # Move cursor
agent-desktop mouse-down --xy 100,200           # Press mouse button
agent-desktop mouse-up --xy 300,400             # Release mouse button

App & Window

agent-desktop launch "System Settings"          # Launch and wait
agent-desktop close-app "TextEdit"              # Quit gracefully
agent-desktop close-app "TextEdit" --force      # Force kill
agent-desktop list-windows --app "Finder"       # List windows
agent-desktop list-apps                         # List running GUI apps
agent-desktop focus-window --app "Finder"       # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"

Notifications

agent-desktop list-notifications                # List all notifications
agent-desktop list-notifications --app "Slack"  # Filter by app
agent-desktop list-notifications --text "deploy" --limit 5  # Filter by text
agent-desktop dismiss-notification 1            # Dismiss by index
agent-desktop dismiss-all-notifications         # Dismiss all
agent-desktop dismiss-all-notifications --app "Slack"  # Dismiss all from app
agent-desktop notification-action 1 "Reply" --expected-app Slack   # Click action (with NC reorder guard)

Clipboard

agent-desktop clipboard-get                     # Read clipboard
agent-desktop clipboard-set "text"              # Write to clipboard
agent-desktop clipboard-clear                   # Clear clipboard

Wait

agent-desktop wait 1000                         # Pause 1 second
agent-desktop wait --element @e5 --timeout 5000 # Wait for element
agent-desktop wait --window "Title"             # Wait for window
agent-desktop wait --text "Done" --app "App"    # Wait for text
agent-desktop wait --menu --app "App"           # Wait for context menu
agent-desktop wait --menu-closed --app "App"    # Wait for menu dismissal
agent-desktop wait --notification --app "App"   # Wait for new notification

System

agent-desktop status                            # Health check
agent-desktop permissions                       # Check permission
agent-desktop permissions --request             # Trigger permission dialog
agent-desktop version --json                    # Version info
agent-desktop batch '[...]' --stop-on-error     # Batch commands

Key Principles for Agents

Skeleton first, drill second. Start with --skeleton -i --compact for dense apps. Drill into regions with --root @ref. Full snapshot only for simple apps.
Use -i --compact flags. Filters to interactive elements and collapses empty wrappers, minimizing tokens.
Refs are ephemeral. Re-drill the affected region after any UI-changing action. Scoped invalidation keeps other refs intact.
Prefer refs over coordinates. click @e5 > mouse-click --xy 500,300.
Use wait for async UI. After launch/dialog triggers, wait for expected state.
Check permissions first. Run permissions on first use.
Handle errors. Parse error.code and follow error.suggestion.
Use find for targeted searches. Faster than any snapshot when you know role/name.
Use surfaces for overlays. snapshot --surface menu for menus, --surface sheet for dialogs. Never --skeleton for surfaces — they're already focused.
Batch for performance. Multiple commands in one invocation.

安全使用建议

This skill appears functional and internally consistent for macOS desktop automation, but exercise caution before installing and running it: 1) Do not blindly run `npm install -g agent-desktop` unless you verify the package source (check the npm page, GitHub repo, publisher, and recent release assets). The registry metadata here has no homepage or source listed. 2) Understand Accessibility (TCC) implications: granting Accessibility to a terminal gives processes launched from that terminal broad control and visibility over your desktop. Prefer granting permission only to a dedicated, minimal terminal app or use an isolated/test account or VM. 3) The CLI exposes screenshots, clipboard contents, and notifications; treat any agent that can call this skill as able to read sensitive data. 4) If you want to proceed, inspect the npm package contents (or the project's repository) before installing, run it in an isolated environment first, and restrict which terminal binary you enable in System Settings. If you cannot verify the package source, do not install.

功能分析

Type: OpenClaw Skill Name: agent-desktop Version: 0.1.10 The agent-desktop skill bundle provides an AI agent with extensive control over the host's desktop environment, including the ability to capture screenshots (screenshot), read/set clipboard content (clipboard-get/set), and monitor system notifications (list-notifications). While these capabilities are directly aligned with the stated purpose of desktop automation and the documentation (e.g., SKILL.md, references/macos.md) focuses on legitimate technical patterns like 'Progressive Skeleton Traversal', they constitute high-risk system access. No explicit evidence of malicious intent or data exfiltration was found, but the broad scope of sensitive data accessible via the agent-desktop CLI justifies a suspicious classification.

能力评估

✓ Purpose & Capability

Name, description, and SKILL.md consistently describe a macOS accessibility-based desktop automation CLI (snapshot, click, type, clipboard, notifications, screenshots). Required capabilities (Accessibility permission, ability to read UI trees and manipulate controls) match the stated purpose; there are no unrelated environment variables or extraneous dependencies declared.

ℹ Instruction Scope

The instructions stay within the stated domain (observe-act loop, progressive skeleton traversal, many commands for interaction and observation). However the command surface includes high-sensitivity operations: taking screenshots (returned as base64), reading clipboard contents, and listing/acting on notifications. Those are expected for a desktop automation tool but are privacy-sensitive; the SKILL.md does not place limits on what the calling agent should do with returned data (e.g., transmit it externally).

⚠ Install Mechanism

The skill is instruction-only but tells users to run `npm install -g agent-desktop` (or bun). The registry metadata lacks a homepage/source URL, so there is no provenance to verify the npm package or its publisher. Installing a global npm package runs arbitrary code on the machine — this is a non-trivial risk unless you verify the package source and contents first.

⚠ Credentials

No environment variables or credentials are requested (good). However the tool requires granting macOS Accessibility (TCC) permission to your terminal application — this grants any code run from that terminal the ability to control and read UI across apps. Combined with the tool's ability to read clipboard, notifications, and screenshots, this is broad and sensitive access; make sure you understand which terminal app you grant and that you trust the installed CLI and the calling agent.

✓ Persistence & Privilege

The skill does not request permanent inclusion (always:false), does not declare system-wide config writes, and is user-invocable. There is no explicit persistence or privilege escalation requested by the skill files provided.

版本历史

v0.1.10

Release v0.1.13

v0.1.9

Release v0.1.12

v0.1.8

agent-desktop v0.1.8 - Added detailed SKILL.md documentation covering installation, usage principles, command reference, error codes, and recovery steps. - Provides an overview of 54 CLI commands across observation, interaction, keyboard/mouse, app/window management, notifications, and clipboard. - Clarifies ref-based element identification and the recommended observe-act loop for reliable automation. - Documents macOS support (Phase 1) and notes plans for Windows and Linux compatibility. - Explains JSON output format and structured error handling for all commands.

元数据

Slug agent-desktop

版本 0.1.10

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 3

常见问题

Agent Desktop 是什么？

Desktop automation via native OS accessibility trees using the agent-desktop CLI. Use when an AI agent needs to observe, interact with, or automate desktop a... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 542 次。

如何安装 Agent Desktop？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install agent-desktop」即可一键安装，无需额外配置。

Agent Desktop 是免费的吗？

是的，Agent Desktop 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Agent Desktop 支持哪些平台？

Agent Desktop 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Agent Desktop？

由 lahfir（@lahfir）开发并维护，当前版本 v0.1.10。

Agent Desktop