/install clawdcursor
\r \r If a human can do it on a screen, you can too. No API? No integration? No problem.\r \r
USE AS A FALLBACK — NOT FIRST CHOICE\r \r Before reaching for any ClawdCursor tool, ask:\r
- Is there a native API? (Gmail API, GitHub API, Slack API) → use the API\r
- Is there a CLI? (
git,npm,curl) → use the CLI\r- Can you edit the file directly? → do that\r
- Is there a browser automation layer? (Playwright, Puppeteer) → use that\r \r None of the above work? Now use ClawdCursor. It's for the last mile.\r \r ---\r \r
Modes at a Glance\r
\r
| Mode | Command | Brain | Tools available |\r
|------|---------|-------|----------------|\r
| serve | clawdcursor serve | You (REST client) | All 42 tools via HTTP |\r
| mcp | clawdcursor mcp | You (MCP client) | All 42 tools via MCP stdio |\r
| start | clawdcursor start | Built-in LLM pipeline | All 42 tools + autonomous agent |\r
\r
In serve and mcp modes: you reason, ClawdCursor acts. There is no built-in LLM. You call tools, interpret results, decide next steps.\r
\r
---\r
\r
Connecting\r
\r
Option A — REST (clawdcursor serve)\r
\r
clawdcursor serve # starts on http://127.0.0.1:3847\r
```\r
\r
All POST endpoints require: `Authorization: Bearer \x3Ctoken>` (token saved to `~/.clawdcursor/token`)\r
\r
```\r
GET /tools → all tool schemas (OpenAI function-calling format)\r
POST /execute/{name} → run a tool: {"param": "value"}\r
GET /health → {"status":"ok","version":"0.7.5"}\r
GET /docs → full documentation\r
```\r
\r
Example:\r
```\r
POST /execute/get_windows {}\r
POST /execute/mouse_click {"x": 640, "y": 400}\r
POST /execute/type_text {"text": "hello world"}\r
```\r
\r
If the server isn't running, start it yourself — don't ask the user:\r
```bash\r
clawdcursor serve\r
# wait 2 seconds, then verify: GET /health\r
```\r
\r
### Option B — MCP (`clawdcursor mcp`)\r
\r
```json\r
{\r
"mcpServers": {\r
"clawdcursor": {\r
"command": "clawdcursor",\r
"args": ["mcp"]\r
}\r
}\r
}\r
```\r
\r
Works with Claude Code, Cursor, Windsurf, Zed, or any MCP-compatible client. All 42 tools are exposed identically.\r
\r
### Option C — Autonomous agent (`clawdcursor start`)\r
\r
```\r
POST /task {"task": "Open Notepad and write Hello"} → submit task\r
GET /status → {"status": "acting"} | "idle" | "waiting_confirm"\r
POST /confirm {"approved": true} → approve safety-gated action\r
POST /abort → stop current task\r
```\r
\r
Use `delegate_to_agent` tool to submit tasks from within MCP/REST sessions. Requires `clawdcursor start` running on port 3847.\r
\r
**Polling pattern:**\r
```\r
POST /task {"task": "...", "returnPartial": true}\r
→ poll GET /status every 2s:\r
"acting" → still running, keep polling\r
"waiting_confirm" → STOP. Ask user → POST /confirm {"approved": true}\r
"idle" → done, check GET /task-logs for result\r
→ if 60s+ with no progress: POST /abort, retry with simpler phrasing\r
```\r
\r
**returnPartial mode** — send `{"returnPartial": true}` with POST /task:\r
ClawdCursor skips Stage 3 (expensive vision) and returns control to you if Stage 2 fails:\r
```json\r
{"partial": true, "stepsCompleted": [...], "context": "got stuck on dialog"}\r
```\r
You finish the task with MCP tools, then call POST /learn to save what worked.\r
\r
**POST /learn — adaptive learning:**\r
After completing a task with your own tool calls, teach ClawdCursor for next time:\r
```json\r
POST /learn\r
{\r
"processName": "EXCEL",\r
"task": "create table with headers",\r
"actions": [\r
{"action": "key", "description": "Ctrl+Home to go to A1"},\r
{"action": "type", "description": "Type header name"},\r
{"action": "key", "description": "Tab to next column"}\r
],\r
"shortcuts": {"next_cell": "Tab", "next_row": "Enter"},\r
"tips": ["Use Tab between columns, Enter between rows"]\r
}\r
```\r
This enriches the app's guide JSON. Stage 2 reads it on the next run — no vision fallback needed.\r
\r
---\r
\r
## The Universal Loop\r
\r
Every GUI task follows the same pattern regardless of transport:\r
\r
```\r
1. ORIENT → read_screen() or get_windows() see what's open and focused\r
2. ACT → smart_click() / smart_type() / key_press() do the thing\r
3. VERIFY → check return value → window state → text check → screenshot\r
4. REPEAT → until done\r
```\r
\r
### Verification (cheapest to most expensive)\r
\r
1. **Tool return value** — every tool reports success/failure. Check it first.\r
2. **Window state** — `get_active_window()`, `get_windows()` — did a dialog appear? Did the title change?\r
3. **Text check** — `read_screen()` or `smart_read()` — is the expected text visible?\r
4. **Screenshot** — `desktop_screenshot()` — only when text methods fail. Costs the most.\r
5. **Negative check** — look for error dialogs, wrong window, unchanged screen.\r
\r
**Always verify** after: sends, saves, deletes, form submissions.\r
**Skip verification** for: mid-sequence keystrokes, scrolling.\r
\r
---\r
\r
## Tool Decision Trees\r
\r
### Perception — always start here\r
\r
```\r
read_screen() → FIRST. Accessibility tree: buttons, inputs, text, with coords.\r
Fast, structured, works on native apps.\r
ocr_read_screen() → When a11y tree is empty (canvas UIs, image-based apps).\r
smart_read() → Combines OCR + a11y. Good first call when unsure.\r
desktop_screenshot() → LAST RESORT. Only when you need pixel-level visual detail.\r
desktop_screenshot_region(x,y,w,h) → Zoomed crop when you need detail in one area.\r
```\r
\r
### Clicking\r
\r
```\r
smart_click("Save") → FIRST. Finds by label/text via OCR + a11y, clicks.\r
Pass processId to target the right window.\r
invoke_element(name="Save") → When you know the exact automation ID from read_screen.\r
cdp_click(text="Submit") → Browser elements. Requires cdp_connect() first.\r
mouse_click(x, y) → LAST RESORT. Raw coordinates from a screenshot.\r
```\r
\r
### Typing\r
\r
```\r
smart_type("Email", "[email protected]") → FIRST. Finds field by label, focuses, types.\r
cdp_type(label="Email", text="…") → Browser inputs. Requires cdp_connect() first.\r
type_text("hello") → Clipboard paste into whatever is focused.\r
Use after manually focusing with smart_click.\r
```\r
\r
### Browser / CDP\r
\r
```\r
1. navigate_browser(url) → opens URL, auto-enables CDP\r
2. cdp_connect() → connect to browser DevTools Protocol\r
3. cdp_page_context() → list interactive elements on page\r
4. cdp_read_text() → extract DOM text (returns empty on canvas apps → use OCR)\r
5. cdp_click(text="…") → click by visible text\r
6. cdp_type(label, text) → fill input by label\r
7. cdp_evaluate(script) → run JavaScript in page context\r
8. cdp_scroll(direction, px) → scroll page via DOM (not mouse wheel)\r
9. cdp_list_tabs() → list all open tabs\r
10. cdp_switch_tab(target) → switch to a specific tab\r
```\r
\r
If CDP isn't connected, switch tabs with keyboard:\r
```\r
key_press("ctrl+1") → tab 1\r
key_press("ctrl+tab") → next tab\r
key_press("ctrl+shift+tab") → previous tab\r
```\r
\r
### Window Management\r
\r
```\r
get_windows() → list all open windows (use to find PIDs)\r
get_active_window() → what's in the foreground right now\r
focus_window(processName="Discord") → bring to front (auto-minimizes phantom off-screen windows)\r
minimize_window(processName="calc") → minimize a window — 1 call, cross-platform\r
also accepts: processId, title\r
```\r
\r
**Rule:** Always `focus_window()` before `key_press()` or `type_text()`. Keystrokes go to whatever has focus — if that's your terminal, not the target app.\r
\r
### Canvas apps (Google Docs, Figma, Notion)\r
\r
DOM has no readable text. Pattern:\r
```\r
ocr_read_screen() → read content (DOM extraction fails)\r
mouse_click(x, y) → click into the canvas area\r
type_text("your text") → clipboard paste works even on canvas\r
```\r
\r
---\r
\r
## Quick Patterns\r
\r
**Open app and type:**\r
```\r
open_app("notepad") → wait(2) → smart_read() → type_text("Hello") → smart_read()\r
```\r
\r
**Read a webpage:**\r
```\r
navigate_browser(url) → wait(3) → cdp_connect() → cdp_read_text()\r
```\r
\r
**Fill a web form:**\r
```\r
cdp_connect() → cdp_type("Email", "[email protected]") → cdp_type("Password", "…") → cdp_click("Submit")\r
```\r
\r
**Cross-app copy/paste:**\r
```\r
focus_window("Chrome") → key_press("ctrl+a") → key_press("ctrl+c")\r
→ read_clipboard() → focus_window("Notepad") → type_text(clipboard)\r
```\r
\r
**Send email via Outlook:**\r
```\r
open_app("outlook") → wait(2) → smart_click("New Email")\r
→ mouse_click(to_field_x, to_field_y) → type_text("[email protected]") → key_press("Tab")\r
→ mouse_click(subject_x, subject_y) → type_text("Subject") → key_press("Tab")\r
→ mouse_click(body_x, body_y) → type_text("Body text")\r
→ mouse_click(send_x, send_y)\r
```\r
\r
**Autonomous complex task (requires `clawdcursor start`):**\r
```\r
delegate_to_agent("Open Gmail, find latest email from Stripe, forward to [email protected]")\r
→ poll GET /status every 2s\r
→ if waiting_confirm: ask user → POST /confirm {"approved": true}\r
→ if idle: task done\r
```\r
\r
---\r
\r
## Full Tool Reference (42 tools)\r
\r
Speed: ⚡ Free/instant · 🔵 Cheap · 🟡 Moderate · 🔴 Vision (expensive)\r
\r
### Perception (6)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `read_screen` | A11y tree — buttons, inputs, text, coords | ⚡ Default first read |\r
| `smart_read` | OCR + a11y combined | 🔵 When unsure which to use |\r
| `ocr_read_screen` | Raw OCR text with bounding boxes | 🔵 Canvas UIs, empty a11y trees |\r
| `desktop_screenshot` | Full screen image (1280px wide) | ⚡ Last resort visual check |\r
| `desktop_screenshot_region` | Zoomed crop of specific area | ⚡ Fine-grained visual detail |\r
| `get_screen_size` | Screen dimensions and DPI | ⚡ Coordinate calculations |\r
\r
### Mouse (7)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `smart_click` | Find element by text/label, click | 🔵 First choice for clicking |\r
| `mouse_click` | Left click at (x, y) | ⚡ Last resort |\r
| `mouse_double_click` | Double click at (x, y) | ⚡ Open files, select words |\r
| `mouse_right_click` | Right click at (x, y) | ⚡ Context menus |\r
| `mouse_hover` | Move cursor without clicking | ⚡ Hover menus |\r
| `mouse_scroll` | Scroll at position (physical mouse wheel) | ⚡ Scroll content |\r
| `mouse_drag` | Drag from start to end — accepts `startX/startY/endX/endY` or `x1/y1/x2/y2` | ⚡ Resize, select ranges |\r
\r
### Keyboard (5)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `smart_type` | Find input by label, focus it, type | 🔵 First choice for form fields |\r
| `type_text` | Clipboard paste into focused element | ⚡ After manually focusing |\r
| `key_press` | Send key combo (`ctrl+s`, `Return`, `alt+tab`) | ⚡ After focus_window |\r
| `shortcuts_list` | List keyboard shortcuts for current app | ⚡ Before reaching for mouse |\r
| `shortcuts_execute` | Run a named shortcut (fuzzy match) | ⚡ Save, copy, paste, undo |\r
\r
### Window Management (5)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `get_windows` | List all open windows with PIDs and bounds | ⚡ Situational awareness |\r
| `get_active_window` | Current foreground window | ⚡ Check current focus |\r
| `get_focused_element` | Element with keyboard focus | ⚡ Debug wrong-field typing |\r
| `focus_window` | Bring window to front (auto-clears off-screen phantoms) | ⚡ Always before key_press |\r
| `minimize_window` | Minimize by processName, processId, or title | ⚡ Clear focus stealers |\r
\r
### UI Elements (2)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `find_element` | Search UI tree by name or type | ⚡ Find automation IDs |\r
| `invoke_element` | Invoke element by automation ID or name | ⚡ When ID known from read_screen |\r
\r
### Clipboard (2)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `read_clipboard` | Read clipboard text | ⚡ After copy operations |\r
| `write_clipboard` | Write text to clipboard | ⚡ Before paste operations |\r
\r
### Browser / CDP (11)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `cdp_connect` | Connect to browser DevTools Protocol | ⚡ First step for any browser task |\r
| `cdp_page_context` | List interactive elements on page | ⚡ After connect |\r
| `cdp_read_text` | Extract DOM text | ⚡ Read page content |\r
| `cdp_click` | Click by CSS selector or visible text | ⚡ Browser clicks |\r
| `cdp_type` | Type into input by label or selector | ⚡ Browser form filling |\r
| `cdp_select_option` | Select dropdown option | ⚡ Select elements |\r
| `cdp_evaluate` | Run JavaScript in page context | ⚡ Custom queries |\r
| `cdp_scroll` | Scroll page via DOM (`direction`, `amount` px) | ⚡ DOM-level scroll |\r
| `cdp_wait_for_selector` | Wait for element to appear | ⚡ After navigation/AJAX |\r
| `cdp_list_tabs` | List all browser tabs | ⚡ When on wrong tab |\r
| `cdp_switch_tab` | Switch to a tab by title or index | ⚡ After cdp_list_tabs |\r
\r
### Orchestration (4)\r
| Tool | What it does | When |\r
|------|-------------|------|\r
| `open_app` | Launch application by name | ⚡ First step for desktop tasks |\r
| `navigate_browser` | Open URL (auto-enables CDP) | ⚡ First step for browser tasks |\r
| `wait` | Pause N seconds | ⚡ After opening apps, let UI render |\r
| `delegate_to_agent` | Send task to built-in autonomous agent | 🟡 Complex multi-step tasks (requires `clawdcursor start`) |\r
\r
---\r
\r
## Provider Setup (agent mode only)\r
\r
| Provider | Setup | Cost |\r
|----------|-------|------|\r
| **Ollama** (local) | `ollama pull qwen2.5:7b && ollama serve` | $0 — fully offline, no data leaves machine |\r
| **Any cloud** | Set env var: `ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `GEMINI_API_KEY`, `MOONSHOT_API_KEY`, etc. | Varies |\r
| **OpenClaw users** | Auto-detected from `~/.openclaw/agents/main/auth-profiles.json` | No extra setup |\r
\r
Run `clawdcursor doctor` to auto-detect and validate providers.\r
\r
---\r
\r
## Security\r
\r
- **Network isolation:** Binds to `127.0.0.1` only. Verify: `netstat -an | findstr 3847` — should show `127.0.0.1:3847`, never `0.0.0.0:3847`\r
- **Ollama:** 100% offline. Screenshots stay in RAM, never leave the machine.\r
- **Cloud providers:** Screenshots/text sent only to your configured provider. No telemetry, no analytics, no third-party logging.\r
- **Token auth:** All mutating POST endpoints require `Authorization: Bearer \x3Ctoken>`. Token at `~/.clawdcursor/token`.\r
- **Safety tiers:** Auto / Preview / Confirm. Agents must **never self-approve Confirm actions**.\r
\r
---\r
\r
## Coordinate System\r
\r
All mouse tools use **image-space coordinates** from a 1280px-wide viewport — matching screenshots from `desktop_screenshot`. DPI scaling is handled automatically. Do not pre-scale coordinates.\r
\r
---\r
\r
## Safety\r
\r
| Tier | Actions | Behavior |\r
|------|---------|----------|\r
| 🟢 Auto | Navigation, reading, opening apps | Runs immediately |\r
| 🟡 Preview | Typing, form filling | Logged |\r
| 🔴 Confirm | Send, delete, purchase | Pauses — **always ask user first** |\r
\r
- **Never self-approve Confirm actions.**\r
- `Alt+F4` and `Ctrl+Alt+Delete` are blocked.\r
- Server binds to `127.0.0.1` only.\r
- First run requires explicit user consent for desktop control.\r
\r
---\r
\r
## Error Recovery\r
\r
| Problem | Fix |\r
|---------|-----|\r
| Port 3847 not responding | `clawdcursor serve` — wait 2s — `GET /health` |\r
| 401 Unauthorized | Token changed — read `~/.clawdcursor/token` and use fresh value |\r
| CDP not available | Chrome must be open. `navigate_browser(url)` auto-enables it. |\r
| CDP on wrong tab | `cdp_list_tabs()` → `cdp_switch_tab(target)` |\r
| `focus_window` fails | `get_windows()` to confirm title/processName, then retry |\r
| `smart_click` can't find element | `read_screen()` for coords → `mouse_click(x, y)` |\r
| `key_press` goes to wrong window | You skipped `focus_window` — always focus first |\r
| `cdp_read_text` returns empty | Canvas app — use `ocr_read_screen()` instead |\r
| Same action fails 3+ times | Try a completely different approach |\r
\r
---\r
\r
## Platform Support\r
\r
| Platform | A11y | OCR | CDP |\r
|----------|------|-----|-----|\r
| Windows (x64/ARM64) | PowerShell + .NET UIA | Windows.Media.Ocr | Chrome/Edge |\r
| macOS (Intel/Apple Silicon) | JXA + System Events | Apple Vision | Chrome/Edge |\r
| Linux (x64/ARM64) | AT-SPI | Tesseract | Chrome/Edge |\r
\r
**macOS:** Grant Accessibility in System Settings → Privacy → Accessibility.\r
**Linux:** `sudo apt install tesseract-ocr` for OCR support.\r
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install clawdcursor - 安装完成后,直接呼叫该 Skill 的名称或使用
/clawdcursor触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
ClawdCursor 是什么?
OS-level desktop automation tool server. 42 tools for controlling any application on Windows, macOS, and Linux. Model-agnostic — works with any AI that can d... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 142 次。
如何安装 ClawdCursor?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install clawdcursor」即可一键安装,无需额外配置。
ClawdCursor 是免费的吗?
是的,ClawdCursor 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
ClawdCursor 支持哪些平台?
ClawdCursor 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 ClawdCursor?
由 AmrDab(@amrdab)开发并维护,当前版本 v0.7.5。