功能描述

Generic macOS desktop control using AppleScript for app and window semantics plus screenshot, OCR, mouse, and keyboard workflows.

使用说明 (SKILL.md)

写在前面

Name: MacOS Desktop Control
Author: kd-oauth

特别做了中文兼容，包括文字输入/识别等，中文用户放心使用～

macos-desktop-control

This skill controls the macOS desktop through a small, explicit pipeline with a clear split between semantic app control and visual UI control:

Features

🖥️ App and window control

✅ Activate an app by name or bundle path
✅ Check whether an app is running
✅ Read the current frontmost app
✅ Read front window title, count windows, and list window titles

📸 Screenshot and image operations

✅ Capture the current screen as a logical-resolution screenshot
✅ Initialize screenshot-to-click calibration for macOS Retina displays
✅ Crop a known rectangular region from an image
✅ Reuse calibration data when a workflow must mix logical and raw screenshots

🎯 Visual target location

✅ Locate text by OCR on screenshots
✅ Locate templates by OpenCV image matching
✅ Constrain later actions to coordinates derived from a screenshot

⌨️ Mouse and keyboard control

✅ Move the mouse in logical screen coordinates
✅ Left click, right click, double click, and drag
✅ Read current mouse position
✅ Type text, paste via higher-level workflows, press keys, and send hotkeys
✅ Hold and release keys explicitly when needed

🛡️ Safety and scope

✅ Use logical coordinates as the default working convention
✅ Keep app-specific UI semantics out of this skill
✅ Keep AppleScript usage limited to app and window semantics, not deep UI scripting
✅ Keep pyautogui.FAILSAFE = True so moving to the top-left corner aborts automation

Use AppleScript for app and window semantics
Initialize coordinate mapping
Capture the screen
Locate targets by OCR or OpenCV image matching
Execute mouse and keyboard actions with Python

Design boundary

This skill intentionally does not include AppleScript UI scripting.

Use AppleScript for:

opening or activating apps
reading frontmost app state
reading window titles and counts

Use screenshot-guided OCR/OpenCV plus pyautogui for:

clicking UI targets
typing into custom-drawn interfaces
interacting with chat rows, images, canvases, or other visually defined targets

This boundary keeps the skill predictable. AppleScript is used where semantic macOS state is strong, and pyautogui is used where direct UI manipulation is more reliable.

Why initialization is needed

On macOS, screenshot coordinates and click coordinates may use different coordinate systems.

screencapture images usually use pixel coordinates.
Mouse automation tools often use macOS screen coordinates, also called point coordinates.
On Retina displays, one point is commonly equal to two pixels.

This skill writes the coordinate mapping result to a JSON file, so later steps can reuse it without recalculating.

Initialization behavior in the current version:

the skill auto-initializes on first use when the calibration file does not exist
it does not re-run mapping on every invocation
if /tmp/macos_desktop_control/calibration.json already exists, the existing calibration is reused

Default calibration file:

/tmp/macos_desktop_control/calibration.json

Directory layout

macos-desktop-control/
  SKILL.md
  requirements.txt
  scripts/
    calibration.py
    init_coordinate_mapping.py
    capture_screen.py
    crop_image.py
    locate_text_ocr.py
    locate_image_opencv.py
    mouse.py
    keyboard.py
    applescript_app.py
    applescript_window.py

Requirements

Install Python dependencies:

pip install -r requirements.txt

OCR uses Apple Vision through PyObjC, so no separate Tesseract install is required.

On macOS, grant the terminal or runtime app these permissions:

Screen Recording
Accessibility

1. Initialize coordinate mapping

The first version handles Retina screens by comparing screenshot pixel size with the logical screen size used by pyautogui.

You can still run initialization manually:

python scripts/init_coordinate_mapping.py

But in normal use, the skill now performs lazy initialization automatically on first use if the calibration file is missing.

Example output:

{
  "screen_width_points": 1512,
  "screen_height_points": 982,
  "screenshot_width_pixels": 3024,
  "screenshot_height_pixels": 1964,
  "scale_x": 2.0,
  "scale_y": 2.0,
  "mode": "retina"
}

Later scripts read this file automatically.

Current lazy-init behavior:

capture_screen.py
mouse.py
locate_text_ocr.py
locate_image_opencv.py

These scripts first check whether /tmp/macos_desktop_control/calibration.json exists. If not, they auto-generate it once and then continue.

2. Capture screen

Capture the current screen and resize the image into the logical coordinate system used by pyautogui.position() and pyautogui.click().

This skill's default convention is:

default screenshot is logical
default recognition result coordinates are logical
default mouse action coordinates are logical
default crop operations should use a logical screenshot
only use calibration conversion when a workflow explicitly mixes logical screenshots with raw pixel screenshots

python scripts/capture_screen.py --output /tmp/macos_desktop_control/screen_logical.png

Core idea:

import pyautogui

img = pyautogui.screenshot()
screen_w, screen_h = pyautogui.size()

# Resize screenshot to the coordinate system used by pyautogui.position() / click().
img = img.resize((screen_w, screen_h))
img.save("screen_logical.png")

3. Crop image regions

When a higher-level skill already knows a target rectangle, crop it directly instead of re-opening previews or re-running visual search.

By default, crop from a logical screenshot so the crop rectangle stays in the same coordinate system as recognition and mouse targeting. Only crop from a raw Retina or pixel screenshot when there is a specific reason to preserve raw pixels, and in that case convert coordinates first using calibration data.

python scripts/crop_image.py \
  --image /tmp/macos_desktop_control/screen_logical.png \
  --x1 400 --y1 300 --x2 700 --y2 650 \
  --output /tmp/macos_desktop_control/crop.png

Use this for:

extracting a detected chat image thumbnail
saving a button or dialog region for later analysis
debugging screenshot-to-action pipelines

4. Locate targets

There are two supported strategies.

Locate by OCR text

python scripts/locate_text_ocr.py \
  --image /tmp/macos_desktop_control/screen_logical.png \
  --text "确定"

You can also constrain OCR to a specific screen region when the same text may appear in multiple places:

python scripts/locate_text_ocr.py \
  --image /tmp/macos_desktop_control/screen_logical.png \
  --text "会话" \
  --x1 0 --y1 120 --x2 520 --y2 1107

The script prints the center point of the best matched Apple Vision OCR box. When a region is provided, the search runs only inside that rectangle, but the returned coordinates are still in full-screen logical coordinates.

Locate by OpenCV image matching

python scripts/locate_image_opencv.py \
  --image /tmp/macos_desktop_control/screen_logical.png \
  --template ./target_button.png \
  --threshold 0.8

The script prints the center point of the matched template.

5. Mouse actions

Use Python and pyautogui to control the mouse in logical screen coordinates.

Single click

python scripts/mouse.py --action click --x 500 --y 300

Move only

python scripts/mouse.py --action move --x 500 --y 300 --duration 0.2

Double click

python scripts/mouse.py --action double-click --x 500 --y 300

Right click

python scripts/mouse.py --action right-click --x 500 --y 300

Drag

python scripts/mouse.py --action drag --x 500 --y 300 --to-x 800 --to-y 500 --duration 0.3

Read current mouse position

python scripts/mouse.py --action position

You can also pipe the result from a locate script:

python scripts/locate_image_opencv.py \
  --image /tmp/macos_desktop_control/screen_logical.png \
  --template ./target_button.png \
| python scripts/mouse.py --stdin --action click

Stdin accepts either x y text or JSON like {"x": 500, "y": 300}.

6. Keyboard actions

Use Python and pyautogui to paste text or trigger shortcuts.

Important practical note:

this skill uses clipboard paste for all text entry by default, including English
this avoids input-method issues with Chinese, English, and mixed-language text
do not use simulated typing for text entry in this skill

Paste text

python scripts/keyboard.py --action paste --text "我是OpenClaw"

Paste from stdin

printf '我是OpenClaw' | python scripts/keyboard.py --action paste --stdin

Default input rule for this skill:

use clipboard paste for all text input by default, including English
click the verified input field first, then paste with command v
do not use simulated typing for text entry in this skill

Press one key

python scripts/keyboard.py --action press --key enter

Press a hotkey

python scripts/keyboard.py --action hotkey --keys command v

Recommended paste workflow when text fidelity matters:

copy the exact text into the clipboard, preferably via python scripts/keyboard.py --action paste
click the verified input field
let the script send command v to paste
verify visually before pressing enter if sending would be externally visible

Hold and release keys

python scripts/keyboard.py --action key-down --key shift
python scripts/keyboard.py --action key-up --key shift

7. AppleScript app control

Use AppleScript when the task is semantic macOS control rather than visual targeting.

Good fits:

open or activate an app
check whether an app is running
read the current frontmost app

Open by app name

python scripts/applescript_app.py --action open --app "微信"

Open by bundle path

python scripts/applescript_app.py --action open --path "/Applications/微信.app"

Activate an app

python scripts/applescript_app.py --action activate --app "微信"

Check whether an app is running

python scripts/applescript_app.py --action is-running --app "微信"

Get the current frontmost app

python scripts/applescript_app.py --action frontmost-app
python scripts/applescript_app.py --action frontmost-app --json-pretty

8. AppleScript window inspection

Use AppleScript window inspection when you need app-level UI state without relying on OCR.

Good fits:

read the front window title
count windows for a process
list window titles for a process

Read the front window title

python scripts/applescript_window.py --action title --app "微信"

Count windows

python scripts/applescript_window.py --action count --app "微信"

List window titles

python scripts/applescript_window.py --action list --app "微信"
python scripts/applescript_window.py --action title --app "微信" --json-pretty

9. When to use AppleScript vs desktop vision

Prefer AppleScript for:

opening or activating apps
reading window titles
checking the frontmost app
simple app and process state queries

Do not add AppleScript UI scripting here for button clicks or deep accessibility-tree automation. That path is intentionally excluded from this skill.

Prefer screenshot + OCR/OpenCV + pyautogui for:

buttons or labels that only exist visually
apps with weak or unstable accessibility hierarchies
targets inside custom-drawn UIs such as chat rows, images, or canvas content
direct manipulation such as clicking, dragging, and typing into app surfaces

When the same text may appear in multiple places, do not search the full screen by default. Constrain OCR to the intended region first, then click using the returned full-screen logical coordinates.

A practical sequence is often:

AppleScript activates the app
AppleScript reads window or process state
screenshot-based vision finds the target
mouse or keyboard automation performs the action
AppleScript or a fresh screenshot verifies the result

10. Recommended flow

python scripts/applescript_app.py --action activate --app "微信"
python scripts/applescript_window.py --action title --app "微信"
python scripts/init_coordinate_mapping.py
python scripts/capture_screen.py
python scripts/locate_text_ocr.py --text "确定"
python scripts/mouse.py --action click --x 500 --y 300
python scripts/keyboard.py --action press --key enter

Notes

Version 1 assumes a Retina display and single primary screen.
Treat logical screenshots as the default working surface for this skill.
Treat recognition output coordinates as logical unless a script explicitly says otherwise.
Treat mouse and keyboard targeting as logical by default.
Treat crop rectangles as logical by default, and prefer cropping from a logical screenshot.
If another skill mixes logical screenshots with raw Retina or pixel screenshots, use calibration conversion deliberately. Do not assume logical bounds match raw pixel bounds 1:1.
Keep this skill focused on generic desktop primitives. App-specific UI semantics, business rules, and event pipelines should stay in the higher-level app skill.
All click, drag, move, and typing actions use Python / pyautogui.
AppleScript support in this skill is limited to app control and window inspection.
For safety, keep pyautogui.FAILSAFE = True; moving the mouse to the top-left corner aborts automation.

安全使用建议

This skill appears to do what it says: local macOS desktop automation using AppleScript, screenshots, Vision OCR, OpenCV, and pyautogui. Before installing or enabling it: 1) Only grant Screen Recording and Accessibility to runtimes you trust (these permissions let the tool read screen pixels and control input). 2) Review requirements.txt and install dependencies in a controlled Python environment (virtualenv). 3) Note it will create /tmp/macos_desktop_control/calibration.json on first run; delete that file if you want to reset calibration. 4) Be aware that the skill can paste/click/type anywhere — do not run it with sensitive apps open unless you trust the skill/source. 5) If you need stronger assurance, run the scripts locally from source (inspect them yourself) rather than giving an agent blanket autonomous invocation; consider restricting autonomous use or testing in a non-critical account/session.

功能分析

Type: OpenClaw Skill Name: desktop-control-for-macos Version: 1.0.13 The skill provides high-risk desktop control primitives for macOS, including mouse/keyboard automation (mouse.py, keyboard.py), screen capture (capture_screen.py), and application/window management via AppleScript (applescript_app.py, applescript_window.py). While these capabilities are aligned with the stated purpose of desktop automation, they grant broad system access. Additionally, the AppleScript wrappers are vulnerable to injection because they use f-strings to construct scripts from arguments (e.g., --app) without sanitization. No evidence of intentional malice, such as data exfiltration or persistence, was found.

能力评估

✓ Purpose & Capability

The name/description match the included files and declared behavior: AppleScript wrappers (applescript_app/window), screenshot capture, Vision OCR, OpenCV template matching, pyautogui-based mouse/keyboard control, and coordinate calibration. Required macOS permissions (Screen Recording, Accessibility) and listed Python dependencies (pyautogui, Pillow, opencv-python, pyobjc Vision/Quartz) are consistent with the stated purpose.

✓ Instruction Scope

SKILL.md instructs running the included Python scripts and describes the calibration/usage pipeline. Scripts only reference local files (default /tmp/macos_desktop_control/*), call osascript for window/app metadata, use pbcopy for clipboard, and perform OCR/OpenCV locally. There are no instructions to read unrelated system configs, sweep user data, or send data to external endpoints.

✓ Install Mechanism

No remote download/install steps are present. This is an instruction-plus-scripts skill with a requirements.txt for pip packages. Installing via pip is the expected method; dependencies are typical and traceable on PyPI or Apple frameworks (pyobjc).

✓ Credentials

The skill declares no environment variables or credentials. It does require macOS Screen Recording and Accessibility permissions to function, which is appropriate for a desktop automation tool. It writes a calibration file to /tmp/macos_desktop_control/calibration.json (explained in SKILL.md). There are no requests for unrelated secrets or config paths.

ℹ Persistence & Privilege

The skill does not request 'always' or system-wide config changes. It can autonomously run actions (normal for skills) and will write a calibration file in /tmp. Functionally it has high capability because it can click, type, and paste — so treat it as powerful desktop automation that requires Accessibility permissions; exercise usual caution when granting those permissions.

版本历史

v1.0.13

Version 1.0.13 - Introduces automatic ("lazy") initialization of calibration: the skill now auto-generates the screen calibration file on first use if it does not exist, removing the need for manual pre-initialization. - Calibration file at /tmp/macos_desktop_control/calibration.json is reused if already present. - Scripts that require calibration (e.g., capture_screen.py, mouse.py, locate_text_ocr.py, locate_image_opencv.py) will trigger auto-initialization when needed. - Slightly expands usage documentation and clarifies default behaviors for new initialization logic. - No code changes detected; update is documentation-only.

v1.0.12

Remove the description related to non-Retina displays.

v1.0.11

- Added a prominent introductory note highlighting Chinese compatibility for text input/recognition. - All other technical features and documentation remain unchanged.

v1.0.10

keyboard改成只复制粘贴，用来兼容中文输入

v1.0.9

- Added skill metadata block (`name` and `description`) to SKILL.md. - Fixed appleScript app frontmost.

v1.0.8

Added compatibility support for Chinese text input.

v1.0.7

### Changelog for version 1.0.7 - Updated documentation to clarify features by adding icon bullets to each feature set for improved readability. - No code or logic changes; only the SKILL.md file was modified. - All features and usage remain unchanged.

v1.0.6

No functional changes in this release. Documentation (SKILL.md) has been updated: - Added a new "Features" section summarizing capabilities for easier overview - Reorganized documentation to clarify groups of actions (app/window, visual control, keyboard, mouse, safety) - No changes to scripts, APIs, or functionality

v1.0.5

- Added image region cropping support with new script `scripts/crop_image.py` - Documented usage for cropping logical screen regions to extract thumbnails, buttons, dialogs, etc. - Updated directory structure and recommended flow to include the crop image functionality

v1.0.4

**AppleScript app and window control added for semantic macOS automation.** - Added scripts/applescript_app.py and scripts/applescript_window.py for app-level and window-level control using AppleScript. - Expanded documentation to clarify skill boundaries, explaining when to use AppleScript versus screen-based automation. - Listed and documented new actions: open, activate, check if running, and get frontmost app for apps; get title, count, and list windows. - Directory layout and usage flow updated to include new AppleScript capabilities.

v1.0.2

Version 1.0.1 - Added meta information file (_meta.json) for the skill. - Updated documentation: OCR now uses Apple Vision via PyObjC, removing the need for separate Tesseract installation. - Clarified in SKILL.md that OCR relies on Apple Vision, and updated related usage notes. - No code changes; this version only updates metadata and documentation for improved clarity and ease of use.

v1.0.1

Enhance mouse control functionality and add keyboard control.

v1.0.0

init

元数据

Slug desktop-control-for-macos

版本 1.0.13

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 13

常见问题

MacOS Desktop Control 是什么？

Generic macOS desktop control using AppleScript for app and window semantics plus screenshot, OCR, mouse, and keyboard workflows. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 262 次。

如何安装 MacOS Desktop Control？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install desktop-control-for-macos」即可一键安装，无需额外配置。

MacOS Desktop Control 是免费的吗？

是的，MacOS Desktop Control 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

MacOS Desktop Control 支持哪些平台？

MacOS Desktop Control 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 MacOS Desktop Control？

由 KD-oauth（@kd-oauth）开发并维护，当前版本 v1.0.13。

MacOS Desktop Control