/install peekaboo-linux-gnome-wayland-computer-use
peekaboo-linux-gnome-wayland-computer-use
The Linux/Wayland counterpart to macOS computer-use — "see and poke the screen of a box for troubleshooting or automation." On macOS that's trivial (CGWindowList + CGEvent + AX tree). On GNOME/Wayland it's locked down by design, but it can be made to work end-to-end:
- See →
gnome-screenshot(re-enabled on GNOME 49+ via a small GNOME extension). - Act →
ydotool(writes to/dev/uinput, no root). - Target elements → AT-SPI WINDOW-relative extents + the Window Calls extension's
window origin, composed into real absolute click coordinates (works where naive AT-SPI returns
(0,0)). - Hands-on (optional) → GNOME Remote Desktop in Remote-Login mode for a human at an RDP client.
Both the "see" and "act" sides work natively (an agent process running ON the box) and remotely (from another machine over SSH). A native agent is a single actor — it can screenshot, analyze, then ydotool-click in one loop. No second machine, no human, no RDP roundtrip.
Context tags below — [NATIVE] = an agent process running ON the GNOME box (sharing the desktop
session of user@1000); [REMOTE] = on another machine, reaching the box over SSH.
Verified on Ubuntu 26.04 LTS + GNOME Shell 50 (Wayland), NVIDIA + AMD multi-GPU, multi-monitor. Substitute your own host/user where the examples use
USER@HOSTand/run/user/UID.
Capability matrix
| Capability | Native (on box) | Remote (from another machine) |
|---|---|---|
| Screenshot | yes — gnome-screenshot -f out.png (extension unlocks it) |
yes — same over SSH, then scp the file |
| Inject input (mouse/keys) | yes — ydotool via /dev/uinput (no root) |
yes — over SSH, same ydotool |
| Click element by role/name | yes — locate-element (AT-SPI WINDOW extents + Window Calls origin) |
yes — same over SSH |
| Full see→reason→act loop | yes — single actor, no RDP | yes — drive over SSH |
| Live interactive desktop | n/a | yes — RDP client (optional, for a human) |
Target environment assumptions
- A GNOME/Wayland box with a logged-in graphical session (the GUI session must be active on a seat).
- For remote use: SSH access (key-based recommended), and the box's session env exported in
non-login shells —
XDG_RUNTIME_DIR=/run/user/UID,DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/UID/bus,WAYLAND_DISPLAY=wayland-0,DISPLAY=:0. (ReplaceUIDwith the session user's numeric id, e.g.1000.) - If agents run under
systemd --userwith the full graphical session env + lingering enabled, they inherit all of the above and survive SSH disconnects. gnome-screenshotgrabs the full virtual desktop (all monitors stitched). On multi-4K setups that's huge — use-w(focused window) or the per-display helper below.
CAPABILITY A — Screenshot (the key unlock)
GNOME 49+ revoked gnome-screenshot's access to the private Shell screenshot API, so on a stock
box every headless capture path is dead (see "why" below). The fix is the allow-gnome-screenshot
extension (https://github.com/siddhpant/allow-gnome-screenshot, GNOME 49/50): it re-adds
org.gnome.Screenshot to the Shell's _senderChecker._allowlistMap, so the gnome-screenshot
binary specifically regains access. It is scoped to that one sender — it does NOT enable global
unsafe_mode.
[NATIVE] — just call it (session env already present):
gnome-screenshot -f /tmp/shot.png # full virtual desktop (all monitors)
gnome-screenshot -w -f /tmp/shot.png # focused window only (smaller)
gnome-screenshot -a -f /tmp/shot.png # interactive area (needs a human at the screen — avoid headless)
# then hand /tmp/shot.png to your image-analysis step
[REMOTE] — over SSH, then pull it back:
ssh USER@HOST '
export XDG_RUNTIME_DIR=/run/user/UID DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/UID/bus
gnome-screenshot -f /tmp/shot.png'
scp USER@HOST:/tmp/shot.png /tmp/ # then analyze it
Sanity-check the capture — a 4K full-desktop PNG is ~15-20 MB; a tiny/near-empty file or an
all-black image means the extension got disabled or the session dropped to the GDM greeter. Quick
non-black check:
python3 -c "from PIL import Image; im=Image.open('/tmp/shot.png').convert('RGB'); d=list(im.getdata()); print(sum(1 for p in d[::997] if p!=(0,0,0)))" → should be well >0.
Installing the extension from scratch
# [NATIVE or over SSH with the session env exported]
sudo apt-get install -y gnome-screenshot
D=~/.local/share/gnome-shell/extensions/[email protected]
git clone --depth 1 https://github.com/siddhpant/allow-gnome-screenshot.git /tmp/ags
mkdir -p "$D" && cp -r /tmp/ags/[email protected]/* "$D/"
sudo systemctl restart gdm3 # reload the Shell so it discovers the extension (see WARNING below)
# after the session returns:
export XDG_RUNTIME_DIR=/run/user/UID DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/UID/bus
gnome-extensions enable [email protected]
gnome-extensions info [email protected] | grep State # expect ACTIVE
Per-display capture (capture ONE monitor, not the stitched canvas)
gnome-screenshot -f grabs the entire virtual desktop. On multi-monitor setups that's a huge,
mostly-empty canvas. To get a single clean display, use the bundled screenshot-display helper
(scripts/screenshot-display.py). It reads live monitor geometry from Mutter DisplayConfig and
crops the full capture to the chosen monitor.
screenshot-display --list # enumerate monitors + geometry
screenshot-display --monitor primary --out /tmp/p.png # primary (default)
screenshot-display --monitor DP-4 --out /tmp/m.png # by connector name
screenshot-display --monitor 2 --out /tmp/m.png # by index (0-based, sorted L->R, T->B)
screenshot-display --monitor active --out /tmp/w.png # the FOCUSED WINDOW (best "being worked on" signal)
Each per-monitor crop is a clean single-display image — feed it straight to your vision step.
[REMOTE]: run screenshot-display over SSH (export the session env), then scp the output back.
Window targeting — screenshot a specific program / the display it's on
Shell.Eval/Introspect are locked, but the Window Calls extension
(https://github.com/ickyicky/window-calls, GNOME 45-50) exposes window list + geometry over D-Bus
at /org/gnome/Shell/Extensions/Windows. The screenshot-display helper uses it to find and crop
to a window or its monitor:
screenshot-display --list-windows # wm_class, id, focus, monitor, title
screenshot-display --window focused --out /tmp/w.png # crop to focused window's exact rect
screenshot-display --window "wm_class=firefox" --out /tmp/w.png # match by wm_class substring
screenshot-display --window "title=Settings" --out /tmp/w.png # match by title substring
screenshot-display --window-display focused --out /tmp/d.png # the whole MONITOR that window sits on
screenshot-display --window-display "wm_class=Code" --out /tmp/d.png # monitor a matched window is on
Flow: List → pick window (focused / wm_class / title) → Details gives x,y,width,height,monitor
→ helper crops the full capture to the window rect, or to whichever monitor's rect contains the
window's center. (Details works; GetFrameBounds is broken in this ext version — the helper
doesn't use it.)
Full setup/enablement guide to replicate on any GNOME/Wayland box:
references/peekaboo-and-rdp-setup-ubuntu-gnome-wayland.md.
CAPABILITY B — Input injection (ydotool)
ydotool writes to /dev/uinput below the display server (the input analog of the screenshot
unlock). Install it, run ydotoold as a persistent user service (socket
/run/user/UID/.ydotool_socket), and ACL /dev/uinput to your user (no root needed at runtime).
[NATIVE]:
export YDOTOOL_SOCKET=/run/user/UID/.ydotool_socket
ydotool mousemove -a 960 540 # absolute pixel coords (no element targeting)
ydotool click 0xC0 # left click down+up
ydotool type "hello"
ydotool key 28:1 28:0 # Enter (keycode:press / keycode:release)
[REMOTE]:
ssh USER@HOST '
export XDG_RUNTIME_DIR=/run/user/UID YDOTOOL_SOCKET=/run/user/UID/.ydotool_socket
ydotool mousemove -a 700 400; ydotool click 0xC0; ydotool type "hi"'
If ydotoold.service is down — systemctl --user enable --now ydotoold.service. For element
targeting (not blind coords) use the locate-element helper below.
CAPABILITY B+ — Click an element by role/name (SOM via hybrid coords)
You do NOT need blind pixel guessing or a vision round-trip for most clicks. The bundled
locate-element helper (scripts/locate-element.py) resolves a UI element to a real absolute
coordinate and clicks/types it.
Why it works when AT-SPI looks broken: on Wayland, getExtents(SCREEN) returns (0,0) for
GTK4/Electron (no global coordinates by design), so naive AT-SPI clicking fails. But
getExtents(WINDOW) (widget-relative) is populated, and Window Calls gives the window's
screen origin from Mutter's privileged view. The helper composes them:
abs = window_origin + window_relative_extents + size/2. Proven on GTK4 apps.
locate-element --app gnome-text-editor --list # list elements + computed ABS coords
locate-element --app gnome-text-editor --role text --click # click first text widget
locate-element --app org.gnome.Nautilus --name "New Folder" --click # click by visible label
locate-element --app gnome-text-editor --role text --click --type "hello" # click then type
locate-element --window-id 1367000216 --role "push button" --name Open --click # by explicit window id
locate-element --app firefox --name Reload --print # just print "X Y", act yourself
--appmatches the AT-SPI app name AND wm_class loosely (normalizesorg.gnome.TextEditor↔gnome-text-editor).--role/--nameare case-insensitive substring filters;--index Npicks the Nth match.- Needs
toolkit-accessibility=true(gsettings set org.gnome.desktop.interface toolkit-accessibility true), Window Calls active, and ydotoold running.
Vision fallback (only when an element zeros even WINDOW extents — custom-drawn canvases, some
Electron internals): locate-element exits non-zero and prints the fallback recipe —
screenshot-display --window 'wm_class=X' --out /tmp/w.png → analyze for pixel coords →
ydotool mousemove -a -x X -y Y.
Deep dive on the technique, the gotchas, and why Newton isn't a substitute yet:
references/wayland-atspi-hybrid-som.md.
The native loop (single actor — the whole point)
[NATIVE agent on the box]
1. gnome-screenshot -f /tmp/s.png
2. analyze /tmp/s.png → "the Save button is at ~(1820, 640)"
3. ydotool mousemove -a 1820 640 && ydotool click 0xC0
4. gnome-screenshot again → verify it landed
No second machine, no human, no RDP. For element targeting prefer locate-element over blind coords.
Window actions (focus / move / resize / close)
Window Calls also does focus/move/resize/close verbs (window id from --list-windows):
WIN=/org/gnome/Shell/Extensions/Windows
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Activate \x3Cid> # focus/raise
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Move \x3Cid> \x3Cx> \x3Cy>
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Resize \x3Cid> \x3Cw> \x3Ch>
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Maximize \x3Cid>
gdbus call --session --dest org.gnome.Shell --object-path $WIN --method $WIN.Close \x3Cid>
Flow: Activate \x3Cid> → screenshot-display --window focused → ydotool interact.
CAPABILITY C — Live interactive desktop via RDP (optional, for a human)
For a person to drive the desktop hands-on from another machine, use GNOME Remote Desktop in Remote Login mode (system/headless, fresh negotiated display — NOT Desktop Sharing, which mirrors the physical multi-monitor desktop and re-hits the huge-canvas problem).
- Mode:
grdctl --system(system servicegnome-remote-desktop.service, enabled at boot). - Listener:
*:3389— always LAN-scope it in your firewall, never any-source. - TLS cert at
/var/lib/gnome-remote-desktop/rdp-tls.{crt,key}. - Credentials: a device user + password over NLA. Setting the device cred = the Linux user's password makes both stages (NLA device prompt, then GDM login) take the same credential.
Connect from a client (Microsoft Remote Desktop / Windows App, or any RDP client) → HOST → device
user/pass → GDM → sign in as the Linux user.
⚠️ #1 RDP gotcha — credential changes need a daemon restart. grdctl --system rdp set-credentials
writes the keyfile but the running daemon does NOT hot-reload it into its NTLM SAM. Until you
sudo systemctl restart gnome-remote-desktop.service, NLA fails for every client:
ntlm_fetch_ntlm_v2_hash: Could not find user in SAM database → SEC_E_NO_CREDENTIALS →
transport_accept_nla: client authentication failure. Meanwhile grdctl status shows the correct
creds (it reads the file). Clients report this as FREERDP_ERROR_CONNECT_TRANSPORT_FAILED (RoyalTS)
or ERRCONNECT_AUTHENTICATION_FAILED (FreeRDP) — looks like transport/cert, is actually stale creds.
Fix: set creds (password via STDIN) → restart the service → verify journald shows no SAM/
NO_CREDENTIALS errors. Server requires NLA (HYBRID_REQUIRED_BY_SERVER) — expected.
Diagnosing any RDP auth failure: read the SERVER's journald, not the client error string.
Client errors all collapse different server causes into the same opaque message. Full server-log→cause
map, the +auth-only//sec:tls client probes, the cert/listener sanity checks, AND the
empty-server-log case (tcpdump proves client-vs-network — e.g. a GUI client completing TCP then
sending zero RDP bytes = client-side engine fault, fixed by recreating the connection, not touching
the server): references/rdp-auth-diagnosis-server-side.md.
For automated capture you do NOT need RDP — Capability A is simpler. RDP is for a human.
Full setup steps: references/peekaboo-and-rdp-setup-ubuntu-gnome-wayland.md (Step 7).
Why the stock capture paths fail (so you don't waste time re-testing)
gnome-screenshotWITHOUT the extension → exit 0 but no file (AccessDeniedto the private API under the hood).- Raw
gdbus/busctlcall toorg.gnome.Shell.Screenshot→AccessDenied. The extension allowlists thegnome-screenshotbinary's D-Bus name, not arbitrary callers — so capture must go through thegnome-screenshotbinary, not a hand-rolled gdbus call. xdg-desktop-portalScreenshot→ no Response even with the PermissionStore granted (forced interactive dialog).org.gnome.Mutter.ScreenCast.CreateSession→Session creation inhibited(gated to grd itself).ffmpeg kmsgrab(KMS framebuffer) → "No usable planes" on NVIDIA proprietary KMS./dev/fb0→ all-black (GNOME doesn't scan out to fb0).grim→ wlroots-only, fails on Mutter. Don'tapt install grim.enable unsafe_modevia gdb injection into gnome-shell → CRASHES the Shell, drops the desktop to the GDM greeter. Do NOT use the gdb method. The extension is the safe, scoped, persistent unlock.
No usable screen-coordinate accessibility on Wayland — AT-SPI getExtents(SCREEN) is (0,0) on
GTK4/Electron. BUT this is solved by Capability B+ (locate-element): WINDOW-relative AT-SPI
extents + Window Calls window origin → real click coords. Use locate-element for "click element
labeled Save"; fall back to vision-locate only where even WINDOW extents are zero.
Pitfalls
- The screenshot unlock is binary-scoped — only the
gnome-screenshotbinary is allowlisted. A raw gdbus Screenshot call still fails. Always shell out tognome-screenshot. - Reloading the Shell on Wayland = restarting GDM (
sudo systemctl restart gdm3). There's no in-placeAlt+F2 rover SSH. The live GUI session is briefly torn down, so don't do it mid-GUI-task. If autologin is enabled the desktop recovers in ~12s;systemd --userservices survive if lingering is on. Verify recovery:loginctl list-sessionsshows aseat0Type=wayland Active=yessession. - Never use the gdb
unsafe_modeinjection — it crashes gnome-shell. Use the extension. - The
allow-gnome-screenshotextension can silently flip to INACTIVE (its allowlist hook only re-arms on a fresh Shell init). Symptom:gnome-screenshot -fexits 0 but writes no file. A disable/enable cycle is NOT enough — it needs a Shell reload (sudo systemctl restart gdm3) to re-arm. If you depend on capture in an automated loop, verify the file exists each time and reload-on-miss. (Window Calls can flip too — re-enable after any GDM restart.) gnome-screenshot -a(interactive area) needs a human at the screen — useless headless; use-f(full) or-w(window).- Full-desktop shots are huge (multi-4K). Prefer
-w, or downscale before analysis. - ydotool needs
ydotooldalive +YDOTOOL_SOCKET/XDG_RUNTIME_DIR. Re-enable if it dies. Rawydotoolis blind absolute pixels — preferlocate-element(Capability B+) for element targeting. - A detached SSH session isn't fully in the graphical session — export
XDG_RUNTIME_DIR=/run/user/UID(+ DBUS) explicitly. A nativesystemd --userservice already has it. - Always LAN-scope RDP (3389) and SSH (22) in your firewall — never any-source.
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install peekaboo-linux-gnome-wayland-computer-use - 安装完成后,直接呼叫该 Skill 的名称或使用
/peekaboo-linux-gnome-wayland-computer-use触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Peekaboo Linux Gnome Wayland Computer Use 是什么?
See and control GNOME Wayland desktops using gnome-screenshot, ydotool, Window Calls, AT-SPI hybrid element coordinates, and optional GNOME Remote Desktop. T... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 83 次。
如何安装 Peekaboo Linux Gnome Wayland Computer Use?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install peekaboo-linux-gnome-wayland-computer-use」即可一键安装,无需额外配置。
Peekaboo Linux Gnome Wayland Computer Use 是免费的吗?
是的,Peekaboo Linux Gnome Wayland Computer Use 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Peekaboo Linux Gnome Wayland Computer Use 支持哪些平台?
Peekaboo Linux Gnome Wayland Computer Use 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(darwin, linux)。
谁开发了 Peekaboo Linux Gnome Wayland Computer Use?
由 Kyzcreig(@kyzcreig)开发并维护,当前版本 v1.0.2。