← Back to Skills Marketplace
alfredjamesli

GUI Agent

by AlfredJamesLi · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ⚠ suspicious
189
Downloads
2
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install gui-claw
Description
GUI automation via visual detection. Clicking, typing, reading content, navigating menus, filling forms — all through screenshot → detect → act workflow. Sup...
README (SKILL.md)

GUI Agent

STEP 0: Activate Platform (MANDATORY FIRST STEP)

Before any GUI operation, run:

python3 {baseDir}/scripts/activate.py

This detects your OS, sets up the correct action commands, and outputs platform context. After running, {baseDir}/actions/_actions.yaml contains your platform's commands.

Workflow

OBSERVE → LEARN → ACT → VERIFY → SAVE
  1. OBSERVE — Take screenshot → run OCR + detector → understand current state → read {baseDir}/skills/gui-observe/SKILL.md

  2. LEARN — First time with an app? Save components to memory → read {baseDir}/skills/gui-learn/SKILL.mdlearn_from_screenshot() auto-outputs app tips if available

  3. ACT — Pick target → execute using _actions.yaml commands → verify → read {baseDir}/skills/gui-act/SKILL.mdread {baseDir}/actions/_actions.yaml for available commands

  4. VERIFY — Screenshot again → confirm action succeeded

  5. SAVE — Record state transitions to memory → read {baseDir}/skills/gui-memory/SKILL.md for memory structure

Core Rules

  • Coordinates from detection only — OCR or GPA-GUI-Detector, NEVER from guessing
  • Look before you act — every action must be justified by what you observed
  • image tool = understanding only — use it to decide WHAT to click, get WHERE from OCR/detector

Sub-Skills Reference

Sub-Skill When to read
skills/gui-observe/SKILL.md Before screenshots or detection
skills/gui-learn/SKILL.md Before learning a new app
skills/gui-act/SKILL.md Before any click/type action
skills/gui-memory/SKILL.md For memory structure details
skills/gui-workflow/SKILL.md For multi-step navigation
skills/gui-setup/SKILL.md For first-time machine setup
skills/gui-report/SKILL.md For task performance reporting
Usage Guidance
This package is broadly coherent for GUI automation but has several items you should verify before installing or running: 1) Inspect scripts/setup.sh, scripts/gui_action.py, scripts/backends/http_remote.py and scripts/backends/ssh_remote (if present) to understand what is sent to remote hosts and whether screenshots/inputs could be exfiltrated. 2) Review skills/gui-report/scripts/tracker.py — it reads ~/.openclaw/.../sessions/sessions.json to collect token/session info and will write logs and a .tracker_state.json file; decide whether that access is acceptable. 3) Run any installation or the setup script in an isolated environment (throwaway VM or container) first — the setup will create ~/gui-agent-env and download large models into your home. 4) If you will use remote control (--remote), restrict the endpoints to trusted hosts and audit the remote server implementation; remote endpoints can execute clicks/typing and receive screenshots. 5) Do not grant accessibility or elevated permissions until you confirm the exact commands the skill will run; after testing, remove permissions you do not trust. 6) If unsure, ask the author for a minimal install/run checklist or a signed release; consider code review by a trusted party before enabling this in a production agent.
Capability Analysis
Type: OpenClaw Skill Name: gui-claw Version: 1.0.1 The gui-claw skill bundle is a legitimate and highly sophisticated GUI automation framework designed for local and remote desktop interaction. It utilizes YOLO-based object detection (GPA-GUI-Detector), OCR (Apple Vision/EasyOCR), and template matching to perceive screen states, which are then managed through a structured memory system in app_memory.py. While the bundle includes powerful capabilities such as remote command execution via http_remote.py and clipboard manipulation in platform_input.py, these features are strictly aligned with its stated purpose of GUI automation and benchmarking (e.g., OSWorld). No evidence of malicious intent, data exfiltration, or unauthorized persistence was found.
Capability Assessment
Purpose & Capability
Name/description align with the included code: screenshot → detect → act workflow, OCR, visual memory, local and remote backends (HTTP/SSH). Heavy ML deps and a setup script are proportionate to the stated detection/OCR features.
Instruction Scope
Runtime instructions ask the operator to run scripts (activate.py, setup.sh) that detect platform, create venvs, download models, and produce actions/_actions.yaml. The code (gui_action.py + backends) supports --remote <URL> (HTTP/SSH) which will send/receive commands/screenshots to arbitrary hosts. Tracker and memory code read/write files under the user's home and OpenClaw workspace (e.g., ~/.openclaw sessions, memory/apps), so the skill accesses data outside its own directory without declaring that scope.
Install Mechanism
No registry install spec is declared (instruction-only), but scripts/setup.sh and README instruct the user to create a home venv, install heavy packages (PyTorch/YOLO/etc.) and clone models from HuggingFace — these are expected but are intrusive (large downloads, system package installs). The install flow relies on network downloads from public sources (GitHub/HuggingFace).
Credentials
The skill declares no required env vars/credentials, yet code reads OpenClaw session files (~/.openclaw/.../sessions.json) to extract token/session info, and reads/writes memory under user/home (~/GPA-GUI-Detector, ~/gui-agent-env, skill memory, logs). Accessing session/token data and user memory is sensitive; these accesses aren't documented as required credentials in the metadata.
Persistence & Privilege
The skill does not set always:true and does not request platform-wide privileges explicitly. However, setup.sh and other scripts create persistent artifacts in the user's home (venv, downloaded models, memory directories, logs, actions/_actions.yaml), and tracker auto-saves/rotates session state — these are persistent changes that the user should review before running.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install gui-claw
  3. After installation, invoke the skill by name or use /gui-claw
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.1
gui-claw 1.0.1 - Major expansion of documentation and benchmarks, including detailed design principles, workflow descriptions, and visual method guidance. - Added OS-specific action definitions for Linux and macOS. - Introduced platform detection and setup scripts. - Expanded memory/app metadata coverage for multiple desktop apps. - Initial support for both macOS and Linux automated GUI actions.
v1.0.0
Initial release (v1.0.0) - Vision-based GUI automation skill for macOS using GPA-GUI-Detector + OCR - Detection-first design: all click coordinates from detectors, never from LLM estimation - Visual memory system: component templates, activity-based forgetting, state identification - State graph navigation: automatic transition recording, BFS path planning - Hierarchical verification: template matching → full detection → VLM fallback - OSWorld Chrome domain benchmark: 97.8% success rate (45/46 tasks) - Sub-skills: gui-observe, gui-act, gui-learn, gui-memory, gui-workflow, gui-report, gui-setup
Metadata
Slug gui-claw
Version 1.0.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 2
Frequently Asked Questions

What is GUI Agent?

GUI automation via visual detection. Clicking, typing, reading content, navigating menus, filling forms — all through screenshot → detect → act workflow. Sup... It is an AI Agent Skill for Claude Code / OpenClaw, with 189 downloads so far.

How do I install GUI Agent?

Run "/install gui-claw" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is GUI Agent free?

Yes, GUI Agent is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does GUI Agent support?

GUI Agent is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created GUI Agent?

It is built and maintained by AlfredJamesLi (@alfredjamesli); the current version is v1.0.1.

💬 Comments