← Back to Skills Marketplace

GUI Agent

Name: GUI Agent
Author: alfredjamesli

by AlfredJamesLi · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ⚠ suspicious

189

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install gui-claw

Description

GUI automation via visual detection. Clicking, typing, reading content, navigating menus, filling forms — all through screenshot → detect → act workflow. Sup...

README (SKILL.md)

GUI Agent

STEP 0: Activate Platform (MANDATORY FIRST STEP)

Before any GUI operation, run:

python3 {baseDir}/scripts/activate.py

This detects your OS, sets up the correct action commands, and outputs platform context. After running, {baseDir}/actions/_actions.yaml contains your platform's commands.

Workflow

OBSERVE → LEARN → ACT → VERIFY → SAVE

OBSERVE — Take screenshot → run OCR + detector → understand current state → read {baseDir}/skills/gui-observe/SKILL.md
LEARN — First time with an app? Save components to memory → read {baseDir}/skills/gui-learn/SKILL.md → learn_from_screenshot() auto-outputs app tips if available
ACT — Pick target → execute using _actions.yaml commands → verify → read {baseDir}/skills/gui-act/SKILL.md → read {baseDir}/actions/_actions.yaml for available commands
VERIFY — Screenshot again → confirm action succeeded
SAVE — Record state transitions to memory → read {baseDir}/skills/gui-memory/SKILL.md for memory structure

Core Rules

Coordinates from detection only — OCR or GPA-GUI-Detector, NEVER from guessing
Look before you act — every action must be justified by what you observed
image tool = understanding only — use it to decide WHAT to click, get WHERE from OCR/detector

Sub-Skills Reference

Sub-Skill	When to read
`skills/gui-observe/SKILL.md`	Before screenshots or detection
`skills/gui-learn/SKILL.md`	Before learning a new app
`skills/gui-act/SKILL.md`	Before any click/type action
`skills/gui-memory/SKILL.md`	For memory structure details
`skills/gui-workflow/SKILL.md`	For multi-step navigation
`skills/gui-setup/SKILL.md`	For first-time machine setup
`skills/gui-report/SKILL.md`	For task performance reporting

Usage Guidance

This package is broadly coherent for GUI automation but has several items you should verify before installing or running: 1) Inspect scripts/setup.sh, scripts/gui_action.py, scripts/backends/http_remote.py and scripts/backends/ssh_remote (if present) to understand what is sent to remote hosts and whether screenshots/inputs could be exfiltrated. 2) Review skills/gui-report/scripts/tracker.py — it reads ~/.openclaw/.../sessions/sessions.json to collect token/session info and will write logs and a .tracker_state.json file; decide whether that access is acceptable. 3) Run any installation or the setup script in an isolated environment (throwaway VM or container) first — the setup will create ~/gui-agent-env and download large models into your home. 4) If you will use remote control (--remote), restrict the endpoints to trusted hosts and audit the remote server implementation; remote endpoints can execute clicks/typing and receive screenshots. 5) Do not grant accessibility or elevated permissions until you confirm the exact commands the skill will run; after testing, remove permissions you do not trust. 6) If unsure, ask the author for a minimal install/run checklist or a signed release; consider code review by a trusted party before enabling this in a production agent.

Capability Analysis

Type: OpenClaw Skill Name: gui-claw Version: 1.0.1 The gui-claw skill bundle is a legitimate and highly sophisticated GUI automation framework designed for local and remote desktop interaction. It utilizes YOLO-based object detection (GPA-GUI-Detector), OCR (Apple Vision/EasyOCR), and template matching to perceive screen states, which are then managed through a structured memory system in app_memory.py. While the bundle includes powerful capabilities such as remote command execution via http_remote.py and clipboard manipulation in platform_input.py, these features are strictly aligned with its stated purpose of GUI automation and benchmarking (e.g., OSWorld). No evidence of malicious intent, data exfiltration, or unauthorized persistence was found.

Capability Assessment

✓ Purpose & Capability

Name/description align with the included code: screenshot → detect → act workflow, OCR, visual memory, local and remote backends (HTTP/SSH). Heavy ML deps and a setup script are proportionate to the stated detection/OCR features.

⚠ Instruction Scope

Runtime instructions ask the operator to run scripts (activate.py, setup.sh) that detect platform, create venvs, download models, and produce actions/_actions.yaml. The code (gui_action.py + backends) supports --remote <URL> (HTTP/SSH) which will send/receive commands/screenshots to arbitrary hosts. Tracker and memory code read/write files under the user's home and OpenClaw workspace (e.g., ~/.openclaw sessions, memory/apps), so the skill accesses data outside its own directory without declaring that scope.

ℹ Install Mechanism

No registry install spec is declared (instruction-only), but scripts/setup.sh and README instruct the user to create a home venv, install heavy packages (PyTorch/YOLO/etc.) and clone models from HuggingFace — these are expected but are intrusive (large downloads, system package installs). The install flow relies on network downloads from public sources (GitHub/HuggingFace).

⚠ Credentials

The skill declares no required env vars/credentials, yet code reads OpenClaw session files (~/.openclaw/.../sessions.json) to extract token/session info, and reads/writes memory under user/home (~/GPA-GUI-Detector, ~/gui-agent-env, skill memory, logs). Accessing session/token data and user memory is sensitive; these accesses aren't documented as required credentials in the metadata.

ℹ Persistence & Privilege

The skill does not set always:true and does not request platform-wide privileges explicitly. However, setup.sh and other scripts create persistent artifacts in the user's home (venv, downloaded models, memory directories, logs, actions/_actions.yaml), and tracker auto-saves/rotates session state — these are persistent changes that the user should review before running.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install gui-claw
After installation, invoke the skill by name or use /gui-claw
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.1

gui-claw 1.0.1 - Major expansion of documentation and benchmarks, including detailed design principles, workflow descriptions, and visual method guidance. - Added OS-specific action definitions for Linux and macOS. - Introduced platform detection and setup scripts. - Expanded memory/app metadata coverage for multiple desktop apps. - Initial support for both macOS and Linux automated GUI actions.

v1.0.0

Initial release (v1.0.0) - Vision-based GUI automation skill for macOS using GPA-GUI-Detector + OCR - Detection-first design: all click coordinates from detectors, never from LLM estimation - Visual memory system: component templates, activity-based forgetting, state identification - State graph navigation: automatic transition recording, BFS path planning - Hierarchical verification: template matching → full detection → VLM fallback - OSWorld Chrome domain benchmark: 97.8% success rate (45/46 tasks) - Sub-skills: gui-observe, gui-act, gui-learn, gui-memory, gui-workflow, gui-report, gui-setup

Metadata

Slug gui-claw

Version 1.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is GUI Agent?

GUI automation via visual detection. Clicking, typing, reading content, navigating menus, filling forms — all through screenshot → detect → act workflow. Sup... It is an AI Agent Skill for Claude Code / OpenClaw, with 189 downloads so far.

How do I install GUI Agent?

Run "/install gui-claw" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is GUI Agent free?

Yes, GUI Agent is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does GUI Agent support?

GUI Agent is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created GUI Agent?

It is built and maintained by AlfredJamesLi (@alfredjamesli); the current version is v1.0.1.

More Skills

GUI Agent

GUI Agent

STEP 0: Activate Platform (MANDATORY FIRST STEP)

Workflow

Core Rules

Sub-Skills Reference

What is GUI Agent?

How do I install GUI Agent?

Is GUI Agent free?

Which platforms does GUI Agent support?

Who created GUI Agent?

💬 Comments