Chapter 21

Server Tools in Practice: The web_search / web_fetch / code_execution Trio

Chapter 21: Computer Use: Complete Guide to Desktop Automation and GUI Operations

21.1 What Is Computer Use?

Computer Use is a revolutionary capability developed by Anthropic for Claude. It allows the model to observe screen state through screenshots and control computer interfaces by simulating mouse clicks, keyboard input, scrolling, and other actions. This means Claude can operate any software with a graphical interface without requiring that software to provide an API.

Compared to traditional RPA (Robotic Process Automation) tools, Computer Use's core advantages are:

Understanding interface semantics: Claude understands what buttons mean and what forms are for, not just coordinates
Error recovery capability: When encountering unexpected pop-ups or interface changes, Claude can reason about how to handle them
Natural language instructions: No recording needed — just describe the task in natural language

Enabling Computer Use requires the betas=["computer-use-2025-01-24"] parameter.

21.2 Core Tool Definitions

Computer Use provides three built-in tool types:

ComputerTool (Desktop Operations Tool)

computer_tool = {
    "type": "computer_20250124",
    "name": "computer",
    "display_width_px": 1920,
    "display_height_px": 1080,
    "display_number": 1  # X11 display number (optional)
}

Action types supported by ComputerTool:

action	Description	Required parameters
`screenshot`	Capture the current screen	None
`left_click`	Left mouse click	`coordinate: [x, y]`
`right_click`	Right mouse click	`coordinate: [x, y]`
`double_click`	Double click	`coordinate: [x, y]`
`middle_click`	Middle button click	`coordinate: [x, y]`
`left_click_drag`	Drag	`coordinate: [x, y]`, `start_coordinate: [x, y]`
`type`	Type text	`text: str`
`key`	Press key	`text: str` (xdotool format)
`scroll`	Scroll	`coordinate: [x, y]`, `direction: up/down/left/right`, `amount: int`
`mouse_move`	Move mouse	`coordinate: [x, y]`
`cursor_position`	Get current mouse position	None

TextEditorTool (Text Editing Tool)

text_editor_tool = {
    "type": "text_editor_20250124",
    "name": "str_replace_editor"
}

Commands supported by TextEditorTool:

command	Description
`view`	View file contents
`create`	Create a new file
`str_replace`	Replace a string in the file
`insert`	Insert content after a specified line
`undo_edit`	Undo the last edit

BashTool (Command Line Tool)

bash_tool = {
    "type": "bash_20250124",
    "name": "bash"
}

BashTool executes commands in a persistent shell session, maintaining state across calls (environment variables, current directory, etc.).

21.3 Complete Computer Use Implementation

Basic Implementation Framework

import anthropic
import base64
import subprocess

client = anthropic.Anthropic()

COMPUTER_USE_TOOLS = [
    {
        "type": "computer_20250124",
        "name": "computer",
        "display_width_px": 1920,
        "display_height_px": 1080
    },
    {
        "type": "text_editor_20250124",
        "name": "str_replace_editor"
    },
    {
        "type": "bash_20250124",
        "name": "bash"
    }
]

def take_screenshot() -> str:
    """Capture screen and return base64-encoded PNG"""
    subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
    with open("/tmp/screenshot.png", "rb") as f:
        return base64.standard_b64encode(f.read()).decode("utf-8")

def execute_computer_action(action: str, **kwargs) -> dict:
    """Execute a desktop action"""
    import pyautogui
    
    if action == "screenshot":
        screenshot_data = take_screenshot()
        return {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": screenshot_data
            }
        }
    
    elif action == "left_click":
        x, y = kwargs["coordinate"]
        pyautogui.click(x, y)
        return {"result": f"Left-clicked ({x}, {y})"}
    
    elif action == "right_click":
        x, y = kwargs["coordinate"]
        pyautogui.rightClick(x, y)
        return {"result": f"Right-clicked ({x}, {y})"}
    
    elif action == "double_click":
        x, y = kwargs["coordinate"]
        pyautogui.doubleClick(x, y)
        return {"result": f"Double-clicked ({x}, {y})"}
    
    elif action == "type":
        text = kwargs["text"]
        pyautogui.typewrite(text, interval=0.05)
        return {"result": f"Typed: {text[:50]}"}
    
    elif action == "key":
        key = kwargs["text"]
        key_map = {
            "Return": "enter",
            "Escape": "esc",
            "BackSpace": "backspace",
            "Tab": "tab",
            "super": "win"
        }
        mapped = key_map.get(key, key)
        pyautogui.hotkey(*mapped.split("+"))
        return {"result": f"Pressed key: {key}"}
    
    elif action == "scroll":
        x, y = kwargs["coordinate"]
        direction = kwargs["direction"]
        amount = kwargs.get("amount", 3)
        pyautogui.moveTo(x, y)
        pyautogui.scroll(amount if direction == "up" else -amount)
        return {"result": f"Scrolled {direction} at ({x}, {y}) by {amount}"}
    
    elif action == "mouse_move":
        x, y = kwargs["coordinate"]
        pyautogui.moveTo(x, y)
        return {"result": f"Mouse moved to ({x}, {y})"}
    
    elif action == "left_click_drag":
        start = kwargs["start_coordinate"]
        end = kwargs["coordinate"]
        pyautogui.drag(end[0] - start[0], end[1] - start[1], duration=0.5, button='left')
        return {"result": f"Dragged from {start} to {end}"}
    
    return {"error": f"Unknown action: {action}"}

def execute_bash_command(command: str) -> str:
    """Execute a bash command"""
    result = subprocess.run(
        command, shell=True, capture_output=True, text=True, timeout=30
    )
    output = result.stdout
    if result.stderr:
        output += f"\nSTDERR: {result.stderr}"
    return output

def execute_text_editor(command: str, path: str = "",
                        old_str: str = None, new_str: str = None,
                        insert_line: int = None) -> str:
    """Execute text editor operations"""
    if command == "view":
        with open(path, 'r', encoding='utf-8') as f:
            return f.read()
    elif command == "create":
        with open(path, 'w', encoding='utf-8') as f:
            f.write(new_str or "")
        return f"File created: {path}"
    elif command == "str_replace":
        with open(path, 'r', encoding='utf-8') as f:
            content = f.read()
        if old_str not in content:
            return "Error: string to replace not found"
        with open(path, 'w', encoding='utf-8') as f:
            f.write(content.replace(old_str, new_str, 1))
        return "Replacement successful"
    elif command == "insert":
        with open(path, 'r', encoding='utf-8') as f:
            lines = f.readlines()
        lines.insert(insert_line, new_str + '\n')
        with open(path, 'w', encoding='utf-8') as f:
            f.writelines(lines)
        return f"Inserted after line {insert_line}"
    return f"Unknown command: {command}"

Complete Tool Call Loop

def process_tool_call(tool_name: str, tool_input: dict) -> str:
    """Process a tool call and return result content"""
    
    if tool_name == "computer":
        action = tool_input["action"]
        kwargs = {k: v for k, v in tool_input.items() if k != "action"}
        result = execute_computer_action(action, **kwargs)
        
        if action == "screenshot":
            return [result]  # Image content as list
        return result.get("result", result.get("error", "Action completed"))
    
    elif tool_name == "bash":
        return execute_bash_command(tool_input["command"])
    
    elif tool_name == "str_replace_editor":
        return execute_text_editor(
            command=tool_input["command"],
            path=tool_input.get("path", ""),
            old_str=tool_input.get("old_str"),
            new_str=tool_input.get("new_str"),
            insert_line=tool_input.get("insert_line")
        )
    
    return "Unknown tool"


def run_computer_use_agent(task: str, system_prompt: str = "") -> str:
    """Run a Computer Use Agent"""
    
    default_system = """You are an AI assistant that can use a computer to complete tasks.
You have the following tools:
- computer: take screenshots, click, type, scroll, and other desktop operations
- bash: run command-line commands
- str_replace_editor: view and edit files

Guidelines:
1. Take a screenshot first to understand the current screen state
2. Take a screenshot after actions to confirm results
3. When errors occur, understand the cause before retrying
4. Take a final screenshot to confirm task completion"""
    
    messages = [{"role": "user", "content": task}]
    
    while True:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=4096,
            system=system_prompt or default_system,
            tools=COMPUTER_USE_TOOLS,
            messages=messages,
            betas=["computer-use-2025-01-24"]
        )
        
        if response.stop_reason == "end_turn":
            return ' '.join(b.text for b in response.content if b.type == "text")
        
        if response.stop_reason == "tool_use":
            tool_results = []
            
            for block in response.content:
                if block.type != "tool_use":
                    continue
                
                print(f"[Tool] {block.name}: {block.input.get('action', block.input)}")
                result_content = process_tool_call(block.name, block.input)
                
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result_content if isinstance(result_content, list) else str(result_content)
                })
            
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})
        else:
            break
    
    return "Task complete"

21.4 Real-World Scenario Examples

Scenario 1: Auto-Fill a Web Form

task = """
Please complete the following in the browser:
1. Take a screenshot to see the current state
2. Find the username input field and click it
3. Type "[email protected]"
4. Click the password field
5. Type "SecurePass123"
6. Click the login button
7. Take a screenshot to confirm login success
"""
result = run_computer_use_agent(task)

Scenario 2: Batch File Processing

task = """
Please rename all .jpg files in ~/Downloads by adding today's date 
as a prefix in YYYYMMDD_ format.
First list the files, confirm, then execute the rename.
"""
result = run_computer_use_agent(task)

Scenario 3: Desktop Application Operation

task = """
Please open the Calculator app, compute (2024 × 365) + 100,
then take a screenshot showing the result.
"""
result = run_computer_use_agent(task)

21.5 Security Considerations and Best Practices

Sandbox Environment Setup

Computer Use has powerful system control capabilities and must run in a controlled environment:

# Docker container for isolation
# Dockerfile (based on Anthropic's official example)
"""
FROM ubuntu:22.04

RUN apt-get update && apt-get install -y \\
    python3 python3-pip \\
    xvfb x11vnc \\
    firefox-esr \\
    scrot \\
    xdotool

# Create a restricted non-root user
RUN useradd -m -s /bin/bash sandboxuser
USER sandboxuser

ENV DISPLAY=:1
"""

def start_virtual_display(width: int = 1920, height: int = 1080, display_num: int = 1):
    """Start an Xvfb virtual display"""
    import subprocess, os
    subprocess.Popen([
        "Xvfb", f":{display_num}",
        "-screen", "0", f"{width}x{height}x24"
    ])
    os.environ["DISPLAY"] = f":{display_num}"
    print(f"Virtual display started: :{display_num} ({width}x{height})")

Human Confirmation Mechanism

For high-risk operations, request human confirmation before executing:

HIGH_RISK_KEYWORDS = [
    "delete", "format", "shutdown", "send email",
    "submit form", "purchase", "transfer"
]

def confirm_before_risky_action(action_description: str) -> bool:
    """Request confirmation before high-risk operations"""
    is_risky = any(kw in action_description.lower() for kw in HIGH_RISK_KEYWORDS)
    
    if is_risky:
        print(f"\n[HIGH RISK ACTION] {action_description}")
        response = input("Continue? (y/N): ")
        return response.lower() == 'y'
    
    return True

class SafeComputerUseAgent:
    def __init__(self, require_confirmation: bool = True):
        self.require_confirmation = require_confirmation
        self.action_log = []
    
    def process_action_safely(self, tool_name: str, tool_input: dict) -> str:
        self.action_log.append({"tool": tool_name, "input": tool_input})
        
        if self.require_confirmation:
            desc = f"{tool_name}: {tool_input}"
            if not confirm_before_risky_action(desc):
                return "Action cancelled by user"
        
        return process_tool_call(tool_name, tool_input)

21.6 Performance Optimization and Debugging

Screenshot Compression

High-resolution screenshots consume many tokens; compression saves cost:

from PIL import Image
import io

def take_compressed_screenshot(max_size: tuple = (1280, 720)) -> str:
    """Capture and compress a screenshot"""
    subprocess.run(["scrot", "/tmp/screenshot_raw.png"], check=True)
    
    with Image.open("/tmp/screenshot_raw.png") as img:
        img.thumbnail(max_size, Image.LANCZOS)
        buffer = io.BytesIO()
        img.convert("RGB").save(buffer, format="JPEG", quality=85)
        compressed_data = buffer.getvalue()
    
    return base64.standard_b64encode(compressed_data).decode("utf-8")

Inter-Action Delays

GUI operations need time for the interface to respond:

import time

def click_and_wait(x: int, y: int, wait_seconds: float = 0.5):
    """Click and wait for the interface to respond"""
    import pyautogui
    pyautogui.click(x, y)
    time.sleep(wait_seconds)

def wait_for_page_load(max_wait: float = 10.0) -> bool:
    """Wait for page load to complete (based on screenshot comparison)"""
    prev_screenshot = take_screenshot()
    start_time = time.time()
    
    while time.time() - start_time < max_wait:
        time.sleep(1.0)
        current = take_screenshot()
        if current == prev_screenshot:
            return True  # No change means loading complete
        prev_screenshot = current
    
    return False  # Timed out

Debug Mode: Save All Screenshots

import os
from datetime import datetime

class DebugComputerUseAgent:
    def __init__(self, debug_dir: str = "/tmp/computer_use_debug"):
        self.debug_dir = debug_dir
        os.makedirs(debug_dir, exist_ok=True)
        self.screenshot_count = 0
    
    def save_screenshot(self, screenshot_data: str, label: str = "") -> str:
        self.screenshot_count += 1
        timestamp = datetime.now().strftime("%H%M%S")
        filename = f"{self.screenshot_count:03d}_{timestamp}_{label}.png"
        filepath = os.path.join(self.debug_dir, filename)
        
        with open(filepath, "wb") as f:
            f.write(base64.standard_b64decode(screenshot_data))
        
        print(f"Screenshot saved: {filepath}")
        return filepath

21.7 Common Issues and Solutions

Issue 1: Coordinate Precision

Claude's coordinate judgment is based on screenshot analysis and may have slight offsets:

Solutions:
1. Specify the display resolution in the system prompt
2. Take a screenshot to confirm element positions before acting
3. For critical buttons, describe visual characteristics rather than hard-coding coordinates

Issue 2: Dynamic Content Loading

The page may not finish loading immediately after a click:

# Add explicit waits after navigation actions
def navigate_and_wait(url_or_action: str):
    # Execute navigation
    execute_computer_action("key", text="Return")
    # Wait for load
    wait_for_page_load(max_wait=15.0)

Issue 3: Unexpected Pop-ups

Pop-ups interrupt normal operation flow:

Solution: Add instructions to the system prompt about handling common pop-ups:
- Close buttons are typically in the upper right corner
- Visual position of OK/Cancel buttons
- How to handle cookie consent dialogs

Summary

Computer Use upgrades Claude from a "text worker" to a genuine computer operator. Core points:

Three tool types (computer, bash, str_replace_editor) cover GUI operations, command line, and file editing
betas=["computer-use-2025-01-24"] is the required parameter to enable this feature
Security is critical: sandbox environments, human confirmation mechanisms, and action logging are all essential
Screenshot compression and appropriate inter-action delays are key optimizations for production use

The next chapter explores combining Tool Use with Extended Thinking — currently Claude's most powerful reasoning-and-action combination.

Rate this chapter

4.7 / 5 (12 ratings)