Server Tools in Practice: The web_search / web_fetch / code_execution Trio
Chapter 21: Computer Use: Complete Guide to Desktop Automation and GUI Operations
21.1 What Is Computer Use?
Computer Use is a revolutionary capability developed by Anthropic for Claude. It allows the model to observe screen state through screenshots and control computer interfaces by simulating mouse clicks, keyboard input, scrolling, and other actions. This means Claude can operate any software with a graphical interface without requiring that software to provide an API.
Compared to traditional RPA (Robotic Process Automation) tools, Computer Use's core advantages are:
- Understanding interface semantics: Claude understands what buttons mean and what forms are for, not just coordinates
- Error recovery capability: When encountering unexpected pop-ups or interface changes, Claude can reason about how to handle them
- Natural language instructions: No recording needed โ just describe the task in natural language
Enabling Computer Use requires the betas=["computer-use-2025-01-24"] parameter.
21.2 Core Tool Definitions
Computer Use provides three built-in tool types:
ComputerTool (Desktop Operations Tool)
computer_tool = {
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080,
"display_number": 1 # X11 display number (optional)
}
Action types supported by ComputerTool:
| action | Description | Required parameters |
|---|---|---|
screenshot |
Capture the current screen | None |
left_click |
Left mouse click | coordinate: [x, y] |
right_click |
Right mouse click | coordinate: [x, y] |
double_click |
Double click | coordinate: [x, y] |
middle_click |
Middle button click | coordinate: [x, y] |
left_click_drag |
Drag | coordinate: [x, y], start_coordinate: [x, y] |
type |
Type text | text: str |
key |
Press key | text: str (xdotool format) |
scroll |
Scroll | coordinate: [x, y], direction: up/down/left/right, amount: int |
mouse_move |
Move mouse | coordinate: [x, y] |
cursor_position |
Get current mouse position | None |
TextEditorTool (Text Editing Tool)
text_editor_tool = {
"type": "text_editor_20250124",
"name": "str_replace_editor"
}
Commands supported by TextEditorTool:
| command | Description |
|---|---|
view |
View file contents |
create |
Create a new file |
str_replace |
Replace a string in the file |
insert |
Insert content after a specified line |
undo_edit |
Undo the last edit |
BashTool (Command Line Tool)
bash_tool = {
"type": "bash_20250124",
"name": "bash"
}
BashTool executes commands in a persistent shell session, maintaining state across calls (environment variables, current directory, etc.).
21.3 Complete Computer Use Implementation
Basic Implementation Framework
import anthropic
import base64
import subprocess
client = anthropic.Anthropic()
COMPUTER_USE_TOOLS = [
{
"type": "computer_20250124",
"name": "computer",
"display_width_px": 1920,
"display_height_px": 1080
},
{
"type": "text_editor_20250124",
"name": "str_replace_editor"
},
{
"type": "bash_20250124",
"name": "bash"
}
]
def take_screenshot() -> str:
"""Capture screen and return base64-encoded PNG"""
subprocess.run(["scrot", "/tmp/screenshot.png"], check=True)
with open("/tmp/screenshot.png", "rb") as f:
return base64.standard_b64encode(f.read()).decode("utf-8")
def execute_computer_action(action: str, **kwargs) -> dict:
"""Execute a desktop action"""
import pyautogui
if action == "screenshot":
screenshot_data = take_screenshot()
return {
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": screenshot_data
}
}
elif action == "left_click":
x, y = kwargs["coordinate"]
pyautogui.click(x, y)
return {"result": f"Left-clicked ({x}, {y})"}
elif action == "right_click":
x, y = kwargs["coordinate"]
pyautogui.rightClick(x, y)
return {"result": f"Right-clicked ({x}, {y})"}
elif action == "double_click":
x, y = kwargs["coordinate"]
pyautogui.doubleClick(x, y)
return {"result": f"Double-clicked ({x}, {y})"}
elif action == "type":
text = kwargs["text"]
pyautogui.typewrite(text, interval=0.05)
return {"result": f"Typed: {text[:50]}"}
elif action == "key":
key = kwargs["text"]
key_map = {
"Return": "enter",
"Escape": "esc",
"BackSpace": "backspace",
"Tab": "tab",
"super": "win"
}
mapped = key_map.get(key, key)
pyautogui.hotkey(*mapped.split("+"))
return {"result": f"Pressed key: {key}"}
elif action == "scroll":
x, y = kwargs["coordinate"]
direction = kwargs["direction"]
amount = kwargs.get("amount", 3)
pyautogui.moveTo(x, y)
pyautogui.scroll(amount if direction == "up" else -amount)
return {"result": f"Scrolled {direction} at ({x}, {y}) by {amount}"}
elif action == "mouse_move":
x, y = kwargs["coordinate"]
pyautogui.moveTo(x, y)
return {"result": f"Mouse moved to ({x}, {y})"}
elif action == "left_click_drag":
start = kwargs["start_coordinate"]
end = kwargs["coordinate"]
pyautogui.drag(end[0] - start[0], end[1] - start[1], duration=0.5, button='left')
return {"result": f"Dragged from {start} to {end}"}
return {"error": f"Unknown action: {action}"}
def execute_bash_command(command: str) -> str:
"""Execute a bash command"""
result = subprocess.run(
command, shell=True, capture_output=True, text=True, timeout=30
)
output = result.stdout
if result.stderr:
output += f"\nSTDERR: {result.stderr}"
return output
def execute_text_editor(command: str, path: str = "",
old_str: str = None, new_str: str = None,
insert_line: int = None) -> str:
"""Execute text editor operations"""
if command == "view":
with open(path, 'r', encoding='utf-8') as f:
return f.read()
elif command == "create":
with open(path, 'w', encoding='utf-8') as f:
f.write(new_str or "")
return f"File created: {path}"
elif command == "str_replace":
with open(path, 'r', encoding='utf-8') as f:
content = f.read()
if old_str not in content:
return "Error: string to replace not found"
with open(path, 'w', encoding='utf-8') as f:
f.write(content.replace(old_str, new_str, 1))
return "Replacement successful"
elif command == "insert":
with open(path, 'r', encoding='utf-8') as f:
lines = f.readlines()
lines.insert(insert_line, new_str + '\n')
with open(path, 'w', encoding='utf-8') as f:
f.writelines(lines)
return f"Inserted after line {insert_line}"
return f"Unknown command: {command}"
Complete Tool Call Loop
def process_tool_call(tool_name: str, tool_input: dict) -> str:
"""Process a tool call and return result content"""
if tool_name == "computer":
action = tool_input["action"]
kwargs = {k: v for k, v in tool_input.items() if k != "action"}
result = execute_computer_action(action, **kwargs)
if action == "screenshot":
return [result] # Image content as list
return result.get("result", result.get("error", "Action completed"))
elif tool_name == "bash":
return execute_bash_command(tool_input["command"])
elif tool_name == "str_replace_editor":
return execute_text_editor(
command=tool_input["command"],
path=tool_input.get("path", ""),
old_str=tool_input.get("old_str"),
new_str=tool_input.get("new_str"),
insert_line=tool_input.get("insert_line")
)
return "Unknown tool"
def run_computer_use_agent(task: str, system_prompt: str = "") -> str:
"""Run a Computer Use Agent"""
default_system = """You are an AI assistant that can use a computer to complete tasks.
You have the following tools:
- computer: take screenshots, click, type, scroll, and other desktop operations
- bash: run command-line commands
- str_replace_editor: view and edit files
Guidelines:
1. Take a screenshot first to understand the current screen state
2. Take a screenshot after actions to confirm results
3. When errors occur, understand the cause before retrying
4. Take a final screenshot to confirm task completion"""
messages = [{"role": "user", "content": task}]
while True:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=4096,
system=system_prompt or default_system,
tools=COMPUTER_USE_TOOLS,
messages=messages,
betas=["computer-use-2025-01-24"]
)
if response.stop_reason == "end_turn":
return ' '.join(b.text for b in response.content if b.type == "text")
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
print(f"[Tool] {block.name}: {block.input.get('action', block.input)}")
result_content = process_tool_call(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result_content if isinstance(result_content, list) else str(result_content)
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
else:
break
return "Task complete"
21.4 Real-World Scenario Examples
Scenario 1: Auto-Fill a Web Form
task = """
Please complete the following in the browser:
1. Take a screenshot to see the current state
2. Find the username input field and click it
3. Type "[email protected]"
4. Click the password field
5. Type "SecurePass123"
6. Click the login button
7. Take a screenshot to confirm login success
"""
result = run_computer_use_agent(task)
Scenario 2: Batch File Processing
task = """
Please rename all .jpg files in ~/Downloads by adding today's date
as a prefix in YYYYMMDD_ format.
First list the files, confirm, then execute the rename.
"""
result = run_computer_use_agent(task)
Scenario 3: Desktop Application Operation
task = """
Please open the Calculator app, compute (2024 ร 365) + 100,
then take a screenshot showing the result.
"""
result = run_computer_use_agent(task)
21.5 Security Considerations and Best Practices
Sandbox Environment Setup
Computer Use has powerful system control capabilities and must run in a controlled environment:
# Docker container for isolation
# Dockerfile (based on Anthropic's official example)
"""
FROM ubuntu:22.04
RUN apt-get update && apt-get install -y \\
python3 python3-pip \\
xvfb x11vnc \\
firefox-esr \\
scrot \\
xdotool
# Create a restricted non-root user
RUN useradd -m -s /bin/bash sandboxuser
USER sandboxuser
ENV DISPLAY=:1
"""
def start_virtual_display(width: int = 1920, height: int = 1080, display_num: int = 1):
"""Start an Xvfb virtual display"""
import subprocess, os
subprocess.Popen([
"Xvfb", f":{display_num}",
"-screen", "0", f"{width}x{height}x24"
])
os.environ["DISPLAY"] = f":{display_num}"
print(f"Virtual display started: :{display_num} ({width}x{height})")
Human Confirmation Mechanism
For high-risk operations, request human confirmation before executing:
HIGH_RISK_KEYWORDS = [
"delete", "format", "shutdown", "send email",
"submit form", "purchase", "transfer"
]
def confirm_before_risky_action(action_description: str) -> bool:
"""Request confirmation before high-risk operations"""
is_risky = any(kw in action_description.lower() for kw in HIGH_RISK_KEYWORDS)
if is_risky:
print(f"\n[HIGH RISK ACTION] {action_description}")
response = input("Continue? (y/N): ")
return response.lower() == 'y'
return True
class SafeComputerUseAgent:
def __init__(self, require_confirmation: bool = True):
self.require_confirmation = require_confirmation
self.action_log = []
def process_action_safely(self, tool_name: str, tool_input: dict) -> str:
self.action_log.append({"tool": tool_name, "input": tool_input})
if self.require_confirmation:
desc = f"{tool_name}: {tool_input}"
if not confirm_before_risky_action(desc):
return "Action cancelled by user"
return process_tool_call(tool_name, tool_input)
21.6 Performance Optimization and Debugging
Screenshot Compression
High-resolution screenshots consume many tokens; compression saves cost:
from PIL import Image
import io
def take_compressed_screenshot(max_size: tuple = (1280, 720)) -> str:
"""Capture and compress a screenshot"""
subprocess.run(["scrot", "/tmp/screenshot_raw.png"], check=True)
with Image.open("/tmp/screenshot_raw.png") as img:
img.thumbnail(max_size, Image.LANCZOS)
buffer = io.BytesIO()
img.convert("RGB").save(buffer, format="JPEG", quality=85)
compressed_data = buffer.getvalue()
return base64.standard_b64encode(compressed_data).decode("utf-8")
Inter-Action Delays
GUI operations need time for the interface to respond:
import time
def click_and_wait(x: int, y: int, wait_seconds: float = 0.5):
"""Click and wait for the interface to respond"""
import pyautogui
pyautogui.click(x, y)
time.sleep(wait_seconds)
def wait_for_page_load(max_wait: float = 10.0) -> bool:
"""Wait for page load to complete (based on screenshot comparison)"""
prev_screenshot = take_screenshot()
start_time = time.time()
while time.time() - start_time < max_wait:
time.sleep(1.0)
current = take_screenshot()
if current == prev_screenshot:
return True # No change means loading complete
prev_screenshot = current
return False # Timed out
Debug Mode: Save All Screenshots
import os
from datetime import datetime
class DebugComputerUseAgent:
def __init__(self, debug_dir: str = "/tmp/computer_use_debug"):
self.debug_dir = debug_dir
os.makedirs(debug_dir, exist_ok=True)
self.screenshot_count = 0
def save_screenshot(self, screenshot_data: str, label: str = "") -> str:
self.screenshot_count += 1
timestamp = datetime.now().strftime("%H%M%S")
filename = f"{self.screenshot_count:03d}_{timestamp}_{label}.png"
filepath = os.path.join(self.debug_dir, filename)
with open(filepath, "wb") as f:
f.write(base64.standard_b64decode(screenshot_data))
print(f"Screenshot saved: {filepath}")
return filepath
21.7 Common Issues and Solutions
Issue 1: Coordinate Precision
Claude's coordinate judgment is based on screenshot analysis and may have slight offsets:
Solutions:
1. Specify the display resolution in the system prompt
2. Take a screenshot to confirm element positions before acting
3. For critical buttons, describe visual characteristics rather than hard-coding coordinates
Issue 2: Dynamic Content Loading
The page may not finish loading immediately after a click:
# Add explicit waits after navigation actions
def navigate_and_wait(url_or_action: str):
# Execute navigation
execute_computer_action("key", text="Return")
# Wait for load
wait_for_page_load(max_wait=15.0)
Issue 3: Unexpected Pop-ups
Pop-ups interrupt normal operation flow:
Solution: Add instructions to the system prompt about handling common pop-ups:
- Close buttons are typically in the upper right corner
- Visual position of OK/Cancel buttons
- How to handle cookie consent dialogs
Summary
Computer Use upgrades Claude from a "text worker" to a genuine computer operator. Core points:
- Three tool types (computer, bash, str_replace_editor) cover GUI operations, command line, and file editing
betas=["computer-use-2025-01-24"]is the required parameter to enable this feature- Security is critical: sandbox environments, human confirmation mechanisms, and action logging are all essential
- Screenshot compression and appropriate inter-action delays are key optimizations for production use
The next chapter explores combining Tool Use with Extended Thinking โ currently Claude's most powerful reasoning-and-action combination.