Description

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No exte...

README (SKILL.md)

Auto Scraping to CSV — Page-Agent Bridge

Name: Auto Scraping to CSV
Author: science-prof-robot

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM required — Claude acts as the host model.

When to Use

Data extraction: "Extract all product names and prices from the listing"
Table scraping: "Get the top 10 rows from the pricing table"
News aggregation: "Scrape latest blog posts with titles, dates, and URLs"
Form & workflow testing: "Fill the signup form with [email protected] and submit"
UI verification: "Verify the dashboard shows 3 items in the table"
End-to-end journeys: "Login → add item to cart → checkout → confirm order"
Regression testing: Re-run natural language test scripts after deploys

How It Works

Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected

Bridge launches a local Chromium browser via Playwright
Page-Agent is injected as an IIFE script from CDN into the target page
Page-Agent indexes the DOM and generates a simplified text representation of interactive elements with numeric indices:
```
[5]\x3Cbutton>Submit\x3C/button>
[12]\x3Cinput placeholder="Email" type="email"/>
```
Claude receives the text state, decides the next action, and instructs the bridge to execute it
Loop continues until the task is complete or max steps reached

Key Design Decisions

Decision	Rationale
Text-based DOM	No screenshots, no vision model needed. Faster and cheaper.
Host model	Claude is the reasoning engine. No OpenAI/Qwen API key needed.
HTTP bridge	Playwright runs in Node.js; Claude communicates via simple HTTP.
Turn-based loop	Compatible with Claude Code's chat interaction model.
CDN injection	No npm install of page-agent needed; auto-updates to latest.
CSV export	Built-in workflow to convert scraped JSON data to CSV files.

First-Time Setup

1. Install Playwright

npm install -D playwright
npx playwright install chromium

2. Place the Bridge Script

After installing this skill, copy the bundled bridge script to .claude/agents/:

cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/

3. Start the Bridge

In a separate terminal (the bridge must stay running):

node .claude/agents/page-agent-bridge.mjs

Default port: 9876. Custom port:

node .claude/agents/page-agent-bridge.mjs 8888

You should see:

🚀  Page-Agent Bridge (Host Model) running on http://localhost:9876

4. Verify Health

curl http://localhost:9876/health

Expected: { "status": "ok", "sessions": 0, "maxSessions": 5 }

Workflow

Phase 1: Initialize Session

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "headless": false}'

Response:

{ "id": "a1b2c3d4", "url": "https://example.com" }

Phase 2: Observe → Think → Act Loop

Step 2a. Observe (fetch DOM state)

curl http://localhost:9876/sessions/a1b2c3d4/state

Step 2b. Think (Claude decides)

Based on the content text, identify the target element index and choose an action.

Step 2c. Act (execute action)

curl -X POST http://localhost:9876/sessions/a1b2c3d4/act \
  -H "Content-Type: application/json" \
  -d '{"action": "clickElement", "params": {"index": 5}}'

Repeat observe → act until complete.

Phase 3: Close Session

curl -X DELETE http://localhost:9876/sessions/a1b2c3d4

Or stop the bridge:

curl -X POST http://localhost:9876/shutdown

Scraping to CSV Workflow

Step 1: Navigate and Get DOM State

Start a session on your target URL and fetch the DOM state:

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "headless": true}'

Step 2: Extract Structured Data via JavaScript

Use the executeJavascript action to extract data from the page:

cat > /tmp/extract.json \x3C\x3C 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = Array.from(document.querySelectorAll('.product')).map(el => ({name: el.querySelector('.title').textContent.trim(), price: el.querySelector('.price').textContent.trim(), url: el.querySelector('a').href})); return JSON.stringify(items);"}}
EOF

curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
  -H "Content-Type: application/json" \
  -d @/tmp/extract.json

Step 3: Convert JSON to CSV

Option A — Python (recommended)

python3 \x3C\x3C 'PYEOF'
import json, csv, sys, re

# The bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
# Extract the JSON array from the message
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
    data = json.loads(match.group(1))
    with open('output.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Wrote {len(data)} rows to output.csv")
PYEOF

Option B — jq + csvkit

# Install csvkit: pip install csvkit
# Extract JSON from bridge response
echo '[{"name":"A","price":"$10"},{"name":"B","price":"$20"}]' | \
  json2csv -k name,price > output.csv

Option C — Node.js (no extra deps)

const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
  headers.join(','),
  ...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\
');
fs.writeFileSync('output.csv', csv);

Complete Example: Scrape Anthropic News to CSV

# 1. Start bridge (in separate terminal)
node .claude/agents/page-agent-bridge.mjs

# 2. Create session
curl -s -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.anthropic.com/news", "headless": true}'
# → { "id": "abc123" }

# 3. Extract news data
cat > /tmp/extract.json \x3C\x3C 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = []; document.querySelectorAll('a[href*=\"/news/\"]').forEach(a => { const href = a.href; if (!href.includes('anthropic.com/news/')) return; const h2 = a.querySelector('h2, h3'); const title = h2 ? h2.textContent.trim() : ''; const time = a.querySelector('time'); const date = time ? time.textContent.trim() : ''; if (title && date) items.push({title, date, url: href}); }); return JSON.stringify(items.slice(0, 15));"}}
EOF

curl -s -X POST http://localhost:9876/sessions/abc123/act \
  -H "Content-Type: application/json" -d @/tmp/extract.json

# 4. Convert to CSV (see Python script above)

# 5. Close session
curl -X DELETE http://localhost:9876/sessions/abc123

Available Actions

Action	Params	Description
`getBrowserState`	—	Refresh DOM tree and return full page state
`clickElement`	`{ index: number }`	Click the interactive element at index
`inputText`	`{ index: number, text: string }`	Click then type into input element
`selectOption`	`{ index: number, optionText: string }`	Select dropdown option by visible text
`scroll`	`{ down?, num_pages?, pixels?, index? }`	Scroll vertically
`scrollHorizontally`	`{ right?, pixels, index? }`	Scroll horizontally
`executeJavascript`	`{ script: string }`	Run arbitrary JS in page context (async/await supported)
`wait`	`{ seconds: number }`	Pause execution
`cleanUpHighlights`	—	Remove all Page-Agent visual highlights
`updateTree`	—	Re-index the DOM manually

Natural Language Commands

When this skill is active, Claude accepts natural language commands:

`/scrape-to-csv \x3Curl> \x3Cdescription>`

General scraping task with CSV export.

/scrape-to-csv https://example.com/products
  "Extract all product names, prices, and availability. Save as CSV."

`/scrape-table \x3Curl> \x3Cselector>`

Extract a specific HTML table.

/scrape-table https://example.com/pricing ".pricing-table"

`/scrape-news \x3Curl>`

Extract news/blog articles with titles, dates, and URLs.

/scrape-news https://www.anthropic.com/news

`/test-frontend \x3Curl> \x3Ctask>`

Test forms, workflows, or UI interactions.

/test-frontend https://staging.example.com/signup
  "Fill the form with test data, submit, and verify welcome page"

Output Format

Claude produces a structured markdown report:

## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 4.2s | **Rows:** 12

### Task
Extract all product names, prices, and availability. Save as CSV.

### Execution Log

| Step | Action | Target | Result |
|------|--------|--------|--------|
| 1 | getBrowserState | — | 24 interactive elements found |
| 2 | executeJavascript | products | ✅ 12 items extracted |
| 3 | — | — | ✅ CSV written: 12 rows |

### Sample Data

| name | price | availability |
|------|-------|-------------|
| Widget A | $19.99 | In stock |
| Widget B | $29.99 | Out of stock |

### File
`./output.csv` — 12 rows, 3 columns

Troubleshooting

Bridge won't start

Error: Cannot find module 'playwright'

Fix: npm install -D playwright && npx playwright install chromium

Browser page is blank

Cause: Page didn't finish loading before Page-Agent injection.
Fix: The bridge already uses waitUntil: 'networkidle'. For SPAs, add a wait action after navigation.

Element index not found

Cause: DOM tree stale; element was added after last updateTree().
Fix: Call getBrowserState (which refreshes the tree) before acting.

CORS errors in browser console

Cause: Page-Agent IIFE loaded from CDN on a strict CSP page.
Fix: The bridge injects via page.addScriptTag({ url: CDN_URL }) which usually bypasses CSP.

Headless vs headed mode

Headed (headless: false): You can watch the browser. Good for debugging.
Headless (headless: true): Faster, good for CI.

Comparison with Other Tools

Tool	DOM Type	LLM Required	Speed	Best For
Page-Agent Bridge	Text	Host (Claude)	Fast	Precise UI tasks, forms, data extraction
`/browse` (gstack)	Visual + DOM	Host (Claude)	Medium	General QA, screenshots, visual checks
Playwright E2E	Code	None	Fastest	Repeatable CI suites, regression
Browser-Use	Text	External API	Medium	Complex multi-page research
Scrapy	Code	None	Fast	Large-scale crawling, pipelines

Use this skill when:

You want natural language scraping commands
You don't want to write CSS selectors or XPath
You need quick one-off data extraction to CSV
You're iterating on frontend behavior and need verification
You want structured text evidence (DOM snapshots) instead of screenshots

Bridge API Reference

`POST /sessions`

Launch a new browser session.

Body:

{ "url": "https://example.com", "headless": false, "viewport": { "width": 1280, "height": 720 } }

Response: { "id": "abc123", "url": "https://example.com" }

`GET /sessions/:id/state`

Get current browser state including simplified DOM text.

Response: BrowserState object with url, title, header, content, footer.

`POST /sessions/:id/act`

Execute a Page-Agent action.

Body:

{ "action": "executeJavascript", "params": { "script": "return document.title;" } }

Response: { "success": true, "message": "✅ Executed JavaScript. Result: ..." }

`POST /sessions/:id/navigate`

Navigate to a new URL within the same session.

Body: { "url": "https://example.com/other" }

`DELETE /sessions/:id`

Close the browser tab and session.

`POST /shutdown`

Stop the bridge server and close all sessions.

`GET /health`

Health check. Returns { "status": "ok", "sessions": 0, "maxSessions": 5 }.

Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright

Usage Guidance

Use this only if you are comfortable running a local browser-control server. Keep it off except during active scraping, avoid using it with real accounts or purchases unless you manually review each step, and prefer a version that pins the Page-Agent script and protects the bridge with an auth token and restricted CORS.

Capability Analysis

Type: OpenClaw Skill Name: auto-scraping-to-csv Version: 1.0.0 The skill implements a local HTTP bridge (page-agent-bridge.mjs) that lacks authentication and exposes an 'executeJavascript' endpoint, allowing arbitrary code execution within the browser context. It also dynamically injects a script from a third-party CDN (jsdelivr.net), which introduces a supply chain risk. While these features are aligned with the stated scraping purpose, the lack of access controls and reliance on external unverified code represent significant security vulnerabilities.

Capability Tags

cryptorequires-sensitive-credentials

Capability Assessment

⚠ Purpose & Capability

Browser automation is central to the skill, but the provided bridge exposes broad actions including clicks, text input, navigation, and arbitrary JavaScript execution, which can go beyond CSV scraping into form submission or account-changing workflows.

⚠ Instruction Scope

The workflow describes an observe-think-act loop and includes high-impact browser actions, but it does not define approval checkpoints, read-only defaults, or limits for submissions, checkout flows, or arbitrary script execution.

⚠ Install Mechanism

The local Playwright install is expected, but Page-Agent is fetched at runtime from a CDN using @latest, so the executable code injected into pages can change without review or pinning.

⚠ Credentials

A local browser bridge is proportionate for this purpose, but the HTTP server has wildcard CORS and no visible authentication before executing actions.

ℹ Persistence & Privilege

The bridge is user-started and stoppable, not hidden or auto-started, but the instructions say it must stay running in a separate terminal, so users should shut it down after use.

Version History

v1.0.0

- Initial release of auto-scraping-to-csv. - Scrape any webpage using text-based DOM manipulation via Playwright and Alibaba Page-Agent, controlled by Claude as the host model. - No external LLM or OpenAI/Qwen API key required; only Node.js and Playwright. - Export structured data to CSV using built-in workflows and convenient code snippets. - Designed for data extraction, table scraping, UI verification, and automated workflow testing. - Local setup instructions provided for bridge server and end-to-end usage.

Metadata

Slug auto-scraping-to-csv

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Auto Scraping to CSV?

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No exte... It is an AI Agent Skill for Claude Code / OpenClaw, with 50 downloads so far.

How do I install Auto Scraping to CSV?

Run "/install auto-scraping-to-csv" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Auto Scraping to CSV free?

Yes, Auto Scraping to CSV is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Auto Scraping to CSV support?

Auto Scraping to CSV is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Auto Scraping to CSV?

It is built and maintained by Science-Prof-Robot (@science-prof-robot); the current version is v1.0.0.

More Skills

Auto Scraping to CSV