← Back to Skills Marketplace
science-prof-robot

Auto Scraping to CSV

by Science-Prof-Robot · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
50
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install auto-scraping-to-csv
Description
Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No exte...
README (SKILL.md)

Auto Scraping to CSV — Page-Agent Bridge

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM required — Claude acts as the host model.

When to Use

  • Data extraction: "Extract all product names and prices from the listing"
  • Table scraping: "Get the top 10 rows from the pricing table"
  • News aggregation: "Scrape latest blog posts with titles, dates, and URLs"
  • Form & workflow testing: "Fill the signup form with [email protected] and submit"
  • UI verification: "Verify the dashboard shows 3 items in the table"
  • End-to-end journeys: "Login → add item to cart → checkout → confirm order"
  • Regression testing: Re-run natural language test scripts after deploys

How It Works

Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected
  1. Bridge launches a local Chromium browser via Playwright
  2. Page-Agent is injected as an IIFE script from CDN into the target page
  3. Page-Agent indexes the DOM and generates a simplified text representation of interactive elements with numeric indices:
    [5]\x3Cbutton>Submit\x3C/button>
    [12]\x3Cinput placeholder="Email" type="email"/>
    
  4. Claude receives the text state, decides the next action, and instructs the bridge to execute it
  5. Loop continues until the task is complete or max steps reached

Key Design Decisions

Decision Rationale
Text-based DOM No screenshots, no vision model needed. Faster and cheaper.
Host model Claude is the reasoning engine. No OpenAI/Qwen API key needed.
HTTP bridge Playwright runs in Node.js; Claude communicates via simple HTTP.
Turn-based loop Compatible with Claude Code's chat interaction model.
CDN injection No npm install of page-agent needed; auto-updates to latest.
CSV export Built-in workflow to convert scraped JSON data to CSV files.

First-Time Setup

1. Install Playwright

npm install -D playwright
npx playwright install chromium

2. Place the Bridge Script

After installing this skill, copy the bundled bridge script to .claude/agents/:

cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/

3. Start the Bridge

In a separate terminal (the bridge must stay running):

node .claude/agents/page-agent-bridge.mjs

Default port: 9876. Custom port:

node .claude/agents/page-agent-bridge.mjs 8888

You should see:

🚀  Page-Agent Bridge (Host Model) running on http://localhost:9876

4. Verify Health

curl http://localhost:9876/health

Expected: { "status": "ok", "sessions": 0, "maxSessions": 5 }


Workflow

Phase 1: Initialize Session

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "headless": false}'

Response:

{ "id": "a1b2c3d4", "url": "https://example.com" }

Phase 2: Observe → Think → Act Loop

Step 2a. Observe (fetch DOM state)

curl http://localhost:9876/sessions/a1b2c3d4/state

Step 2b. Think (Claude decides)

Based on the content text, identify the target element index and choose an action.

Step 2c. Act (execute action)

curl -X POST http://localhost:9876/sessions/a1b2c3d4/act \
  -H "Content-Type: application/json" \
  -d '{"action": "clickElement", "params": {"index": 5}}'

Repeat observe → act until complete.

Phase 3: Close Session

curl -X DELETE http://localhost:9876/sessions/a1b2c3d4

Or stop the bridge:

curl -X POST http://localhost:9876/shutdown

Scraping to CSV Workflow

Step 1: Navigate and Get DOM State

Start a session on your target URL and fetch the DOM state:

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "headless": true}'

Step 2: Extract Structured Data via JavaScript

Use the executeJavascript action to extract data from the page:

cat > /tmp/extract.json \x3C\x3C 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = Array.from(document.querySelectorAll('.product')).map(el => ({name: el.querySelector('.title').textContent.trim(), price: el.querySelector('.price').textContent.trim(), url: el.querySelector('a').href})); return JSON.stringify(items);"}}
EOF

curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
  -H "Content-Type: application/json" \
  -d @/tmp/extract.json

Step 3: Convert JSON to CSV

Option A — Python (recommended)

python3 \x3C\x3C 'PYEOF'
import json, csv, sys, re

# The bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
# Extract the JSON array from the message
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
    data = json.loads(match.group(1))
    with open('output.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Wrote {len(data)} rows to output.csv")
PYEOF

Option B — jq + csvkit

# Install csvkit: pip install csvkit
# Extract JSON from bridge response
echo '[{"name":"A","price":"$10"},{"name":"B","price":"$20"}]' | \
  json2csv -k name,price > output.csv

Option C — Node.js (no extra deps)

const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
  headers.join(','),
  ...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\
');
fs.writeFileSync('output.csv', csv);

Complete Example: Scrape Anthropic News to CSV

# 1. Start bridge (in separate terminal)
node .claude/agents/page-agent-bridge.mjs

# 2. Create session
curl -s -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.anthropic.com/news", "headless": true}'
# → { "id": "abc123" }

# 3. Extract news data
cat > /tmp/extract.json \x3C\x3C 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = []; document.querySelectorAll('a[href*=\"/news/\"]').forEach(a => { const href = a.href; if (!href.includes('anthropic.com/news/')) return; const h2 = a.querySelector('h2, h3'); const title = h2 ? h2.textContent.trim() : ''; const time = a.querySelector('time'); const date = time ? time.textContent.trim() : ''; if (title && date) items.push({title, date, url: href}); }); return JSON.stringify(items.slice(0, 15));"}}
EOF

curl -s -X POST http://localhost:9876/sessions/abc123/act \
  -H "Content-Type: application/json" -d @/tmp/extract.json

# 4. Convert to CSV (see Python script above)

# 5. Close session
curl -X DELETE http://localhost:9876/sessions/abc123

Available Actions

Action Params Description
getBrowserState Refresh DOM tree and return full page state
clickElement { index: number } Click the interactive element at index
inputText { index: number, text: string } Click then type into input element
selectOption { index: number, optionText: string } Select dropdown option by visible text
scroll { down?, num_pages?, pixels?, index? } Scroll vertically
scrollHorizontally { right?, pixels, index? } Scroll horizontally
executeJavascript { script: string } Run arbitrary JS in page context (async/await supported)
wait { seconds: number } Pause execution
cleanUpHighlights Remove all Page-Agent visual highlights
updateTree Re-index the DOM manually

Natural Language Commands

When this skill is active, Claude accepts natural language commands:

/scrape-to-csv \x3Curl> \x3Cdescription>

General scraping task with CSV export.

/scrape-to-csv https://example.com/products
  "Extract all product names, prices, and availability. Save as CSV."

/scrape-table \x3Curl> \x3Cselector>

Extract a specific HTML table.

/scrape-table https://example.com/pricing ".pricing-table"

/scrape-news \x3Curl>

Extract news/blog articles with titles, dates, and URLs.

/scrape-news https://www.anthropic.com/news

/test-frontend \x3Curl> \x3Ctask>

Test forms, workflows, or UI interactions.

/test-frontend https://staging.example.com/signup
  "Fill the form with test data, submit, and verify welcome page"

Output Format

Claude produces a structured markdown report:

## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 4.2s | **Rows:** 12

### Task
Extract all product names, prices, and availability. Save as CSV.

### Execution Log

| Step | Action | Target | Result |
|------|--------|--------|--------|
| 1 | getBrowserState | — | 24 interactive elements found |
| 2 | executeJavascript | products | ✅ 12 items extracted |
| 3 | — | — | ✅ CSV written: 12 rows |

### Sample Data

| name | price | availability |
|------|-------|-------------|
| Widget A | $19.99 | In stock |
| Widget B | $29.99 | Out of stock |

### File
`./output.csv` — 12 rows, 3 columns

Troubleshooting

Bridge won't start

Error: Cannot find module 'playwright'

Fix: npm install -D playwright && npx playwright install chromium

Browser page is blank

Cause: Page didn't finish loading before Page-Agent injection.
Fix: The bridge already uses waitUntil: 'networkidle'. For SPAs, add a wait action after navigation.

Element index not found

Cause: DOM tree stale; element was added after last updateTree().
Fix: Call getBrowserState (which refreshes the tree) before acting.

CORS errors in browser console

Cause: Page-Agent IIFE loaded from CDN on a strict CSP page.
Fix: The bridge injects via page.addScriptTag({ url: CDN_URL }) which usually bypasses CSP.

Headless vs headed mode

  • Headed (headless: false): You can watch the browser. Good for debugging.
  • Headless (headless: true): Faster, good for CI.

Comparison with Other Tools

Tool DOM Type LLM Required Speed Best For
Page-Agent Bridge Text Host (Claude) Fast Precise UI tasks, forms, data extraction
/browse (gstack) Visual + DOM Host (Claude) Medium General QA, screenshots, visual checks
Playwright E2E Code None Fastest Repeatable CI suites, regression
Browser-Use Text External API Medium Complex multi-page research
Scrapy Code None Fast Large-scale crawling, pipelines

Use this skill when:

  • You want natural language scraping commands
  • You don't want to write CSS selectors or XPath
  • You need quick one-off data extraction to CSV
  • You're iterating on frontend behavior and need verification
  • You want structured text evidence (DOM snapshots) instead of screenshots

Bridge API Reference

POST /sessions

Launch a new browser session.

Body:

{ "url": "https://example.com", "headless": false, "viewport": { "width": 1280, "height": 720 } }

Response: { "id": "abc123", "url": "https://example.com" }

GET /sessions/:id/state

Get current browser state including simplified DOM text.

Response: BrowserState object with url, title, header, content, footer.

POST /sessions/:id/act

Execute a Page-Agent action.

Body:

{ "action": "executeJavascript", "params": { "script": "return document.title;" } }

Response: { "success": true, "message": "✅ Executed JavaScript. Result: ..." }

POST /sessions/:id/navigate

Navigate to a new URL within the same session.

Body: { "url": "https://example.com/other" }

DELETE /sessions/:id

Close the browser tab and session.

POST /shutdown

Stop the bridge server and close all sessions.

GET /health

Health check. Returns { "status": "ok", "sessions": 0, "maxSessions": 5 }.


Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright

Usage Guidance
Use this only if you are comfortable running a local browser-control server. Keep it off except during active scraping, avoid using it with real accounts or purchases unless you manually review each step, and prefer a version that pins the Page-Agent script and protects the bridge with an auth token and restricted CORS.
Capability Analysis
Type: OpenClaw Skill Name: auto-scraping-to-csv Version: 1.0.0 The skill implements a local HTTP bridge (page-agent-bridge.mjs) that lacks authentication and exposes an 'executeJavascript' endpoint, allowing arbitrary code execution within the browser context. It also dynamically injects a script from a third-party CDN (jsdelivr.net), which introduces a supply chain risk. While these features are aligned with the stated scraping purpose, the lack of access controls and reliance on external unverified code represent significant security vulnerabilities.
Capability Tags
cryptorequires-sensitive-credentials
Capability Assessment
Purpose & Capability
Browser automation is central to the skill, but the provided bridge exposes broad actions including clicks, text input, navigation, and arbitrary JavaScript execution, which can go beyond CSV scraping into form submission or account-changing workflows.
Instruction Scope
The workflow describes an observe-think-act loop and includes high-impact browser actions, but it does not define approval checkpoints, read-only defaults, or limits for submissions, checkout flows, or arbitrary script execution.
Install Mechanism
The local Playwright install is expected, but Page-Agent is fetched at runtime from a CDN using @latest, so the executable code injected into pages can change without review or pinning.
Credentials
A local browser bridge is proportionate for this purpose, but the HTTP server has wildcard CORS and no visible authentication before executing actions.
Persistence & Privilege
The bridge is user-started and stoppable, not hidden or auto-started, but the instructions say it must stay running in a separate terminal, so users should shut it down after use.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install auto-scraping-to-csv
  3. After installation, invoke the skill by name or use /auto-scraping-to-csv
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of auto-scraping-to-csv. - Scrape any webpage using text-based DOM manipulation via Playwright and Alibaba Page-Agent, controlled by Claude as the host model. - No external LLM or OpenAI/Qwen API key required; only Node.js and Playwright. - Export structured data to CSV using built-in workflows and convenient code snippets. - Designed for data extraction, table scraping, UI verification, and automated workflow testing. - Local setup instructions provided for bridge server and end-to-end usage.
Metadata
Slug auto-scraping-to-csv
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Auto Scraping to CSV?

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No exte... It is an AI Agent Skill for Claude Code / OpenClaw, with 50 downloads so far.

How do I install Auto Scraping to CSV?

Run "/install auto-scraping-to-csv" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Auto Scraping to CSV free?

Yes, Auto Scraping to CSV is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Auto Scraping to CSV support?

Auto Scraping to CSV is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Auto Scraping to CSV?

It is built and maintained by Science-Prof-Robot (@science-prof-robot); the current version is v1.0.0.

💬 Comments