← 返回 Skills 市场
science-prof-robot

Auto Scraping to CSV

作者 Science-Prof-Robot · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
50
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install auto-scraping-to-csv
功能描述
Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No exte...
使用说明 (SKILL.md)

Auto Scraping to CSV — Page-Agent Bridge

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No external LLM required — Claude acts as the host model.

When to Use

  • Data extraction: "Extract all product names and prices from the listing"
  • Table scraping: "Get the top 10 rows from the pricing table"
  • News aggregation: "Scrape latest blog posts with titles, dates, and URLs"
  • Form & workflow testing: "Fill the signup form with [email protected] and submit"
  • UI verification: "Verify the dashboard shows 3 items in the table"
  • End-to-end journeys: "Login → add item to cart → checkout → confirm order"
  • Regression testing: Re-run natural language test scripts after deploys

How It Works

Claude (Host Model)
    ↕  HTTP
Bridge Server (Node.js + Playwright)
    ↕  page.evaluate()
Browser (Chromium) ← Page-Agent injected
  1. Bridge launches a local Chromium browser via Playwright
  2. Page-Agent is injected as an IIFE script from CDN into the target page
  3. Page-Agent indexes the DOM and generates a simplified text representation of interactive elements with numeric indices:
    [5]\x3Cbutton>Submit\x3C/button>
    [12]\x3Cinput placeholder="Email" type="email"/>
    
  4. Claude receives the text state, decides the next action, and instructs the bridge to execute it
  5. Loop continues until the task is complete or max steps reached

Key Design Decisions

Decision Rationale
Text-based DOM No screenshots, no vision model needed. Faster and cheaper.
Host model Claude is the reasoning engine. No OpenAI/Qwen API key needed.
HTTP bridge Playwright runs in Node.js; Claude communicates via simple HTTP.
Turn-based loop Compatible with Claude Code's chat interaction model.
CDN injection No npm install of page-agent needed; auto-updates to latest.
CSV export Built-in workflow to convert scraped JSON data to CSV files.

First-Time Setup

1. Install Playwright

npm install -D playwright
npx playwright install chromium

2. Place the Bridge Script

After installing this skill, copy the bundled bridge script to .claude/agents/:

cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/

3. Start the Bridge

In a separate terminal (the bridge must stay running):

node .claude/agents/page-agent-bridge.mjs

Default port: 9876. Custom port:

node .claude/agents/page-agent-bridge.mjs 8888

You should see:

🚀  Page-Agent Bridge (Host Model) running on http://localhost:9876

4. Verify Health

curl http://localhost:9876/health

Expected: { "status": "ok", "sessions": 0, "maxSessions": 5 }


Workflow

Phase 1: Initialize Session

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "headless": false}'

Response:

{ "id": "a1b2c3d4", "url": "https://example.com" }

Phase 2: Observe → Think → Act Loop

Step 2a. Observe (fetch DOM state)

curl http://localhost:9876/sessions/a1b2c3d4/state

Step 2b. Think (Claude decides)

Based on the content text, identify the target element index and choose an action.

Step 2c. Act (execute action)

curl -X POST http://localhost:9876/sessions/a1b2c3d4/act \
  -H "Content-Type: application/json" \
  -d '{"action": "clickElement", "params": {"index": 5}}'

Repeat observe → act until complete.

Phase 3: Close Session

curl -X DELETE http://localhost:9876/sessions/a1b2c3d4

Or stop the bridge:

curl -X POST http://localhost:9876/shutdown

Scraping to CSV Workflow

Step 1: Navigate and Get DOM State

Start a session on your target URL and fetch the DOM state:

curl -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "headless": true}'

Step 2: Extract Structured Data via JavaScript

Use the executeJavascript action to extract data from the page:

cat > /tmp/extract.json \x3C\x3C 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = Array.from(document.querySelectorAll('.product')).map(el => ({name: el.querySelector('.title').textContent.trim(), price: el.querySelector('.price').textContent.trim(), url: el.querySelector('a').href})); return JSON.stringify(items);"}}
EOF

curl -X POST http://localhost:9876/sessions/SESSION_ID/act \
  -H "Content-Type: application/json" \
  -d @/tmp/extract.json

Step 3: Convert JSON to CSV

Option A — Python (recommended)

python3 \x3C\x3C 'PYEOF'
import json, csv, sys, re

# The bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
# Extract the JSON array from the message
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
    data = json.loads(match.group(1))
    with open('output.csv', 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)
    print(f"Wrote {len(data)} rows to output.csv")
PYEOF

Option B — jq + csvkit

# Install csvkit: pip install csvkit
# Extract JSON from bridge response
echo '[{"name":"A","price":"$10"},{"name":"B","price":"$20"}]' | \
  json2csv -k name,price > output.csv

Option C — Node.js (no extra deps)

const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
  headers.join(','),
  ...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\
');
fs.writeFileSync('output.csv', csv);

Complete Example: Scrape Anthropic News to CSV

# 1. Start bridge (in separate terminal)
node .claude/agents/page-agent-bridge.mjs

# 2. Create session
curl -s -X POST http://localhost:9876/sessions \
  -H "Content-Type: application/json" \
  -d '{"url": "https://www.anthropic.com/news", "headless": true}'
# → { "id": "abc123" }

# 3. Extract news data
cat > /tmp/extract.json \x3C\x3C 'EOF'
{"action": "executeJavascript", "params": {"script": "const items = []; document.querySelectorAll('a[href*=\"/news/\"]').forEach(a => { const href = a.href; if (!href.includes('anthropic.com/news/')) return; const h2 = a.querySelector('h2, h3'); const title = h2 ? h2.textContent.trim() : ''; const time = a.querySelector('time'); const date = time ? time.textContent.trim() : ''; if (title && date) items.push({title, date, url: href}); }); return JSON.stringify(items.slice(0, 15));"}}
EOF

curl -s -X POST http://localhost:9876/sessions/abc123/act \
  -H "Content-Type: application/json" -d @/tmp/extract.json

# 4. Convert to CSV (see Python script above)

# 5. Close session
curl -X DELETE http://localhost:9876/sessions/abc123

Available Actions

Action Params Description
getBrowserState Refresh DOM tree and return full page state
clickElement { index: number } Click the interactive element at index
inputText { index: number, text: string } Click then type into input element
selectOption { index: number, optionText: string } Select dropdown option by visible text
scroll { down?, num_pages?, pixels?, index? } Scroll vertically
scrollHorizontally { right?, pixels, index? } Scroll horizontally
executeJavascript { script: string } Run arbitrary JS in page context (async/await supported)
wait { seconds: number } Pause execution
cleanUpHighlights Remove all Page-Agent visual highlights
updateTree Re-index the DOM manually

Natural Language Commands

When this skill is active, Claude accepts natural language commands:

/scrape-to-csv \x3Curl> \x3Cdescription>

General scraping task with CSV export.

/scrape-to-csv https://example.com/products
  "Extract all product names, prices, and availability. Save as CSV."

/scrape-table \x3Curl> \x3Cselector>

Extract a specific HTML table.

/scrape-table https://example.com/pricing ".pricing-table"

/scrape-news \x3Curl>

Extract news/blog articles with titles, dates, and URLs.

/scrape-news https://www.anthropic.com/news

/test-frontend \x3Curl> \x3Ctask>

Test forms, workflows, or UI interactions.

/test-frontend https://staging.example.com/signup
  "Fill the form with test data, submit, and verify welcome page"

Output Format

Claude produces a structured markdown report:

## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 4.2s | **Rows:** 12

### Task
Extract all product names, prices, and availability. Save as CSV.

### Execution Log

| Step | Action | Target | Result |
|------|--------|--------|--------|
| 1 | getBrowserState | — | 24 interactive elements found |
| 2 | executeJavascript | products | ✅ 12 items extracted |
| 3 | — | — | ✅ CSV written: 12 rows |

### Sample Data

| name | price | availability |
|------|-------|-------------|
| Widget A | $19.99 | In stock |
| Widget B | $29.99 | Out of stock |

### File
`./output.csv` — 12 rows, 3 columns

Troubleshooting

Bridge won't start

Error: Cannot find module 'playwright'

Fix: npm install -D playwright && npx playwright install chromium

Browser page is blank

Cause: Page didn't finish loading before Page-Agent injection.
Fix: The bridge already uses waitUntil: 'networkidle'. For SPAs, add a wait action after navigation.

Element index not found

Cause: DOM tree stale; element was added after last updateTree().
Fix: Call getBrowserState (which refreshes the tree) before acting.

CORS errors in browser console

Cause: Page-Agent IIFE loaded from CDN on a strict CSP page.
Fix: The bridge injects via page.addScriptTag({ url: CDN_URL }) which usually bypasses CSP.

Headless vs headed mode

  • Headed (headless: false): You can watch the browser. Good for debugging.
  • Headless (headless: true): Faster, good for CI.

Comparison with Other Tools

Tool DOM Type LLM Required Speed Best For
Page-Agent Bridge Text Host (Claude) Fast Precise UI tasks, forms, data extraction
/browse (gstack) Visual + DOM Host (Claude) Medium General QA, screenshots, visual checks
Playwright E2E Code None Fastest Repeatable CI suites, regression
Browser-Use Text External API Medium Complex multi-page research
Scrapy Code None Fast Large-scale crawling, pipelines

Use this skill when:

  • You want natural language scraping commands
  • You don't want to write CSS selectors or XPath
  • You need quick one-off data extraction to CSV
  • You're iterating on frontend behavior and need verification
  • You want structured text evidence (DOM snapshots) instead of screenshots

Bridge API Reference

POST /sessions

Launch a new browser session.

Body:

{ "url": "https://example.com", "headless": false, "viewport": { "width": 1280, "height": 720 } }

Response: { "id": "abc123", "url": "https://example.com" }

GET /sessions/:id/state

Get current browser state including simplified DOM text.

Response: BrowserState object with url, title, header, content, footer.

POST /sessions/:id/act

Execute a Page-Agent action.

Body:

{ "action": "executeJavascript", "params": { "script": "return document.title;" } }

Response: { "success": true, "message": "✅ Executed JavaScript. Result: ..." }

POST /sessions/:id/navigate

Navigate to a new URL within the same session.

Body: { "url": "https://example.com/other" }

DELETE /sessions/:id

Close the browser tab and session.

POST /shutdown

Stop the bridge server and close all sessions.

GET /health

Health check. Returns { "status": "ok", "sessions": 0, "maxSessions": 5 }.


Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright

安全使用建议
Use this only if you are comfortable running a local browser-control server. Keep it off except during active scraping, avoid using it with real accounts or purchases unless you manually review each step, and prefer a version that pins the Page-Agent script and protects the bridge with an auth token and restricted CORS.
功能分析
Type: OpenClaw Skill Name: auto-scraping-to-csv Version: 1.0.0 The skill implements a local HTTP bridge (page-agent-bridge.mjs) that lacks authentication and exposes an 'executeJavascript' endpoint, allowing arbitrary code execution within the browser context. It also dynamically injects a script from a third-party CDN (jsdelivr.net), which introduces a supply chain risk. While these features are aligned with the stated scraping purpose, the lack of access controls and reliance on external unverified code represent significant security vulnerabilities.
能力标签
cryptorequires-sensitive-credentials
能力评估
Purpose & Capability
Browser automation is central to the skill, but the provided bridge exposes broad actions including clicks, text input, navigation, and arbitrary JavaScript execution, which can go beyond CSV scraping into form submission or account-changing workflows.
Instruction Scope
The workflow describes an observe-think-act loop and includes high-impact browser actions, but it does not define approval checkpoints, read-only defaults, or limits for submissions, checkout flows, or arbitrary script execution.
Install Mechanism
The local Playwright install is expected, but Page-Agent is fetched at runtime from a CDN using @latest, so the executable code injected into pages can change without review or pinning.
Credentials
A local browser bridge is proportionate for this purpose, but the HTTP server has wildcard CORS and no visible authentication before executing actions.
Persistence & Privilege
The bridge is user-started and stoppable, not hidden or auto-started, but the instructions say it must stay running in a separate terminal, so users should shut it down after use.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install auto-scraping-to-csv
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /auto-scraping-to-csv 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of auto-scraping-to-csv. - Scrape any webpage using text-based DOM manipulation via Playwright and Alibaba Page-Agent, controlled by Claude as the host model. - No external LLM or OpenAI/Qwen API key required; only Node.js and Playwright. - Export structured data to CSV using built-in workflows and convenient code snippets. - Designed for data extraction, table scraping, UI verification, and automated workflow testing. - Local setup instructions provided for bridge server and end-to-end usage.
元数据
Slug auto-scraping-to-csv
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Auto Scraping to CSV 是什么?

Scrape any webpage using text-based DOM manipulation and export structured data to CSV. Controls a local browser via Playwright + Alibaba Page-Agent. No exte... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 50 次。

如何安装 Auto Scraping to CSV?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install auto-scraping-to-csv」即可一键安装,无需额外配置。

Auto Scraping to CSV 是免费的吗?

是的,Auto Scraping to CSV 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Auto Scraping to CSV 支持哪些平台?

Auto Scraping to CSV 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Auto Scraping to CSV?

由 Science-Prof-Robot(@science-prof-robot)开发并维护,当前版本 v1.0.0。

💬 留言讨论