← 返回 Skills 市场

Data Scraper

Name: Data Scraper
Author: mupengi-bot

作者 mupengi-bot · GitHub ↗ · v1.0.0

cross-platform ⚠ suspicious

1427

总下载

当前安装

版本数

在 OpenClaw 中安装

/install data-scraper

功能描述

Web page data collection and structured text extraction

使用说明 (SKILL.md)

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

Extract text content from web pages (articles, blogs, docs)
Scrape product prices, reviews, or listings
Monitor pages for changes (price drops, new content)
Batch-collect data from multiple URLs
Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

Format	Flag	Use Case
Text	`--format text`	Reading, summarization
JSON	`--format json`	Data processing
CSV	`--format csv`	Spreadsheets
Markdown	`--format md`	Documentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

Default: 1 request per second per domain
Respects robots.txt when --polite flag is set
Configurable delay between requests
Stops on 429 (Too Many Requests) and backs off

Error Handling

Error	Behavior
404	Log and skip
403/401	Warn about auth requirement
429	Exponential backoff (max 3 retries)
Timeout	Retry once with longer timeout
SSL error	Warn, option to proceed with `--insecure`

Integration

web-claude: Use as fallback when web_fetch isn't enough
competitor-watch: Feed scraped data into competitor analysis
seo-audit: Scrape competitor pages for SEO comparison
performance-tracker: Collect social metrics from public profiles

安全使用建议

This skill's documentation promises a full-featured scraping tool, but the only runnable file is a minimal curl + HTML-strip script that does not implement selectors, table parsing, batch jobs, monitoring, robots.txt handling, notification integration, or JSON/CSV output beyond a small event file. Before installing or using it: (1) treat it as a lightweight fetcher, not the advertised full scraper; (2) inspect and test run.sh in a safe sandbox to confirm behavior; (3) if you need selector/table/monitoring features, request the author or look for a different skill that actually implements them; (4) be cautious about running it against sites where scraping is disallowed — the script does not enforce politeness or legal rules; (5) consider adding or verifying any required tools (lynx, jq) and safe output handling to avoid accidental data leakage.

功能分析

Type: OpenClaw Skill Name: data-scraper Version: 1.0.0 The skill is designed for web scraping, but the `run.sh` script contains a critical shell injection vulnerability. The `$URL` variable is directly interpolated into a `curl` command without sanitization, allowing an attacker to execute arbitrary commands if they can control the URL input. This vulnerability is also demonstrated in the `GUIDE.md` implementation examples. While the skill's stated purpose and ethical guidelines suggest benign intent, this severe flaw makes it suspicious.

能力评估

⚠ Purpose & Capability

SKILL.md and GUIDE.md describe many features: selector mode, table extraction, batch scraping, watch/diff/monitoring, rate limiting, robots.txt respect, headers/cookies, JSON/CSV output, integrations/notification-hub. The only executable provided (run.sh) implements a minimal fetch: curl the URL, optionally run lynx or sed to strip tags, print to stdout, and write a small event file. There is no selector parsing, table mode, batch processing, monitoring loop, robots.txt handling, retries/backoff beyond curl failure handling, or integrations. The breadth of declared features is disproportionate to the actual code.

⚠ Instruction Scope

The SKILL.md/GUIDE.md instruct agents to do things (batch scraping, create snapshots, alert via notification-hub, use jq for JSON construction, respect --polite flag) that are not implemented by run.sh. The docs effectively give a to-do list of behaviors that would require additional binaries/tools (jq, lynx, selector-capable HTML parsers) and more complex logic; the runtime instructions are therefore ambiguous and could lead an agent to attempt operations that will fail or be implemented inconsistently by invoking ad-hoc shell pipelines.

✓ Install Mechanism

There is no install spec and no network downloads or packaged dependencies. The only included code is a simple shell script. This is low-risk from an install/execution distribution perspective (no external archives or installers).

✓ Credentials

The skill requests no environment variables, credentials, or config paths. The script uses WORKSPACE/EVENTS_DIR/MEMORY_DIR environment variables (with sane defaults) to write an event file; this is proportional to its stated behavior of producing a local event. No secrets or unrelated credentials are requested.

✓ Persistence & Privilege

always is false and model invocation is allowed (default). The script writes event files into a workspace events directory and otherwise prints to stdout; it does not modify other skills or system-wide config. No elevated persistence is requested.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install data-scraper
安装完成后，直接呼叫该 Skill 的名称或使用 /data-scraper 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release of data-scraper. - Extract structured data from web pages using curl, no browser required. - Supports HTML-to-text conversion, table extraction, price monitoring, and batch scraping. - Multiple extraction modes: readable text, CSS selectors, tables, and link lists. - Change monitoring with snapshots, diffing, and notifications. - Flexible output formats: text, JSON, CSV, and Markdown. - Customizable headers, cookies, rate limiting, and robots.txt respect with `--polite`. - Integration with related skills for broader data workflows.

元数据

Slug data-scraper

版本 1.0.0

许可证 —

累计安装 8

当前安装数 8

历史版本数 1

常见问题

Data Scraper 是什么？

Web page data collection and structured text extraction. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 1427 次。

如何安装 Data Scraper？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-scraper」即可一键安装，无需额外配置。

Data Scraper 是免费的吗？

是的，Data Scraper 完全免费（开源免费），可自由下载、安装和使用。

Data Scraper 支持哪些平台？

Data Scraper 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Data Scraper？

由 mupengi-bot（@mupengi-bot）开发并维护，当前版本 v1.0.0。