← 返回 Skills 市场
mupengi-bot

Data Scraper

作者 mupengi-bot · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
1427
总下载
1
收藏
8
当前安装
1
版本数
在 OpenClaw 中安装
/install data-scraper
功能描述
Web page data collection and structured text extraction
使用说明 (SKILL.md)

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

  • Extract text content from web pages (articles, blogs, docs)
  • Scrape product prices, reviews, or listings
  • Monitor pages for changes (price drops, new content)
  • Batch-collect data from multiple URLs
  • Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

Format Flag Use Case
Text --format text Reading, summarization
JSON --format json Data processing
CSV --format csv Spreadsheets
Markdown --format md Documentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

  • Default: 1 request per second per domain
  • Respects robots.txt when --polite flag is set
  • Configurable delay between requests
  • Stops on 429 (Too Many Requests) and backs off

Error Handling

Error Behavior
404 Log and skip
403/401 Warn about auth requirement
429 Exponential backoff (max 3 retries)
Timeout Retry once with longer timeout
SSL error Warn, option to proceed with --insecure

Integration

  • web-claude: Use as fallback when web_fetch isn't enough
  • competitor-watch: Feed scraped data into competitor analysis
  • seo-audit: Scrape competitor pages for SEO comparison
  • performance-tracker: Collect social metrics from public profiles
安全使用建议
This skill's documentation promises a full-featured scraping tool, but the only runnable file is a minimal curl + HTML-strip script that does not implement selectors, table parsing, batch jobs, monitoring, robots.txt handling, notification integration, or JSON/CSV output beyond a small event file. Before installing or using it: (1) treat it as a lightweight fetcher, not the advertised full scraper; (2) inspect and test run.sh in a safe sandbox to confirm behavior; (3) if you need selector/table/monitoring features, request the author or look for a different skill that actually implements them; (4) be cautious about running it against sites where scraping is disallowed — the script does not enforce politeness or legal rules; (5) consider adding or verifying any required tools (lynx, jq) and safe output handling to avoid accidental data leakage.
功能分析
Type: OpenClaw Skill Name: data-scraper Version: 1.0.0 The skill is designed for web scraping, but the `run.sh` script contains a critical shell injection vulnerability. The `$URL` variable is directly interpolated into a `curl` command without sanitization, allowing an attacker to execute arbitrary commands if they can control the URL input. This vulnerability is also demonstrated in the `GUIDE.md` implementation examples. While the skill's stated purpose and ethical guidelines suggest benign intent, this severe flaw makes it suspicious.
能力评估
Purpose & Capability
SKILL.md and GUIDE.md describe many features: selector mode, table extraction, batch scraping, watch/diff/monitoring, rate limiting, robots.txt respect, headers/cookies, JSON/CSV output, integrations/notification-hub. The only executable provided (run.sh) implements a minimal fetch: curl the URL, optionally run lynx or sed to strip tags, print to stdout, and write a small event file. There is no selector parsing, table mode, batch processing, monitoring loop, robots.txt handling, retries/backoff beyond curl failure handling, or integrations. The breadth of declared features is disproportionate to the actual code.
Instruction Scope
The SKILL.md/GUIDE.md instruct agents to do things (batch scraping, create snapshots, alert via notification-hub, use jq for JSON construction, respect --polite flag) that are not implemented by run.sh. The docs effectively give a to-do list of behaviors that would require additional binaries/tools (jq, lynx, selector-capable HTML parsers) and more complex logic; the runtime instructions are therefore ambiguous and could lead an agent to attempt operations that will fail or be implemented inconsistently by invoking ad-hoc shell pipelines.
Install Mechanism
There is no install spec and no network downloads or packaged dependencies. The only included code is a simple shell script. This is low-risk from an install/execution distribution perspective (no external archives or installers).
Credentials
The skill requests no environment variables, credentials, or config paths. The script uses WORKSPACE/EVENTS_DIR/MEMORY_DIR environment variables (with sane defaults) to write an event file; this is proportional to its stated behavior of producing a local event. No secrets or unrelated credentials are requested.
Persistence & Privilege
always is false and model invocation is allowed (default). The script writes event files into a workspace events directory and otherwise prints to stdout; it does not modify other skills or system-wide config. No elevated persistence is requested.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install data-scraper
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /data-scraper 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of data-scraper. - Extract structured data from web pages using curl, no browser required. - Supports HTML-to-text conversion, table extraction, price monitoring, and batch scraping. - Multiple extraction modes: readable text, CSS selectors, tables, and link lists. - Change monitoring with snapshots, diffing, and notifications. - Flexible output formats: text, JSON, CSV, and Markdown. - Customizable headers, cookies, rate limiting, and robots.txt respect with `--polite`. - Integration with related skills for broader data workflows.
元数据
Slug data-scraper
版本 1.0.0
许可证
累计安装 8
当前安装数 8
历史版本数 1
常见问题

Data Scraper 是什么?

Web page data collection and structured text extraction. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1427 次。

如何安装 Data Scraper?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install data-scraper」即可一键安装,无需额外配置。

Data Scraper 是免费的吗?

是的,Data Scraper 完全免费(开源免费),可自由下载、安装和使用。

Data Scraper 支持哪些平台?

Data Scraper 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Data Scraper?

由 mupengi-bot(@mupengi-bot)开发并维护,当前版本 v1.0.0。

💬 留言讨论