← Back to Skills Marketplace

Data Scraper

Name: Data Scraper
Author: mupengi-bot

by mupengi-bot · GitHub ↗ · v1.0.0

cross-platform ⚠ suspicious

1427

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install data-scraper

Description

Web page data collection and structured text extraction

README (SKILL.md)

data-scraper

Web Data Scraper — Extract structured data from web pages using curl + parsing. Lightweight, no browser required. Supports HTML-to-text, table extraction, price monitoring, and batch scraping.

When to Use

Extract text content from web pages (articles, blogs, docs)
Scrape product prices, reviews, or listings
Monitor pages for changes (price drops, new content)
Batch-collect data from multiple URLs
Convert HTML tables to structured formats (JSON/CSV)

Quick Start

# Extract readable text from URL
data-scraper fetch "https://example.com/article"

# Extract specific elements
data-scraper extract "https://example.com" --selector "h2, .price"

# Monitor for changes
data-scraper watch "https://example.com/product" --interval 3600

Extraction Modes

Text Mode (default)

Fetches page and extracts readable content, stripping HTML tags, scripts, and styles. Similar to reader mode.

data-scraper fetch URL
# Output: clean markdown text

Selector Mode

Target specific CSS selectors for precise extraction.

data-scraper extract URL --selector ".product-title, .price, .rating"
# Output: matched elements as structured data

Table Mode

Extract HTML tables into structured formats.

data-scraper table URL --index 0
# Output: JSON array of row objects (header → value mapping)

Link Mode

Extract all links from a page with optional filtering.

data-scraper links URL --filter "*.pdf"
# Output: filtered list of absolute URLs

Batch Scraping

# Scrape multiple URLs
data-scraper batch urls.txt --output results/

# With rate limiting
data-scraper batch urls.txt --delay 2000 --output results/

urls.txt format:

https://site1.com/page
https://site2.com/page
https://site3.com/page

Change Monitoring

# Watch for changes, alert on diff
data-scraper watch URL --selector ".price" --interval 3600

# Compare with previous snapshot
data-scraper diff URL

Stores snapshots in data-scraper/snapshots/ with timestamps. Alerts via notification-hub when changes detected.

Output Formats

Format	Flag	Use Case
Text	`--format text`	Reading, summarization
JSON	`--format json`	Data processing
CSV	`--format csv`	Spreadsheets
Markdown	`--format md`	Documentation

Headers & Auth

# Custom headers
data-scraper fetch URL --header "Authorization: Bearer TOKEN"

# Cookie-based auth
data-scraper fetch URL --cookie "session=abc123"

# User-Agent override
data-scraper fetch URL --ua "Mozilla/5.0..."

Rate Limiting & Ethics

Default: 1 request per second per domain
Respects robots.txt when --polite flag is set
Configurable delay between requests
Stops on 429 (Too Many Requests) and backs off

Error Handling

Error	Behavior
404	Log and skip
403/401	Warn about auth requirement
429	Exponential backoff (max 3 retries)
Timeout	Retry once with longer timeout
SSL error	Warn, option to proceed with `--insecure`

Integration

web-claude: Use as fallback when web_fetch isn't enough
competitor-watch: Feed scraped data into competitor analysis
seo-audit: Scrape competitor pages for SEO comparison
performance-tracker: Collect social metrics from public profiles

Usage Guidance

This skill's documentation promises a full-featured scraping tool, but the only runnable file is a minimal curl + HTML-strip script that does not implement selectors, table parsing, batch jobs, monitoring, robots.txt handling, notification integration, or JSON/CSV output beyond a small event file. Before installing or using it: (1) treat it as a lightweight fetcher, not the advertised full scraper; (2) inspect and test run.sh in a safe sandbox to confirm behavior; (3) if you need selector/table/monitoring features, request the author or look for a different skill that actually implements them; (4) be cautious about running it against sites where scraping is disallowed — the script does not enforce politeness or legal rules; (5) consider adding or verifying any required tools (lynx, jq) and safe output handling to avoid accidental data leakage.

Capability Analysis

Type: OpenClaw Skill Name: data-scraper Version: 1.0.0 The skill is designed for web scraping, but the `run.sh` script contains a critical shell injection vulnerability. The `$URL` variable is directly interpolated into a `curl` command without sanitization, allowing an attacker to execute arbitrary commands if they can control the URL input. This vulnerability is also demonstrated in the `GUIDE.md` implementation examples. While the skill's stated purpose and ethical guidelines suggest benign intent, this severe flaw makes it suspicious.

Capability Assessment

⚠ Purpose & Capability

SKILL.md and GUIDE.md describe many features: selector mode, table extraction, batch scraping, watch/diff/monitoring, rate limiting, robots.txt respect, headers/cookies, JSON/CSV output, integrations/notification-hub. The only executable provided (run.sh) implements a minimal fetch: curl the URL, optionally run lynx or sed to strip tags, print to stdout, and write a small event file. There is no selector parsing, table mode, batch processing, monitoring loop, robots.txt handling, retries/backoff beyond curl failure handling, or integrations. The breadth of declared features is disproportionate to the actual code.

⚠ Instruction Scope

The SKILL.md/GUIDE.md instruct agents to do things (batch scraping, create snapshots, alert via notification-hub, use jq for JSON construction, respect --polite flag) that are not implemented by run.sh. The docs effectively give a to-do list of behaviors that would require additional binaries/tools (jq, lynx, selector-capable HTML parsers) and more complex logic; the runtime instructions are therefore ambiguous and could lead an agent to attempt operations that will fail or be implemented inconsistently by invoking ad-hoc shell pipelines.

✓ Install Mechanism

There is no install spec and no network downloads or packaged dependencies. The only included code is a simple shell script. This is low-risk from an install/execution distribution perspective (no external archives or installers).

✓ Credentials

The skill requests no environment variables, credentials, or config paths. The script uses WORKSPACE/EVENTS_DIR/MEMORY_DIR environment variables (with sane defaults) to write an event file; this is proportional to its stated behavior of producing a local event. No secrets or unrelated credentials are requested.

✓ Persistence & Privilege

always is false and model invocation is allowed (default). The script writes event files into a workspace events directory and otherwise prints to stdout; it does not modify other skills or system-wide config. No elevated persistence is requested.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install data-scraper
After installation, invoke the skill by name or use /data-scraper
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release of data-scraper. - Extract structured data from web pages using curl, no browser required. - Supports HTML-to-text conversion, table extraction, price monitoring, and batch scraping. - Multiple extraction modes: readable text, CSS selectors, tables, and link lists. - Change monitoring with snapshots, diffing, and notifications. - Flexible output formats: text, JSON, CSV, and Markdown. - Customizable headers, cookies, rate limiting, and robots.txt respect with `--polite`. - Integration with related skills for broader data workflows.

Metadata

Slug data-scraper

Version 1.0.0

License —

All-time Installs 8

Active Installs 8

Total Versions 1

Frequently Asked Questions

What is Data Scraper?

Web page data collection and structured text extraction. It is an AI Agent Skill for Claude Code / OpenClaw, with 1427 downloads so far.

How do I install Data Scraper?

Run "/install data-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Data Scraper free?

Yes, Data Scraper is completely free (open-source). You can download, install and use it at no cost.

Which platforms does Data Scraper support?

Data Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Data Scraper?

It is built and maintained by mupengi-bot (@mupengi-bot); the current version is v1.0.0.

More Skills