← Back to Skills Marketplace

Smart Scraper

Name: Smart Scraper
Author: jlacroix82

by jlacroix82 · GitHub ↗ · v0.1.0 · MIT-0

cross-platform ⚠ suspicious

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install smart-scraper-web

Description

Extract structured data from websites. Tables, lists, prices, articles, metadata. HTML parsing with caching. Zero external dependencies.

README (SKILL.md)

Smart Scraper 🕷️

Stop copying data by hand. Start extracting it automatically.

The Problem

Web content is everywhere but inaccessible to agents. web_fetch gets raw HTML, but you need structure — tables, prices, lists, article text — to make it useful.

Smart Scraper turns raw HTML into structured data with one command.

Quick Start

Extract everything from a page

node skills/smart-scraper/smart-scraper.js --extract https://example.com

Returns title, headings, paragraphs, links, tables, lists, prices, images, and metadata.

Extract tables only

node skills/smart-scraper/smart-scraper.js --extract --table https://example.com/pricing

Extract lists only

node skills/smart-scraper/smart-scraper.js --extract --list https://example.com/blog

Extract prices

node skills/smart-scraper/smart-scraper.js --extract --price https://example.com/products

Extract article content

node skills/smart-scraper/smart-scraper.js --extract --article https://example.com/blog/post

Parse raw HTML

node skills/smart-scraper/smart-scraper.js --parse "\x3Chtml>...\x3C/html>"

Status overview

node skills/smart-scraper/smart-scraper.js --status

Features

HTML Parsing

Title extraction
Heading hierarchy (h1-h6)
Paragraph extraction (filters short fragments)
Link extraction with text
Image extraction with alt text
Metadata/meta tag extraction

Table Extraction

Full table structure with rows and cells
Handles th and td elements
Strips nested HTML from cells

List Extraction

Both ordered and unordered lists
List item text extraction
Preserves list structure

Price Detection

Matches USD ($), EUR (€), GBP (£), JPY (¥) formats
Handles comma-separated thousands (e.g., $1,234.56)
Returns raw price strings

Article Mode

Focuses on heading + paragraph structure
Shows first 5 paragraphs as preview
Ideal for blog posts and documentation

Caching

5-minute TTL on fetched pages
LRU eviction: max 50 entries or 10MB
Reduces redundant network calls
Cache stats via --status

Configuration

Cache stored in: memory/scraper-cache/cache.json

Override data directory:

--dir /path/to/data

Security

URL validation — only http/https to public hosts; blocks file://, gopher://, data:, localhost, private IPs, cloud metadata endpoints
Redirect limit — max 5 redirects to prevent loops and SSRF
Rate limiting — 100ms minimum between requests
Bounded regex — all patterns have {0,N} limits to prevent ReDoS
Cache eviction — LRU with 50-entry / 10MB limits
No eval, no execSync, no command injection — pure parsing, no shell interaction

Agent Protocol

When extracting web content:

Extract everything first — --extract \x3Curl> for a full overview
Target specific data — --extract --table/list/price/article for focused extraction
Parse raw HTML — --parse when you already have HTML from another tool
Check cache — --status to monitor cache usage
Combine with API Gateway — Use API Gateway for authenticated or rate-limited sites

Limitations

Regex-based HTML parsing (not a full DOM parser)
No JavaScript execution (SPA content not supported)
Basic price detection (regex-based, not ML)
15-second fetch timeout per page
Only http/https URLs to public hosts (no file://, localhost, private IPs, cloud metadata)
Max 5 redirects per request
Rate limited to 1 request per 100ms

Comparison

Tool	Structure	Tables	Prices	Articles	Caching
`web_fetch`	Raw HTML	❌	❌	❌	❌
Puppeteer	✅	✅	✅	✅	❌
Smart Scraper	✅	✅	✅	✅	✅

Smart Scraper gives you structured extraction + caching with zero dependencies.

Design Principles

Zero setup — Works immediately, no config needed
No dependencies — Pure Node.js http/https, no npm packages
Structured output — Returns parsed data, not raw HTML
Cached — Reduces redundant fetches automatically
Multi-mode — Extract everything or target specific data types

Usage Guidance

Install only if you are comfortable running a network scraper in your agent environment. Avoid sensitive, authenticated, internal, or attacker-controlled URLs until redirect targets are revalidated and cache behavior is clarified; clear the cache after use if page contents or URLs may be sensitive.

Capability Assessment

ℹ Purpose & Capability

The stated purpose is structured extraction from user-supplied web pages, and the code implements HTTP/HTTPS fetching, parsing, and local caching consistent with that purpose.

⚠ Instruction Scope

The instructions and audit claim public-host SSRF protections, but the code validates only the initial URL and follows redirect targets without re-validating them.

✓ Install Mechanism

The artifact is a small Node.js script plus documentation and a static comparison page; there are no package dependencies, install hooks, or hidden setup steps.

⚠ Credentials

Network access is expected for a scraper, but redirect-based SSRF can make an apparently public URL cause requests to internal services in the agent's runtime environment.

ℹ Persistence & Privilege

The skill writes a local cache under memory/scraper-cache/cache.json, which is disclosed and bounded by entry count and TTL, though the advertised 10MB size limit is defined but not actually enforced in the code.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install smart-scraper-web
After installation, invoke the skill by name or use /smart-scraper-web
Provide required inputs per the skill's parameter spec and get structured output

Version History

v0.1.0

- Initial release of smart-scraper: extract structured data from websites (tables, lists, prices, articles, metadata) with a single command. - Supports flexible extraction modes: extract everything, or target tables, lists, prices, or article content. - Provides caching with 5-minute TTL, LRU eviction (max 50 entries or 10MB), and status overview. - Security features include URL validation, redirect and rate limits, bounded regex, and strict command isolation (no shell execution). - No dependencies—runs on pure Node.js http/https, with in-memory and file-backed caching.

Metadata

Slug smart-scraper-web

Version 0.1.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Smart Scraper?

Extract structured data from websites. Tables, lists, prices, articles, metadata. HTML parsing with caching. Zero external dependencies. It is an AI Agent Skill for Claude Code / OpenClaw, with 0 downloads so far.

How do I install Smart Scraper?

Run "/install smart-scraper-web" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Smart Scraper free?

Yes, Smart Scraper is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Smart Scraper support?

Smart Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Smart Scraper?

It is built and maintained by jlacroix82 (@jlacroix82); the current version is v0.1.0.

More Skills