Smart Scraper
/install smart-scraper-web
Smart Scraper 🕷️
Stop copying data by hand. Start extracting it automatically.
The Problem
Web content is everywhere but inaccessible to agents. web_fetch gets raw HTML, but you need structure — tables, prices, lists, article text — to make it useful.
Smart Scraper turns raw HTML into structured data with one command.
Quick Start
Extract everything from a page
node skills/smart-scraper/smart-scraper.js --extract https://example.com
Returns title, headings, paragraphs, links, tables, lists, prices, images, and metadata.
Extract tables only
node skills/smart-scraper/smart-scraper.js --extract --table https://example.com/pricing
Extract lists only
node skills/smart-scraper/smart-scraper.js --extract --list https://example.com/blog
Extract prices
node skills/smart-scraper/smart-scraper.js --extract --price https://example.com/products
Extract article content
node skills/smart-scraper/smart-scraper.js --extract --article https://example.com/blog/post
Parse raw HTML
node skills/smart-scraper/smart-scraper.js --parse "\x3Chtml>...\x3C/html>"
Status overview
node skills/smart-scraper/smart-scraper.js --status
Features
HTML Parsing
- Title extraction
- Heading hierarchy (h1-h6)
- Paragraph extraction (filters short fragments)
- Link extraction with text
- Image extraction with alt text
- Metadata/meta tag extraction
Table Extraction
- Full table structure with rows and cells
- Handles th and td elements
- Strips nested HTML from cells
List Extraction
- Both ordered and unordered lists
- List item text extraction
- Preserves list structure
Price Detection
- Matches USD ($), EUR (€), GBP (£), JPY (¥) formats
- Handles comma-separated thousands (e.g., $1,234.56)
- Returns raw price strings
Article Mode
- Focuses on heading + paragraph structure
- Shows first 5 paragraphs as preview
- Ideal for blog posts and documentation
Caching
- 5-minute TTL on fetched pages
- LRU eviction: max 50 entries or 10MB
- Reduces redundant network calls
- Cache stats via
--status
Configuration
Cache stored in: memory/scraper-cache/cache.json
Override data directory:
--dir /path/to/data
Security
- URL validation — only http/https to public hosts; blocks file://, gopher://, data:, localhost, private IPs, cloud metadata endpoints
- Redirect limit — max 5 redirects to prevent loops and SSRF
- Rate limiting — 100ms minimum between requests
- Bounded regex — all patterns have
{0,N}limits to prevent ReDoS - Cache eviction — LRU with 50-entry / 10MB limits
- No eval, no execSync, no command injection — pure parsing, no shell interaction
Agent Protocol
When extracting web content:
- Extract everything first —
--extract \x3Curl>for a full overview - Target specific data —
--extract --table/list/price/articlefor focused extraction - Parse raw HTML —
--parsewhen you already have HTML from another tool - Check cache —
--statusto monitor cache usage - Combine with API Gateway — Use API Gateway for authenticated or rate-limited sites
Limitations
- Regex-based HTML parsing (not a full DOM parser)
- No JavaScript execution (SPA content not supported)
- Basic price detection (regex-based, not ML)
- 15-second fetch timeout per page
- Only http/https URLs to public hosts (no file://, localhost, private IPs, cloud metadata)
- Max 5 redirects per request
- Rate limited to 1 request per 100ms
Comparison
| Tool | Structure | Tables | Prices | Articles | Caching |
|---|---|---|---|---|---|
web_fetch |
Raw HTML | ❌ | ❌ | ❌ | ❌ |
| Puppeteer | ✅ | ✅ | ✅ | ✅ | ❌ |
| Smart Scraper | ✅ | ✅ | ✅ | ✅ | ✅ |
Smart Scraper gives you structured extraction + caching with zero dependencies.
Design Principles
- Zero setup — Works immediately, no config needed
- No dependencies — Pure Node.js http/https, no npm packages
- Structured output — Returns parsed data, not raw HTML
- Cached — Reduces redundant fetches automatically
- Multi-mode — Extract everything or target specific data types
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install smart-scraper-web - After installation, invoke the skill by name or use
/smart-scraper-web - Provide required inputs per the skill's parameter spec and get structured output
What is Smart Scraper?
Extract structured data from websites. Tables, lists, prices, articles, metadata. HTML parsing with caching. Zero external dependencies. It is an AI Agent Skill for Claude Code / OpenClaw, with 0 downloads so far.
How do I install Smart Scraper?
Run "/install smart-scraper-web" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Smart Scraper free?
Yes, Smart Scraper is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Smart Scraper support?
Smart Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Smart Scraper?
It is built and maintained by jlacroix82 (@jlacroix82); the current version is v0.1.0.