Smart Scraper
/install smart-scraper-web
Smart Scraper 🕷️
Stop copying data by hand. Start extracting it automatically.
The Problem
Web content is everywhere but inaccessible to agents. web_fetch gets raw HTML, but you need structure — tables, prices, lists, article text — to make it useful.
Smart Scraper turns raw HTML into structured data with one command.
Quick Start
Extract everything from a page
node skills/smart-scraper/smart-scraper.js --extract https://example.com
Returns title, headings, paragraphs, links, tables, lists, prices, images, and metadata.
Extract tables only
node skills/smart-scraper/smart-scraper.js --extract --table https://example.com/pricing
Extract lists only
node skills/smart-scraper/smart-scraper.js --extract --list https://example.com/blog
Extract prices
node skills/smart-scraper/smart-scraper.js --extract --price https://example.com/products
Extract article content
node skills/smart-scraper/smart-scraper.js --extract --article https://example.com/blog/post
Parse raw HTML
node skills/smart-scraper/smart-scraper.js --parse "\x3Chtml>...\x3C/html>"
Status overview
node skills/smart-scraper/smart-scraper.js --status
Features
HTML Parsing
- Title extraction
- Heading hierarchy (h1-h6)
- Paragraph extraction (filters short fragments)
- Link extraction with text
- Image extraction with alt text
- Metadata/meta tag extraction
Table Extraction
- Full table structure with rows and cells
- Handles th and td elements
- Strips nested HTML from cells
List Extraction
- Both ordered and unordered lists
- List item text extraction
- Preserves list structure
Price Detection
- Matches USD ($), EUR (€), GBP (£), JPY (¥) formats
- Handles comma-separated thousands (e.g., $1,234.56)
- Returns raw price strings
Article Mode
- Focuses on heading + paragraph structure
- Shows first 5 paragraphs as preview
- Ideal for blog posts and documentation
Caching
- 5-minute TTL on fetched pages
- LRU eviction: max 50 entries or 10MB
- Reduces redundant network calls
- Cache stats via
--status
Configuration
Cache stored in: memory/scraper-cache/cache.json
Override data directory:
--dir /path/to/data
Security
- URL validation — only http/https to public hosts; blocks file://, gopher://, data:, localhost, private IPs, cloud metadata endpoints
- Redirect limit — max 5 redirects to prevent loops and SSRF
- Rate limiting — 100ms minimum between requests
- Bounded regex — all patterns have
{0,N}limits to prevent ReDoS - Cache eviction — LRU with 50-entry / 10MB limits
- No eval, no execSync, no command injection — pure parsing, no shell interaction
Agent Protocol
When extracting web content:
- Extract everything first —
--extract \x3Curl>for a full overview - Target specific data —
--extract --table/list/price/articlefor focused extraction - Parse raw HTML —
--parsewhen you already have HTML from another tool - Check cache —
--statusto monitor cache usage - Combine with API Gateway — Use API Gateway for authenticated or rate-limited sites
Limitations
- Regex-based HTML parsing (not a full DOM parser)
- No JavaScript execution (SPA content not supported)
- Basic price detection (regex-based, not ML)
- 15-second fetch timeout per page
- Only http/https URLs to public hosts (no file://, localhost, private IPs, cloud metadata)
- Max 5 redirects per request
- Rate limited to 1 request per 100ms
Comparison
| Tool | Structure | Tables | Prices | Articles | Caching |
|---|---|---|---|---|---|
web_fetch |
Raw HTML | ❌ | ❌ | ❌ | ❌ |
| Puppeteer | ✅ | ✅ | ✅ | ✅ | ❌ |
| Smart Scraper | ✅ | ✅ | ✅ | ✅ | ✅ |
Smart Scraper gives you structured extraction + caching with zero dependencies.
Design Principles
- Zero setup — Works immediately, no config needed
- No dependencies — Pure Node.js http/https, no npm packages
- Structured output — Returns parsed data, not raw HTML
- Cached — Reduces redundant fetches automatically
- Multi-mode — Extract everything or target specific data types
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install smart-scraper-web - 安装完成后,直接呼叫该 Skill 的名称或使用
/smart-scraper-web触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Smart Scraper 是什么?
Extract structured data from websites. Tables, lists, prices, articles, metadata. HTML parsing with caching. Zero external dependencies. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 0 次。
如何安装 Smart Scraper?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install smart-scraper-web」即可一键安装,无需额外配置。
Smart Scraper 是免费的吗?
是的,Smart Scraper 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Smart Scraper 支持哪些平台?
Smart Scraper 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Smart Scraper?
由 jlacroix82(@jlacroix82)开发并维护,当前版本 v0.1.0。