← 返回 Skills 市场
jlacroix82

Smart Scraper

作者 jlacroix82 · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ suspicious
0
总下载
2
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install smart-scraper-web
功能描述
Extract structured data from websites. Tables, lists, prices, articles, metadata. HTML parsing with caching. Zero external dependencies.
使用说明 (SKILL.md)

Smart Scraper 🕷️

Stop copying data by hand. Start extracting it automatically.

The Problem

Web content is everywhere but inaccessible to agents. web_fetch gets raw HTML, but you need structure — tables, prices, lists, article text — to make it useful.

Smart Scraper turns raw HTML into structured data with one command.

Quick Start

Extract everything from a page

node skills/smart-scraper/smart-scraper.js --extract https://example.com

Returns title, headings, paragraphs, links, tables, lists, prices, images, and metadata.

Extract tables only

node skills/smart-scraper/smart-scraper.js --extract --table https://example.com/pricing

Extract lists only

node skills/smart-scraper/smart-scraper.js --extract --list https://example.com/blog

Extract prices

node skills/smart-scraper/smart-scraper.js --extract --price https://example.com/products

Extract article content

node skills/smart-scraper/smart-scraper.js --extract --article https://example.com/blog/post

Parse raw HTML

node skills/smart-scraper/smart-scraper.js --parse "\x3Chtml>...\x3C/html>"

Status overview

node skills/smart-scraper/smart-scraper.js --status

Features

HTML Parsing

  • Title extraction
  • Heading hierarchy (h1-h6)
  • Paragraph extraction (filters short fragments)
  • Link extraction with text
  • Image extraction with alt text
  • Metadata/meta tag extraction

Table Extraction

  • Full table structure with rows and cells
  • Handles th and td elements
  • Strips nested HTML from cells

List Extraction

  • Both ordered and unordered lists
  • List item text extraction
  • Preserves list structure

Price Detection

  • Matches USD ($), EUR (€), GBP (£), JPY (¥) formats
  • Handles comma-separated thousands (e.g., $1,234.56)
  • Returns raw price strings

Article Mode

  • Focuses on heading + paragraph structure
  • Shows first 5 paragraphs as preview
  • Ideal for blog posts and documentation

Caching

  • 5-minute TTL on fetched pages
  • LRU eviction: max 50 entries or 10MB
  • Reduces redundant network calls
  • Cache stats via --status

Configuration

Cache stored in: memory/scraper-cache/cache.json

Override data directory:

--dir /path/to/data

Security

  • URL validation — only http/https to public hosts; blocks file://, gopher://, data:, localhost, private IPs, cloud metadata endpoints
  • Redirect limit — max 5 redirects to prevent loops and SSRF
  • Rate limiting — 100ms minimum between requests
  • Bounded regex — all patterns have {0,N} limits to prevent ReDoS
  • Cache eviction — LRU with 50-entry / 10MB limits
  • No eval, no execSync, no command injection — pure parsing, no shell interaction

Agent Protocol

When extracting web content:

  1. Extract everything first--extract \x3Curl> for a full overview
  2. Target specific data--extract --table/list/price/article for focused extraction
  3. Parse raw HTML--parse when you already have HTML from another tool
  4. Check cache--status to monitor cache usage
  5. Combine with API Gateway — Use API Gateway for authenticated or rate-limited sites

Limitations

  • Regex-based HTML parsing (not a full DOM parser)
  • No JavaScript execution (SPA content not supported)
  • Basic price detection (regex-based, not ML)
  • 15-second fetch timeout per page
  • Only http/https URLs to public hosts (no file://, localhost, private IPs, cloud metadata)
  • Max 5 redirects per request
  • Rate limited to 1 request per 100ms

Comparison

Tool Structure Tables Prices Articles Caching
web_fetch Raw HTML
Puppeteer
Smart Scraper

Smart Scraper gives you structured extraction + caching with zero dependencies.

Design Principles

  1. Zero setup — Works immediately, no config needed
  2. No dependencies — Pure Node.js http/https, no npm packages
  3. Structured output — Returns parsed data, not raw HTML
  4. Cached — Reduces redundant fetches automatically
  5. Multi-mode — Extract everything or target specific data types
安全使用建议
Install only if you are comfortable running a network scraper in your agent environment. Avoid sensitive, authenticated, internal, or attacker-controlled URLs until redirect targets are revalidated and cache behavior is clarified; clear the cache after use if page contents or URLs may be sensitive.
能力评估
Purpose & Capability
The stated purpose is structured extraction from user-supplied web pages, and the code implements HTTP/HTTPS fetching, parsing, and local caching consistent with that purpose.
Instruction Scope
The instructions and audit claim public-host SSRF protections, but the code validates only the initial URL and follows redirect targets without re-validating them.
Install Mechanism
The artifact is a small Node.js script plus documentation and a static comparison page; there are no package dependencies, install hooks, or hidden setup steps.
Credentials
Network access is expected for a scraper, but redirect-based SSRF can make an apparently public URL cause requests to internal services in the agent's runtime environment.
Persistence & Privilege
The skill writes a local cache under memory/scraper-cache/cache.json, which is disclosed and bounded by entry count and TTL, though the advertised 10MB size limit is defined but not actually enforced in the code.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install smart-scraper-web
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /smart-scraper-web 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
- Initial release of smart-scraper: extract structured data from websites (tables, lists, prices, articles, metadata) with a single command. - Supports flexible extraction modes: extract everything, or target tables, lists, prices, or article content. - Provides caching with 5-minute TTL, LRU eviction (max 50 entries or 10MB), and status overview. - Security features include URL validation, redirect and rate limits, bounded regex, and strict command isolation (no shell execution). - No dependencies—runs on pure Node.js http/https, with in-memory and file-backed caching.
元数据
Slug smart-scraper-web
版本 0.1.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Smart Scraper 是什么?

Extract structured data from websites. Tables, lists, prices, articles, metadata. HTML parsing with caching. Zero external dependencies. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 0 次。

如何安装 Smart Scraper?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install smart-scraper-web」即可一键安装,无需额外配置。

Smart Scraper 是免费的吗?

是的,Smart Scraper 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Smart Scraper 支持哪些平台?

Smart Scraper 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Smart Scraper?

由 jlacroix82(@jlacroix82)开发并维护,当前版本 v0.1.0。

💬 留言讨论