← Back to Skills Marketplace
cjstate

智能网页爬虫

by CJstate · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
109
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install xh-smart-scraper
Description
智能网页数据采集器。自动识别网页结构,批量抓取列表/表格/详情页数据,支持导出JSON/CSV/Excel。内置反爬策略适配。
Usage Guidance
This skill contains plausible scraper code (Puppeteer + Cheerio) and will npm install Puppeteer (which downloads Chromium). However the README/metadata overstate capabilities — proxy pools, retries, database writes and randomized anti-bot strategies are advertised but not implemented. Before installing or using: (1) review scraper.js yourself or run it in a sandboxed environment; (2) avoid running npm install as root because Puppeteer/Chromium can require special flags (--no-sandbox is used in the code); (3) if you need proxy or DB features, expect to modify the code and add secure credential handling; (4) heed legal/robots.txt constraints for scraping targets. If you want a fully-featured scraper, request clarification or a version that actually implements the advertised features and documents how credentials/config are provided.
Capability Analysis
Type: OpenClaw Skill Name: xh-smart-scraper Version: 1.0.0 The skill is a standard web scraper implementation using Puppeteer and Cheerio. The code in scraper.js performs legitimate data extraction and file export (JSON/CSV/Excel) based on user-provided configurations, with no evidence of data exfiltration, malicious execution, or prompt injection in SKILL.md.
Capability Assessment
Purpose & Capability
Name/description promise: auto-recognition, anti-bot adaptations, proxy pool support, automatic retries, and database direct storage. The code implements basic Puppeteer fetching, Cheerio parsing, simple file export, and a static random User-Agent list. It does NOT implement proxy pool usage, DB storage, retry logic, or true randomized delays despite these appearing in the documentation—this is a mismatch between stated purpose and actual capability.
Instruction Scope
SKILL.md instructs npm install and running scraper.js (consistent). However the documentation advertises features (IP proxy pool, DB direct store, configurable randomized delays/retries) that the runtime instructions/code do not actually support. The runtime code reads a local config file and writes outputs to local files (JSON/CSV/Excel) only — it does not access external endpoints other than the target URLs, nor does it read environment variables or other system config.
Install Mechanism
No explicit install spec in registry (instruction-only), but package.json depends on puppeteer (which will download Chromium during npm install). This is expected for a scraper but increases install size and can pull large binaries. No external, untrusted download URLs; standard npm dependencies are used.
Credentials
Requires no environment variables or credentials in metadata, which matches the code. However the documentation claims proxy pool and DB direct-storage features that typically require credentials/config; those are not requested or implemented—this mismatch can mislead users about what secrets/config are needed and may result in attempts to add credentials later without clear handling in the code.
Persistence & Privilege
Does not request persistent/always-on privilege. It is user-invocable and not set to always: true. The skill only runs when invoked and writes output files to disk, which is expected behavior for a CLI scraper.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install xh-smart-scraper
  3. After installation, invoke the skill by name or use /xh-smart-scraper
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of Smart Web Scraper (智能网页数据采集器) - Features intelligent structure recognition for list, table, and detail pages - Automatically extracts key fields such as titles, prices, and authors - Supports anti-crawling strategies: User-Agent rotation, request delay, proxy pool (optional), and auto-retry - Exports data in JSON, CSV, Excel, and supports direct database storage (MySQL/MongoDB) - Provides command-line and config file usage with sample scenarios
Metadata
Slug xh-smart-scraper
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is 智能网页爬虫?

智能网页数据采集器。自动识别网页结构,批量抓取列表/表格/详情页数据,支持导出JSON/CSV/Excel。内置反爬策略适配。 It is an AI Agent Skill for Claude Code / OpenClaw, with 109 downloads so far.

How do I install 智能网页爬虫?

Run "/install xh-smart-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is 智能网页爬虫 free?

Yes, 智能网页爬虫 is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does 智能网页爬虫 support?

智能网页爬虫 is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created 智能网页爬虫?

It is built and maintained by CJstate (@cjstate); the current version is v1.0.0.

💬 Comments