← Back to Skills Marketplace

智能网页爬虫

Name: 智能网页爬虫
Author: cjstate

by CJstate · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

109

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install xh-smart-scraper

Description

智能网页数据采集器。自动识别网页结构，批量抓取列表/表格/详情页数据，支持导出JSON/CSV/Excel。内置反爬策略适配。

Usage Guidance

This skill contains plausible scraper code (Puppeteer + Cheerio) and will npm install Puppeteer (which downloads Chromium). However the README/metadata overstate capabilities — proxy pools, retries, database writes and randomized anti-bot strategies are advertised but not implemented. Before installing or using: (1) review scraper.js yourself or run it in a sandboxed environment; (2) avoid running npm install as root because Puppeteer/Chromium can require special flags (--no-sandbox is used in the code); (3) if you need proxy or DB features, expect to modify the code and add secure credential handling; (4) heed legal/robots.txt constraints for scraping targets. If you want a fully-featured scraper, request clarification or a version that actually implements the advertised features and documents how credentials/config are provided.

Capability Analysis

Type: OpenClaw Skill Name: xh-smart-scraper Version: 1.0.0 The skill is a standard web scraper implementation using Puppeteer and Cheerio. The code in scraper.js performs legitimate data extraction and file export (JSON/CSV/Excel) based on user-provided configurations, with no evidence of data exfiltration, malicious execution, or prompt injection in SKILL.md.

Capability Assessment

⚠ Purpose & Capability

Name/description promise: auto-recognition, anti-bot adaptations, proxy pool support, automatic retries, and database direct storage. The code implements basic Puppeteer fetching, Cheerio parsing, simple file export, and a static random User-Agent list. It does NOT implement proxy pool usage, DB storage, retry logic, or true randomized delays despite these appearing in the documentation—this is a mismatch between stated purpose and actual capability.

⚠ Instruction Scope

SKILL.md instructs npm install and running scraper.js (consistent). However the documentation advertises features (IP proxy pool, DB direct store, configurable randomized delays/retries) that the runtime instructions/code do not actually support. The runtime code reads a local config file and writes outputs to local files (JSON/CSV/Excel) only — it does not access external endpoints other than the target URLs, nor does it read environment variables or other system config.

ℹ Install Mechanism

No explicit install spec in registry (instruction-only), but package.json depends on puppeteer (which will download Chromium during npm install). This is expected for a scraper but increases install size and can pull large binaries. No external, untrusted download URLs; standard npm dependencies are used.

⚠ Credentials

Requires no environment variables or credentials in metadata, which matches the code. However the documentation claims proxy pool and DB direct-storage features that typically require credentials/config; those are not requested or implemented—this mismatch can mislead users about what secrets/config are needed and may result in attempts to add credentials later without clear handling in the code.

✓ Persistence & Privilege

Does not request persistent/always-on privilege. It is user-invocable and not set to always: true. The skill only runs when invoked and writes output files to disk, which is expected behavior for a CLI scraper.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install xh-smart-scraper
After installation, invoke the skill by name or use /xh-smart-scraper
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

- Initial release of Smart Web Scraper (智能网页数据采集器) - Features intelligent structure recognition for list, table, and detail pages - Automatically extracts key fields such as titles, prices, and authors - Supports anti-crawling strategies: User-Agent rotation, request delay, proxy pool (optional), and auto-retry - Exports data in JSON, CSV, Excel, and supports direct database storage (MySQL/MongoDB) - Provides command-line and config file usage with sample scenarios

Metadata

Slug xh-smart-scraper

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is 智能网页爬虫?

智能网页数据采集器。自动识别网页结构，批量抓取列表/表格/详情页数据，支持导出JSON/CSV/Excel。内置反爬策略适配。 It is an AI Agent Skill for Claude Code / OpenClaw, with 109 downloads so far.

How do I install 智能网页爬虫?

Run "/install xh-smart-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is 智能网页爬虫 free?

Yes, 智能网页爬虫 is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does 智能网页爬虫 support?

智能网页爬虫 is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created 智能网页爬虫?

It is built and maintained by CJstate (@cjstate); the current version is v1.0.0.

More Skills