← Back to Skills Marketplace
wenbozhao279-code

Web Scraping Tool Selection Strategy

by wenbozhao279-code · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
220
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install web-scraping-tool-selection-strategy
Description
如何选择合适的网页抓取工具进行数据采集。当用户提到网页抓取、数据采集、爬虫、自动化测试、浏览器自动化、网站监控、竞品分析、价格监控、评论抓取、社交媒体数据分析、电商数据采集、小红书/知乎/京东/淘宝/1688抓取、结构化数据提取、反爬绕过、浏览器复用、API抓取、实时数据监控等场景时使用此技能。包含opencli...
README (SKILL.md)

网页抓取工具选型策略

建立高效的网页数据采集策略,通过合理选择工具最大化抓取成功率和数据质量。

When to use this skill

  • 当你需要从不同网站抓取数据但不确定使用哪种工具时
  • 面对反爬机制需要绕过的复杂网站抓取场景
  • 需要结构化数据输出或快速API级访问时
  • 要复用已登录浏览器状态抓取私有数据时

Steps

  1. 优先使用opencli进行有适配器的平台抓取

    • 对于小红书、知乎、微博、B站等有官方适配器的平台,使用opencli \x3Cplatform> \x3Caction> --limit \x3Cnumber> -f json
    • 例如:opencli xiaohongshu search "关键词" --limit 3 -f json
    • 为什么:提供结构化JSON输出,速度快,稳定性高,包含作者、标题、点赞数、发布时间等完整字段
  2. 使用playwright-cli作为兜底方案

    • 对于京东、淘宝、1688、抖音、拼多多等复杂电商网站,使用playwright-cli goto "\x3CURL>"
    • 例如:playwright-cli goto "https://item.jd.com/44541018110.html#comment"
    • 为什么:能够复用已登录的Chrome浏览器状态,绕过反爬机制,支持动态加载内容和登录后可见数据
  3. 根据平台特性选择工具

    • 社交媒体平台(小红书/知乎/微博/B站)→ 优先使用opencli
    • 电商平台(京东/淘宝/1688/抖音/拼多多)→ 使用playwright-cli
    • 为什么:opencli针对特定平台有优化适配器,playwright-cli提供通用浏览器级解决方案
  4. 验证工具连通性和状态

    • 在正式抓取前测试工具是否正常运行
    • 检查Chrome浏览器是否已正确连接
    • 为什么:避免在演示或生产环境中出现连接失败的问题

Pitfalls and solutions

❌ 盲目使用单一工具 → 无法适应不同网站的反爬机制和结构差异 → ✅ 根据平台特性选择合适工具 ❌ 忽略已登录浏览器状态 → 错过登录后数据和增加登录验证步骤 → ✅ 优先复用已登录Chrome标签页 ❌ 不区分API级和浏览器级抓取 → 效率低下或数据不准确 → ✅ 结构化数据用opencli,复杂页面用playwright-cli ❌ 缺乏工具状态检查 → 演示时出现意外故障 → ✅ 演示前进行最小检查验证

Key code and configuration

# opencli小红书搜索示例
opencli xiaohongshu search "宠物猫" --limit 3 -f json

# opencli知乎热榜示例  
opencli zhihu hot --limit 5 -f json

# playwright-cli京东评论抓取示例
playwright-cli goto "https://item.jd.com/44541018110.html#comment"

# playwright-cli 1688供应链抓取示例
playwright-cli goto "https://s.1688.com/selloffer/offer_search.htm?keywords=静脉曲张袜"

Environment and prerequisites

  • opencli工具已安装并配置
  • playwright-cli工具已安装并配置
  • Chrome浏览器已安装且可被工具访问
  • 网络连接稳定,能够访问目标网站
  • 目标网站账号已登录(用于playwright-cli复用登录态)

Companion files

  • scripts/web_scraping_validator — 工具连通性验证脚本
  • references/platform_mapping_table — 平台与工具对应关系参考表
Usage Guidance
This skill is a coherent, instruction-only guide for choosing opencli vs playwright-cli. Before using it: 1) Verify you will manually install and review opencli/playwright-cli from official sources (don’t run unknown installers). 2) Be cautious about reusing logged-in browser state — don’t give an agent access to your browser profile, cookies, or passwords unless you explicitly trust the environment; doing so can expose private account data. 3) The SKILL.md mentions companion scripts that aren’t bundled here—inspect any such scripts before running. 4) Ensure your scraping activities comply with target sites’ terms of service and applicable laws. 5) Prefer manual review and least-privilege testing (use throwaway accounts or isolated browser profiles) when validating the recommended commands.
Capability Analysis
Type: OpenClaw Skill Name: web-scraping-tool-selection-strategy Version: 1.0.0 The skill bundle provides a legitimate strategy for web scraping using 'opencli' and 'playwright-cli' across various Chinese social media and e-commerce platforms (e.g., JD, Xiaohongshu, 1688). The instructions in SKILL.md and the reference guides focus on tool selection, data structure mapping, and browser state reuse to handle anti-scraping mechanisms. No evidence of data exfiltration, malicious execution, or harmful prompt injection was found.
Capability Assessment
Purpose & Capability
The skill's name and description match the instructions: it is a tool-selection strategy between opencli and playwright-cli. It does not request unrelated credentials or binaries. Minor inconsistency: SKILL.md references companion scripts/files (e.g., scripts/web_scraping_validator, references/platform_mapping_table) that are not present in the file manifest—this is a documentation/packaging omission but does not imply malicious behavior.
Instruction Scope
Instructions stay on-topic (how to choose and invoke opencli/playwright-cli). They explicitly recommend reusing logged-in Chrome browser state to access post-login data and to bypass anti-bot measures; while coherent for the stated purpose, this step can expose private account data if performed automatically or without care. The skill does not instruct the agent to read arbitrary system files or exfiltrate data to external endpoints, but following its guidance requires elevated access to a browser profile/session outside the skill's own control.
Install Mechanism
No install spec and no code files to execute — instruction-only skill. This minimizes surface area: nothing is downloaded or written by the skill itself.
Credentials
The skill declares no required env vars or credentials (proportional). However it implicitly depends on user-managed credentials/sessions (logged-in browser state and site accounts). That dependence is reasonable for the guidance given, but users should not hand over browser profiles, cookies, or credentials to untrusted agents.
Persistence & Privilege
The skill is not always-enabled and makes no requests to modify other skills or system configuration. Autonomous invocation is allowed by platform default but the skill does not request elevated persistent privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install web-scraping-tool-selection-strategy
  3. After installation, invoke the skill by name or use /web-scraping-tool-selection-strategy
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of web-scraping-tool-selection-strategy: - Provides step-by-step guidance for selecting between opencli and playwright-cli for web scraping based on platform and requirements. - Includes recommended usage scenarios, tool selection rules, and example commands for common platforms (e.g., 小红书、知乎、京东、淘宝、1688). - Lists key pitfalls and solutions to improve scraping success and data quality. - Details essential environment prerequisites and companion files for tool validation and platform mapping.
Metadata
Slug web-scraping-tool-selection-strategy
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Web Scraping Tool Selection Strategy?

如何选择合适的网页抓取工具进行数据采集。当用户提到网页抓取、数据采集、爬虫、自动化测试、浏览器自动化、网站监控、竞品分析、价格监控、评论抓取、社交媒体数据分析、电商数据采集、小红书/知乎/京东/淘宝/1688抓取、结构化数据提取、反爬绕过、浏览器复用、API抓取、实时数据监控等场景时使用此技能。包含opencli... It is an AI Agent Skill for Claude Code / OpenClaw, with 220 downloads so far.

How do I install Web Scraping Tool Selection Strategy?

Run "/install web-scraping-tool-selection-strategy" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Scraping Tool Selection Strategy free?

Yes, Web Scraping Tool Selection Strategy is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Scraping Tool Selection Strategy support?

Web Scraping Tool Selection Strategy is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Scraping Tool Selection Strategy?

It is built and maintained by wenbozhao279-code (@wenbozhao279-code); the current version is v1.0.0.

💬 Comments