← 返回 Skills 市场
smallkeyboy

102 Playwright Scraper Skill

作者 smallKeyboy · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
103
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install 102-playwright-scraper-skill
功能描述
Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk.
使用说明 (SKILL.md)

Playwright Scraper Skill

A Playwright-based web scraping OpenClaw Skill with anti-bot protection. Choose the best approach based on the target website's anti-bot level.


🎯 Use Case Matrix

Target Website Anti-Bot Level Recommended Method Script
Regular Sites Low web_fetch tool N/A (built-in)
Dynamic Sites Medium Playwright Simple scripts/playwright-simple.js
Cloudflare Protected High Playwright Stealth scripts/playwright-stealth.js
YouTube Special deep-scraper Install separately
Reddit Special reddit-scraper Install separately

📦 Installation

cd playwright-scraper-skill
npm install
npx playwright install chromium

🚀 Quick Start

1️⃣ Simple Sites (No Anti-Bot)

Use OpenClaw's built-in web_fetch tool:

# Invoke directly in OpenClaw
Hey, fetch me the content from https://example.com

2️⃣ Dynamic Sites (Requires JavaScript)

Use Playwright Simple:

node scripts/playwright-simple.js "https://example.com"

Example output:

{
  "url": "https://example.com",
  "title": "Example Domain",
  "content": "...",
  "elapsedSeconds": "3.45"
}

3️⃣ Anti-Bot Protected Sites (Cloudflare etc.)

Use Playwright Stealth:

node scripts/playwright-stealth.js "https://m.discuss.com.hk/#hot"

Features:

  • Hide automation markers (navigator.webdriver = false)
  • Realistic User-Agent (iPhone, Android)
  • Random delays to mimic human behavior
  • Screenshot and HTML saving support

4️⃣ YouTube Video Transcripts

Use deep-scraper (install separately):

# Install deep-scraper skill
npx clawhub install deep-scraper

# Use it
cd skills/deep-scraper
node assets/youtube_handler.js "https://www.youtube.com/watch?v=VIDEO_ID"

📖 Script Descriptions

scripts/playwright-simple.js

  • Use Case: Regular dynamic websites
  • Speed: Fast (3-5 seconds)
  • Anti-Bot: None
  • Output: JSON (title, content, URL)

scripts/playwright-stealth.js

  • Use Case: Sites with Cloudflare or anti-bot protection
  • Speed: Medium (5-20 seconds)
  • Anti-Bot: Medium-High (hides automation, realistic UA)
  • Output: JSON + Screenshot + HTML file
  • Verified: 100% success on Discuss.com.hk

🎓 Best Practices

1. Try web_fetch First

If the site doesn't have dynamic loading, use OpenClaw's web_fetch tool—it's fastest.

2. Need JavaScript? Use Playwright Simple

If you need to wait for JavaScript rendering, use playwright-simple.js.

3. Getting Blocked? Use Stealth

If you encounter 403 or Cloudflare challenges, use playwright-stealth.js.

4. Special Sites Need Specialized Skills

  • YouTube → deep-scraper
  • Reddit → reddit-scraper
  • Twitter → bird skill

🔧 Customization

All scripts support environment variables:

# Set screenshot path
SCREENSHOT_PATH=/path/to/screenshot.png node scripts/playwright-stealth.js URL

# Set wait time (milliseconds)
WAIT_TIME=10000 node scripts/playwright-simple.js URL

# Enable headful mode (show browser)
HEADLESS=false node scripts/playwright-stealth.js URL

# Save HTML
SAVE_HTML=true node scripts/playwright-stealth.js URL

# Custom User-Agent
USER_AGENT="Mozilla/5.0 ..." node scripts/playwright-stealth.js URL

📊 Performance Comparison

Method Speed Anti-Bot Success Rate (Discuss.com.hk)
web_fetch ⚡ Fastest ❌ None 0%
Playwright Simple 🚀 Fast ⚠️ Low 20%
Playwright Stealth ⏱️ Medium ✅ Medium 100%
Puppeteer Stealth ⏱️ Medium ✅ Medium-High ~80%
Crawlee (deep-scraper) 🐢 Slow ❌ Detected 0%
Chaser (Rust) ⏱️ Medium ❌ Detected 0%

🛡️ Anti-Bot Techniques Summary

Lessons learned from our testing:

✅ Effective Anti-Bot Measures

  1. Hide navigator.webdriver — Essential
  2. Realistic User-Agent — Use real devices (iPhone, Android)
  3. Mimic Human Behavior — Random delays, scrolling
  4. Avoid Framework Signatures — Crawlee, Selenium are easily detected
  5. Use addInitScript (Playwright) — Inject before page load

❌ Ineffective Anti-Bot Measures

  1. Only changing User-Agent — Not enough
  2. Using high-level frameworks (Crawlee) — More easily detected
  3. Docker isolation — Doesn't help with Cloudflare

🔍 Troubleshooting

Issue: 403 Forbidden

Solution: Use playwright-stealth.js

Issue: Cloudflare Challenge Page

Solution:

  1. Increase wait time (10-15 seconds)
  2. Try headless: false (headful mode sometimes has higher success rate)
  3. Consider using proxy IPs

Issue: Blank Page

Solution:

  1. Increase waitForTimeout
  2. Use waitUntil: 'networkidle' or 'domcontentloaded'
  3. Check if login is required

📝 Memory & Experience

2026-02-07 Discuss.com.hk Test Conclusions

  • Pure Playwright + Stealth succeeded (5s, 200 OK)
  • ❌ Crawlee (deep-scraper) failed (403)
  • ❌ Chaser (Rust) failed (Cloudflare)
  • ❌ Puppeteer standard failed (403)

Best Solution: Pure Playwright + anti-bot techniques (framework-independent)


🚧 Future Improvements

  • Add proxy IP rotation
  • Implement cookie management (maintain login state)
  • Add CAPTCHA handling (2captcha / Anti-Captcha)
  • Batch scraping (parallel URLs)
  • Integration with OpenClaw's browser tool

📚 References

安全使用建议
This skill appears to do what it says: run Playwright scrapers (including techniques to evade bot detection). Before installing, consider: (1) legal/ethical risk — evading anti-bot protections and scraping some sites may violate terms of service or local law; (2) resource impact — Playwright will download Chromium and run headful browsers (disk and RAM usage); (3) network & privacy — scraped data and screenshots are saved locally by default; if you later add CAPTCHA-solving or proxy modules, those may require third-party API keys and could introduce external data flows. Recommended precautions: run first in an isolated environment (container/VM), inspect/verify the scripts yourself (they are short and readable), and avoid supplying any sensitive credentials. If you need CAPTCHA or proxy support, vet those integrations and providers separately.
功能分析
Type: OpenClaw Skill Name: 102-playwright-scraper-skill Version: 1.0.0 The skill bundle provides a legitimate Playwright-based web scraping toolset with standard anti-bot evasion techniques. The scripts (scripts/playwright-simple.js and scripts/playwright-stealth.js) perform expected scraping tasks such as content extraction, screenshot capture, and HTML saving, with no evidence of data exfiltration, malicious execution, or prompt injection.
能力评估
Purpose & Capability
Name/description match the included scripts and docs. The repo implements Playwright simple and stealth scrapers, and the requested actions (npm install, npx playwright install chromium, running node scripts) are exactly what a Playwright scraper needs.
Instruction Scope
SKILL.md and scripts focus on navigation, DOM extraction, screenshots and optional HTML saving. They include anti-bot evasion (hide navigator.webdriver, UA spoofing, human-like delays) which aligns with the declared purpose. The docs mention future integration for proxies and CAPTCHA solving (2captcha/Anti-Captcha) but these are not implemented in the provided files. No instructions were found that read unrelated system files or transmit data to hidden endpoints.
Install Mechanism
There is no custom install spec in registry metadata; the package is distributed with package.json and JS files and instructs users to run `npm install` and `npx playwright install chromium`. This pulls Playwright from the npm registry and downloads Chromium via Playwright's official installer — expected for this functionality. No arbitrary personal URLs or archive extracts were used.
Credentials
The skill declares no required environment variables or credentials. Scripts support optional env vars (HEADLESS, WAIT_TIME, SCREENSHOT_PATH, SAVE_HTML, USER_AGENT) that are reasonable for configuration. No secrets or unrelated service tokens are requested.
Persistence & Privilege
Registry flags are default (always:false, user-invocable:true, model-invocation enabled). The skill does not request persistent platform privileges or modify other skills. It writes optional local outputs (screenshots, HTML) to file paths supplied or defaulted, which is expected behavior for a scraper.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install 102-playwright-scraper-skill
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /102-playwright-scraper-skill 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
**Playwright Scraper Skill v1.2.0:** - Added comprehensive usage guide and script descriptions, including detailed guidance for various website anti-bot levels. - Introduced a use case matrix and performance comparison chart for choosing the best scraping method. - Documented anti-bot protection strategies employed in the provided scripts. - Included troubleshooting steps, environmental variable customization, and best practices for higher scraping success. - Outlined future improvement plans and provided helpful external references.
元数据
Slug 102-playwright-scraper-skill
版本 1.0.0
许可证 MIT-0
累计安装 1
当前安装数 0
历史版本数 1
常见问题

102 Playwright Scraper Skill 是什么?

Playwright-based web scraping OpenClaw Skill with anti-bot protection. Successfully tested on complex sites like Discuss.com.hk. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 103 次。

如何安装 102 Playwright Scraper Skill?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install 102-playwright-scraper-skill」即可一键安装,无需额外配置。

102 Playwright Scraper Skill 是免费的吗?

是的,102 Playwright Scraper Skill 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

102 Playwright Scraper Skill 支持哪些平台?

102 Playwright Scraper Skill 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 102 Playwright Scraper Skill?

由 smallKeyboy(@smallkeyboy)开发并维护,当前版本 v1.0.0。

💬 留言讨论