← 返回 Skills 市场

Web Scraper

Name: Web Scraper
Author: ericlooi504

作者 ericlooi504 · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install python-web-scraper

功能描述

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J...

使用说明 (SKILL.md)

Web Scraper

Overview

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JS-heavy sites, and structured output. Covers ethical scraping practices. Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, or scrape JavaScript-rendered content.

Quick Start

Prerequisites

pip install requests beautifulsoup4 lxml
# For JS-heavy sites:
pip install selenium webdriver-manager

Basic scrape

# Extract all links from a page
python3 scripts/scrape-basic.py https://example.com \
  --selector "a[href]" --attr href --output links.json --pretty

# Extract text from articles
python3 scripts/scrape-basic.py https://news.ycombinator.com \
  --selector ".titleline a" --output hn.txt

Paginated scrape

# URL parameter pagination (?page=1, ?page=2)
python3 scripts/scrape-pagination.py https://books.toscrape.com/catalogue/page-1.html \
  --selector "h3 a" --attr title --max-pages 5

# Next-link detection
python3 scripts/scrape-pagination.py https://quotes.toscrape.com \
  --selector "span.text" --max-pages 3

JavaScript-rendered pages (Selenium)

python3 scripts/scrape-with-selenium.py https://example.com \
  --selector ".dynamic-content" --wait 5 --output data.json

Common Scenarios

Anti-blocking techniques

Rotate User-Agents and add delays to avoid 429/blocking:

import random
import time
headers = {
    "User-Agent": random.choice(USER_AGENTS),
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.google.com/",
}
time.sleep(random.uniform(1.0, 3.0))  # random delay between requests

For aggressive blocking: set cookies, use sessions, or add proxy.

Handle JavaScript sites without Selenium

First check: is the data embedded in the page source?

import re, json
# Look for JSON data in \x3Cscript> tags
match = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.*?});', html)
if match:
    data = json.loads(match.group(1))

Many SPAs (React/Vue) embed data in script tags — Selenium may be unnecessary.

Handle login-protected pages

# Option 1: Export cookies from browser
# In browser console: document.cookie or use EditThisCookie extension
# Option 2: Use requests Session
python3 -c "
import requests
s = requests.Session()
s.post('https://example.com/login', data={'user': '...', 'pass': '...'})
with open('cookies.txt', 'w') as f:
    f.write(str(s.cookies.get_dict()))
"

Output formatting

Scripts output JSON by default. Convert to CSV:

# JSON → CSV using jq
python3 scrape-basic.py https://example.com -s "tr" -o data.json --pretty
python3 -c "
import json, csv
with open('data.json') as f:
    data = json.load(f)
with open('data.csv', 'w', newline='') as f:
    w = csv.writer(f)
    w.writerow(['item'])
    for d in data:
        w.writerow([d])
"

Ethics & Legal

Always check robots.txt first: https://example.com/robots.txt
Respect Crawl-delay directive
Identify yourself in User-Agent with contact info
Never scrape login-protected content, personal data, or copyrighted material
Add delays (1-3s minimum) between requests — don't hammer servers
Check ToS, some sites explicitly ban scraping
For public data (news, blogs, directories): generally fine with proper rate limiting

Resources

scripts/scrape-basic.py — Single page scrape with CSS selectors, JSON/CSV/text output
scripts/scrape-pagination.py — Paginated scrape (URL params + next-link detection)
scripts/scrape-with-selenium.py — Selenium-based scrape for JS-heavy sites with scroll
references/anti-blocking.md — Detailed anti-blocking and proxy strategies

安全使用建议

Install or use this only if you intentionally need a web-scraping toolkit and can ensure the target use is authorized. Avoid giving it browser cookies, passwords, or private account sessions; do not use the anti-bot, proxy, CAPTCHA, or webdriver-bypass guidance unless permitted by the site owner. Pin and review dependencies, and run scraping in an isolated environment.

功能分析

Type: OpenClaw Skill Name: python-web-scraper Version: 1.0.1 The skill bundle is a standard web scraping toolkit providing scripts for basic, paginated, and Selenium-based data extraction. The code in scripts/scrape-basic.py, scripts/scrape-pagination.py, and scripts/scrape-with-selenium.py follows best practices for scraping, including rate limiting and user-agent rotation, and lacks any indicators of data exfiltration, unauthorized execution, or malicious prompt injection.

能力标签

requires-sensitive-credentials

能力评估

⚠ Purpose & Capability

The included scripts match a web-scraping toolkit, but the documented scope extends from public-page scraping into anti-bot evasion, CAPTCHA handling, proxies, and login/session-cookie use.

⚠ Instruction Scope

The skill gives steps for login-protected pages while also saying not to scrape login-protected content, creating unclear boundaries for when the agent should use credentials or sessions.

ℹ Install Mechanism

There is no install spec; setup is via user-directed pip commands, and the Selenium helper uses webdriver-manager to obtain ChromeDriver at runtime.

⚠ Credentials

Arbitrary URL fetching and file output are expected for scraping, but using browser cookies, credentials, proxies, and bot-detection bypass techniques is higher-impact than simple public data extraction.

⚠ Persistence & Privilege

The documentation includes writing session cookies to a local cookies.txt file and does not provide clear retention, protection, or cleanup guidance for that sensitive account material.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install python-web-scraper
安装完成后，直接呼叫该 Skill 的名称或使用 /python-web-scraper 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.1

Fix: --output - now correctly prints to stdout instead of creating a '-' file

v1.0.0

Initial release: basic scraping, pagination, Selenium support, anti-blocking strategies, multiple output formats

元数据

Slug python-web-scraper

版本 1.0.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题

Web Scraper 是什么？

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 15 次。

如何安装 Web Scraper？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install python-web-scraper」即可一键安装，无需额外配置。

Web Scraper 是免费的吗？

是的，Web Scraper 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Web Scraper 支持哪些平台？

Web Scraper 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Web Scraper？

由 ericlooi504（@ericlooi504）开发并维护，当前版本 v1.0.1。

Web Scraper

Web Scraper

Overview

Quick Start

Prerequisites

Basic scrape

Paginated scrape

JavaScript-rendered pages (Selenium)

Common Scenarios

Anti-blocking techniques

Handle JavaScript sites without Selenium

Handle login-protected pages

Output formatting

Ethics & Legal

Resources

Web Scraper 是什么？

如何安装 Web Scraper？

Web Scraper 是免费的吗？

Web Scraper 支持哪些平台？

谁开发了 Web Scraper？

💬 留言讨论