← Back to Skills Marketplace
ericlooi504

Web Scraper

by ericlooi504 · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ⚠ suspicious
15
Downloads
0
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install python-web-scraper
Description
Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J...
README (SKILL.md)

Web Scraper

Overview

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JS-heavy sites, and structured output. Covers ethical scraping practices. Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, or scrape JavaScript-rendered content.

Quick Start

Prerequisites

pip install requests beautifulsoup4 lxml
# For JS-heavy sites:
pip install selenium webdriver-manager

Basic scrape

# Extract all links from a page
python3 scripts/scrape-basic.py https://example.com \
  --selector "a[href]" --attr href --output links.json --pretty

# Extract text from articles
python3 scripts/scrape-basic.py https://news.ycombinator.com \
  --selector ".titleline a" --output hn.txt

Paginated scrape

# URL parameter pagination (?page=1, ?page=2)
python3 scripts/scrape-pagination.py https://books.toscrape.com/catalogue/page-1.html \
  --selector "h3 a" --attr title --max-pages 5

# Next-link detection
python3 scripts/scrape-pagination.py https://quotes.toscrape.com \
  --selector "span.text" --max-pages 3

JavaScript-rendered pages (Selenium)

python3 scripts/scrape-with-selenium.py https://example.com \
  --selector ".dynamic-content" --wait 5 --output data.json

Common Scenarios

Anti-blocking techniques

Rotate User-Agents and add delays to avoid 429/blocking:

import random
import time
headers = {
    "User-Agent": random.choice(USER_AGENTS),
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.google.com/",
}
time.sleep(random.uniform(1.0, 3.0))  # random delay between requests

For aggressive blocking: set cookies, use sessions, or add proxy.

Handle JavaScript sites without Selenium

First check: is the data embedded in the page source?

import re, json
# Look for JSON data in \x3Cscript> tags
match = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.*?});', html)
if match:
    data = json.loads(match.group(1))

Many SPAs (React/Vue) embed data in script tags — Selenium may be unnecessary.

Handle login-protected pages

# Option 1: Export cookies from browser
# In browser console: document.cookie or use EditThisCookie extension
# Option 2: Use requests Session
python3 -c "
import requests
s = requests.Session()
s.post('https://example.com/login', data={'user': '...', 'pass': '...'})
with open('cookies.txt', 'w') as f:
    f.write(str(s.cookies.get_dict()))
"

Output formatting

Scripts output JSON by default. Convert to CSV:

# JSON → CSV using jq
python3 scrape-basic.py https://example.com -s "tr" -o data.json --pretty
python3 -c "
import json, csv
with open('data.json') as f:
    data = json.load(f)
with open('data.csv', 'w', newline='') as f:
    w = csv.writer(f)
    w.writerow(['item'])
    for d in data:
        w.writerow([d])
"

Ethics & Legal

  • Always check robots.txt first: https://example.com/robots.txt
  • Respect Crawl-delay directive
  • Identify yourself in User-Agent with contact info
  • Never scrape login-protected content, personal data, or copyrighted material
  • Add delays (1-3s minimum) between requests — don't hammer servers
  • Check ToS, some sites explicitly ban scraping
  • For public data (news, blogs, directories): generally fine with proper rate limiting

Resources

  • scripts/scrape-basic.py — Single page scrape with CSS selectors, JSON/CSV/text output
  • scripts/scrape-pagination.py — Paginated scrape (URL params + next-link detection)
  • scripts/scrape-with-selenium.py — Selenium-based scrape for JS-heavy sites with scroll
  • references/anti-blocking.md — Detailed anti-blocking and proxy strategies
Usage Guidance
Install or use this only if you intentionally need a web-scraping toolkit and can ensure the target use is authorized. Avoid giving it browser cookies, passwords, or private account sessions; do not use the anti-bot, proxy, CAPTCHA, or webdriver-bypass guidance unless permitted by the site owner. Pin and review dependencies, and run scraping in an isolated environment.
Capability Analysis
Type: OpenClaw Skill Name: python-web-scraper Version: 1.0.1 The skill bundle is a standard web scraping toolkit providing scripts for basic, paginated, and Selenium-based data extraction. The code in scripts/scrape-basic.py, scripts/scrape-pagination.py, and scripts/scrape-with-selenium.py follows best practices for scraping, including rate limiting and user-agent rotation, and lacks any indicators of data exfiltration, unauthorized execution, or malicious prompt injection.
Capability Tags
requires-sensitive-credentials
Capability Assessment
Purpose & Capability
The included scripts match a web-scraping toolkit, but the documented scope extends from public-page scraping into anti-bot evasion, CAPTCHA handling, proxies, and login/session-cookie use.
Instruction Scope
The skill gives steps for login-protected pages while also saying not to scrape login-protected content, creating unclear boundaries for when the agent should use credentials or sessions.
Install Mechanism
There is no install spec; setup is via user-directed pip commands, and the Selenium helper uses webdriver-manager to obtain ChromeDriver at runtime.
Credentials
Arbitrary URL fetching and file output are expected for scraping, but using browser cookies, credentials, proxies, and bot-detection bypass techniques is higher-impact than simple public data extraction.
Persistence & Privilege
The documentation includes writing session cookies to a local cookies.txt file and does not provide clear retention, protection, or cleanup guidance for that sensitive account material.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install python-web-scraper
  3. After installation, invoke the skill by name or use /python-web-scraper
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.1
Fix: --output - now correctly prints to stdout instead of creating a '-' file
v1.0.0
Initial release: basic scraping, pagination, Selenium support, anti-blocking strategies, multiple output formats
Metadata
Slug python-web-scraper
Version 1.0.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 2
Frequently Asked Questions

What is Web Scraper?

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J... It is an AI Agent Skill for Claude Code / OpenClaw, with 15 downloads so far.

How do I install Web Scraper?

Run "/install python-web-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Scraper free?

Yes, Web Scraper is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Scraper support?

Web Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Scraper?

It is built and maintained by ericlooi504 (@ericlooi504); the current version is v1.0.1.

💬 Comments