← Back to Skills Marketplace

Web Scraper

Name: Web Scraper
Author: ericlooi504

by ericlooi504 · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ⚠ suspicious

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install python-web-scraper

Description

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J...

README (SKILL.md)

Web Scraper

Overview

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JS-heavy sites, and structured output. Covers ethical scraping practices. Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, or scrape JavaScript-rendered content.

Quick Start

Prerequisites

pip install requests beautifulsoup4 lxml
# For JS-heavy sites:
pip install selenium webdriver-manager

Basic scrape

# Extract all links from a page
python3 scripts/scrape-basic.py https://example.com \
  --selector "a[href]" --attr href --output links.json --pretty

# Extract text from articles
python3 scripts/scrape-basic.py https://news.ycombinator.com \
  --selector ".titleline a" --output hn.txt

Paginated scrape

# URL parameter pagination (?page=1, ?page=2)
python3 scripts/scrape-pagination.py https://books.toscrape.com/catalogue/page-1.html \
  --selector "h3 a" --attr title --max-pages 5

# Next-link detection
python3 scripts/scrape-pagination.py https://quotes.toscrape.com \
  --selector "span.text" --max-pages 3

JavaScript-rendered pages (Selenium)

python3 scripts/scrape-with-selenium.py https://example.com \
  --selector ".dynamic-content" --wait 5 --output data.json

Common Scenarios

Anti-blocking techniques

Rotate User-Agents and add delays to avoid 429/blocking:

import random
import time
headers = {
    "User-Agent": random.choice(USER_AGENTS),
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.google.com/",
}
time.sleep(random.uniform(1.0, 3.0))  # random delay between requests

For aggressive blocking: set cookies, use sessions, or add proxy.

Handle JavaScript sites without Selenium

First check: is the data embedded in the page source?

import re, json
# Look for JSON data in \x3Cscript> tags
match = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.*?});', html)
if match:
    data = json.loads(match.group(1))

Many SPAs (React/Vue) embed data in script tags — Selenium may be unnecessary.

Handle login-protected pages

# Option 1: Export cookies from browser
# In browser console: document.cookie or use EditThisCookie extension
# Option 2: Use requests Session
python3 -c "
import requests
s = requests.Session()
s.post('https://example.com/login', data={'user': '...', 'pass': '...'})
with open('cookies.txt', 'w') as f:
    f.write(str(s.cookies.get_dict()))
"

Output formatting

Scripts output JSON by default. Convert to CSV:

# JSON → CSV using jq
python3 scrape-basic.py https://example.com -s "tr" -o data.json --pretty
python3 -c "
import json, csv
with open('data.json') as f:
    data = json.load(f)
with open('data.csv', 'w', newline='') as f:
    w = csv.writer(f)
    w.writerow(['item'])
    for d in data:
        w.writerow([d])
"

Ethics & Legal

Always check robots.txt first: https://example.com/robots.txt
Respect Crawl-delay directive
Identify yourself in User-Agent with contact info
Never scrape login-protected content, personal data, or copyrighted material
Add delays (1-3s minimum) between requests — don't hammer servers
Check ToS, some sites explicitly ban scraping
For public data (news, blogs, directories): generally fine with proper rate limiting

Resources

scripts/scrape-basic.py — Single page scrape with CSS selectors, JSON/CSV/text output
scripts/scrape-pagination.py — Paginated scrape (URL params + next-link detection)
scripts/scrape-with-selenium.py — Selenium-based scrape for JS-heavy sites with scroll
references/anti-blocking.md — Detailed anti-blocking and proxy strategies

Usage Guidance

Install or use this only if you intentionally need a web-scraping toolkit and can ensure the target use is authorized. Avoid giving it browser cookies, passwords, or private account sessions; do not use the anti-bot, proxy, CAPTCHA, or webdriver-bypass guidance unless permitted by the site owner. Pin and review dependencies, and run scraping in an isolated environment.

Capability Analysis

Type: OpenClaw Skill Name: python-web-scraper Version: 1.0.1 The skill bundle is a standard web scraping toolkit providing scripts for basic, paginated, and Selenium-based data extraction. The code in scripts/scrape-basic.py, scripts/scrape-pagination.py, and scripts/scrape-with-selenium.py follows best practices for scraping, including rate limiting and user-agent rotation, and lacks any indicators of data exfiltration, unauthorized execution, or malicious prompt injection.

Capability Tags

requires-sensitive-credentials

Capability Assessment

⚠ Purpose & Capability

The included scripts match a web-scraping toolkit, but the documented scope extends from public-page scraping into anti-bot evasion, CAPTCHA handling, proxies, and login/session-cookie use.

⚠ Instruction Scope

The skill gives steps for login-protected pages while also saying not to scrape login-protected content, creating unclear boundaries for when the agent should use credentials or sessions.

ℹ Install Mechanism

There is no install spec; setup is via user-directed pip commands, and the Selenium helper uses webdriver-manager to obtain ChromeDriver at runtime.

⚠ Credentials

Arbitrary URL fetching and file output are expected for scraping, but using browser cookies, credentials, proxies, and bot-detection bypass techniques is higher-impact than simple public data extraction.

⚠ Persistence & Privilege

The documentation includes writing session cookies to a local cookies.txt file and does not provide clear retention, protection, or cleanup guidance for that sensitive account material.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install python-web-scraper
After installation, invoke the skill by name or use /python-web-scraper
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.1

Fix: --output - now correctly prints to stdout instead of creating a '-' file

v1.0.0

Initial release: basic scraping, pagination, Selenium support, anti-blocking strategies, multiple output formats

Metadata

Slug python-web-scraper

Version 1.0.1

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 2

Frequently Asked Questions

What is Web Scraper?

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (J... It is an AI Agent Skill for Claude Code / OpenClaw, with 15 downloads so far.

How do I install Web Scraper?

Run "/install python-web-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Scraper free?

Yes, Web Scraper is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Scraper support?

Web Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Scraper?

It is built and maintained by ericlooi504 (@ericlooi504); the current version is v1.0.1.

More Skills

Web Scraper

Web Scraper

Overview

Quick Start

Prerequisites

Basic scrape

Paginated scrape

JavaScript-rendered pages (Selenium)

Common Scenarios

Anti-blocking techniques

Handle JavaScript sites without Selenium

Handle login-protected pages

Output formatting

Ethics & Legal

Resources

What is Web Scraper?

How do I install Web Scraper?

Is Web Scraper free?

Which platforms does Web Scraper support?

Who created Web Scraper?

💬 Comments