← 返回 Skills 市场
1kalin

Web Scraping & Data Extraction Engine

作者 1kalin · GitHub ↗ · v1.0.0
cross-platform ⚠ suspicious
1398
总下载
0
收藏
5
当前安装
1
版本数
在 OpenClaw 中安装
/install afrexai-web-scraping-engine
功能描述
Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...
使用说明 (SKILL.md)

Web Scraping & Data Extraction Engine

Quick Health Check (Run First)

Score your scraping operation (2 points each):

Signal Healthy Unhealthy
Legal compliance robots.txt checked, ToS reviewed Scraping blindly
Architecture Tool matches site complexity Using Puppeteer for static HTML
Anti-detection Rotation, delays, fingerprint diversity Single IP, no delays
Data quality Validation + dedup pipeline Raw dumps, no cleaning
Error handling Retry logic, circuit breakers Crashes on first 403
Monitoring Success rates tracked, alerts set No visibility
Storage Structured, deduplicated, versioned Flat files, duplicates
Scheduling Appropriate frequency, off-peak Hammering during business hours

Score: /16 → 12+: Production-ready | 8-11: Needs work | \x3C8: Stop and redesign


Phase 1: Legal & Ethical Foundation

Pre-Scrape Compliance Checklist

compliance_brief:
  target_domain: ""
  date_assessed: ""
  
  robots_txt:
    checked: false
    target_paths_allowed: false
    crawl_delay_specified: ""
    ai_bot_rules: ""  # Many sites now block AI crawlers specifically
    
  terms_of_service:
    reviewed: false
    scraping_mentioned: false
    scraping_prohibited: false
    api_available: false
    api_sufficient: false
    
  data_classification:
    type: ""  # public-factual | public-personal | behind-auth | copyrighted
    contains_pii: false
    pii_types: []  # name, email, phone, address, photo
    gdpr_applies: false  # EU residents' data
    ccpa_applies: false  # California residents' data
    
  legal_risk: ""  # low | medium | high | do-not-scrape
  decision: ""  # proceed | use-api | request-permission | abandon
  justification: ""

Legal Landscape Quick Reference

Scenario Risk Level Key Case Law
Public data, no login, robots.txt allows LOW hiQ v. LinkedIn (2022)
Public data, robots.txt disallows MEDIUM Meta v. Bright Data (2024)
Behind authentication HIGH Van Buren v. US (2021), CFAA
Personal data without consent HIGH GDPR Art. 6, CCPA §1798.100
Republishing copyrighted content HIGH Copyright Act §106
Price/product comparison LOW eBay v. Bidder's Edge (fair use)
Academic/research use LOW-MEDIUM Varies by jurisdiction
Bypassing anti-bot measures HIGH CFAA "exceeds authorized access"

Decision Rules

  1. API exists and covers your needs? → Use the API. Always.
  2. robots.txt disallows your target? → Respect it unless you have written permission.
  3. Data behind login? → Do not scrape without explicit authorization.
  4. Contains PII? → GDPR/CCPA compliance required before collection.
  5. Copyrighted content? → Extract facts/data points only, never full content.
  6. Site explicitly prohibits scraping? → Request permission or find alternative source.

AI Crawler Considerations (2025+)

Many sites now specifically block AI-related crawlers:

# Common AI bot blocks in robots.txt
User-agent: GPTBot
User-agent: ChatGPT-User
User-agent: Google-Extended
User-agent: CCBot
User-agent: anthropic-ai
User-agent: ClaudeBot
User-agent: Bytespider
User-agent: PerplexityBot

Rule: If collecting data for AI training, check for these specific blocks.


Phase 2: Architecture Decision

Tool Selection Matrix

Tool/Approach Best For Speed JS Support Complexity Cost
HTTP client (requests/axios) Static HTML, APIs ⚡⚡⚡ Low Free
Beautiful Soup / Cheerio Static HTML parsing ⚡⚡⚡ Low Free
Scrapy Large-scale structured crawling ⚡⚡⚡ Plugin Medium Free
Playwright / Puppeteer JS-rendered, SPAs, interactions Medium Free
Selenium Legacy, browser automation High Free
Crawlee Hybrid (HTTP + browser fallback) ⚡⚡ Medium Free
Firecrawl / ScrapingBee Managed, anti-bot bypass ⚡⚡ Low Paid
Bright Data / Oxylabs Enterprise, proxy + browser ⚡⚡ Low Paid

Decision Tree

Is the content in the initial HTML source?
├── YES → Is the site structure consistent?
│   ├── YES → Static scraper (requests + BeautifulSoup/Cheerio)
│   └── NO → Scrapy with custom parsers
└── NO → Does the page require user interaction?
    ├── YES → Playwright/Puppeteer with interaction scripts
    └── NO → Playwright in non-interactive mode
        └── At scale (>10K pages)? → Crawlee (hybrid mode)
            └── Heavy anti-bot? → Managed service (Firecrawl/ScrapingBee)

Architecture Brief YAML

scraping_project:
  name: ""
  objective: ""  # What data, why, how often
  
  targets:
    - domain: ""
      pages_estimated: 0
      rendering: "static" | "javascript" | "spa"
      anti_bot: "none" | "basic" | "cloudflare" | "advanced"
      rate_limit: ""  # requests per second safe limit
      
  tool_selected: ""
  justification: ""
  
  data_schema:
    fields: []
    output_format: ""  # json | csv | database
    
  schedule:
    frequency: ""  # once | hourly | daily | weekly
    preferred_time: ""  # off-peak for target timezone
    
  infrastructure:
    proxy_needed: false
    proxy_type: ""  # residential | datacenter | mobile
    storage: ""
    monitoring: ""

Phase 3: Request Engineering

HTTP Request Best Practices

# Python example — production request pattern
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()

# Retry strategy
retry = Retry(
    total=3,
    backoff_factor=1,      # 1s, 2s, 4s
    status_forcelist=[429, 500, 502, 503, 504],
    respect_retry_after_header=True
)
session.mount("https://", HTTPAdapter(max_retries=retry))

# Realistic headers
session.headers.update({
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/122.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Cache-Control": "no-cache",
})

Header Rotation Strategy

Rotate these to avoid fingerprinting:

Header Rotation Pool Size Notes
User-Agent 20-50 real browser UAs Match OS distribution
Accept-Language 5-10 locale combos Match proxy geo
Sec-Ch-Ua Match User-Agent Chrome/Edge/Brave
Referer Vary per request Previous page or search engine

Rate Limiting Rules

Site Type Safe Delay Aggressive (risky)
Small business site 5-10 seconds 2-3 seconds
Medium site 2-5 seconds 1-2 seconds
Large platform (Amazon, etc.) 3-5 seconds 1 second
API endpoint Per API docs Never exceed
robots.txt crawl-delay Respect exactly Never below

Rules:

  1. Always respect Crawl-delay in robots.txt
  2. Add random jitter (±30%) to avoid pattern detection
  3. Slow down during business hours for smaller sites
  4. Respect Retry-After headers — they mean it
  5. Watch for 429s — back off exponentially (2x each time)

Phase 4: Parsing & Extraction

CSS Selector Strategy (Priority Order)

  1. Data attributes[data-product-id], [data-price] (most stable)
  2. Semantic IDs#product-title, #price (stable but can change)
  3. ARIA attributes[aria-label="Price"] (accessibility, fairly stable)
  4. Semantic HTMLarticle, main, nav (structural, stable)
  5. Class names.product-card (can change with redesigns)
  6. XPath position//div[3]/span[2] (FRAGILE — last resort)

Extraction Patterns

Structured data first — Check before writing CSS selectors:

# 1. Check JSON-LD (best source — structured, clean)
import json
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
for script in soup.find_all('script', type='application/ld+json'):
    data = json.loads(script.string)
    # Often contains: Product, Article, Organization, etc.

# 2. Check Open Graph meta tags
og_title = soup.find('meta', property='og:title')
og_price = soup.find('meta', property='product:price:amount')

# 3. Check microdata
items = soup.find_all(itemtype=True)

# 4. Fall back to CSS selectors only if above are empty

Table extraction pattern:

import pandas as pd

# Quick table extraction
tables = pd.read_html(html)  # Returns list of DataFrames

# For complex tables with merged cells
def extract_table(soup, selector):
    table = soup.select_one(selector)
    headers = [th.get_text(strip=True) for th in table.select('thead th')]
    rows = []
    for tr in table.select('tbody tr'):
        cells = [td.get_text(strip=True) for td in tr.select('td')]
        rows.append(dict(zip(headers, cells)))
    return rows

Pagination handling:

# Pattern 1: Next button
while True:
    # ... scrape current page ...
    next_link = soup.select_one('a.next-page, [rel="next"], .pagination .next a')
    if not next_link or not next_link.get('href'):
        break
    url = urljoin(base_url, next_link['href'])
    
# Pattern 2: API pagination (infinite scroll sites)
page = 1
while True:
    resp = session.get(f"{api_url}?page={page}&limit=50")
    data = resp.json()
    if not data.get('results'):
        break
    # ... process results ...
    page += 1

# Pattern 3: Cursor-based
cursor = None
while True:
    params = {"limit": 50}
    if cursor:
        params["cursor"] = cursor
    resp = session.get(api_url, params=params)
    data = resp.json()
    # ... process ...
    cursor = data.get('next_cursor')
    if not cursor:
        break

JavaScript-Rendered Content

# Playwright pattern for JS-rendered pages
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 ...",
    )
    page = context.new_page()
    
    # Block unnecessary resources (speed + stealth)
    page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}", 
               lambda route: route.abort())
    
    page.goto(url, wait_until="networkidle")
    
    # Wait for specific content (better than arbitrary sleep)
    page.wait_for_selector('[data-product-id]', timeout=10000)
    
    # Extract after JS rendering
    content = page.content()
    # ... parse with BeautifulSoup/Cheerio ...
    
    browser.close()

Phase 5: Anti-Detection & Stealth

Detection Signals (What Sites Check)

Signal Detection Method Mitigation
IP reputation IP blacklists, datacenter ranges Residential proxies
Request rate Requests/min from same IP Rate limiting + jitter
TLS fingerprint JA3/JA4 hash matching Use real browser or curl-impersonate
Browser fingerprint Canvas, WebGL, fonts Playwright with stealth plugin
JavaScript challenges Cloudflare Turnstile, hCaptcha Managed browser services
Cookie/session behavior Missing cookies, no history Full session management
Navigation pattern Direct URL hits, no referrer Simulate natural browsing
Mouse/keyboard events No interaction telemetry Event simulation (Playwright)
Header consistency Mismatched headers vs UA Header sets that match

Proxy Strategy

proxy_strategy:
  # Tier 1: Free/Datacenter (for non-protected sites)
  basic:
    type: "datacenter"
    cost: "$1-5/GB"
    success_rate: "60-80%"
    use_for: "APIs, small sites, no anti-bot"
    
  # Tier 2: Residential (for most protected sites)
  standard:
    type: "residential"
    cost: "$5-15/GB"
    success_rate: "90-95%"
    use_for: "Cloudflare, major platforms"
    rotation: "per-request or sticky 10min"
    
  # Tier 3: Mobile/ISP (for maximum stealth)
  premium:
    type: "mobile"
    cost: "$15-30/GB"
    success_rate: "95-99%"
    use_for: "Aggressive anti-bot, social media"
    
  rules:
    - Start with cheapest tier, escalate only on blocks
    - Match proxy geo to target audience geo
    - Rotate on 403/429, not every request
    - Use sticky sessions for multi-page scrapes
    - Monitor proxy health — remove slow/blocked IPs

Playwright Stealth Configuration

# Essential stealth for Playwright
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        args=[
            '--disable-blink-features=AutomationControlled',
            '--disable-features=IsolateOrigins,site-per-process',
        ]
    )
    context = browser.new_context(
        viewport={"width": 1920, "height": 1080},
        locale="en-US",
        timezone_id="America/New_York",
        geolocation={"latitude": 40.7128, "longitude": -74.0060},
        permissions=["geolocation"],
    )
    
    # Remove automation indicators
    page = context.new_page()
    page.add_init_script("""
        Object.defineProperty(navigator, 'webdriver', {get: () => undefined});
        Object.defineProperty(navigator, 'plugins', {get: () => [1, 2, 3]});
    """)

Cloudflare Bypass Decision

Cloudflare detected?
├── JS Challenge only → Playwright with stealth + residential proxy
├── Turnstile CAPTCHA → Managed service (ScrapingBee/Bright Data)
├── Under Attack Mode → Wait, try later, or managed service
└── WAF blocking → Different approach needed
    ├── Check for API endpoints (network tab)
    ├── Check for mobile app API
    └── Consider if data is available elsewhere

Phase 6: Data Pipeline & Quality

Data Validation Rules

# Validation pattern — validate BEFORE storing
from dataclasses import dataclass, field
from typing import Optional
import re
from datetime import datetime

@dataclass
class ScrapedProduct:
    url: str
    title: str
    price: Optional[float]
    currency: str = "USD"
    scraped_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    
    def validate(self) -> list[str]:
        errors = []
        if not self.url.startswith('http'):
            errors.append("Invalid URL")
        if not self.title or len(self.title) \x3C 3:
            errors.append("Title too short or missing")
        if self.price is not None and self.price \x3C 0:
            errors.append("Negative price")
        if self.price is not None and self.price > 1_000_000:
            errors.append("Price suspiciously high — verify")
        if self.currency not in ("USD", "EUR", "GBP", "BTC"):
            errors.append(f"Unknown currency: {self.currency}")
        return errors

Deduplication Strategy

Method When to Use Implementation
URL-based Pages with unique URLs Hash the canonical URL
Content hash Same URL, changing content MD5/SHA256 of key fields
Fuzzy matching Near-duplicate detection Jaccard similarity > 0.85
Composite key Multi-field uniqueness Hash(domain + product_id + variant)
import hashlib

def dedup_key(item: dict, fields: list[str]) -> str:
    """Generate dedup key from selected fields."""
    values = "|".join(str(item.get(f, "")) for f in fields)
    return hashlib.sha256(values.encode()).hexdigest()

# Usage
seen = set()
for item in scraped_items:
    key = dedup_key(item, ["url", "product_id"])
    if key not in seen:
        seen.add(key)
        clean_items.append(item)

Data Cleaning Pipeline

Raw HTML → Parse → Extract → Validate → Clean → Deduplicate → Store
                                ↓
                          Quarantine (failed validation)

Common cleaning operations:

Problem Solution
HTML entities (&) html.unescape()
Extra whitespace " ".join(text.split())
Unicode issues unicodedata.normalize('NFKD', text)
Price in text ("$49.99") Regex: r'[\$£€]?([\d,]+\.?\d*)'
Date formats vary dateutil.parser.parse() with dayfirst flag
Relative URLs urllib.parse.urljoin(base, relative)
Encoding issues chardet.detect() then decode

Phase 7: Storage & Export

Storage Decision Guide

Volume Frequency Query Needs Recommendation
\x3C10K records One-time None JSON/CSV files
\x3C10K records Recurring Simple lookups SQLite
10K-1M records Recurring Complex queries PostgreSQL
1M+ records Continuous Analytics PostgreSQL + partitioning
Append-only logs Continuous Time-series ClickHouse / TimescaleDB

SQLite Pattern (Most Common)

import sqlite3
import json
from datetime import datetime

def init_db(path="scraper_data.db"):
    conn = sqlite3.connect(path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS items (
            id INTEGER PRIMARY KEY,
            url TEXT UNIQUE,
            data JSON NOT NULL,
            scraped_at TEXT DEFAULT (datetime('now')),
            updated_at TEXT,
            checksum TEXT
        )
    """)
    conn.execute("CREATE INDEX IF NOT EXISTS idx_url ON items(url)")
    conn.execute("CREATE INDEX IF NOT EXISTS idx_scraped ON items(scraped_at)")
    return conn

def upsert(conn, url, data, checksum):
    conn.execute("""
        INSERT INTO items (url, data, checksum) VALUES (?, ?, ?)
        ON CONFLICT(url) DO UPDATE SET
            data = excluded.data,
            updated_at = datetime('now'),
            checksum = excluded.checksum
        WHERE items.checksum != excluded.checksum
    """, (url, json.dumps(data), checksum))
    conn.commit()

Export Formats

# CSV export
import csv
def to_csv(items, path, fields):
    with open(path, 'w', newline='') as f:
        writer = csv.DictWriter(f, fieldnames=fields)
        writer.writeheader()
        writer.writerows(items)

# JSON Lines (best for large datasets — streaming)
def to_jsonl(items, path):
    with open(path, 'w') as f:
        for item in items:
            f.write(json.dumps(item) + '\
')

# Incremental export (only new/changed since last export)
def export_since(conn, last_export_time):
    cursor = conn.execute(
        "SELECT data FROM items WHERE scraped_at > ? OR updated_at > ?",
        (last_export_time, last_export_time)
    )
    return [json.loads(row[0]) for row in cursor]

Phase 8: Error Handling & Resilience

Error Classification

HTTP Code Meaning Action
200 Success Process normally
301/302 Redirect Follow (max 5 hops)
403 Forbidden/blocked Rotate proxy, slow down
404 Not found Log, skip, mark URL dead
429 Rate limited Respect Retry-After, back off 2x
500-504 Server error Retry 3x with backoff
Connection timeout Network issue Retry with different proxy
SSL error Certificate issue Log, investigate, skip

Circuit Breaker Pattern

class CircuitBreaker:
    def __init__(self, failure_threshold=5, reset_timeout=300):
        self.failures = 0
        self.threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.last_failure = 0
        self.state = "closed"  # closed | open | half-open
    
    def record_failure(self):
        self.failures += 1
        self.last_failure = time.time()
        if self.failures >= self.threshold:
            self.state = "open"
            # Alert: "Circuit open — too many failures"
    
    def record_success(self):
        self.failures = 0
        self.state = "closed"
    
    def can_proceed(self):
        if self.state == "closed":
            return True
        if self.state == "open":
            if time.time() - self.last_failure > self.reset_timeout:
                self.state = "half-open"
                return True  # Try one request
            return False
        return True  # half-open: allow attempt

Checkpoint & Resume

import json
from pathlib import Path

class Checkpointer:
    def __init__(self, path="checkpoint.json"):
        self.path = Path(path)
        self.state = self._load()
    
    def _load(self):
        if self.path.exists():
            return json.loads(self.path.read_text())
        return {"completed_urls": [], "last_page": 0, "cursor": None}
    
    def save(self):
        self.path.write_text(json.dumps(self.state))
    
    def is_done(self, url):
        return url in self.state["completed_urls"]
    
    def mark_done(self, url):
        self.state["completed_urls"].append(url)
        if len(self.state["completed_urls"]) % 50 == 0:
            self.save()  # Periodic save

Phase 9: Monitoring & Operations

Scraper Health Dashboard

dashboard:
  real_time:
    - metric: "requests_per_minute"
      alert_if: "> 60 for small sites"
    - metric: "success_rate"
      alert_if: "\x3C 90%"
    - metric: "avg_response_time_ms"
      alert_if: "> 5000"
    - metric: "blocked_rate"
      alert_if: "> 10%"
      
  per_run:
    - metric: "pages_scraped"
    - metric: "items_extracted"
    - metric: "items_validated"
    - metric: "items_deduplicated"
    - metric: "new_items"
    - metric: "updated_items"
    - metric: "errors_by_type"
    - metric: "run_duration"
    - metric: "proxy_cost"
    
  weekly:
    - metric: "data_freshness"
      description: "% of records updated in last 7 days"
    - metric: "site_structure_changes"
      description: "Selectors that stopped matching"
    - metric: "total_cost"
      description: "Proxy + compute + storage"

Breakage Detection

Sites redesign. Selectors break. Detect it early:

def health_check(results: list[dict], expected_fields: list[str]) -> dict:
    """Check if scraper is still extracting correctly."""
    total = len(results)
    if total == 0:
        return {"status": "CRITICAL", "message": "Zero results — likely broken"}
    
    field_coverage = {}
    for field in expected_fields:
        filled = sum(1 for r in results if r.get(field))
        coverage = filled / total
        field_coverage[field] = coverage
        
    issues = []
    for field, coverage in field_coverage.items():
        if coverage \x3C 0.5:
            issues.append(f"{field}: {coverage:.0%} fill rate (expected >50%)")
    
    if issues:
        return {"status": "WARNING", "issues": issues}
    return {"status": "OK", "field_coverage": field_coverage}

Operational Runbook

Daily:

  • Check success rate per target domain
  • Review error logs for new patterns
  • Verify data freshness

Weekly:

  • Compare extraction counts vs baseline (>20% drop = investigate)
  • Review proxy spend
  • Spot-check 10 random records for accuracy

Monthly:

  • Full selector validation against live pages
  • Review legal compliance (robots.txt changes, ToS updates)
  • Cost optimization review
  • Prune dead URLs from queue

Phase 10: Common Scraping Patterns

Pattern 1: E-commerce Price Monitor

use_case: "Track competitor prices daily"
tool: "requests + BeautifulSoup"
schedule: "Daily at 03:00 UTC (off-peak)"
targets: ["competitor-a.com/products", "competitor-b.com/api"]
data:
  - product_id
  - product_name
  - price
  - currency
  - in_stock
  - scraped_at
storage: "SQLite with price history"
alerts: "Price change > 10% → notify"

Pattern 2: Job Board Aggregator

use_case: "Aggregate job listings from multiple boards"
tool: "Scrapy with per-site spiders"
schedule: "Every 6 hours"
targets: ["board-a.com", "board-b.com", "board-c.com"]
data:
  - title
  - company
  - location
  - salary_range
  - posted_date
  - url
  - source
dedup: "Hash(title + company + location)"
storage: "PostgreSQL"

Pattern 3: News & Content Monitor

use_case: "Monitor industry news mentions"
tool: "requests + RSS feeds (preferred) + web fallback"
schedule: "Every 30 minutes"
approach:
  1: "RSS/Atom feeds (fastest, cleanest)"
  2: "Google News RSS for topic"
  3: "Direct scraping if no feed"
data:
  - headline
  - source
  - url
  - published_at
  - snippet
  - sentiment
alerts: "Keyword match → immediate notification"

Pattern 4: Social Media Intelligence

use_case: "Track brand mentions and sentiment"
tool: "Official APIs (always) + web search fallback"
rules:
  - NEVER scrape social platforms directly — use APIs
  - Twitter/X: Official API ($100/mo basic)
  - Reddit: Official API (free tier available)
  - LinkedIn: No scraping (aggressive legal action)
  - Instagram: Official API only (Meta Business)
fallback: "Brave/Google search for public mentions"

Pattern 5: Real Estate Listings

use_case: "Track property listings and prices"
tool: "Playwright (most listing sites are JS-heavy)"
schedule: "Daily"
challenges:
  - Heavy JavaScript rendering
  - Anti-bot measures (Cloudflare common)
  - Frequent layout changes
  - Map-based results
approach: "API endpoint discovery via network tab first"

Phase 11: Scaling Strategies

Concurrency Architecture

Single machine (small scale):
├── asyncio + aiohttp (Python) → 50-200 concurrent requests
├── Worker pool (ThreadPoolExecutor) → 10-50 threads
└── Scrapy reactor → Built-in concurrency

Multi-machine (large scale):
├── URL queue: Redis / RabbitMQ / SQS
├── Workers: Multiple Scrapy/custom workers
├── Results: Shared PostgreSQL / S3
└── Coordinator: Celery / custom scheduler

Cost Optimization

Lever Impact How
Static > Browser 10-50x cheaper Always try HTTP first
Block images/CSS/fonts 60-80% bandwidth saved Route filtering
Cache DNS Minor but cumulative Local DNS cache
Compress responses 50-70% bandwidth Accept-Encoding: gzip, br
Smart scheduling Avoid redundant scrapes Change detection before full re-scrape
Proxy tier matching 3-10x cost difference Don't use residential for easy sites

Phase 12: Advanced Patterns

API Discovery (Network Tab Mining)

Before building a scraper, check if the site has hidden API endpoints:

  1. Open DevTools → Network tab
  2. Filter by XHR/Fetch
  3. Navigate the site, click load-more, filter/sort
  4. Look for JSON responses — these are your goldmine
  5. Most SPAs load data via REST/GraphQL APIs

Common hidden API patterns:

  • /api/v1/products?page=1&limit=20
  • /graphql with query parameters
  • /_next/data/... (Next.js data routes)
  • /wp-json/wp/v2/posts (WordPress)

Headless Browser Optimization

# Minimize browser resource usage
context = browser.new_context(
    viewport={"width": 1280, "height": 720},
    java_script_enabled=True,  # Only if needed
    has_touch=False,
    is_mobile=False,
)

# Block resource types you don't need
page.route("**/*", lambda route: (
    route.abort() if route.request.resource_type in 
    ["image", "stylesheet", "font", "media"] 
    else route.continue_()
))

Scraping Behind Authentication

# When authorized to scrape behind login
# ALWAYS use session-based auth, never store passwords in code

# Pattern: Login once, reuse session
session = requests.Session()
login_resp = session.post("https://example.com/login", data={
    "username": os.environ["SCRAPE_USER"],
    "password": os.environ["SCRAPE_PASS"],
})
assert login_resp.ok, "Login failed"

# Session cookies are now stored — use for subsequent requests
data_resp = session.get("https://example.com/api/data")

Change Detection (Avoid Redundant Scrapes)

import hashlib

def has_changed(url, session, last_etag=None, last_modified=None):
    """Check if page changed without downloading full content."""
    headers = {}
    if last_etag:
        headers["If-None-Match"] = last_etag
    if last_modified:
        headers["If-Modified-Since"] = last_modified
    
    resp = session.head(url, headers=headers)
    
    if resp.status_code == 304:
        return False, resp.headers.get("ETag"), resp.headers.get("Last-Modified")
    
    return True, resp.headers.get("ETag"), resp.headers.get("Last-Modified")

Quality Scoring Rubric (0-100)

Dimension Weight What to Assess
Legal compliance 20% robots.txt, ToS, PII handling, audit trail
Data quality 20% Validation, accuracy, completeness, freshness
Resilience 15% Error handling, retries, circuit breakers, checkpointing
Anti-detection 15% Proxy rotation, fingerprint diversity, rate limiting
Architecture 10% Right tool selection, clean code, modularity
Monitoring 10% Success rates, breakage detection, alerting
Performance 5% Speed, cost efficiency, resource usage
Documentation 5% Runbook, schema docs, legal assessment

Grading: 90+ Excellent | 75-89 Good | 60-74 Needs work | \x3C60 Redesign


10 Common Mistakes

# Mistake Fix
1 No robots.txt check Always check first — it's your legal defense
2 Fixed delays (no jitter) Add ±30% random jitter to all delays
3 No data validation Validate every field before storing
4 Using browser for static HTML HTTP client is 10-50x faster and cheaper
5 Single IP, no rotation Proxy rotation for any serious scraping
6 No breakage detection Monitor extraction counts and field fill rates
7 Storing raw HTML only Extract + structure immediately
8 No checkpoint/resume Long scrapes must be resumable
9 Ignoring structured data JSON-LD/microdata is cleaner than CSS selectors
10 Scraping when API exists Always check for API first

5 Edge Cases

  1. Single-page apps (React/Vue/Angular): Must use browser rendering OR find the underlying API (network tab). Prefer API discovery — it's faster and more reliable.

  2. Infinite scroll: Intercept the XHR/fetch calls that load more content. Simulate scrolling only as last resort. The API endpoint usually accepts page or offset params.

  3. CAPTCHAs: If you're hitting CAPTCHAs, you're scraping too aggressively. Slow down first. If CAPTCHAs persist: managed services (2Captcha, Anti-Captcha) or rethink approach.

  4. Dynamic class names (CSS modules, Tailwind): Use data attributes, ARIA labels, or text content selectors instead. [data-testid="price"] survives redesigns. .sc-bdVTJa does not.

  5. Multi-language sites: Detect language via html[lang] attribute. Set Accept-Language header to get desired locale. Watch for different URL structures (/en/, /de/, subdomains).


Natural Language Commands

  1. "Check if I can scrape [URL]" → Run compliance checklist (robots.txt, ToS, data type)
  2. "What tool should I use for [site]?" → Analyze site rendering, anti-bot, recommend tool
  3. "Build a scraper for [description]" → Full architecture brief + code pattern
  4. "My scraper is getting blocked" → Anti-detection diagnostic + proxy/stealth recommendations
  5. "Extract [data] from [URL]" → Check structured data first, then CSS selectors
  6. "Monitor [site] for changes" → Change detection + scheduling + alerting setup
  7. "How do I handle pagination on [site]?" → Identify pagination type + code pattern
  8. "Scrape at scale ([N] pages)" → Concurrency architecture + cost estimate
  9. "Clean and store this scraped data" → Validation + dedup + storage recommendation
  10. "Is my scraper healthy?" → Run health check + breakage detection
  11. "Find the API behind [site]" → Network tab mining guide + common patterns
  12. "Set up price monitoring for [competitors]" → Full e-commerce monitor pattern
安全使用建议
This skill is a comprehensive how-to for building scrapers and includes detailed techniques for evading anti-bot measures. That makes it helpful for legitimate engineering but also increases legal and ethical risk. Before installing: (1) Confirm you have a lawful use case and get legal review for target jurisdictions; (2) prefer using official APIs where available and follow robots.txt/ToS; (3) do not rely on this to bypass authentication or protections—doing so may violate laws (CFAA, GDPR, etc.); (4) do not provide the agent with credentials or broad network access unless you trust the agent and audit its actions; (5) if you proceed, restrict the agent’s permissions, enable logging/alerts, and avoid giving it the ability to autonomously execute network operations or provision third-party services without human-in-the-loop approval. If you need only compliance/architecture guidance, consider extracting those sections and avoiding the anti-detection recommendations.
功能分析
Type: OpenClaw Skill Name: afrexai-web-scraping-engine Version: 1.0.0 The skill bundle provides a comprehensive methodology for web scraping, including legal compliance, architecture, anti-detection, data pipelines, and operations. While it emphasizes ethical and legal scraping, it demonstrates powerful capabilities such as extensive network access, file I/O (SQLite, CSV, JSONL), browser automation with Playwright (including JavaScript injection for stealth), and accessing environment variables for authentication (`os.environ['SCRAPE_USER']`, `os.environ['SCRAPE_PASS']`). The 'anti-detection & stealth' techniques, though explained as necessary for effective scraping, are dual-use and could be repurposed. The instruction for the AI agent to 'Open DevTools → Network tab' for API discovery implies significant control over the browser environment. These capabilities, while plausibly needed for the stated purpose, are high-risk and could be exploited if the agent is given malicious instructions by a user, leading to a 'suspicious' classification rather than 'benign'. There is no clear evidence of intentional malicious behavior within the skill bundle itself, such as exfiltration to unauthorized endpoints or installation of backdoors.
能力评估
Purpose & Capability
Name and description align with the SKILL.md content: comprehensive guidance for building scrapers, architecture choices, data pipelines, and anti-detection strategies. No declared credentials or installs are required, which is consistent with an instruction-only methodology skill.
Instruction Scope
The instructions go beyond benign best-practices and include explicit anti-detection tactics (proxy rotation, fingerprint diversity, stealth configs, Cloudflare bypass references and managed 'anti-bot bypass' providers). That content can facilitate evading site protections and accessing data behind defenses; this raises legal and ethical risk even if the skill frames compliance checks first.
Install Mechanism
Instruction-only skill with no install spec and no code files — nothing is written to disk or downloaded by the skill itself, which lowers technical supply-chain risk.
Credentials
The skill declares no required environment variables or credentials (proportionate for a methodology document). However, it repeatedly recommends third-party proxy and scraping services (Bright Data, Oxylabs, ScrapingBee, etc.) that in practice require credentials and billing; those are not declared in the skill metadata, so users must be mindful to supply and protect such secrets outside the skill.
Persistence & Privilege
always:false and no special privileges requested. Autonomous invocation is allowed (platform default) — combined with the anti-detection guidance this increases potential misuse, but the skill itself does not request persistent or cross-skill configuration changes.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install afrexai-web-scraping-engine
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /afrexai-web-scraping-engine 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of the Web Scraping & Data Extraction Engine: - Provides a comprehensive methodology covering legal compliance, scraper architecture, anti-detection techniques, data pipelines, and operational best practices. - Includes a quick health check scoring system to assess the production readiness of scraping projects. - Offers detailed legal guidance, decision rules, and risk assessment based on current regulations and case law. - Presents an architecture decision matrix and decision tree for optimal tool selection based on site complexity and anti-bot measures. - Shares request engineering best practices, including header rotation, rate limiting, and retry strategies. - Gives practical guidance for data extraction, error handling, monitoring, storage, and scheduling for scalable web data collection.
元数据
Slug afrexai-web-scraping-engine
版本 1.0.0
许可证
累计安装 5
当前安装数 5
历史版本数 1
常见问题

Web Scraping & Data Extraction Engine 是什么?

Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 1398 次。

如何安装 Web Scraping & Data Extraction Engine?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install afrexai-web-scraping-engine」即可一键安装,无需额外配置。

Web Scraping & Data Extraction Engine 是免费的吗?

是的,Web Scraping & Data Extraction Engine 完全免费(开源免费),可自由下载、安装和使用。

Web Scraping & Data Extraction Engine 支持哪些平台?

Web Scraping & Data Extraction Engine 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Web Scraping & Data Extraction Engine?

由 1kalin(@1kalin)开发并维护,当前版本 v1.0.0。

💬 留言讨论