← 返回 Skills 市场
nidhov01

Scrapling Web Scraping

作者 nidhov01 · GitHub ↗ · v0.4.2 · MIT-0
cross-platform ✓ 安全检测通过
232
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install nidhov01-scrapling-web-scraping
功能描述
Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC...
使用说明 (SKILL.md)

Scrapling MCP — Web Scraping Guidance

Guidance Layer + MCP Integration
Use this skill for strategy and patterns. For execution, call Scrapling's MCP server via mcporter.

Quick Start (MCP)

1. Install Scrapling with MCP support

pip install scrapling[mcp]
# Or for full features:
pip install scrapling[mcp,playwright]
python -m playwright install chromium

2. Add to OpenClaw MCP config

{
  "mcpServers": {
    "scrapling": {
      "command": "python",
      "args": ["-m", "scrapling.mcp"]
    }
  }
}

3. Call via mcporter

mcporter call scrapling fetch_page --url "https://example.com"

Execution vs Guidance

Task Tool Example
Fetch a page mcporter mcporter call scrapling fetch_page --url URL
Extract with CSS mcporter mcporter call scrapling css_select --selector ".title::text"
Which fetcher to use? This skill See "Fetcher Selection Guide" below
Anti-bot strategy? This skill See "Anti-Bot Escalation Ladder"
Complex crawl patterns? This skill See "Spider Recipes"

Fetcher Selection Guide

┌─────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Fetcher       │────▶│ DynamicFetcher   │────▶│ StealthyFetcher  │
│   (HTTP)        │     │ (Browser/JS)     │     │ (Anti-bot)       │
└─────────────────┘     └──────────────────┘     └──────────────────┘
     Fastest              JS-rendered               Cloudflare, 
     Static pages         SPAs, React/Vue          Turnstile, etc.

Decision Tree

  1. Static HTML?Fetcher (10-100x faster)
  2. Need JS execution?DynamicFetcher
  3. Getting blocked?StealthyFetcher
  4. Complex session? → Use Session variants

MCP Fetch Modes

  • fetch_page — HTTP fetcher
  • fetch_dynamic — Browser-based with Playwright
  • fetch_stealthy — Anti-bot bypass mode

Anti-Bot Escalation Ladder

Level 1: Polite HTTP

# MCP call: fetch_page with options
{
  "url": "https://example.com",
  "headers": {"User-Agent": "..."},
  "delay": 2.0
}

Level 2: Session Persistence

# Use sessions for cookie/state across requests
FetcherSession(impersonate="chrome")  # TLS fingerprint spoofing

Level 3: Stealth Mode

# MCP: fetch_stealthy
StealthyFetcher.fetch(
    url,
    headless=True,
    solve_cloudflare=True,  # Auto-solve Turnstile
    network_idle=True
)

Level 4: Proxy Rotation

See references/proxy-rotation.md

Adaptive Scraping (Anti-Fragile)

Scrapling can survive website redesigns using adaptive selectors:

# First run — save fingerprints
products = page.css('.product', auto_save=True)

# Later runs — auto-relocate if DOM changed
products = page.css('.product', adaptive=True)

MCP usage:

mcporter call scrapling css_select \\
  --selector ".product" \\
  --adaptive true \\
  --auto-save true

Spider Framework (Large Crawls)

When to use Spiders vs direct fetching:

  • Spider: 10+ pages, concurrency needed, resume capability, proxy rotation
  • Direct: 1-5 pages, quick extraction, simple flow

Basic Spider Pattern

from scrapling.spiders import Spider, Response

class ProductSpider(Spider):
    name = "products"
    start_urls = ["https://example.com/products"]
    concurrent_requests = 10
    download_delay = 1.0
    
    async def parse(self, response: Response):
        for product in response.css('.product'):
            yield {
                "name": product.css('h2::text').get(),
                "price": product.css('.price::text').get(),
                "url": response.url
            }
        
        # Follow pagination
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# Run with resume capability
result = ProductSpider(crawldir="./crawl_data").start()
result.items.to_jsonl("products.jsonl")

Advanced: Multi-Session Spider

from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSessionSpider(Spider):
    name = "multi"
    start_urls = ["https://example.com/"]
    
    def configure_sessions(self, manager):
        manager.add("fast", FetcherSession(impersonate="chrome"))
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
    
    async def parse(self, response: Response):
        for link in response.css('a::attr(href)').getall():
            if "/protected/" in link:
                yield Request(link, sid="stealth")
            else:
                yield Request(link, sid="fast")

Spider Features

  • Pause/Resume: crawldir parameter saves checkpoints
  • Streaming: async for item in spider.stream() for real-time processing
  • Auto-retry: Configurable retry on blocked requests
  • Export: Built-in to_json(), to_jsonl()

CLI & Interactive Shell

Terminal Extraction (No Code)

# Extract to markdown
scrapling extract get 'https://example.com' content.md

# Extract specific element
scrapling extract get 'https://example.com' content.txt \\
  --css-selector '.article' \\
  --impersonate 'chrome'

# Stealth mode
scrapling extract stealthy-fetch 'https://protected.com' content.md \\
  --no-headless \\
  --solve-cloudflare

Interactive Shell

scrapling shell

# Inside shell:
>>> page = Fetcher.get('https://example.com')
>>> page.css('h1::text').get()
>>> page.find_all('div', class_='item')

Parser API (Beyond CSS/XPath)

BeautifulSoup-Style Methods

# Find by attributes
page.find_all('div', {'class': 'product', 'data-id': True})
page.find_all('div', class_='product', id=re.compile(r'item-\\d+'))

# Text search
page.find_by_text('Add to Cart', tag='button')
page.find_by_regex(r'\\$\\d+\\.\\d{2}')

# Navigation
first = page.css('.product')[0]
parent = first.parent
siblings = first.next_siblings
children = first.children

# Similarity
similar = first.find_similar()  # Find visually/structurally similar elements
below = first.below_elements()  # Elements below in DOM

Auto-Generated Selectors

# Get robust selector for any element
element = page.css('.product')[0]
selector = element.auto_css_selector()  # Returns stable CSS path
xpath = element.auto_xpath()

Proxy Rotation

from scrapling.spiders import ProxyRotator

# Cyclic rotation
rotator = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://user:pass@proxy3:8080"
], strategy="cyclic")

# Use with any session
with FetcherSession(proxy=rotator.next()) as session:
    page = session.get('https://example.com')

Common Recipes

Pagination Patterns

# Page numbers
for page_num in range(1, 11):
    url = f"https://example.com/products?page={page_num}"
    ...

# Next button
while next_page := response.css('.next a::attr(href)').get():
    yield response.follow(next_page)

# Infinite scroll (DynamicFetcher)
with DynamicSession() as session:
    page = session.fetch(url)
    page.scroll_to_bottom()
    items = page.css('.item').getall()

Login Sessions

with StealthySession(headless=False) as session:
    # Login
    login_page = session.fetch('https://example.com/login')
    login_page.fill('input[name="username"]', 'user')
    login_page.fill('input[name="password"]', 'pass')
    login_page.click('button[type="submit"]')
    
    # Now session has cookies
    protected_page = session.fetch('https://example.com/dashboard')

Next.js Data Extraction

# Extract JSON from __NEXT_DATA__
import json
import re

next_data = json.loads(
    re.search(
        r'__NEXT_DATA__" type="application/json">(.*?)\x3C/script>',
        page.html_content,
        re.S
    ).group(1)
)
props = next_data['props']['pageProps']

Output Formats

# JSON (pretty)
result.items.to_json('output.json')

# JSONL (streaming, one per line)
result.items.to_jsonl('output.jsonl')

# Python objects
for item in result.items:
    print(item['title'])

Performance Tips

  1. Use HTTP fetcher when possible — 10-100x faster than browser
  2. Impersonate browsersimpersonate='chrome' for TLS fingerprinting
  3. HTTP/3 supportFetcherSession(http3=True)
  4. Limit resourcesdisable_resources=True in Dynamic/Stealthy
  5. Connection pooling — Reuse sessions across requests

Guardrails (Always)

  • Only scrape content you're authorized to access
  • Respect robots.txt and ToS
  • Add delays (download_delay) for large crawls
  • Don't bypass paywalls or authentication without permission
  • Never scrape personal/sensitive data

References

  • references/mcp-setup.md — Detailed MCP configuration
  • references/anti-bot.md — Anti-bot handling strategies
  • references/proxy-rotation.md — Proxy setup and rotation
  • references/spider-recipes.md — Advanced crawling patterns
  • references/api-reference.md — Quick API reference
  • references/links.md — Official docs links

Scripts

  • scripts/scrapling_scrape.py — Quick one-off extraction
  • scripts/scrapling_smoke_test.py — Test connectivity and anti-bot indicators
安全使用建议
This package appears coherent with a web-scraping helper, but review these before installing or running: 1) Legal/ethical: only scrape sites you are authorized to access; the skill documents anti-bot and proxy techniques that can be abused — do not use them to evade protections. 2) Package provenance: the SKILL references a GitHub repo — verify the upstream project and author before pip installing (and check the package version matches the skill metadata). 3) Proxy credentials: proxy examples may include user:pass; never store secrets in plaintext or share them with the skill unless you understand where they're used. 4) Isolation: run scrapling and Playwright in an isolated environment (sandbox/VM/CI) because the scripts will fetch arbitrary URLs and write files (crawldir, downloads). 5) Review inconsistencies: the included _meta.json shows a different ownerId/version than the registry metadata — confirm which source/version you trust. If you need higher assurance, inspect the upstream scrapling code on the referenced GitHub and run the helper scripts locally in a controlled environment before enabling the skill for autonomous use.
功能分析
Type: OpenClaw Skill Name: nidhov01-scrapling-web-scraping Version: 0.4.2 The skill bundle provides a legitimate and well-documented integration for the Scrapling web scraping library via the Model Context Protocol (MCP). It includes helper scripts (scrapling_scrape.py and scrapling_smoke_test.py) for one-off extractions and connectivity testing, along with extensive documentation on anti-bot strategies and spider patterns. The bundle includes explicit ethical guardrails in SKILL.md, instructing the agent to respect robots.txt and avoid scraping sensitive personal data. No evidence of malicious intent, data exfiltration, or unauthorized execution was found.
能力评估
Purpose & Capability
Name/description match the contents: the SKILL.md, reference docs, and scripts all focus on scrapling usage, MCP setup, fetcher selection, anti-bot escalation, proxy rotation, and spider recipes. The included Python helpers import scrapling.fetchers as expected. There are no unrelated environment variables, binaries, or surprising install requirements declared.
Instruction Scope
Runtime instructions and examples show network activity (fetch_page, fetch_dynamic, fetch_stealthy), proxy rotation, and anti-bot bypass features — all coherent with a scraping skill. The docs explicitly advise authorization and include 'do not bypass paywalls' guidance. Note: the instructions and scripts will fetch arbitrary URLs and write crawl/download dirs (e.g., crawldir, downloads), so running them can retrieve remote HTML/assets and store them locally; this is expected but worth being aware of.
Install Mechanism
This is instruction-only (no install spec). The README suggests installing via pip (pip install scrapling[...] and playwright). That is a standard, traceable install mechanism; the skill itself does not download arbitrary remote archives or run unusual installers.
Credentials
The skill does not declare required environment variables or credentials. Docs show an optional PYTHONPATH env in an MCP config example and proxy examples that may include user:pass proxies (these are examples only). There is no unexplained request for tokens/keys in package metadata.
Persistence & Privilege
always is false and the skill does not request any elevated platform privileges. It does not modify other skills' configurations. The skill will run code when invoked (including network fetches) but does not demand permanent inclusion or special privileges.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install nidhov01-scrapling-web-scraping
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /nidhov01-scrapling-web-scraping 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.4.2
Fork - 2026.3.16
元数据
Slug nidhov01-scrapling-web-scraping
版本 0.4.2
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Scrapling Web Scraping 是什么?

Advanced web scraping with Scrapling — MCP-native guidance for extraction, crawling, and anti-bot handling. Use via mcporter (MCP) to call the `scrapling` MC... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 232 次。

如何安装 Scrapling Web Scraping?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install nidhov01-scrapling-web-scraping」即可一键安装,无需额外配置。

Scrapling Web Scraping 是免费的吗?

是的,Scrapling Web Scraping 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Scrapling Web Scraping 支持哪些平台?

Scrapling Web Scraping 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Scrapling Web Scraping?

由 nidhov01(@nidhov01)开发并维护,当前版本 v0.4.2。

💬 留言讨论