Description

Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction.

README (SKILL.md)

Scrapling

Name: Nmb Scrapling
Author: superowl

自适应Web爬虫框架，能过反爬、能大规模爬取、网站改版不崩。

Installation

# 基础安装（仅解析器）
pip install scrapling

# 完整安装（含fetchers和浏览器）
pip install "scrapling[all]"
scrapling install

# 或单独安装功能
pip install "scrapling[fetchers]"  # 抓取功能
pip install "scrapling[ai]"        # MCP服务
pip install "scrapling[shell]"     # 交互式shell

Quick Start

基础抓取

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
print(quotes)

过反爬（Cloudflare等）

from scrapling.fetchers import StealthyFetcher

# 自动过Cloudflare Turnstile
page = StealthyFetcher.fetch(
    'https://目标网站',
    headless=True,
    solve_cloudflare=True
)
data = page.css('.content::text').getall()

动态页面（JS渲染）

from scrapling.fetchers import DynamicFetcher

# 完整浏览器渲染
page = DynamicFetcher.fetch(
    'https://spa网站',
    headless=True,
    network_idle=True  # 等网络请求完成
)

Fetcher Types

Fetcher	用途	特点
`Fetcher`	普通HTTP请求	最快，适合静态页面
`StealthyFetcher`	隐身模式	过反爬，过Cloudflare
`DynamicFetcher`	浏览器模式	JS渲染，SPA页面

Element Selection

page = Fetcher.get('https://example.com')

# CSS选择器
items = page.css('.item')
title = page.css('h1::text').get()
titles = page.css('h2::text').getall()

# XPath
items = page.xpath('//div[@class="item"]')

# BeautifulSoup风格
items = page.find_all('div', class_='item')
items = page.find_by_text('关键词', tag='div')

# 链式选择
quote_text = page.css('.quote')[0].css('.text::text').get()

# 导航
first = page.css('.item')[0]
parent = first.parent
sibling = first.next_sibling
similar = first.find_similar()  # 找相似元素

Session Management

from scrapling.fetchers import FetcherSession, StealthySession

# 保持会话（cookie复用）
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page1 = session.fetch('https://example.com/login')
    page2 = session.fetch('https://example.com/dashboard')  # 已登录状态

# 异步Session
from scrapling.fetchers import AsyncStealthySession

async with AsyncStealthySession(headless=True) as session:
    page = await session.fetch('https://example.com')

Building Spiders (大规模爬取)

from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "products"
    start_urls = ["https://shop.example.com/"]
    concurrent_requests = 10  # 并发数

    async def parse(self, response: Response):
        for item in response.css('.product'):
            yield {
                "title": item.css('h2::text').get(),
                "price": item.css('.price::text').get(),
            }

        # 翻页
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# 运行
result = MySpider().start()
print(f"爬取了 {len(result.items)} 条")

# 导出
result.items.to_json("output.json")
result.items.to_jsonl("output.jsonl")

断点续爬

# 指定crawl目录，支持暂停/恢复
MySpider(crawldir="./crawl_data").start()

# Ctrl+C 暂停，再次运行从断点继续

多Session混用

from scrapling.spiders import Spider, Request
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSpider(Spider):
    name = "multi"

    def configure_sessions(self, manager):
        # 普通请求 - 快
        manager.add("fast", FetcherSession(impersonate="chrome"))
        # 隐身请求 - 过反爬
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

    async def parse(self, response):
        for link in response.css('a::attr(href)').getall():
            if "protected" in link:
                yield Request(link, sid="stealth")  # 用隐身session
            else:
                yield Request(link, sid="fast")      # 用快速session

Adaptive Parsing (自适应解析)

网站改版后自动重新定位元素：

# 首次爬取，保存元素特征
products = page.css('.product', auto_save=True)

# 网站改版后，用adaptive=True自动重新定位
products = page.css('.product', adaptive=True)

Proxy Rotation

from scrapling.fetchers import StealthyFetcher, ProxyRotator

proxies = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
])

page = StealthyFetcher.fetch(
    'https://example.com',
    proxy=proxies.next()
)

CLI Commands

# 交互式shell
scrapling shell

# 直接抓取（不用写代码）
scrapling extract get 'https://example.com' output.md
scrapling extract stealthy-fetch 'https://protected.com' output.html --solve-cloudflare

# 安装浏览器
scrapling install
scrapling install --force

MCP Server (AI集成)

让Claude/Cursor直接调Scrapling爬数据：

pip install "scrapling[ai]"

# 启动MCP服务
scrapling mcp

配置到Claude Desktop的config：

{
  "mcpServers": {
    "scrapling": {
      "command": "scrapling",
      "args": ["mcp"]
    }
  }
}

Common Use Cases

电商比价

from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch('https://item.jd.com/12345.html', headless=True)
price = page.css('.price::text').get()
title = page.css('.sku-name::text').get()

招聘信息

from scrapling.spiders import Spider, Response

class JobsSpider(Spider):
    name = "jobs"
    start_urls = ["https://www.zhipin.com/job_detail/?query=Python"]

    async def parse(self, response: Response):
        for job in response.css('.job-list li'):
            yield {
                "title": job.css('.job-name::text').get(),
                "salary": job.css('.salary::text').get(),
                "company": job.css('.company-name::text').get(),
            }

竞品监控

from scrapling.fetchers import Fetcher
import json

def check_competitor(url):
    page = Fetcher.get(url)
    return {
        "products": len(page.css('.product')),
        "price_range": page.css('.price::text').getall(),
        "updated": page.css('.update-time::text').get(),
    }

Tips

先测试后规模化：用scrapling shell调试选择器
合理设置并发：concurrent_requests别太高，容易被封
用Session复用：登录态、cookie保持用Session
断点续爬：长时间爬取务必设置crawldir
尊重robots.txt：合规爬取

References

官方文档：https://scrapling.readthedocs.io
GitHub：https://github.com/D4Vinci/Scrapling

Usage Guidance

This bundle appears coherent for web scraping, but before installing or running anything: (1) Inspect the actual 'scrapling' PyPI package and its source (the SKILL.md instructs pip install of that package — it is external code). (2) If you plan to run the MCP server or allow autonomous runs, consider running in a sandbox/container and monitor network traffic because scraped data may be sent off-host. (3) Do not provide credentials (API keys, AWS, etc.) unless you verify the package's requirements. (4) Review license and terms for scraping target sites and ensure you have permission and comply with robots.txt and laws. (5) If you want stronger assurance, ask the skill author for the package's source repository or a signed release before installation.

Capability Analysis

Type: OpenClaw Skill Name: nmb-scrapling Version: 1.0.0 The skill bundle provides a wrapper and documentation for the 'scrapling' web scraping library, including features for anti-bot bypass and dynamic content rendering. The Python script 'scripts/scrape.py' and the instructions in 'SKILL.md' are consistent with the stated purpose of web scraping and do not contain any evidence of data exfiltration, malicious execution, or prompt injection attacks.

Capability Assessment

✓ Purpose & Capability

Name, description, SKILL.md, and the included scrape.py all align: this is a web-scraping helper that relies on a 'scrapling' Python package and supports stealth/dynamic fetchers, sessions, proxies, and a CLI. The code and prose request no unrelated system access or credentials.

ℹ Instruction Scope

SKILL.md instructs installing and running the external 'scrapling' package and optionally starting an MCP server for AI integration. The runtime instructions and example code operate only on URLs and local outputs; they do not instruct reading unrelated local files or exfiltrating secrets. However, the MCP server and 'collect data for AI training/RAG' guidance mean the skill may be used to send scraped data off-host if the installed package or operator config does so.

ℹ Install Mechanism

The registry has no install spec (instruction-only), so platform won't install binaries automatically. SKILL.md explicitly tells users to pip install 'scrapling' (and extras). Installing a third‑party PyPI package is common but introduces standard supply-chain risk: the package is external and not vetted by this bundle. The included script itself does not download additional code or call unknown endpoints.

✓ Credentials

The skill declares no required environment variables, credentials, or config paths and the code does not access environment secrets. This is proportionate to a scraping tool.

✓ Persistence & Privilege

always:false and user-invocable:true. The skill does not request forced persistent presence or modify other skills/config. Autonomous invocation is allowed by default but is not combined with other high-risk indicators here.

Version History

v1.0.0

Initial release of nmb-scrapling: - Provides a comprehensive web scraping framework capable of bypassing anti-bot measures (e.g., Cloudflare) and adapting to website structure changes - Supports multiple fetching modes: standard HTTP, stealthy anti-bot, and dynamic (JavaScript-rendered) pages - Offers large-scale crawling with spider and session management, including pause/resume and multi-session strategies - Includes adaptive parsing to auto-recover after site changes and proxy rotation features - Provides CLI tools for interactive usage and direct extraction, plus integration with MCP server for AI-assisted workflows - Detailed examples and best practices for a range of use cases (e-commerce, jobs, competitor monitoring, etc.)

Metadata

Slug nmb-scrapling

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Nmb Scrapling?

Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction. It is an AI Agent Skill for Claude Code / OpenClaw, with 198 downloads so far.

How do I install Nmb Scrapling?

Run "/install nmb-scrapling" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Nmb Scrapling free?

Yes, Nmb Scrapling is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Nmb Scrapling support?

Nmb Scrapling is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Nmb Scrapling?

It is built and maintained by Superowl (@superowl); the current version is v1.0.0.

More Skills

Nmb Scrapling