功能描述

Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction.

使用说明 (SKILL.md)

Scrapling

Name: Nmb Scrapling
Author: superowl

自适应Web爬虫框架，能过反爬、能大规模爬取、网站改版不崩。

Installation

# 基础安装（仅解析器）
pip install scrapling

# 完整安装（含fetchers和浏览器）
pip install "scrapling[all]"
scrapling install

# 或单独安装功能
pip install "scrapling[fetchers]"  # 抓取功能
pip install "scrapling[ai]"        # MCP服务
pip install "scrapling[shell]"     # 交互式shell

Quick Start

基础抓取

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
print(quotes)

过反爬（Cloudflare等）

from scrapling.fetchers import StealthyFetcher

# 自动过Cloudflare Turnstile
page = StealthyFetcher.fetch(
    'https://目标网站',
    headless=True,
    solve_cloudflare=True
)
data = page.css('.content::text').getall()

动态页面（JS渲染）

from scrapling.fetchers import DynamicFetcher

# 完整浏览器渲染
page = DynamicFetcher.fetch(
    'https://spa网站',
    headless=True,
    network_idle=True  # 等网络请求完成
)

Fetcher Types

Fetcher	用途	特点
`Fetcher`	普通HTTP请求	最快，适合静态页面
`StealthyFetcher`	隐身模式	过反爬，过Cloudflare
`DynamicFetcher`	浏览器模式	JS渲染，SPA页面

Element Selection

page = Fetcher.get('https://example.com')

# CSS选择器
items = page.css('.item')
title = page.css('h1::text').get()
titles = page.css('h2::text').getall()

# XPath
items = page.xpath('//div[@class="item"]')

# BeautifulSoup风格
items = page.find_all('div', class_='item')
items = page.find_by_text('关键词', tag='div')

# 链式选择
quote_text = page.css('.quote')[0].css('.text::text').get()

# 导航
first = page.css('.item')[0]
parent = first.parent
sibling = first.next_sibling
similar = first.find_similar()  # 找相似元素

Session Management

from scrapling.fetchers import FetcherSession, StealthySession

# 保持会话（cookie复用）
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page1 = session.fetch('https://example.com/login')
    page2 = session.fetch('https://example.com/dashboard')  # 已登录状态

# 异步Session
from scrapling.fetchers import AsyncStealthySession

async with AsyncStealthySession(headless=True) as session:
    page = await session.fetch('https://example.com')

Building Spiders (大规模爬取)

from scrapling.spiders import Spider, Response

class MySpider(Spider):
    name = "products"
    start_urls = ["https://shop.example.com/"]
    concurrent_requests = 10  # 并发数

    async def parse(self, response: Response):
        for item in response.css('.product'):
            yield {
                "title": item.css('h2::text').get(),
                "price": item.css('.price::text').get(),
            }

        # 翻页
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page)

# 运行
result = MySpider().start()
print(f"爬取了 {len(result.items)} 条")

# 导出
result.items.to_json("output.json")
result.items.to_jsonl("output.jsonl")

断点续爬

# 指定crawl目录，支持暂停/恢复
MySpider(crawldir="./crawl_data").start()

# Ctrl+C 暂停，再次运行从断点继续

多Session混用

from scrapling.spiders import Spider, Request
from scrapling.fetchers import FetcherSession, AsyncStealthySession

class MultiSpider(Spider):
    name = "multi"

    def configure_sessions(self, manager):
        # 普通请求 - 快
        manager.add("fast", FetcherSession(impersonate="chrome"))
        # 隐身请求 - 过反爬
        manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)

    async def parse(self, response):
        for link in response.css('a::attr(href)').getall():
            if "protected" in link:
                yield Request(link, sid="stealth")  # 用隐身session
            else:
                yield Request(link, sid="fast")      # 用快速session

Adaptive Parsing (自适应解析)

网站改版后自动重新定位元素：

# 首次爬取，保存元素特征
products = page.css('.product', auto_save=True)

# 网站改版后，用adaptive=True自动重新定位
products = page.css('.product', adaptive=True)

Proxy Rotation

from scrapling.fetchers import StealthyFetcher, ProxyRotator

proxies = ProxyRotator([
    "http://proxy1:8080",
    "http://proxy2:8080",
])

page = StealthyFetcher.fetch(
    'https://example.com',
    proxy=proxies.next()
)

CLI Commands

# 交互式shell
scrapling shell

# 直接抓取（不用写代码）
scrapling extract get 'https://example.com' output.md
scrapling extract stealthy-fetch 'https://protected.com' output.html --solve-cloudflare

# 安装浏览器
scrapling install
scrapling install --force

MCP Server (AI集成)

让Claude/Cursor直接调Scrapling爬数据：

pip install "scrapling[ai]"

# 启动MCP服务
scrapling mcp

配置到Claude Desktop的config：

{
  "mcpServers": {
    "scrapling": {
      "command": "scrapling",
      "args": ["mcp"]
    }
  }
}

Common Use Cases

电商比价

from scrapling.fetchers import StealthyFetcher

page = StealthyFetcher.fetch('https://item.jd.com/12345.html', headless=True)
price = page.css('.price::text').get()
title = page.css('.sku-name::text').get()

招聘信息

from scrapling.spiders import Spider, Response

class JobsSpider(Spider):
    name = "jobs"
    start_urls = ["https://www.zhipin.com/job_detail/?query=Python"]

    async def parse(self, response: Response):
        for job in response.css('.job-list li'):
            yield {
                "title": job.css('.job-name::text').get(),
                "salary": job.css('.salary::text').get(),
                "company": job.css('.company-name::text').get(),
            }

竞品监控

from scrapling.fetchers import Fetcher
import json

def check_competitor(url):
    page = Fetcher.get(url)
    return {
        "products": len(page.css('.product')),
        "price_range": page.css('.price::text').getall(),
        "updated": page.css('.update-time::text').get(),
    }

Tips

先测试后规模化：用scrapling shell调试选择器
合理设置并发：concurrent_requests别太高，容易被封
用Session复用：登录态、cookie保持用Session
断点续爬：长时间爬取务必设置crawldir
尊重robots.txt：合规爬取

References

官方文档：https://scrapling.readthedocs.io
GitHub：https://github.com/D4Vinci/Scrapling

安全使用建议

This bundle appears coherent for web scraping, but before installing or running anything: (1) Inspect the actual 'scrapling' PyPI package and its source (the SKILL.md instructs pip install of that package — it is external code). (2) If you plan to run the MCP server or allow autonomous runs, consider running in a sandbox/container and monitor network traffic because scraped data may be sent off-host. (3) Do not provide credentials (API keys, AWS, etc.) unless you verify the package's requirements. (4) Review license and terms for scraping target sites and ensure you have permission and comply with robots.txt and laws. (5) If you want stronger assurance, ask the skill author for the package's source repository or a signed release before installation.

功能分析

Type: OpenClaw Skill Name: nmb-scrapling Version: 1.0.0 The skill bundle provides a wrapper and documentation for the 'scrapling' web scraping library, including features for anti-bot bypass and dynamic content rendering. The Python script 'scripts/scrape.py' and the instructions in 'SKILL.md' are consistent with the stated purpose of web scraping and do not contain any evidence of data exfiltration, malicious execution, or prompt injection attacks.

能力评估

✓ Purpose & Capability

Name, description, SKILL.md, and the included scrape.py all align: this is a web-scraping helper that relies on a 'scrapling' Python package and supports stealth/dynamic fetchers, sessions, proxies, and a CLI. The code and prose request no unrelated system access or credentials.

ℹ Instruction Scope

SKILL.md instructs installing and running the external 'scrapling' package and optionally starting an MCP server for AI integration. The runtime instructions and example code operate only on URLs and local outputs; they do not instruct reading unrelated local files or exfiltrating secrets. However, the MCP server and 'collect data for AI training/RAG' guidance mean the skill may be used to send scraped data off-host if the installed package or operator config does so.

ℹ Install Mechanism

The registry has no install spec (instruction-only), so platform won't install binaries automatically. SKILL.md explicitly tells users to pip install 'scrapling' (and extras). Installing a third‑party PyPI package is common but introduces standard supply-chain risk: the package is external and not vetted by this bundle. The included script itself does not download additional code or call unknown endpoints.

✓ Credentials

The skill declares no required environment variables, credentials, or config paths and the code does not access environment secrets. This is proportionate to a scraping tool.

✓ Persistence & Privilege

always:false and user-invocable:true. The skill does not request forced persistent presence or modify other skills/config. Autonomous invocation is allowed by default but is not combined with other high-risk indicators here.

版本历史

v1.0.0

Initial release of nmb-scrapling: - Provides a comprehensive web scraping framework capable of bypassing anti-bot measures (e.g., Cloudflare) and adapting to website structure changes - Supports multiple fetching modes: standard HTTP, stealthy anti-bot, and dynamic (JavaScript-rendered) pages - Offers large-scale crawling with spider and session management, including pause/resume and multi-session strategies - Includes adaptive parsing to auto-recover after site changes and proxy rotation features - Provides CLI tools for interactive usage and direct extraction, plus integration with MCP server for AI-assisted workflows - Detailed examples and best practices for a range of use cases (e-commerce, jobs, competitor monitoring, etc.)

元数据

Slug nmb-scrapling

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Nmb Scrapling 是什么？

Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 198 次。

如何安装 Nmb Scrapling？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install nmb-scrapling」即可一键安装，无需额外配置。

Nmb Scrapling 是免费的吗？

是的，Nmb Scrapling 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Nmb Scrapling 支持哪些平台？

Nmb Scrapling 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Nmb Scrapling？

由 Superowl（@superowl）开发并维护，当前版本 v1.0.0。

Nmb Scrapling