← 返回 Skills 市场
198
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install nmb-scrapling
功能描述
Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction.
使用说明 (SKILL.md)
Scrapling
自适应Web爬虫框架,能过反爬、能大规模爬取、网站改版不崩。
Installation
# 基础安装(仅解析器)
pip install scrapling
# 完整安装(含fetchers和浏览器)
pip install "scrapling[all]"
scrapling install
# 或单独安装功能
pip install "scrapling[fetchers]" # 抓取功能
pip install "scrapling[ai]" # MCP服务
pip install "scrapling[shell]" # 交互式shell
Quick Start
基础抓取
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
print(quotes)
过反爬(Cloudflare等)
from scrapling.fetchers import StealthyFetcher
# 自动过Cloudflare Turnstile
page = StealthyFetcher.fetch(
'https://目标网站',
headless=True,
solve_cloudflare=True
)
data = page.css('.content::text').getall()
动态页面(JS渲染)
from scrapling.fetchers import DynamicFetcher
# 完整浏览器渲染
page = DynamicFetcher.fetch(
'https://spa网站',
headless=True,
network_idle=True # 等网络请求完成
)
Fetcher Types
| Fetcher | 用途 | 特点 |
|---|---|---|
Fetcher |
普通HTTP请求 | 最快,适合静态页面 |
StealthyFetcher |
隐身模式 | 过反爬,过Cloudflare |
DynamicFetcher |
浏览器模式 | JS渲染,SPA页面 |
Element Selection
page = Fetcher.get('https://example.com')
# CSS选择器
items = page.css('.item')
title = page.css('h1::text').get()
titles = page.css('h2::text').getall()
# XPath
items = page.xpath('//div[@class="item"]')
# BeautifulSoup风格
items = page.find_all('div', class_='item')
items = page.find_by_text('关键词', tag='div')
# 链式选择
quote_text = page.css('.quote')[0].css('.text::text').get()
# 导航
first = page.css('.item')[0]
parent = first.parent
sibling = first.next_sibling
similar = first.find_similar() # 找相似元素
Session Management
from scrapling.fetchers import FetcherSession, StealthySession
# 保持会话(cookie复用)
with StealthySession(headless=True, solve_cloudflare=True) as session:
page1 = session.fetch('https://example.com/login')
page2 = session.fetch('https://example.com/dashboard') # 已登录状态
# 异步Session
from scrapling.fetchers import AsyncStealthySession
async with AsyncStealthySession(headless=True) as session:
page = await session.fetch('https://example.com')
Building Spiders (大规模爬取)
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "products"
start_urls = ["https://shop.example.com/"]
concurrent_requests = 10 # 并发数
async def parse(self, response: Response):
for item in response.css('.product'):
yield {
"title": item.css('h2::text').get(),
"price": item.css('.price::text').get(),
}
# 翻页
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page)
# 运行
result = MySpider().start()
print(f"爬取了 {len(result.items)} 条")
# 导出
result.items.to_json("output.json")
result.items.to_jsonl("output.jsonl")
断点续爬
# 指定crawl目录,支持暂停/恢复
MySpider(crawldir="./crawl_data").start()
# Ctrl+C 暂停,再次运行从断点继续
多Session混用
from scrapling.spiders import Spider, Request
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSpider(Spider):
name = "multi"
def configure_sessions(self, manager):
# 普通请求 - 快
manager.add("fast", FetcherSession(impersonate="chrome"))
# 隐身请求 - 过反爬
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response):
for link in response.css('a::attr(href)').getall():
if "protected" in link:
yield Request(link, sid="stealth") # 用隐身session
else:
yield Request(link, sid="fast") # 用快速session
Adaptive Parsing (自适应解析)
网站改版后自动重新定位元素:
# 首次爬取,保存元素特征
products = page.css('.product', auto_save=True)
# 网站改版后,用adaptive=True自动重新定位
products = page.css('.product', adaptive=True)
Proxy Rotation
from scrapling.fetchers import StealthyFetcher, ProxyRotator
proxies = ProxyRotator([
"http://proxy1:8080",
"http://proxy2:8080",
])
page = StealthyFetcher.fetch(
'https://example.com',
proxy=proxies.next()
)
CLI Commands
# 交互式shell
scrapling shell
# 直接抓取(不用写代码)
scrapling extract get 'https://example.com' output.md
scrapling extract stealthy-fetch 'https://protected.com' output.html --solve-cloudflare
# 安装浏览器
scrapling install
scrapling install --force
MCP Server (AI集成)
让Claude/Cursor直接调Scrapling爬数据:
pip install "scrapling[ai]"
# 启动MCP服务
scrapling mcp
配置到Claude Desktop的config:
{
"mcpServers": {
"scrapling": {
"command": "scrapling",
"args": ["mcp"]
}
}
}
Common Use Cases
电商比价
from scrapling.fetchers import StealthyFetcher
page = StealthyFetcher.fetch('https://item.jd.com/12345.html', headless=True)
price = page.css('.price::text').get()
title = page.css('.sku-name::text').get()
招聘信息
from scrapling.spiders import Spider, Response
class JobsSpider(Spider):
name = "jobs"
start_urls = ["https://www.zhipin.com/job_detail/?query=Python"]
async def parse(self, response: Response):
for job in response.css('.job-list li'):
yield {
"title": job.css('.job-name::text').get(),
"salary": job.css('.salary::text').get(),
"company": job.css('.company-name::text').get(),
}
竞品监控
from scrapling.fetchers import Fetcher
import json
def check_competitor(url):
page = Fetcher.get(url)
return {
"products": len(page.css('.product')),
"price_range": page.css('.price::text').getall(),
"updated": page.css('.update-time::text').get(),
}
Tips
- 先测试后规模化:用
scrapling shell调试选择器 - 合理设置并发:
concurrent_requests别太高,容易被封 - 用Session复用:登录态、cookie保持用Session
- 断点续爬:长时间爬取务必设置
crawldir - 尊重robots.txt:合规爬取
References
- 官方文档:https://scrapling.readthedocs.io
- GitHub:https://github.com/D4Vinci/Scrapling
安全使用建议
This bundle appears coherent for web scraping, but before installing or running anything: (1) Inspect the actual 'scrapling' PyPI package and its source (the SKILL.md instructs pip install of that package — it is external code). (2) If you plan to run the MCP server or allow autonomous runs, consider running in a sandbox/container and monitor network traffic because scraped data may be sent off-host. (3) Do not provide credentials (API keys, AWS, etc.) unless you verify the package's requirements. (4) Review license and terms for scraping target sites and ensure you have permission and comply with robots.txt and laws. (5) If you want stronger assurance, ask the skill author for the package's source repository or a signed release before installation.
功能分析
Type: OpenClaw Skill
Name: nmb-scrapling
Version: 1.0.0
The skill bundle provides a wrapper and documentation for the 'scrapling' web scraping library, including features for anti-bot bypass and dynamic content rendering. The Python script 'scripts/scrape.py' and the instructions in 'SKILL.md' are consistent with the stated purpose of web scraping and do not contain any evidence of data exfiltration, malicious execution, or prompt injection attacks.
能力评估
Purpose & Capability
Name, description, SKILL.md, and the included scrape.py all align: this is a web-scraping helper that relies on a 'scrapling' Python package and supports stealth/dynamic fetchers, sessions, proxies, and a CLI. The code and prose request no unrelated system access or credentials.
Instruction Scope
SKILL.md instructs installing and running the external 'scrapling' package and optionally starting an MCP server for AI integration. The runtime instructions and example code operate only on URLs and local outputs; they do not instruct reading unrelated local files or exfiltrating secrets. However, the MCP server and 'collect data for AI training/RAG' guidance mean the skill may be used to send scraped data off-host if the installed package or operator config does so.
Install Mechanism
The registry has no install spec (instruction-only), so platform won't install binaries automatically. SKILL.md explicitly tells users to pip install 'scrapling' (and extras). Installing a third‑party PyPI package is common but introduces standard supply-chain risk: the package is external and not vetted by this bundle. The included script itself does not download additional code or call unknown endpoints.
Credentials
The skill declares no required environment variables, credentials, or config paths and the code does not access environment secrets. This is proportionate to a scraping tool.
Persistence & Privilege
always:false and user-invocable:true. The skill does not request forced persistent presence or modify other skills/config. Autonomous invocation is allowed by default but is not combined with other high-risk indicators here.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install nmb-scrapling - 安装完成后,直接呼叫该 Skill 的名称或使用
/nmb-scrapling触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of nmb-scrapling:
- Provides a comprehensive web scraping framework capable of bypassing anti-bot measures (e.g., Cloudflare) and adapting to website structure changes
- Supports multiple fetching modes: standard HTTP, stealthy anti-bot, and dynamic (JavaScript-rendered) pages
- Offers large-scale crawling with spider and session management, including pause/resume and multi-session strategies
- Includes adaptive parsing to auto-recover after site changes and proxy rotation features
- Provides CLI tools for interactive usage and direct extraction, plus integration with MCP server for AI-assisted workflows
- Detailed examples and best practices for a range of use cases (e-commerce, jobs, competitor monitoring, etc.)
元数据
常见问题
Nmb Scrapling 是什么?
Web scraping framework supporting anti-bot bypass, adaptive parsing, session and proxy management, large-scale crawling, and dynamic content extraction. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 198 次。
如何安装 Nmb Scrapling?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install nmb-scrapling」即可一键安装,无需额外配置。
Nmb Scrapling 是免费的吗?
是的,Nmb Scrapling 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Nmb Scrapling 支持哪些平台?
Nmb Scrapling 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Nmb Scrapling?
由 Superowl(@superowl)开发并维护,当前版本 v1.0.0。
推荐 Skills