← Back to Skills Marketplace
smseow001

Crawl Pro

by SMS · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
40
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install crawl-pro
Description
爬虫专项工作流 - 基于 Scrapling 框架的生产级采集。自适应元素定位、反爬绕过、并发规模爬取、代理轮换。
README (SKILL.md)

Crawl Pro - 爬虫专项工作流

基于 Scrapling 框架的生产级采集


核心思想

** Scrapling 核心能力:**

  1. 自适应元素定位 — 网站改版无需改代码
  2. 反爬绕过 — Cloudflare 等自动绕过
  3. 并发规模 — 断点续爬、代理轮换
  4. MCP 集成 — AI 辅助智能抓取

触发词

爬虫模式 / crawl pro / 抓取 / 采集中


工作流阶段

1️⃣ 环境与请求方式选择

## 🌍 环境选择

### 目标网站分析
- 反爬级别:[低/中/高/Cloudflare]
- 动态内容:[是/否]

### Fetcher 选择
| 场景 | Fetcher | 说明 |
|------|---------|------|
| 静态页面 | Fetcher | 轻量,高并发 |
| 反爬站点 | StealthyFetcher | 指纹伪装 |
| JS动态渲染 | DynamicFetcher | Playwright |

### 安装
pip install scrapling

2️⃣ 基础请求与解析

## 📡 请求示例

### 基础请求
```python
from scrapling import Fetcher
fetcher = Fetcher()
response = fetcher.get('https://www.example.com')

元素提取

# CSS 选择器
products = response.css('.product')

# 开启自适应(网站改版自动重定位)
products = response.css('.product', adaptive=True)

# 首次存入 SQLite,后续自动重定位
products = response.css('.product', auto_save=True)

XPath 也支持

items = response.xpath('//div[@class="product"]')

---

### 3️⃣ 自适应元素定位(核心)

🎯 自适应定位

原理

  • 框架通过相似度算法(Levenshtein 距离)自动匹配
  • 0.3 秒内完成重定位
  • 无需修改代码

使用

# 首次爬取
products = response.css('.product', auto_save=True)

# 网站改版后
products = response.css('.product', adaptive=True)
# 自动找到新位置的元素!

---

### 4️⃣ 并发规模爬取

⚡ 并发爬取

定义 Spider

from scrapling import Spider

class MySpider(Spider):
    start_urls = ['https://example.com/page1', 'https://example.com/page2']
    
    async def parse(self, response):
        products = response.css('.product', adaptive=True)
        for product in products:
            yield {
                'name': product.css('.name').text(),
                'price': product.css('.price').text()
            }

配置并发

spider.run(
    concurrent_requests=8,  # 并发数
    download_delay=3,           # 请求延迟(秒)
    auto_save=True,            # 断点续爬
    export='jsonl'            # 导出格式
)

断点续爬

  • 按 Ctrl+C 中断自动保存 Checkpoint
  • 重启后从上次位置继续

5️⃣ 反爬与代理轮换

## 🛡️ 反爬策略

### 代理轮换
```python
# 循环轮换代理
spider = MySpider(proxy_rotator='cyclic')

# 或内置代理池

反爬配置

参数 说明
headless=False 可见浏览器
network_idle=True 等待网络空闲
retry=3 重试次数

StealthyFetcher(深度伪装)

from scrapling import StealthyFetcher
fetcher = StealthyFetcher.adaptive = True
response = fetcher.fetch('https://cloudflare-site.com', headless=True)

---

### 6️⃣ 数据输出

💾 数据输出

实时流式输���

async for item in spider.stream():
    print(item)  # 实时处理

导出格式

spider.run(export='jsonl')   # JSONL
spider.run(export='json')    # JSON
spider.run(export='csv')     # CSV

自定义 Pipeline

class MyPipeline:
    async def process(self, item):
        # 自定义处理
        await save_to_db(item)

---

### 7️⃣ AI 集成(MCP)

🤖 AI 集成

MCP Server

from scrapling import MCP_SERVER
MCP_SERVER.start()

AI 智能抓取

  • 先用 MCP 提取数据
  • 再调用 AI 分析
  • 降低 Token 消耗

---

## 性能调优

| 网站类型 | 并发数 | 请求延迟 |
|----------|--------|----------|
| 普通网站 | 5-10 | 2-5秒 |
| 反爬站点 | 2-3 | 5-10秒 |
| 动态网站 | 1-2 | 10秒+ |

---

## 快速指令表

| 需求 | 命令 |
|------|------|
| 分析目标站 | `分析网站[URL]的反爬级别` |
| 写爬虫 | `写一个爬虫:抓取[URL]的[内容]` |
| 自适应 | `用adaptive模式抓取` |
| 大规模 | `并发爬取[URL],8并发` |

---

## 输出规范

### 完整爬虫模板
```python
# 导入
from scrapling import Spider, Fetcher

# 定义
class ProductSpider(Spider):
    start_urls = ['https://example.com']
    
    async def parse(self, response):
        items = response.css('.item', adaptive=True)
        for item in items:
            yield {'title': item.css('h2').text()}

# 运行
if __name__ == '__main__':
    fetcher = Fetcher()
    spider = ProductSpider(fetcher=fetcher)
    spider.run(concurrent_requests=5, export='jsonl')

Crawl Pro | 基于 Scrapling 的生产级爬虫工作流

Usage Guidance
Install or use this skill only if you intentionally need authorized web scraping. Avoid using the stealth, Cloudflare-bypass, or proxy-rotation guidance against sites where you lack permission, verify and pin the Scrapling dependency, run crawlers in an isolated environment, and enable MCP only with clear local access controls.
Capability Assessment
Purpose & Capability
The scraping purpose is coherent and disclosed, but the stated capability includes bypassing anti-bot protections such as Cloudflare and using stealth/proxy techniques, which is materially higher-risk than ordinary data extraction.
Instruction Scope
The examples give broad scraping patterns, concurrency, stealth fetchers, and proxy rotation without clear authorization, target-scope, or compliance boundaries for when bypass-oriented crawling is appropriate.
Install Mechanism
There is no install spec or bundled code, but SKILL.md instructs users to install an unpinned third-party Python package with `pip install scrapling`; this is user-directed and purpose-aligned, but users should verify it.
Credentials
No credentials or environment variables are requested. Network access, browser automation, proxies, and MCP integration are disclosed as part of the scraping workflow, but they should be scoped carefully.
Persistence & Privilege
The skill documents checkpointing and SQLite-backed adaptive selector storage. This appears purpose-aligned, but users should understand where crawl state and extracted data are stored and how to remove it.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install crawl-pro
  3. After installation, invoke the skill by name or use /crawl-pro
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of Crawl Pro: a production-grade scraping workflow based on Scrapling. - Supports adaptive element location, anti-bot bypass, scalable concurrent crawling, and proxy rotation. - Easily selects environment and fetcher for static, dynamic, and anti-bot sites. - Built-in checkpointing for resumable crawls and streamlined concurrent spider definitions. - Exports data in multiple formats (JSONL, JSON, CSV) with real-time streaming and pipeline customization. - Integrates AI (MCP server) for smarter scraping and reduced token usage.
Metadata
Slug crawl-pro
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Crawl Pro?

爬虫专项工作流 - 基于 Scrapling 框架的生产级采集。自适应元素定位、反爬绕过、并发规模爬取、代理轮换。 It is an AI Agent Skill for Claude Code / OpenClaw, with 40 downloads so far.

How do I install Crawl Pro?

Run "/install crawl-pro" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Crawl Pro free?

Yes, Crawl Pro is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Crawl Pro support?

Crawl Pro is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Crawl Pro?

It is built and maintained by SMS (@smseow001); the current version is v1.0.0.

💬 Comments