← 返回 Skills 市场
openlark

Crawl4AI Web Crawler

作者 OpenLark · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ⚠ suspicious
80
总下载
0
收藏
0
当前安装
2
版本数
在 OpenClaw 中安装
/install crawl4ai-web-crawler
功能描述
Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perf...
使用说明 (SKILL.md)

Crawl4AI Web Crawler

Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.

For detailed API parameters, see references/api-reference.md.

Trigger Words

"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.

Installation

pip install -U crawl4ai
crawl4ai-setup          # Automatically installs the Playwright browser
crawl4ai-doctor         # Verifies the installation

If the browser installation fails, run manually:

python -m playwright install --with-deps chromium

Core Architecture

Three core classes:

Class Purpose
AsyncWebCrawler Main async crawler class, manages the browser lifecycle
BrowserConfig Browser settings (headless, UA, proxy, viewport, etc.)
CrawlerRunConfig Per-crawl settings (cache, extraction strategy, JS, screenshots, etc.)

Basic Usage

Simplest Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)  # LLM-ready Markdown

asyncio.run(main())

Crawl with Configuration

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,     # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
    css_selector="main.article",     # Only extract the specified area
    word_count_threshold=10,         # Filter out short text blocks
    screenshot=True,                 # Take a screenshot
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun(url="https://example.com", config=run_cfg)
    print(result.markdown)
    if result.screenshot:
        print(f"Screenshot: {len(result.screenshot)} bytes base64")

Command Line Tool

# Basic crawl
crwl https://example.com -o markdown

# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# LLM extraction
crwl https://example.com/products -q "Extract all product prices"

Markdown Generation

Using Content Filters

Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:

from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(
        threshold=0.48,           # 0-1; the lower the value, the more is pruned
        threshold_type="fixed",   # "fixed" or "dynamic"
        min_word_threshold=0
    )
)

# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
    content_filter=BM25ContentFilter(
        user_query="machine learning",  # Keywords to focus on
        bm25_threshold=1.0
    )
)

run_cfg = CrawlerRunConfig(markdown_generator=md_gen)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="...", config=run_cfg)
    print(len(result.markdown.raw_markdown))   # Raw MD
    print(len(result.markdown.fit_markdown))   # Filtered MD

Structured Data Extraction

CSS/XPath Extraction (No LLM Required, Fast and Free)

from crawl4ai import JsonCssExtractionStrategy
import json

schema = {
    "name": "Articles",
    "baseSelector": "article.post",     # Container for repeating elements
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ]
}

run_cfg = CrawlerRunConfig(
    extraction_strategy=JsonCssExtractionStrategy(schema)
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
    data = json.loads(result.extracted_content)
    print(data)  # [{"title": "...", "url": "...", "image": "..."}, ...]

Auto-Generate Schema (one-time LLM cost, then reuse for free):

from crawl4ai import LLMConfig

schema = JsonCssExtractionStrategy.generate_schema(
    html="\x3Cdiv class='product'>...",
    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
    # Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)

LLM Extraction (Suitable for Unstructured Content)

from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig

class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Price as string")
    description: str = Field(..., description="Short description")

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",     # Also supports ollama/llama3, anthropic/claude-3, etc.
        api_token="your-api-key"
    ),
    schema=Product.model_json_schema(),
    extraction_type="schema",              # "schema" or "block"
    instruction="Extract all product objects with name, price, and description.",
    chunk_token_threshold=1000,            # Auto-chunk when exceeding this token count
    overlap_rate=0.1,                      # 10% overlap between chunks
    apply_chunking=True,
    input_format="markdown",               # "markdown" | "html" | "fit_markdown"
    extra_args={"temperature": 0.0, "max_tokens": 800}
)

run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/products", config=run_cfg)
    data = json.loads(result.extracted_content)
    llm_strategy.show_usage()  # Print token usage statistics

Extraction Strategy Selection Guide

Scenario Strategy
Repeating lists (products, articles, search results) JsonCssExtractionStrategy
Unstructured text requiring AI understanding LLMExtractionStrategy
High-frequency crawling of the same site Generate Schema with LLM first, then extract via CSS

Dynamic Page Handling

run_cfg = CrawlerRunConfig(
    js_code=[                          # JS executed on the page
        "window.scrollTo(0, document.body.scrollHeight)",
        "await new Promise(r => setTimeout(r, 2000))",
    ],
    wait_for="css:.content-loaded",     # Wait for a specific element to appear
    delay_before_return_html=2.0,       # Additional wait in seconds before returning
)

Batch Crawling

urls = ["https://example.com/page1", "https://example.com/page2", ...]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls, config=run_cfg)
    for result in results:
        if result.success:
            print(result.markdown[:200])

arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.

Browser Management

browser_cfg = BrowserConfig(
    browser_type="chromium",       # "chromium" | "firefox" | "webkit"
    headless=True,
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 ...",
    proxy="http://user:pass@proxy:8080",
    use_managed_browser=True,      # Use an existing browser instance
    user_data_dir="/path/to/profile",  # Persistent profile (to retain login state)
)

Deep Crawl (Site-Level Crawling)

from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy

deep_crawl = BFSDeepCrawlStrategy(
    max_depth=3,                    # Maximum depth
    max_pages=50,                   # Maximum number of pages
    include_paths=["/docs/*"],      # Only crawl specified paths
    exclude_paths=["/blog/*"],      # Exclude specified paths
)

run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(url="https://example.com", config=run_cfg)
    for r in results:
        print(f"{r.url} → {len(r.markdown)} chars")

Docker Deployment

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground

Python Client:

import requests

resp = requests.post("http://localhost:11235/crawl",
    json={"urls": ["https://example.com"], "priority": 10})

task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())

CrawlResult Key Fields

result.url              # Final URL (after any redirects)
result.html             # Raw HTML
result.cleaned_html     # Cleaned HTML
result.markdown         # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot       # Base64 screenshot
result.media            # Image/video information
result.links            # Internal and external link information
result.success          # Whether the crawl was successful
result.error_message    # Error message

FAQ

Playwright browser not installed:

python -m playwright install --with-deps chromium

Cache issues causing stale data to be returned: Set cache_mode=CacheMode.BYPASS to skip the cache.

Dynamic content not loading: Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.

Out of memory (batch crawling): Reduce the concurrency level; arun_many() automatically monitors memory and adapts.

Anti-bot / detection: Enable use_managed_browser=True in BrowserConfig or configure a proxy.

安全使用建议
Review before installing. Use a sandbox or virtual environment, pin trusted dependency versions when possible, set strict crawl limits, and avoid using your main browser profile or personal cookies. For private or authenticated pages, use a temporary account/profile and confirm whether content will be sent to an external LLM provider.
能力标签
requires-sensitive-credentials
能力评估
Purpose & Capability
The stated purpose—using Crawl4AI for scraping, crawling, Markdown conversion, and structured extraction—matches the documented examples and API reference.
Instruction Scope
The instructions expose powerful crawler options, including deep crawling, JavaScript execution, anti-detection behavior, and reuse of authenticated browser state, but do not clearly require explicit user approval or scoping before using those higher-risk options.
Install Mechanism
The skill is instruction-only, but SKILL.md tells users to install an unpinned PyPI package and Playwright browser components. This is purpose-aligned for a browser crawler, but users should trust the package source before installing.
Credentials
For a web crawler, external website access and optional LLM-provider use are expected, but the reference also includes browser profile, cookie, and auth-state options that could access logged-in/private pages without clear containment guidance.
Persistence & Privilege
The reference documents reusable browser sessions, storage state, cache modes, and browser profile paths. These can persist or reuse authentication state and should be tightly scoped.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install crawl4ai-web-crawler
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /crawl4ai-web-crawler 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.1
Version 1.0.1 - Renamed and rebranded the skill to "crawl4ai-web-crawler" with updated purpose and trigger words. - Replaced the prior RAGFlow documentation with comprehensive Crawl4AI usage guide, including installation, core classes, and real-world examples. - Added API reference pointer: references/api-reference.md. - Removed prior references: architecture.md, cli-reference.md, and deployment.md, focusing documentation on Crawl4AI features and usage.
v1.0.0
- Initial release of the crawl4ai-web-crawler skill under the name "ragflow". - Provides guidance for deploying, configuring, managing, and troubleshooting the RAGFlow open-source Retrieval-Augmented Generation (RAG) engine. - Includes detailed Docker deployment and quick-start instructions, system prerequisites, and configuration file references. - Covers CLI usage for managing datasets, documents, agents, and chats. - Offers troubleshooting tips, architecture overview, and links to documentation, support channels, and source code.
元数据
Slug crawl4ai-web-crawler
版本 1.0.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 2
常见问题

Crawl4AI Web Crawler 是什么?

Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perf... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 80 次。

如何安装 Crawl4AI Web Crawler?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install crawl4ai-web-crawler」即可一键安装,无需额外配置。

Crawl4AI Web Crawler 是免费的吗?

是的,Crawl4AI Web Crawler 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Crawl4AI Web Crawler 支持哪些平台?

Crawl4AI Web Crawler 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Crawl4AI Web Crawler?

由 OpenLark(@openlark)开发并维护,当前版本 v1.0.1。

💬 留言讨论