功能描述

Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perf...

使用说明 (SKILL.md)

Crawl4AI Web Crawler

Name: Crawl4AI Web Crawler
Author: openlark

Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.

For detailed API parameters, see references/api-reference.md.

Trigger Words

"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.

Installation

pip install -U crawl4ai
crawl4ai-setup          # Automatically installs the Playwright browser
crawl4ai-doctor         # Verifies the installation

If the browser installation fails, run manually:

python -m playwright install --with-deps chromium

Core Architecture

Three core classes:

Class	Purpose
`AsyncWebCrawler`	Main async crawler class, manages the browser lifecycle
`BrowserConfig`	Browser settings (headless, UA, proxy, viewport, etc.)
`CrawlerRunConfig`	Per-crawl settings (cache, extraction strategy, JS, screenshots, etc.)

Basic Usage

Simplest Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)  # LLM-ready Markdown

asyncio.run(main())

Crawl with Configuration

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,     # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
    css_selector="main.article",     # Only extract the specified area
    word_count_threshold=10,         # Filter out short text blocks
    screenshot=True,                 # Take a screenshot
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun(url="https://example.com", config=run_cfg)
    print(result.markdown)
    if result.screenshot:
        print(f"Screenshot: {len(result.screenshot)} bytes base64")

Command Line Tool

# Basic crawl
crwl https://example.com -o markdown

# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# LLM extraction
crwl https://example.com/products -q "Extract all product prices"

Markdown Generation

Using Content Filters

Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:

from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(
        threshold=0.48,           # 0-1; the lower the value, the more is pruned
        threshold_type="fixed",   # "fixed" or "dynamic"
        min_word_threshold=0
    )
)

# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
    content_filter=BM25ContentFilter(
        user_query="machine learning",  # Keywords to focus on
        bm25_threshold=1.0
    )
)

run_cfg = CrawlerRunConfig(markdown_generator=md_gen)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="...", config=run_cfg)
    print(len(result.markdown.raw_markdown))   # Raw MD
    print(len(result.markdown.fit_markdown))   # Filtered MD

Structured Data Extraction

CSS/XPath Extraction (No LLM Required, Fast and Free)

from crawl4ai import JsonCssExtractionStrategy
import json

schema = {
    "name": "Articles",
    "baseSelector": "article.post",     # Container for repeating elements
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ]
}

run_cfg = CrawlerRunConfig(
    extraction_strategy=JsonCssExtractionStrategy(schema)
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
    data = json.loads(result.extracted_content)
    print(data)  # [{"title": "...", "url": "...", "image": "..."}, ...]

Auto-Generate Schema (one-time LLM cost, then reuse for free):

from crawl4ai import LLMConfig

schema = JsonCssExtractionStrategy.generate_schema(
    html="\x3Cdiv class='product'>...",
    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
    # Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)

LLM Extraction (Suitable for Unstructured Content)

from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig

class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Price as string")
    description: str = Field(..., description="Short description")

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",     # Also supports ollama/llama3, anthropic/claude-3, etc.
        api_token="your-api-key"
    ),
    schema=Product.model_json_schema(),
    extraction_type="schema",              # "schema" or "block"
    instruction="Extract all product objects with name, price, and description.",
    chunk_token_threshold=1000,            # Auto-chunk when exceeding this token count
    overlap_rate=0.1,                      # 10% overlap between chunks
    apply_chunking=True,
    input_format="markdown",               # "markdown" | "html" | "fit_markdown"
    extra_args={"temperature": 0.0, "max_tokens": 800}
)

run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/products", config=run_cfg)
    data = json.loads(result.extracted_content)
    llm_strategy.show_usage()  # Print token usage statistics

Extraction Strategy Selection Guide

Scenario	Strategy
Repeating lists (products, articles, search results)	`JsonCssExtractionStrategy`
Unstructured text requiring AI understanding	`LLMExtractionStrategy`
High-frequency crawling of the same site	Generate Schema with LLM first, then extract via CSS

Dynamic Page Handling

run_cfg = CrawlerRunConfig(
    js_code=[                          # JS executed on the page
        "window.scrollTo(0, document.body.scrollHeight)",
        "await new Promise(r => setTimeout(r, 2000))",
    ],
    wait_for="css:.content-loaded",     # Wait for a specific element to appear
    delay_before_return_html=2.0,       # Additional wait in seconds before returning
)

Batch Crawling

urls = ["https://example.com/page1", "https://example.com/page2", ...]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls, config=run_cfg)
    for result in results:
        if result.success:
            print(result.markdown[:200])

arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.

Browser Management

browser_cfg = BrowserConfig(
    browser_type="chromium",       # "chromium" | "firefox" | "webkit"
    headless=True,
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 ...",
    proxy="http://user:pass@proxy:8080",
    use_managed_browser=True,      # Use an existing browser instance
    user_data_dir="/path/to/profile",  # Persistent profile (to retain login state)
)

Deep Crawl (Site-Level Crawling)

from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy

deep_crawl = BFSDeepCrawlStrategy(
    max_depth=3,                    # Maximum depth
    max_pages=50,                   # Maximum number of pages
    include_paths=["/docs/*"],      # Only crawl specified paths
    exclude_paths=["/blog/*"],      # Exclude specified paths
)

run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(url="https://example.com", config=run_cfg)
    for r in results:
        print(f"{r.url} → {len(r.markdown)} chars")

Docker Deployment

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground

Python Client:

import requests

resp = requests.post("http://localhost:11235/crawl",
    json={"urls": ["https://example.com"], "priority": 10})

task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())

CrawlResult Key Fields

result.url              # Final URL (after any redirects)
result.html             # Raw HTML
result.cleaned_html     # Cleaned HTML
result.markdown         # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot       # Base64 screenshot
result.media            # Image/video information
result.links            # Internal and external link information
result.success          # Whether the crawl was successful
result.error_message    # Error message

FAQ

Playwright browser not installed:

python -m playwright install --with-deps chromium

Cache issues causing stale data to be returned: Set cache_mode=CacheMode.BYPASS to skip the cache.

Dynamic content not loading: Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.

Out of memory (batch crawling): Reduce the concurrency level; arun_many() automatically monitors memory and adapts.

Anti-bot / detection: Enable use_managed_browser=True in BrowserConfig or configure a proxy.

安全使用建议

Review before installing. Use a sandbox or virtual environment, pin trusted dependency versions when possible, set strict crawl limits, and avoid using your main browser profile or personal cookies. For private or authenticated pages, use a temporary account/profile and confirm whether content will be sent to an external LLM provider.

能力标签

requires-sensitive-credentials

能力评估

ℹ Purpose & Capability

The stated purpose—using Crawl4AI for scraping, crawling, Markdown conversion, and structured extraction—matches the documented examples and API reference.

⚠ Instruction Scope

The instructions expose powerful crawler options, including deep crawling, JavaScript execution, anti-detection behavior, and reuse of authenticated browser state, but do not clearly require explicit user approval or scoping before using those higher-risk options.

ℹ Install Mechanism

The skill is instruction-only, but SKILL.md tells users to install an unpinned PyPI package and Playwright browser components. This is purpose-aligned for a browser crawler, but users should trust the package source before installing.

⚠ Credentials

For a web crawler, external website access and optional LLM-provider use are expected, but the reference also includes browser profile, cookie, and auth-state options that could access logged-in/private pages without clear containment guidance.

⚠ Persistence & Privilege

The reference documents reusable browser sessions, storage state, cache modes, and browser profile paths. These can persist or reuse authentication state and should be tightly scoped.

版本历史

v1.0.1

Version 1.0.1 - Renamed and rebranded the skill to "crawl4ai-web-crawler" with updated purpose and trigger words. - Replaced the prior RAGFlow documentation with comprehensive Crawl4AI usage guide, including installation, core classes, and real-world examples. - Added API reference pointer: references/api-reference.md. - Removed prior references: architecture.md, cli-reference.md, and deployment.md, focusing documentation on Crawl4AI features and usage.

v1.0.0

- Initial release of the crawl4ai-web-crawler skill under the name "ragflow". - Provides guidance for deploying, configuring, managing, and troubleshooting the RAGFlow open-source Retrieval-Augmented Generation (RAG) engine. - Includes detailed Docker deployment and quick-start instructions, system prerequisites, and configuration file references. - Covers CLI usage for managing datasets, documents, agents, and chats. - Offers troubleshooting tips, architecture overview, and links to documentation, support channels, and source code.

元数据

Slug crawl4ai-web-crawler

版本 1.0.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题