← Back to Skills Marketplace
openlark

Crawl4AI Web Crawler

by OpenLark · GitHub ↗ · v1.0.1 · MIT-0
cross-platform ⚠ suspicious
80
Downloads
0
Stars
0
Active Installs
2
Versions
Install in OpenClaw
/install crawl4ai-web-crawler
Description
Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perf...
README (SKILL.md)

Crawl4AI Web Crawler

Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.

For detailed API parameters, see references/api-reference.md.

Trigger Words

"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.

Installation

pip install -U crawl4ai
crawl4ai-setup          # Automatically installs the Playwright browser
crawl4ai-doctor         # Verifies the installation

If the browser installation fails, run manually:

python -m playwright install --with-deps chromium

Core Architecture

Three core classes:

Class Purpose
AsyncWebCrawler Main async crawler class, manages the browser lifecycle
BrowserConfig Browser settings (headless, UA, proxy, viewport, etc.)
CrawlerRunConfig Per-crawl settings (cache, extraction strategy, JS, screenshots, etc.)

Basic Usage

Simplest Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)  # LLM-ready Markdown

asyncio.run(main())

Crawl with Configuration

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,     # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
    css_selector="main.article",     # Only extract the specified area
    word_count_threshold=10,         # Filter out short text blocks
    screenshot=True,                 # Take a screenshot
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun(url="https://example.com", config=run_cfg)
    print(result.markdown)
    if result.screenshot:
        print(f"Screenshot: {len(result.screenshot)} bytes base64")

Command Line Tool

# Basic crawl
crwl https://example.com -o markdown

# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# LLM extraction
crwl https://example.com/products -q "Extract all product prices"

Markdown Generation

Using Content Filters

Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:

from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(
        threshold=0.48,           # 0-1; the lower the value, the more is pruned
        threshold_type="fixed",   # "fixed" or "dynamic"
        min_word_threshold=0
    )
)

# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
    content_filter=BM25ContentFilter(
        user_query="machine learning",  # Keywords to focus on
        bm25_threshold=1.0
    )
)

run_cfg = CrawlerRunConfig(markdown_generator=md_gen)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="...", config=run_cfg)
    print(len(result.markdown.raw_markdown))   # Raw MD
    print(len(result.markdown.fit_markdown))   # Filtered MD

Structured Data Extraction

CSS/XPath Extraction (No LLM Required, Fast and Free)

from crawl4ai import JsonCssExtractionStrategy
import json

schema = {
    "name": "Articles",
    "baseSelector": "article.post",     # Container for repeating elements
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ]
}

run_cfg = CrawlerRunConfig(
    extraction_strategy=JsonCssExtractionStrategy(schema)
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
    data = json.loads(result.extracted_content)
    print(data)  # [{"title": "...", "url": "...", "image": "..."}, ...]

Auto-Generate Schema (one-time LLM cost, then reuse for free):

from crawl4ai import LLMConfig

schema = JsonCssExtractionStrategy.generate_schema(
    html="\x3Cdiv class='product'>...",
    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
    # Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)

LLM Extraction (Suitable for Unstructured Content)

from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig

class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Price as string")
    description: str = Field(..., description="Short description")

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",     # Also supports ollama/llama3, anthropic/claude-3, etc.
        api_token="your-api-key"
    ),
    schema=Product.model_json_schema(),
    extraction_type="schema",              # "schema" or "block"
    instruction="Extract all product objects with name, price, and description.",
    chunk_token_threshold=1000,            # Auto-chunk when exceeding this token count
    overlap_rate=0.1,                      # 10% overlap between chunks
    apply_chunking=True,
    input_format="markdown",               # "markdown" | "html" | "fit_markdown"
    extra_args={"temperature": 0.0, "max_tokens": 800}
)

run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/products", config=run_cfg)
    data = json.loads(result.extracted_content)
    llm_strategy.show_usage()  # Print token usage statistics

Extraction Strategy Selection Guide

Scenario Strategy
Repeating lists (products, articles, search results) JsonCssExtractionStrategy
Unstructured text requiring AI understanding LLMExtractionStrategy
High-frequency crawling of the same site Generate Schema with LLM first, then extract via CSS

Dynamic Page Handling

run_cfg = CrawlerRunConfig(
    js_code=[                          # JS executed on the page
        "window.scrollTo(0, document.body.scrollHeight)",
        "await new Promise(r => setTimeout(r, 2000))",
    ],
    wait_for="css:.content-loaded",     # Wait for a specific element to appear
    delay_before_return_html=2.0,       # Additional wait in seconds before returning
)

Batch Crawling

urls = ["https://example.com/page1", "https://example.com/page2", ...]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls, config=run_cfg)
    for result in results:
        if result.success:
            print(result.markdown[:200])

arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.

Browser Management

browser_cfg = BrowserConfig(
    browser_type="chromium",       # "chromium" | "firefox" | "webkit"
    headless=True,
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 ...",
    proxy="http://user:pass@proxy:8080",
    use_managed_browser=True,      # Use an existing browser instance
    user_data_dir="/path/to/profile",  # Persistent profile (to retain login state)
)

Deep Crawl (Site-Level Crawling)

from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy

deep_crawl = BFSDeepCrawlStrategy(
    max_depth=3,                    # Maximum depth
    max_pages=50,                   # Maximum number of pages
    include_paths=["/docs/*"],      # Only crawl specified paths
    exclude_paths=["/blog/*"],      # Exclude specified paths
)

run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(url="https://example.com", config=run_cfg)
    for r in results:
        print(f"{r.url} → {len(r.markdown)} chars")

Docker Deployment

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground

Python Client:

import requests

resp = requests.post("http://localhost:11235/crawl",
    json={"urls": ["https://example.com"], "priority": 10})

task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())

CrawlResult Key Fields

result.url              # Final URL (after any redirects)
result.html             # Raw HTML
result.cleaned_html     # Cleaned HTML
result.markdown         # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot       # Base64 screenshot
result.media            # Image/video information
result.links            # Internal and external link information
result.success          # Whether the crawl was successful
result.error_message    # Error message

FAQ

Playwright browser not installed:

python -m playwright install --with-deps chromium

Cache issues causing stale data to be returned: Set cache_mode=CacheMode.BYPASS to skip the cache.

Dynamic content not loading: Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.

Out of memory (batch crawling): Reduce the concurrency level; arun_many() automatically monitors memory and adapts.

Anti-bot / detection: Enable use_managed_browser=True in BrowserConfig or configure a proxy.

Usage Guidance
Review before installing. Use a sandbox or virtual environment, pin trusted dependency versions when possible, set strict crawl limits, and avoid using your main browser profile or personal cookies. For private or authenticated pages, use a temporary account/profile and confirm whether content will be sent to an external LLM provider.
Capability Tags
requires-sensitive-credentials
Capability Assessment
Purpose & Capability
The stated purpose—using Crawl4AI for scraping, crawling, Markdown conversion, and structured extraction—matches the documented examples and API reference.
Instruction Scope
The instructions expose powerful crawler options, including deep crawling, JavaScript execution, anti-detection behavior, and reuse of authenticated browser state, but do not clearly require explicit user approval or scoping before using those higher-risk options.
Install Mechanism
The skill is instruction-only, but SKILL.md tells users to install an unpinned PyPI package and Playwright browser components. This is purpose-aligned for a browser crawler, but users should trust the package source before installing.
Credentials
For a web crawler, external website access and optional LLM-provider use are expected, but the reference also includes browser profile, cookie, and auth-state options that could access logged-in/private pages without clear containment guidance.
Persistence & Privilege
The reference documents reusable browser sessions, storage state, cache modes, and browser profile paths. These can persist or reuse authentication state and should be tightly scoped.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install crawl4ai-web-crawler
  3. After installation, invoke the skill by name or use /crawl4ai-web-crawler
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.1
Version 1.0.1 - Renamed and rebranded the skill to "crawl4ai-web-crawler" with updated purpose and trigger words. - Replaced the prior RAGFlow documentation with comprehensive Crawl4AI usage guide, including installation, core classes, and real-world examples. - Added API reference pointer: references/api-reference.md. - Removed prior references: architecture.md, cli-reference.md, and deployment.md, focusing documentation on Crawl4AI features and usage.
v1.0.0
- Initial release of the crawl4ai-web-crawler skill under the name "ragflow". - Provides guidance for deploying, configuring, managing, and troubleshooting the RAGFlow open-source Retrieval-Augmented Generation (RAG) engine. - Includes detailed Docker deployment and quick-start instructions, system prerequisites, and configuration file references. - Covers CLI usage for managing datasets, documents, agents, and chats. - Offers troubleshooting tips, architecture overview, and links to documentation, support channels, and source code.
Metadata
Slug crawl4ai-web-crawler
Version 1.0.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 2
Frequently Asked Questions

What is Crawl4AI Web Crawler?

Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perf... It is an AI Agent Skill for Claude Code / OpenClaw, with 80 downloads so far.

How do I install Crawl4AI Web Crawler?

Run "/install crawl4ai-web-crawler" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Crawl4AI Web Crawler free?

Yes, Crawl4AI Web Crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Crawl4AI Web Crawler support?

Crawl4AI Web Crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Crawl4AI Web Crawler?

It is built and maintained by OpenLark (@openlark); the current version is v1.0.1.

💬 Comments