/install crawl4ai-web-crawler
Crawl4AI Web Crawler
Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.
For detailed API parameters, see references/api-reference.md.
Trigger Words
"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.
Installation
pip install -U crawl4ai
crawl4ai-setup # Automatically installs the Playwright browser
crawl4ai-doctor # Verifies the installation
If the browser installation fails, run manually:
python -m playwright install --with-deps chromium
Core Architecture
Three core classes:
| Class | Purpose |
|---|---|
AsyncWebCrawler |
Main async crawler class, manages the browser lifecycle |
BrowserConfig |
Browser settings (headless, UA, proxy, viewport, etc.) |
CrawlerRunConfig |
Per-crawl settings (cache, extraction strategy, JS, screenshots, etc.) |
Basic Usage
Simplest Crawl
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown) # LLM-ready Markdown
asyncio.run(main())
Crawl with Configuration
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
css_selector="main.article", # Only extract the specified area
word_count_threshold=10, # Filter out short text blocks
screenshot=True, # Take a screenshot
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(url="https://example.com", config=run_cfg)
print(result.markdown)
if result.screenshot:
print(f"Screenshot: {len(result.screenshot)} bytes base64")
Command Line Tool
# Basic crawl
crwl https://example.com -o markdown
# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
# LLM extraction
crwl https://example.com/products -q "Extract all product prices"
Markdown Generation
Using Content Filters
Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, # 0-1; the lower the value, the more is pruned
threshold_type="fixed", # "fixed" or "dynamic"
min_word_threshold=0
)
)
# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(
user_query="machine learning", # Keywords to focus on
bm25_threshold=1.0
)
)
run_cfg = CrawlerRunConfig(markdown_generator=md_gen)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="...", config=run_cfg)
print(len(result.markdown.raw_markdown)) # Raw MD
print(len(result.markdown.fit_markdown)) # Filtered MD
Structured Data Extraction
CSS/XPath Extraction (No LLM Required, Fast and Free)
from crawl4ai import JsonCssExtractionStrategy
import json
schema = {
"name": "Articles",
"baseSelector": "article.post", # Container for repeating elements
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
]
}
run_cfg = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
data = json.loads(result.extracted_content)
print(data) # [{"title": "...", "url": "...", "image": "..."}, ...]
Auto-Generate Schema (one-time LLM cost, then reuse for free):
from crawl4ai import LLMConfig
schema = JsonCssExtractionStrategy.generate_schema(
html="\x3Cdiv class='product'>...",
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
# Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)
LLM Extraction (Suitable for Unstructured Content)
from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig
class Product(BaseModel):
name: str = Field(..., description="Product name")
price: str = Field(..., description="Price as string")
description: str = Field(..., description="Short description")
llm_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o-mini", # Also supports ollama/llama3, anthropic/claude-3, etc.
api_token="your-api-key"
),
schema=Product.model_json_schema(),
extraction_type="schema", # "schema" or "block"
instruction="Extract all product objects with name, price, and description.",
chunk_token_threshold=1000, # Auto-chunk when exceeding this token count
overlap_rate=0.1, # 10% overlap between chunks
apply_chunking=True,
input_format="markdown", # "markdown" | "html" | "fit_markdown"
extra_args={"temperature": 0.0, "max_tokens": 800}
)
run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com/products", config=run_cfg)
data = json.loads(result.extracted_content)
llm_strategy.show_usage() # Print token usage statistics
Extraction Strategy Selection Guide
| Scenario | Strategy |
|---|---|
| Repeating lists (products, articles, search results) | JsonCssExtractionStrategy |
| Unstructured text requiring AI understanding | LLMExtractionStrategy |
| High-frequency crawling of the same site | Generate Schema with LLM first, then extract via CSS |
Dynamic Page Handling
run_cfg = CrawlerRunConfig(
js_code=[ # JS executed on the page
"window.scrollTo(0, document.body.scrollHeight)",
"await new Promise(r => setTimeout(r, 2000))",
],
wait_for="css:.content-loaded", # Wait for a specific element to appear
delay_before_return_html=2.0, # Additional wait in seconds before returning
)
Batch Crawling
urls = ["https://example.com/page1", "https://example.com/page2", ...]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls, config=run_cfg)
for result in results:
if result.success:
print(result.markdown[:200])
arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.
Browser Management
browser_cfg = BrowserConfig(
browser_type="chromium", # "chromium" | "firefox" | "webkit"
headless=True,
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 ...",
proxy="http://user:pass@proxy:8080",
use_managed_browser=True, # Use an existing browser instance
user_data_dir="/path/to/profile", # Persistent profile (to retain login state)
)
Deep Crawl (Site-Level Crawling)
from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy
deep_crawl = BFSDeepCrawlStrategy(
max_depth=3, # Maximum depth
max_pages=50, # Maximum number of pages
include_paths=["/docs/*"], # Only crawl specified paths
exclude_paths=["/blog/*"], # Exclude specified paths
)
run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(url="https://example.com", config=run_cfg)
for r in results:
print(f"{r.url} → {len(r.markdown)} chars")
Docker Deployment
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground
Python Client:
import requests
resp = requests.post("http://localhost:11235/crawl",
json={"urls": ["https://example.com"], "priority": 10})
task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())
CrawlResult Key Fields
result.url # Final URL (after any redirects)
result.html # Raw HTML
result.cleaned_html # Cleaned HTML
result.markdown # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot # Base64 screenshot
result.media # Image/video information
result.links # Internal and external link information
result.success # Whether the crawl was successful
result.error_message # Error message
FAQ
Playwright browser not installed:
python -m playwright install --with-deps chromium
Cache issues causing stale data to be returned:
Set cache_mode=CacheMode.BYPASS to skip the cache.
Dynamic content not loading:
Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.
Out of memory (batch crawling):
Reduce the concurrency level; arun_many() automatically monitors memory and adapts.
Anti-bot / detection:
Enable use_managed_browser=True in BrowserConfig or configure a proxy.
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install crawl4ai-web-crawler - After installation, invoke the skill by name or use
/crawl4ai-web-crawler - Provide required inputs per the skill's parameter spec and get structured output
What is Crawl4AI Web Crawler?
Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perf... It is an AI Agent Skill for Claude Code / OpenClaw, with 80 downloads so far.
How do I install Crawl4AI Web Crawler?
Run "/install crawl4ai-web-crawler" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Crawl4AI Web Crawler free?
Yes, Crawl4AI Web Crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Crawl4AI Web Crawler support?
Crawl4AI Web Crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Crawl4AI Web Crawler?
It is built and maintained by OpenLark (@openlark); the current version is v1.0.1.