功能描述

Expert guide for building web scrapers and crawlers using Crawlee (JavaScript/TypeScript and Python). Use this skill whenever the user wants to: scrape a web...

使用说明 (SKILL.md)

Crawlee Skill

Name: Crawlee
Author: yash-kavaiya

Crawlee is a production-grade web scraping and browser automation library for JavaScript/TypeScript (Node.js 16+) and Python (3.10+). It handles anti-blocking, proxies, session management, storage, and concurrency out of the box.

Docs: https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub: https://github.com/apify/crawlee

1. Choose Your Crawler

JavaScript / TypeScript

Crawler	When to Use	JS Required
`CheerioCrawler`	Fast HTML parsing, no JS rendering needed	❌
`HttpCrawler`	Raw HTTP responses, custom parsing	❌
`JSDOMCrawler`	DOM manipulation without full browser	❌
`PlaywrightCrawler`	Modern headless browser (Chromium/Firefox/WebKit)	✅
`PuppeteerCrawler`	Chromium/Chrome headless automation	✅
`AdaptivePlaywrightCrawler`	Auto-detects if JS rendering is needed	Auto
`BasicCrawler`	Custom HTTP logic from scratch	❌

Rule of thumb: Start with CheerioCrawler. Upgrade to PlaywrightCrawler only when JS rendering is required.

Python

Crawler	When to Use
`BeautifulSoupCrawler`	HTML parsing with BeautifulSoup (fast, no JS)
`ParselCrawler`	CSS/XPath selectors, Scrapy-style (fast, no JS)
`PlaywrightCrawler`	Full browser automation (Chromium/Firefox/WebKit)
`AdaptivePlaywrightCrawler`	Auto HTTP vs browser decision

2. Installation

JavaScript

# Recommended: use the CLI
npx crawlee create my-crawler
cd my-crawler && npm install

# Or manually:
npm install crawlee

# For Playwright:
npm install crawlee playwright
npx playwright install

# For Puppeteer:
npm install crawlee puppeteer

Add to package.json:

{ "type": "module" }

Python

pip install crawlee

# With BeautifulSoup:
pip install 'crawlee[beautifulsoup]'

# With Playwright:
pip install 'crawlee[playwright]'
playwright install

3. Core Concepts

The Two Questions Every Crawler Answers

Where to go? → Request objects in a RequestQueue
What to do there? → requestHandler function (JS) / decorated handler (Python)

Key Classes (JS)

Request — A single URL + metadata to crawl
RequestQueue — Dynamic, deduplicated queue of URLs
Dataset — Append-only structured result storage (like a table)
KeyValueStore — Blob storage for screenshots, PDFs, state
ProxyConfiguration — Manages proxy rotation
SessionPool — Manages browser sessions + cookies

4. Quick Start Examples

JavaScript — CheerioCrawler (Recommended Start)

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ $, request, enqueueLinks, log }) {
    const title = $('title').text();
    log.info(`Title of ${request.loadedUrl}: ${title}`);

    await Dataset.pushData({ url: request.loadedUrl, title });

    // Enqueue all links found on this page
    await enqueueLinks();
  },
  maxRequestsPerCrawl: 100, // Safety limit
});

await crawler.run(['https://example.com']);

JavaScript — PlaywrightCrawler

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // headless: false, // Uncomment to see the browser
  async requestHandler({ page, request, enqueueLinks, log }) {
    const title = await page.title();
    log.info(`${request.loadedUrl}: ${title}`);
    await Dataset.pushData({ url: request.loadedUrl, title });
    await enqueueLinks();
  },
});

await crawler.run(['https://example.com']);

Python — BeautifulSoupCrawler

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    crawler = BeautifulSoupCrawler(max_requests_per_crawl=50)

    @crawler.router.default_handler
    async def handler(context: BeautifulSoupCrawlingContext) -> None:
        title = context.soup.title.string if context.soup.title else None
        context.log.info(f'Processing {context.request.url}: {title}')
        await context.push_data({'url': context.request.url, 'title': title})
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Python — PlaywrightCrawler

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(headless=True, browser_type='chromium')

    @crawler.router.default_handler
    async def handler(context: PlaywrightCrawlingContext) -> None:
        title = await context.page.title()
        await context.push_data({'url': context.request.url, 'title': title})
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

5. Routing — Handling Multiple Page Types

Use labels + router to handle different kinds of pages (list pages, detail pages, etc.).

JavaScript

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { router } from './routes.js';

const crawler = new PlaywrightCrawler({ requestHandler: router });

await crawler.run([{ url: 'https://shop.example.com', label: 'START' }]);

// routes.js
import { createPlaywrightRouter } from 'crawlee';

export const router = createPlaywrightRouter();

router.addHandler('START', async ({ page, enqueueLinks }) => {
  await enqueueLinks({ selector: 'a.category', label: 'CATEGORY' });
});

router.addHandler('CATEGORY', async ({ page, enqueueLinks }) => {
  await enqueueLinks({ selector: 'a.product', label: 'DETAIL' });
  // Enqueue next page
  const next = await page.$('a.next-page');
  if (next) await enqueueLinks({ selector: 'a.next-page', label: 'CATEGORY' });
});

router.addDefaultHandler(async ({ page, request, pushData }) => {
  // DETAIL pages
  const title = await page.title();
  const price = await page.$eval('.price', el => el.textContent);
  await pushData({ url: request.url, title, price });
});

Python

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

crawler = BeautifulSoupCrawler()

@crawler.router.handler('CATEGORY')
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
    await context.enqueue_links(selector='a.product', label='DETAIL')

@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
    title = context.soup.title.string
    await context.push_data({'url': context.request.url, 'title': title})

6. Enqueuing Links

JavaScript — `enqueueLinks()`

// Enqueue all links on page
await enqueueLinks();

// Filter by glob pattern
await enqueueLinks({ globs: ['https://example.com/products/**'] });

// Filter by regex
await enqueueLinks({ regexps: [/\/product\/\d+/] });

// Enqueue only specific selector
await enqueueLinks({ selector: 'a.pagination', label: 'LIST' });

// Enqueue with custom label and transformations
await enqueueLinks({
  selector: 'a.item',
  label: 'DETAIL',
  transformRequestFunction: (req) => {
    req.userData.scrapedAt = new Date().toISOString();
    return req;
  },
});

Python

await context.enqueue_links()
await context.enqueue_links(selector='a.product', label='DETAIL')
await context.enqueue_links(include=[re.compile(r'/products/\d+')])

7. Storage

Dataset (structured results)

// JS — Write
await Dataset.pushData({ url, title, price });
await Dataset.pushData([item1, item2, item3]); // batch write

// JS — Read / Export
const dataset = await Dataset.open();
await dataset.exportToCSV('results'); // saves to KV store
await dataset.exportToJSON('results');

for await (const item of dataset) { console.log(item); }

# Python — Write
await context.push_data({'url': url, 'title': title})

# Python — Read / Export
from crawlee.storages import Dataset
dataset = await Dataset.open()
await dataset.export_to(key='results', content_type='csv')

Data is saved to ./storage/datasets/default/*.json by default.

KeyValueStore (blobs, screenshots, state)

// JS
await KeyValueStore.setValue('OUTPUT', { results: [...] });
const value = await KeyValueStore.getValue('OUTPUT');

// Save a screenshot
const store = await KeyValueStore.open();
await store.setValue('screenshot', await page.screenshot(), { contentType: 'image/png' });

# Python
from crawlee.storages import KeyValueStore
kvs = await KeyValueStore.open()
await kvs.set_value('result', {'data': 'value'})
value = await kvs.get_value('result')

Storage location

./storage/
  datasets/default/     # Dataset rows as JSON files
  key_value_stores/default/  # KV store entries
  request_queues/default/    # Request queue state

Override with env var: CRAWLEE_STORAGE_DIR=/path/to/storage

8. Proxy Management

// JS — Basic proxy rotation
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
  proxyUrls: [
    'http://user:[email protected]:8000',
    'http://user:[email protected]:8000',
  ],
});

const crawler = new CheerioCrawler({
  proxyConfiguration,
  useSessionPool: true,
  persistCookiesPerSession: true,
  async requestHandler({ proxyInfo, request }) {
    console.log('Using proxy:', proxyInfo?.url);
  },
});

// JS — Tiered proxies (smart cost/reliability balancing)
const proxyConfiguration = new ProxyConfiguration({
  tieredProxyUrls: [
    [null],                              // Tier 0: no proxy (cheapest)
    ['http://cheap-datacenter-proxy'],   // Tier 1: datacenter
    ['http://expensive-residential'],    // Tier 2: residential (most reliable)
  ],
});
// Crawlee auto-escalates tiers when blocking is detected, then drops back when clear

# Python
from crawlee.proxy_configuration import ProxyConfiguration

proxy_configuration = ProxyConfiguration(
    proxy_urls=['http://proxy1.com/', 'http://proxy2.com/'],
)
crawler = BeautifulSoupCrawler(
    proxy_configuration=proxy_configuration,
    use_session_pool=True,
)

9. Session Management

Sessions tie together cookies, proxy IPs, and headers to simulate a consistent user identity.

// JS
const crawler = new CheerioCrawler({
  useSessionPool: true,         // Enable (default: true)
  persistCookiesPerSession: true,
  sessionPoolOptions: { maxPoolSize: 100 },

  async requestHandler({ session, $ }) {
    const title = $('title').text();
    if (title === 'Access Denied') {
      session?.retire();  // Mark this IP+cookie combo as blocked
    } else if (title === 'Slow') {
      session?.markBad(); // Penalize but don't retire
    }
    // session.markGood() is called automatically on success
  },
});

# Python
from crawlee.sessions import SessionPool

crawler = BeautifulSoupCrawler(
    use_session_pool=True,
    session_pool=SessionPool(max_pool_size=100),
)

@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext) -> None:
    title = context.soup.title.string if context.soup.title else ''
    if title == 'Access Denied':
        context.session.retire()

10. Avoiding Blocks

// JS — Playwright with fingerprint rotation (built-in, zero config needed)
const crawler = new PlaywrightCrawler({
  // Fingerprints automatically randomized by default in Playwright/Puppeteer crawlers
  // headless: false,  // Use headful for harder targets
  async requestHandler({ page }) {
    // Add realistic delays
    await page.waitForTimeout(1000 + Math.random() * 2000);
  },
});

// Use got-scraping for HTTP (built into CheerioCrawler/HttpCrawler)
// It automatically sets realistic headers and TLS fingerprints

Anti-blocking checklist:

✅ Use CheerioCrawler — it uses got-scraping which mimics real browser HTTP
✅ Enable useSessionPool: true with a proxyConfiguration
✅ Use tiered proxies for automatic failover
✅ Set maxRequestsPerMinute to avoid rate limits
✅ For browser crawlers — fingerprints are rotated automatically
✅ Use persistCookiesPerSession: true
✅ Retire sessions on blocks: session.retire()

11. Concurrency & Scaling

// JS
const crawler = new CheerioCrawler({
  maxConcurrency: 50,         // Max parallel requests (default: 200)
  minConcurrency: 1,          // Don't set too high!
  maxRequestsPerMinute: 120,  // Rate limit
  maxRequestsPerCrawl: 1000,  // Total request cap (safety)
  requestHandlerTimeoutSecs: 30,
});

# Python
from crawlee import ConcurrencySettings

crawler = BeautifulSoupCrawler(
    concurrency_settings=ConcurrencySettings(
        max_concurrency=50,
        max_tasks_per_minute=120,
    ),
    max_requests_per_crawl=1000,
)

Scaling notes:

Crawlee auto-scales concurrency based on CPU/memory
Don't set minConcurrency high — it can crash under load
maxRequestsPerMinute is smoother than raw concurrency throttling

12. Configuration & Environment Variables

Env Variable	Default	Purpose
`CRAWLEE_STORAGE_DIR`	`./storage`	Storage root directory
`CRAWLEE_DEFAULT_DATASET_ID`	`default`	Override default dataset ID
`CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID`	`default`	Override default KVS ID
`CRAWLEE_DEFAULT_REQUEST_QUEUE_ID`	`default`	Override default queue ID
`CRAWLEE_PURGE_ON_START`	`true`	Clear storage before each run

// JS — Programmatic configuration
import { Configuration } from 'crawlee';

const config = new Configuration({
  storageDir: '/data/crawlee',
  persistStateIntervalMillis: 30_000,
});

const crawler = new CheerioCrawler({ /* ... */ }, config);

13. Docker Deployment

FROM apify/actor-node-playwright-chrome:20

COPY package*.json ./
RUN npm ci --only=prod

COPY . ./

CMD ["node", "src/main.js"]

For Cheerio (smaller image):

FROM apify/actor-node:20

14. Common Patterns

Pagination

// JS — Enqueue next page
router.addHandler('LIST', async ({ page, enqueueLinks }) => {
  await enqueueLinks({ selector: '.product', label: 'DETAIL' });
  const hasNext = await page.$('a.next');
  if (hasNext) await enqueueLinks({ selector: 'a.next', label: 'LIST' });
});

Downloading Files

// JS — Save to KeyValueStore
const { body } = await sendRequest({ responseType: 'buffer' });
await KeyValueStore.setValue('file.pdf', body, { contentType: 'application/pdf' });

Taking Screenshots

// JS — Playwright
async requestHandler({ page, request }) {
  const screenshot = await page.screenshot({ fullPage: true });
  await KeyValueStore.setValue(
    `screenshot-${Date.now()}`,
    screenshot,
    { contentType: 'image/png' }
  );
}

Shared State Across Handlers

// JS — useState()
async requestHandler({ useState }) {
  const state = await useState({ count: 0 });
  state.count++;
  console.log('Total processed:', state.count);
}

Error Handling & Retries

// JS
const crawler = new CheerioCrawler({
  maxRequestRetries: 3, // Retry failed requests up to 3 times
  failedRequestHandler: async ({ request, error }) => {
    console.error(`Failed: ${request.url}`, error.message);
    await Dataset.pushData({ url: request.url, error: error.message });
  },
});

# Python
crawler = BeautifulSoupCrawler(max_request_retries=3)

@crawler.failed_request_handler
async def on_failed(context: BasicCrawlingContext, error: Exception) -> None:
    context.log.error(f'Failed {context.request.url}: {error}')

Sitemap Crawling

import { CheerioCrawler } from 'crawlee';
import { Sitemap } from '@crawlee/utils';

const { urls } = await Sitemap.load('https://example.com/sitemap.xml');
const crawler = new CheerioCrawler({ /* ... */ });
await crawler.run(urls);

Run as Web Server

import { CheerioCrawler } from 'crawlee';
import { createServer } from 'http';

const server = createServer(async (req, res) => {
  const url = new URL(req.url, 'http://localhost').searchParams.get('url');
  const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 1,
    async requestHandler({ $ }) {
      res.end(JSON.stringify({ title: $('title').text() }));
    },
  });
  await crawler.run([url]);
});
server.listen(3000);

15. TypeScript Support

import { CheerioCrawler, CheerioCrawlingContext, Dataset } from 'crawlee';

interface Product {
  url: string;
  title: string;
  price: number;
}

const crawler = new CheerioCrawler({
  async requestHandler({ $, request }: CheerioCrawlingContext) {
    const title = $('h1').text();
    const price = parseFloat($('.price').text().replace('$', ''));
    await Dataset.pushData\x3CProduct>({ url: request.url, title, price });
  },
});

16. Cloud Deployment (Apify Platform)

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const { startUrls } = input;

const crawler = new CheerioCrawler({
  async requestHandler({ $, request }) {
    await Actor.pushData({ url: request.url, title: $('title').text() });
  },
});

await crawler.run(startUrls);
await Actor.exit();

Deploy with: apify push

17. Debugging Tips

// Enable verbose logging
import { Log } from 'crawlee';
Log.setLevel(Log.LEVELS.DEBUG);

// Run headful (browser crawlers only)
const crawler = new PlaywrightCrawler({
  headless: false,
  // ...
});

// Limit requests while developing
const crawler = new CheerioCrawler({
  maxRequestsPerCrawl: 10,
  // ...
});

18. Reference Files

For advanced topics, see:

references/js-api.md — Full JS API quick reference
references/python-api.md — Full Python API quick reference

Both language docs: https://crawlee.dev

安全使用建议

This skill is a documentation/guide for using the Crawlee libraries and appears internally consistent. Before using: (1) be aware the examples install packages (npm/pip) and Playwright which download browser binaries and require network access; only run those commands on systems you control. (2) If you plan to supply proxy URLs they may include credentials—treat them as sensitive. (3) Web scraping can raise legal and ethical issues; check robots.txt and the target site's Terms of Service and applicable law. (4) The skill is instruction-only (it won’t run code by itself), but the agent may recommend commands to execute; review any suggested shell commands before running. (5) If you’re concerned about the skill being suggested too often, note it is configured to trigger for many loosely related phrases—consider limiting invocation scope or confirm before acting.

功能分析

Type: OpenClaw Skill Name: crawlee Version: 1.0.0 The skill bundle provides comprehensive and legitimate documentation for the Crawlee web scraping library in both JavaScript and Python. It contains standard installation commands, API references, and code examples for common scraping tasks like handling proxies, sessions, and data storage. No indicators of malicious intent, data exfiltration, or prompt injection were found across SKILL.md or the reference files.

能力评估

✓ Purpose & Capability

The name/description match the provided content (detailed JS/Python guidance for Crawlee). There are no unexpected required binaries, env vars, or config paths.

✓ Instruction Scope

SKILL.md contains step-by-step installation and usage examples for Crawlee (npm/pip/playwright installs, example crawlers, API refs). It does not instruct the agent to read unrelated system files, exfiltrate secrets, or contact hidden endpoints. Note: the doc explicitly tells the agent to trigger for many loosely related user phrases, which affects when the skill will be suggested but does not change its technical scope.

✓ Install Mechanism

This is an instruction-only skill (no install spec). It recommends standard package installs (npm, pip, playwright install) which is expected for this content. Nothing in the skill pulls arbitrary archives or personal servers.

✓ Credentials

The skill declares no required environment variables or credentials. It documents optional proxy configuration (which naturally may carry credentials when used) but does not request unrelated secrets.

✓ Persistence & Privilege

always is false and the skill does not request persistent system-wide privileges. Autonomous invocation is allowed (platform default) but is not combined with other concerning privileges.

版本历史

v1.0.0

Initial release: Expert guide for building web scrapers and crawlers using Crawlee (JS/TS and Python)

元数据

Slug crawlee

版本 1.0.0

许可证 MIT-0

累计安装 2

当前安装数 2

历史版本数 1

常见问题

Crawlee 是什么？

Expert guide for building web scrapers and crawlers using Crawlee (JavaScript/TypeScript and Python). Use this skill whenever the user wants to: scrape a web... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 451 次。

如何安装 Crawlee？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install crawlee」即可一键安装，无需额外配置。

Crawlee 是免费的吗？

是的，Crawlee 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Crawlee 支持哪些平台？

Crawlee 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Crawlee？

由 Yash Kavaiya（@yash-kavaiya）开发并维护，当前版本 v1.0.0。

Crawlee

Crawlee Skill

1. Choose Your Crawler

JavaScript / TypeScript

Python

2. Installation

JavaScript

Python

3. Core Concepts

The Two Questions Every Crawler Answers

Key Classes (JS)

4. Quick Start Examples

JavaScript — CheerioCrawler (Recommended Start)

JavaScript — PlaywrightCrawler

Python — BeautifulSoupCrawler

Python — PlaywrightCrawler

5. Routing — Handling Multiple Page Types

JavaScript

Python

6. Enqueuing Links

JavaScript — enqueueLinks()

Python

7. Storage

Dataset (structured results)

KeyValueStore (blobs, screenshots, state)

Storage location

8. Proxy Management

9. Session Management

10. Avoiding Blocks

11. Concurrency & Scaling

12. Configuration & Environment Variables

13. Docker Deployment

14. Common Patterns

Pagination

Downloading Files

Taking Screenshots

Shared State Across Handlers

Error Handling & Retries

Sitemap Crawling

Run as Web Server

15. TypeScript Support

16. Cloud Deployment (Apify Platform)

17. Debugging Tips

18. Reference Files

Crawlee 是什么？

如何安装 Crawlee？

Crawlee 是免费的吗？

Crawlee 支持哪些平台？

谁开发了 Crawlee？

💬 留言讨论

JavaScript — `enqueueLinks()`