← 返回 Skills 市场
yash-kavaiya

Crawlee

作者 Yash Kavaiya · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
451
总下载
0
收藏
2
当前安装
1
版本数
在 OpenClaw 中安装
/install crawlee
功能描述
Expert guide for building web scrapers and crawlers using Crawlee (JavaScript/TypeScript and Python). Use this skill whenever the user wants to: scrape a web...
使用说明 (SKILL.md)

Crawlee Skill

Crawlee is a production-grade web scraping and browser automation library for JavaScript/TypeScript (Node.js 16+) and Python (3.10+). It handles anti-blocking, proxies, session management, storage, and concurrency out of the box.

Docs: https://crawlee.dev/js/docs | https://crawlee.dev/python/docs
GitHub: https://github.com/apify/crawlee


1. Choose Your Crawler

JavaScript / TypeScript

Crawler When to Use JS Required
CheerioCrawler Fast HTML parsing, no JS rendering needed
HttpCrawler Raw HTTP responses, custom parsing
JSDOMCrawler DOM manipulation without full browser
PlaywrightCrawler Modern headless browser (Chromium/Firefox/WebKit)
PuppeteerCrawler Chromium/Chrome headless automation
AdaptivePlaywrightCrawler Auto-detects if JS rendering is needed Auto
BasicCrawler Custom HTTP logic from scratch

Rule of thumb: Start with CheerioCrawler. Upgrade to PlaywrightCrawler only when JS rendering is required.

Python

Crawler When to Use
BeautifulSoupCrawler HTML parsing with BeautifulSoup (fast, no JS)
ParselCrawler CSS/XPath selectors, Scrapy-style (fast, no JS)
PlaywrightCrawler Full browser automation (Chromium/Firefox/WebKit)
AdaptivePlaywrightCrawler Auto HTTP vs browser decision

2. Installation

JavaScript

# Recommended: use the CLI
npx crawlee create my-crawler
cd my-crawler && npm install

# Or manually:
npm install crawlee

# For Playwright:
npm install crawlee playwright
npx playwright install

# For Puppeteer:
npm install crawlee puppeteer

Add to package.json:

{ "type": "module" }

Python

pip install crawlee

# With BeautifulSoup:
pip install 'crawlee[beautifulsoup]'

# With Playwright:
pip install 'crawlee[playwright]'
playwright install

3. Core Concepts

The Two Questions Every Crawler Answers

  1. Where to go?Request objects in a RequestQueue
  2. What to do there?requestHandler function (JS) / decorated handler (Python)

Key Classes (JS)

  • Request — A single URL + metadata to crawl
  • RequestQueue — Dynamic, deduplicated queue of URLs
  • Dataset — Append-only structured result storage (like a table)
  • KeyValueStore — Blob storage for screenshots, PDFs, state
  • ProxyConfiguration — Manages proxy rotation
  • SessionPool — Manages browser sessions + cookies

4. Quick Start Examples

JavaScript — CheerioCrawler (Recommended Start)

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ $, request, enqueueLinks, log }) {
    const title = $('title').text();
    log.info(`Title of ${request.loadedUrl}: ${title}`);

    await Dataset.pushData({ url: request.loadedUrl, title });

    // Enqueue all links found on this page
    await enqueueLinks();
  },
  maxRequestsPerCrawl: 100, // Safety limit
});

await crawler.run(['https://example.com']);

JavaScript — PlaywrightCrawler

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
  // headless: false, // Uncomment to see the browser
  async requestHandler({ page, request, enqueueLinks, log }) {
    const title = await page.title();
    log.info(`${request.loadedUrl}: ${title}`);
    await Dataset.pushData({ url: request.loadedUrl, title });
    await enqueueLinks();
  },
});

await crawler.run(['https://example.com']);

Python — BeautifulSoupCrawler

import asyncio
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

async def main() -> None:
    crawler = BeautifulSoupCrawler(max_requests_per_crawl=50)

    @crawler.router.default_handler
    async def handler(context: BeautifulSoupCrawlingContext) -> None:
        title = context.soup.title.string if context.soup.title else None
        context.log.info(f'Processing {context.request.url}: {title}')
        await context.push_data({'url': context.request.url, 'title': title})
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

Python — PlaywrightCrawler

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext

async def main() -> None:
    crawler = PlaywrightCrawler(headless=True, browser_type='chromium')

    @crawler.router.default_handler
    async def handler(context: PlaywrightCrawlingContext) -> None:
        title = await context.page.title()
        await context.push_data({'url': context.request.url, 'title': title})
        await context.enqueue_links()

    await crawler.run(['https://example.com'])

if __name__ == '__main__':
    asyncio.run(main())

5. Routing — Handling Multiple Page Types

Use labels + router to handle different kinds of pages (list pages, detail pages, etc.).

JavaScript

import { PlaywrightCrawler, Dataset } from 'crawlee';
import { router } from './routes.js';

const crawler = new PlaywrightCrawler({ requestHandler: router });

await crawler.run([{ url: 'https://shop.example.com', label: 'START' }]);
// routes.js
import { createPlaywrightRouter } from 'crawlee';

export const router = createPlaywrightRouter();

router.addHandler('START', async ({ page, enqueueLinks }) => {
  await enqueueLinks({ selector: 'a.category', label: 'CATEGORY' });
});

router.addHandler('CATEGORY', async ({ page, enqueueLinks }) => {
  await enqueueLinks({ selector: 'a.product', label: 'DETAIL' });
  // Enqueue next page
  const next = await page.$('a.next-page');
  if (next) await enqueueLinks({ selector: 'a.next-page', label: 'CATEGORY' });
});

router.addDefaultHandler(async ({ page, request, pushData }) => {
  // DETAIL pages
  const title = await page.title();
  const price = await page.$eval('.price', el => el.textContent);
  await pushData({ url: request.url, title, price });
});

Python

from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext

crawler = BeautifulSoupCrawler()

@crawler.router.handler('CATEGORY')
async def category_handler(context: BeautifulSoupCrawlingContext) -> None:
    await context.enqueue_links(selector='a.product', label='DETAIL')

@crawler.router.default_handler
async def detail_handler(context: BeautifulSoupCrawlingContext) -> None:
    title = context.soup.title.string
    await context.push_data({'url': context.request.url, 'title': title})

6. Enqueuing Links

JavaScript — enqueueLinks()

// Enqueue all links on page
await enqueueLinks();

// Filter by glob pattern
await enqueueLinks({ globs: ['https://example.com/products/**'] });

// Filter by regex
await enqueueLinks({ regexps: [/\/product\/\d+/] });

// Enqueue only specific selector
await enqueueLinks({ selector: 'a.pagination', label: 'LIST' });

// Enqueue with custom label and transformations
await enqueueLinks({
  selector: 'a.item',
  label: 'DETAIL',
  transformRequestFunction: (req) => {
    req.userData.scrapedAt = new Date().toISOString();
    return req;
  },
});

Python

await context.enqueue_links()
await context.enqueue_links(selector='a.product', label='DETAIL')
await context.enqueue_links(include=[re.compile(r'/products/\d+')])

7. Storage

Dataset (structured results)

// JS — Write
await Dataset.pushData({ url, title, price });
await Dataset.pushData([item1, item2, item3]); // batch write

// JS — Read / Export
const dataset = await Dataset.open();
await dataset.exportToCSV('results'); // saves to KV store
await dataset.exportToJSON('results');

for await (const item of dataset) { console.log(item); }
# Python — Write
await context.push_data({'url': url, 'title': title})

# Python — Read / Export
from crawlee.storages import Dataset
dataset = await Dataset.open()
await dataset.export_to(key='results', content_type='csv')

Data is saved to ./storage/datasets/default/*.json by default.

KeyValueStore (blobs, screenshots, state)

// JS
await KeyValueStore.setValue('OUTPUT', { results: [...] });
const value = await KeyValueStore.getValue('OUTPUT');

// Save a screenshot
const store = await KeyValueStore.open();
await store.setValue('screenshot', await page.screenshot(), { contentType: 'image/png' });
# Python
from crawlee.storages import KeyValueStore
kvs = await KeyValueStore.open()
await kvs.set_value('result', {'data': 'value'})
value = await kvs.get_value('result')

Storage location

./storage/
  datasets/default/     # Dataset rows as JSON files
  key_value_stores/default/  # KV store entries
  request_queues/default/    # Request queue state

Override with env var: CRAWLEE_STORAGE_DIR=/path/to/storage


8. Proxy Management

// JS — Basic proxy rotation
import { ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
  proxyUrls: [
    'http://user:[email protected]:8000',
    'http://user:[email protected]:8000',
  ],
});

const crawler = new CheerioCrawler({
  proxyConfiguration,
  useSessionPool: true,
  persistCookiesPerSession: true,
  async requestHandler({ proxyInfo, request }) {
    console.log('Using proxy:', proxyInfo?.url);
  },
});
// JS — Tiered proxies (smart cost/reliability balancing)
const proxyConfiguration = new ProxyConfiguration({
  tieredProxyUrls: [
    [null],                              // Tier 0: no proxy (cheapest)
    ['http://cheap-datacenter-proxy'],   // Tier 1: datacenter
    ['http://expensive-residential'],    // Tier 2: residential (most reliable)
  ],
});
// Crawlee auto-escalates tiers when blocking is detected, then drops back when clear
# Python
from crawlee.proxy_configuration import ProxyConfiguration

proxy_configuration = ProxyConfiguration(
    proxy_urls=['http://proxy1.com/', 'http://proxy2.com/'],
)
crawler = BeautifulSoupCrawler(
    proxy_configuration=proxy_configuration,
    use_session_pool=True,
)

9. Session Management

Sessions tie together cookies, proxy IPs, and headers to simulate a consistent user identity.

// JS
const crawler = new CheerioCrawler({
  useSessionPool: true,         // Enable (default: true)
  persistCookiesPerSession: true,
  sessionPoolOptions: { maxPoolSize: 100 },

  async requestHandler({ session, $ }) {
    const title = $('title').text();
    if (title === 'Access Denied') {
      session?.retire();  // Mark this IP+cookie combo as blocked
    } else if (title === 'Slow') {
      session?.markBad(); // Penalize but don't retire
    }
    // session.markGood() is called automatically on success
  },
});
# Python
from crawlee.sessions import SessionPool

crawler = BeautifulSoupCrawler(
    use_session_pool=True,
    session_pool=SessionPool(max_pool_size=100),
)

@crawler.router.default_handler
async def handler(context: BeautifulSoupCrawlingContext) -> None:
    title = context.soup.title.string if context.soup.title else ''
    if title == 'Access Denied':
        context.session.retire()

10. Avoiding Blocks

// JS — Playwright with fingerprint rotation (built-in, zero config needed)
const crawler = new PlaywrightCrawler({
  // Fingerprints automatically randomized by default in Playwright/Puppeteer crawlers
  // headless: false,  // Use headful for harder targets
  async requestHandler({ page }) {
    // Add realistic delays
    await page.waitForTimeout(1000 + Math.random() * 2000);
  },
});

// Use got-scraping for HTTP (built into CheerioCrawler/HttpCrawler)
// It automatically sets realistic headers and TLS fingerprints

Anti-blocking checklist:

  • ✅ Use CheerioCrawler — it uses got-scraping which mimics real browser HTTP
  • ✅ Enable useSessionPool: true with a proxyConfiguration
  • ✅ Use tiered proxies for automatic failover
  • ✅ Set maxRequestsPerMinute to avoid rate limits
  • ✅ For browser crawlers — fingerprints are rotated automatically
  • ✅ Use persistCookiesPerSession: true
  • ✅ Retire sessions on blocks: session.retire()

11. Concurrency & Scaling

// JS
const crawler = new CheerioCrawler({
  maxConcurrency: 50,         // Max parallel requests (default: 200)
  minConcurrency: 1,          // Don't set too high!
  maxRequestsPerMinute: 120,  // Rate limit
  maxRequestsPerCrawl: 1000,  // Total request cap (safety)
  requestHandlerTimeoutSecs: 30,
});
# Python
from crawlee import ConcurrencySettings

crawler = BeautifulSoupCrawler(
    concurrency_settings=ConcurrencySettings(
        max_concurrency=50,
        max_tasks_per_minute=120,
    ),
    max_requests_per_crawl=1000,
)

Scaling notes:

  • Crawlee auto-scales concurrency based on CPU/memory
  • Don't set minConcurrency high — it can crash under load
  • maxRequestsPerMinute is smoother than raw concurrency throttling

12. Configuration & Environment Variables

Env Variable Default Purpose
CRAWLEE_STORAGE_DIR ./storage Storage root directory
CRAWLEE_DEFAULT_DATASET_ID default Override default dataset ID
CRAWLEE_DEFAULT_KEY_VALUE_STORE_ID default Override default KVS ID
CRAWLEE_DEFAULT_REQUEST_QUEUE_ID default Override default queue ID
CRAWLEE_PURGE_ON_START true Clear storage before each run
// JS — Programmatic configuration
import { Configuration } from 'crawlee';

const config = new Configuration({
  storageDir: '/data/crawlee',
  persistStateIntervalMillis: 30_000,
});

const crawler = new CheerioCrawler({ /* ... */ }, config);

13. Docker Deployment

FROM apify/actor-node-playwright-chrome:20

COPY package*.json ./
RUN npm ci --only=prod

COPY . ./

CMD ["node", "src/main.js"]

For Cheerio (smaller image):

FROM apify/actor-node:20

14. Common Patterns

Pagination

// JS — Enqueue next page
router.addHandler('LIST', async ({ page, enqueueLinks }) => {
  await enqueueLinks({ selector: '.product', label: 'DETAIL' });
  const hasNext = await page.$('a.next');
  if (hasNext) await enqueueLinks({ selector: 'a.next', label: 'LIST' });
});

Downloading Files

// JS — Save to KeyValueStore
const { body } = await sendRequest({ responseType: 'buffer' });
await KeyValueStore.setValue('file.pdf', body, { contentType: 'application/pdf' });

Taking Screenshots

// JS — Playwright
async requestHandler({ page, request }) {
  const screenshot = await page.screenshot({ fullPage: true });
  await KeyValueStore.setValue(
    `screenshot-${Date.now()}`,
    screenshot,
    { contentType: 'image/png' }
  );
}

Shared State Across Handlers

// JS — useState()
async requestHandler({ useState }) {
  const state = await useState({ count: 0 });
  state.count++;
  console.log('Total processed:', state.count);
}

Error Handling & Retries

// JS
const crawler = new CheerioCrawler({
  maxRequestRetries: 3, // Retry failed requests up to 3 times
  failedRequestHandler: async ({ request, error }) => {
    console.error(`Failed: ${request.url}`, error.message);
    await Dataset.pushData({ url: request.url, error: error.message });
  },
});
# Python
crawler = BeautifulSoupCrawler(max_request_retries=3)

@crawler.failed_request_handler
async def on_failed(context: BasicCrawlingContext, error: Exception) -> None:
    context.log.error(f'Failed {context.request.url}: {error}')

Sitemap Crawling

import { CheerioCrawler } from 'crawlee';
import { Sitemap } from '@crawlee/utils';

const { urls } = await Sitemap.load('https://example.com/sitemap.xml');
const crawler = new CheerioCrawler({ /* ... */ });
await crawler.run(urls);

Run as Web Server

import { CheerioCrawler } from 'crawlee';
import { createServer } from 'http';

const server = createServer(async (req, res) => {
  const url = new URL(req.url, 'http://localhost').searchParams.get('url');
  const crawler = new CheerioCrawler({
    maxRequestsPerCrawl: 1,
    async requestHandler({ $ }) {
      res.end(JSON.stringify({ title: $('title').text() }));
    },
  });
  await crawler.run([url]);
});
server.listen(3000);

15. TypeScript Support

import { CheerioCrawler, CheerioCrawlingContext, Dataset } from 'crawlee';

interface Product {
  url: string;
  title: string;
  price: number;
}

const crawler = new CheerioCrawler({
  async requestHandler({ $, request }: CheerioCrawlingContext) {
    const title = $('h1').text();
    const price = parseFloat($('.price').text().replace('$', ''));
    await Dataset.pushData\x3CProduct>({ url: request.url, title, price });
  },
});

16. Cloud Deployment (Apify Platform)

import { Actor } from 'apify';
import { CheerioCrawler } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const { startUrls } = input;

const crawler = new CheerioCrawler({
  async requestHandler({ $, request }) {
    await Actor.pushData({ url: request.url, title: $('title').text() });
  },
});

await crawler.run(startUrls);
await Actor.exit();

Deploy with: apify push


17. Debugging Tips

// Enable verbose logging
import { Log } from 'crawlee';
Log.setLevel(Log.LEVELS.DEBUG);

// Run headful (browser crawlers only)
const crawler = new PlaywrightCrawler({
  headless: false,
  // ...
});

// Limit requests while developing
const crawler = new CheerioCrawler({
  maxRequestsPerCrawl: 10,
  // ...
});

18. Reference Files

For advanced topics, see:

  • references/js-api.md — Full JS API quick reference
  • references/python-api.md — Full Python API quick reference

Both language docs: https://crawlee.dev

安全使用建议
This skill is a documentation/guide for using the Crawlee libraries and appears internally consistent. Before using: (1) be aware the examples install packages (npm/pip) and Playwright which download browser binaries and require network access; only run those commands on systems you control. (2) If you plan to supply proxy URLs they may include credentials—treat them as sensitive. (3) Web scraping can raise legal and ethical issues; check robots.txt and the target site's Terms of Service and applicable law. (4) The skill is instruction-only (it won’t run code by itself), but the agent may recommend commands to execute; review any suggested shell commands before running. (5) If you’re concerned about the skill being suggested too often, note it is configured to trigger for many loosely related phrases—consider limiting invocation scope or confirm before acting.
功能分析
Type: OpenClaw Skill Name: crawlee Version: 1.0.0 The skill bundle provides comprehensive and legitimate documentation for the Crawlee web scraping library in both JavaScript and Python. It contains standard installation commands, API references, and code examples for common scraping tasks like handling proxies, sessions, and data storage. No indicators of malicious intent, data exfiltration, or prompt injection were found across SKILL.md or the reference files.
能力评估
Purpose & Capability
The name/description match the provided content (detailed JS/Python guidance for Crawlee). There are no unexpected required binaries, env vars, or config paths.
Instruction Scope
SKILL.md contains step-by-step installation and usage examples for Crawlee (npm/pip/playwright installs, example crawlers, API refs). It does not instruct the agent to read unrelated system files, exfiltrate secrets, or contact hidden endpoints. Note: the doc explicitly tells the agent to trigger for many loosely related user phrases, which affects when the skill will be suggested but does not change its technical scope.
Install Mechanism
This is an instruction-only skill (no install spec). It recommends standard package installs (npm, pip, playwright install) which is expected for this content. Nothing in the skill pulls arbitrary archives or personal servers.
Credentials
The skill declares no required environment variables or credentials. It documents optional proxy configuration (which naturally may carry credentials when used) but does not request unrelated secrets.
Persistence & Privilege
always is false and the skill does not request persistent system-wide privileges. Autonomous invocation is allowed (platform default) but is not combined with other concerning privileges.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install crawlee
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /crawlee 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release: Expert guide for building web scrapers and crawlers using Crawlee (JS/TS and Python)
元数据
Slug crawlee
版本 1.0.0
许可证 MIT-0
累计安装 2
当前安装数 2
历史版本数 1
常见问题

Crawlee 是什么?

Expert guide for building web scrapers and crawlers using Crawlee (JavaScript/TypeScript and Python). Use this skill whenever the user wants to: scrape a web... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 451 次。

如何安装 Crawlee?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install crawlee」即可一键安装,无需额外配置。

Crawlee 是免费的吗?

是的,Crawlee 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Crawlee 支持哪些平台?

Crawlee 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Crawlee?

由 Yash Kavaiya(@yash-kavaiya)开发并维护,当前版本 v1.0.0。

💬 留言讨论