/install node-crawler
\r \r
Node Crawler (crawler package)\r
\r
crawler is a Node.js web spider library: internal queue + configurable\r
connection pool + per-domain rate limiting + automatic retries + proxy\r
rotation + charset detection + server-side Cheerio (jQuery-style) HTML\r
parsing. Built on got, supports HTTP/2.\r
\r
Prerequisites\r
\r
- Node.js >= 22\r
- Pure ESM:
import Crawler from "crawler"\r - If the codebase must use CommonJS, install
crawler@beta\r \r
npm install crawler\r
```\r
\r
## When to use\r
\r
This skill is for **production-grade, large-scale crawling**. Reach for it\r
when the task is substantial:\r
\r
- Scraping **many pages** (dozens to millions) with structured data extraction\r
- **Batch-downloading** files — images, PDFs, archives — with retry and resume\r
(resume logic is developer-implemented via `userParams` and file existence checks)\r
- **Long-running spiders** that need rate limiting, retries, and connection pooling\r
- **Multi-step workflows** — pagination, link following, cascading crawls\r
- **Proxy rotation**, charset detection, HTTP/2 — infrastructure a real\r
production crawler depends on\r
\r
### When NOT to use\r
\r
- A **single page** or one-off request → `curl` is far lighter.\r
Spinning up a Crawler instance for 1-2 pages is overkill.\r
- Pages requiring **JavaScript rendering** → use the `agent-browser` skill\r
(Playwright/Puppeteer) instead\r
- Simple **API data fetching** → `fetch` / `got` with JSON parsing\r
\r
## API\r
\r
### Core decision: Queue vs Send\r
\r
| Approach | When |\r
|----------|------|\r
| **`crawler.add()` + `'drain'` event** | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |\r
| **`crawler.send()`** | One-off requests. Bypasses queue, rate limiter, `preRequest`, `'request'` event |\r
\r
## Basic usage: Queue mode\r
\r
```js\r
import Crawler from "crawler";\r
\r
const c = new Crawler({\r
maxConnections: 10,\r
callback: (error, res, done) => {\r
if (error) {\r
console.error(error);\r
} else {\r
const $ = res.$; // Cheerio instance (enabled by default)\r
console.log($("title").text());\r
}\r
done(); // REQUIRED: releases the connection slot, or the crawler deadlocks\r
},\r
});\r
\r
c.on("drain", () => console.log("All done"));\r
\r
c.add("https://example.com");\r
c.add(["https://a.com", "https://b.com"]);\r
c.add({ url: "https://c.com", jQuery: false,\r
callback: (e, res, done) => { /* custom callback */ done(); } });\r
```\r
\r
Two most critical rules:\r
\r
1. Call `done()` from every branch of every queue callback, including the\r
`if (error)` branch\r
2. The crawler is finished when the `'drain'` event fires, **not** when\r
`add()` returns. `add()` merely enqueues tasks\r
\r
## Rate limiting and retries\r
\r
```js\r
const c = new Crawler({\r
rateLimit: 1000, // minimum gap between requests >=1000ms (forces maxConnections=1)\r
retries: 2, // default: 2\r
retryInterval: 3000,// ms to wait before retrying\r
timeout: 20000, // request timeout in ms\r
callback: (e, res, done) => { /* ... */ done(); },\r
});\r
```\r
\r
## Data extraction with Cheerio\r
\r
Cheerio is enabled by default. Use jQuery selectors to extract data:\r
\r
```js\r
callback: (e, res, done) => {\r
const $ = res.$;\r
const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();\r
const links = $("a").map((i, el) => $(el).attr("href")).get();\r
done();\r
}\r
```\r
\r
## Binary file download\r
\r
```js\r
import fs from "fs";\r
const c = new Crawler({\r
encoding: null, // keep body as Buffer\r
jQuery: false, // skip Cheerio parsing\r
callback: (err, res, done) => {\r
if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);\r
done();\r
},\r
});\r
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });\r
```\r
\r
## Proxy rotation\r
\r
Use the `'schedule'` event for dynamic assignment (preferred over using the\r
`proxies` array):\r
\r
```js\r
c.on("schedule", options => { options.proxy = "http://proxy:port"; });\r
c.on("request", options => { options.searchParams = { t: Date.now() }; });\r
```\r
\r
Different proxies can have different rate limiters:\r
\r
```js\r
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });\r
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });\r
```\r
\r
## HTTP/2\r
\r
```js\r
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });\r
```\r
\r
When using Charles or self-signed certs, add `rejectUnauthorized: false`.\r
\r
## Passing context data\r
\r
Use `userParams` to attach data; read it back in the callback via\r
`res.options.userParams`. Do **not** attach custom fields directly on the\r
options object.\r
\r
## Gotchas\r
\r
- ❌ Forgetting `done()` in the `if (error)` branch → crawler deadlocks\r
- ❌ Writing `console.log("done")` right after `add()` → listen for `'drain'` instead\r
- ❌ Setting `maxConnections > 1` when `rateLimit > 0` → it gets overridden to 1\r
- ❌ Expecting `send()` to trigger `preRequest` or `'request'` → `send()` bypasses all queue mechanics\r
- ❌ POST form data via `body` → v2 requires `form`\r
- ❌ Binary download without `encoding: null` → corrupt output\r
\r
## Options reference\r
\r
The complete options table is in [references/options.md](references/options.md).\r
\r
## Code examples\r
\r
Full, runnable examples for every scenario are in [references/examples.md](references/examples.md):\r
\r
- Basic queue crawling\r
- Rate limiting\r
- Cheerio data extraction\r
- Binary download\r
- Direct requests\r
- HTTP/2\r
- Proxy rotation\r
- preRequest hooks\r
- Full spider (pagination + extraction + following links)\r
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install node-crawler - After installation, invoke the skill by name or use
/node-crawler - Provide required inputs per the skill's parameter spec and get structured output
What is Web Crawler?
Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.
How do I install Web Crawler?
Run "/install node-crawler" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Web Crawler free?
Yes, Web Crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Web Crawler support?
Web Crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Web Crawler?
It is built and maintained by Mike Chen (@mike442144); the current version is v1.0.0.