← 返回 Skills 市场
Node Crawler (
44
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install node-crawler
功能描述
Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr...
使用说明 (SKILL.md)
\r \r
Node Crawler (crawler package)\r
\r
crawler is a Node.js web spider library: internal queue + configurable\r
connection pool + per-domain rate limiting + automatic retries + proxy\r
rotation + charset detection + server-side Cheerio (jQuery-style) HTML\r
parsing. Built on got, supports HTTP/2.\r
\r
Prerequisites\r
\r
- Node.js >= 22\r
- Pure ESM:
import Crawler from "crawler"\r - If the codebase must use CommonJS, install
crawler@beta\r \r
npm install crawler\r
```\r
\r
## When to use\r
\r
This skill is for **production-grade, large-scale crawling**. Reach for it\r
when the task is substantial:\r
\r
- Scraping **many pages** (dozens to millions) with structured data extraction\r
- **Batch-downloading** files — images, PDFs, archives — with retry and resume\r
(resume logic is developer-implemented via `userParams` and file existence checks)\r
- **Long-running spiders** that need rate limiting, retries, and connection pooling\r
- **Multi-step workflows** — pagination, link following, cascading crawls\r
- **Proxy rotation**, charset detection, HTTP/2 — infrastructure a real\r
production crawler depends on\r
\r
### When NOT to use\r
\r
- A **single page** or one-off request → `curl` is far lighter.\r
Spinning up a Crawler instance for 1-2 pages is overkill.\r
- Pages requiring **JavaScript rendering** → use the `agent-browser` skill\r
(Playwright/Puppeteer) instead\r
- Simple **API data fetching** → `fetch` / `got` with JSON parsing\r
\r
## API\r
\r
### Core decision: Queue vs Send\r
\r
| Approach | When |\r
|----------|------|\r
| **`crawler.add()` + `'drain'` event** | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |\r
| **`crawler.send()`** | One-off requests. Bypasses queue, rate limiter, `preRequest`, `'request'` event |\r
\r
## Basic usage: Queue mode\r
\r
```js\r
import Crawler from "crawler";\r
\r
const c = new Crawler({\r
maxConnections: 10,\r
callback: (error, res, done) => {\r
if (error) {\r
console.error(error);\r
} else {\r
const $ = res.$; // Cheerio instance (enabled by default)\r
console.log($("title").text());\r
}\r
done(); // REQUIRED: releases the connection slot, or the crawler deadlocks\r
},\r
});\r
\r
c.on("drain", () => console.log("All done"));\r
\r
c.add("https://example.com");\r
c.add(["https://a.com", "https://b.com"]);\r
c.add({ url: "https://c.com", jQuery: false,\r
callback: (e, res, done) => { /* custom callback */ done(); } });\r
```\r
\r
Two most critical rules:\r
\r
1. Call `done()` from every branch of every queue callback, including the\r
`if (error)` branch\r
2. The crawler is finished when the `'drain'` event fires, **not** when\r
`add()` returns. `add()` merely enqueues tasks\r
\r
## Rate limiting and retries\r
\r
```js\r
const c = new Crawler({\r
rateLimit: 1000, // minimum gap between requests >=1000ms (forces maxConnections=1)\r
retries: 2, // default: 2\r
retryInterval: 3000,// ms to wait before retrying\r
timeout: 20000, // request timeout in ms\r
callback: (e, res, done) => { /* ... */ done(); },\r
});\r
```\r
\r
## Data extraction with Cheerio\r
\r
Cheerio is enabled by default. Use jQuery selectors to extract data:\r
\r
```js\r
callback: (e, res, done) => {\r
const $ = res.$;\r
const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();\r
const links = $("a").map((i, el) => $(el).attr("href")).get();\r
done();\r
}\r
```\r
\r
## Binary file download\r
\r
```js\r
import fs from "fs";\r
const c = new Crawler({\r
encoding: null, // keep body as Buffer\r
jQuery: false, // skip Cheerio parsing\r
callback: (err, res, done) => {\r
if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);\r
done();\r
},\r
});\r
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });\r
```\r
\r
## Proxy rotation\r
\r
Use the `'schedule'` event for dynamic assignment (preferred over using the\r
`proxies` array):\r
\r
```js\r
c.on("schedule", options => { options.proxy = "http://proxy:port"; });\r
c.on("request", options => { options.searchParams = { t: Date.now() }; });\r
```\r
\r
Different proxies can have different rate limiters:\r
\r
```js\r
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });\r
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });\r
```\r
\r
## HTTP/2\r
\r
```js\r
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });\r
```\r
\r
When using Charles or self-signed certs, add `rejectUnauthorized: false`.\r
\r
## Passing context data\r
\r
Use `userParams` to attach data; read it back in the callback via\r
`res.options.userParams`. Do **not** attach custom fields directly on the\r
options object.\r
\r
## Gotchas\r
\r
- ❌ Forgetting `done()` in the `if (error)` branch → crawler deadlocks\r
- ❌ Writing `console.log("done")` right after `add()` → listen for `'drain'` instead\r
- ❌ Setting `maxConnections > 1` when `rateLimit > 0` → it gets overridden to 1\r
- ❌ Expecting `send()` to trigger `preRequest` or `'request'` → `send()` bypasses all queue mechanics\r
- ❌ POST form data via `body` → v2 requires `form`\r
- ❌ Binary download without `encoding: null` → corrupt output\r
\r
## Options reference\r
\r
The complete options table is in [references/options.md](references/options.md).\r
\r
## Code examples\r
\r
Full, runnable examples for every scenario are in [references/examples.md](references/examples.md):\r
\r
- Basic queue crawling\r
- Rate limiting\r
- Cheerio data extraction\r
- Binary download\r
- Direct requests\r
- HTTP/2\r
- Proxy rotation\r
- preRequest hooks\r
- Full spider (pagination + extraction + following links)\r
安全使用建议
Before installing, treat it as a coding reference for large crawling jobs: review the npm package you install, avoid crawling sites without permission, keep rate limits reasonable, and be careful with examples that disable TLS verification or write downloaded files.
能力评估
Purpose & Capability
The documented capabilities are bulk crawling, rate limiting, retries, proxy use, Cheerio parsing, and file downloads, all matching the stated web-crawler purpose.
Instruction Scope
Instructions are scoped to when to use the crawler, when not to use it, and how to call its API; no prompt overrides, hidden commands, or unrelated agent instructions were found.
Install Mechanism
It asks users to install the external npm package `crawler`, which is disclosed and purpose-aligned but still depends on normal npm package trust.
Credentials
Network access, proxy configuration, and local output writes are expected for large-scale crawling, but users should ensure they have permission to crawl target sites and set conservative rate limits.
Persistence & Privilege
No autonomous persistence, privilege escalation, credential harvesting, or background service installation is present; long-running behavior is limited to user-created crawler processes.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install node-crawler - 安装完成后,直接呼叫该 Skill 的名称或使用
/node-crawler触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
node-crawler v1.0.0
- Initial release of a robust Node.js crawler for large-scale, production use.
- Supports bulk scraping, batch downloads, multi-step spiders, connection pooling, proxy rotation, rate limiting, and automatic retries.
- Pure ESM; requires Node.js 22+. Install via `npm install crawler`.
- Integrates server-side Cheerio (jQuery selectors) out of the box.
- Provides configurable queueing, pool sizes, and event-driven API (`add()`, `drain`, `schedule`, etc.).
- Not suitable for single requests or JavaScript-heavy pages; intended for batch/multi-page crawling scenarios.
元数据
常见问题
Web Crawler 是什么?
Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 44 次。
如何安装 Web Crawler?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install node-crawler」即可一键安装,无需额外配置。
Web Crawler 是免费的吗?
是的,Web Crawler 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Web Crawler 支持哪些平台?
Web Crawler 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Web Crawler?
由 Mike Chen(@mike442144)开发并维护,当前版本 v1.0.0。
推荐 Skills