← 返回 Skills 市场
mike442144

Web Crawler

作者 Mike Chen · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
44
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install node-crawler
功能描述
Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr...
使用说明 (SKILL.md)

\r \r

Node Crawler (crawler package)\r

\r crawler is a Node.js web spider library: internal queue + configurable\r connection pool + per-domain rate limiting + automatic retries + proxy\r rotation + charset detection + server-side Cheerio (jQuery-style) HTML\r parsing. Built on got, supports HTTP/2.\r \r

Prerequisites\r

\r

  • Node.js >= 22\r
  • Pure ESM: import Crawler from "crawler"\r
  • If the codebase must use CommonJS, install crawler@beta\r \r
npm install crawler\r
```\r
\r
## When to use\r
\r
This skill is for **production-grade, large-scale crawling**. Reach for it\r
when the task is substantial:\r
\r
- Scraping **many pages** (dozens to millions) with structured data extraction\r
- **Batch-downloading** files — images, PDFs, archives — with retry and resume\r
  (resume logic is developer-implemented via `userParams` and file existence checks)\r
- **Long-running spiders** that need rate limiting, retries, and connection pooling\r
- **Multi-step workflows** — pagination, link following, cascading crawls\r
- **Proxy rotation**, charset detection, HTTP/2 — infrastructure a real\r
  production crawler depends on\r
\r
### When NOT to use\r
\r
- A **single page** or one-off request → `curl` is far lighter.\r
  Spinning up a Crawler instance for 1-2 pages is overkill.\r
- Pages requiring **JavaScript rendering** → use the `agent-browser` skill\r
  (Playwright/Puppeteer) instead\r
- Simple **API data fetching** → `fetch` / `got` with JSON parsing\r
\r
## API\r
\r
### Core decision: Queue vs Send\r
\r
| Approach | When |\r
|----------|------|\r
| **`crawler.add()` + `'drain'` event** | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |\r
| **`crawler.send()`** | One-off requests. Bypasses queue, rate limiter, `preRequest`, `'request'` event |\r
\r
## Basic usage: Queue mode\r
\r
```js\r
import Crawler from "crawler";\r
\r
const c = new Crawler({\r
  maxConnections: 10,\r
  callback: (error, res, done) => {\r
    if (error) {\r
      console.error(error);\r
    } else {\r
      const $ = res.$;  // Cheerio instance (enabled by default)\r
      console.log($("title").text());\r
    }\r
    done();  // REQUIRED: releases the connection slot, or the crawler deadlocks\r
  },\r
});\r
\r
c.on("drain", () => console.log("All done"));\r
\r
c.add("https://example.com");\r
c.add(["https://a.com", "https://b.com"]);\r
c.add({ url: "https://c.com", jQuery: false,\r
        callback: (e, res, done) => { /* custom callback */ done(); } });\r
```\r
\r
Two most critical rules:\r
\r
1. Call `done()` from every branch of every queue callback, including the\r
   `if (error)` branch\r
2. The crawler is finished when the `'drain'` event fires, **not** when\r
   `add()` returns. `add()` merely enqueues tasks\r
\r
## Rate limiting and retries\r
\r
```js\r
const c = new Crawler({\r
  rateLimit: 1000,    // minimum gap between requests >=1000ms (forces maxConnections=1)\r
  retries: 2,         // default: 2\r
  retryInterval: 3000,// ms to wait before retrying\r
  timeout: 20000,     // request timeout in ms\r
  callback: (e, res, done) => { /* ... */ done(); },\r
});\r
```\r
\r
## Data extraction with Cheerio\r
\r
Cheerio is enabled by default. Use jQuery selectors to extract data:\r
\r
```js\r
callback: (e, res, done) => {\r
  const $ = res.$;\r
  const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();\r
  const links  = $("a").map((i, el) => $(el).attr("href")).get();\r
  done();\r
}\r
```\r
\r
## Binary file download\r
\r
```js\r
import fs from "fs";\r
const c = new Crawler({\r
  encoding: null,     // keep body as Buffer\r
  jQuery: false,      // skip Cheerio parsing\r
  callback: (err, res, done) => {\r
    if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);\r
    done();\r
  },\r
});\r
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });\r
```\r
\r
## Proxy rotation\r
\r
Use the `'schedule'` event for dynamic assignment (preferred over using the\r
`proxies` array):\r
\r
```js\r
c.on("schedule", options => { options.proxy = "http://proxy:port"; });\r
c.on("request",  options => { options.searchParams = { t: Date.now() }; });\r
```\r
\r
Different proxies can have different rate limiters:\r
\r
```js\r
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });\r
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });\r
```\r
\r
## HTTP/2\r
\r
```js\r
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });\r
```\r
\r
When using Charles or self-signed certs, add `rejectUnauthorized: false`.\r
\r
## Passing context data\r
\r
Use `userParams` to attach data; read it back in the callback via\r
`res.options.userParams`. Do **not** attach custom fields directly on the\r
options object.\r
\r
## Gotchas\r
\r
- ❌ Forgetting `done()` in the `if (error)` branch → crawler deadlocks\r
- ❌ Writing `console.log("done")` right after `add()` → listen for `'drain'` instead\r
- ❌ Setting `maxConnections > 1` when `rateLimit > 0` → it gets overridden to 1\r
- ❌ Expecting `send()` to trigger `preRequest` or `'request'` → `send()` bypasses all queue mechanics\r
- ❌ POST form data via `body` → v2 requires `form`\r
- ❌ Binary download without `encoding: null` → corrupt output\r
\r
## Options reference\r
\r
The complete options table is in [references/options.md](references/options.md).\r
\r
## Code examples\r
\r
Full, runnable examples for every scenario are in [references/examples.md](references/examples.md):\r
\r
- Basic queue crawling\r
- Rate limiting\r
- Cheerio data extraction\r
- Binary download\r
- Direct requests\r
- HTTP/2\r
- Proxy rotation\r
- preRequest hooks\r
- Full spider (pagination + extraction + following links)\r
安全使用建议
Before installing, treat it as a coding reference for large crawling jobs: review the npm package you install, avoid crawling sites without permission, keep rate limits reasonable, and be careful with examples that disable TLS verification or write downloaded files.
能力评估
Purpose & Capability
The documented capabilities are bulk crawling, rate limiting, retries, proxy use, Cheerio parsing, and file downloads, all matching the stated web-crawler purpose.
Instruction Scope
Instructions are scoped to when to use the crawler, when not to use it, and how to call its API; no prompt overrides, hidden commands, or unrelated agent instructions were found.
Install Mechanism
It asks users to install the external npm package `crawler`, which is disclosed and purpose-aligned but still depends on normal npm package trust.
Credentials
Network access, proxy configuration, and local output writes are expected for large-scale crawling, but users should ensure they have permission to crawl target sites and set conservative rate limits.
Persistence & Privilege
No autonomous persistence, privilege escalation, credential harvesting, or background service installation is present; long-running behavior is limited to user-created crawler processes.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install node-crawler
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /node-crawler 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
node-crawler v1.0.0 - Initial release of a robust Node.js crawler for large-scale, production use. - Supports bulk scraping, batch downloads, multi-step spiders, connection pooling, proxy rotation, rate limiting, and automatic retries. - Pure ESM; requires Node.js 22+. Install via `npm install crawler`. - Integrates server-side Cheerio (jQuery selectors) out of the box. - Provides configurable queueing, pool sizes, and event-driven API (`add()`, `drain`, `schedule`, etc.). - Not suitable for single requests or JavaScript-heavy pages; intended for batch/multi-page crawling scenarios.
元数据
Slug node-crawler
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Web Crawler 是什么?

Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 44 次。

如何安装 Web Crawler?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install node-crawler」即可一键安装,无需额外配置。

Web Crawler 是免费的吗?

是的,Web Crawler 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Web Crawler 支持哪些平台?

Web Crawler 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Web Crawler?

由 Mike Chen(@mike442144)开发并维护,当前版本 v1.0.0。

💬 留言讨论