← Back to Skills Marketplace
mike442144

Web Crawler

by Mike Chen · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
44
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install node-crawler
Description
Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr...
README (SKILL.md)

\r \r

Node Crawler (crawler package)\r

\r crawler is a Node.js web spider library: internal queue + configurable\r connection pool + per-domain rate limiting + automatic retries + proxy\r rotation + charset detection + server-side Cheerio (jQuery-style) HTML\r parsing. Built on got, supports HTTP/2.\r \r

Prerequisites\r

\r

  • Node.js >= 22\r
  • Pure ESM: import Crawler from "crawler"\r
  • If the codebase must use CommonJS, install crawler@beta\r \r
npm install crawler\r
```\r
\r
## When to use\r
\r
This skill is for **production-grade, large-scale crawling**. Reach for it\r
when the task is substantial:\r
\r
- Scraping **many pages** (dozens to millions) with structured data extraction\r
- **Batch-downloading** files — images, PDFs, archives — with retry and resume\r
  (resume logic is developer-implemented via `userParams` and file existence checks)\r
- **Long-running spiders** that need rate limiting, retries, and connection pooling\r
- **Multi-step workflows** — pagination, link following, cascading crawls\r
- **Proxy rotation**, charset detection, HTTP/2 — infrastructure a real\r
  production crawler depends on\r
\r
### When NOT to use\r
\r
- A **single page** or one-off request → `curl` is far lighter.\r
  Spinning up a Crawler instance for 1-2 pages is overkill.\r
- Pages requiring **JavaScript rendering** → use the `agent-browser` skill\r
  (Playwright/Puppeteer) instead\r
- Simple **API data fetching** → `fetch` / `got` with JSON parsing\r
\r
## API\r
\r
### Core decision: Queue vs Send\r
\r
| Approach | When |\r
|----------|------|\r
| **`crawler.add()` + `'drain'` event** | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |\r
| **`crawler.send()`** | One-off requests. Bypasses queue, rate limiter, `preRequest`, `'request'` event |\r
\r
## Basic usage: Queue mode\r
\r
```js\r
import Crawler from "crawler";\r
\r
const c = new Crawler({\r
  maxConnections: 10,\r
  callback: (error, res, done) => {\r
    if (error) {\r
      console.error(error);\r
    } else {\r
      const $ = res.$;  // Cheerio instance (enabled by default)\r
      console.log($("title").text());\r
    }\r
    done();  // REQUIRED: releases the connection slot, or the crawler deadlocks\r
  },\r
});\r
\r
c.on("drain", () => console.log("All done"));\r
\r
c.add("https://example.com");\r
c.add(["https://a.com", "https://b.com"]);\r
c.add({ url: "https://c.com", jQuery: false,\r
        callback: (e, res, done) => { /* custom callback */ done(); } });\r
```\r
\r
Two most critical rules:\r
\r
1. Call `done()` from every branch of every queue callback, including the\r
   `if (error)` branch\r
2. The crawler is finished when the `'drain'` event fires, **not** when\r
   `add()` returns. `add()` merely enqueues tasks\r
\r
## Rate limiting and retries\r
\r
```js\r
const c = new Crawler({\r
  rateLimit: 1000,    // minimum gap between requests >=1000ms (forces maxConnections=1)\r
  retries: 2,         // default: 2\r
  retryInterval: 3000,// ms to wait before retrying\r
  timeout: 20000,     // request timeout in ms\r
  callback: (e, res, done) => { /* ... */ done(); },\r
});\r
```\r
\r
## Data extraction with Cheerio\r
\r
Cheerio is enabled by default. Use jQuery selectors to extract data:\r
\r
```js\r
callback: (e, res, done) => {\r
  const $ = res.$;\r
  const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();\r
  const links  = $("a").map((i, el) => $(el).attr("href")).get();\r
  done();\r
}\r
```\r
\r
## Binary file download\r
\r
```js\r
import fs from "fs";\r
const c = new Crawler({\r
  encoding: null,     // keep body as Buffer\r
  jQuery: false,      // skip Cheerio parsing\r
  callback: (err, res, done) => {\r
    if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);\r
    done();\r
  },\r
});\r
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });\r
```\r
\r
## Proxy rotation\r
\r
Use the `'schedule'` event for dynamic assignment (preferred over using the\r
`proxies` array):\r
\r
```js\r
c.on("schedule", options => { options.proxy = "http://proxy:port"; });\r
c.on("request",  options => { options.searchParams = { t: Date.now() }; });\r
```\r
\r
Different proxies can have different rate limiters:\r
\r
```js\r
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });\r
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });\r
```\r
\r
## HTTP/2\r
\r
```js\r
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });\r
```\r
\r
When using Charles or self-signed certs, add `rejectUnauthorized: false`.\r
\r
## Passing context data\r
\r
Use `userParams` to attach data; read it back in the callback via\r
`res.options.userParams`. Do **not** attach custom fields directly on the\r
options object.\r
\r
## Gotchas\r
\r
- ❌ Forgetting `done()` in the `if (error)` branch → crawler deadlocks\r
- ❌ Writing `console.log("done")` right after `add()` → listen for `'drain'` instead\r
- ❌ Setting `maxConnections > 1` when `rateLimit > 0` → it gets overridden to 1\r
- ❌ Expecting `send()` to trigger `preRequest` or `'request'` → `send()` bypasses all queue mechanics\r
- ❌ POST form data via `body` → v2 requires `form`\r
- ❌ Binary download without `encoding: null` → corrupt output\r
\r
## Options reference\r
\r
The complete options table is in [references/options.md](references/options.md).\r
\r
## Code examples\r
\r
Full, runnable examples for every scenario are in [references/examples.md](references/examples.md):\r
\r
- Basic queue crawling\r
- Rate limiting\r
- Cheerio data extraction\r
- Binary download\r
- Direct requests\r
- HTTP/2\r
- Proxy rotation\r
- preRequest hooks\r
- Full spider (pagination + extraction + following links)\r
Usage Guidance
Before installing, treat it as a coding reference for large crawling jobs: review the npm package you install, avoid crawling sites without permission, keep rate limits reasonable, and be careful with examples that disable TLS verification or write downloaded files.
Capability Assessment
Purpose & Capability
The documented capabilities are bulk crawling, rate limiting, retries, proxy use, Cheerio parsing, and file downloads, all matching the stated web-crawler purpose.
Instruction Scope
Instructions are scoped to when to use the crawler, when not to use it, and how to call its API; no prompt overrides, hidden commands, or unrelated agent instructions were found.
Install Mechanism
It asks users to install the external npm package `crawler`, which is disclosed and purpose-aligned but still depends on normal npm package trust.
Credentials
Network access, proxy configuration, and local output writes are expected for large-scale crawling, but users should ensure they have permission to crawl target sites and set conservative rate limits.
Persistence & Privilege
No autonomous persistence, privilege escalation, credential harvesting, or background service installation is present; long-running behavior is limited to user-created crawler processes.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install node-crawler
  3. After installation, invoke the skill by name or use /node-crawler
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
node-crawler v1.0.0 - Initial release of a robust Node.js crawler for large-scale, production use. - Supports bulk scraping, batch downloads, multi-step spiders, connection pooling, proxy rotation, rate limiting, and automatic retries. - Pure ESM; requires Node.js 22+. Install via `npm install crawler`. - Integrates server-side Cheerio (jQuery selectors) out of the box. - Provides configurable queueing, pool sizes, and event-driven API (`add()`, `drain`, `schedule`, etc.). - Not suitable for single requests or JavaScript-heavy pages; intended for batch/multi-page crawling scenarios.
Metadata
Slug node-crawler
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Web Crawler?

Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.

How do I install Web Crawler?

Run "/install node-crawler" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Crawler free?

Yes, Web Crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Crawler support?

Web Crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Crawler?

It is built and maintained by Mike Chen (@mike442144); the current version is v1.0.0.

💬 Comments