← Back to Skills Marketplace

Web Crawler

Name: Web Crawler
Author: mike442144

by Mike Chen · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install node-crawler

Description

Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr...

README (SKILL.md)

\r \r

Node Crawler (`crawler` package)\r

\r crawler is a Node.js web spider library: internal queue + configurable\r connection pool + per-domain rate limiting + automatic retries + proxy\r rotation + charset detection + server-side Cheerio (jQuery-style) HTML\r parsing. Built on got, supports HTTP/2.\r \r

Prerequisites\r

Node.js >= 22\r
Pure ESM: import Crawler from "crawler"\r
If the codebase must use CommonJS, install crawler@beta\r \r

npm install crawler\r
```\r
\r
## When to use\r
\r
This skill is for **production-grade, large-scale crawling**. Reach for it\r
when the task is substantial:\r
\r
- Scraping **many pages** (dozens to millions) with structured data extraction\r
- **Batch-downloading** files — images, PDFs, archives — with retry and resume\r
  (resume logic is developer-implemented via `userParams` and file existence checks)\r
- **Long-running spiders** that need rate limiting, retries, and connection pooling\r
- **Multi-step workflows** — pagination, link following, cascading crawls\r
- **Proxy rotation**, charset detection, HTTP/2 — infrastructure a real\r
  production crawler depends on\r
\r
### When NOT to use\r
\r
- A **single page** or one-off request → `curl` is far lighter.\r
  Spinning up a Crawler instance for 1-2 pages is overkill.\r
- Pages requiring **JavaScript rendering** → use the `agent-browser` skill\r
  (Playwright/Puppeteer) instead\r
- Simple **API data fetching** → `fetch` / `got` with JSON parsing\r
\r
## API\r
\r
### Core decision: Queue vs Send\r
\r
| Approach | When |\r
|----------|------|\r
| **`crawler.add()` + `'drain'` event** | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |\r
| **`crawler.send()`** | One-off requests. Bypasses queue, rate limiter, `preRequest`, `'request'` event |\r
\r
## Basic usage: Queue mode\r
\r
```js\r
import Crawler from "crawler";\r
\r
const c = new Crawler({\r
  maxConnections: 10,\r
  callback: (error, res, done) => {\r
    if (error) {\r
      console.error(error);\r
    } else {\r
      const $ = res.$;  // Cheerio instance (enabled by default)\r
      console.log($("title").text());\r
    }\r
    done();  // REQUIRED: releases the connection slot, or the crawler deadlocks\r
  },\r
});\r
\r
c.on("drain", () => console.log("All done"));\r
\r
c.add("https://example.com");\r
c.add(["https://a.com", "https://b.com"]);\r
c.add({ url: "https://c.com", jQuery: false,\r
        callback: (e, res, done) => { /* custom callback */ done(); } });\r
```\r
\r
Two most critical rules:\r
\r
1. Call `done()` from every branch of every queue callback, including the\r
   `if (error)` branch\r
2. The crawler is finished when the `'drain'` event fires, **not** when\r
   `add()` returns. `add()` merely enqueues tasks\r
\r
## Rate limiting and retries\r
\r
```js\r
const c = new Crawler({\r
  rateLimit: 1000,    // minimum gap between requests >=1000ms (forces maxConnections=1)\r
  retries: 2,         // default: 2\r
  retryInterval: 3000,// ms to wait before retrying\r
  timeout: 20000,     // request timeout in ms\r
  callback: (e, res, done) => { /* ... */ done(); },\r
});\r
```\r
\r
## Data extraction with Cheerio\r
\r
Cheerio is enabled by default. Use jQuery selectors to extract data:\r
\r
```js\r
callback: (e, res, done) => {\r
  const $ = res.$;\r
  const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();\r
  const links  = $("a").map((i, el) => $(el).attr("href")).get();\r
  done();\r
}\r
```\r
\r
## Binary file download\r
\r
```js\r
import fs from "fs";\r
const c = new Crawler({\r
  encoding: null,     // keep body as Buffer\r
  jQuery: false,      // skip Cheerio parsing\r
  callback: (err, res, done) => {\r
    if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);\r
    done();\r
  },\r
});\r
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });\r
```\r
\r
## Proxy rotation\r
\r
Use the `'schedule'` event for dynamic assignment (preferred over using the\r
`proxies` array):\r
\r
```js\r
c.on("schedule", options => { options.proxy = "http://proxy:port"; });\r
c.on("request",  options => { options.searchParams = { t: Date.now() }; });\r
```\r
\r
Different proxies can have different rate limiters:\r
\r
```js\r
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });\r
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });\r
```\r
\r
## HTTP/2\r
\r
```js\r
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });\r
```\r
\r
When using Charles or self-signed certs, add `rejectUnauthorized: false`.\r
\r
## Passing context data\r
\r
Use `userParams` to attach data; read it back in the callback via\r
`res.options.userParams`. Do **not** attach custom fields directly on the\r
options object.\r
\r
## Gotchas\r
\r
- ❌ Forgetting `done()` in the `if (error)` branch → crawler deadlocks\r
- ❌ Writing `console.log("done")` right after `add()` → listen for `'drain'` instead\r
- ❌ Setting `maxConnections > 1` when `rateLimit > 0` → it gets overridden to 1\r
- ❌ Expecting `send()` to trigger `preRequest` or `'request'` → `send()` bypasses all queue mechanics\r
- ❌ POST form data via `body` → v2 requires `form`\r
- ❌ Binary download without `encoding: null` → corrupt output\r
\r
## Options reference\r
\r
The complete options table is in [references/options.md](references/options.md).\r
\r
## Code examples\r
\r
Full, runnable examples for every scenario are in [references/examples.md](references/examples.md):\r
\r
- Basic queue crawling\r
- Rate limiting\r
- Cheerio data extraction\r
- Binary download\r
- Direct requests\r
- HTTP/2\r
- Proxy rotation\r
- preRequest hooks\r
- Full spider (pagination + extraction + following links)\r

Usage Guidance

Before installing, treat it as a coding reference for large crawling jobs: review the npm package you install, avoid crawling sites without permission, keep rate limits reasonable, and be careful with examples that disable TLS verification or write downloaded files.

Capability Assessment

✓ Purpose & Capability

The documented capabilities are bulk crawling, rate limiting, retries, proxy use, Cheerio parsing, and file downloads, all matching the stated web-crawler purpose.

✓ Instruction Scope

Instructions are scoped to when to use the crawler, when not to use it, and how to call its API; no prompt overrides, hidden commands, or unrelated agent instructions were found.

ℹ Install Mechanism

It asks users to install the external npm package `crawler`, which is disclosed and purpose-aligned but still depends on normal npm package trust.

ℹ Credentials

Network access, proxy configuration, and local output writes are expected for large-scale crawling, but users should ensure they have permission to crawl target sites and set conservative rate limits.

✓ Persistence & Privilege

No autonomous persistence, privilege escalation, credential harvesting, or background service installation is present; long-running behavior is limited to user-created crawler processes.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install node-crawler
After installation, invoke the skill by name or use /node-crawler
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

node-crawler v1.0.0 - Initial release of a robust Node.js crawler for large-scale, production use. - Supports bulk scraping, batch downloads, multi-step spiders, connection pooling, proxy rotation, rate limiting, and automatic retries. - Pure ESM; requires Node.js 22+. Install via `npm install crawler`. - Integrates server-side Cheerio (jQuery selectors) out of the box. - Provides configurable queueing, pool sizes, and event-driven API (`add()`, `drain`, `schedule`, etc.). - Not suitable for single requests or JavaScript-heavy pages; intended for batch/multi-page crawling scenarios.

Metadata

Slug node-crawler

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Web Crawler?

Node.js web crawler for production-grade, large-scale tasks — NOT for simple one-off requests. Use only when: bulk scraping, batch downloading, multi-page cr... It is an AI Agent Skill for Claude Code / OpenClaw, with 44 downloads so far.

How do I install Web Crawler?

Run "/install node-crawler" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Crawler free?

Yes, Web Crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Crawler support?

Web Crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Crawler?

It is built and maintained by Mike Chen (@mike442144); the current version is v1.0.0.

More Skills