← 返回 Skills 市场

Crawlee Web Scraper

Name: Crawlee Web Scraper
Author: bryantegomoh

作者 Bryan Tegomoh, MD, MPH · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

187

总下载

当前安装

版本数

在 OpenClaw 中安装

/install crawlee-web-scraper

功能描述

Resilient web scraper with bot-detection evasion using the Crawlee library. Use when web_fetch is blocked by rate limits or bot detection. Supports single UR...

使用说明 (SKILL.md)

crawlee-web-scraper

Drop-in replacement for web_fetch when sites block automated requests. Crawlee handles session management, retry logic, and bot-detection evasion automatically.

Scripts

crawlee_fetch.py — main scraper; accepts a single URL or a file of URLs; returns JSON
crawlee_http.py — library helper; tries requests first, falls back to Crawlee on 403/429/503

Usage

# Single URL, return HTML preview
python3 scripts/crawlee_fetch.py --url "https://example.com"

# Single URL, extract text (strips HTML tags)
python3 scripts/crawlee_fetch.py --url "https://example.com" --extract-text

# Bulk scrape from file
python3 scripts/crawlee_fetch.py --urls-file urls.txt --output results.json

Library usage

from crawlee_http import fetch_with_fallback

resp = fetch_with_fallback("https://example.com")
print(resp.status_code, resp.text[:500])

Output

JSON array with one object per URL:

[
  {
    "url": "https://example.com",
    "status": 200,
    "fetched_at": "2026-01-01T00:00:00Z",
    "length": 12345,
    "text": "Page content..."
  }
]

Installation

pip install crawlee requests

When to use

web_fetch returns 403 / 429 / empty
Bulk scraping 10+ URLs
Sites using Cloudflare or similar bot protection

安全使用建议

This skill appears to be what it says: a Crawlee-based fallback scraper. Before installing, be aware: (1) it requires 'pip install crawlee requests' — Crawlee may install or later download browser tooling (Playwright or similar) which can add network activity and disk artifacts; (2) the scripts will perform HTTP requests to any URL you provide (so don’t give it URLs containing secrets, credentials, or private tokens); (3) scraping sites may violate terms of service or legal rules—use responsibly; (4) the fallback uses a subprocess with a 30s timeout and caps extracted text (10k chars) — adjust if you need longer fetches. If you need stricter controls, run this in an isolated environment and audit installed Python packages (or pin package versions) before use.

功能分析

Type: OpenClaw Skill Name: crawlee-web-scraper Version: 1.0.0 The skill is a legitimate web scraping utility designed to bypass bot detection using the Crawlee library. It consists of a main scraper (crawlee_fetch.py) and a helper (crawlee_http.py) that provides a fallback mechanism from standard requests to Crawlee. The code uses safe subprocess execution (passing arguments as a list) and performs standard file and network operations consistent with its stated purpose without any signs of malicious intent or prompt injection.

能力评估

✓ Purpose & Capability

Name/description (Crawlee-based scraper) matches the delivered artifacts: two Python scripts that use requests and Crawlee to fetch pages and a SKILL.md describing exactly that. No unrelated credentials, binaries, or config paths are requested.

✓ Instruction Scope

SKILL.md and the scripts are specific and scoped: they document usage, install (pip install crawlee requests), and show that fetching is targeted at user-supplied URLs. The code only reads a provided URLs file, runs a subprocess to call the included script, and returns JSON. There are no instructions to read unrelated system files, environment variables, or to transmit data to unexpected remote endpoints.

ℹ Install Mechanism

No install spec beyond the SKILL.md recommendation 'pip install crawlee requests'. Using pip is expected for a Python library, but installing Crawlee may pull additional runtime deps (Playwright/browser components) which can download browser binaries at install or first-run time. This is typical for headless-browser scrapers but may have additional network/activity implications.

✓ Credentials

The skill declares no required environment variables or credentials and the code does not read secrets or unrelated env vars. All requests are to user-provided target URLs, which is proportionate to a scraping tool.

✓ Persistence & Privilege

Skill does not request always: true and is user-invocable. It does not modify other skills or system-wide agent settings. Autonomous invocation is allowed by default but not combined with other red flags.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install crawlee-web-scraper
安装完成后，直接呼叫该 Skill 的名称或使用 /crawlee-web-scraper 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial release of crawlee-web-scraper. - Provides resilient web scraping with evasion for bot detection and rate limits using Crawlee. - Supports both single URLs and bulk file input for scraping. - Implements automatic fallback: tries regular requests, then uses Crawlee on 403/429/503 errors. - Returns standardized JSON output per URL with metadata and extracted content. - Drop-in replacement for web_fetch, with simple command-line and Python library usage.

元数据

Slug crawlee-web-scraper

版本 1.0.0

许可证 MIT-0

累计安装 2

当前安装数 1

历史版本数 1

常见问题