← 返回 Skills 市场

Web Retrieval

Name: Web Retrieval
Author: seanford

作者 Sean Ford · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install web-retrieval

功能描述

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...

使用说明 (SKILL.md)

Web Retrieval — Scrapling Expert

Local web fetching via Scrapling. Three fetchers, two scripts, one job: get the page content reliably.

Fetcher Selection Guide

Always start with the cheapest fetcher that works. Escalate only if needed.

Fetcher	CLI mode	Speed	Use when
`Fetcher` (curl_cffi)	`get`	Fast (~1s)	Static HTML, APIs, most public pages. Impersonates Chrome.
`DynamicFetcher`	`fetch`	Medium (~5s)	JS-rendered pages, SPAs, pages that need browser execution
`StealthyFetcher`	`stealthy`	Slow (~10s)	Cloudflare, heavy anti-bot, fingerprint detection

Decision tree:

Try get first — it handles 80% of pages
If content is just a title or empty → escalate to fetch
If blocked/Cloudflare detected → escalate to stealthy
If still blocked → add --solve-cloudflare and/or --wait 3000

Fetch Script

FETCH="python3 $SKILL_DIR/scripts/fetch"

# Basic fetch (auto-escalates through modes)
$FETCH https://example.com

# Force specific mode
$FETCH https://example.com --mode stealthy

# Extract specific content with CSS selector
$FETCH https://example.com -s "article.main-content"

# Wait for JS-rendered content
$FETCH https://spa.example.com --mode fetch --wait 3000 --wait-selector ".content"

# Save to file
$FETCH https://example.com /tmp/output.md

# Plain text output
$FETCH https://example.com --text

# Raw HTML (for link extraction, parsing)
$FETCH https://example.com --html

# Cloudflare bypass
$FETCH https://protected.example.com --mode stealthy --solve-cloudflare

# Fast (no images/fonts/media)
$FETCH https://example.com --no-resources

# Network idle wait (good for dashboards)
$FETCH https://example.com --mode fetch --network-idle

Crawl Script

CRAWL="python3 $SKILL_DIR/scripts/crawl"

# Fetch a flat list of URLs from file → one .md per URL in output dir
$CRAWL --urls-file /tmp/urls.txt --output-dir /tmp/results/

# Fetch list → single JSON file
$CRAWL --urls-file /tmp/urls.txt --output-json /tmp/results.json

# Crawl a site 2 levels deep (same domain only)
$CRAWL https://docs.example.com --depth 2 --output-dir /tmp/docs/

# Spider with checkpoint (resume if interrupted)
$CRAWL https://large-site.com --depth 3 --checkpoint-dir /tmp/checkpoint/

# Crawl with URL filter (only pages matching pattern)
$CRAWL https://docs.openclaw.ai --depth 2 --allowed-pattern "/docs/" --output-dir /tmp/

# Use stealth mode for crawl
$CRAWL --urls-file /tmp/urls.txt --mode stealthy --output-json /tmp/out.json

Python API (for sub-agents / scripts)

from scrapling import Fetcher, StealthyFetcher, DynamicFetcher

# Static fetch with browser impersonation
page = Fetcher().get("https://example.com", stealthy_headers=True)
text = page.get_all_text(ignore_tags=("script", "style"))
links = [a.attrib.get("href") for a in page.css("a[href]")]
title = page.css_first("h1").text

# Dynamic (JS-rendered)
async with DynamicFetcher() as f:
    page = await f.async_fetch("https://spa.example.com")

# Stealthy
page = StealthyFetcher().fetch("https://cloudflare-site.com", wait=2000)

# CSS selector extraction
results = page.css("div.article-body p")  # returns list of elements
first = page.css_first("h1").text

# Response properties
page.status  # HTTP status
page.url     # final URL (after redirects)
page.html    # raw HTML string
page.find("div", {"class": "content"})  # BeautifulSoup-style

Scrapling Spider (site crawl with full control)

from scrapling.spiders import Spider, Request
from scrapling import Fetcher

class DocsCrawler(Spider):
    start_urls = ["https://docs.example.com"]

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    async def parse(self, response):
        # Yield scraped data
        yield {
            "url": response.url,
            "title": response.css_first("h1").text if response.css_first("h1") else "",
            "body": response.get_all_text(ignore_tags=("script", "style")),
        }
        # Follow links
        for link in response.css("a[href]"):
            href = link.attrib.get("href", "")
            if href.startswith("/") or "docs.example.com" in href:
                yield Request(response.urljoin(href), callback=self.parse)

# Run (checkpoint-enabled)
spider = DocsCrawler(crawldir="/tmp/crawl-checkpoint/")
result = spider.start()
print(f"Scraped {result.stats.items_scraped} pages")
items = list(result.items)

Key Scrapling CSS/Response Methods

Method	Description
`page.css("selector")`	All matching elements
`page.css_first("selector")`	First match or None
`el.text`	Text content of element
`el.attrib["href"]`	Attribute value
`page.get_all_text()`	Full page text (strips scripts/styles)
`page.html`	Raw HTML
`page.find(tag, attrs)`	BeautifulSoup-style find
`page.urljoin(href)`	Resolve relative URL
`response.url`	Final URL after redirects
`response.status`	HTTP status code

Output Formats

All CLI commands support three output formats via file extension:

.md — Markdown (default, best for LLM consumption)
.txt — Plain text
.html — Raw HTML (use for link extraction or further parsing)

Tips

JS pages with lazy loading: use --wait 2000 + --network-idle
Dynamic content: --wait-selector ".target-class" waits until element appears
Rate limiting: add --wait 1000 between requests in crawl mode
Cloudflare 403/503: --mode stealthy --solve-cloudflare
Missing content: try --mode fetch --network-idle before escalating to stealthy
CSS selectors: use for targeted extraction to reduce noise in output
Deep research crawls: use --checkpoint-dir so crawl survives interruption

安全使用建议

Install this only if you want agents to make direct web requests and optionally crawl sites from your machine. Use it only for URLs and domains you are allowed to access, be careful with stealth or Cloudflare-bypass modes, and choose output paths that will not overwrite important files or retain sensitive fetched content longer than intended.

能力评估

✓ Purpose & Capability

The stated purpose is reliable web retrieval with static, dynamic, stealth, and crawl modes, and the scripts implement URL fetching, selector extraction, site crawling, and result export consistent with that purpose.

ℹ Instruction Scope

The instructions broadly prefer this skill over web_fetch and include stealth and Cloudflare-bypass options; those capabilities are disclosed and purpose-aligned, but users should apply site authorization, compliance, and rate-limit judgment.

✓ Install Mechanism

The package contains documentation and two non-executable Python script files, with no install-time execution, startup hook, package installer, or hidden setup behavior observed.

ℹ Credentials

Outbound network requests, browser-like fetching, optional proxy use, and same-domain crawling are proportionate for a web retrieval tool, but they can reveal access patterns to target sites.

ℹ Persistence & Privilege

The scripts can write fetched content to user-specified files, directories, JSON outputs, and temporary files; this is disclosed, user-directed, and not paired with privilege escalation or background persistence.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install web-retrieval
安装完成后，直接呼叫该 Skill 的名称或使用 /web-retrieval 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

- Initial release of web-retrieval skill powered by Scrapling. - Supports static, dynamic (JS-rendered), and stealthy web fetching modes for robust page retrieval. - Provides scripts for single-page fetches and multi-URL/site crawling with checkpoint and resume. - CLI, Python API, and Spider interfaces available with documentation and usage tips. - Improved support for anti-bot sites and Cloudflare bypass with stealth mode.

元数据

Slug web-retrieval

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Web Retrieval 是什么？

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 60 次。

如何安装 Web Retrieval？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install web-retrieval」即可一键安装，无需额外配置。

Web Retrieval 是免费的吗？

是的，Web Retrieval 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Web Retrieval 支持哪些平台？

Web Retrieval 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Web Retrieval？

由 Sean Ford（@seanford）开发并维护，当前版本 v1.0.0。