← 返回 Skills 市场
seanford

Web Retrieval

作者 Sean Ford · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
60
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install web-retrieval
功能描述
Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...
使用说明 (SKILL.md)

Web Retrieval — Scrapling Expert

Local web fetching via Scrapling. Three fetchers, two scripts, one job: get the page content reliably.

Fetcher Selection Guide

Always start with the cheapest fetcher that works. Escalate only if needed.

Fetcher CLI mode Speed Use when
Fetcher (curl_cffi) get Fast (~1s) Static HTML, APIs, most public pages. Impersonates Chrome.
DynamicFetcher fetch Medium (~5s) JS-rendered pages, SPAs, pages that need browser execution
StealthyFetcher stealthy Slow (~10s) Cloudflare, heavy anti-bot, fingerprint detection

Decision tree:

  1. Try get first — it handles 80% of pages
  2. If content is just a title or empty → escalate to fetch
  3. If blocked/Cloudflare detected → escalate to stealthy
  4. If still blocked → add --solve-cloudflare and/or --wait 3000

Fetch Script

FETCH="python3 $SKILL_DIR/scripts/fetch"

# Basic fetch (auto-escalates through modes)
$FETCH https://example.com

# Force specific mode
$FETCH https://example.com --mode stealthy

# Extract specific content with CSS selector
$FETCH https://example.com -s "article.main-content"

# Wait for JS-rendered content
$FETCH https://spa.example.com --mode fetch --wait 3000 --wait-selector ".content"

# Save to file
$FETCH https://example.com /tmp/output.md

# Plain text output
$FETCH https://example.com --text

# Raw HTML (for link extraction, parsing)
$FETCH https://example.com --html

# Cloudflare bypass
$FETCH https://protected.example.com --mode stealthy --solve-cloudflare

# Fast (no images/fonts/media)
$FETCH https://example.com --no-resources

# Network idle wait (good for dashboards)
$FETCH https://example.com --mode fetch --network-idle

Crawl Script

CRAWL="python3 $SKILL_DIR/scripts/crawl"

# Fetch a flat list of URLs from file → one .md per URL in output dir
$CRAWL --urls-file /tmp/urls.txt --output-dir /tmp/results/

# Fetch list → single JSON file
$CRAWL --urls-file /tmp/urls.txt --output-json /tmp/results.json

# Crawl a site 2 levels deep (same domain only)
$CRAWL https://docs.example.com --depth 2 --output-dir /tmp/docs/

# Spider with checkpoint (resume if interrupted)
$CRAWL https://large-site.com --depth 3 --checkpoint-dir /tmp/checkpoint/

# Crawl with URL filter (only pages matching pattern)
$CRAWL https://docs.openclaw.ai --depth 2 --allowed-pattern "/docs/" --output-dir /tmp/

# Use stealth mode for crawl
$CRAWL --urls-file /tmp/urls.txt --mode stealthy --output-json /tmp/out.json

Python API (for sub-agents / scripts)

from scrapling import Fetcher, StealthyFetcher, DynamicFetcher

# Static fetch with browser impersonation
page = Fetcher().get("https://example.com", stealthy_headers=True)
text = page.get_all_text(ignore_tags=("script", "style"))
links = [a.attrib.get("href") for a in page.css("a[href]")]
title = page.css_first("h1").text

# Dynamic (JS-rendered)
async with DynamicFetcher() as f:
    page = await f.async_fetch("https://spa.example.com")

# Stealthy
page = StealthyFetcher().fetch("https://cloudflare-site.com", wait=2000)

# CSS selector extraction
results = page.css("div.article-body p")  # returns list of elements
first = page.css_first("h1").text

# Response properties
page.status  # HTTP status
page.url     # final URL (after redirects)
page.html    # raw HTML string
page.find("div", {"class": "content"})  # BeautifulSoup-style

Scrapling Spider (site crawl with full control)

from scrapling.spiders import Spider, Request
from scrapling import Fetcher

class DocsCrawler(Spider):
    start_urls = ["https://docs.example.com"]

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    async def parse(self, response):
        # Yield scraped data
        yield {
            "url": response.url,
            "title": response.css_first("h1").text if response.css_first("h1") else "",
            "body": response.get_all_text(ignore_tags=("script", "style")),
        }
        # Follow links
        for link in response.css("a[href]"):
            href = link.attrib.get("href", "")
            if href.startswith("/") or "docs.example.com" in href:
                yield Request(response.urljoin(href), callback=self.parse)

# Run (checkpoint-enabled)
spider = DocsCrawler(crawldir="/tmp/crawl-checkpoint/")
result = spider.start()
print(f"Scraped {result.stats.items_scraped} pages")
items = list(result.items)

Key Scrapling CSS/Response Methods

Method Description
page.css("selector") All matching elements
page.css_first("selector") First match or None
el.text Text content of element
el.attrib["href"] Attribute value
page.get_all_text() Full page text (strips scripts/styles)
page.html Raw HTML
page.find(tag, attrs) BeautifulSoup-style find
page.urljoin(href) Resolve relative URL
response.url Final URL after redirects
response.status HTTP status code

Output Formats

All CLI commands support three output formats via file extension:

  • .md — Markdown (default, best for LLM consumption)
  • .txt — Plain text
  • .html — Raw HTML (use for link extraction or further parsing)

Tips

  • JS pages with lazy loading: use --wait 2000 + --network-idle
  • Dynamic content: --wait-selector ".target-class" waits until element appears
  • Rate limiting: add --wait 1000 between requests in crawl mode
  • Cloudflare 403/503: --mode stealthy --solve-cloudflare
  • Missing content: try --mode fetch --network-idle before escalating to stealthy
  • CSS selectors: use for targeted extraction to reduce noise in output
  • Deep research crawls: use --checkpoint-dir so crawl survives interruption
安全使用建议
Install this only if you want agents to make direct web requests and optionally crawl sites from your machine. Use it only for URLs and domains you are allowed to access, be careful with stealth or Cloudflare-bypass modes, and choose output paths that will not overwrite important files or retain sensitive fetched content longer than intended.
能力评估
Purpose & Capability
The stated purpose is reliable web retrieval with static, dynamic, stealth, and crawl modes, and the scripts implement URL fetching, selector extraction, site crawling, and result export consistent with that purpose.
Instruction Scope
The instructions broadly prefer this skill over web_fetch and include stealth and Cloudflare-bypass options; those capabilities are disclosed and purpose-aligned, but users should apply site authorization, compliance, and rate-limit judgment.
Install Mechanism
The package contains documentation and two non-executable Python script files, with no install-time execution, startup hook, package installer, or hidden setup behavior observed.
Credentials
Outbound network requests, browser-like fetching, optional proxy use, and same-domain crawling are proportionate for a web retrieval tool, but they can reveal access patterns to target sites.
Persistence & Privilege
The scripts can write fetched content to user-specified files, directories, JSON outputs, and temporary files; this is disclosed, user-directed, and not paired with privilege escalation or background persistence.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install web-retrieval
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /web-retrieval 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of web-retrieval skill powered by Scrapling. - Supports static, dynamic (JS-rendered), and stealthy web fetching modes for robust page retrieval. - Provides scripts for single-page fetches and multi-URL/site crawling with checkpoint and resume. - CLI, Python API, and Spider interfaces available with documentation and usage tips. - Improved support for anti-bot sites and Cloudflare bypass with stealth mode.
元数据
Slug web-retrieval
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Web Retrieval 是什么?

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 60 次。

如何安装 Web Retrieval?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install web-retrieval」即可一键安装,无需额外配置。

Web Retrieval 是免费的吗?

是的,Web Retrieval 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Web Retrieval 支持哪些平台?

Web Retrieval 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Web Retrieval?

由 Sean Ford(@seanford)开发并维护,当前版本 v1.0.0。

💬 留言讨论