← 返回 Skills 市场
60
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install web-retrieval
功能描述
Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...
使用说明 (SKILL.md)
Web Retrieval — Scrapling Expert
Local web fetching via Scrapling. Three fetchers, two scripts, one job: get the page content reliably.
Fetcher Selection Guide
Always start with the cheapest fetcher that works. Escalate only if needed.
| Fetcher | CLI mode | Speed | Use when |
|---|---|---|---|
Fetcher (curl_cffi) |
get |
Fast (~1s) | Static HTML, APIs, most public pages. Impersonates Chrome. |
DynamicFetcher |
fetch |
Medium (~5s) | JS-rendered pages, SPAs, pages that need browser execution |
StealthyFetcher |
stealthy |
Slow (~10s) | Cloudflare, heavy anti-bot, fingerprint detection |
Decision tree:
- Try
getfirst — it handles 80% of pages - If content is just a title or empty → escalate to
fetch - If blocked/Cloudflare detected → escalate to
stealthy - If still blocked → add
--solve-cloudflareand/or--wait 3000
Fetch Script
FETCH="python3 $SKILL_DIR/scripts/fetch"
# Basic fetch (auto-escalates through modes)
$FETCH https://example.com
# Force specific mode
$FETCH https://example.com --mode stealthy
# Extract specific content with CSS selector
$FETCH https://example.com -s "article.main-content"
# Wait for JS-rendered content
$FETCH https://spa.example.com --mode fetch --wait 3000 --wait-selector ".content"
# Save to file
$FETCH https://example.com /tmp/output.md
# Plain text output
$FETCH https://example.com --text
# Raw HTML (for link extraction, parsing)
$FETCH https://example.com --html
# Cloudflare bypass
$FETCH https://protected.example.com --mode stealthy --solve-cloudflare
# Fast (no images/fonts/media)
$FETCH https://example.com --no-resources
# Network idle wait (good for dashboards)
$FETCH https://example.com --mode fetch --network-idle
Crawl Script
CRAWL="python3 $SKILL_DIR/scripts/crawl"
# Fetch a flat list of URLs from file → one .md per URL in output dir
$CRAWL --urls-file /tmp/urls.txt --output-dir /tmp/results/
# Fetch list → single JSON file
$CRAWL --urls-file /tmp/urls.txt --output-json /tmp/results.json
# Crawl a site 2 levels deep (same domain only)
$CRAWL https://docs.example.com --depth 2 --output-dir /tmp/docs/
# Spider with checkpoint (resume if interrupted)
$CRAWL https://large-site.com --depth 3 --checkpoint-dir /tmp/checkpoint/
# Crawl with URL filter (only pages matching pattern)
$CRAWL https://docs.openclaw.ai --depth 2 --allowed-pattern "/docs/" --output-dir /tmp/
# Use stealth mode for crawl
$CRAWL --urls-file /tmp/urls.txt --mode stealthy --output-json /tmp/out.json
Python API (for sub-agents / scripts)
from scrapling import Fetcher, StealthyFetcher, DynamicFetcher
# Static fetch with browser impersonation
page = Fetcher().get("https://example.com", stealthy_headers=True)
text = page.get_all_text(ignore_tags=("script", "style"))
links = [a.attrib.get("href") for a in page.css("a[href]")]
title = page.css_first("h1").text
# Dynamic (JS-rendered)
async with DynamicFetcher() as f:
page = await f.async_fetch("https://spa.example.com")
# Stealthy
page = StealthyFetcher().fetch("https://cloudflare-site.com", wait=2000)
# CSS selector extraction
results = page.css("div.article-body p") # returns list of elements
first = page.css_first("h1").text
# Response properties
page.status # HTTP status
page.url # final URL (after redirects)
page.html # raw HTML string
page.find("div", {"class": "content"}) # BeautifulSoup-style
Scrapling Spider (site crawl with full control)
from scrapling.spiders import Spider, Request
from scrapling import Fetcher
class DocsCrawler(Spider):
start_urls = ["https://docs.example.com"]
async def start_requests(self):
for url in self.start_urls:
yield Request(url, callback=self.parse)
async def parse(self, response):
# Yield scraped data
yield {
"url": response.url,
"title": response.css_first("h1").text if response.css_first("h1") else "",
"body": response.get_all_text(ignore_tags=("script", "style")),
}
# Follow links
for link in response.css("a[href]"):
href = link.attrib.get("href", "")
if href.startswith("/") or "docs.example.com" in href:
yield Request(response.urljoin(href), callback=self.parse)
# Run (checkpoint-enabled)
spider = DocsCrawler(crawldir="/tmp/crawl-checkpoint/")
result = spider.start()
print(f"Scraped {result.stats.items_scraped} pages")
items = list(result.items)
Key Scrapling CSS/Response Methods
| Method | Description |
|---|---|
page.css("selector") |
All matching elements |
page.css_first("selector") |
First match or None |
el.text |
Text content of element |
el.attrib["href"] |
Attribute value |
page.get_all_text() |
Full page text (strips scripts/styles) |
page.html |
Raw HTML |
page.find(tag, attrs) |
BeautifulSoup-style find |
page.urljoin(href) |
Resolve relative URL |
response.url |
Final URL after redirects |
response.status |
HTTP status code |
Output Formats
All CLI commands support three output formats via file extension:
.md— Markdown (default, best for LLM consumption).txt— Plain text.html— Raw HTML (use for link extraction or further parsing)
Tips
- JS pages with lazy loading: use
--wait 2000+--network-idle - Dynamic content:
--wait-selector ".target-class"waits until element appears - Rate limiting: add
--wait 1000between requests in crawl mode - Cloudflare 403/503:
--mode stealthy --solve-cloudflare - Missing content: try
--mode fetch --network-idlebefore escalating to stealthy - CSS selectors: use for targeted extraction to reduce noise in output
- Deep research crawls: use
--checkpoint-dirso crawl survives interruption
安全使用建议
Install this only if you want agents to make direct web requests and optionally crawl sites from your machine. Use it only for URLs and domains you are allowed to access, be careful with stealth or Cloudflare-bypass modes, and choose output paths that will not overwrite important files or retain sensitive fetched content longer than intended.
能力评估
Purpose & Capability
The stated purpose is reliable web retrieval with static, dynamic, stealth, and crawl modes, and the scripts implement URL fetching, selector extraction, site crawling, and result export consistent with that purpose.
Instruction Scope
The instructions broadly prefer this skill over web_fetch and include stealth and Cloudflare-bypass options; those capabilities are disclosed and purpose-aligned, but users should apply site authorization, compliance, and rate-limit judgment.
Install Mechanism
The package contains documentation and two non-executable Python script files, with no install-time execution, startup hook, package installer, or hidden setup behavior observed.
Credentials
Outbound network requests, browser-like fetching, optional proxy use, and same-domain crawling are proportionate for a web retrieval tool, but they can reveal access patterns to target sites.
Persistence & Privilege
The scripts can write fetched content to user-specified files, directories, JSON outputs, and temporary files; this is disclosed, user-directed, and not paired with privilege escalation or background persistence.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install web-retrieval - 安装完成后,直接呼叫该 Skill 的名称或使用
/web-retrieval触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
- Initial release of web-retrieval skill powered by Scrapling.
- Supports static, dynamic (JS-rendered), and stealthy web fetching modes for robust page retrieval.
- Provides scripts for single-page fetches and multi-URL/site crawling with checkpoint and resume.
- CLI, Python API, and Spider interfaces available with documentation and usage tips.
- Improved support for anti-bot sites and Cloudflare bypass with stealth mode.
元数据
常见问题
Web Retrieval 是什么?
Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 60 次。
如何安装 Web Retrieval?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install web-retrieval」即可一键安装,无需额外配置。
Web Retrieval 是免费的吗?
是的,Web Retrieval 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Web Retrieval 支持哪些平台?
Web Retrieval 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Web Retrieval?
由 Sean Ford(@seanford)开发并维护,当前版本 v1.0.0。
推荐 Skills