← Back to Skills Marketplace
seanford

Web Retrieval

by Sean Ford · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
60
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install web-retrieval
Description
Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...
README (SKILL.md)

Web Retrieval — Scrapling Expert

Local web fetching via Scrapling. Three fetchers, two scripts, one job: get the page content reliably.

Fetcher Selection Guide

Always start with the cheapest fetcher that works. Escalate only if needed.

Fetcher CLI mode Speed Use when
Fetcher (curl_cffi) get Fast (~1s) Static HTML, APIs, most public pages. Impersonates Chrome.
DynamicFetcher fetch Medium (~5s) JS-rendered pages, SPAs, pages that need browser execution
StealthyFetcher stealthy Slow (~10s) Cloudflare, heavy anti-bot, fingerprint detection

Decision tree:

  1. Try get first — it handles 80% of pages
  2. If content is just a title or empty → escalate to fetch
  3. If blocked/Cloudflare detected → escalate to stealthy
  4. If still blocked → add --solve-cloudflare and/or --wait 3000

Fetch Script

FETCH="python3 $SKILL_DIR/scripts/fetch"

# Basic fetch (auto-escalates through modes)
$FETCH https://example.com

# Force specific mode
$FETCH https://example.com --mode stealthy

# Extract specific content with CSS selector
$FETCH https://example.com -s "article.main-content"

# Wait for JS-rendered content
$FETCH https://spa.example.com --mode fetch --wait 3000 --wait-selector ".content"

# Save to file
$FETCH https://example.com /tmp/output.md

# Plain text output
$FETCH https://example.com --text

# Raw HTML (for link extraction, parsing)
$FETCH https://example.com --html

# Cloudflare bypass
$FETCH https://protected.example.com --mode stealthy --solve-cloudflare

# Fast (no images/fonts/media)
$FETCH https://example.com --no-resources

# Network idle wait (good for dashboards)
$FETCH https://example.com --mode fetch --network-idle

Crawl Script

CRAWL="python3 $SKILL_DIR/scripts/crawl"

# Fetch a flat list of URLs from file → one .md per URL in output dir
$CRAWL --urls-file /tmp/urls.txt --output-dir /tmp/results/

# Fetch list → single JSON file
$CRAWL --urls-file /tmp/urls.txt --output-json /tmp/results.json

# Crawl a site 2 levels deep (same domain only)
$CRAWL https://docs.example.com --depth 2 --output-dir /tmp/docs/

# Spider with checkpoint (resume if interrupted)
$CRAWL https://large-site.com --depth 3 --checkpoint-dir /tmp/checkpoint/

# Crawl with URL filter (only pages matching pattern)
$CRAWL https://docs.openclaw.ai --depth 2 --allowed-pattern "/docs/" --output-dir /tmp/

# Use stealth mode for crawl
$CRAWL --urls-file /tmp/urls.txt --mode stealthy --output-json /tmp/out.json

Python API (for sub-agents / scripts)

from scrapling import Fetcher, StealthyFetcher, DynamicFetcher

# Static fetch with browser impersonation
page = Fetcher().get("https://example.com", stealthy_headers=True)
text = page.get_all_text(ignore_tags=("script", "style"))
links = [a.attrib.get("href") for a in page.css("a[href]")]
title = page.css_first("h1").text

# Dynamic (JS-rendered)
async with DynamicFetcher() as f:
    page = await f.async_fetch("https://spa.example.com")

# Stealthy
page = StealthyFetcher().fetch("https://cloudflare-site.com", wait=2000)

# CSS selector extraction
results = page.css("div.article-body p")  # returns list of elements
first = page.css_first("h1").text

# Response properties
page.status  # HTTP status
page.url     # final URL (after redirects)
page.html    # raw HTML string
page.find("div", {"class": "content"})  # BeautifulSoup-style

Scrapling Spider (site crawl with full control)

from scrapling.spiders import Spider, Request
from scrapling import Fetcher

class DocsCrawler(Spider):
    start_urls = ["https://docs.example.com"]

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    async def parse(self, response):
        # Yield scraped data
        yield {
            "url": response.url,
            "title": response.css_first("h1").text if response.css_first("h1") else "",
            "body": response.get_all_text(ignore_tags=("script", "style")),
        }
        # Follow links
        for link in response.css("a[href]"):
            href = link.attrib.get("href", "")
            if href.startswith("/") or "docs.example.com" in href:
                yield Request(response.urljoin(href), callback=self.parse)

# Run (checkpoint-enabled)
spider = DocsCrawler(crawldir="/tmp/crawl-checkpoint/")
result = spider.start()
print(f"Scraped {result.stats.items_scraped} pages")
items = list(result.items)

Key Scrapling CSS/Response Methods

Method Description
page.css("selector") All matching elements
page.css_first("selector") First match or None
el.text Text content of element
el.attrib["href"] Attribute value
page.get_all_text() Full page text (strips scripts/styles)
page.html Raw HTML
page.find(tag, attrs) BeautifulSoup-style find
page.urljoin(href) Resolve relative URL
response.url Final URL after redirects
response.status HTTP status code

Output Formats

All CLI commands support three output formats via file extension:

  • .md — Markdown (default, best for LLM consumption)
  • .txt — Plain text
  • .html — Raw HTML (use for link extraction or further parsing)

Tips

  • JS pages with lazy loading: use --wait 2000 + --network-idle
  • Dynamic content: --wait-selector ".target-class" waits until element appears
  • Rate limiting: add --wait 1000 between requests in crawl mode
  • Cloudflare 403/503: --mode stealthy --solve-cloudflare
  • Missing content: try --mode fetch --network-idle before escalating to stealthy
  • CSS selectors: use for targeted extraction to reduce noise in output
  • Deep research crawls: use --checkpoint-dir so crawl survives interruption
Usage Guidance
Install this only if you want agents to make direct web requests and optionally crawl sites from your machine. Use it only for URLs and domains you are allowed to access, be careful with stealth or Cloudflare-bypass modes, and choose output paths that will not overwrite important files or retain sensitive fetched content longer than intended.
Capability Assessment
Purpose & Capability
The stated purpose is reliable web retrieval with static, dynamic, stealth, and crawl modes, and the scripts implement URL fetching, selector extraction, site crawling, and result export consistent with that purpose.
Instruction Scope
The instructions broadly prefer this skill over web_fetch and include stealth and Cloudflare-bypass options; those capabilities are disclosed and purpose-aligned, but users should apply site authorization, compliance, and rate-limit judgment.
Install Mechanism
The package contains documentation and two non-executable Python script files, with no install-time execution, startup hook, package installer, or hidden setup behavior observed.
Credentials
Outbound network requests, browser-like fetching, optional proxy use, and same-domain crawling are proportionate for a web retrieval tool, but they can reveal access patterns to target sites.
Persistence & Privilege
The scripts can write fetched content to user-specified files, directories, JSON outputs, and temporary files; this is disclosed, user-directed, and not paired with privilege escalation or background persistence.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install web-retrieval
  3. After installation, invoke the skill by name or use /web-retrieval
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
- Initial release of web-retrieval skill powered by Scrapling. - Supports static, dynamic (JS-rendered), and stealthy web fetching modes for robust page retrieval. - Provides scripts for single-page fetches and multi-URL/site crawling with checkpoint and resume. - CLI, Python API, and Spider interfaces available with documentation and usage tips. - Improved support for anti-bot sites and Cloudflare bypass with stealth mode.
Metadata
Slug web-retrieval
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Web Retrieval?

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr... It is an AI Agent Skill for Claude Code / OpenClaw, with 60 downloads so far.

How do I install Web Retrieval?

Run "/install web-retrieval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Retrieval free?

Yes, Web Retrieval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Retrieval support?

Web Retrieval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Retrieval?

It is built and maintained by Sean Ford (@seanford); the current version is v1.0.0.

💬 Comments