← Back to Skills Marketplace

Web Retrieval

Name: Web Retrieval
Author: seanford

by Sean Ford · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install web-retrieval

Description

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr...

README (SKILL.md)

Web Retrieval — Scrapling Expert

Local web fetching via Scrapling. Three fetchers, two scripts, one job: get the page content reliably.

Fetcher Selection Guide

Always start with the cheapest fetcher that works. Escalate only if needed.

Fetcher	CLI mode	Speed	Use when
`Fetcher` (curl_cffi)	`get`	Fast (~1s)	Static HTML, APIs, most public pages. Impersonates Chrome.
`DynamicFetcher`	`fetch`	Medium (~5s)	JS-rendered pages, SPAs, pages that need browser execution
`StealthyFetcher`	`stealthy`	Slow (~10s)	Cloudflare, heavy anti-bot, fingerprint detection

Decision tree:

Try get first — it handles 80% of pages
If content is just a title or empty → escalate to fetch
If blocked/Cloudflare detected → escalate to stealthy
If still blocked → add --solve-cloudflare and/or --wait 3000

Fetch Script

FETCH="python3 $SKILL_DIR/scripts/fetch"

# Basic fetch (auto-escalates through modes)
$FETCH https://example.com

# Force specific mode
$FETCH https://example.com --mode stealthy

# Extract specific content with CSS selector
$FETCH https://example.com -s "article.main-content"

# Wait for JS-rendered content
$FETCH https://spa.example.com --mode fetch --wait 3000 --wait-selector ".content"

# Save to file
$FETCH https://example.com /tmp/output.md

# Plain text output
$FETCH https://example.com --text

# Raw HTML (for link extraction, parsing)
$FETCH https://example.com --html

# Cloudflare bypass
$FETCH https://protected.example.com --mode stealthy --solve-cloudflare

# Fast (no images/fonts/media)
$FETCH https://example.com --no-resources

# Network idle wait (good for dashboards)
$FETCH https://example.com --mode fetch --network-idle

Crawl Script

CRAWL="python3 $SKILL_DIR/scripts/crawl"

# Fetch a flat list of URLs from file → one .md per URL in output dir
$CRAWL --urls-file /tmp/urls.txt --output-dir /tmp/results/

# Fetch list → single JSON file
$CRAWL --urls-file /tmp/urls.txt --output-json /tmp/results.json

# Crawl a site 2 levels deep (same domain only)
$CRAWL https://docs.example.com --depth 2 --output-dir /tmp/docs/

# Spider with checkpoint (resume if interrupted)
$CRAWL https://large-site.com --depth 3 --checkpoint-dir /tmp/checkpoint/

# Crawl with URL filter (only pages matching pattern)
$CRAWL https://docs.openclaw.ai --depth 2 --allowed-pattern "/docs/" --output-dir /tmp/

# Use stealth mode for crawl
$CRAWL --urls-file /tmp/urls.txt --mode stealthy --output-json /tmp/out.json

Python API (for sub-agents / scripts)

from scrapling import Fetcher, StealthyFetcher, DynamicFetcher

# Static fetch with browser impersonation
page = Fetcher().get("https://example.com", stealthy_headers=True)
text = page.get_all_text(ignore_tags=("script", "style"))
links = [a.attrib.get("href") for a in page.css("a[href]")]
title = page.css_first("h1").text

# Dynamic (JS-rendered)
async with DynamicFetcher() as f:
    page = await f.async_fetch("https://spa.example.com")

# Stealthy
page = StealthyFetcher().fetch("https://cloudflare-site.com", wait=2000)

# CSS selector extraction
results = page.css("div.article-body p")  # returns list of elements
first = page.css_first("h1").text

# Response properties
page.status  # HTTP status
page.url     # final URL (after redirects)
page.html    # raw HTML string
page.find("div", {"class": "content"})  # BeautifulSoup-style

Scrapling Spider (site crawl with full control)

from scrapling.spiders import Spider, Request
from scrapling import Fetcher

class DocsCrawler(Spider):
    start_urls = ["https://docs.example.com"]

    async def start_requests(self):
        for url in self.start_urls:
            yield Request(url, callback=self.parse)

    async def parse(self, response):
        # Yield scraped data
        yield {
            "url": response.url,
            "title": response.css_first("h1").text if response.css_first("h1") else "",
            "body": response.get_all_text(ignore_tags=("script", "style")),
        }
        # Follow links
        for link in response.css("a[href]"):
            href = link.attrib.get("href", "")
            if href.startswith("/") or "docs.example.com" in href:
                yield Request(response.urljoin(href), callback=self.parse)

# Run (checkpoint-enabled)
spider = DocsCrawler(crawldir="/tmp/crawl-checkpoint/")
result = spider.start()
print(f"Scraped {result.stats.items_scraped} pages")
items = list(result.items)

Key Scrapling CSS/Response Methods

Method	Description
`page.css("selector")`	All matching elements
`page.css_first("selector")`	First match or None
`el.text`	Text content of element
`el.attrib["href"]`	Attribute value
`page.get_all_text()`	Full page text (strips scripts/styles)
`page.html`	Raw HTML
`page.find(tag, attrs)`	BeautifulSoup-style find
`page.urljoin(href)`	Resolve relative URL
`response.url`	Final URL after redirects
`response.status`	HTTP status code

Output Formats

All CLI commands support three output formats via file extension:

.md — Markdown (default, best for LLM consumption)
.txt — Plain text
.html — Raw HTML (use for link extraction or further parsing)

Tips

JS pages with lazy loading: use --wait 2000 + --network-idle
Dynamic content: --wait-selector ".target-class" waits until element appears
Rate limiting: add --wait 1000 between requests in crawl mode
Cloudflare 403/503: --mode stealthy --solve-cloudflare
Missing content: try --mode fetch --network-idle before escalating to stealthy
CSS selectors: use for targeted extraction to reduce noise in output
Deep research crawls: use --checkpoint-dir so crawl survives interruption

Usage Guidance

Install this only if you want agents to make direct web requests and optionally crawl sites from your machine. Use it only for URLs and domains you are allowed to access, be careful with stealth or Cloudflare-bypass modes, and choose output paths that will not overwrite important files or retain sensitive fetched content longer than intended.

Capability Assessment

✓ Purpose & Capability

The stated purpose is reliable web retrieval with static, dynamic, stealth, and crawl modes, and the scripts implement URL fetching, selector extraction, site crawling, and result export consistent with that purpose.

ℹ Instruction Scope

The instructions broadly prefer this skill over web_fetch and include stealth and Cloudflare-bypass options; those capabilities are disclosed and purpose-aligned, but users should apply site authorization, compliance, and rate-limit judgment.

✓ Install Mechanism

The package contains documentation and two non-executable Python script files, with no install-time execution, startup hook, package installer, or hidden setup behavior observed.

ℹ Credentials

Outbound network requests, browser-like fetching, optional proxy use, and same-domain crawling are proportionate for a web retrieval tool, but they can reveal access patterns to target sites.

ℹ Persistence & Privilege

The scripts can write fetched content to user-specified files, directories, JSON outputs, and temporary files; this is disclosed, user-directed, and not paired with privilege escalation or background persistence.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install web-retrieval
After installation, invoke the skill by name or use /web-retrieval
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

- Initial release of web-retrieval skill powered by Scrapling. - Supports static, dynamic (JS-rendered), and stealthy web fetching modes for robust page retrieval. - Provides scripts for single-page fetches and multi-URL/site crawling with checkpoint and resume. - CLI, Python API, and Spider interfaces available with documentation and usage tips. - Improved support for anti-bot sites and Cloudflare bypass with stealth mode.

Metadata

Slug web-retrieval

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Web Retrieval?

Expert web fetching and crawling using Scrapling. Use for any web_fetch task, JS-rendered pages, anti-bot sites, bulk URL fetching, or site crawling. Preferr... It is an AI Agent Skill for Claude Code / OpenClaw, with 60 downloads so far.

How do I install Web Retrieval?

Run "/install web-retrieval" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Web Retrieval free?

Yes, Web Retrieval is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Web Retrieval support?

Web Retrieval is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Web Retrieval?

It is built and maintained by Sean Ford (@seanford); the current version is v1.0.0.

More Skills