← Back to Skills Marketplace

Sitemap Content Scraper

Name: Sitemap Content Scraper
Author: quareth

by gunes alcan · GitHub ↗ · v1.0.2 · MIT-0

cross-platform ✓ Security Clean

141

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install sitemap-content-scraper

Description

Discover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy,...

Usage Guidance

This skill appears to do what it says: it will run the included Python scripts to discover public sitemaps and fetch pages, then write Markdown files to the destination folder you choose. Before running: (1) inspect the bundled scripts (they are included) and run them in a sandbox or container if you are cautious; (2) only target public http/https hosts and avoid internal/private hostnames as advised; (3) choose an output directory you control and confirm the agent asks before writing outside that area; (4) be aware the scraper performs arbitrary HTTP requests (so don't point it at services where requests could trigger actions or costs).

Capability Analysis

Type: OpenClaw Skill Name: sitemap-content-scraper Version: 1.0.2 The sitemap-content-scraper skill is a well-implemented tool for converting public website content into Markdown. It features robust security controls, specifically addressing SSRF (Server-Side Request Forgery) by validating that all target URLs and redirect targets resolve to global, public IP addresses in both discover_sitemaps.py and scrape_sitemap.py. Additionally, it prevents path traversal attacks by slugifying URL segments before writing files to the local filesystem, and the SKILL.md instructions are strictly aligned with the stated purpose without any evidence of malicious prompt injection or data exfiltration.

Capability Assessment

✓ Purpose & Capability

Name/description match the included Python scripts (discover_sitemaps.py and scrape_sitemap.py). Required runtime (python3) and no credentials/config paths are consistent with a public-site sitemap discovery and scraping tool.

✓ Instruction Scope

SKILL.md restricts activity to public http/https targets and instructs running the included scripts; the scripts perform network requests and write files to a user-specified output directory as expected. The SKILL.md guardrails (reject localhost/private IPs, avoid auth/cookies, ask before writing outside working area) align with the script behavior.

✓ Install Mechanism

No install spec (instruction-only) and only a dependency on python3. The skill bundles the scraper scripts rather than downloading external code at runtime, avoiding high-risk remote installs.

✓ Credentials

No environment variables, credentials, or unrelated binaries are requested. The scripts access network and local filesystem as required by a scraper; nothing asks for unrelated secrets or broad system config access.

✓ Persistence & Privilege

The skill is user-invocable and not always-enabled; it does not request persistent privileges or attempt to modify other skills or global agent configuration.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install sitemap-content-scraper
After installation, invoke the skill by name or use /sitemap-content-scraper
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.2

- Adds stricter guardrails to enforce public-only targets using both hostname resolution and redirect-target checks at request time. - No code changes; SKILL.md documentation updated to reflect improved public content enforcement.

v1.0.1

- Tightened environment requirement: Now requires only python3 (not python, py, or python3 fallbacks). - Updated guardrails: Now explicitly rejects localhost, private IPs, and internal-only hostnames, and prohibits use of authentication headers, cookies, or tokens. - Simplified workflow: Interpreter command resolution steps removed; always uses python3. - Documentation updated to reflect new security and environment requirements.

v1.0.0

Initial release of Sitemap Content Scraper. - Discovers website sitemaps automatically from robots.txt and common locations. - Lets you select and filter by content family (e.g., docs, blog, help center) before scraping. - Scrapes sitemap-listed public pages into Markdown files, outputting traceable, folder-structured content. - Generates a manifest.json reporting scraped, skipped, and failed pages. - Respects sitemap scope and public content boundaries; does not crawl ad hoc or capture private/user-specific pages. - Warns about extraction quality on JavaScript-heavy sites.

Metadata

Slug sitemap-content-scraper

Version 1.0.2

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 3

Frequently Asked Questions

What is Sitemap Content Scraper?

Discover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy,... It is an AI Agent Skill for Claude Code / OpenClaw, with 141 downloads so far.

How do I install Sitemap Content Scraper?

Run "/install sitemap-content-scraper" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Sitemap Content Scraper free?

Yes, Sitemap Content Scraper is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Sitemap Content Scraper support?

Sitemap Content Scraper is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Sitemap Content Scraper?

It is built and maintained by gunes alcan (@quareth); the current version is v1.0.2.

More Skills