← Back to Skills Marketplace
carlosdelfino

Rss Sitemap

by Carlos Delfino · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
45
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install rss-sitemap
Description
Discover website URLs, feed entries, and latest publications by checking sitemap.xml, sitemaps.xml, atom.xml, and rss.xml before crawling a specific site. Us...
README (SKILL.md)

RSS Sitemap

Overview

Use this skill to bootstrap site discovery from the site's own machine-readable indexes before doing general crawling. For any task that targets a specific website, first look for sitemap, Atom, and RSS resources and use them to find the latest publications or guide the crawl.

Workflow

  1. Normalize the target site to an origin such as https://example.com.
  2. Run the bundled preprocessor through the OpenClaw exec tool when Node.js 18+ is available. exec is the shell tool name; do not require a separate bash tool:
    node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --site https://example.com --output /tmp/rss-sitemap.json
    
  3. Probe these root resources first when running manually:
    • /sitemap.xml
    • /sitemaps.xml
    • /atom.xml
    • /rss.xml
  4. If available, also inspect /robots.txt for Sitemap: directives and include those sitemap URLs.
  5. Fetch only resources that return a successful HTTP response and XML-like content.
  6. Parse XML with a real parser when possible. Avoid ad hoc regex parsing except for quick triage.
  7. Use discovered URLs or entries as the crawl frontier before falling back to regular page crawling.

Bundled Tool

Use scripts/preprocess-rss-sitemap.js for deterministic pre-crawl discovery. It has no npm dependencies and uses Node's built-in fetch, so it requires Node.js 18 or newer for URL fetching.

Common commands:

node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --site https://example.com
node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --url https://example.com/sitemap.xml --url https://example.com/feed.xml
node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --file ./sitemap.xml --file ./feed.xml
node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --site https://example.com --max-depth 2 --output /tmp/rss-sitemap.json

The script outputs JSON with:

  • resources: probed XML or robots resources, HTTP status, content type, detected kind, and entry count.
  • entries: normalized sitemap URLs, RSS items, or Atom entries with source provenance.

For latest-publication requests, sort entries by the best available date:

  1. RSS pubDate
  2. Atom updated
  3. Atom published
  4. Sitemap lastmod

If entries do not include dates, prefer RSS or Atom feed order before sitemap order because feeds usually list newest content first.

If the script fails because the site blocks requests, needs JavaScript, or requires authentication, use the available web scraping/search/browser tools for fetching, then apply the same parsing and crawl strategy.

Required tools:

  • OpenClaw exec enabled for host script execution.
  • Node.js 18+ for remote URL discovery with the bundled script.
  • Any available HTTP, scraping, search, or browser tool when Node fetch cannot access the target site.

Parsing Rules

For sitemaps:

  • Treat \x3Csitemapindex> as a list of nested sitemaps; recursively fetch each \x3Cloc>.
  • Treat \x3Curlset> as crawlable page URLs; extract \x3Cloc> and keep useful metadata such as \x3Clastmod>, \x3Cchangefreq>, and \x3Cpriority> when present.
  • De-duplicate URLs after canonicalizing obvious variants such as fragments.

For RSS feeds:

  • Extract each \x3Citem> with title, link, guid, pubDate, and description when present.
  • Prefer link as the crawl URL; fall back to guid only if it is URL-like.

For Atom feeds:

  • Extract each \x3Centry> with title, id, updated, published, summary, and link.
  • Prefer \x3Clink rel="alternate" href="...">; otherwise use the first URL-like href.

Crawl Strategy

  • Prefer newest or most relevant entries when the user asks for recent content.
  • For "latest publications", "recent posts", "new articles", or equivalent requests, use RSS/Atom first and return dated entries in descending order when dates are available.
  • Prefer sitemap URLs when the user asks for broad site coverage.
  • Keep feed and sitemap provenance with each discovered URL so later summaries can explain where a URL came from.
  • If none of the well-known resources exist, state that discovery fell back to normal crawling or search.
  • Respect robots, rate limits, authentication boundaries, and user instructions before expanding a crawl.
Usage Guidance
Install this only if you are comfortable allowing the agent to run the bundled Node script and make outbound HTTP requests from the host. Keep exec approvals scoped to this script, prefer one-time approval unless repeated use is needed, and avoid using it on localhost, private IP ranges, or sensitive internal domains unless that access is intentional.
Capability Assessment
Purpose & Capability
The stated purpose is to discover URLs and recent publications from sitemap, Atom, RSS, and robots.txt resources, and the bundled script does exactly that by fetching and parsing those resources.
Instruction Scope
The skill tells agents to use OpenClaw exec and outbound fetching for user-specified sites; that is disclosed and purpose-aligned, though operators should avoid using it against private or sensitive network targets.
Install Mechanism
No package dependencies or installer behavior were found; metadata declares Node as the required binary. The README includes broader OpenClaw exec and WhatsApp configuration examples, but they are documented setup steps rather than automatic install actions.
Credentials
Host exec plus network fetch is proportionate for a local preprocessor crawler, but it gives the agent network reach from the host environment and should be governed by normal allowlist and approval policy.
Persistence & Privilege
The script itself has no persistence, background worker, credential use, or privilege escalation. The README describes optional persistent approval for the specific Node command pattern, which users should review before enabling.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install rss-sitemap
  3. After installation, invoke the skill by name or use /rss-sitemap
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of rss-sitemap skill. - Discovers website URLs, feed entries, and recent publications via sitemap.xml, sitemaps.xml, atom.xml, and rss.xml before crawling. - Bundles a Node.js 18+ script for deterministic pre-crawl discovery and parsing of sitemap, RSS, and Atom resources. - Prioritizes site-provided XML indexes and feeds for recency and coverage before blind crawling. - Outputs normalized URLs and entries with resource provenance; sorts entries using publication dates when available. - Falls back to crawling and scraping tools if XML resources are unavailable or access is blocked.
Metadata
Slug rss-sitemap
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Rss Sitemap?

Discover website URLs, feed entries, and latest publications by checking sitemap.xml, sitemaps.xml, atom.xml, and rss.xml before crawling a specific site. Us... It is an AI Agent Skill for Claude Code / OpenClaw, with 45 downloads so far.

How do I install Rss Sitemap?

Run "/install rss-sitemap" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Rss Sitemap free?

Yes, Rss Sitemap is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Rss Sitemap support?

Rss Sitemap is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Rss Sitemap?

It is built and maintained by Carlos Delfino (@carlosdelfino); the current version is v1.0.0.

💬 Comments