← 返回 Skills 市场
carlosdelfino

Rss Sitemap

作者 Carlos Delfino · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
45
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install rss-sitemap
功能描述
Discover website URLs, feed entries, and latest publications by checking sitemap.xml, sitemaps.xml, atom.xml, and rss.xml before crawling a specific site. Us...
使用说明 (SKILL.md)

RSS Sitemap

Overview

Use this skill to bootstrap site discovery from the site's own machine-readable indexes before doing general crawling. For any task that targets a specific website, first look for sitemap, Atom, and RSS resources and use them to find the latest publications or guide the crawl.

Workflow

  1. Normalize the target site to an origin such as https://example.com.
  2. Run the bundled preprocessor through the OpenClaw exec tool when Node.js 18+ is available. exec is the shell tool name; do not require a separate bash tool:
    node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --site https://example.com --output /tmp/rss-sitemap.json
    
  3. Probe these root resources first when running manually:
    • /sitemap.xml
    • /sitemaps.xml
    • /atom.xml
    • /rss.xml
  4. If available, also inspect /robots.txt for Sitemap: directives and include those sitemap URLs.
  5. Fetch only resources that return a successful HTTP response and XML-like content.
  6. Parse XML with a real parser when possible. Avoid ad hoc regex parsing except for quick triage.
  7. Use discovered URLs or entries as the crawl frontier before falling back to regular page crawling.

Bundled Tool

Use scripts/preprocess-rss-sitemap.js for deterministic pre-crawl discovery. It has no npm dependencies and uses Node's built-in fetch, so it requires Node.js 18 or newer for URL fetching.

Common commands:

node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --site https://example.com
node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --url https://example.com/sitemap.xml --url https://example.com/feed.xml
node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --file ./sitemap.xml --file ./feed.xml
node skills/rss-sitemap/scripts/preprocess-rss-sitemap.js --site https://example.com --max-depth 2 --output /tmp/rss-sitemap.json

The script outputs JSON with:

  • resources: probed XML or robots resources, HTTP status, content type, detected kind, and entry count.
  • entries: normalized sitemap URLs, RSS items, or Atom entries with source provenance.

For latest-publication requests, sort entries by the best available date:

  1. RSS pubDate
  2. Atom updated
  3. Atom published
  4. Sitemap lastmod

If entries do not include dates, prefer RSS or Atom feed order before sitemap order because feeds usually list newest content first.

If the script fails because the site blocks requests, needs JavaScript, or requires authentication, use the available web scraping/search/browser tools for fetching, then apply the same parsing and crawl strategy.

Required tools:

  • OpenClaw exec enabled for host script execution.
  • Node.js 18+ for remote URL discovery with the bundled script.
  • Any available HTTP, scraping, search, or browser tool when Node fetch cannot access the target site.

Parsing Rules

For sitemaps:

  • Treat \x3Csitemapindex> as a list of nested sitemaps; recursively fetch each \x3Cloc>.
  • Treat \x3Curlset> as crawlable page URLs; extract \x3Cloc> and keep useful metadata such as \x3Clastmod>, \x3Cchangefreq>, and \x3Cpriority> when present.
  • De-duplicate URLs after canonicalizing obvious variants such as fragments.

For RSS feeds:

  • Extract each \x3Citem> with title, link, guid, pubDate, and description when present.
  • Prefer link as the crawl URL; fall back to guid only if it is URL-like.

For Atom feeds:

  • Extract each \x3Centry> with title, id, updated, published, summary, and link.
  • Prefer \x3Clink rel="alternate" href="...">; otherwise use the first URL-like href.

Crawl Strategy

  • Prefer newest or most relevant entries when the user asks for recent content.
  • For "latest publications", "recent posts", "new articles", or equivalent requests, use RSS/Atom first and return dated entries in descending order when dates are available.
  • Prefer sitemap URLs when the user asks for broad site coverage.
  • Keep feed and sitemap provenance with each discovered URL so later summaries can explain where a URL came from.
  • If none of the well-known resources exist, state that discovery fell back to normal crawling or search.
  • Respect robots, rate limits, authentication boundaries, and user instructions before expanding a crawl.
安全使用建议
Install this only if you are comfortable allowing the agent to run the bundled Node script and make outbound HTTP requests from the host. Keep exec approvals scoped to this script, prefer one-time approval unless repeated use is needed, and avoid using it on localhost, private IP ranges, or sensitive internal domains unless that access is intentional.
能力评估
Purpose & Capability
The stated purpose is to discover URLs and recent publications from sitemap, Atom, RSS, and robots.txt resources, and the bundled script does exactly that by fetching and parsing those resources.
Instruction Scope
The skill tells agents to use OpenClaw exec and outbound fetching for user-specified sites; that is disclosed and purpose-aligned, though operators should avoid using it against private or sensitive network targets.
Install Mechanism
No package dependencies or installer behavior were found; metadata declares Node as the required binary. The README includes broader OpenClaw exec and WhatsApp configuration examples, but they are documented setup steps rather than automatic install actions.
Credentials
Host exec plus network fetch is proportionate for a local preprocessor crawler, but it gives the agent network reach from the host environment and should be governed by normal allowlist and approval policy.
Persistence & Privilege
The script itself has no persistence, background worker, credential use, or privilege escalation. The README describes optional persistent approval for the specific Node command pattern, which users should review before enabling.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install rss-sitemap
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /rss-sitemap 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of rss-sitemap skill. - Discovers website URLs, feed entries, and recent publications via sitemap.xml, sitemaps.xml, atom.xml, and rss.xml before crawling. - Bundles a Node.js 18+ script for deterministic pre-crawl discovery and parsing of sitemap, RSS, and Atom resources. - Prioritizes site-provided XML indexes and feeds for recency and coverage before blind crawling. - Outputs normalized URLs and entries with resource provenance; sorts entries using publication dates when available. - Falls back to crawling and scraping tools if XML resources are unavailable or access is blocked.
元数据
Slug rss-sitemap
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Rss Sitemap 是什么?

Discover website URLs, feed entries, and latest publications by checking sitemap.xml, sitemaps.xml, atom.xml, and rss.xml before crawling a specific site. Us... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 45 次。

如何安装 Rss Sitemap?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install rss-sitemap」即可一键安装,无需额外配置。

Rss Sitemap 是免费的吗?

是的,Rss Sitemap 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Rss Sitemap 支持哪些平台?

Rss Sitemap 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Rss Sitemap?

由 Carlos Delfino(@carlosdelfino)开发并维护,当前版本 v1.0.0。

💬 留言讨论