← 返回 Skills 市场
quareth

Sitemap Content Scraper

作者 gunes alcan · GitHub ↗ · v1.0.2 · MIT-0
cross-platform ✓ 安全检测通过
141
总下载
2
收藏
0
当前安装
3
版本数
在 OpenClaw 中安装
/install sitemap-content-scraper
功能描述
Discover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy,...
安全使用建议
This skill appears to do what it says: it will run the included Python scripts to discover public sitemaps and fetch pages, then write Markdown files to the destination folder you choose. Before running: (1) inspect the bundled scripts (they are included) and run them in a sandbox or container if you are cautious; (2) only target public http/https hosts and avoid internal/private hostnames as advised; (3) choose an output directory you control and confirm the agent asks before writing outside that area; (4) be aware the scraper performs arbitrary HTTP requests (so don't point it at services where requests could trigger actions or costs).
功能分析
Type: OpenClaw Skill Name: sitemap-content-scraper Version: 1.0.2 The sitemap-content-scraper skill is a well-implemented tool for converting public website content into Markdown. It features robust security controls, specifically addressing SSRF (Server-Side Request Forgery) by validating that all target URLs and redirect targets resolve to global, public IP addresses in both discover_sitemaps.py and scrape_sitemap.py. Additionally, it prevents path traversal attacks by slugifying URL segments before writing files to the local filesystem, and the SKILL.md instructions are strictly aligned with the stated purpose without any evidence of malicious prompt injection or data exfiltration.
能力评估
Purpose & Capability
Name/description match the included Python scripts (discover_sitemaps.py and scrape_sitemap.py). Required runtime (python3) and no credentials/config paths are consistent with a public-site sitemap discovery and scraping tool.
Instruction Scope
SKILL.md restricts activity to public http/https targets and instructs running the included scripts; the scripts perform network requests and write files to a user-specified output directory as expected. The SKILL.md guardrails (reject localhost/private IPs, avoid auth/cookies, ask before writing outside working area) align with the script behavior.
Install Mechanism
No install spec (instruction-only) and only a dependency on python3. The skill bundles the scraper scripts rather than downloading external code at runtime, avoiding high-risk remote installs.
Credentials
No environment variables, credentials, or unrelated binaries are requested. The scripts access network and local filesystem as required by a scraper; nothing asks for unrelated secrets or broad system config access.
Persistence & Privilege
The skill is user-invocable and not always-enabled; it does not request persistent privileges or attempt to modify other skills or global agent configuration.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install sitemap-content-scraper
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /sitemap-content-scraper 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.2
- Adds stricter guardrails to enforce public-only targets using both hostname resolution and redirect-target checks at request time. - No code changes; SKILL.md documentation updated to reflect improved public content enforcement.
v1.0.1
- Tightened environment requirement: Now requires only python3 (not python, py, or python3 fallbacks). - Updated guardrails: Now explicitly rejects localhost, private IPs, and internal-only hostnames, and prohibits use of authentication headers, cookies, or tokens. - Simplified workflow: Interpreter command resolution steps removed; always uses python3. - Documentation updated to reflect new security and environment requirements.
v1.0.0
Initial release of Sitemap Content Scraper. - Discovers website sitemaps automatically from robots.txt and common locations. - Lets you select and filter by content family (e.g., docs, blog, help center) before scraping. - Scrapes sitemap-listed public pages into Markdown files, outputting traceable, folder-structured content. - Generates a manifest.json reporting scraped, skipped, and failed pages. - Respects sitemap scope and public content boundaries; does not crawl ad hoc or capture private/user-specific pages. - Warns about extraction quality on JavaScript-heavy sites.
元数据
Slug sitemap-content-scraper
版本 1.0.2
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 3
常见问题

Sitemap Content Scraper 是什么?

Discover website sitemaps from robots.txt and common sitemap locations, choose the right sitemap or content family such as docs, blog, help center, academy,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 141 次。

如何安装 Sitemap Content Scraper?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install sitemap-content-scraper」即可一键安装,无需额外配置。

Sitemap Content Scraper 是免费的吗?

是的,Sitemap Content Scraper 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Sitemap Content Scraper 支持哪些平台?

Sitemap Content Scraper 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Sitemap Content Scraper?

由 gunes alcan(@quareth)开发并维护,当前版本 v1.0.2。

💬 留言讨论