← 返回 Skills 市场
drowning-in-codes

hn-crawler

作者 proanimer · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ 安全检测通过
110
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install hn-crawler-cn
功能描述
爬取 https://hn.aimaker.dev/ 网站资讯,执行爬取->提取->整理->总结完整流程。Invoke when user wants to crawl news from hn.aimaker.dev or process web content through the full pipeline.
使用说明 (SKILL.md)

HN 资讯爬虫 Skill

本 Skill 用于爬取 https://hn.aimaker.dev/ 网站的资讯内容,并通过完整的处理流程将原始数据转化为结构化的总结报告。

工作流程

整个处理流程分为四个阶段:

┌─────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│  Crawl  │ -> │ Extract  │ -> │ Organize │ -> │ Summarize │
│  爬取   │    │  提取    │    │  整理    │    │  总结     │
└─────────┘    └──────────┘    └──────────┘    └───────────┘

1. Crawl(爬取)

  • 脚本: scripts/crawl.py
  • 功能: 使用 HTTP 请求获取网页原始 HTML 内容
  • 输出: data/raw/hn_aimaker_\x3Ctimestamp>.html

2. Extract(提取)

  • 脚本: scripts/extract.py
  • 功能: 解析 HTML,提取文章标题、链接、摘要、发布时间等信息
  • 输出: data/extracted/articles_\x3Ctimestamp>.json

3. Organize(整理)

  • 脚本: scripts/organize.py
  • 功能: 对提取的数据进行清洗、去重、分类和格式化
  • 输出: data/organized/articles_organized_\x3Ctimestamp>.json

4. Summarize(总结)

  • 脚本: scripts/summarize.py
  • 功能: 生成摘要报告,包括热点话题统计、趋势分析等
  • 输出: data/summary/summary_\x3Ctimestamp>.md

快速开始

安装依赖

cd .trae/skills/hn-crawler/scripts
pip install -r requirements.txt

运行完整流程

# 方法1:逐个执行
python scripts/crawl.py
python scripts/extract.py
python scripts/organize.py
python scripts/summarize.py

# 方法2:一键执行完整流程
python scripts/run_pipeline.py

目录结构

.trae/skills/hn-crawler/
├── SKILL.md                    # 本文件
├── scripts/
│   ├── requirements.txt        # Python 依赖
│   ├── crawl.py               # 爬取脚本
│   ├── extract.py             # 提取脚本
│   ├── organize.py            # 整理脚本
│   ├── summarize.py           # 总结脚本
│   └── run_pipeline.py        # 一键运行完整流程
└── data/                      # 数据输出目录(自动创建)
    ├── raw/                   # 原始 HTML
    ├── extracted/             # 提取的 JSON 数据
    ├── organized/             # 整理后的数据
    └── summary/               # 总结报告

数据格式

提取后的文章格式 (JSON)

{
  "articles": [
    {
      "title": "文章标题",
      "url": "https://example.com/article",
      "summary": "文章摘要",
      "published_at": "2024-01-15T10:30:00",
      "source": "hn.aimaker.dev",
      "category": "AI",
      "score": 150
    }
  ],
  "metadata": {
    "crawled_at": "2024-01-15T12:00:00",
    "total_count": 30
  }
}

配置选项

各脚本支持以下环境变量或命令行参数:

  • TARGET_URL: 目标 URL(默认: https://hn.aimaker.dev/
  • OUTPUT_DIR: 输出目录(默认: data/)
  • TIMEOUT: 请求超时时间(默认: 30秒)

注意事项

  1. 请遵守网站的 robots.txt 和爬虫协议
  2. 建议设置适当的请求间隔,避免对服务器造成压力
  3. 爬取的数据仅供个人学习研究使用
安全使用建议
This skill appears internally consistent for crawling and processing hn.aimaker.dev content. Before installing or running: 1) Inspect the code locally (you already have the files); there are syntax/typing bugs (e.g., in organize.py) that must be fixed for the pipeline to run. 2) Follow robots.txt and rate-limit requests to avoid abusive crawling. 3) When running pip install -r requirements.txt, review which packages and versions will be installed (PyPI packages are common but carry supply-chain risk). 4) Run the skill in a sandbox or non-critical environment first (it writes files to data/). 5) If you need higher assurance, request the full, untruncated source for final review or ask the author to provide a fixed release with tests and an explicit provenance/homepage.
功能分析
Type: OpenClaw Skill Name: hn-crawler-cn Version: 1.0.0 The skill bundle implements a standard web crawling and data processing pipeline for the website hn.aimaker.dev. The scripts (crawl.py, extract.py, organize.py, summarize.py) use well-known libraries like requests and BeautifulSoup to fetch and parse content, and run_pipeline.py orchestrates the workflow using subprocess.run in a safe manner. There is no evidence of data exfiltration, malicious execution, or prompt injection intended to subvert the agent's behavior.
能力评估
Purpose & Capability
Name/description match the provided scripts and SKILL.md. The package contains crawl/extract/organize/summarize scripts and a run_pipeline orchestrator which all operate on the stated site (default TARGET_URL is https://hn.aimaker.dev/). There are no unrelated required binaries or environment variables.
Instruction Scope
SKILL.md and the scripts limit actions to HTTP GET requests to the target site, parsing HTML, local file read/write under data/, and generating summaries. Declared environment variables (TARGET_URL, OUTPUT_DIR, TIMEOUT) are used. The code does not reference other system credentials, config paths, or external endpoints beyond normal HTTP requests to the target URL. Note: some source files (organize.py) contain syntax/typing errors that will prevent successful execution until fixed; this is a functionality issue rather than a security misdirection.
Install Mechanism
There is no automated install spec; SKILL.md instructs the user to run pip install -r requirements.txt. Installing packages from PyPI is normal but carries the usual supply-chain risk (verify package versions and trust). No downloads from arbitrary URLs or archive extraction steps are present in the skill itself.
Credentials
The skill does not request credentials or secrets. The only environment variables used (TARGET_URL, OUTPUT_DIR, TIMEOUT) are proportional and documented. Scripts operate on local output directories and do not exfiltrate data to unlisted remote endpoints.
Persistence & Privilege
The skill is not marked always:true and does not attempt to modify other skills or system-level agent configuration. It does not request permanent presence or elevated privileges.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install hn-crawler-cn
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /hn-crawler-cn 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release of hn-crawler skill. - Implements a complete pipeline to crawl, extract, organize, and summarize news from https://hn.aimaker.dev/. - Provides Python scripts for each processing stage and a one-click pipeline runner. - Outputs structured JSON, cleaned/organized data, and markdown summary reports. - Configurable via environment variables and CLI arguments for target URL, output directory, and timeout. - Includes detailed documentation on workflow, data format, and usage.
元数据
Slug hn-crawler-cn
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

hn-crawler 是什么?

爬取 https://hn.aimaker.dev/ 网站资讯,执行爬取->提取->整理->总结完整流程。Invoke when user wants to crawl news from hn.aimaker.dev or process web content through the full pipeline. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 110 次。

如何安装 hn-crawler?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install hn-crawler-cn」即可一键安装,无需额外配置。

hn-crawler 是免费的吗?

是的,hn-crawler 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

hn-crawler 支持哪些平台?

hn-crawler 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 hn-crawler?

由 proanimer(@drowning-in-codes)开发并维护,当前版本 v1.0.0。

💬 留言讨论