← Back to Skills Marketplace

hn-crawler

Name: hn-crawler
Author: drowning-in-codes

by proanimer · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ✓ Security Clean

110

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install hn-crawler-cn

Description

爬取 https://hn.aimaker.dev/ 网站资讯，执行爬取->提取->整理->总结完整流程。Invoke when user wants to crawl news from hn.aimaker.dev or process web content through the full pipeline.

README (SKILL.md)

HN 资讯爬虫 Skill

本 Skill 用于爬取 https://hn.aimaker.dev/ 网站的资讯内容，并通过完整的处理流程将原始数据转化为结构化的总结报告。

工作流程

整个处理流程分为四个阶段：

┌─────────┐    ┌──────────┐    ┌──────────┐    ┌───────────┐
│  Crawl  │ -> │ Extract  │ -> │ Organize │ -> │ Summarize │
│  爬取   │    │  提取    │    │  整理    │    │  总结     │
└─────────┘    └──────────┘    └──────────┘    └───────────┘

1. Crawl（爬取）

脚本: scripts/crawl.py
功能: 使用 HTTP 请求获取网页原始 HTML 内容
输出: data/raw/hn_aimaker_\x3Ctimestamp>.html

2. Extract（提取）

脚本: scripts/extract.py
功能: 解析 HTML，提取文章标题、链接、摘要、发布时间等信息
输出: data/extracted/articles_\x3Ctimestamp>.json

3. Organize（整理）

脚本: scripts/organize.py
功能: 对提取的数据进行清洗、去重、分类和格式化
输出: data/organized/articles_organized_\x3Ctimestamp>.json

4. Summarize（总结）

脚本: scripts/summarize.py
功能: 生成摘要报告，包括热点话题统计、趋势分析等
输出: data/summary/summary_\x3Ctimestamp>.md

快速开始

安装依赖

cd .trae/skills/hn-crawler/scripts
pip install -r requirements.txt

运行完整流程

# 方法1：逐个执行
python scripts/crawl.py
python scripts/extract.py
python scripts/organize.py
python scripts/summarize.py

# 方法2：一键执行完整流程
python scripts/run_pipeline.py

目录结构

.trae/skills/hn-crawler/
├── SKILL.md                    # 本文件
├── scripts/
│   ├── requirements.txt        # Python 依赖
│   ├── crawl.py               # 爬取脚本
│   ├── extract.py             # 提取脚本
│   ├── organize.py            # 整理脚本
│   ├── summarize.py           # 总结脚本
│   └── run_pipeline.py        # 一键运行完整流程
└── data/                      # 数据输出目录（自动创建）
    ├── raw/                   # 原始 HTML
    ├── extracted/             # 提取的 JSON 数据
    ├── organized/             # 整理后的数据
    └── summary/               # 总结报告

数据格式

提取后的文章格式 (JSON)

{
  "articles": [
    {
      "title": "文章标题",
      "url": "https://example.com/article",
      "summary": "文章摘要",
      "published_at": "2024-01-15T10:30:00",
      "source": "hn.aimaker.dev",
      "category": "AI",
      "score": 150
    }
  ],
  "metadata": {
    "crawled_at": "2024-01-15T12:00:00",
    "total_count": 30
  }
}

配置选项

各脚本支持以下环境变量或命令行参数：

TARGET_URL: 目标 URL（默认: https://hn.aimaker.dev/）
OUTPUT_DIR: 输出目录（默认: data/）
TIMEOUT: 请求超时时间（默认: 30秒）

注意事项

请遵守网站的 robots.txt 和爬虫协议
建议设置适当的请求间隔，避免对服务器造成压力
爬取的数据仅供个人学习研究使用

Usage Guidance

This skill appears internally consistent for crawling and processing hn.aimaker.dev content. Before installing or running: 1) Inspect the code locally (you already have the files); there are syntax/typing bugs (e.g., in organize.py) that must be fixed for the pipeline to run. 2) Follow robots.txt and rate-limit requests to avoid abusive crawling. 3) When running pip install -r requirements.txt, review which packages and versions will be installed (PyPI packages are common but carry supply-chain risk). 4) Run the skill in a sandbox or non-critical environment first (it writes files to data/). 5) If you need higher assurance, request the full, untruncated source for final review or ask the author to provide a fixed release with tests and an explicit provenance/homepage.

Capability Analysis

Type: OpenClaw Skill Name: hn-crawler-cn Version: 1.0.0 The skill bundle implements a standard web crawling and data processing pipeline for the website hn.aimaker.dev. The scripts (crawl.py, extract.py, organize.py, summarize.py) use well-known libraries like requests and BeautifulSoup to fetch and parse content, and run_pipeline.py orchestrates the workflow using subprocess.run in a safe manner. There is no evidence of data exfiltration, malicious execution, or prompt injection intended to subvert the agent's behavior.

Capability Assessment

✓ Purpose & Capability

Name/description match the provided scripts and SKILL.md. The package contains crawl/extract/organize/summarize scripts and a run_pipeline orchestrator which all operate on the stated site (default TARGET_URL is https://hn.aimaker.dev/). There are no unrelated required binaries or environment variables.

ℹ Instruction Scope

SKILL.md and the scripts limit actions to HTTP GET requests to the target site, parsing HTML, local file read/write under data/, and generating summaries. Declared environment variables (TARGET_URL, OUTPUT_DIR, TIMEOUT) are used. The code does not reference other system credentials, config paths, or external endpoints beyond normal HTTP requests to the target URL. Note: some source files (organize.py) contain syntax/typing errors that will prevent successful execution until fixed; this is a functionality issue rather than a security misdirection.

ℹ Install Mechanism

There is no automated install spec; SKILL.md instructs the user to run pip install -r requirements.txt. Installing packages from PyPI is normal but carries the usual supply-chain risk (verify package versions and trust). No downloads from arbitrary URLs or archive extraction steps are present in the skill itself.

✓ Credentials

The skill does not request credentials or secrets. The only environment variables used (TARGET_URL, OUTPUT_DIR, TIMEOUT) are proportional and documented. Scripts operate on local output directories and do not exfiltrate data to unlisted remote endpoints.

✓ Persistence & Privilege

The skill is not marked always:true and does not attempt to modify other skills or system-level agent configuration. It does not request permanent presence or elevated privileges.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install hn-crawler-cn
After installation, invoke the skill by name or use /hn-crawler-cn
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release of hn-crawler skill. - Implements a complete pipeline to crawl, extract, organize, and summarize news from https://hn.aimaker.dev/. - Provides Python scripts for each processing stage and a one-click pipeline runner. - Outputs structured JSON, cleaned/organized data, and markdown summary reports. - Configurable via environment variables and CLI arguments for target URL, output directory, and timeout. - Includes detailed documentation on workflow, data format, and usage.

Metadata

Slug hn-crawler-cn

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is hn-crawler?

爬取 https://hn.aimaker.dev/ 网站资讯，执行爬取->提取->整理->总结完整流程。Invoke when user wants to crawl news from hn.aimaker.dev or process web content through the full pipeline. It is an AI Agent Skill for Claude Code / OpenClaw, with 110 downloads so far.

How do I install hn-crawler?

Run "/install hn-crawler-cn" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is hn-crawler free?

Yes, hn-crawler is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does hn-crawler support?

hn-crawler is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created hn-crawler?

It is built and maintained by proanimer (@drowning-in-codes); the current version is v1.0.0.

More Skills