← 返回 Skills 市场
zigu-creator

每日新闻搜索与智能摘要

作者 zigu-creator · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
32
总下载
1
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install news-digest-v1
功能描述
Automatically scrape, process, and generate daily news digests from 42 Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy,...
使用说明 (SKILL.md)

News Digest - 每日新闻摘要

Automated 3-stage pipeline for Chinese news aggregation and digest generation.

Quick Start

python scripts/news_digest_v2/run_all_stages.py

Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)

Architecture

Stage 1:   Fetch     →  Scrape 42 websites → Filter → Save to SQLite DB
Stage 2:   Process   →  Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM       →  Batch LLM summarization (optional, requires API key)
Stage 3:   Output    →  Read LLM summaries (fallback to rule summaries) → Save to files

Setup

Prerequisites

  • Python 3.8+ with: requests, beautifulsoup4
  • SQLite (built-in)

Initialize Database

The database stores articles and configuration. Default path: news_digest_v2/news.db (relative to scripts directory).

Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db

Then seed the database with monitored websites and system keywords using SQL insertion into monitor_websites and system_keywords tables.

Core Database Tables

Table Purpose
articles Scraped news articles (title, content, URL, date, keywords, duplicate flag)
monitor_websites 42 monitored websites (name, URL, CSS selector, category, enabled)
system_keywords Keywords for relevance scoring (core vs auxiliary, with weight)

Usage

Full Pipeline

python scripts/news_digest_v2/run_all_stages.py

Takes ~5 minutes (network-bound, 42 websites).

Entry Point (PowerShell wrapper)

For OpenClaw or automated integration, create a wrapper script that:

  1. Runs the pipeline
  2. Reads the output file
  3. Sends to your preferred messaging platform

Cron Job Example

schedule: "0 20 * * *"  # Daily 20:00
payload:
  run: python scripts/news_digest_v2/run_all_stages.py
  then: read .news-digest-out.md and send to messaging
timeout: 600  # 10 minutes

Output Format

【来源:标题】
摘要内容(智能选段,300字以内,包含关键数据和核心事实)
发布时间:YYYY-MM-DD
原文链接:http://...

摘要质量保证

不完整句子自动过滤

  • 摘要末尾以逗号、顿号、分号、冒号等结尾 → 回退截断到上一个句号
  • 全文没有句号(整段残缺)→ 直接丢弃,不输出
  • 截断时信息损失超过 40% → 整段放弃,宁缺毋滥

教程/指南类内容全部过滤

  • 标题或内容包含"教程"、"指南"、"攻略"、"手把手"、"从零开始"等 → 自动排除
  • 科研绘图/PS教程/Illustrator教程 → 自动排除
  • 详见 rules_config.pysocial 分类的教程关键词列表

Key Features

Smart Summary Extraction (fetcher.py → extract_brief_summary)

Not simple truncation. Each paragraph is scored by:

  • Position: Lead paragraph +10, top-3 +5 (inverted pyramid journalism)
  • Data density: Numbers × 1.5
  • Signal words: 印发/发布/宣布/决定/完成/启动 (+2 each)
  • Entity density: Organizations, locations (+1 each)
  • Completeness: Full sentence ending +3

Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.

截断保护:截断时信息损失 >40% → 整段放弃。

摘要后处理 (formatter.py → clean_summary)

  • 电头/记者署名清理(预编译正则,支持新华社、中新网、财联社等)
  • 不完整句子过滤:以逗号/顿号/分号结尾 → 回退到上一个句号
  • 全文无句号 → 丢弃(不输出残缺内容)

Filtering Rules (rules_config.py)

Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.

教程类(全部过滤):教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator、AI教程、钢笔工具、高斯模糊、路径查找器等。

Invalid keywords: clickbait patterns, advertising, webpage navigation elements.

See scripts/news_digest_v2/rules_config.py for full lists.

Deduplication (similarity.py)

  • Jaccard similarity on keyword sets
  • Threshold: ≥90% → mark as duplicate
  • Only one version appears in output

Date Filtering

  • Normal: within 3 days
  • Holidays: within 7 days
  • No date → discard
  • Old URLs (year > 1 year ago) → skip

Configuration

Environment Variables

Variable Default Description
NEWS_DIGEST_DB news_digest_v2/news.db SQLite database path
NEWS_DIGEST_LLM_API_KEY (empty) LLM API key for Stage 2.5 summarization
NEWS_DIGEST_LLM_BASE_URL (empty) LLM API base URL
NEWS_DIGEST_LLM_MODEL qwen-plus LLM model name

If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.

Add/Remove Websites

Edit monitor_websites table:

INSERT INTO monitor_websites (name, url, selector, category, enabled)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);

Customize Keywords

Edit system_keywords table:

INSERT INTO system_keywords (keyword, category, weight, enabled)
VALUES ('新能源', 'core', 5, 1);

Adjust Output

In config.py:

  • MAX_OUTPUT_COUNT = 35 (max articles per digest)
  • SIMILARITY_THRESHOLD = 0.90

Files

news-digest/
├── SKILL.md
└── scripts/
    └── news_digest_v2/
        ├── __init__.py
        ├── config.py              # DB path, websites, keywords, holidays, LLM config
        ├── database.py            # SQLite operations
        ├── fetcher.py             # Web scraping + smart summary extraction
        ├── filters.py             # Content filtering logic
        ├── formatter.py           # Output formatting + incomplete sentence handling
        ├── rules_config.py        # Exclusion rules, keywords, dateline patterns
        ├── similarity.py          # Jaccard deduplication
        ├── stage1_fetch.py        # Stage 1 entry (fetch)
        ├── stage2_process.py      # Stage 2 entry (dedup + keywords)
        ├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
        ├── stage3_output.py       # Stage 3 entry (read + format + save)
        └── run_all_stages.py      # Full pipeline entry

Performance Notes

  • ~5 minutes for full 42-website scrape (network I/O bound)
  • Some sites may fail (SSL issues, 521 errors, 404s) — pipeline continues
  • Recommended cron timeout: 600 seconds
  • 数据库是增量追加的,不会被清空。新新闻按 URL 去重插入(INSERT OR IGNORE),旧新闻保留。
  • 重复新闻标记 is_duplicate = 1,不删除。
  • 数据库增长约 30-50 条/天,建议定期清理(可选)。
安全使用建议
Review the HTTPS verification fallback before relying on the digest, especially on untrusted networks. If you enable LLM summarization, provide only a trusted HTTPS base URL and a limited API key. Set up cron or messaging delivery only if you are comfortable with the digest being generated or sent automatically.
功能分析
Type: OpenClaw Skill Name: news-digest-v1 Version: 1.0.0 The skill bundle implements a news aggregation pipeline with several high-risk behaviors that, while functional, introduce security vulnerabilities. Specifically, 'fetcher.py' explicitly disables SSL certificate verification ('verify=False') as a fallback mechanism, exposing the agent to Man-in-the-Middle (MITM) attacks. Additionally, 'stage3_output.py' and 'formatter.py' perform file system writes to the user's Desktop directory, which is an intrusive operation. The 'stage2_5_llm_summary.py' script also handles sensitive API keys from environment variables and transmits data to a configurable external endpoint without validation, posing a risk of credential exposure.
能力标签
requires-sensitive-credentials
能力评估
Purpose & Capability
The included code is broadly coherent with the stated purpose: scraping Chinese news sites, filtering/deduplicating articles, optionally summarizing with an LLM, and writing digest files.
Instruction Scope
The LLM summarization stage places scraped article text directly into the model prompt; this is expected for summarization, but untrusted webpage text could influence the generated summaries.
Install Mechanism
There is no install spec or remote installer shown. The skill documents Python prerequisites but does not show automatic package installation or remote script execution.
Credentials
Network scraping is expected, but the HTTPS fallback with certificate verification disabled is an unsafe implementation choice; the optional LLM stage also uses a bearer API key that users should protect.
Persistence & Privilege
The artifacts do not install a background service or scheduler. The cron and messaging workflow are presented as user-configured examples rather than automatic persistence.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install news-digest-v1
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /news-digest-v1 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
## v1.0.0 - 初始发布 自动抓取自定义来源的新闻网站,生成每日新闻摘要。覆盖指定领域。 ### 安装后配置(必须完成) **1. 安装依赖** ```bash pip install requests beautifulsoup4 2. 初始化数据库 bash Copy # 创建数据库和表结构 python -c " import sqlite3, os db = sqlite3.connect('news.db') db.execute('''CREATE TABLE IF NOT EXISTS articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, source TEXT NOT NULL, publish_date TEXT NOT NULL, summary TEXT, content TEXT, url TEXT UNIQUE NOT NULL, keywords TEXT, is_duplicate INTEGER DEFAULT 0, similarity_score REAL DEFAULT 0, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP )''') db.execute('''CREATE TABLE IF NOT EXISTS monitor_websites ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT UNIQUE NOT NULL, url TEXT NOT NULL, selector TEXT DEFAULT 'a', category TEXT, priority INTEGER DEFAULT 3, enabled INTEGER DEFAULT 1 )''') db.execute('''CREATE TABLE IF NOT EXISTS system_keywords ( id INTEGER PRIMARY KEY AUTOINCREMENT, keyword TEXT UNIQUE NOT NULL, category TEXT, weight INTEGER DEFAULT 1, enabled INTEGER DEFAULT 1 )''') db.execute('CREATE INDEX IF NOT EXISTS idx_publish_date ON articles(publish_date)') db.execute('CREATE INDEX IF NOT EXISTS idx_keywords ON articles(keywords)') db.commit() db.close() print('OK: news.db created') " 3. 添加监测网站(示例) bash Copy python -c " import sqlite3 db = sqlite3.connect('news.db') sites = [ ('中国经济网', 'http://www.ce.cn/', 'a', '财经', 1), # 添加你想监控的网站... ] for name, url, sel, cat, pri in sites: db.execute('INSERT OR IGNORE INTO monitor_websites (name, url, selector, category, priority) VALUES (?,?,?,?,?)', (name, url, sel, cat, pri)) db.commit() db.close() print(f'OK: added {len(sites)} websites') " 4. 添加关键词(示例) bash Copy python -c " import sqlite3 db = sqlite3.connect('news.db') kws = [ ('市场', 'auxiliary', 2), ('企业', 'auxiliary', 2), ] for kw, cat, w in kws: db.execute('INSERT OR IGNORE INTO system_keywords (keyword, category, weight) VALUES (?,?,?)', (kw, cat, w)) db.commit() db.close() print(f'OK: added {len(kws)} keywords') " 5. 运行 bash Copy python scripts/news_digest_v2/run_all_stages.py 可选配置 环境变量 默认值 说明 NEWS_DIGEST_DB news.db 数据库路径 NEWS_DIGEST_LLM_API_KEY (空) LLM API 密钥(可选,启用智能总结) NEWS_DIGEST_LLM_BASE_URL (空) LLM API 地址 NEWS_DIGEST_LLM_MODEL qwen-plus LLM 模型名 功能特性 • 🕷️ 网站智能抓取(支持站点独立解析器) • 🧹 6 大类无效内容过滤(娱乐/社会/科普/教程等) • 🔄 相似度 ≥90% 自动去重 • ✍️ 智能摘要提取(非简单截断,基于信息密度评分) • 🤖 可选 LLM 批量总结(生成更专业的摘要) • 📄 紧凑格式输出(来源+标题+摘要+链接)
元数据
Slug news-digest-v1
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

每日新闻搜索与智能摘要 是什么?

Automatically scrape, process, and generate daily news digests from 42 Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 32 次。

如何安装 每日新闻搜索与智能摘要?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install news-digest-v1」即可一键安装,无需额外配置。

每日新闻搜索与智能摘要 是免费的吗?

是的,每日新闻搜索与智能摘要 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

每日新闻搜索与智能摘要 支持哪些平台?

每日新闻搜索与智能摘要 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 每日新闻搜索与智能摘要?

由 zigu-creator(@zigu-creator)开发并维护,当前版本 v1.0.0。

💬 留言讨论