功能描述

Automatically scrape, process, and generate daily news digests from 42 Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy,...

使用说明 (SKILL.md)

News Digest - 每日新闻摘要

Name: 每日新闻搜索与智能摘要
Author: zigu-creator

Automated 3-stage pipeline for Chinese news aggregation and digest generation.

Quick Start

python scripts/news_digest_v2/run_all_stages.py

Output: .news-digest-out.md (workspace) + 新闻摘要_YYYYMMDD_HHMMSS.txt (desktop)

Architecture

Stage 1:   Fetch     →  Scrape 42 websites → Filter → Save to SQLite DB
Stage 2:   Process   →  Deduplicate (≥90% similarity) → Tag keywords
Stage 2.5: LLM       →  Batch LLM summarization (optional, requires API key)
Stage 3:   Output    →  Read LLM summaries (fallback to rule summaries) → Save to files

Setup

Prerequisites

Python 3.8+ with: requests, beautifulsoup4
SQLite (built-in)

Initialize Database

The database stores articles and configuration. Default path: news_digest_v2/news.db (relative to scripts directory).

Override with environment variable: NEWS_DIGEST_DB=/your/path/news.db

Then seed the database with monitored websites and system keywords using SQL insertion into monitor_websites and system_keywords tables.

Core Database Tables

Table	Purpose
`articles`	Scraped news articles (title, content, URL, date, keywords, duplicate flag)
`monitor_websites`	42 monitored websites (name, URL, CSS selector, category, enabled)
`system_keywords`	Keywords for relevance scoring (core vs auxiliary, with weight)

Usage

Full Pipeline

python scripts/news_digest_v2/run_all_stages.py

Takes ~5 minutes (network-bound, 42 websites).

Entry Point (PowerShell wrapper)

For OpenClaw or automated integration, create a wrapper script that:

Runs the pipeline
Reads the output file
Sends to your preferred messaging platform

Cron Job Example

schedule: "0 20 * * *"  # Daily 20:00
payload:
  run: python scripts/news_digest_v2/run_all_stages.py
  then: read .news-digest-out.md and send to messaging
timeout: 600  # 10 minutes

Output Format

【来源：标题】
摘要内容（智能选段，300字以内，包含关键数据和核心事实）
发布时间：YYYY-MM-DD
原文链接：http://...

摘要质量保证

不完整句子自动过滤：

摘要末尾以逗号、顿号、分号、冒号等结尾 → 回退截断到上一个句号
全文没有句号（整段残缺）→ 直接丢弃，不输出
截断时信息损失超过 40% → 整段放弃，宁缺毋滥

教程/指南类内容全部过滤：

标题或内容包含"教程"、"指南"、"攻略"、"手把手"、"从零开始"等 → 自动排除
科研绘图/PS教程/Illustrator教程 → 自动排除
详见 rules_config.py 中 social 分类的教程关键词列表

Key Features

Smart Summary Extraction (fetcher.py → extract_brief_summary)

Not simple truncation. Each paragraph is scored by:

Position: Lead paragraph +10, top-3 +5 (inverted pyramid journalism)
Data density: Numbers × 1.5
Signal words: 印发/发布/宣布/决定/完成/启动 (+2 each)
Entity density: Organizations, locations (+1 each)
Completeness: Full sentence ending +3

Then filtered: removes image captions, journalist bylines, ads, subtitles, boilerplate.

截断保护：截断时信息损失 >40% → 整段放弃。

摘要后处理 (formatter.py → clean_summary)

电头/记者署名清理（预编译正则，支持新华社、中新网、财联社等）
不完整句子过滤：以逗号/顿号/分号结尾 → 回退到上一个句号
全文无句号 → 丢弃（不输出残缺内容）

Filtering Rules (rules_config.py)

Excluded topics: entertainment, social news, violence, crime cases, health/wellness, education, automotive consumer news, science popularization (科普类), animal/archaeology news.

教程类（全部过滤）：教程、指南、攻略、入门、自学、从零开始、手把手、保姆级教程、怎么做、如何使用、操作步骤、图文教程、视频教程、科研绘图、PS教程、Illustrator、AI教程、钢笔工具、高斯模糊、路径查找器等。

Invalid keywords: clickbait patterns, advertising, webpage navigation elements.

See scripts/news_digest_v2/rules_config.py for full lists.

Deduplication (similarity.py)

Jaccard similarity on keyword sets
Threshold: ≥90% → mark as duplicate
Only one version appears in output

Date Filtering

Normal: within 3 days
Holidays: within 7 days
No date → discard
Old URLs (year > 1 year ago) → skip

Configuration

Environment Variables

Variable	Default	Description
`NEWS_DIGEST_DB`	`news_digest_v2/news.db`	SQLite database path
`NEWS_DIGEST_LLM_API_KEY`	(empty)	LLM API key for Stage 2.5 summarization
`NEWS_DIGEST_LLM_BASE_URL`	(empty)	LLM API base URL
`NEWS_DIGEST_LLM_MODEL`	`qwen-plus`	LLM model name

If LLM env vars are not set, Stage 2.5 is silently skipped and rule-based summaries are used instead.

Add/Remove Websites

Edit monitor_websites table:

INSERT INTO monitor_websites (name, url, selector, category, enabled)
VALUES ('示例网站', 'https://example.com', 'a', '财经', 1);

Customize Keywords

Edit system_keywords table:

INSERT INTO system_keywords (keyword, category, weight, enabled)
VALUES ('新能源', 'core', 5, 1);

Adjust Output

In config.py:

MAX_OUTPUT_COUNT = 35 (max articles per digest)
SIMILARITY_THRESHOLD = 0.90

Files

news-digest/
├── SKILL.md
└── scripts/
    └── news_digest_v2/
        ├── __init__.py
        ├── config.py              # DB path, websites, keywords, holidays, LLM config
        ├── database.py            # SQLite operations
        ├── fetcher.py             # Web scraping + smart summary extraction
        ├── filters.py             # Content filtering logic
        ├── formatter.py           # Output formatting + incomplete sentence handling
        ├── rules_config.py        # Exclusion rules, keywords, dateline patterns
        ├── similarity.py          # Jaccard deduplication
        ├── stage1_fetch.py        # Stage 1 entry (fetch)
        ├── stage2_process.py      # Stage 2 entry (dedup + keywords)
        ├── stage2_5_llm_summary.py # Stage 2.5 (LLM batch summarization)
        ├── stage3_output.py       # Stage 3 entry (read + format + save)
        └── run_all_stages.py      # Full pipeline entry

Performance Notes

~5 minutes for full 42-website scrape (network I/O bound)
Some sites may fail (SSL issues, 521 errors, 404s) — pipeline continues
Recommended cron timeout: 600 seconds
数据库是增量追加的，不会被清空。新新闻按 URL 去重插入（INSERT OR IGNORE），旧新闻保留。
重复新闻标记 is_duplicate = 1，不删除。
数据库增长约 30-50 条/天，建议定期清理（可选）。

安全使用建议

Review the HTTPS verification fallback before relying on the digest, especially on untrusted networks. If you enable LLM summarization, provide only a trusted HTTPS base URL and a limited API key. Set up cron or messaging delivery only if you are comfortable with the digest being generated or sent automatically.

功能分析

Type: OpenClaw Skill Name: news-digest-v1 Version: 1.0.0 The skill bundle implements a news aggregation pipeline with several high-risk behaviors that, while functional, introduce security vulnerabilities. Specifically, 'fetcher.py' explicitly disables SSL certificate verification ('verify=False') as a fallback mechanism, exposing the agent to Man-in-the-Middle (MITM) attacks. Additionally, 'stage3_output.py' and 'formatter.py' perform file system writes to the user's Desktop directory, which is an intrusive operation. The 'stage2_5_llm_summary.py' script also handles sensitive API keys from environment variables and transmits data to a configurable external endpoint without validation, posing a risk of credential exposure.

能力标签

requires-sensitive-credentials

能力评估

✓ Purpose & Capability

The included code is broadly coherent with the stated purpose: scraping Chinese news sites, filtering/deduplicating articles, optionally summarizing with an LLM, and writing digest files.

ℹ Instruction Scope

The LLM summarization stage places scraped article text directly into the model prompt; this is expected for summarization, but untrusted webpage text could influence the generated summaries.

✓ Install Mechanism

There is no install spec or remote installer shown. The skill documents Python prerequisites but does not show automatic package installation or remote script execution.

⚠ Credentials

Network scraping is expected, but the HTTPS fallback with certificate verification disabled is an unsafe implementation choice; the optional LLM stage also uses a bearer API key that users should protect.

✓ Persistence & Privilege

The artifacts do not install a background service or scheduler. The cron and messaging workflow are presented as user-configured examples rather than automatic persistence.

版本历史

v1.0.0

## v1.0.0 - 初始发布自动抓取自定义来源的新闻网站，生成每日新闻摘要。覆盖指定领域。 ### 安装后配置（必须完成） **1. 安装依赖** ```bash pip install requests beautifulsoup4 2. 初始化数据库 bash Copy # 创建数据库和表结构 python -c " import sqlite3, os db = sqlite3.connect('news.db') db.execute('''CREATE TABLE IF NOT EXISTS articles ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, source TEXT NOT NULL, publish_date TEXT NOT NULL, summary TEXT, content TEXT, url TEXT UNIQUE NOT NULL, keywords TEXT, is_duplicate INTEGER DEFAULT 0, similarity_score REAL DEFAULT 0, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP )''') db.execute('''CREATE TABLE IF NOT EXISTS monitor_websites ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT UNIQUE NOT NULL, url TEXT NOT NULL, selector TEXT DEFAULT 'a', category TEXT, priority INTEGER DEFAULT 3, enabled INTEGER DEFAULT 1 )''') db.execute('''CREATE TABLE IF NOT EXISTS system_keywords ( id INTEGER PRIMARY KEY AUTOINCREMENT, keyword TEXT UNIQUE NOT NULL, category TEXT, weight INTEGER DEFAULT 1, enabled INTEGER DEFAULT 1 )''') db.execute('CREATE INDEX IF NOT EXISTS idx_publish_date ON articles(publish_date)') db.execute('CREATE INDEX IF NOT EXISTS idx_keywords ON articles(keywords)') db.commit() db.close() print('OK: news.db created') " 3. 添加监测网站（示例） bash Copy python -c " import sqlite3 db = sqlite3.connect('news.db') sites = [ ('中国经济网', 'http://www.ce.cn/', 'a', '财经', 1), # 添加你想监控的网站... ] for name, url, sel, cat, pri in sites: db.execute('INSERT OR IGNORE INTO monitor_websites (name, url, selector, category, priority) VALUES (?,?,?,?,?)', (name, url, sel, cat, pri)) db.commit() db.close() print(f'OK: added {len(sites)} websites') " 4. 添加关键词（示例） bash Copy python -c " import sqlite3 db = sqlite3.connect('news.db') kws = [ ('市场', 'auxiliary', 2), ('企业', 'auxiliary', 2), ] for kw, cat, w in kws: db.execute('INSERT OR IGNORE INTO system_keywords (keyword, category, weight) VALUES (?,?,?)', (kw, cat, w)) db.commit() db.close() print(f'OK: added {len(kws)} keywords') " 5. 运行 bash Copy python scripts/news_digest_v2/run_all_stages.py 可选配置环境变量默认值说明 NEWS_DIGEST_DB news.db 数据库路径 NEWS_DIGEST_LLM_API_KEY (空) LLM API 密钥（可选，启用智能总结） NEWS_DIGEST_LLM_BASE_URL (空) LLM API 地址 NEWS_DIGEST_LLM_MODEL qwen-plus LLM 模型名功能特性 • 🕷️ 网站智能抓取（支持站点独立解析器） • 🧹 6 大类无效内容过滤（娱乐/社会/科普/教程等） • 🔄 相似度 ≥90% 自动去重 • ✍️ 智能摘要提取（非简单截断，基于信息密度评分） • 🤖 可选 LLM 批量总结（生成更专业的摘要） • 📄 紧凑格式输出（来源+标题+摘要+链接）

元数据

Slug news-digest-v1

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

每日新闻搜索与智能摘要是什么？

Automatically scrape, process, and generate daily news digests from 42 Chinese news sources. Covers industry dynamics, policy updates, economy, tech, energy,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 32 次。

如何安装每日新闻搜索与智能摘要？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install news-digest-v1」即可一键安装，无需额外配置。

每日新闻搜索与智能摘要是免费的吗？

是的，每日新闻搜索与智能摘要完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

每日新闻搜索与智能摘要支持哪些平台？

每日新闻搜索与智能摘要跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了每日新闻搜索与智能摘要？

由 zigu-creator（@zigu-creator）开发并维护，当前版本 v1.0.0。

每日新闻搜索与智能摘要