功能描述

自动解析用户提供的 arxiv 论文或 PDF，生成结构化摘要、更新知识库中的概念和资源页，并同步飞书表格。

使用说明 (SKILL.md)

Skill: ingest_paper — paper-kb 存入文档

Name: Ingest Paper
Author: myd2002

用途

把用户发来的 arxiv 论文或上传的 PDF 文档，自动分析并存入其专属的 Gitea 知识库：生成结构化摘要页（summaries/）、更新跨文档概念页（concepts/）和科研资源页（resources/），同步飞书多维表格，并回复用户。

触发条件

Activate when（满足任一）：

消息中包含 arxiv 链接，且表达了存储意图（"存"、"入库"、"保存"、"记录"、"加到知识库"等）。
用户上传了 PDF 文件，且表达了存储意图。
消息中只有 arxiv 链接或只上传了文件、没有说明意图 → 先问一句： "需要我把这篇存入你的知识库吗？" 用户确认后再执行。
上一轮查重发现疑似重复后，用户回复"覆盖"/"是"/"继续存"。

Do NOT activate when：

用户在提问、查找文献（"有没有……的论文"）→ 交给 query_papers。
用户未注册（init_user check 返回 registered=false）→ 先走 init_user。
用户发的链接不是 arxiv（GitHub、新闻等）→ 告知当前只支持 arxiv 链接和 PDF 上传。
用户明确说不要存了 / 取消。

前置依赖

current_user_open_id：从消息上下文 sender 获取，传给所有脚本的 --open_id。
本 Skill 根目录需有 .env（GITEA_URL / GITEA_ADMIN_TOKEN / GITEA_BOT_USERNAME）。
用户必须已注册（init_user 的 check 返回 registered=true）。用户记录里的 research_direction（研究方向）在分析步骤要用。

临时文件路径约定（重要）

所有中间文件放在 /tmp/paperkb/：

arxiv PDF：/tmp/paperkb/arxiv_{arxiv_id}.pdf（/ 替换为 _）
提取文本：与 PDF 同名 .txt
用户上传的 PDF：保存为 /tmp/paperkb/upload_{原文件名}.pdf
你生成的页面草稿：/tmp/paperkb/draft_summary.md、/tmp/paperkb/draft_concept_{名}.md 等

如果执行中途丢失了 PDF 路径：arxiv 论文按上述规则用 arxiv_id 重建路径；重建后文件不存在则重新运行 fetch_arxiv.py。

完整执行流程

Step 1：获取文档

A. arxiv 链接：

python3 scripts/fetch_arxiv.py --url "\x3C用户发的链接或ID>"

成功输出含：arxiv_id、title、authors、published、primary_category、 abstract、pdf_path。失败时把 message 转告用户，流程终止。

B. 用户上传 PDF： 把附件保存到 /tmp/paperkb/upload_{文件名}.pdf，记下路径。此时还没有标题，标题在 Step 3 分析后由 AI 提取。

Step 2：提取全文

python3 scripts/process_pdf.py --pdf_path "\x3CPDF路径>"

输出含 text_path（全文 txt 路径）、truncated（是否截断）、head（开头600字）。失败（如纯扫描件）时把 message 转告用户，流程终止。

Step 3：AI 分析（你自己完成，不调脚本）

用文件工具读取 text_path 的全文，按下面的要求分析。分析前先取用户的 research_direction（注册时已记录）。

分析输出必须是严格的 JSON（你内部使用，不直接发给用户）：

{
  "doc_type": "论文 | 调研报告 | 会议纪要 | 技术文档 | 其他",
  "title_zh": "中文标题（原标题是英文时翻译；上传PDF时从内容提取标题）",
  "title_original": "原文标题",
  "brief": "一句话简介（50字内，用于目录）",
  "summary": "中文综述，200-300字：做了什么、用什么方法、得到什么结论、有什么价值",
  "research_question": "这篇文档要解决的核心问题（1-2句）",
  "methods": ["核心方法要点1", "要点2", "..."],
  "conclusions": ["主要结论1（尽量带关键数据）", "..."],
  "sections": [{"name": "章节名", "points": "该章2-3句要点"}],
  "keywords": ["关键词1", "...共5-8个"],
  "relevance": {"score": 8, "reason": "结合用户研究方向「{research_direction}」说明评分理由"},
  "concepts": ["从文档中提取的抽象概念，如 力控制、模仿学习"],
  "resources": [{"name": "具体资源名", "type": "数据集|开源项目|工具|硬件",
                 "note": "文档中如何使用/评价它"}]
}

要求：

全部中文。
relevance.score 是该文档与用户研究方向的相关性（1-10），不是文档质量分。
concepts 只列抽象概念；resources 只列文档中实际使用或评测过的具体资源，只是提了一下名字的不算。
上传 PDF 场景：title_zh 就是后续的文件名，必须从内容里准确提取。

Step 4：查重

python3 scripts/check_duplicate.py --open_id \x3Copen_id> \
    --title "\x3Ctitle_zh>" --arxiv_id "\x3Carxiv_id，无则省略>" \
    --text_path "\x3Ctext_path>"

duplicate: true → 告知用户已存在（给出 existing 里的标题和时间），问"是否覆盖？"。用户确认后从 Step 5 继续（save 时加 --force）；否则终止。
possible_duplicate: true → 告知用户疑似与《existing.title》重复（相似度 similarity），同样问是否继续。
duplicate: false 且无 possible → 直接继续。

Step 5：概念与资源规划（你自己完成）

先读取已有概念和资源的目录：

python3 scripts/kb_read.py --open_id \x3Copen_id> --list all

根据 Step 3 的 concepts / resources 和已有目录，决定每一项是 create（新建）/ update（更新已有）/ skip（仅在 summary 里提及，不建页）。

规划规则（严格遵守）：

知识库文档数 ≤ 3 时，每篇最多新建 2-3 个概念页，宁缺毋滥。
名字相同或含义重叠的，一律 update 已有页，不要 create 新页。
不要为"文档主题本身"建概念页（那是 summary 的工作）。
资源页同理：已有的 update，新的才 create；每篇文档新建资源页一般 0-5 个。

Step 6：生成并保存概念页 / 资源页

对每个 create 项，按模板写 Markdown 草稿到 /tmp/paperkb/draft_concept_{名}.md，然后：

python3 scripts/save_page.py --open_id \x3Copen_id> --kind concept \
    --name "\x3C概念名>" --file "\x3C草稿路径>" --brief "\x3C一句话定义>"

对每个 update 项，先读旧内容：

python3 scripts/kb_read.py --open_id \x3Copen_id> --read "concepts/\x3C概念名>"

把新文档的信息自然地融合改写进全文（不是简单追加），保留原有结构，写入草稿后用同样的 save_page.py 保存（同名自动覆盖）。

资源页用 --kind resource 并加 --resource_type。

概念页模板：

# \x3C概念名>

## 定义
\x3C清晰的中文定义，2-3句>

## 相关文档中的论述
- **[[summaries/\x3C文档标题>]]**：\x3C该文档对此概念的处理方式和结论，1-2句>
（每个相关文档一条，新文档的条目融入而非堆在最后）

## 方法对比
（仅当多篇文档用了不同方法时写，用表格：方法|优点|缺点|来源）

## 关联概念
[[concepts/\x3C其他概念>]]（只链接确实存在的页面）

## 矛盾与待解决问题
（不同文档结论冲突时记录在此；没有则写"暂无"）

资源页模板：

# \x3C资源名>

## 类型
数据集 / 开源项目 / 工具 / 硬件

## 简介
\x3C2-3句>

## 在哪些文档中被使用
- **[[summaries/\x3C文档标题>]]**：\x3C如何使用、效果如何>

## 获取方式
\x3C文档中提到的链接/地址；没有则写"文档未提供">

## 相关资源
[[resources/\x3C其他资源>]]（只链接确实存在的页面）

Step 7：生成最终版 summary 页（两步重写在这里完成）

现在所有概念页/资源页已确定。白名单 = 本次 create/update 的页面 + kb_read 列出的已有页面。按模板生成最终 summary，其中 [[wikilinks]] 只允许指向白名单内的页面；白名单外的概念以纯文本出现。

summary 页模板：

---
标题: \x3Ctitle_zh>
原文标题: \x3Ctitle_original>
类型: \x3Cdoc_type>
来源: \x3Carxiv / 上传PDF>
arxiv_id: \x3C有则填>
作者: \x3C逗号分隔，上传PDF无作者信息则省略此行>
发表时间: \x3Cpublished>
关键词: [\x3C关键词逗号分隔>]
相关性评分: \x3Cscore>
存入时间: \x3C今天日期>
---

# \x3Ctitle_zh>

## 一句话总结
\x3Cbrief>

## 中文综述
\x3Csummary>

## 研究问题
\x3Cresearch_question>

## 核心方法
- \x3Cmethods 逐条>

## 主要结论
- \x3Cconclusions 逐条>

## 章节要点
- **\x3C章节名>**：\x3C要点>

## 与我研究方向的相关性
评分：\x3Cscore>/10
理由：\x3Creason>

## 关键概念
[[concepts/\x3C概念1>]] [[concepts/\x3C概念2>]]（仅白名单内）

## 科研资源
[[resources/\x3C资源1>]]（仅白名单内；没有则省略本节）

写入 /tmp/paperkb/draft_summary.md，然后保存：

python3 scripts/save_paper.py --open_id \x3Copen_id> \
    --title "\x3Ctitle_zh>" \
    --summary_file /tmp/paperkb/draft_summary.md \
    --doc_type "\x3Cdoc_type>" \
    --keywords "\x3C关键词逗号分隔>" \
    --score \x3Cscore> \
    --brief "\x3Cbrief>" \
    --arxiv_id "\x3C有则填>" \
    --pdf_path "\x3CPDF路径>" \
    --text_path "\x3Ctext_path>" \
    [--force]   # 仅覆盖模式

输出含 summary_url、pdf_url、repo_url。

Step 8：同步飞书多维表格（可选，失败不阻塞）

用户记录里有 feishu_app_token 和 feishu_table_id 时（非空），调用 feishu_bitable_app_table_record（action: create）：

app_token: \x3C用户的 feishu_app_token>
table_id:  \x3C用户的 feishu_table_id>
fields: {
  "标题": "\x3Ctitle_zh>",
  "类型": "\x3Cdoc_type>",
  "关键词": "\x3C逗号分隔字符串>",
  "相关性评分": \x3Cscore 数字>,
  "存入时间": \x3C当前毫秒时间戳，数字>,
  "Gitea链接": {"link": "\x3Csummary_url>", "text": "\x3Ctitle_zh>"},
  "arxiv_id": "\x3C有则填，无则空字符串>"
}

字段为空（用户初始化时飞书未启用）→ 直接跳过本步。
调用失败 → 最多重试 1 次，仍失败则跳过，在回复中注明"飞书表格同步失败，不影响知识库"。绝不因此中断流程或丢弃已保存的内容。

Step 9：回复用户

✅ 已存入知识库！

📄 \x3Ctitle_zh>
🏷 类型：\x3Cdoc_type>｜🔑 关键词：\x3C前几个关键词>
⭐ 与你研究方向的相关性：\x3Cscore>/10 — \x3Creason 精简到一句>
🧠 概念页更新：\x3C新建X个/更新Y个，列名字>
🛠 资源页：\x3C同上，没有则省略>
🔗 查看详情：\x3Csummary_url>

覆盖模式時开头改为"✅ 已覆盖更新！"。
上传 PDF 场景追加一行：💡 建议把本地 PDF 重命名为：\x3Ctitle_zh>.pdf
飞书表格跳过/失败时追加说明。

错误处理总则

所有脚本输出单行 JSON；success: false 时按 message 转告用户，不要把原始报错堆给用户。
stdout 出现非 JSON 内容 = 脚本异常，截取关键信息告知用户并建议联系管理员。
任何一步失败都不要静默吞掉——告诉用户哪一步失败、能否重试。

安全使用建议

Install only if you trust the publisher and the configured Gitea/Feishu environment. Use HTTPS, avoid a full site-admin token if a narrower token can work, confirm that uploaded PDFs may be stored in Gitea and that metadata may be synced to Feishu, and pin or review dependency versions before deployment.

能力评估

ℹ Purpose & Capability

PDF/arxiv ingestion, text extraction, AI summarization, duplicate checks, Gitea knowledge-base writes, and optional Feishu metadata sync are coherent with the stated paper-kb purpose.

ℹ Instruction Scope

Activation is user-directed and includes confirmation for ambiguous uploads or duplicates, but Feishu sync runs automatically when user Feishu tokens are present and there is no explicit per-run data-sharing notice.

⚠ Install Mechanism

setup.sh installs Python dependencies from broad version ranges and creates a .env for a Gitea admin token; it does not install background services or persistence.

⚠ Credentials

The skill requires a Gitea site-admin token and the sample/default Gitea URLs use plain HTTP to external IP addresses, which is broad and exposes a high-value credential and uploaded research content to network/security risk.

⚠ Persistence & Privilege

The skill persists summaries, PDFs, catalog entries, indexes, logs, and concept/resource pages to a remote Gitea repository using administrator-backed API access; no destructive behavior was found, but the authority is high impact.

版本历史

v1.0.0

- Initial release of ingest-paper, automating the storage of arxiv links or uploaded PDFs into each user's Gitea knowledge base. - Extracts full text, generates a structured summary page, updates inter-document concept and resource pages, then syncs with Feishu multidimensional table if enabled. - Includes robust duplicate detection and user confirmation flow. - Carefully plans and updates concept/resource pages to avoid duplication and ensure quality. - Provides user feedback after completion, with handling for errors and optional Feishu sync.

元数据

Slug ingest-paper

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

Ingest Paper 是什么？

自动解析用户提供的 arxiv 论文或 PDF，生成结构化摘要、更新知识库中的概念和资源页，并同步飞书表格。它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 46 次。

如何安装 Ingest Paper？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install ingest-paper」即可一键安装，无需额外配置。

Ingest Paper 是免费的吗？

是的，Ingest Paper 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

Ingest Paper 支持哪些平台？

Ingest Paper 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 Ingest Paper？

由 myd2002（@myd2002）开发并维护，当前版本 v1.0.0。

Ingest Paper