功能描述

原子化RAG知识库构建器 - 让AI真正学会一本书，而非只是看过。理工农医特化，方法论提炼，全网最好的开源专属知识库建立技能。

使用说明 (SKILL.md)

📚 原子化RAG知识库构建技能

Name: 原子化RAG知识库构建器
Author: simonstang

Atomic Knowledge Base Builder for RAG

【学来学去学习社出品】| Produced by Xue Lai Xue Qu Learning Society

"让AI真正学会一本书，而非只是看过。"

技能概述

技能名称: atomic-rag-builder
版本: v1.0.0
分类: AI-Programming / Knowledge-Management
标签: rag, knowledge-base, pdf, vector-db, atomic, learning

技能简介

本技能用于从PDF文档中构建高质量的RAG（检索增强生成）知识库。区别于传统的"硬切法"（按字数机械分割），本技能采用"原子化"方法，将知识拆分为最小可用单元，使AI能够真正理解、融会贯通、举一反三。

解决的核心痛点

痛点1：AI"没头没尾"

现象: 传统向量库按600-800字硬切，一个完整知识点被拦腰斩断
结果: AI回答"断片"，只能生成正确的废话
解决: 按知识完整性拆分，保留上下文的逻辑关联

痛点2：U型注意力丢失

现象: 100万上下文大模型只记住开头结尾10%
结果: 中间80%的精华方法论被忽略
解决: 原子化后只加载相关原子，精准匹配

痛点3：知识≠能力

现象: AI看过很多书，但不会解决问题
结果: 只能复述概念，无法指导实践
解决: 提炼方法论，将知识封装为可执行的能力

痛点4：理工农医书籍特殊处理

现象: 理工农医书籍有大量公式、图表、推导过程
结果: 传统OCR只能提取文字，丢失最核心的公式和图表关系
解决: 专用解析器处理公式、图表、代码、推导步骤

技能使用场景

建立个人知识库: 将阅读过的书籍转化为可检索的知识原子
企业知识管理: 将SOP、手册、培训资料原子化，供AI调用
教育内容建设: 将教材、题库原子化，实现个性化学习
研究资料整理: 将论文、专利原子化，提取核心方法和结论

核心方法论

原子化五步法

Step 1: 格式转化 (消除视觉盲区)
Step 2: 语义分段 (按知识完整性)
Step 3: 方法论提炼 (去故事留方法)
Step 4: 元数据提取 (多维度标签)
Step 5: 向量化存储 (准备检索)

原子单元标准格式

{
  "atom_id": "unique_identifier",
  "type": "knowledge_type",
  "title": "核心概念/问题/方法",
  "content": "核心内容",
  "metadata": {
    "source": "来源",
    "page": 10,
    "chapter": "第X章",
    "difficulty": 1-5,
    "prerequisites": ["前置知识"],
    "related_atoms": ["关联原子"]
  },
  "methodology": {
    "steps": ["步骤1", "步骤2"],
    "key_points": ["关键点"],
    "common_mistakes": ["常见错误"],
    "verification": "验证方法"
  },
  "embedding": [0.12, -0.45, ...]
}

🎓 理工农医特化处理

领域	特殊处理
数学	LaTeX公式提取、证明步骤识别、定理定义标注
物理	物理模型提取、公式推导过程、适用条件标注
化学	化学反应式识别、反应机理提取、条件参数记录
医学	诊断逻辑提取、治疗方案记录、鉴别诊断标注

📊 性能指标

指标	目标值	说明
原子提取完整率	>95%	知识点不丢失
方法论提炼准确率	>90%	正确识别可执行方法
检索召回率	>85%	相关知识能找回
处理速度	50页/分钟	PDF处理效率

🚀 使用示例

from atomic_rag import AtomicRAGBuilder

# 构建知识库
builder = AtomicRAGBuilder(domain="math")
atoms = builder.process_pdf("高等数学.pdf")
builder.store_to_vector_db(atoms, collection_name="math_kb")

# RAG问答
from atomic_rag import MultiRecallRAG
rag = MultiRecallRAG()
answer = rag.ask("如何求解一元二次方程？")

📄 License

MIT License - 自由使用，欢迎贡献！

Made with ❤️ by 学来学去AI团队

安全使用建议

This skill largely implements a reasonable PDF→RAG pipeline, but several red flags mean you should be careful before installing or running it: - Check package.json: it contains an apparent GitHub token embedded in the repository URL. Treat that as a leaked secret — do not reuse it, and avoid trusting it. If you plan to use this code, remove/rotate the token and confirm it is not active. - Expect to provide external credentials at runtime: OpenAI (or another embedding provider) API key, and any vector DB credentials (Chroma/Milvus/Pinecone). The skill does not declare these, but the code will call embedding/vector services. - System binaries are required but not declared: OCR (pytesseract) usually needs Tesseract installed on the host; pdf2image and pdf processing may need poppler. Install these in a controlled environment before running. - Run in an isolated environment first (sandbox/VM/container) and inspect network activity: the code will perform network calls to embedding providers and possibly vector DBs. Monitor outbound connections and avoid processing sensitive documents until you confirm where data is sent. - Review and test with non-sensitive PDFs: verify which external endpoints are contacted and what metadata/contents are transmitted (embeddings are generated by sending text to an embedding API). - If you will use this for medical content, be aware this code extracts diagnostic/treatment steps — ensure compliance with applicable regulations and have domain experts validate outputs. Given the evidence (missing declared env vars, system deps, and an embedded token), treat this skill as suspicious until you fix/confirm the issues above. If you want, I can point to the exact lines/files where environment-dependent calls are made and suggest specific mitigations (e.g., declare required env vars, remove tokens, add installation notes for system binaries).

功能分析

Type: OpenClaw Skill Name: atomic-rag-knowledge-base Version: 1.0.0 The skill bundle provides a functional RAG (Retrieval-Augmented Generation) framework specialized for STEM document processing. However, it contains a critical security vulnerability: a hardcoded GitHub Personal Access Token (PAT) is exposed in the 'repository.url' field of the 'package.json' file. While this appears to be an unintentional credential leak by the developer rather than an intentional attack against the user, the presence of hardcoded secrets is a high-risk indicator. The core logic in 'builder.py' and 'rag.py' uses standard libraries (LangChain, pdfplumber) for its stated purpose without evidence of data exfiltration or malicious execution.

能力评估

ℹ Purpose & Capability

Name, README, SKILL.md and the Python code implement a PDF→atomic-RAG pipeline (OCR, semantic chunking, domain processors, embeddings, storing to Chroma/Milvus). The requested libraries and processors are coherent with the stated purpose. However package.json contains a GitHub token-like string in the repository URL and requirements include several heavy components (Milvus, Pinecone, Chroma, OCR/system deps) that are plausible but not declared in the skill metadata/manifest.

⚠ Instruction Scope

SKILL.md demonstrates running builder.process_pdf() and storing to vector DBs, but the instructions do not mention required API keys (e.g., OpenAI for embeddings), vector DB credentials, or required system binaries (tesseract, poppler). The runtime code will read local PDF files (expected) and call external services (embedding provider, vector DBs) — those network operations are not documented in SKILL.md or the skill manifest. The omission grants the skill broad implicit network access and unspecified credential use.

ℹ Install Mechanism

There is no formal install spec in the registry (instruction-only style), but a requirements.txt and package.json are included, indicating Python dependencies. That itself is fine, but package.json contains an apparent GitHub personal access token embedded in the repository URL (exposes a secret-like string). Also some Python packages (pdf2image, pytesseract) require system-level binaries (poppler, tesseract) which are not declared as required binaries.

⚠ Credentials

The skill metadata lists no required environment variables or primary credential, yet the code uses LangChain's OpenAIEmbeddings (which requires an embedding provider API key at runtime, commonly OPENAI_API_KEY) and supports vector stores (Chroma, Milvus, Pinecone) that typically need credentials or endpoints. Additionally, the package.json contains a token-like string that is unrelated to the declared environment requirements. The manifest under-declares sensitive external credentials and system dependencies.

✓ Persistence & Privilege

The skill is not marked always:true, is user-invocable, and does not modify other skills or system-wide agent settings. It writes data to user-specified outputs (JSON, vector DB) only when explicitly invoked. No elevated persistence or cross-skill modification behavior was detected.

版本历史

v1.0.0

atomic-rag-knowledge-base v1.0.0 - Initial release of the atomic-rag-knowledge-base skill. - Provides a tool for building retrieval-augmented generation (RAG) knowledge bases from PDF documents. - Implements "atomic" knowledge units for improved context integrity, addressing key issues with conventional chunking methods. - Features specialized processing for STEM and medical content, including formula and method extraction. - Supports use cases in personal/enterprise knowledge management, education, and research.

元数据

Slug atomic-rag-knowledge-base

版本 1.0.0

许可证 MIT-0

累计安装 1

当前安装数 1

历史版本数 1

常见问题

原子化RAG知识库构建器是什么？

原子化RAG知识库构建器 - 让AI真正学会一本书，而非只是看过。理工农医特化，方法论提炼，全网最好的开源专属知识库建立技能。它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 98 次。

如何安装原子化RAG知识库构建器？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install atomic-rag-knowledge-base」即可一键安装，无需额外配置。

原子化RAG知识库构建器是免费的吗？

是的，原子化RAG知识库构建器完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

原子化RAG知识库构建器支持哪些平台？

原子化RAG知识库构建器跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了原子化RAG知识库构建器？

由 SimonsTang（@simonstang）开发并维护，当前版本 v1.0.0。

原子化RAG知识库构建器