Description

大文档归档与检索管线。将 Word/PDF/TXT/Markdown 文档转换、分块、可选 LLM 增强，输出结构化 Markdown 和索引，适合存入 Obsidian 或知识库。触发词：读大文档、归档文档、junyi-doc-reader、doc-reader、文档索引、帮我读这个PDF、把文档存到Obsid...

README (SKILL.md)

junyi-doc-reader

Name: Junyi Doc Reader
Author: xuanranc

大文档归档与检索管线。将大文档安全地转为结构化 Markdown，生成分块索引，可选 LLM 提炼摘要。

When to Use

用户要求归档/阅读/索引一个大文档（Word、PDF、TXT、Markdown）
用户要求把文档存到 Obsidian
文档超过 context 窗口限制，需要分块处理

Supported Formats

格式	转换工具	备注
.docx	pandoc	推荐格式，需安装 pandoc
.pdf	pdftotext	需安装 poppler，扫描件暂不支持
.txt	直接读取	自动检测编码（UTF-8/GBK）
.md	跳过转换	直接进入分块

飞书云文档：不直接支持飞书链接。请先在飞书中导出为 Word 或 PDF 再处理。

Three Modes

模式	说明	需要 API
`archive-only`	转换 + 分块 + 原文归档	否
`archive+index`	上述 + 结构化索引	否
`archive+index+insights`	上述 + LLM 摘要/关键词/分类	是

自动降级规则：

未设 DOC_READER_API_KEY → archive-only
DOC_READER_ALLOW_EXTERNAL=false（默认）→ 不外发文档给 LLM
API 失败 → 保留已完成产物，降级继续

Usage

Single Command

python3 scripts/pipeline.py \x3Cinput_file> --output \x3Coutput_dir> \
  [--mode archive-only|archive+index|archive+index+insights] \
  [--split-by year|topic|chapter|none]

脚本路径相对于 skill 目录： ~/.openclaw/workspace/skills/junyi-doc-reader/

Example

# 基础归档
python3 ~/.openclaw/workspace/skills/junyi-doc-reader/scripts/pipeline.py \
  /path/to/document.docx \
  --output /path/to/obsidian/vault/文档名/

# 带 LLM 增强 + 按章节分文件
DOC_READER_API_KEY="sk-xxx" DOC_READER_ALLOW_EXTERNAL=true \
python3 ~/.openclaw/workspace/skills/junyi-doc-reader/scripts/pipeline.py \
  /path/to/document.pdf \
  --output /path/to/obsidian/vault/文档名/ \
  --mode archive+index+insights \
  --split-by chapter

Environment Variables

变量	说明	默认值
`DOC_READER_API_KEY`	LLM API 密钥	(无)
`DOC_READER_API_URL`	API endpoint	`https://api.openai.com/v1/chat/completions`
`DOC_READER_MODEL`	模型名	`claude-haiku-4-5-20251001`
`DOC_READER_ALLOW_EXTERNAL`	是否允许外发文档	`false`

Output Structure

output_dir/
├── manifest.json          # 任务元数据
├── source.md              # 完整原文 Markdown
├── ROOT_INDEX.md          # 全局导航目录
├── chunks.jsonl           # 分块数据（机器可读）
├── processing_report.md   # 处理报告
├── converted.md           # 中间转换结果
├── state.json             # 状态文件（用于断点恢复）
├── parts/                 # 分文件（仅 --split-by 时生成）
│   ├── 2024.md
│   └── 2025.md
└── indexes/               # 分层索引（仅 insights 模式）
    ├── by-year.md
    └── by-topic.md

Key Files for Agent Use

ROOT_INDEX.md — 先读这个了解文档结构
chunks.jsonl — 精确检索定位，每行一个 JSON chunk
source.md — 需要全文搜索时使用
manifest.json — 查看处理状态和警告

chunks.jsonl Format

{"chunk_id": "ch-0001", "heading_path": ["第一章", "引言"], "char_start": 0, "char_end": 4500, "text": "..."}

Enriched chunks additionally have: summary, key_points, keywords, classification, confidence.

Crash Recovery

Pipeline 自动保存进度到 state.json。如果中断，重新运行相同命令即可从上次完成的步骤恢复。

Dependencies

Python 3.9+
pandoc（处理 .docx，brew install pandoc）
poppler（处理 .pdf，brew install poppler）
无第三方 Python 包依赖（使用 stdlib urllib）

Agent Workflow

确认用户要处理的文件路径和目标目录
检查文件格式是否支持
根据是否配置了 API key 确定模式
运行 python3 scripts/pipeline.py 一次完成所有步骤
检查 manifest.json 确认状态
向用户报告：处理了多少块、生成了哪些文件、有无警告
如需写入 Obsidian，将 output_dir 内容复制到 vault 目标路径

Usage Guidance

This skill is internally consistent with its stated purpose. Two practical things to check before using it: (1) Enrichment mode will send chunks of your document to whatever API endpoint and key you configure — by default DOC_READER_ALLOW_EXTERNAL is false, so enrichment is disabled unless you explicitly set DOC_READER_ALLOW_EXTERNAL=true and supply DOC_READER_API_KEY. Only enable that for non-sensitive documents or when you trust the target LLM provider. (2) Confirm the API endpoint and model: the default URL is an OpenAI-style endpoint but the default model name looks like a Claude model — set DOC_READER_API_URL and DOC_READER_MODEL to values that match your provider. Also: converter steps may call pandoc/pdftotext (poppler) which are optional system dependencies. The scripts write state.json, manifest.json, converted.md, chunks.jsonl, and other output files into your chosen output_dir — review those files and the target Obsidian vault path before copying. If you want to avoid any network transmission, leave DOC_READER_API_KEY unset and keep DOC_READER_ALLOW_EXTERNAL=false (the pipeline will downgrade to archive-only or archive+index modes).

Capability Analysis

Type: OpenClaw Skill Name: junyi-doc-reader Version: 1.0.0 The junyi-doc-reader skill bundle provides a structured pipeline for converting, chunking, and indexing large documents (PDF, Word, Markdown). It utilizes standard system utilities like `pandoc` and `pdftotext` via safe `subprocess` calls and performs optional LLM enrichment through Python's `urllib` library. The code includes robust state management for crash recovery and explicitly gates external data transmission to LLM APIs behind a `DOC_READER_ALLOW_EXTERNAL` environment variable, showing no signs of malicious intent or unauthorized data exfiltration.

Capability Assessment

✓ Purpose & Capability

Name/description (document conversion, chunking, optional LLM enrichment, Obsidian output) align with the provided scripts (converter, chunker, enricher, assembler, pipeline). Optional system binaries (pandoc, pdftotext/poppler) are appropriate for converting .docx/.pdf. No unexpected services or credentials are required by default.

ℹ Instruction Scope

SKILL.md and pipeline instructions stay within the stated purpose: they read the supplied input file, convert it, split into chunks, optionally call an external LLM, and write outputs into the specified output_dir. One important behavior: enrichment mode will transmit document chunks to the configured API endpoint (DOC_READER_API_URL) when DOC_READER_ALLOW_EXTERNAL=true and an API key is provided — the README and code do document this, but users should note this explicit external data transmission.

✓ Install Mechanism

No install spec; the skill is instruction+script only (no downloads or installers). The Python scripts use only the stdlib for network calls. System dependencies (pandoc, pdftotext/poppler) are optional and are standard tools for document conversion.

ℹ Credentials

No required env vars by registry metadata; enrichment requires DOC_READER_API_KEY and optional DOC_READER_API_URL/DOC_READER_MODEL and DOC_READER_ALLOW_EXTERNAL. Those env vars are proportionate to optional LLM enrichment. Minor inconsistency to be aware of: default DOC_READER_API_URL is an OpenAI-compatible endpoint while the default DOC_READER_MODEL string references a 'claude' style model — this is a configuration mismatch that requires the user to set correct API_URL/MODEL for their provider.

✓ Persistence & Privilege

The skill does not request forced/always-enabled execution. It writes state.json, manifest.json and output files inside the user-specified output_dir for crash recovery and auditing — expected behavior for a pipeline. It does not modify system-wide agent settings or other skills.

Version History

v1.0.0

v1.0.0: 大文档归档与检索管线，支持 Word/PDF/TXT/MD，三档运行模式（archive-only/+index/+insights），分块索引，断点恢复

Metadata

Slug junyi-doc-reader

Version 1.0.0

License MIT-0

All-time Installs 0

Active Installs 0

Total Versions 1

Frequently Asked Questions

What is Junyi Doc Reader?

大文档归档与检索管线。将 Word/PDF/TXT/Markdown 文档转换、分块、可选 LLM 增强，输出结构化 Markdown 和索引，适合存入 Obsidian 或知识库。触发词：读大文档、归档文档、junyi-doc-reader、doc-reader、文档索引、帮我读这个PDF、把文档存到Obsid... It is an AI Agent Skill for Claude Code / OpenClaw, with 215 downloads so far.

How do I install Junyi Doc Reader?

Run "/install junyi-doc-reader" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Junyi Doc Reader free?

Yes, Junyi Doc Reader is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Junyi Doc Reader support?

Junyi Doc Reader is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Junyi Doc Reader?

It is built and maintained by XuanranC (@xuanranc); the current version is v1.0.0.

More Skills

Junyi Doc Reader