← Back to Skills Marketplace
xuanranc

Junyi Doc Reader

by XuanranC · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
215
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install junyi-doc-reader
Description
大文档归档与检索管线。将 Word/PDF/TXT/Markdown 文档转换、分块、可选 LLM 增强,输出结构化 Markdown 和索引,适合存入 Obsidian 或知识库。触发词:读大文档、归档文档、junyi-doc-reader、doc-reader、文档索引、帮我读这个PDF、把文档存到Obsid...
README (SKILL.md)

junyi-doc-reader

大文档归档与检索管线。将大文档安全地转为结构化 Markdown,生成分块索引,可选 LLM 提炼摘要。

When to Use

  • 用户要求归档/阅读/索引一个大文档(Word、PDF、TXT、Markdown)
  • 用户要求把文档存到 Obsidian
  • 文档超过 context 窗口限制,需要分块处理

Supported Formats

格式 转换工具 备注
.docx pandoc 推荐格式,需安装 pandoc
.pdf pdftotext 需安装 poppler,扫描件暂不支持
.txt 直接读取 自动检测编码(UTF-8/GBK)
.md 跳过转换 直接进入分块

飞书云文档:不直接支持飞书链接。请先在飞书中导出为 Word 或 PDF 再处理。

Three Modes

模式 说明 需要 API
archive-only 转换 + 分块 + 原文归档
archive+index 上述 + 结构化索引
archive+index+insights 上述 + LLM 摘要/关键词/分类

自动降级规则:

  • 未设 DOC_READER_API_KEY → archive-only
  • DOC_READER_ALLOW_EXTERNAL=false(默认)→ 不外发文档给 LLM
  • API 失败 → 保留已完成产物,降级继续

Usage

Single Command

python3 scripts/pipeline.py \x3Cinput_file> --output \x3Coutput_dir> \
  [--mode archive-only|archive+index|archive+index+insights] \
  [--split-by year|topic|chapter|none]

脚本路径相对于 skill 目录: ~/.openclaw/workspace/skills/junyi-doc-reader/

Example

# 基础归档
python3 ~/.openclaw/workspace/skills/junyi-doc-reader/scripts/pipeline.py \
  /path/to/document.docx \
  --output /path/to/obsidian/vault/文档名/

# 带 LLM 增强 + 按章节分文件
DOC_READER_API_KEY="sk-xxx" DOC_READER_ALLOW_EXTERNAL=true \
python3 ~/.openclaw/workspace/skills/junyi-doc-reader/scripts/pipeline.py \
  /path/to/document.pdf \
  --output /path/to/obsidian/vault/文档名/ \
  --mode archive+index+insights \
  --split-by chapter

Environment Variables

变量 说明 默认值
DOC_READER_API_KEY LLM API 密钥 (无)
DOC_READER_API_URL API endpoint https://api.openai.com/v1/chat/completions
DOC_READER_MODEL 模型名 claude-haiku-4-5-20251001
DOC_READER_ALLOW_EXTERNAL 是否允许外发文档 false

Output Structure

output_dir/
├── manifest.json          # 任务元数据
├── source.md              # 完整原文 Markdown
├── ROOT_INDEX.md          # 全局导航目录
├── chunks.jsonl           # 分块数据(机器可读)
├── processing_report.md   # 处理报告
├── converted.md           # 中间转换结果
├── state.json             # 状态文件(用于断点恢复)
├── parts/                 # 分文件(仅 --split-by 时生成)
│   ├── 2024.md
│   └── 2025.md
└── indexes/               # 分层索引(仅 insights 模式)
    ├── by-year.md
    └── by-topic.md

Key Files for Agent Use

  • ROOT_INDEX.md — 先读这个了解文档结构
  • chunks.jsonl — 精确检索定位,每行一个 JSON chunk
  • source.md — 需要全文搜索时使用
  • manifest.json — 查看处理状态和警告

chunks.jsonl Format

{"chunk_id": "ch-0001", "heading_path": ["第一章", "引言"], "char_start": 0, "char_end": 4500, "text": "..."}

Enriched chunks additionally have: summary, key_points, keywords, classification, confidence.

Crash Recovery

Pipeline 自动保存进度到 state.json。如果中断,重新运行相同命令即可从上次完成的步骤恢复。

Dependencies

  • Python 3.9+
  • pandoc(处理 .docx,brew install pandoc
  • poppler(处理 .pdf,brew install poppler
  • 无第三方 Python 包依赖(使用 stdlib urllib)

Agent Workflow

  1. 确认用户要处理的文件路径和目标目录
  2. 检查文件格式是否支持
  3. 根据是否配置了 API key 确定模式
  4. 运行 python3 scripts/pipeline.py 一次完成所有步骤
  5. 检查 manifest.json 确认状态
  6. 向用户报告:处理了多少块、生成了哪些文件、有无警告
  7. 如需写入 Obsidian,将 output_dir 内容复制到 vault 目标路径
Usage Guidance
This skill is internally consistent with its stated purpose. Two practical things to check before using it: (1) Enrichment mode will send chunks of your document to whatever API endpoint and key you configure — by default DOC_READER_ALLOW_EXTERNAL is false, so enrichment is disabled unless you explicitly set DOC_READER_ALLOW_EXTERNAL=true and supply DOC_READER_API_KEY. Only enable that for non-sensitive documents or when you trust the target LLM provider. (2) Confirm the API endpoint and model: the default URL is an OpenAI-style endpoint but the default model name looks like a Claude model — set DOC_READER_API_URL and DOC_READER_MODEL to values that match your provider. Also: converter steps may call pandoc/pdftotext (poppler) which are optional system dependencies. The scripts write state.json, manifest.json, converted.md, chunks.jsonl, and other output files into your chosen output_dir — review those files and the target Obsidian vault path before copying. If you want to avoid any network transmission, leave DOC_READER_API_KEY unset and keep DOC_READER_ALLOW_EXTERNAL=false (the pipeline will downgrade to archive-only or archive+index modes).
Capability Analysis
Type: OpenClaw Skill Name: junyi-doc-reader Version: 1.0.0 The junyi-doc-reader skill bundle provides a structured pipeline for converting, chunking, and indexing large documents (PDF, Word, Markdown). It utilizes standard system utilities like `pandoc` and `pdftotext` via safe `subprocess` calls and performs optional LLM enrichment through Python's `urllib` library. The code includes robust state management for crash recovery and explicitly gates external data transmission to LLM APIs behind a `DOC_READER_ALLOW_EXTERNAL` environment variable, showing no signs of malicious intent or unauthorized data exfiltration.
Capability Assessment
Purpose & Capability
Name/description (document conversion, chunking, optional LLM enrichment, Obsidian output) align with the provided scripts (converter, chunker, enricher, assembler, pipeline). Optional system binaries (pandoc, pdftotext/poppler) are appropriate for converting .docx/.pdf. No unexpected services or credentials are required by default.
Instruction Scope
SKILL.md and pipeline instructions stay within the stated purpose: they read the supplied input file, convert it, split into chunks, optionally call an external LLM, and write outputs into the specified output_dir. One important behavior: enrichment mode will transmit document chunks to the configured API endpoint (DOC_READER_API_URL) when DOC_READER_ALLOW_EXTERNAL=true and an API key is provided — the README and code do document this, but users should note this explicit external data transmission.
Install Mechanism
No install spec; the skill is instruction+script only (no downloads or installers). The Python scripts use only the stdlib for network calls. System dependencies (pandoc, pdftotext/poppler) are optional and are standard tools for document conversion.
Credentials
No required env vars by registry metadata; enrichment requires DOC_READER_API_KEY and optional DOC_READER_API_URL/DOC_READER_MODEL and DOC_READER_ALLOW_EXTERNAL. Those env vars are proportionate to optional LLM enrichment. Minor inconsistency to be aware of: default DOC_READER_API_URL is an OpenAI-compatible endpoint while the default DOC_READER_MODEL string references a 'claude' style model — this is a configuration mismatch that requires the user to set correct API_URL/MODEL for their provider.
Persistence & Privilege
The skill does not request forced/always-enabled execution. It writes state.json, manifest.json and output files inside the user-specified output_dir for crash recovery and auditing — expected behavior for a pipeline. It does not modify system-wide agent settings or other skills.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install junyi-doc-reader
  3. After installation, invoke the skill by name or use /junyi-doc-reader
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
v1.0.0: 大文档归档与检索管线,支持 Word/PDF/TXT/MD,三档运行模式(archive-only/+index/+insights),分块索引,断点恢复
Metadata
Slug junyi-doc-reader
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Junyi Doc Reader?

大文档归档与检索管线。将 Word/PDF/TXT/Markdown 文档转换、分块、可选 LLM 增强,输出结构化 Markdown 和索引,适合存入 Obsidian 或知识库。触发词:读大文档、归档文档、junyi-doc-reader、doc-reader、文档索引、帮我读这个PDF、把文档存到Obsid... It is an AI Agent Skill for Claude Code / OpenClaw, with 215 downloads so far.

How do I install Junyi Doc Reader?

Run "/install junyi-doc-reader" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Junyi Doc Reader free?

Yes, Junyi Doc Reader is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Junyi Doc Reader support?

Junyi Doc Reader is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Junyi Doc Reader?

It is built and maintained by XuanranC (@xuanranc); the current version is v1.0.0.

💬 Comments