← 返回 Skills 市场

Universal Document Ingestion Router

Name: Universal Document Ingestion Router
Author: hollis9087

作者 hollis9087 · GitHub ↗ · v0.1.1 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install universal-document-ingestion-router

功能描述

Document parsing and knowledge-base import router.

使用说明 (SKILL.md)

Universal Document Ingestion Router

Use this skill whenever a task involves document parsing, document ingestion, knowledge-base import preparation, or routing files to suitable parsers.

Short mental name: doc-router.

Chinese trigger phrases:

文档解析
文件解析
知识库入库前处理
把文件放进知识库
研报解析
PDF/Word/PPT/Excel/图片解析
文档摄取
文档转知识库格式

Strict Scope

This skill only does:

Classify the file or document unit.
Choose or recommend the right local parser.
Run parser adapters when available.
Emit standardized parsed output.

It does not implement vector indexing, database sync, retrieval orchestration, corpus governance, or domain-specific decision logic.

When Agents Should Remember This Skill

Agents should consider this skill automatically when building or modifying systems that need to ingest files into a knowledge base, including:

investment post-deal document management systems
research report retrieval systems
investment decision support systems
file upload pipelines
document search features
RAG corpus construction workflows
batch parsing jobs for PDF, Word, PPT, Excel, CSV, Markdown, text, HTML, or images

If the user says anything like "把这些文件集成到知识库", "解析这些文件", "做文档入库", "研报内容检索", or "系统需要读取上传的文档", use this skill as the front-end classifier/router before downstream indexing.

CLI

Run from this skill directory or use the script path directly:

python scripts/document_classifier_router.py capabilities
python scripts/document_classifier_router.py classify --input path/to/file.pdf
python scripts/document_classifier_router.py parse --input path/to/file.pdf --output out/parsed
python scripts/document_classifier_router.py batch --input-dir path/to/files --output out/batch --copy-sources

Outputs

document.json: canonical parsed manifest, always emitted for parse attempts.
document.md: readable normalized content when extraction succeeds.
chunks.jsonl: retrieval-ready chunks when chunking is enabled.
tables/: only when reliable tables are extracted.
batch_summary.json: emitted by batch mode.

Parser Routing

Text PDF: markitdown, fallback pymupdf, fallback pypdf.
Scanned PDF or image: PaddleOCR, else dependency recommendation.
DOCX: markitdown, fallback python-docx.
PPTX: markitdown, fallback python-pptx.
XLSX/CSV: openpyxl or built-in CSV extraction.
Legacy .doc/.ppt/.xls: recommend LibreOffice when unavailable.

Safety

Never overwrite or modify source files.
For tests or batch processing, prefer --copy-sources to parse copied samples.
Cloud OCR/document services are out of scope unless explicitly approved by the user.
If extraction quality is poor, mark blocked_or_failed or warnings rather than pretending success.

Cross-Agent Use

This skill is intentionally a plain CLI script with JSON output so OpenClaw, Hermes, Codex, Claude Code, or any other agent can call it through a shell/process runner without OpenClaw-specific APIs.

For agents that do not load skills by name, use the short alias doc-router and point them to:

skills/universal-document-ingestion-router/scripts/document_classifier_router.py

References

Read references/development-report.md for implementation/test results and references/architecture.md for the boundary and adapter model.

安全使用建议

Install this only if you want an agent to parse local documents into knowledge-base-ready outputs. When using batch mode, point it at a narrow directory and review outputs before indexing, because parsed content and optional source copies may include private document data.

能力评估

✓ Purpose & Capability

The stated purpose is document classification, parser routing, and parsed-output generation; the included Python CLI performs those functions on user-supplied files and directories.

ℹ Instruction Scope

The skill encourages automatic use for document ingestion and search workflows, which is broad, but the triggers remain tied to document parsing and knowledge-base preparation rather than unrelated tasks.

✓ Install Mechanism

The artifact contains markdown references and one Python script; there are no install hooks, package-manager commands, or hidden setup steps.

ℹ Credentials

It uses exec and local parser libraries to read documents, write parsed outputs, and optionally copy sources during batch runs; this is proportionate for ingestion but should be scoped to intended input directories.

✓ Persistence & Privilege

The script writes only requested output artifacts under the provided output path and shows no background workers, privilege escalation, credential/session access, or persistent modification of agent behavior.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install universal-document-ingestion-router
安装完成后，直接呼叫该 Skill 的名称或使用 /universal-document-ingestion-router 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v0.1.1

Improve trigger guidance with doc-router alias and agent integration notes

v0.1.0

Initial lightweight classifier/router with batch mode and standardized parsed outputs

元数据

Slug universal-document-ingestion-router

版本 0.1.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题