← 返回 Skills 市场
kounlong

Doc to JSON

作者 梁辉盛 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
57
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install doc-to-json
功能描述
Convert documents (docx, doc, PDF, xlsx, xls) to structured JSON via MinerU. Full pipeline: file to mineru-open-api extract to Markdown then to JSON. Use whe...
使用说明 (SKILL.md)

Doc to JSON

Convert office documents to structured JSON using MinerU as the extraction engine.

Supported Formats

  • .doc / .docx — Word documents
  • .pdf — PDF files
  • .xlsx / .xls — Excel spreadsheets

Prerequisites

  • mineru-open-api CLI must be installed (v0.5+)
  • MINERU_TOKEN environment variable must be set
  • Check: mineru-open-api version

Quick Usage

# Full pipeline: document -> MinerU Markdown -> JSON
python3 scripts/doc_to_json.py /path/to/file.docx -o output.json

# Keep temp files for debugging
python3 scripts/doc_to_json.py /path/to/file.pdf -o out.json --keep-temp

Manual Two-Step Pipeline

If the full pipeline script fails, run steps manually:

Step 1: MinerU Extract

export MINERU_TOKEN="your_token"
mineru-open-api extract input_file.pdf -o /tmp/mineru_out/

Output: .md file in the output directory.

Step 2: Markdown -> JSON

python3 scripts/markdown_to_json.py /tmp/mineru_out/output.md -o output.json

JSON Structure

The output JSON preserves:

  • Metadata fields — course name, code, credits, hours, etc. (extracted from plain text)
  • Heading hierarchy — 一、二、三... sections become nested keys
  • Tables — stored as array of arrays (row cells), keyed as "表格"
  • Numbered lists — stored as array of strings under section title
  • Paragraph text — merged into "text" field per section

For Knowledge Base Preparation

After JSON conversion, common next steps:

  1. Chunk by section — split the JSON into per-section documents for embedding
  2. Table extraction — convert "表格" arrays to flattened rows for database import
  3. Metadata extraction — pull course code, name, etc. as document metadata
  4. Embedding — feed cleaned text chunks into vector database

See references/kb-prep.md for detailed KB preparation patterns.

安全使用建议
What to consider before installing or running this skill: - The skill will send documents to MinerU via the mineru-open-api CLI. That means your documents (including any sensitive content) may be transmitted to MinerU's servers. Only proceed if you trust MinerU and are comfortable with that data flow. - The manifest does not declare the required MINERU_TOKEN or mineru-open-api binary — this is an inconsistency. Treat the missing declaration as a red flag: confirm with the author or registry why those requirements were omitted. - If you must use it: obtain MINERU_TOKEN only from a trusted source and avoid using production secrets. Consider testing with non-sensitive files first. - If you need stronger guarantees: inspect or install the mineru-open-api CLI from its official source (verify signatures/URLs), or prefer a local/offline extractor if you cannot trust remote processing. - Mitigations: run the tool in an isolated environment (sandbox/VM), monitor outbound network traffic when the CLI runs, and verify the mineru-open-api CLI source code or release channel before supplying your token. If the registry is updated to explicitly declare MINERU_TOKEN and mineru-open-api as requirements and provides an official upstream URL for the CLI, the inconsistency concern would be resolved and my confidence would increase.
功能分析
Type: OpenClaw Skill Name: doc-to-json Version: 1.0.0 The skill bundle provides a legitimate pipeline for converting various document formats (PDF, DOCX, XLSX) into structured JSON using the MinerU extraction engine. The core logic in `scripts/doc_to_json.py` and `scripts/markdown_to_json.py` focuses on executing the MinerU CLI and parsing the resulting Markdown via regular expressions. No evidence of data exfiltration, malicious execution, or prompt injection was found; the use of subprocesses is handled safely using argument lists to prevent shell injection.
能力评估
Purpose & Capability
Name/description promise: convert documents to JSON via MinerU. The SKILL.md and included scripts clearly require the mineru-open-api CLI and a MINERU_TOKEN. However the registry metadata lists no required binaries and no required environment variables. That is an internal inconsistency: a MinerU token and CLI are necessary to perform the described extraction but are not declared in the manifest.
Instruction Scope
The runtime instructions and scripts stay within the stated purpose: they call the mineru-open-api CLI to produce Markdown, then parse the Markdown into JSON locally. The scripts parse headings, tables, lists and metadata — no other system files are read and no unexpected external endpoints are referenced in the code itself. However the mineru-open-api CLI will contact MinerU's servers (not shown in the package), so documents and their content will be transmitted to that external service when the CLI runs.
Install Mechanism
This skill is instruction-only (no install spec). That lowers installer risk, but it also means the manifest does not install the required mineru-open-api CLI; users must install it themselves. The absence of an install specification for the external CLI is coherent but increases the chance of mismatches (user may not realize they need to install and trust a third-party CLI).
Credentials
The scripts and SKILL.md require MINERU_TOKEN (and pass it to the mineru-open-api CLI), but the registry metadata lists no required environment variables and no primary credential. Requesting a service token for an external extraction service is reasonable for the skill's purpose — the problem is the manifest omits that requirement, which is disproportionate and inconsistent. This omission reduces transparency about what secrets the skill needs.
Persistence & Privilege
The skill does not request persistent or elevated privileges: always is false, it does not modify other skills or global agent config, and it does not persist credentials itself. Temp files are cleaned up by default (unless --keep-temp is used).
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install doc-to-json
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /doc-to-json 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial release: doc/docx/PDF/xlsx/xls to JSON via MinerU pipeline, with Markdown parser and KB preparation helpers
元数据
Slug doc-to-json
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Doc to JSON 是什么?

Convert documents (docx, doc, PDF, xlsx, xls) to structured JSON via MinerU. Full pipeline: file to mineru-open-api extract to Markdown then to JSON. Use whe... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 57 次。

如何安装 Doc to JSON?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install doc-to-json」即可一键安装,无需额外配置。

Doc to JSON 是免费的吗?

是的,Doc to JSON 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Doc to JSON 支持哪些平台?

Doc to JSON 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Doc to JSON?

由 梁辉盛(@kounlong)开发并维护,当前版本 v1.0.0。

💬 留言讨论