← Back to Skills Marketplace
kounlong

Doc to JSON

by 梁辉盛 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
57
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install doc-to-json
Description
Convert documents (docx, doc, PDF, xlsx, xls) to structured JSON via MinerU. Full pipeline: file to mineru-open-api extract to Markdown then to JSON. Use whe...
README (SKILL.md)

Doc to JSON

Convert office documents to structured JSON using MinerU as the extraction engine.

Supported Formats

  • .doc / .docx — Word documents
  • .pdf — PDF files
  • .xlsx / .xls — Excel spreadsheets

Prerequisites

  • mineru-open-api CLI must be installed (v0.5+)
  • MINERU_TOKEN environment variable must be set
  • Check: mineru-open-api version

Quick Usage

# Full pipeline: document -> MinerU Markdown -> JSON
python3 scripts/doc_to_json.py /path/to/file.docx -o output.json

# Keep temp files for debugging
python3 scripts/doc_to_json.py /path/to/file.pdf -o out.json --keep-temp

Manual Two-Step Pipeline

If the full pipeline script fails, run steps manually:

Step 1: MinerU Extract

export MINERU_TOKEN="your_token"
mineru-open-api extract input_file.pdf -o /tmp/mineru_out/

Output: .md file in the output directory.

Step 2: Markdown -> JSON

python3 scripts/markdown_to_json.py /tmp/mineru_out/output.md -o output.json

JSON Structure

The output JSON preserves:

  • Metadata fields — course name, code, credits, hours, etc. (extracted from plain text)
  • Heading hierarchy — 一、二、三... sections become nested keys
  • Tables — stored as array of arrays (row cells), keyed as "表格"
  • Numbered lists — stored as array of strings under section title
  • Paragraph text — merged into "text" field per section

For Knowledge Base Preparation

After JSON conversion, common next steps:

  1. Chunk by section — split the JSON into per-section documents for embedding
  2. Table extraction — convert "表格" arrays to flattened rows for database import
  3. Metadata extraction — pull course code, name, etc. as document metadata
  4. Embedding — feed cleaned text chunks into vector database

See references/kb-prep.md for detailed KB preparation patterns.

Usage Guidance
What to consider before installing or running this skill: - The skill will send documents to MinerU via the mineru-open-api CLI. That means your documents (including any sensitive content) may be transmitted to MinerU's servers. Only proceed if you trust MinerU and are comfortable with that data flow. - The manifest does not declare the required MINERU_TOKEN or mineru-open-api binary — this is an inconsistency. Treat the missing declaration as a red flag: confirm with the author or registry why those requirements were omitted. - If you must use it: obtain MINERU_TOKEN only from a trusted source and avoid using production secrets. Consider testing with non-sensitive files first. - If you need stronger guarantees: inspect or install the mineru-open-api CLI from its official source (verify signatures/URLs), or prefer a local/offline extractor if you cannot trust remote processing. - Mitigations: run the tool in an isolated environment (sandbox/VM), monitor outbound network traffic when the CLI runs, and verify the mineru-open-api CLI source code or release channel before supplying your token. If the registry is updated to explicitly declare MINERU_TOKEN and mineru-open-api as requirements and provides an official upstream URL for the CLI, the inconsistency concern would be resolved and my confidence would increase.
Capability Analysis
Type: OpenClaw Skill Name: doc-to-json Version: 1.0.0 The skill bundle provides a legitimate pipeline for converting various document formats (PDF, DOCX, XLSX) into structured JSON using the MinerU extraction engine. The core logic in `scripts/doc_to_json.py` and `scripts/markdown_to_json.py` focuses on executing the MinerU CLI and parsing the resulting Markdown via regular expressions. No evidence of data exfiltration, malicious execution, or prompt injection was found; the use of subprocesses is handled safely using argument lists to prevent shell injection.
Capability Assessment
Purpose & Capability
Name/description promise: convert documents to JSON via MinerU. The SKILL.md and included scripts clearly require the mineru-open-api CLI and a MINERU_TOKEN. However the registry metadata lists no required binaries and no required environment variables. That is an internal inconsistency: a MinerU token and CLI are necessary to perform the described extraction but are not declared in the manifest.
Instruction Scope
The runtime instructions and scripts stay within the stated purpose: they call the mineru-open-api CLI to produce Markdown, then parse the Markdown into JSON locally. The scripts parse headings, tables, lists and metadata — no other system files are read and no unexpected external endpoints are referenced in the code itself. However the mineru-open-api CLI will contact MinerU's servers (not shown in the package), so documents and their content will be transmitted to that external service when the CLI runs.
Install Mechanism
This skill is instruction-only (no install spec). That lowers installer risk, but it also means the manifest does not install the required mineru-open-api CLI; users must install it themselves. The absence of an install specification for the external CLI is coherent but increases the chance of mismatches (user may not realize they need to install and trust a third-party CLI).
Credentials
The scripts and SKILL.md require MINERU_TOKEN (and pass it to the mineru-open-api CLI), but the registry metadata lists no required environment variables and no primary credential. Requesting a service token for an external extraction service is reasonable for the skill's purpose — the problem is the manifest omits that requirement, which is disproportionate and inconsistent. This omission reduces transparency about what secrets the skill needs.
Persistence & Privilege
The skill does not request persistent or elevated privileges: always is false, it does not modify other skills or global agent config, and it does not persist credentials itself. Temp files are cleaned up by default (unless --keep-temp is used).
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install doc-to-json
  3. After installation, invoke the skill by name or use /doc-to-json
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release: doc/docx/PDF/xlsx/xls to JSON via MinerU pipeline, with Markdown parser and KB preparation helpers
Metadata
Slug doc-to-json
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Doc to JSON?

Convert documents (docx, doc, PDF, xlsx, xls) to structured JSON via MinerU. Full pipeline: file to mineru-open-api extract to Markdown then to JSON. Use whe... It is an AI Agent Skill for Claude Code / OpenClaw, with 57 downloads so far.

How do I install Doc to JSON?

Run "/install doc-to-json" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Doc to JSON free?

Yes, Doc to JSON is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Doc to JSON support?

Doc to JSON is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Doc to JSON?

It is built and maintained by 梁辉盛 (@kounlong); the current version is v1.0.0.

💬 Comments