← Back to Skills Marketplace
decrystal

ade-mineru-api-skills

by decrystal666 · GitHub ↗ · v1.2.0 · MIT-0
cross-platform ⚠ suspicious
281
Downloads
0
Stars
0
Active Installs
3
Versions
Install in OpenClaw
/install ade-mineru-api-skills
Description
MinerU document extraction CLI that converts PDFs, images, and web pages into Markdown, HTML, LaTeX, or DOCX via the MinerU API. Supports single/batch extrac...
README (SKILL.md)

Document Extraction with mineru

Installation

Linux / macOS

curl -fsSL https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh | sh

Windows (PowerShell)

irm https://cdn-mineru.openxlab.org.cn/open-api-cli/install.ps1 | iex

Verify installation

mineru version

Authentication

Before using, configure your API token (get one from https://mineru.net):

mineru auth                    # Interactive token setup
export MINERU_TOKEN="your-token"  # Or set via environment variable

Token resolution order: --token flag > MINERU_TOKEN env > ~/.mineru/config.yaml.

Supported input formats

The extract command accepts the following input types:

  • PDF (.pdf) — primary use case, supports scanned and digital PDFs
  • Images (.png, .jpg, .jpeg, .webp, .gif,.bmp) — use --ocr for best results on scanned content
  • DOCX (.docx) — Microsoft Word documents
  • URLs — remote files are downloaded automatically

The crawl command accepts any HTTP/HTTPS URL and extracts web page content.

Default behavior

  • Table recognition: ON by default. Tables in documents are extracted and converted to Markdown tables. Use --no-table to disable.
  • Formula recognition: ON by default. Mathematical formulas are extracted as LaTeX. Use --no-formula to disable.
  • Language: defaults to ch (Chinese). Use --language en for English documents.
  • Model: auto-selected. Use --model vlm for complex layouts, --model pipeline for speed.

Quick start

mineru extract report.pdf                    # PDF → Markdown to stdout
mineru extract report.pdf -o ./out/          # Save to file
mineru extract report.pdf -f md,docx         # Multiple formats
mineru crawl https://example.com/article     # Web page → Markdown

Core workflow

  1. Authenticate: mineru auth or set MINERU_TOKEN
  2. Extract: mineru extract \x3Cfile-or-url> for documents
  3. Crawl: mineru crawl \x3Curl> for web pages
  4. Check results: output goes to stdout (default) or -o directory

Commands

extract — Document extraction

Convert PDFs, images, and other documents to Markdown or other formats.

mineru extract report.pdf                         # Markdown to stdout
mineru extract report.pdf -f html                 # HTML to stdout
mineru extract report.pdf -o ./out/               # Save to directory
mineru extract report.pdf -o ./out/ -f md,docx    # Multiple formats
mineru extract *.pdf -o ./results/                # Batch extract
mineru extract --list files.txt -o ./results/     # Batch from file list
mineru extract https://example.com/doc.pdf        # Extract from URL
cat doc.pdf | mineru extract --stdin -o ./out/    # From stdin

extract flags

Flag Short Default Description
--output -o (stdout) Output path (file or directory)
--format -f md Output formats: md, json, html, latex, docx (comma-separated)
--model (auto) Model: vlm, pipeline, html
--ocr false Enable OCR for scanned documents
--no-formula false Disable formula recognition
--no-table false Disable table recognition
--language ch Document language
--pages (all) Page range, e.g. 1-10,15
--timeout 300/1800 Timeout in seconds (single/batch)
--list Read input list from file (one path per line)
--stdin-list false Read input list from stdin
--stdin false Read file content from stdin
--stdin-name stdin.pdf Filename hint for stdin mode
--concurrency 0 Batch concurrency (0 = server default)

crawl — Web page extraction

Fetch web pages and convert to Markdown.

mineru crawl https://example.com/article              # Markdown to stdout
mineru crawl https://example.com/article -f html      # HTML to stdout
mineru crawl https://example.com/article -o ./out/     # Save to file
mineru crawl url1 url2 -o ./pages/                     # Batch crawl
mineru crawl --list urls.txt -o ./pages/               # Batch from file list

crawl flags

Flag Short Default Description
--output -o (stdout) Output path
--format -f md Output formats: md, json, html (comma-separated)
--timeout 300/1800 Timeout in seconds (single/batch)
--list Read URL list from file (one per line)
--stdin-list false Read URL list from stdin
--concurrency 0 Batch concurrency

auth — Authentication management

mineru auth              # Interactive token setup
mineru auth --verify     # Verify current token is valid
mineru auth --show       # Show current token source and masked value

status — Async task status

Query the status of a previously submitted extraction task.

mineru status \x3Ctask-id>                      # Check status once
mineru status \x3Ctask-id> --wait               # Wait for completion
mineru status \x3Ctask-id> --wait -o ./out/     # Wait and download results
mineru status \x3Ctask-id> --wait --timeout 600 # Custom timeout

status flags

Flag Short Default Description
--wait false Wait for task completion
--output -o Download results to directory when done
--timeout 300 Max wait time in seconds

version — Version info

mineru version    # Show version, commit, build date, Go version, OS/arch

Global flags

These flags apply to all commands:

Flag Short Description
--token API token (overrides env and config)
--base-url API base URL (for private deployments)
--verbose -v Verbose mode, print HTTP details

Output behavior

  • No -o flag: result goes to stdout; status/progress messages go to stderr
  • With -o flag: result saved to file/directory; progress messages on stderr
  • Batch mode: requires -o to specify output directory
  • Binary formats (docx): cannot output to stdout, must use -o
  • Markdown output includes extracted images saved alongside the .md file

Examples

Single PDF extraction

mineru extract report.pdf -o ./output/
# Output: ./output/report.md + ./output/images/

Extract with OCR and specific pages

mineru extract scanned.pdf --ocr --pages "1-5" -o ./out/

Multi-format output

mineru extract paper.pdf -f md,html,docx -o ./out/
# Output: ./out/paper.md, ./out/paper.html, ./out/paper.docx

Batch processing from file list

# files.txt contains one path per line
mineru extract --list files.txt -o ./results/

Extract to LaTeX

mineru extract paper.pdf -f latex -o ./out/
# Output: ./out/paper.tex

English document with specific language

mineru extract english-report.pdf --language en -o ./out/

Extract Word document to Markdown

mineru extract resume.docx -o ./out/
# Output: ./out/resume.md

Pipe workflow

# Download and extract in one pipeline
curl -sL https://example.com/doc.pdf | mineru extract --stdin --stdin-name doc.pdf

Web crawling

mineru crawl https://example.com/docs/guide -o ./docs/

Batch crawl with URL list

echo -e "https://example.com/page1\
https://example.com/page2" | mineru crawl --stdin-list -o ./pages/

Use with other tools

# Extract and pipe to another tool
mineru extract report.pdf | wc -w              # Word count
mineru extract report.pdf | grep "keyword"     # Search content
mineru extract report.pdf -f json | jq '.[]'   # Parse structured output

Agent guidelines

When using this skill on behalf of the user:

  • Always ask for the file path if the user didn't specify one. Never guess or fabricate a filename.
  • Quote file paths that contain spaces or special characters with double quotes in commands. Example: mineru extract "report 01.pdf", NOT mineru extract report 01.pdf.
  • Don't run commands blindly on errors — if the user asks "提取失败了怎么办", explain the exit code and troubleshooting steps instead of re-running the command.
  • Installation questions ("mineru 怎么安装") should be answered with the install instructions, not by running mineru extract.
  • DOCX as input is supported — if the user asks "这个 Word 文档能转 Markdown 吗", use mineru extract file.docx.
  • Table extraction — tables are extracted by default as part of the Markdown output. There is no "tables only" mode; the full document is always extracted.
  • For stdout mode (no -o), only one text format can be output at a time. If the user wants multiple formats, suggest adding -o.

Default output directory

When the user does NOT specify an output path (-o), the agent MUST generate a default output directory to prevent file overwrites. Use:

~/MinerU-Skill/\x3Cname>_\x3Chash>/

Naming rules:

  • \x3Cname>: derived from the source, then sanitized for safe directory names.
    • For URLs: last path segment (e.g. https://arxiv.org/pdf/2509.221862509.22186)
    • For local files: filename without extension (e.g. report.pdfreport)
    • Sanitization: replace spaces and shell-unsafe characters (space, (, ), [, ], &, ', ", !, #, $, `) with _. Collapse consecutive _ into one. Keep alphanumeric, -, _, ., and CJK characters.
  • \x3Chash>: first 6 characters of the MD5 hash of the full original source path or URL (before sanitization). This ensures:
    • Different URLs with similar basenames get unique directories
    • Re-running the same source reuses the same directory (idempotent)

Examples:

Source \x3Cname> Output directory
https://arxiv.org/pdf/2509.22186 2509.22186 ~/MinerU-Skill/2509.22186_a3f2b1/
https://arxiv.org/pdf/2509.200 2509.200 ~/MinerU-Skill/2509.200_c7e9d4/
./report.pdf report ~/MinerU-Skill/report_8b1a3f/
./report 01.pdf report_01 ~/MinerU-Skill/report_01_f4a1c2/
./My Doc (final).pdf My_Doc_final ~/MinerU-Skill/My_Doc_final_b9e3d7/
./个人简介.docx 个人简介 ~/MinerU-Skill/个人简介_d2a8f5/

How the agent should generate the hash:

echo -n "https://arxiv.org/pdf/2509.22186" | md5sum | cut -c1-6

Or on macOS:

echo -n "https://arxiv.org/pdf/2509.22186" | md5 | cut -c1-6

When the user specifies -o: use the user's path as-is, do NOT override with the default directory.

Exit codes

Code Meaning Recovery
0 Success
1 General API or unknown error Check network connectivity; retry; use --verbose for details
2 Invalid parameters / usage error Check command syntax and flag values
3 Authentication error Run mineru auth to reconfigure token, or check token expiration
4 File too large or page limit exceeded Split the file or use --pages to extract a subset
5 Extraction failed The document may be corrupted or unsupported; try a different --model
6 Timeout Increase with --timeout; large files may need 600+ seconds
7 Quota exceeded Check API quota at https://mineru.net; wait or upgrade plan

Troubleshooting

  • "no API token found": Run mineru auth or set MINERU_TOKEN env variable
  • Timeout on large files: Increase with --timeout 600 (seconds)
  • Batch fails partially: Check stderr for per-file status; succeeded files are still saved
  • Binary format to stdout: Use -o flag; docx cannot stream to stdout
  • Private deployment: Use --base-url https://your-server.com/api
  • Extraction quality is poor: Try --model vlm for complex layouts, or --ocr for scanned documents
  • Formula not recognized: Ensure --no-formula is NOT set; try --model vlm for better formula support

Notes

  • All status/progress messages go to stderr; only document content goes to stdout
  • Batch mode automatically polls the API with exponential backoff
  • Token is stored in ~/.mineru/config.yaml after mineru auth
  • The CLI wraps the MinerU Open SDK (github.com/OpenDataLab/mineru-open-sdk)
Usage Guidance
This wrapper appears to do what it says (drive the mineru CLI) but take these precautions before installing or running it: 1) Do not blindly run 'curl | sh' or 'irm | iex' — first fetch the install script and inspect its contents or prefer an install from a vetted package or GitHub release. 2) Verify the installer host (cdn-mineru.openxlab.org.cn) and prefer official release channels if available. 3) MINERU_TOKEN is needed for operation but is not declared in the registry metadata — treat this token like a secret and only provide it to the official mineru CLI. 4) Inspect ~/.mineru/config.yaml after auth to see how tokens are stored. 5) If you are uncomfortable with the remote install, install the mineru binary manually from a verified source or run the CLI in an isolated environment (container/VM). If you want me to, I can fetch and show the installer script contents (or check a known GitHub release) so you can review them before running anything.
Capability Analysis
Type: OpenClaw Skill Name: ade-mineru-api-skills Version: 1.2.0 The skill bundle facilitates document extraction via the MinerU API but utilizes high-risk installation patterns, specifically 'curl|bash' and PowerShell 'iex' from a remote CDN (SKILL.md). It also includes complex instructions for the AI agent to execute shell pipelines (e.g., using md5sum and cut) to generate local directory paths. While these behaviors appear aligned with the tool's functional purpose, the combination of remote script execution and agent-driven shell logic for file system management represents a significant security risk and attack surface.
Capability Assessment
Purpose & Capability
The skill is an instruction-only wrapper around the 'mineru' CLI and only needs that binary and an API token to do document extraction and crawling — that matches the description. Minor inconsistency: SKILL.md requires/mentions MINERU_TOKEN and ~/.mineru/config.yaml for auth but the registry metadata lists no required environment variables or primary credential.
Instruction Scope
The runtime instructions are focused on extraction/crawl workflows (PDFs, images, URLs) which is expected. However the SKILL.md explicitly tells users to install software by piping a remote script to sh/iex (curl | sh and irm | iex). That instruction expands the attack surface because it executes arbitrary remote code on the host unless the user inspects it first.
Install Mechanism
There is no packaged install (e.g., GitHub release or package manager) described; the documented install downloads and executes scripts from https://cdn-mineru.openxlab.org.cn/open-api-cli/install.sh (and a PowerShell equivalent). This is a direct-download install from an unfamiliar domain and uses the high-risk pattern of piping remote script output into a shell/PowerShell.
Credentials
Only a MinerU API token (MINERU_TOKEN or --token flag; config at ~/.mineru/config.yaml) is needed for the tool to function, which is proportional to the purpose. However the published metadata did not declare MINERU_TOKEN as a required env var or primary credential — an omission that reduces transparency about what secrets the skill needs.
Persistence & Privilege
The skill is not always-enabled, does not request system config paths beyond the mineru config file, and is instruction-only. It does recommend installing a CLI which may persist on disk, but the skill itself does not request persistent elevated privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install ade-mineru-api-skills
  3. After installation, invoke the skill by name or use /ade-mineru-api-skills
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.2.0
Version 1.1.1 of ade-mineru-api-skills - No file changes were detected in this release. - Documentation (SKILL.md) remains unchanged with no updates to commands, flags, or usage instructions. - This is most likely a metadata or version bump without any functional modifications.
v1.1.0
- New simplified installation for Linux/macOS and Windows using a unified install script or PowerShell script from a new CDN location. - Previous manual per-architecture binary download instructions have been removed; installation is now streamlined by OS. - Added Windows installation instructions with a dedicated PowerShell script. - No change to core usage, commands, or CLI flags.
v1.0.0
Initial release of mineru-cli: a CLI for document extraction via MinerU API. - Extracts text from PDFs, images, DOCX, and web pages; supports OCR, tables, formulas. - Converts to Markdown, HTML, LaTeX, DOCX, or JSON; batch and piped workflows supported. - Includes web crawling, async task management, and flexible authentication. - Simple installation for macOS (Intel/Apple Silicon) and Linux (x86_64/ARM64) platforms.
Metadata
Slug ade-mineru-api-skills
Version 1.2.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 3
Frequently Asked Questions

What is ade-mineru-api-skills?

MinerU document extraction CLI that converts PDFs, images, and web pages into Markdown, HTML, LaTeX, or DOCX via the MinerU API. Supports single/batch extrac... It is an AI Agent Skill for Claude Code / OpenClaw, with 281 downloads so far.

How do I install ade-mineru-api-skills?

Run "/install ade-mineru-api-skills" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is ade-mineru-api-skills free?

Yes, ade-mineru-api-skills is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does ade-mineru-api-skills support?

ade-mineru-api-skills is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created ade-mineru-api-skills?

It is built and maintained by decrystal666 (@decrystal); the current version is v1.2.0.

💬 Comments