← 返回 Skills 市场

Markitdown-Skill-for-non-multimodal-agent

Name: Markitdown-Skill-for-non-multimodal-agent
Author: keepyaoung

作者 Tommy, Joon Shin · GitHub ↗ · v1.0.0 · MIT-0

cross-platform ⚠ suspicious

总下载

当前安装

版本数

在 OpenClaw 中安装

/install markitdown-skill-for-non-multimodal-agent

功能描述

Use when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx...

使用说明 (SKILL.md)

markitdown — let a text-only agent read documents

A non-multimodal OpenClaw agent has no eyes: its backend is a plain-text API, so it cannot open a PDF / Word / Excel / PowerPoint attachment at all. This skill turns those files into Markdown the model can read.

Two layers, and you almost always only need the first:

Free layer — DEFAULT. The markitdown MCP server converts text-bearing documents (PDF, docx, pptx, xlsx, html, csv, json, xml, epub, zip) to Markdown locally. No API key. No per-call cost. This handles the vast majority of attachments, because most documents store real text.
OCR layer — OPT-IN. Scanned PDFs (photographed pages), standalone image files, and images embedded inside documents contain no extractable text — the only way to read them is to have a vision model look. This layer is OFF unless OPENAI_API_KEY is set, and it bills per image.

The cost line is simple: pulling existing text out of a file is free; asking a model to look at a picture costs money.

When to use

A document attachment arrives (Slack file, email attachment, a path the user gives you) whose extension or MIME is one of:

pdf · docx · pptx · xlsx · xls · epub · html · htm · csv · json · xml · zip · md …and you need its content (to answer about it, summarize it, quote it, or save it).

For plain images (png · jpg · jpeg · gif · webp · tiff): only useful if the OCR layer is on. With no key, a text-only agent simply cannot read an image — say so rather than guessing.

Do not use this for a file the agent can already read as plain text in the prompt.

Setup (operator, one time)

Free layer — the MCP server

Run the server over stdio with no install using uvx:

uvx markitdown-mcp

{
  "mcpServers": {
    "markitdown": {
      "command": "uvx",
      "args": ["markitdown-mcp"]
    }
  }
}

This exposes one tool: convert_to_markdown(uri), where uri is any http:, https:, file:, or data: URI. That is the whole free layer.

The MCP server runs with the privileges of its process and can read any file that user can read. Keep it bound to local/stdio use only.

OCR layer — the CLI (optional)

The MCP server cannot OCR — it never wires up a vision client, so even with plugins enabled it silently returns text-only output. OCR runs through the CLI instead. Install the fork (which ships the markitdown-ocr plugin) plus an OpenAI client:

pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]"
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr"
pip install openai

Then export a key (any OpenAI-compatible endpoint works):

export OPENAI_API_KEY=sk-...
export MARKITDOWN_OCR_MODEL=gpt-4o-mini   # cheapest vision model; optional

With no OPENAI_API_KEY, the plugin still loads but OCR is skipped — you fall back to the free converter automatically. So the OCR layer is genuinely zero-cost until someone opts in.

Flow (per attachment)

Get the absolute path. The downloaded attachment's absolute path is already provided by the runtime (e.g. MediaPaths). Build a file://\x3Cabsolute-path> URI.
- ⚠️ Convert in the same turn the file arrives — downloads live in a temp dir and may be GC'd next turn.
Free convert (always try first). Call convert_to_markdown("file://\x3Cabspath>") on the markitdown MCP server. For normal documents you are done — read or store the Markdown.
Decide if OCR is needed. OCR only matters when:
- the file is a standalone image, or
- the free conversion came back empty / whitespace-only / a few stray characters (a tell-tale of a scanned PDF — pages are images, not text).
If neither is true, stop. Don't spend a vision call on a document that already gave you text.

OCR (only if needed AND OPENAI_API_KEY is set). Shell out to the CLI:

markitdown "\x3Cabspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "\x3Cout>.md"

Or, with no global install, one-shot via uvx:

uvx --from "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]" \
    --with "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr" \
    --with openai \
    markitdown "\x3Cabspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}"

OCR-extracted text is wrapped inline as *[Image OCR] … [End OCR]*, interleaved in reading order, so document structure is preserved.

If OPENAI_API_KEY is NOT set and the content is image-only: do not pretend. Tell the user the file is image-based and reading it needs the optional OCR layer (an OpenAI-compatible key), and stop.

Use or persist. One-off question → read the Markdown and answer; no need to save. Worth keeping → write it to your knowledge store with provenance (original filename, source, date).

Cost & model notes

Free layer: $0. Local text extraction, no network model.
OCR layer: one vision API call per image (and one per page for fully scanned PDFs, rendered at 300 DPI). With gpt-4o-mini this is roughly a fraction of a cent per image — cheap, but not zero, and it scales with image count. Pick a small vision model unless you need fidelity.
The OCR layer is the reason this fork exists: it gives a text-only agent a way to "see" images, on demand, without making the whole agent multimodal.

Gotchas

MCP ≠ OCR. Do not set MARKITDOWN_ENABLE_PLUGINS=true on the server expecting OCR — the server passes no llm_client, so it silently skips OCR. OCR is CLI-only.
Path access. Both the file:// input and any output path must be inside the server/agent's allowed root, or the call is blocked.
Encrypted / corrupt files can fail conversion. Report the failure plainly; for PDFs you can retry with a dedicated PDF tool if available.
Don't OCR what already has text. Step 3's check exists to avoid burning vision calls on ordinary documents.

Supported formats

Free (local): PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, EPUB, ZIP (iterates contents), plus text formats. OCR-enhanced (key required): scanned PDFs, standalone images, and images embedded in PDF/DOCX/PPTX/XLSX.

Built on microsoft/markitdown; OCR layer from the Self-made-Orange/markitdown fork (packages/markitdown-ocr).

安全使用建议

Install only if you trust this publisher and need ClawHub maintainer workflows. Before using `autoreview`, prefer `--no-yolo` or `AUTOREVIEW_YOLO=0` unless you intentionally want a nested reviewer with full local authority. Treat moderation, email, and production migration commands as high-impact operations and require explicit targets, dry runs, confirmations, and audit verification.

能力标签

requires-sensitive-credentials

能力评估

ℹ Purpose & Capability

The skills are purpose-aligned for ClawHub code review, moderation, PR maintenance, Convex setup, and migrations; several workflows legitimately involve high-impact staff or production actions.

⚠ Instruction Scope

The autoreview helper discloses that it runs nested `codex review` with `--dangerously-bypass-approvals-and-sandbox --sandbox danger-full-access` by default, which grants broad execution authority for a review task even though an opt-out exists.

✓ Install Mechanism

The reviewed skill files are static instructions plus one helper script; I found no skill install-time persistence, hidden post-install behavior, or automatic credential collection in the skill artifacts.

⚠ Credentials

Full-access nested agent execution is not tightly contained to the stated read-oriented review purpose; fallback reviewers may also receive generated diffs, which is expected for review but should be treated as code-sharing with external tools.

ℹ Persistence & Privilege

Moderation and migration skills can affect users, packages, orgs, emails, and production data, but they include explicit target, reason, dry-run, backup, confirmation, verification, and audit-log guidance; no automatic persistence was found.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install markitdown-skill-for-non-multimodal-agent
安装完成后，直接呼叫该 Skill 的名称或使用 /markitdown-skill-for-non-multimodal-agent 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.0

Initial public release of markitdown-skill-for-non-multimodal-agent. - Enables text-only agents to read content from document attachments (PDF, Word, Excel, PowerPoint, etc.) by converting them to Markdown via a local MCP server. - Optional OCR layer extracts text from scanned PDFs and images using an OpenAI-compatible vision model if API key is provided. - Free layer handles most document formats without API keys or extra cost. - Designed to skip conversion for plain text files and images unless OCR is enabled. - Setup requires local markitdown MCP server; OCR requires additional CLI tools and a vision model API key.

元数据

Slug markitdown-skill-for-non-multimodal-agent

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题