← 返回 Skills 市场
keepyaoung

Markitdown-Skill-for-non-multimodal-agent

作者 Tommy, Joon Shin · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
37
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install markitdown-skill-for-non-multimodal-agent
功能描述
Use when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx...
使用说明 (SKILL.md)

markitdown — let a text-only agent read documents

A non-multimodal OpenClaw agent has no eyes: its backend is a plain-text API, so it cannot open a PDF / Word / Excel / PowerPoint attachment at all. This skill turns those files into Markdown the model can read.

Two layers, and you almost always only need the first:

  • Free layer — DEFAULT. The markitdown MCP server converts text-bearing documents (PDF, docx, pptx, xlsx, html, csv, json, xml, epub, zip) to Markdown locally. No API key. No per-call cost. This handles the vast majority of attachments, because most documents store real text.
  • OCR layer — OPT-IN. Scanned PDFs (photographed pages), standalone image files, and images embedded inside documents contain no extractable text — the only way to read them is to have a vision model look. This layer is OFF unless OPENAI_API_KEY is set, and it bills per image.

The cost line is simple: pulling existing text out of a file is free; asking a model to look at a picture costs money.

When to use

A document attachment arrives (Slack file, email attachment, a path the user gives you) whose extension or MIME is one of:

pdf · docx · pptx · xlsx · xls · epub · html · htm · csv · json · xml · zip · md …and you need its content (to answer about it, summarize it, quote it, or save it).

For plain images (png · jpg · jpeg · gif · webp · tiff): only useful if the OCR layer is on. With no key, a text-only agent simply cannot read an image — say so rather than guessing.

Do not use this for a file the agent can already read as plain text in the prompt.


Setup (operator, one time)

Free layer — the MCP server

Run the server over stdio with no install using uvx:

uvx markitdown-mcp

Register it in your OpenClaw / MCP client config:

{
  "mcpServers": {
    "markitdown": {
      "command": "uvx",
      "args": ["markitdown-mcp"]
    }
  }
}

This exposes one tool: convert_to_markdown(uri), where uri is any http:, https:, file:, or data: URI. That is the whole free layer.

The MCP server runs with the privileges of its process and can read any file that user can read. Keep it bound to local/stdio use only.

OCR layer — the CLI (optional)

The MCP server cannot OCR — it never wires up a vision client, so even with plugins enabled it silently returns text-only output. OCR runs through the CLI instead. Install the fork (which ships the markitdown-ocr plugin) plus an OpenAI client:

pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]"
pip install "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr"
pip install openai

Then export a key (any OpenAI-compatible endpoint works):

export OPENAI_API_KEY=sk-...
export MARKITDOWN_OCR_MODEL=gpt-4o-mini   # cheapest vision model; optional

With no OPENAI_API_KEY, the plugin still loads but OCR is skipped — you fall back to the free converter automatically. So the OCR layer is genuinely zero-cost until someone opts in.


Flow (per attachment)

  1. Get the absolute path. The downloaded attachment's absolute path is already provided by the runtime (e.g. MediaPaths). Build a file://\x3Cabsolute-path> URI.

    • ⚠️ Convert in the same turn the file arrives — downloads live in a temp dir and may be GC'd next turn.
  2. Free convert (always try first). Call convert_to_markdown("file://\x3Cabspath>") on the markitdown MCP server. For normal documents you are done — read or store the Markdown.

  3. Decide if OCR is needed. OCR only matters when:

    • the file is a standalone image, or
    • the free conversion came back empty / whitespace-only / a few stray characters (a tell-tale of a scanned PDF — pages are images, not text).

    If neither is true, stop. Don't spend a vision call on a document that already gave you text.

  4. OCR (only if needed AND OPENAI_API_KEY is set). Shell out to the CLI:

    markitdown "\x3Cabspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}" -o "\x3Cout>.md"
    

    Or, with no global install, one-shot via uvx:

    uvx --from "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown[all]" \
        --with "git+https://github.com/Self-made-Orange/markitdown.git#subdirectory=packages/markitdown-ocr" \
        --with openai \
        markitdown "\x3Cabspath>" --use-plugins --llm-client openai --llm-model "${MARKITDOWN_OCR_MODEL:-gpt-4o-mini}"
    

    OCR-extracted text is wrapped inline as *[Image OCR] … [End OCR]*, interleaved in reading order, so document structure is preserved.

    If OPENAI_API_KEY is NOT set and the content is image-only: do not pretend. Tell the user the file is image-based and reading it needs the optional OCR layer (an OpenAI-compatible key), and stop.

  5. Use or persist. One-off question → read the Markdown and answer; no need to save. Worth keeping → write it to your knowledge store with provenance (original filename, source, date).


Cost & model notes

  • Free layer: $0. Local text extraction, no network model.
  • OCR layer: one vision API call per image (and one per page for fully scanned PDFs, rendered at 300 DPI). With gpt-4o-mini this is roughly a fraction of a cent per image — cheap, but not zero, and it scales with image count. Pick a small vision model unless you need fidelity.
  • The OCR layer is the reason this fork exists: it gives a text-only agent a way to "see" images, on demand, without making the whole agent multimodal.

Gotchas

  • MCP ≠ OCR. Do not set MARKITDOWN_ENABLE_PLUGINS=true on the server expecting OCR — the server passes no llm_client, so it silently skips OCR. OCR is CLI-only.
  • Path access. Both the file:// input and any output path must be inside the server/agent's allowed root, or the call is blocked.
  • Encrypted / corrupt files can fail conversion. Report the failure plainly; for PDFs you can retry with a dedicated PDF tool if available.
  • Don't OCR what already has text. Step 3's check exists to avoid burning vision calls on ordinary documents.

Supported formats

Free (local): PDF, PowerPoint, Word, Excel, HTML, CSV, JSON, XML, EPUB, ZIP (iterates contents), plus text formats. OCR-enhanced (key required): scanned PDFs, standalone images, and images embedded in PDF/DOCX/PPTX/XLSX.


Built on microsoft/markitdown; OCR layer from the Self-made-Orange/markitdown fork (packages/markitdown-ocr).

安全使用建议
Install only if you trust this publisher and need ClawHub maintainer workflows. Before using `autoreview`, prefer `--no-yolo` or `AUTOREVIEW_YOLO=0` unless you intentionally want a nested reviewer with full local authority. Treat moderation, email, and production migration commands as high-impact operations and require explicit targets, dry runs, confirmations, and audit verification.
能力标签
requires-sensitive-credentials
能力评估
Purpose & Capability
The skills are purpose-aligned for ClawHub code review, moderation, PR maintenance, Convex setup, and migrations; several workflows legitimately involve high-impact staff or production actions.
Instruction Scope
The autoreview helper discloses that it runs nested `codex review` with `--dangerously-bypass-approvals-and-sandbox --sandbox danger-full-access` by default, which grants broad execution authority for a review task even though an opt-out exists.
Install Mechanism
The reviewed skill files are static instructions plus one helper script; I found no skill install-time persistence, hidden post-install behavior, or automatic credential collection in the skill artifacts.
Credentials
Full-access nested agent execution is not tightly contained to the stated read-oriented review purpose; fallback reviewers may also receive generated diffs, which is expected for review but should be treated as code-sharing with external tools.
Persistence & Privilege
Moderation and migration skills can affect users, packages, orgs, emails, and production data, but they include explicit target, reason, dry-run, backup, confirmation, verification, and audit-log guidance; no automatic persistence was found.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install markitdown-skill-for-non-multimodal-agent
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /markitdown-skill-for-non-multimodal-agent 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
Initial public release of markitdown-skill-for-non-multimodal-agent. - Enables text-only agents to read content from document attachments (PDF, Word, Excel, PowerPoint, etc.) by converting them to Markdown via a local MCP server. - Optional OCR layer extracts text from scanned PDFs and images using an OpenAI-compatible vision model if API key is provided. - Free layer handles most document formats without API keys or extra cost. - Designed to skip conversion for plain text files and images unless OCR is enabled. - Setup requires local markitdown MCP server; OCR requires additional CLI tools and a vision model API key.
元数据
Slug markitdown-skill-for-non-multimodal-agent
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Markitdown-Skill-for-non-multimodal-agent 是什么?

Use when a NON-multimodal agent (a text-only LLM backend that cannot read attachments) receives a document — PDF, Word (docx), PowerPoint (pptx), Excel (xlsx... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 37 次。

如何安装 Markitdown-Skill-for-non-multimodal-agent?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install markitdown-skill-for-non-multimodal-agent」即可一键安装,无需额外配置。

Markitdown-Skill-for-non-multimodal-agent 是免费的吗?

是的,Markitdown-Skill-for-non-multimodal-agent 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Markitdown-Skill-for-non-multimodal-agent 支持哪些平台?

Markitdown-Skill-for-non-multimodal-agent 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Markitdown-Skill-for-non-multimodal-agent?

由 Tommy, Joon Shin(@keepyaoung)开发并维护,当前版本 v1.0.0。

💬 留言讨论