← 返回 Skills 市场
195
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install everything2markdown-cn
功能描述
使用 Microsoft MarkItDown 将各种文档格式(PDF/DOCX/PPTX/XLSX/图片/音频等)转换为 Markdown,专为 AGENT 和 LLM 工作流优化
使用说明 (SKILL.md)
Everything2Markdown - 万物转 Markdown
专为 AGENT 和 LLM 工作流优化的文档转换工具,基于 Microsoft MarkItDown。
核心特性
- ✅ 多格式支持: PDF, DOCX, PPTX, XLSX, EPUB, HTML
- ✅ 富媒体处理: 图片(OCR), 音频(转录), YouTube 链接
- ✅ 结构保留: 标题层级、表格、列表、链接
- ✅ 元数据提取: 作者、创建时间、页数等
- ✅ AGENT 优化: 输出适合 LLM 处理的干净 Markdown
支持的格式
| 格式 | 扩展名 | 说明 |
|---|---|---|
| 完整文本提取,保留结构 | ||
| Word | .docx, .doc | 保留标题和表格 |
| PowerPoint | .pptx, .ppt | 逐页转换 |
| Excel | .xlsx, .xls | 工作表转表格 |
| EPUB | .epub | 电子书格式 |
| HTML | .html, .htm | 网页转换 |
| 图片 | .png, .jpg | OCR 文字提取 |
| 音频 | .mp3, .wav | 语音转文字 |
| YouTube | URL | 字幕和元数据 |
快速开始
转换单个文件
markitdown document.pdf -o output.md
批量转换目录
# 转换目录下所有 PDF 和 DOCX
find . -name "*.pdf" -o -name "*.docx" | while read f; do
out="${f%.*}.md"
markitdown "$f" -o "$out"
echo "✓ 已转换: $f → $out"
done
高级选项
# 详细输出
markitdown document.pdf -o output.md --verbose
# 保留临时文件
markitdown document.pdf -o output.md --keep-temp
# 指定编码
markitdown document.pdf -o output.md --encoding utf-8
Python API
基础用法
from markitdown import MarkItDown
# 初始化
md = MarkItDown()
# 转换文件
result = md.convert("document.pdf")
# 访问内容
print(result.text_content) # Markdown 文本
print(result.metadata) # 文档元数据
高级 API
from markitdown import MarkItDown
# 自定义配置
md = MarkItDown(
enable_plugins=True
)
# 带选项转换
result = md.convert(
"document.pdf",
keep_formatting=True,
extract_images=True
)
# 访问结构化数据
print(f"标题: {result.metadata.get('title')}")
print(f"作者: {result.metadata.get('author')}")
print(f"页数: {result.metadata.get('pages')}")
print(f"字数: {len(result.text_content.split())}")
AGENT 工作流最佳实践
1. 文档预处理管道
import re
from markitdown import MarkItDown
def preprocess_for_llm(file_path):
"""
为 LLM 预处理文档。
清理噪声,规范化格式,提取结构。
"""
# 转换为 Markdown
md = MarkItDown()
result = md.convert(file_path)
text = result.text_content
# 清理过度格式化
text = re.sub(r'\*{4,}', '***', text) # 限制星号
text = re.sub(r'\-{4,}', '---', text) # 规范分隔线
text = re.sub(r'\
{4,}', '\
\
\
', text) # 限制空行
# 规范化标题
text = re.sub(r'^#{7,}', '######', text, flags=re.MULTILINE)
return {
'content': text,
'metadata': result.metadata,
'original_length': len(result.text_content),
'processed_length': len(text)
}
2. 结构化提取
import re
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class Section:
level: int
title: str
content: str
start_line: int
end_line: int
def extract_sections(md_text: str) -> List[Section]:
"""
从 Markdown 提取层级章节。
保留文档结构用于 AGENT 处理。
"""
lines = md_text.split('\
')
sections = []
# 查找所有标题
header_pattern = re.compile(r'^(#{1,6})\s+(.+)$')
headers = []
for i, line in enumerate(lines):
match = header_pattern.match(line)
if match:
level = len(match.group(1))
title = match.group(2).strip()
headers.append({'level': level, 'title': title, 'line': i})
# 提取章节
for i, header in enumerate(headers):
start_line = header['line']
end_line = headers[i + 1]['line'] if i + 1 \x3C len(headers) else len(lines)
content = '\
'.join(lines[start_line + 1:end_line]).strip()
sections.append(Section(
level=header['level'],
title=header['title'],
content=content,
start_line=start_line,
end_line=end_line
))
return sections
3. RAG 优化分块
from typing import List, Dict
import hashlib
def chunk_for_rag(
md_text: str,
max_chunk_size: int = 1500,
overlap: int = 200,
preserve_headers: bool = True
) -> List[Dict]:
"""
分块 Markdown 用于最优 RAG 检索。
保留语义边界并提供丰富元数据。
"""
chunks = []
current_chunk = []
current_size = 0
chunk_index = 0
# 按自然边界分割
paragraphs = md_text.split('\
\
')
for para in paragraphs:
para = para.strip()
if not para:
continue
para_size = len(para)
# 检查是否超过限制
if current_size + para_size > max_chunk_size and current_chunk:
# 保存当前块
chunk_text = '\
\
'.join(current_chunk)
chunks.append(create_chunk_dict(
chunk_text, chunk_index, md_text
))
# 重叠开始新块
if overlap > 0:
overlap_text = '\
\
'.join(current_chunk[-2:]) if len(current_chunk) >= 2 else chunk_text[-overlap:]
current_chunk = [overlap_text, para]
current_size = len(overlap_text) + para_size
else:
current_chunk = [para]
current_size = para_size
chunk_index += 1
else:
current_chunk.append(para)
current_size += para_size + 2 # +2 for newlines
# 别忘了最后一块
if current_chunk:
chunk_text = '\
\
'.join(current_chunk)
chunks.append(create_chunk_dict(
chunk_text, chunk_index, md_text
))
return chunks
def create_chunk_dict(text: str, index: int, source: str) -> Dict:
"""创建带元数据的块字典。"""
return {
'index': index,
'text': text,
'length': len(text),
'hash': hashlib.md5(text.encode()).hexdigest()[:8],
'source_length': len(source),
'percent_start': 0 if index == 0 else round((sum(len(c) for c in source[:text]) / len(source)) * 100, 1)
}
4. 文档分析管道
from typing import Dict, Any
from dataclasses import dataclass, asdict
@dataclass
class DocumentAnalysis:
"""AGENT 消费的完整文档分析。"""
file_path: str
file_type: str
word_count: int
char_count: int
section_count: int
heading_levels: Dict[int, int] # level -> count
has_tables: bool
has_images: bool
has_links: bool
estimated_reading_time: int # minutes
summary: str
keywords: list
metadata: Dict[str, Any]
class DocumentAnalyzer:
"""AGENT 工作流的文档分析器。"""
def __init__(self):
self.md = MarkItDown()
def analyze(self, file_path: str) -> DocumentAnalysis:
"""完整文档分析。"""
# 转换为 Markdown
result = self.md.convert(file_path)
text = result.text_content
# 基础统计
word_count = len(text.split())
char_count = len(text)
# 章节分析
sections = extract_sections(text)
section_count = len(sections)
# 标题层级
heading_levels = {}
for s in sections:
heading_levels[s.level] = heading_levels.get(s.level, 0) + 1
# 特性检测
has_tables = '|' in text and '---' in text
has_images = '![' in text
has_links = 'http' in text or '[' in text
# 阅读时间(平均 200 字/分钟)
reading_time = max(1, word_count // 200)
# 提取关键词(简单方法)
words = re.findall(r'\b[A-Za-z][a-z]{4,}\b', text)
word_freq = {}
for w in words:
w = w.lower()
if w not in ['would', 'could', 'should', 'there', 'their', 'about']:
word_freq[w] = word_freq.get(w, 0) + 1
keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
# 生成摘要
summary = self._generate_summary(text, sections)
return DocumentAnalysis(
file_path=file_path,
file_type=file_path.split('.')[-1].lower(),
word_count=word_count,
char_count=char_count,
section_count=section_count,
heading_levels=heading_levels,
has_tables=has_tables,
has_images=has_images,
has_links=has_links,
estimated_reading_time=reading_time,
summary=summary,
keywords=[k[0] for k in keywords],
metadata=result.metadata
)
def _generate_summary(self, text: str, sections: List[Section]) -> str:
"""生成文档摘要。"""
if not sections:
return text[:500] + "..." if len(text) > 500 else text
# 获取主要章节
main_sections = [s for s in sections if s.level \x3C= 2]
if main_sections:
section_list = ", ".join([s.title for s in main_sections[:5]])
return f"文档包含: {section_list}"
return text[:300] + "..." if len(text) > 300 else text
高级功能
处理图片(OCR)
# 提取图片中的文字
markitdown image.png --ocr-enabled
处理音频(转录)
# 音频转文字
markitdown recording.mp3 --speech-transcription
处理 YouTube
# 提取 YouTube 视频字幕和元数据
markitdown "https://youtube.com/watch?v=..."
故障排除
安装问题
# 确保安装完整依赖
pip install 'markitdown[all]'
# 或使用 Conda
conda install -c conda-forge markitdown
转换失败
# 启用详细日志
markitdown file.pdf -o out.md --verbose
# 检查文件权限
file file.pdf
ls -la file.pdf
中文乱码
# 确保使用 UTF-8
export LANG=en_US.UTF-8
markitdown chinese.pdf -o output.md
MCP 服务器集成
MarkItDown 提供 MCP (Model Context Protocol) 服务器,可用于 Claude Desktop 等工具:
pip install markitdown-mcp
配置 Claude Desktop claude_desktop_config.json:
{
"mcpServers": {
"markitdown": {
"command": "python",
"args": ["-m", "markitdown_mcp"]
}
}
}
相关链接
提示: 此 Skill 专为 AGENT 和 LLM 工作流优化,输出的 Markdown 格式干净、结构清晰,非常适合 RAG 系统和 AI 处理管道使用。
安全使用建议
This skill appears to do what it says: wrap Microsoft MarkItDown to convert documents to Markdown. Before installing, note that: (1) SKILL.md expects python3/pip3 and runs pip3 install 'markitdown[all]' — ensure you want that package and its extras installed (it can pull OCR, transcription, and media tooling). (2) Features like YouTube subtitle extraction and audio transcription will require network access and possibly third-party APIs or binaries (e.g., ffmpeg); review MarkItDown's docs to understand remote calls and dependencies. (3) Because the skill is instruction-only (no bundled code here), review the markitdown package source / PyPI page for provenance and trustworthiness. Recommended precautions: install into an isolated virtual environment, review the package's homepage/repository and dependencies, and avoid running it on highly sensitive documents unless you confirm where data may be transmitted or cached.
功能分析
Type: OpenClaw Skill
Name: everything2markdown-cn
Version: 1.0.0
The skill is a legitimate wrapper for the Microsoft MarkItDown library, providing documentation and code examples for converting various document formats to Markdown. It contains standard installation commands (pip3 install) and Python utility functions for document analysis and RAG preprocessing without any signs of malicious intent, data exfiltration, or prompt injection (SKILL.md, README.md).
能力评估
Purpose & Capability
The skill's stated purpose (convert many document formats to Markdown using Microsoft MarkItDown) matches the runtime instructions and examples. However, there is a small internal inconsistency: the registry summary lists no required binaries/env, while the SKILL.md metadata and install step expect python3 and pip3 and installing the markitdown Python package. This is likely an authoring/packaging omission rather than malicious intent.
Instruction Scope
SKILL.md instructions and code samples focus on converting local files (single or batch), preprocessing Markdown, section extraction, and RAG chunking. There are no instructions to read unrelated system files, to collect unrelated environment variables, or to send converted content to unknown endpoints. The only notable behavior is handling YouTube links and audio transcription which implies network access and use of third-party services (expected for those features).
Install Mechanism
The install step uses pip3 install 'markitdown[all]'. Installing a feature-complete extras set can pull many dependencies (OCR, transcription, media tooling like ffmpeg, etc.), increasing the attack surface and requiring network access to PyPI. This is a standard package install (PyPI) rather than an arbitrary download, but you should be aware it may install native binaries or extra packages with their own behavior.
Credentials
The skill does not request any environment variables, secrets, or credential files. None of the instructions reference hidden env vars or credentials. This is proportionate for a document-conversion tool.
Persistence & Privilege
The skill is instruction-only, has no always:true flag, and does not request persistent/global privileges or modify other skills. It can be invoked by the agent (normal default) but does not request elevated or permanent presence.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install everything2markdown-cn - 安装完成后,直接呼叫该 Skill 的名称或使用
/everything2markdown-cn触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
everything2markdown-cn v1.0.0
- 首次发布:支持将多种文档格式批量或单文件转换为结构化 Markdown,专为 AGENT 与 LLM 工作流优化
- 提供文档预处理、结构化章节提取与 RAG 分块等多种 Python 处理范例
- 完整支持 PDF、DOCX、PPTX、XLSX、EPUB、HTML、图片(OCR)、音频(转录)、YouTube 字幕等常见文档类型
- 输出保留标题、表格、列表、链接等结构及元数据信息,适合自动化管道接入
- 简洁 API 设计,易于集成至自动化、数据分析、检索增强生成 (RAG) 等场景
元数据
常见问题
everything to markdown 中文版 是什么?
使用 Microsoft MarkItDown 将各种文档格式(PDF/DOCX/PPTX/XLSX/图片/音频等)转换为 Markdown,专为 AGENT 和 LLM 工作流优化. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 195 次。
如何安装 everything to markdown 中文版?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install everything2markdown-cn」即可一键安装,无需额外配置。
everything to markdown 中文版 是免费的吗?
是的,everything to markdown 中文版 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
everything to markdown 中文版 支持哪些平台?
everything to markdown 中文版 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 everything to markdown 中文版?
由 xsm0826(@xsm0826)开发并维护,当前版本 v1.0.0。
推荐 Skills