← 返回 Skills 市场
102
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install everything2markdown
功能描述
Convert almost anything (PDF, DOCX, PPTX, XLSX, images, audio, YouTube, etc.) to Markdown using Microsoft MarkItDown. Optimized for AGENT and LLM workflows.
使用说明 (SKILL.md)
Everything2Markdown - Convert Anything to Markdown
A powerful document conversion tool based on Microsoft MarkItDown, specifically optimized for AGENT and LLM workflows.
Core Features
- ✅ Universal Support: PDF, DOCX, PPTX, XLSX, EPUB, HTML, CSV, JSON, XML
- ✅ Rich Media: Image OCR, audio transcription, YouTube subtitle extraction
- ✅ Structure Preservation: Headings, tables, lists, links maintained
- ✅ Metadata Extraction: Author, creation date, page count, etc.
- ✅ AGENT Optimized: Clean Markdown output perfect for LLM processing
Supported Formats
| Format | Extension | Notes |
|---|---|---|
| Full text extraction with structure | ||
| Word | .docx, .doc | Preserves headings and tables |
| PowerPoint | .pptx, .ppt | Slide-by-slide conversion |
| Excel | .xlsx, .xls | Sheet-to-table conversion |
| EPUB | .epub | E-book format |
| HTML | .html, .htm | Web page conversion |
| Images | .png, .jpg, .gif | OCR text extraction |
| Audio | .mp3, .wav, .m4a | Speech-to-text |
| Archives | .zip | Iterates contents |
| YouTube | URL | Subtitle and metadata |
Quick Start
Single File Conversion
markitdown document.pdf -o output.md
Batch Conversion
# Convert all PDFs in directory
for f in *.pdf; do
markitdown "$f" -o "${f%.pdf}.md"
done
# Or with find
find . -name "*.pdf" -o -name "*.docx" | while read f; do
out="${f%.*}.md"
markitdown "$f" -o "$out"
echo "✓ Converted: $f → $out"
done
Advanced Options
# Verbose output
markitdown document.pdf -o output.md --verbose
# Keep intermediate files
markitdown document.pdf -o output.md --keep-temp
# Specify encoding
markitdown document.pdf -o output.md --encoding utf-8
Python API
Basic Usage
from markitdown import MarkItDown
# Initialize
md = MarkItDown()
# Convert file
result = md.convert("document.pdf")
# Access content
print(result.text_content) # Markdown text
print(result.metadata) # Document metadata
Advanced API
from markitdown import MarkItDown
from markitdown.converters import DocumentConverter
# Custom configuration
md = MarkItDown(
enable_plugins=True,
custom_converters=[MyCustomConverter()]
)
# Convert with options
result = md.convert(
"document.pdf",
keep_formatting=True,
extract_images=True
)
# Access structured data
print(f"Title: {result.metadata.get('title')}")
print(f"Author: {result.metadata.get('author')}")
print(f"Pages: {result.metadata.get('pages')}")
print(f"Word count: {len(result.text_content.split())}")
AGENT Workflow Best Practices
1. Document Preprocessing Pipeline
import re
from markitdown import MarkItDown
def preprocess_for_llm(file_path):
"""
Preprocess document for optimal LLM consumption.
Cleans noise, normalizes formatting, extracts structure.
"""
# Convert to markdown
md = MarkItDown()
result = md.convert(file_path)
text = result.text_content
# Clean excessive formatting
text = re.sub(r'\*{4,}', '***', text) # Limit asterisks
text = re.sub(r'\-{4,}', '---', text) # Normalize horizontal rules
text = re.sub(r'\
{4,}', '\
\
\
', text) # Limit blank lines
# Normalize headings
text = re.sub(r'^#{7,}', '######', text, flags=re.MULTILINE)
return {
'content': text,
'metadata': result.metadata,
'original_length': len(result.text_content),
'processed_length': len(text)
}
2. Structured Section Extraction
import re
from dataclasses import dataclass
from typing import List, Optional
@dataclass
class Section:
level: int
title: str
content: str
start_line: int
end_line: int
def extract_sections(md_text: str) -> List[Section]:
"""
Extract hierarchical sections from Markdown.
Preserves document structure for AGENT processing.
"""
lines = md_text.split('\
')
sections = []
# Find all headers
header_pattern = re.compile(r'^(#{1,6})\s+(.+)$')
headers = []
for i, line in enumerate(lines):
match = header_pattern.match(line)
if match:
level = len(match.group(1))
title = match.group(2).strip()
headers.append({'level': level, 'title': title, 'line': i})
# Extract sections
for i, header in enumerate(headers):
start_line = header['line']
end_line = headers[i + 1]['line'] if i + 1 \x3C len(headers) else len(lines)
content = '\
'.join(lines[start_line + 1:end_line]).strip()
sections.append(Section(
level=header['level'],
title=header['title'],
content=content,
start_line=start_line,
end_line=end_line
))
return sections
3. RAG-Optimized Chunking
from typing import List, Dict
import hashlib
def chunk_for_rag(
md_text: str,
max_chunk_size: int = 1500,
overlap: int = 200,
preserve_headers: bool = True
) -> List[Dict]:
"""
Chunk Markdown for optimal RAG retrieval.
Preserves semantic boundaries and provides rich metadata.
"""
chunks = []
current_chunk = []
current_size = 0
chunk_index = 0
# Split by natural boundaries
paragraphs = md_text.split('\
\
')
for para in paragraphs:
para = para.strip()
if not para:
continue
para_size = len(para)
# Check if adding this paragraph exceeds limit
if current_size + para_size > max_chunk_size and current_chunk:
# Save current chunk
chunk_text = '\
\
'.join(current_chunk)
chunks.append(create_chunk_dict(
chunk_text, chunk_index, md_text
))
# Start new chunk with overlap
if overlap > 0:
overlap_text = '\
\
'.join(current_chunk[-2:]) if len(current_chunk) >= 2 else chunk_text[-overlap:]
current_chunk = [overlap_text, para]
current_size = len(overlap_text) + para_size
else:
current_chunk = [para]
current_size = para_size
chunk_index += 1
else:
current_chunk.append(para)
current_size += para_size + 2 # +2 for newlines
# Don't forget the last chunk
if current_chunk:
chunk_text = '\
\
'.join(current_chunk)
chunks.append(create_chunk_dict(
chunk_text, chunk_index, md_text
))
return chunks
def create_chunk_dict(text: str, index: int, source: str) -> Dict:
"""Create a chunk dictionary with metadata."""
return {
'index': index,
'text': text,
'length': len(text),
'hash': hashlib.md5(text.encode()).hexdigest()[:8],
'source_length': len(source),
'percent_start': 0 if index == 0 else round((sum(len(c) for c in source[:text]) / len(source)) * 100, 1)
}
4. Document Analysis Pipeline
from typing import Dict, Any
from dataclasses import dataclass, asdict
@dataclass
class DocumentAnalysis:
"""Complete document analysis for AGENT consumption."""
file_path: str
file_type: str
word_count: int
char_count: int
section_count: int
heading_levels: Dict[int, int] # level -> count
has_tables: bool
has_images: bool
has_links: bool
estimated_reading_time: int # minutes
summary: str
keywords: list
metadata: Dict[str, Any]
class DocumentAnalyzer:
"""Analyze documents for AGENT workflows."""
def __init__(self):
self.md = MarkItDown()
def analyze(self, file_path: str) -> DocumentAnalysis:
"""Complete document analysis."""
# Convert to markdown
result = self.md.convert(file_path)
text = result.text_content
# Basic stats
word_count = len(text.split())
char_count = len(text)
# Section analysis
sections = extract_sections(text)
section_count = len(sections)
# Heading levels
heading_levels = {}
for s in sections:
heading_levels[s.level] = heading_levels.get(s.level, 0) + 1
# Features detection
has_tables = '|' in text and '---' in text
has_images = '![' in text
has_links = 'http' in text or '[' in text
# Reading time (average 200 wpm)
reading_time = max(1, word_count // 200)
# Extract keywords (simple approach)
words = re.findall(r'\b[A-Za-z][a-z]{4,}\b', text)
word_freq = {}
for w in words:
w = w.lower()
if w not in ['would', 'could', 'should', 'there', 'their', 'about']:
word_freq[w] = word_freq.get(w, 0) + 1
keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
# Generate summary
summary = self._generate_summary(text, sections)
return DocumentAnalysis(
file_path=file_path,
file_type=file_path.split('.')[-1].lower(),
word_count=word_count,
char_count=char_count,
section_count=section_count,
heading_levels=heading_levels,
has_tables=has_tables,
has_images=has_images,
has_links=has_links,
estimated_reading_time=reading_time,
summary=summary,
keywords=[k[0] for k in keywords],
metadata=result.metadata
)
def _generate_summary(self, text: str, sections: List[Section]) -> str:
"""Generate a brief document summary."""
if not sections:
return text[:500] + "..." if len(text) > 500 else text
# Get main sections
main_sections = [s for s in sections if s.level \x3C= 2]
if main_sections:
section_list = ", ".join([s.title for s in main_sections[:5]])
return f"Document covers: {section_list}"
return text[:300] + "..." if len(text) > 300 else text
Integration with OpenClaw
Skill Definition
# SKILL.md
name: doc2markdown
metadata:
emoji: 📝
requires:
python_packages: [markitdown]
install:
- command: pip3 install 'markitdown[all]'
Tool Usage
# In OpenClaw agent
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("/path/to/document.pdf")
# Use in RAG pipeline
chunks = chunk_for_rag(result.text_content)
store_in_vector_db(chunks)
Performance Tips
- Batch Processing: Process multiple files in parallel
- Memory Management: For large documents, use streaming
- Caching: Cache converted documents to avoid reprocessing
- Selective Conversion: Only convert needed sections
Security Considerations
- Sanitize file paths before processing
- Validate file types to prevent injection
- Handle sensitive content appropriately
- Consider file size limits to prevent DoS
Troubleshooting
| Issue | Solution |
|---|---|
| ImportError | Ensure pip install 'markitdown[all]' |
| OCR fails | Install Tesseract: apt install tesseract-ocr |
| Audio fails | Install ffmpeg: apt install ffmpeg |
| Memory error | Process in smaller chunks |
License
MIT License - See LICENSE file for details.
安全使用建议
This skill largely does what it says, but take these precautions before installing or running it: 1) Resolve the metadata mismatch — the registry lists no requirements but SKILL.md expects python3, pip3 and markitdown; treat the SKILL.md as authoritative. 2) Run pip installs in an isolated environment (virtualenv, container) so package install scripts and native deps don't affect your system. 3) Audit the markitdown package (PyPI/GitHub) before installing — check its dependencies and whether it requires system binaries (ffmpeg, tesseract, yt-dlp/yt-dlp) and what post-install actions it performs. 4) Avoid running batch conversion commands in directories containing sensitive data; the examples scan and convert files recursively and may process anything on PATH-accessible locations. 5) Expect network access for YouTube and transcription features; if you need to restrict network or resource usage, run in a constrained environment. If you want higher assurance, ask the skill author to update registry metadata to declare required system binaries and to provide provenance (official package name on PyPI and a verified homepage/repository).
功能分析
Type: OpenClaw Skill
Name: everything2markdown
Version: 1.0.0
The skill bundle provides a comprehensive wrapper and documentation for the Microsoft MarkItDown library, intended for converting various document formats to Markdown for LLM processing. All code snippets in SKILL.md are legitimate utility functions for text preprocessing, RAG chunking, and document analysis, and the installation process uses the official 'markitdown' package.
能力评估
Purpose & Capability
The stated purpose (convert PDFs, Office files, images, audio, YouTube, etc. to Markdown) matches the instructions and examples which call out the markitdown package and CLI. However the registry metadata lists no required binaries or packages while the SKILL.md itself declares python3/pip3 and markitdown (and an install command). This metadata mismatch is an incoherence that should be resolved. Also some features (OCR, audio transcription, YouTube extraction) commonly rely on system binaries (e.g., ffmpeg, tesseract, yt-dlp) that are not declared, which may cause surprising runtime failures or require additional privileges.
Instruction Scope
SKILL.md instructs the agent to install markitdown via pip and to run markitdown on files, including batch loops and find-based scans. That behavior is consistent with the skill's purpose, but the batch examples will iterate through and convert any matching files in the agent's working directories — potentially processing sensitive local files if the agent is run in a home or repo directory. The instructions also implicitly require network access for YouTube extraction and possibly for transcription models; they do not request or access secrets. No instructions request unrelated system files or credentials.
Install Mechanism
The SKILL.md contains an install block that runs pip3 install 'markitdown[all]'. As an instruction-only skill, the agent may execute that command at runtime. Installing from PyPI is common, but pip installs execute arbitrary package code and can pull many transitive dependencies. The install is not a download from an unknown/personal URL, which lowers risk, but the '[all]' extras likely bring heavy native and optional deps and may require system binaries not declared in the registry metadata.
Credentials
The skill does not request environment variables, credentials, or config paths. SKILL.md's metadata also lists env: []. There is no request for unrelated secrets or tokens.
Persistence & Privilege
always is false and there is no install spec in the registry (the SKILL.md contains an optional install command). The skill does not request persistent elevation or modify other skills. Autonomous invocation is allowed (default), which is normal; this combined with other notes increases the need for caution but is not itself a problem.
如何使用
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install everything2markdown - 安装完成后,直接呼叫该 Skill 的名称或使用
/everything2markdown触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
new
元数据
常见问题
everything to markdown 是什么?
Convert almost anything (PDF, DOCX, PPTX, XLSX, images, audio, YouTube, etc.) to Markdown using Microsoft MarkItDown. Optimized for AGENT and LLM workflows. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 102 次。
如何安装 everything to markdown?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install everything2markdown」即可一键安装,无需额外配置。
everything to markdown 是免费的吗?
是的,everything to markdown 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
everything to markdown 支持哪些平台?
everything to markdown 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 everything to markdown?
由 xsm0826(@xsm0826)开发并维护,当前版本 v1.0.0。
推荐 Skills