功能描述

Convert almost anything (PDF, DOCX, PPTX, XLSX, images, audio, YouTube, etc.) to Markdown using Microsoft MarkItDown. Optimized for AGENT and LLM workflows.

使用说明 (SKILL.md)

Everything2Markdown - Convert Anything to Markdown

Name: everything to markdown
Author: xsm0826

A powerful document conversion tool based on Microsoft MarkItDown, specifically optimized for AGENT and LLM workflows.

Core Features

✅ Universal Support: PDF, DOCX, PPTX, XLSX, EPUB, HTML, CSV, JSON, XML
✅ Rich Media: Image OCR, audio transcription, YouTube subtitle extraction
✅ Structure Preservation: Headings, tables, lists, links maintained
✅ Metadata Extraction: Author, creation date, page count, etc.
✅ AGENT Optimized: Clean Markdown output perfect for LLM processing

Supported Formats

Format	Extension	Notes
PDF	.pdf	Full text extraction with structure
Word	.docx, .doc	Preserves headings and tables
PowerPoint	.pptx, .ppt	Slide-by-slide conversion
Excel	.xlsx, .xls	Sheet-to-table conversion
EPUB	.epub	E-book format
HTML	.html, .htm	Web page conversion
Images	.png, .jpg, .gif	OCR text extraction
Audio	.mp3, .wav, .m4a	Speech-to-text
Archives	.zip	Iterates contents
YouTube	URL	Subtitle and metadata

Quick Start

Single File Conversion

markitdown document.pdf -o output.md

Batch Conversion

# Convert all PDFs in directory
for f in *.pdf; do
  markitdown "$f" -o "${f%.pdf}.md"
done

# Or with find
find . -name "*.pdf" -o -name "*.docx" | while read f; do
  out="${f%.*}.md"
  markitdown "$f" -o "$out"
  echo "✓ Converted: $f → $out"
done

Advanced Options

# Verbose output
markitdown document.pdf -o output.md --verbose

# Keep intermediate files
markitdown document.pdf -o output.md --keep-temp

# Specify encoding
markitdown document.pdf -o output.md --encoding utf-8

Python API

Basic Usage

from markitdown import MarkItDown

# Initialize
md = MarkItDown()

# Convert file
result = md.convert("document.pdf")

# Access content
print(result.text_content)  # Markdown text
print(result.metadata)      # Document metadata

Advanced API

from markitdown import MarkItDown
from markitdown.converters import DocumentConverter

# Custom configuration
md = MarkItDown(
    enable_plugins=True,
    custom_converters=[MyCustomConverter()]
)

# Convert with options
result = md.convert(
    "document.pdf",
    keep_formatting=True,
    extract_images=True
)

# Access structured data
print(f"Title: {result.metadata.get('title')}")
print(f"Author: {result.metadata.get('author')}")
print(f"Pages: {result.metadata.get('pages')}")
print(f"Word count: {len(result.text_content.split())}")

AGENT Workflow Best Practices

1. Document Preprocessing Pipeline

import re
from markitdown import MarkItDown

def preprocess_for_llm(file_path):
    """
    Preprocess document for optimal LLM consumption.
    Cleans noise, normalizes formatting, extracts structure.
    """
    # Convert to markdown
    md = MarkItDown()
    result = md.convert(file_path)
    
    text = result.text_content
    
    # Clean excessive formatting
    text = re.sub(r'\*{4,}', '***', text)  # Limit asterisks
    text = re.sub(r'\-{4,}', '---', text)  # Normalize horizontal rules
    text = re.sub(r'\
{4,}', '\
\
\
', text)  # Limit blank lines
    
    # Normalize headings
    text = re.sub(r'^#{7,}', '######', text, flags=re.MULTILINE)
    
    return {
        'content': text,
        'metadata': result.metadata,
        'original_length': len(result.text_content),
        'processed_length': len(text)
    }

2. Structured Section Extraction

import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class Section:
    level: int
    title: str
    content: str
    start_line: int
    end_line: int

def extract_sections(md_text: str) -> List[Section]:
    """
    Extract hierarchical sections from Markdown.
    Preserves document structure for AGENT processing.
    """
    lines = md_text.split('\
')
    sections = []
    
    # Find all headers
    header_pattern = re.compile(r'^(#{1,6})\s+(.+)$')
    headers = []
    
    for i, line in enumerate(lines):
        match = header_pattern.match(line)
        if match:
            level = len(match.group(1))
            title = match.group(2).strip()
            headers.append({'level': level, 'title': title, 'line': i})
    
    # Extract sections
    for i, header in enumerate(headers):
        start_line = header['line']
        end_line = headers[i + 1]['line'] if i + 1 \x3C len(headers) else len(lines)
        
        content = '\
'.join(lines[start_line + 1:end_line]).strip()
        
        sections.append(Section(
            level=header['level'],
            title=header['title'],
            content=content,
            start_line=start_line,
            end_line=end_line
        ))
    
    return sections

3. RAG-Optimized Chunking

from typing import List, Dict
import hashlib

def chunk_for_rag(
    md_text: str,
    max_chunk_size: int = 1500,
    overlap: int = 200,
    preserve_headers: bool = True
) -> List[Dict]:
    """
    Chunk Markdown for optimal RAG retrieval.
    Preserves semantic boundaries and provides rich metadata.
    """
    chunks = []
    current_chunk = []
    current_size = 0
    chunk_index = 0
    
    # Split by natural boundaries
    paragraphs = md_text.split('\
\
')
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
        
        para_size = len(para)
        
        # Check if adding this paragraph exceeds limit
        if current_size + para_size > max_chunk_size and current_chunk:
            # Save current chunk
            chunk_text = '\
\
'.join(current_chunk)
            chunks.append(create_chunk_dict(
                chunk_text, chunk_index, md_text
            ))
            
            # Start new chunk with overlap
            if overlap > 0:
                overlap_text = '\
\
'.join(current_chunk[-2:]) if len(current_chunk) >= 2 else chunk_text[-overlap:]
                current_chunk = [overlap_text, para]
                current_size = len(overlap_text) + para_size
            else:
                current_chunk = [para]
                current_size = para_size
            
            chunk_index += 1
        else:
            current_chunk.append(para)
            current_size += para_size + 2  # +2 for newlines
    
    # Don't forget the last chunk
    if current_chunk:
        chunk_text = '\
\
'.join(current_chunk)
        chunks.append(create_chunk_dict(
            chunk_text, chunk_index, md_text
        ))
    
    return chunks

def create_chunk_dict(text: str, index: int, source: str) -> Dict:
    """Create a chunk dictionary with metadata."""
    return {
        'index': index,
        'text': text,
        'length': len(text),
        'hash': hashlib.md5(text.encode()).hexdigest()[:8],
        'source_length': len(source),
        'percent_start': 0 if index == 0 else round((sum(len(c) for c in source[:text]) / len(source)) * 100, 1)
    }

4. Document Analysis Pipeline

from typing import Dict, Any
from dataclasses import dataclass, asdict

@dataclass
class DocumentAnalysis:
    """Complete document analysis for AGENT consumption."""
    file_path: str
    file_type: str
    word_count: int
    char_count: int
    section_count: int
    heading_levels: Dict[int, int]  # level -> count
    has_tables: bool
    has_images: bool
    has_links: bool
    estimated_reading_time: int  # minutes
    summary: str
    keywords: list
    metadata: Dict[str, Any]

class DocumentAnalyzer:
    """Analyze documents for AGENT workflows."""
    
    def __init__(self):
        self.md = MarkItDown()
    
    def analyze(self, file_path: str) -> DocumentAnalysis:
        """Complete document analysis."""
        # Convert to markdown
        result = self.md.convert(file_path)
        text = result.text_content
        
        # Basic stats
        word_count = len(text.split())
        char_count = len(text)
        
        # Section analysis
        sections = extract_sections(text)
        section_count = len(sections)
        
        # Heading levels
        heading_levels = {}
        for s in sections:
            heading_levels[s.level] = heading_levels.get(s.level, 0) + 1
        
        # Features detection
        has_tables = '|' in text and '---' in text
        has_images = '![' in text
        has_links = 'http' in text or '[' in text
        
        # Reading time (average 200 wpm)
        reading_time = max(1, word_count // 200)
        
        # Extract keywords (simple approach)
        words = re.findall(r'\b[A-Za-z][a-z]{4,}\b', text)
        word_freq = {}
        for w in words:
            w = w.lower()
            if w not in ['would', 'could', 'should', 'there', 'their', 'about']:
                word_freq[w] = word_freq.get(w, 0) + 1
        keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
        
        # Generate summary
        summary = self._generate_summary(text, sections)
        
        return DocumentAnalysis(
            file_path=file_path,
            file_type=file_path.split('.')[-1].lower(),
            word_count=word_count,
            char_count=char_count,
            section_count=section_count,
            heading_levels=heading_levels,
            has_tables=has_tables,
            has_images=has_images,
            has_links=has_links,
            estimated_reading_time=reading_time,
            summary=summary,
            keywords=[k[0] for k in keywords],
            metadata=result.metadata
        )
    
    def _generate_summary(self, text: str, sections: List[Section]) -> str:
        """Generate a brief document summary."""
        if not sections:
            return text[:500] + "..." if len(text) > 500 else text
        
        # Get main sections
        main_sections = [s for s in sections if s.level \x3C= 2]
        if main_sections:
            section_list = ", ".join([s.title for s in main_sections[:5]])
            return f"Document covers: {section_list}"
        
        return text[:300] + "..." if len(text) > 300 else text

Integration with OpenClaw

Skill Definition

# SKILL.md
name: doc2markdown
metadata:
  emoji: 📝
  requires:
    python_packages: [markitdown]
  install:
    - command: pip3 install 'markitdown[all]'

Tool Usage

# In OpenClaw agent
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("/path/to/document.pdf")

# Use in RAG pipeline
chunks = chunk_for_rag(result.text_content)
store_in_vector_db(chunks)

Performance Tips

Batch Processing: Process multiple files in parallel
Memory Management: For large documents, use streaming
Caching: Cache converted documents to avoid reprocessing
Selective Conversion: Only convert needed sections

Security Considerations

Sanitize file paths before processing
Validate file types to prevent injection
Handle sensitive content appropriately
Consider file size limits to prevent DoS

Troubleshooting

Issue	Solution
ImportError	Ensure `pip install 'markitdown[all]'`
OCR fails	Install Tesseract: `apt install tesseract-ocr`
Audio fails	Install ffmpeg: `apt install ffmpeg`
Memory error	Process in smaller chunks

License

MIT License - See LICENSE file for details.

安全使用建议

This skill largely does what it says, but take these precautions before installing or running it: 1) Resolve the metadata mismatch — the registry lists no requirements but SKILL.md expects python3, pip3 and markitdown; treat the SKILL.md as authoritative. 2) Run pip installs in an isolated environment (virtualenv, container) so package install scripts and native deps don't affect your system. 3) Audit the markitdown package (PyPI/GitHub) before installing — check its dependencies and whether it requires system binaries (ffmpeg, tesseract, yt-dlp/yt-dlp) and what post-install actions it performs. 4) Avoid running batch conversion commands in directories containing sensitive data; the examples scan and convert files recursively and may process anything on PATH-accessible locations. 5) Expect network access for YouTube and transcription features; if you need to restrict network or resource usage, run in a constrained environment. If you want higher assurance, ask the skill author to update registry metadata to declare required system binaries and to provide provenance (official package name on PyPI and a verified homepage/repository).

功能分析

Type: OpenClaw Skill Name: everything2markdown Version: 1.0.0 The skill bundle provides a comprehensive wrapper and documentation for the Microsoft MarkItDown library, intended for converting various document formats to Markdown for LLM processing. All code snippets in SKILL.md are legitimate utility functions for text preprocessing, RAG chunking, and document analysis, and the installation process uses the official 'markitdown' package.

能力评估

ℹ Purpose & Capability

The stated purpose (convert PDFs, Office files, images, audio, YouTube, etc. to Markdown) matches the instructions and examples which call out the markitdown package and CLI. However the registry metadata lists no required binaries or packages while the SKILL.md itself declares python3/pip3 and markitdown (and an install command). This metadata mismatch is an incoherence that should be resolved. Also some features (OCR, audio transcription, YouTube extraction) commonly rely on system binaries (e.g., ffmpeg, tesseract, yt-dlp) that are not declared, which may cause surprising runtime failures or require additional privileges.

ℹ Instruction Scope

SKILL.md instructs the agent to install markitdown via pip and to run markitdown on files, including batch loops and find-based scans. That behavior is consistent with the skill's purpose, but the batch examples will iterate through and convert any matching files in the agent's working directories — potentially processing sensitive local files if the agent is run in a home or repo directory. The instructions also implicitly require network access for YouTube extraction and possibly for transcription models; they do not request or access secrets. No instructions request unrelated system files or credentials.

ℹ Install Mechanism

The SKILL.md contains an install block that runs pip3 install 'markitdown[all]'. As an instruction-only skill, the agent may execute that command at runtime. Installing from PyPI is common, but pip installs execute arbitrary package code and can pull many transitive dependencies. The install is not a download from an unknown/personal URL, which lowers risk, but the '[all]' extras likely bring heavy native and optional deps and may require system binaries not declared in the registry metadata.

✓ Credentials

The skill does not request environment variables, credentials, or config paths. SKILL.md's metadata also lists env: []. There is no request for unrelated secrets or tokens.

✓ Persistence & Privilege

always is false and there is no install spec in the registry (the SKILL.md contains an optional install command). The skill does not request persistent elevation or modify other skills. Autonomous invocation is allowed (default), which is normal; this combined with other notes increases the need for caution but is not itself a problem.

版本历史

v1.0.0

new

元数据

Slug everything2markdown

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

everything to markdown 是什么？

Convert almost anything (PDF, DOCX, PPTX, XLSX, images, audio, YouTube, etc.) to Markdown using Microsoft MarkItDown. Optimized for AGENT and LLM workflows. 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 102 次。

如何安装 everything to markdown？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install everything2markdown」即可一键安装，无需额外配置。

everything to markdown 是免费的吗？

是的，everything to markdown 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

everything to markdown 支持哪些平台？

everything to markdown 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 everything to markdown？

由 xsm0826（@xsm0826）开发并维护，当前版本 v1.0.0。

everything to markdown