← Back to Skills Marketplace
xsm0826

everything to markdown

by xsm0826 · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
102
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install everything2markdown
Description
Convert almost anything (PDF, DOCX, PPTX, XLSX, images, audio, YouTube, etc.) to Markdown using Microsoft MarkItDown. Optimized for AGENT and LLM workflows.
README (SKILL.md)

Everything2Markdown - Convert Anything to Markdown

A powerful document conversion tool based on Microsoft MarkItDown, specifically optimized for AGENT and LLM workflows.

Core Features

  • Universal Support: PDF, DOCX, PPTX, XLSX, EPUB, HTML, CSV, JSON, XML
  • Rich Media: Image OCR, audio transcription, YouTube subtitle extraction
  • Structure Preservation: Headings, tables, lists, links maintained
  • Metadata Extraction: Author, creation date, page count, etc.
  • AGENT Optimized: Clean Markdown output perfect for LLM processing

Supported Formats

Format Extension Notes
PDF .pdf Full text extraction with structure
Word .docx, .doc Preserves headings and tables
PowerPoint .pptx, .ppt Slide-by-slide conversion
Excel .xlsx, .xls Sheet-to-table conversion
EPUB .epub E-book format
HTML .html, .htm Web page conversion
Images .png, .jpg, .gif OCR text extraction
Audio .mp3, .wav, .m4a Speech-to-text
Archives .zip Iterates contents
YouTube URL Subtitle and metadata

Quick Start

Single File Conversion

markitdown document.pdf -o output.md

Batch Conversion

# Convert all PDFs in directory
for f in *.pdf; do
  markitdown "$f" -o "${f%.pdf}.md"
done

# Or with find
find . -name "*.pdf" -o -name "*.docx" | while read f; do
  out="${f%.*}.md"
  markitdown "$f" -o "$out"
  echo "✓ Converted: $f → $out"
done

Advanced Options

# Verbose output
markitdown document.pdf -o output.md --verbose

# Keep intermediate files
markitdown document.pdf -o output.md --keep-temp

# Specify encoding
markitdown document.pdf -o output.md --encoding utf-8

Python API

Basic Usage

from markitdown import MarkItDown

# Initialize
md = MarkItDown()

# Convert file
result = md.convert("document.pdf")

# Access content
print(result.text_content)  # Markdown text
print(result.metadata)      # Document metadata

Advanced API

from markitdown import MarkItDown
from markitdown.converters import DocumentConverter

# Custom configuration
md = MarkItDown(
    enable_plugins=True,
    custom_converters=[MyCustomConverter()]
)

# Convert with options
result = md.convert(
    "document.pdf",
    keep_formatting=True,
    extract_images=True
)

# Access structured data
print(f"Title: {result.metadata.get('title')}")
print(f"Author: {result.metadata.get('author')}")
print(f"Pages: {result.metadata.get('pages')}")
print(f"Word count: {len(result.text_content.split())}")

AGENT Workflow Best Practices

1. Document Preprocessing Pipeline

import re
from markitdown import MarkItDown

def preprocess_for_llm(file_path):
    """
    Preprocess document for optimal LLM consumption.
    Cleans noise, normalizes formatting, extracts structure.
    """
    # Convert to markdown
    md = MarkItDown()
    result = md.convert(file_path)
    
    text = result.text_content
    
    # Clean excessive formatting
    text = re.sub(r'\*{4,}', '***', text)  # Limit asterisks
    text = re.sub(r'\-{4,}', '---', text)  # Normalize horizontal rules
    text = re.sub(r'\
{4,}', '\
\
\
', text)  # Limit blank lines
    
    # Normalize headings
    text = re.sub(r'^#{7,}', '######', text, flags=re.MULTILINE)
    
    return {
        'content': text,
        'metadata': result.metadata,
        'original_length': len(result.text_content),
        'processed_length': len(text)
    }

2. Structured Section Extraction

import re
from dataclasses import dataclass
from typing import List, Optional

@dataclass
class Section:
    level: int
    title: str
    content: str
    start_line: int
    end_line: int

def extract_sections(md_text: str) -> List[Section]:
    """
    Extract hierarchical sections from Markdown.
    Preserves document structure for AGENT processing.
    """
    lines = md_text.split('\
')
    sections = []
    
    # Find all headers
    header_pattern = re.compile(r'^(#{1,6})\s+(.+)$')
    headers = []
    
    for i, line in enumerate(lines):
        match = header_pattern.match(line)
        if match:
            level = len(match.group(1))
            title = match.group(2).strip()
            headers.append({'level': level, 'title': title, 'line': i})
    
    # Extract sections
    for i, header in enumerate(headers):
        start_line = header['line']
        end_line = headers[i + 1]['line'] if i + 1 \x3C len(headers) else len(lines)
        
        content = '\
'.join(lines[start_line + 1:end_line]).strip()
        
        sections.append(Section(
            level=header['level'],
            title=header['title'],
            content=content,
            start_line=start_line,
            end_line=end_line
        ))
    
    return sections

3. RAG-Optimized Chunking

from typing import List, Dict
import hashlib

def chunk_for_rag(
    md_text: str,
    max_chunk_size: int = 1500,
    overlap: int = 200,
    preserve_headers: bool = True
) -> List[Dict]:
    """
    Chunk Markdown for optimal RAG retrieval.
    Preserves semantic boundaries and provides rich metadata.
    """
    chunks = []
    current_chunk = []
    current_size = 0
    chunk_index = 0
    
    # Split by natural boundaries
    paragraphs = md_text.split('\
\
')
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
        
        para_size = len(para)
        
        # Check if adding this paragraph exceeds limit
        if current_size + para_size > max_chunk_size and current_chunk:
            # Save current chunk
            chunk_text = '\
\
'.join(current_chunk)
            chunks.append(create_chunk_dict(
                chunk_text, chunk_index, md_text
            ))
            
            # Start new chunk with overlap
            if overlap > 0:
                overlap_text = '\
\
'.join(current_chunk[-2:]) if len(current_chunk) >= 2 else chunk_text[-overlap:]
                current_chunk = [overlap_text, para]
                current_size = len(overlap_text) + para_size
            else:
                current_chunk = [para]
                current_size = para_size
            
            chunk_index += 1
        else:
            current_chunk.append(para)
            current_size += para_size + 2  # +2 for newlines
    
    # Don't forget the last chunk
    if current_chunk:
        chunk_text = '\
\
'.join(current_chunk)
        chunks.append(create_chunk_dict(
            chunk_text, chunk_index, md_text
        ))
    
    return chunks

def create_chunk_dict(text: str, index: int, source: str) -> Dict:
    """Create a chunk dictionary with metadata."""
    return {
        'index': index,
        'text': text,
        'length': len(text),
        'hash': hashlib.md5(text.encode()).hexdigest()[:8],
        'source_length': len(source),
        'percent_start': 0 if index == 0 else round((sum(len(c) for c in source[:text]) / len(source)) * 100, 1)
    }

4. Document Analysis Pipeline

from typing import Dict, Any
from dataclasses import dataclass, asdict

@dataclass
class DocumentAnalysis:
    """Complete document analysis for AGENT consumption."""
    file_path: str
    file_type: str
    word_count: int
    char_count: int
    section_count: int
    heading_levels: Dict[int, int]  # level -> count
    has_tables: bool
    has_images: bool
    has_links: bool
    estimated_reading_time: int  # minutes
    summary: str
    keywords: list
    metadata: Dict[str, Any]

class DocumentAnalyzer:
    """Analyze documents for AGENT workflows."""
    
    def __init__(self):
        self.md = MarkItDown()
    
    def analyze(self, file_path: str) -> DocumentAnalysis:
        """Complete document analysis."""
        # Convert to markdown
        result = self.md.convert(file_path)
        text = result.text_content
        
        # Basic stats
        word_count = len(text.split())
        char_count = len(text)
        
        # Section analysis
        sections = extract_sections(text)
        section_count = len(sections)
        
        # Heading levels
        heading_levels = {}
        for s in sections:
            heading_levels[s.level] = heading_levels.get(s.level, 0) + 1
        
        # Features detection
        has_tables = '|' in text and '---' in text
        has_images = '![' in text
        has_links = 'http' in text or '[' in text
        
        # Reading time (average 200 wpm)
        reading_time = max(1, word_count // 200)
        
        # Extract keywords (simple approach)
        words = re.findall(r'\b[A-Za-z][a-z]{4,}\b', text)
        word_freq = {}
        for w in words:
            w = w.lower()
            if w not in ['would', 'could', 'should', 'there', 'their', 'about']:
                word_freq[w] = word_freq.get(w, 0) + 1
        keywords = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)[:10]
        
        # Generate summary
        summary = self._generate_summary(text, sections)
        
        return DocumentAnalysis(
            file_path=file_path,
            file_type=file_path.split('.')[-1].lower(),
            word_count=word_count,
            char_count=char_count,
            section_count=section_count,
            heading_levels=heading_levels,
            has_tables=has_tables,
            has_images=has_images,
            has_links=has_links,
            estimated_reading_time=reading_time,
            summary=summary,
            keywords=[k[0] for k in keywords],
            metadata=result.metadata
        )
    
    def _generate_summary(self, text: str, sections: List[Section]) -> str:
        """Generate a brief document summary."""
        if not sections:
            return text[:500] + "..." if len(text) > 500 else text
        
        # Get main sections
        main_sections = [s for s in sections if s.level \x3C= 2]
        if main_sections:
            section_list = ", ".join([s.title for s in main_sections[:5]])
            return f"Document covers: {section_list}"
        
        return text[:300] + "..." if len(text) > 300 else text

Integration with OpenClaw

Skill Definition

# SKILL.md
name: doc2markdown
metadata:
  emoji: 📝
  requires:
    python_packages: [markitdown]
  install:
    - command: pip3 install 'markitdown[all]'

Tool Usage

# In OpenClaw agent
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("/path/to/document.pdf")

# Use in RAG pipeline
chunks = chunk_for_rag(result.text_content)
store_in_vector_db(chunks)

Performance Tips

  1. Batch Processing: Process multiple files in parallel
  2. Memory Management: For large documents, use streaming
  3. Caching: Cache converted documents to avoid reprocessing
  4. Selective Conversion: Only convert needed sections

Security Considerations

  • Sanitize file paths before processing
  • Validate file types to prevent injection
  • Handle sensitive content appropriately
  • Consider file size limits to prevent DoS

Troubleshooting

Issue Solution
ImportError Ensure pip install 'markitdown[all]'
OCR fails Install Tesseract: apt install tesseract-ocr
Audio fails Install ffmpeg: apt install ffmpeg
Memory error Process in smaller chunks

License

MIT License - See LICENSE file for details.

Usage Guidance
This skill largely does what it says, but take these precautions before installing or running it: 1) Resolve the metadata mismatch — the registry lists no requirements but SKILL.md expects python3, pip3 and markitdown; treat the SKILL.md as authoritative. 2) Run pip installs in an isolated environment (virtualenv, container) so package install scripts and native deps don't affect your system. 3) Audit the markitdown package (PyPI/GitHub) before installing — check its dependencies and whether it requires system binaries (ffmpeg, tesseract, yt-dlp/yt-dlp) and what post-install actions it performs. 4) Avoid running batch conversion commands in directories containing sensitive data; the examples scan and convert files recursively and may process anything on PATH-accessible locations. 5) Expect network access for YouTube and transcription features; if you need to restrict network or resource usage, run in a constrained environment. If you want higher assurance, ask the skill author to update registry metadata to declare required system binaries and to provide provenance (official package name on PyPI and a verified homepage/repository).
Capability Analysis
Type: OpenClaw Skill Name: everything2markdown Version: 1.0.0 The skill bundle provides a comprehensive wrapper and documentation for the Microsoft MarkItDown library, intended for converting various document formats to Markdown for LLM processing. All code snippets in SKILL.md are legitimate utility functions for text preprocessing, RAG chunking, and document analysis, and the installation process uses the official 'markitdown' package.
Capability Assessment
Purpose & Capability
The stated purpose (convert PDFs, Office files, images, audio, YouTube, etc. to Markdown) matches the instructions and examples which call out the markitdown package and CLI. However the registry metadata lists no required binaries or packages while the SKILL.md itself declares python3/pip3 and markitdown (and an install command). This metadata mismatch is an incoherence that should be resolved. Also some features (OCR, audio transcription, YouTube extraction) commonly rely on system binaries (e.g., ffmpeg, tesseract, yt-dlp) that are not declared, which may cause surprising runtime failures or require additional privileges.
Instruction Scope
SKILL.md instructs the agent to install markitdown via pip and to run markitdown on files, including batch loops and find-based scans. That behavior is consistent with the skill's purpose, but the batch examples will iterate through and convert any matching files in the agent's working directories — potentially processing sensitive local files if the agent is run in a home or repo directory. The instructions also implicitly require network access for YouTube extraction and possibly for transcription models; they do not request or access secrets. No instructions request unrelated system files or credentials.
Install Mechanism
The SKILL.md contains an install block that runs pip3 install 'markitdown[all]'. As an instruction-only skill, the agent may execute that command at runtime. Installing from PyPI is common, but pip installs execute arbitrary package code and can pull many transitive dependencies. The install is not a download from an unknown/personal URL, which lowers risk, but the '[all]' extras likely bring heavy native and optional deps and may require system binaries not declared in the registry metadata.
Credentials
The skill does not request environment variables, credentials, or config paths. SKILL.md's metadata also lists env: []. There is no request for unrelated secrets or tokens.
Persistence & Privilege
always is false and there is no install spec in the registry (the SKILL.md contains an optional install command). The skill does not request persistent elevation or modify other skills. Autonomous invocation is allowed (default), which is normal; this combined with other notes increases the need for caution but is not itself a problem.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install everything2markdown
  3. After installation, invoke the skill by name or use /everything2markdown
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
new
Metadata
Slug everything2markdown
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is everything to markdown?

Convert almost anything (PDF, DOCX, PPTX, XLSX, images, audio, YouTube, etc.) to Markdown using Microsoft MarkItDown. Optimized for AGENT and LLM workflows. It is an AI Agent Skill for Claude Code / OpenClaw, with 102 downloads so far.

How do I install everything to markdown?

Run "/install everything2markdown" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is everything to markdown free?

Yes, everything to markdown is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does everything to markdown support?

everything to markdown is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created everything to markdown?

It is built and maintained by xsm0826 (@xsm0826); the current version is v1.0.0.

💬 Comments