← 返回 Skills 市场

Office Document Extractor

Name: Office Document Extractor
Author: michealxie001

作者 michealxie001 · GitHub ↗ · v1.0.1 · MIT-0

cross-platform ✓ 安全检测通过

总下载

当前安装

版本数

在 OpenClaw 中安装

/install office-doc-extractor

功能描述

Convert Microsoft Office documents (DOCX, XLSX, PPTX) to Markdown without any external dependencies. Use when the user needs to extract text from Word docume...

使用说明 (SKILL.md)

Office Document Extractor

Zero-dependency converter for Microsoft Office documents. Extracts text and structure from DOCX, XLSX, and PPTX files into clean Markdown.

Quick Start

# Single file
python3 scripts/main.py report.docx -o report.md

# Batch convert a directory
python3 scripts/main.py ./documents --batch -o ./markdown

Supported Formats

Format	Extension	Output
Word	.docx	Headings, paragraphs
Excel	.xlsx	Tables (one per sheet)
PowerPoint	.pptx	Slides as sections

How It Works

DOCX: Parses the ZIP archive's XML directly using Python's zipfile and xml.etree
XLSX: Uses bundled openpyxl (pure Python, no C extensions)
PPTX: Parses the ZIP archive's slide XML directly

No external commands, no network calls, no pip install required.

Usage

Single File

python3 scripts/main.py \x3Cinput_file> [-o \x3Coutput.md>]

Auto-detects format from file extension. If -o is omitted, outputs to \x3Cinput>.md.

Batch Conversion

python3 scripts/main.py \x3Cinput_directory> --batch [-o \x3Coutput_directory>]

Converts all .docx, .xlsx, .pptx files in the directory. Results saved to markdown_output/ by default.

Resources

scripts/

main.py — Unified CLI for single-file and batch conversion
docx_extractor.py — DOCX → Markdown (standard library only)
xlsx_extractor.py — XLSX → Markdown tables (bundled openpyxl)
pptx_extractor.py — PPTX → Markdown (standard library only)

Bundled Dependencies

openpyxl/ — Pure Python Excel library (v3.1.5)
et_xmlfile/ — openpyxl dependency (pure Python)

Limitations

Does not extract images or embedded objects (text only)
Does not preserve complex formatting (colors, fonts, layouts)
Does not handle encrypted/password-protected files
No OCR for scanned documents (use OpenClaw's native pdf tool for that)

Why This Skill?

Existing markitdown-based skills require pip install or external CLI tools, which triggers ClawHub security warnings. This skill is 100% self-contained — install it and use it immediately, even offline.

安全使用建议

This looks consistent with an offline document converter. Before installing, be comfortable running the bundled Python code, use it only on documents you intend to extract, and remember that the generated Markdown may contain sensitive or untrusted text.

功能分析

Type: OpenClaw Skill Name: office-doc-extractor Version: 1.0.1 The office-doc-extractor skill is a functional tool designed to convert DOCX, XLSX, and PPTX files into Markdown. It uses the Python standard library (zipfile, xml.etree) for Word and PowerPoint files and includes a bundled version of the openpyxl library for Excel files to maintain its 'zero-dependency' claim. The code logic in scripts/main.py, scripts/docx_extractor.py, and scripts/xlsx_extractor.py is transparent and strictly aligned with the stated purpose. No evidence of data exfiltration, network activity, or malicious prompt injection was found.

能力评估

ℹ Purpose & Capability

The documented purpose, CLI, and visible source align: it converts DOCX/XLSX/PPTX files to Markdown and writes local output files. Users should notice that converted Markdown can contain the full text of private documents.

✓ Instruction Scope

The instructions are user-directed examples for running the converter; there is no evidence of hidden goal changes, forced autonomous execution, or prompt-injection style instructions.

ℹ Install Mechanism

There is no install spec or network download, but the skill is executed as local Python code and bundles openpyxl/et_xmlfile. The registry source is unknown and no homepage is provided, so dependency provenance is less transparent.

ℹ Credentials

Local file reads and Markdown writes are proportionate to document conversion. Batch mode can process all supported files in a selected directory, so users should scope input and output paths carefully.

ℹ Persistence & Privilege

No credentials, privileged APIs, background workers, or ongoing persistence are shown. The only persistence evidenced is user-directed creation of Markdown output files.

如何使用

确保已安装 OpenClaw（本地或 Docker 部署）
在对话框中输入安装命令：/install office-doc-extractor
安装完成后，直接呼叫该 Skill 的名称或使用 /office-doc-extractor 触发
根据 Skill 的参数说明提供必要输入，即可获得结构化输出

版本历史

v1.0.1

Fix: Removed pycache, repackaged clean build

v1.0.0

- Initial release of office-doc-extractor: convert DOCX, XLSX, and PPTX files to Markdown using a pure Python, zero-dependency approach. - Supports extraction of text and structure: Word headings/paragraphs, Excel tables, and PowerPoint slides. - Works offline—no pip installs, subprocess calls, or network access required. - Includes unified CLI for both single-file and batch directory conversion. - Bundles pure Python openpyxl and et_xmlfile for Excel support.

元数据

Slug office-doc-extractor

版本 1.0.1

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 2

常见问题