DOCX Toolkit
/install docx-toolkit
DOCX Toolkit
A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).
Capabilities
1. Text + Table Extraction (.docx)
python3 {baseDir}/scripts/extract_text.py input.docx output.txt
Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.
2. Text Extraction (Legacy .doc)
python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt
Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.
3. Image Extraction (.docx)
python3 {baseDir}/scripts/extract_images.py input.docx output_dir/
Extracts all embedded images with:
- Automatic deduplication (MD5 hash comparison)
- Size filtering (skips tiny icons \x3C5KB by default)
- Sequential renaming (img_001.png, img_002.jpg, etc.)
4. Image Compression
python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]
Batch resize/compress images for API processing (saves 50-70% on vision API costs).
Dependencies
- Python 3.6+
python-docx— for .docx processingolefile— for legacy .doc processingPillow— for image resizing (optional, only needed for resize script)
Install:
pip3 install python-docx olefile Pillow
Use Cases
- Document analysis: Extract text for AI review/summarization
- Migration: Pull content from Word docs into other formats
- Image audit: Extract and review all embedded images
- Cost optimization: Compress images before sending to vision APIs
- Batch processing: Process multiple documents in a pipeline
Notes
- Large .doc files (>200MB) may require significant RAM for olefile processing
- Image extraction preserves original format (png/jpg/gif/etc.)
- Deduplication catches exact duplicates; near-duplicates still pass through
- CJK (Chinese/Japanese/Korean) text is fully supported in both extractors
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install docx-toolkit - 安装完成后,直接呼叫该 Skill 的名称或使用
/docx-toolkit触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
DOCX Toolkit 是什么?
Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 603 次。
如何安装 DOCX Toolkit?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install docx-toolkit」即可一键安装,无需额外配置。
DOCX Toolkit 是免费的吗?
是的,DOCX Toolkit 完全免费(开源免费),可自由下载、安装和使用。
DOCX Toolkit 支持哪些平台?
DOCX Toolkit 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 DOCX Toolkit?
由 Shihao Jiang (Zac)(@zacjiang)开发并维护,当前版本 v1.0.0。