← Back to Skills Marketplace
onlyloveher

Docx Toolkit

by onlyloveher · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
106
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install docx-toolkit-zhouli
Description
Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an...
README (SKILL.md)

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

  • Automatic deduplication (MD5 hash comparison)
  • Size filtering (skips tiny icons \x3C5KB by default)
  • Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

  • Python 3.6+
  • python-docx — for .docx processing
  • olefile — for legacy .doc processing
  • Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

  • Document analysis: Extract text for AI review/summarization
  • Migration: Pull content from Word docs into other formats
  • Image audit: Extract and review all embedded images
  • Cost optimization: Compress images before sending to vision APIs
  • Batch processing: Process multiple documents in a pipeline

Notes

  • Large .doc files (>200MB) may require significant RAM for olefile processing
  • Image extraction preserves original format (png/jpg/gif/etc.)
  • Deduplication catches exact duplicates; near-duplicates still pass through
  • CJK (Chinese/Japanese/Korean) text is fully supported in both extractors
Usage Guidance
This skill appears coherent and its code is consistent with the described functionality, but take normal precautions before running bundled scripts: run them in a sandboxed/isolated environment or virtualenv, install Python package dependencies from official PyPI sources, and avoid processing sensitive/confidential documents unless you trust the environment—extracted text and images are written to disk. Note there is no required credential access and no network endpoints in the code, but the package metadata/source provenance is limited (registry/source marked unknown), so prefer running locally with restricted permissions and inspect files before use.
Capability Analysis
Type: OpenClaw Skill Name: docx-toolkit-zhouli Version: 1.0.0 The docx-toolkit is a legitimate utility for extracting text, tables, and images from Microsoft Word documents (.docx and .doc). The scripts (extract_text.py, extract_doc_text.py, extract_images.py, and resize_images.py) use standard libraries like python-docx, olefile, and Pillow to perform their stated functions. While the image extraction script includes logic to categorize images based on surrounding text context (e.g., identifying contracts or certificates), this behavior is consistent with the toolkit's stated purpose of document analysis and does not exhibit signs of malicious intent or data exfiltration.
Capability Assessment
Purpose & Capability
Name/description (docx/doc extraction, image dedup/compression) matches the included scripts and declared Python dependencies (python-docx, olefile, Pillow). There are no unrelated credentials, binaries, or config paths requested.
Instruction Scope
SKILL.md instructs the agent to run local Python scripts on user-provided files and directories. The scripts operate on the given input files, read/write local output paths, and do not reference external endpoints, extra environment variables, or unrelated system files.
Install Mechanism
No install spec is provided (instruction-only with bundled scripts). Dependencies are standard Python packages installable via pip. No remote downloads or archive extraction from third-party URLs are performed by the skill itself.
Credentials
The skill declares no required environment variables or credentials. The scripts use only local filesystem access and in-memory processing; requested resources (Python packages) are proportional to the task.
Persistence & Privilege
always:false and no indication the skill attempts to persist or modify other skills or system-wide agent settings. It does not request elevated or permanent platform privileges.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install docx-toolkit-zhouli
  3. After installation, invoke the skill by name or use /docx-toolkit-zhouli
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial release of docx-toolkit — a comprehensive tool for extracting content from Word documents. - Extracts text, tables (with structure), and images from both .docx and legacy .doc files - Supports large documents and complex/CJK (Chinese, Japanese, Korean) text - Automatic deduplication and filtering for extracted images - Includes batch image resize/compression to reduce vision API costs - Simple command-line usage with support for pipelines and automation
Metadata
Slug docx-toolkit-zhouli
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is Docx Toolkit?

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an... It is an AI Agent Skill for Claude Code / OpenClaw, with 106 downloads so far.

How do I install Docx Toolkit?

Run "/install docx-toolkit-zhouli" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Docx Toolkit free?

Yes, Docx Toolkit is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Docx Toolkit support?

Docx Toolkit is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Docx Toolkit?

It is built and maintained by onlyloveher (@onlyloveher); the current version is v1.0.0.

💬 Comments