← Back to Skills Marketplace

DOCX Toolkit

Name: DOCX Toolkit
Author: zacjiang

by Shihao Jiang (Zac) · GitHub ↗ · v1.0.0

cross-platform ✓ Security Clean

603

Downloads

Stars

Active Installs

Versions

Install in OpenClaw

/install docx-toolkit

Description

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an...

README (SKILL.md)

DOCX Toolkit

A complete toolkit for processing Microsoft Word documents (.docx and legacy .doc formats).

Capabilities

1. Text + Table Extraction (.docx)

python3 {baseDir}/scripts/extract_text.py input.docx output.txt

Extracts all paragraphs and tables with structure preserved. Tables are formatted as pipe-delimited rows for easy parsing.

2. Text Extraction (Legacy .doc)

python3 {baseDir}/scripts/extract_doc_text.py input.doc output.txt

Handles legacy OLE2 .doc format using olefile. Extracts Unicode text from the WordDocument stream.

3. Image Extraction (.docx)

python3 {baseDir}/scripts/extract_images.py input.docx output_dir/

Extracts all embedded images with:

Automatic deduplication (MD5 hash comparison)
Size filtering (skips tiny icons \x3C5KB by default)
Sequential renaming (img_001.png, img_002.jpg, etc.)

4. Image Compression

python3 {baseDir}/scripts/resize_images.py input_dir/ output_dir/ [--max-width 1024]

Batch resize/compress images for API processing (saves 50-70% on vision API costs).

Dependencies

Python 3.6+
python-docx — for .docx processing
olefile — for legacy .doc processing
Pillow — for image resizing (optional, only needed for resize script)

Install:

pip3 install python-docx olefile Pillow

Use Cases

Document analysis: Extract text for AI review/summarization
Migration: Pull content from Word docs into other formats
Image audit: Extract and review all embedded images
Cost optimization: Compress images before sending to vision APIs
Batch processing: Process multiple documents in a pipeline

Notes

Large .doc files (>200MB) may require significant RAM for olefile processing
Image extraction preserves original format (png/jpg/gif/etc.)
Deduplication catches exact duplicates; near-duplicates still pass through
CJK (Chinese/Japanese/Korean) text is fully supported in both extractors

Usage Guidance

This skill appears to do what it claims: local extraction of text, tables, and images from Word files. Before using on sensitive content, consider: run it on a sandbox or isolated environment for untrusted documents; expect the scripts to write files to the specified output_dir and note that resize_images overwrites in-place by default; very large legacy .doc files may use a lot of RAM; image extraction can pull out sensitive items (IDs, certificates)—review outputs before uploading anywhere; classification is heuristic and language-specific (may mislabel). No network exfiltration or secret usage was observed in the code. If you require stronger assurance, inspect the bundled scripts locally or run them in a container.

Capability Analysis

Type: OpenClaw Skill Name: docx-toolkit Version: 1.0.0 The docx-toolkit is a legitimate set of utilities for extracting text, tables, and images from Microsoft Word documents (.docx and legacy .doc). The scripts (extract_text.py, extract_doc_text.py, extract_images.py, and resize_images.py) use standard libraries like python-docx, olefile, and Pillow to perform their stated functions. While extract_images.py contains logic to categorize images based on sensitive keywords (e.g., contracts, certificates, and IDs), this is used to generate a local manifest for processing efficiency and does not involve data exfiltration or unauthorized access. No malicious patterns, such as network calls, obfuscation, or prompt injection, were detected.

Capability Assessment

✓ Purpose & Capability

Name/description match the included scripts: extract_text.py, extract_doc_text.py, extract_images.py, and resize_images.py. Declared Python libraries (python-docx, olefile, Pillow) are appropriate for the stated functionality. No unrelated binaries, env vars, or external services are requested.

ℹ Instruction Scope

SKILL.md only instructs running the included scripts on local files and directories. The scripts read input document files, write extracted text/images to an output directory, and optionally write a JSON manifest. This is within scope. Notes: extract_doc_text reads raw OLE streams and may use significant RAM for very large .doc files; resize_images will overwrite files if output_dir is omitted; classify_by_context uses heuristic keyword matching (mostly Chinese keywords) and can misclassify. The scripts do not contact external endpoints or read environment variables.

✓ Install Mechanism

No install spec is provided (instruction-only), and the code is bundled with the skill. Dependencies are normal Python packages installable via pip. No downloads from arbitrary URLs or archive extraction are present.

✓ Credentials

The skill requests no environment variables, credentials, or special config paths. All required resources are local files and standard Python packages, which is proportionate to the functionality.

✓ Persistence & Privilege

The skill is not always-enabled and does not request persistent or elevated platform privileges. It does not alter other skills' configuration or require platform-wide settings.

How to Use

Make sure OpenClaw is installed (local or Docker)
Run the install command in chat: /install docx-toolkit
After installation, invoke the skill by name or use /docx-toolkit
Provide required inputs per the skill's parameter spec and get structured output

Version History

v1.0.0

Initial release: extract text, tables, images from .docx/.doc with CJK support

Metadata

Slug docx-toolkit

Version 1.0.0

License —

All-time Installs 6

Active Installs 6

Total Versions 1

Frequently Asked Questions

What is DOCX Toolkit?

Extract text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication an... It is an AI Agent Skill for Claude Code / OpenClaw, with 603 downloads so far.

How do I install DOCX Toolkit?

Run "/install docx-toolkit" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is DOCX Toolkit free?

Yes, DOCX Toolkit is completely free (open-source). You can download, install and use it at no cost.

Which platforms does DOCX Toolkit support?

DOCX Toolkit is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created DOCX Toolkit?

It is built and maintained by Shihao Jiang (Zac) (@zacjiang); the current version is v1.0.0.

More Skills

DOCX Toolkit

DOCX Toolkit

Capabilities

1. Text + Table Extraction (.docx)

2. Text Extraction (Legacy .doc)

3. Image Extraction (.docx)

4. Image Compression

Dependencies

Use Cases

Notes

What is DOCX Toolkit?

How do I install DOCX Toolkit?

Is DOCX Toolkit free?

Which platforms does DOCX Toolkit support?

Who created DOCX Toolkit?

💬 Comments