← Back to Skills Marketplace
longjf25

Document Sanitizer

by juanfenglong · GitHub ↗ · v1.5.1 · MIT-0
cross-platform ✓ Security Clean
229
Downloads
0
Stars
0
Active Installs
8
Versions
Install in OpenClaw
/install document-sanitizer
Description
Batch desensitize docx/xlsx files via keyword and regex rules, with one-click reversible restoration. Replace sensitive terms (company names, personal info,...
README (SKILL.md)

\r \r

document-sanitizer\r

\r

技能说明 / Skill Description\r

\r

概述 / Overview\r

\r 本技能用于批量对 Word (.docx) 和 Excel (.xlsx) 文件执行脱敏处理,支持关键字精确替换和正则表达式动态替换,并提供一键恢复功能。\r \r This skill batch-desensitizes Word (.docx) and Excel (.xlsx) files using keyword exact-match and regex dynamic replacement, with one-click reversible restoration.\r \r

核心特性 / Key Features\r

\r | 特性 | 说明 / Description |\r |------|-------------------|\r | 关键字精确替换 / Exact Match | 配置关键字对,替换为带 [] 标记的占位符(如"白云"→[黑水])|\r | 正则动态替换 / Regex Dynamic | 匹配手机号、身份证号、邮箱等,自动生成占位符如 [RED_手机号_1] |\r | 文件名脱敏 / Filename Sanitization | 默认开启,输出文件自动使用脱敏后的文件名 |\r | 统一脱敏记录 / Unified Record | 所有映射累积到 _sanitize_record.json,无论文档如何修改都能恢复 |\r | 一键恢复 / One-Click Restore | 读取记录反向替换,还原文件名和内容,校验残留占位符 |\r | 旧格式自动转换 / Legacy Format Conversion | 检测到 .doc/.xls 时提示自动转换为 .docx/.xlsx |\r \r ---\r \r

前置依赖 / Prerequisites\r

\r

Python 包依赖 / Python Dependencies\r

\r

pip install python-docx openpyxl\r
```\r
\r
### 可选依赖(旧格式转换)/ Optional (Legacy Format Conversion)\r
\r
```bash\r
pip install xlrd pywin32\r
```\r
\r
- `.doc → .docx` 转换需要 Windows + Microsoft Word(使用 Word COM 自动化)\r
- `.xls → .xlsx` 转换使用 xlrd + openpyxl(跨平台)\r
\r
---\r
\r
## 使用方法 / Usage\r
\r
### 1. 准备配置文件 / Prepare Config File\r
\r
在工作目录下创建 `_sanitize_config.json`:\r
\r
Create `_sanitize_config.json` in your workspace:\r
\r
```json\r
{\r
  "exact_rules": [\r
    {"pattern": "白云", "replacement": "黑水"},\r
    {"pattern": "南方", "replacement": "北风"},\r
    {"pattern": "广州", "replacement": "镇北"}\r
  ],\r
  "regex_rules": [\r
    {"pattern": "1[3-9]\\d{9}", "label": "手机号"},\r
    {"pattern": "\\d{6}(?:19|20)\\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\\d|3[01])\\d{3}[\\dXx]", "label": "身份证号"},\r
    {"pattern": "[a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}", "label": "邮箱"}\r
  ]\r
}\r
```\r
\r
### 2. 执行脱敏 / Run Sanitization\r
\r
```bash\r
# 基本脱敏(默认脱敏文件名)\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录>\r
\r
# 不脱敏文件名\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录> --no-rename\r
\r
# 自动转换 .doc/.xls 旧格式文件\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录> --auto-convert\r
```\r
\r
### 3. 恢复文档 / Restore Documents\r
\r
```bash\r
python \x3Cskill_dir>/scripts/sanitize.py restore \x3C工作目录>\r
```\r
\r
---\r
\r
## 命令行参数 / CLI Arguments\r
\r
| 参数 / Argument | 说明 / Description |\r
|----------------|-------------------|\r
| `sanitize \x3Cworkspace>` | 执行脱敏 / Run sanitization |\r
| `restore \x3Cworkspace>` | 恢复文档 / Restore documents |\r
| `--no-rename` | 不对文件名脱敏 / Skip filename sanitization |\r
| `--auto-convert` | 自动转换 .doc/.xls,无需确认 / Auto-convert legacy formats |\r
\r
---\r
\r
## 输出目录 / Output Directories\r
\r
| 目录 / Directory | 说明 / Description |\r
|-----------------|-------------------|\r
| `_sanitized_output/` | 脱敏后的文件 / Sanitized files |\r
| `_restored_output/` | 恢复后的文件 / Restored files |\r
| `_sanitize_record.json` | 统一脱敏记录(映射累积)/ Unified record (accumulative mapping) |\r
| `_sanitize_config.json` | 脱敏规则配置 / Sanitization rules config |\r
\r
---\r
\r
## 脱敏记录结构 / Record Structure\r
\r
```json\r
{\r
  "version": 2,\r
  "created_at": "2026-04-08 16:02:07",\r
  "last_updated": "2026-04-08 16:02:07",\r
  "mapping": {\r
    "黑水": "白云",\r
    "[RED_手机号_1]": "13828417396"\r
  },\r
  "filename_mapping": {\r
    "黑水物流文档.docx": "白云物流文档.docx"\r
  },\r
  "runs": [\r
    {"timestamp": "...", "files_processed": ["..."]}\r
  ]\r
}\r
```\r
\r
**核心原理 / Core Principle**:只要 `mapping`(脱敏值→原始值)完整,无论文档经过多少修改,都可以用它来反向替换恢复。\r
\r
As long as the `mapping` (sanitized value → original value) is complete, documents can be restored regardless of modifications.\r
\r
---\r
\r
## 示例 / Examples\r
\r
### 示例 1:基本脱敏流程 / Example 1: Basic Sanitization\r
\r
```bash\r
# 1. 创建配置文件 _sanitize_config.json\r
# 2. 执行脱敏\r
python sanitize.py sanitize ./my-docs\r
\r
# 输出:\r
# [RENAME] 白云物流文档.docx → 黑水物流文档.docx\r
# [1/1] 白云物流文档.docx [OK]\r
# 脱敏输出目录: ./my-docs/_sanitized_output\r
```\r
\r
### 示例 2:脱敏后恢复 / Example 2: Sanitize then Restore\r
\r
```bash\r
# 脱敏\r
python sanitize.py sanitize ./my-docs\r
\r
# 恢复(可多次执行,对修改后的文档也有效)\r
python sanitize.py restore ./my-docs\r
\r
# 输出:\r
# [RENAME] 黑水物流文档.docx → 白云物流文档.docx\r
# [OK] 无残留占位符\r
```\r
\r
### 示例 3:处理包含旧格式文件的目录 / Example 3: Directory with Legacy Files\r
\r
```bash\r
# 自动转换 .doc/.xls 并脱敏\r
python sanitize.py sanitize ./my-docs --auto-convert\r
```\r
\r
---\r
\r
## 技术要点 / Technical Notes\r
\r
1. **Word run 拆分 / Run Splitting**: Word 会将文本拆分到多个 `\x3Cw:r>` 元素中,脚本先合并所有 `w:t` 文本再替换再写回,确保跨 run 的关键字也能正确匹配\r
\r
   Word splits text across multiple `\x3Cw:r>` elements. The script merges all `w:t` text first, applies replacements, then writes back — ensuring cross-run keywords are correctly matched.\r
\r
2. **反向替换顺序 / Reverse Order**: 恢复时按 key 长度降序替换,避免短 key 误匹配长 key 的子串\r
\r
   Restoration sorts keys by length descending to prevent short keys from partially matching longer keys.\r
\r
3. **脱敏范围 / Scope**: 仅支持 .docx/.xlsx 格式。.doc/.xls 旧格式需先转换(可自动完成)\r
\r
   Only .docx/.xlsx are supported. Legacy .doc/.xls must be converted first (auto-conversion available).\r
\r
4. **原始文件安全 / Original Safety**: 原始文件不会被修改,所有操作在输出目录中进行\r
\r
   Original files are never modified — all operations happen in output directories.\r
\r
---\r
\r
## 错误处理 / Error Handling\r
\r
| 错误 / Error | 解决方法 / Solution |\r
|-------------|-------------------|\r
| "未找到配置文件" / Config not found | 在工作目录创建 `_sanitize_config.json` |\r
| "python-docx import error" | 运行 / Run: `pip install python-docx` |\r
| "openpyxl import error" | 运行 / Run: `pip install openpyxl` |\r
| ".doc 转换失败" / .doc conversion failed | 确保 Windows + Word 已安装 / Ensure Word is installed |\r
| "残留占位符" / Residual placeholders | 检查脱敏记录是否完整 / Check record completeness |\r
Usage Guidance
This skill appears to do exactly what it says and runs locally, without network exfiltration or secret requests. Before installing/using: 1) Review and secure the _sanitize_record.json file that the tool writes — it contains the original sensitive values required for restoration, so store/encrypt it or remove it when not needed. 2) If you enable --auto-convert, the script will execute conversion helper scripts from ~/.workbuddy/skills/doc_xls2docx_xlsx/scripts; verify those scripts are legitimate on your machine. 3) Test the tool on non-production/sample documents first to confirm behavior (renaming, output locations). 4) Keep the required Python packages up-to-date and run in a restricted workspace to avoid accidental processing of unrelated files.
Capability Analysis
Type: OpenClaw Skill Name: document-sanitizer Version: 1.5.1 The document-sanitizer skill is a legitimate utility for batch-redacting sensitive information from Word and Excel files. It uses standard libraries (python-docx, openpyxl) to perform regex and keyword replacements and maintains a local mapping file (_sanitize_record.json) to allow for reversible restoration. The script sanitize.py includes logic to call external conversion scripts for legacy formats via subprocess.run, but it does so without shell=True and uses specific paths, showing no signs of malicious intent, data exfiltration, or prompt injection.
Capability Assessment
Purpose & Capability
Name/description (batch desensitize docx/xlsx with reversible restore) matches the provided code and SKILL.md. The only out-of-band dependency is that legacy-format conversion calls scripts under ~/.workbuddy/skills/doc_xls2docx_xlsx/scripts — this is coherent with optional auto-conversion but means the skill expects another local skill to be installed.
Instruction Scope
SKILL.md instructs scanning a workspace/raw folder, producing outputs under _sanitized_output/_restored_output and a unified _sanitize_record.json. The runtime script implements these actions and does not attempt to read unrelated system config or environment variables. It uses subprocess to invoke optional local conversion scripts when converting .doc/.xls files.
Install Mechanism
There is no install spec; this is an instruction+script skill bundled with Python code. No network downloads, no package install automation in the skill itself. Required Python libraries are listed in SKILL.md (python-docx, openpyxl, optional xlrd/pywin32), which is proportionate.
Credentials
The skill requests no environment variables or credentials. It reads/writes files inside the workspace and will read an optional config file in the workspace. It also references the user's home directory (~/.workbuddy) to find optional conversion scripts — this is justified by the optional conversion feature but is worth attention.
Persistence & Privilege
The skill is not always-enabled and does not request elevated privileges. It writes a persistent record file (_sanitize_record.json) into the workspace that accumulates original data → sanitized mappings; this is normal for reversible sanitization but increases the sensitivity of the workspace.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install document-sanitizer
  3. After installation, invoke the skill by name or use /document-sanitizer
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.5.1
只扫描 raw 文件夹,修复子目录文件名恢复问题 / Scan only raw folder, fix filename restore in subdirectories
v1.5.0
关键字脱敏统一使用 [] 标记,优化残留占位符检测逻辑 / Keyword sanitization uses [] markers, improved residual detection
v1.4.0
关键字脱敏使用 -- 标记,如 --黑水-- / Keyword sanitization uses -- markers
v1.3.0
升级版本号至1.3.0 / Bump version to 1.3.0
v1.0.0
Initial release: keyword and regex based document desensitization with one-click reversible restoration for .docx/.xlsx files
v1.2.1
v1.2.1: bilingual display name
v1.2.0
v1.2.0: Clarify restore workflow - default output to _restored_output/, filename restoration, directory structure preservation
v1.1.0
v1.1.0: English path names, filename sanitization (--rename), --target subdirectory support, security hardening (path traversal, ReDoS, size limits)
Metadata
Slug document-sanitizer
Version 1.5.1
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 8
Frequently Asked Questions

What is Document Sanitizer?

Batch desensitize docx/xlsx files via keyword and regex rules, with one-click reversible restoration. Replace sensitive terms (company names, personal info,... It is an AI Agent Skill for Claude Code / OpenClaw, with 229 downloads so far.

How do I install Document Sanitizer?

Run "/install document-sanitizer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is Document Sanitizer free?

Yes, Document Sanitizer is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does Document Sanitizer support?

Document Sanitizer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created Document Sanitizer?

It is built and maintained by juanfenglong (@longjf25); the current version is v1.5.1.

💬 Comments