Document Sanitizer
/install document-sanitizer
\r \r
document-sanitizer\r
\r
技能说明 / Skill Description\r
\r
概述 / Overview\r
\r 本技能用于批量对 Word (.docx) 和 Excel (.xlsx) 文件执行脱敏处理,支持关键字精确替换和正则表达式动态替换,并提供一键恢复功能。\r \r This skill batch-desensitizes Word (.docx) and Excel (.xlsx) files using keyword exact-match and regex dynamic replacement, with one-click reversible restoration.\r \r
核心特性 / Key Features\r
\r
| 特性 | 说明 / Description |\r
|------|-------------------|\r
| 关键字精确替换 / Exact Match | 配置关键字对,替换为带 [] 标记的占位符(如"白云"→[黑水])|\r
| 正则动态替换 / Regex Dynamic | 匹配手机号、身份证号、邮箱等,自动生成占位符如 [RED_手机号_1] |\r
| 文件名脱敏 / Filename Sanitization | 默认开启,输出文件自动使用脱敏后的文件名 |\r
| 统一脱敏记录 / Unified Record | 所有映射累积到 _sanitize_record.json,无论文档如何修改都能恢复 |\r
| 一键恢复 / One-Click Restore | 读取记录反向替换,还原文件名和内容,校验残留占位符 |\r
| 旧格式自动转换 / Legacy Format Conversion | 检测到 .doc/.xls 时提示自动转换为 .docx/.xlsx |\r
\r
---\r
\r
前置依赖 / Prerequisites\r
\r
Python 包依赖 / Python Dependencies\r
\r
pip install python-docx openpyxl\r
```\r
\r
### 可选依赖(旧格式转换)/ Optional (Legacy Format Conversion)\r
\r
```bash\r
pip install xlrd pywin32\r
```\r
\r
- `.doc → .docx` 转换需要 Windows + Microsoft Word(使用 Word COM 自动化)\r
- `.xls → .xlsx` 转换使用 xlrd + openpyxl(跨平台)\r
\r
---\r
\r
## 使用方法 / Usage\r
\r
### 1. 准备配置文件 / Prepare Config File\r
\r
在工作目录下创建 `_sanitize_config.json`:\r
\r
Create `_sanitize_config.json` in your workspace:\r
\r
```json\r
{\r
"exact_rules": [\r
{"pattern": "白云", "replacement": "黑水"},\r
{"pattern": "南方", "replacement": "北风"},\r
{"pattern": "广州", "replacement": "镇北"}\r
],\r
"regex_rules": [\r
{"pattern": "1[3-9]\\d{9}", "label": "手机号"},\r
{"pattern": "\\d{6}(?:19|20)\\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\\d|3[01])\\d{3}[\\dXx]", "label": "身份证号"},\r
{"pattern": "[a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}", "label": "邮箱"}\r
]\r
}\r
```\r
\r
### 2. 执行脱敏 / Run Sanitization\r
\r
```bash\r
# 基本脱敏(默认脱敏文件名)\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录>\r
\r
# 不脱敏文件名\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录> --no-rename\r
\r
# 自动转换 .doc/.xls 旧格式文件\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录> --auto-convert\r
```\r
\r
### 3. 恢复文档 / Restore Documents\r
\r
```bash\r
python \x3Cskill_dir>/scripts/sanitize.py restore \x3C工作目录>\r
```\r
\r
---\r
\r
## 命令行参数 / CLI Arguments\r
\r
| 参数 / Argument | 说明 / Description |\r
|----------------|-------------------|\r
| `sanitize \x3Cworkspace>` | 执行脱敏 / Run sanitization |\r
| `restore \x3Cworkspace>` | 恢复文档 / Restore documents |\r
| `--no-rename` | 不对文件名脱敏 / Skip filename sanitization |\r
| `--auto-convert` | 自动转换 .doc/.xls,无需确认 / Auto-convert legacy formats |\r
\r
---\r
\r
## 输出目录 / Output Directories\r
\r
| 目录 / Directory | 说明 / Description |\r
|-----------------|-------------------|\r
| `_sanitized_output/` | 脱敏后的文件 / Sanitized files |\r
| `_restored_output/` | 恢复后的文件 / Restored files |\r
| `_sanitize_record.json` | 统一脱敏记录(映射累积)/ Unified record (accumulative mapping) |\r
| `_sanitize_config.json` | 脱敏规则配置 / Sanitization rules config |\r
\r
---\r
\r
## 脱敏记录结构 / Record Structure\r
\r
```json\r
{\r
"version": 2,\r
"created_at": "2026-04-08 16:02:07",\r
"last_updated": "2026-04-08 16:02:07",\r
"mapping": {\r
"黑水": "白云",\r
"[RED_手机号_1]": "13828417396"\r
},\r
"filename_mapping": {\r
"黑水物流文档.docx": "白云物流文档.docx"\r
},\r
"runs": [\r
{"timestamp": "...", "files_processed": ["..."]}\r
]\r
}\r
```\r
\r
**核心原理 / Core Principle**:只要 `mapping`(脱敏值→原始值)完整,无论文档经过多少修改,都可以用它来反向替换恢复。\r
\r
As long as the `mapping` (sanitized value → original value) is complete, documents can be restored regardless of modifications.\r
\r
---\r
\r
## 示例 / Examples\r
\r
### 示例 1:基本脱敏流程 / Example 1: Basic Sanitization\r
\r
```bash\r
# 1. 创建配置文件 _sanitize_config.json\r
# 2. 执行脱敏\r
python sanitize.py sanitize ./my-docs\r
\r
# 输出:\r
# [RENAME] 白云物流文档.docx → 黑水物流文档.docx\r
# [1/1] 白云物流文档.docx [OK]\r
# 脱敏输出目录: ./my-docs/_sanitized_output\r
```\r
\r
### 示例 2:脱敏后恢复 / Example 2: Sanitize then Restore\r
\r
```bash\r
# 脱敏\r
python sanitize.py sanitize ./my-docs\r
\r
# 恢复(可多次执行,对修改后的文档也有效)\r
python sanitize.py restore ./my-docs\r
\r
# 输出:\r
# [RENAME] 黑水物流文档.docx → 白云物流文档.docx\r
# [OK] 无残留占位符\r
```\r
\r
### 示例 3:处理包含旧格式文件的目录 / Example 3: Directory with Legacy Files\r
\r
```bash\r
# 自动转换 .doc/.xls 并脱敏\r
python sanitize.py sanitize ./my-docs --auto-convert\r
```\r
\r
---\r
\r
## 技术要点 / Technical Notes\r
\r
1. **Word run 拆分 / Run Splitting**: Word 会将文本拆分到多个 `\x3Cw:r>` 元素中,脚本先合并所有 `w:t` 文本再替换再写回,确保跨 run 的关键字也能正确匹配\r
\r
Word splits text across multiple `\x3Cw:r>` elements. The script merges all `w:t` text first, applies replacements, then writes back — ensuring cross-run keywords are correctly matched.\r
\r
2. **反向替换顺序 / Reverse Order**: 恢复时按 key 长度降序替换,避免短 key 误匹配长 key 的子串\r
\r
Restoration sorts keys by length descending to prevent short keys from partially matching longer keys.\r
\r
3. **脱敏范围 / Scope**: 仅支持 .docx/.xlsx 格式。.doc/.xls 旧格式需先转换(可自动完成)\r
\r
Only .docx/.xlsx are supported. Legacy .doc/.xls must be converted first (auto-conversion available).\r
\r
4. **原始文件安全 / Original Safety**: 原始文件不会被修改,所有操作在输出目录中进行\r
\r
Original files are never modified — all operations happen in output directories.\r
\r
---\r
\r
## 错误处理 / Error Handling\r
\r
| 错误 / Error | 解决方法 / Solution |\r
|-------------|-------------------|\r
| "未找到配置文件" / Config not found | 在工作目录创建 `_sanitize_config.json` |\r
| "python-docx import error" | 运行 / Run: `pip install python-docx` |\r
| "openpyxl import error" | 运行 / Run: `pip install openpyxl` |\r
| ".doc 转换失败" / .doc conversion failed | 确保 Windows + Word 已安装 / Ensure Word is installed |\r
| "残留占位符" / Residual placeholders | 检查脱敏记录是否完整 / Check record completeness |\r
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install document-sanitizer - After installation, invoke the skill by name or use
/document-sanitizer - Provide required inputs per the skill's parameter spec and get structured output
What is Document Sanitizer?
Batch desensitize docx/xlsx files via keyword and regex rules, with one-click reversible restoration. Replace sensitive terms (company names, personal info,... It is an AI Agent Skill for Claude Code / OpenClaw, with 229 downloads so far.
How do I install Document Sanitizer?
Run "/install document-sanitizer" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Document Sanitizer free?
Yes, Document Sanitizer is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Document Sanitizer support?
Document Sanitizer is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Document Sanitizer?
It is built and maintained by juanfenglong (@longjf25); the current version is v1.5.1.