← 返回 Skills 市场
longjf25

Document Sanitizer

作者 juanfenglong · GitHub ↗ · v1.5.1 · MIT-0
cross-platform ✓ 安全检测通过
229
总下载
0
收藏
0
当前安装
8
版本数
在 OpenClaw 中安装
/install document-sanitizer
功能描述
Batch desensitize docx/xlsx files via keyword and regex rules, with one-click reversible restoration. Replace sensitive terms (company names, personal info,...
使用说明 (SKILL.md)

\r \r

document-sanitizer\r

\r

技能说明 / Skill Description\r

\r

概述 / Overview\r

\r 本技能用于批量对 Word (.docx) 和 Excel (.xlsx) 文件执行脱敏处理,支持关键字精确替换和正则表达式动态替换,并提供一键恢复功能。\r \r This skill batch-desensitizes Word (.docx) and Excel (.xlsx) files using keyword exact-match and regex dynamic replacement, with one-click reversible restoration.\r \r

核心特性 / Key Features\r

\r | 特性 | 说明 / Description |\r |------|-------------------|\r | 关键字精确替换 / Exact Match | 配置关键字对,替换为带 [] 标记的占位符(如"白云"→[黑水])|\r | 正则动态替换 / Regex Dynamic | 匹配手机号、身份证号、邮箱等,自动生成占位符如 [RED_手机号_1] |\r | 文件名脱敏 / Filename Sanitization | 默认开启,输出文件自动使用脱敏后的文件名 |\r | 统一脱敏记录 / Unified Record | 所有映射累积到 _sanitize_record.json,无论文档如何修改都能恢复 |\r | 一键恢复 / One-Click Restore | 读取记录反向替换,还原文件名和内容,校验残留占位符 |\r | 旧格式自动转换 / Legacy Format Conversion | 检测到 .doc/.xls 时提示自动转换为 .docx/.xlsx |\r \r ---\r \r

前置依赖 / Prerequisites\r

\r

Python 包依赖 / Python Dependencies\r

\r

pip install python-docx openpyxl\r
```\r
\r
### 可选依赖(旧格式转换)/ Optional (Legacy Format Conversion)\r
\r
```bash\r
pip install xlrd pywin32\r
```\r
\r
- `.doc → .docx` 转换需要 Windows + Microsoft Word(使用 Word COM 自动化)\r
- `.xls → .xlsx` 转换使用 xlrd + openpyxl(跨平台)\r
\r
---\r
\r
## 使用方法 / Usage\r
\r
### 1. 准备配置文件 / Prepare Config File\r
\r
在工作目录下创建 `_sanitize_config.json`:\r
\r
Create `_sanitize_config.json` in your workspace:\r
\r
```json\r
{\r
  "exact_rules": [\r
    {"pattern": "白云", "replacement": "黑水"},\r
    {"pattern": "南方", "replacement": "北风"},\r
    {"pattern": "广州", "replacement": "镇北"}\r
  ],\r
  "regex_rules": [\r
    {"pattern": "1[3-9]\\d{9}", "label": "手机号"},\r
    {"pattern": "\\d{6}(?:19|20)\\d{2}(?:0[1-9]|1[0-2])(?:0[1-9]|[12]\\d|3[01])\\d{3}[\\dXx]", "label": "身份证号"},\r
    {"pattern": "[a-zA-Z0-9._%+\\-]+@[a-zA-Z0-9.\\-]+\\.[a-zA-Z]{2,}", "label": "邮箱"}\r
  ]\r
}\r
```\r
\r
### 2. 执行脱敏 / Run Sanitization\r
\r
```bash\r
# 基本脱敏(默认脱敏文件名)\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录>\r
\r
# 不脱敏文件名\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录> --no-rename\r
\r
# 自动转换 .doc/.xls 旧格式文件\r
python \x3Cskill_dir>/scripts/sanitize.py sanitize \x3C工作目录> --auto-convert\r
```\r
\r
### 3. 恢复文档 / Restore Documents\r
\r
```bash\r
python \x3Cskill_dir>/scripts/sanitize.py restore \x3C工作目录>\r
```\r
\r
---\r
\r
## 命令行参数 / CLI Arguments\r
\r
| 参数 / Argument | 说明 / Description |\r
|----------------|-------------------|\r
| `sanitize \x3Cworkspace>` | 执行脱敏 / Run sanitization |\r
| `restore \x3Cworkspace>` | 恢复文档 / Restore documents |\r
| `--no-rename` | 不对文件名脱敏 / Skip filename sanitization |\r
| `--auto-convert` | 自动转换 .doc/.xls,无需确认 / Auto-convert legacy formats |\r
\r
---\r
\r
## 输出目录 / Output Directories\r
\r
| 目录 / Directory | 说明 / Description |\r
|-----------------|-------------------|\r
| `_sanitized_output/` | 脱敏后的文件 / Sanitized files |\r
| `_restored_output/` | 恢复后的文件 / Restored files |\r
| `_sanitize_record.json` | 统一脱敏记录(映射累积)/ Unified record (accumulative mapping) |\r
| `_sanitize_config.json` | 脱敏规则配置 / Sanitization rules config |\r
\r
---\r
\r
## 脱敏记录结构 / Record Structure\r
\r
```json\r
{\r
  "version": 2,\r
  "created_at": "2026-04-08 16:02:07",\r
  "last_updated": "2026-04-08 16:02:07",\r
  "mapping": {\r
    "黑水": "白云",\r
    "[RED_手机号_1]": "13828417396"\r
  },\r
  "filename_mapping": {\r
    "黑水物流文档.docx": "白云物流文档.docx"\r
  },\r
  "runs": [\r
    {"timestamp": "...", "files_processed": ["..."]}\r
  ]\r
}\r
```\r
\r
**核心原理 / Core Principle**:只要 `mapping`(脱敏值→原始值)完整,无论文档经过多少修改,都可以用它来反向替换恢复。\r
\r
As long as the `mapping` (sanitized value → original value) is complete, documents can be restored regardless of modifications.\r
\r
---\r
\r
## 示例 / Examples\r
\r
### 示例 1:基本脱敏流程 / Example 1: Basic Sanitization\r
\r
```bash\r
# 1. 创建配置文件 _sanitize_config.json\r
# 2. 执行脱敏\r
python sanitize.py sanitize ./my-docs\r
\r
# 输出:\r
# [RENAME] 白云物流文档.docx → 黑水物流文档.docx\r
# [1/1] 白云物流文档.docx [OK]\r
# 脱敏输出目录: ./my-docs/_sanitized_output\r
```\r
\r
### 示例 2:脱敏后恢复 / Example 2: Sanitize then Restore\r
\r
```bash\r
# 脱敏\r
python sanitize.py sanitize ./my-docs\r
\r
# 恢复(可多次执行,对修改后的文档也有效)\r
python sanitize.py restore ./my-docs\r
\r
# 输出:\r
# [RENAME] 黑水物流文档.docx → 白云物流文档.docx\r
# [OK] 无残留占位符\r
```\r
\r
### 示例 3:处理包含旧格式文件的目录 / Example 3: Directory with Legacy Files\r
\r
```bash\r
# 自动转换 .doc/.xls 并脱敏\r
python sanitize.py sanitize ./my-docs --auto-convert\r
```\r
\r
---\r
\r
## 技术要点 / Technical Notes\r
\r
1. **Word run 拆分 / Run Splitting**: Word 会将文本拆分到多个 `\x3Cw:r>` 元素中,脚本先合并所有 `w:t` 文本再替换再写回,确保跨 run 的关键字也能正确匹配\r
\r
   Word splits text across multiple `\x3Cw:r>` elements. The script merges all `w:t` text first, applies replacements, then writes back — ensuring cross-run keywords are correctly matched.\r
\r
2. **反向替换顺序 / Reverse Order**: 恢复时按 key 长度降序替换,避免短 key 误匹配长 key 的子串\r
\r
   Restoration sorts keys by length descending to prevent short keys from partially matching longer keys.\r
\r
3. **脱敏范围 / Scope**: 仅支持 .docx/.xlsx 格式。.doc/.xls 旧格式需先转换(可自动完成)\r
\r
   Only .docx/.xlsx are supported. Legacy .doc/.xls must be converted first (auto-conversion available).\r
\r
4. **原始文件安全 / Original Safety**: 原始文件不会被修改,所有操作在输出目录中进行\r
\r
   Original files are never modified — all operations happen in output directories.\r
\r
---\r
\r
## 错误处理 / Error Handling\r
\r
| 错误 / Error | 解决方法 / Solution |\r
|-------------|-------------------|\r
| "未找到配置文件" / Config not found | 在工作目录创建 `_sanitize_config.json` |\r
| "python-docx import error" | 运行 / Run: `pip install python-docx` |\r
| "openpyxl import error" | 运行 / Run: `pip install openpyxl` |\r
| ".doc 转换失败" / .doc conversion failed | 确保 Windows + Word 已安装 / Ensure Word is installed |\r
| "残留占位符" / Residual placeholders | 检查脱敏记录是否完整 / Check record completeness |\r
安全使用建议
This skill appears to do exactly what it says and runs locally, without network exfiltration or secret requests. Before installing/using: 1) Review and secure the _sanitize_record.json file that the tool writes — it contains the original sensitive values required for restoration, so store/encrypt it or remove it when not needed. 2) If you enable --auto-convert, the script will execute conversion helper scripts from ~/.workbuddy/skills/doc_xls2docx_xlsx/scripts; verify those scripts are legitimate on your machine. 3) Test the tool on non-production/sample documents first to confirm behavior (renaming, output locations). 4) Keep the required Python packages up-to-date and run in a restricted workspace to avoid accidental processing of unrelated files.
功能分析
Type: OpenClaw Skill Name: document-sanitizer Version: 1.5.1 The document-sanitizer skill is a legitimate utility for batch-redacting sensitive information from Word and Excel files. It uses standard libraries (python-docx, openpyxl) to perform regex and keyword replacements and maintains a local mapping file (_sanitize_record.json) to allow for reversible restoration. The script sanitize.py includes logic to call external conversion scripts for legacy formats via subprocess.run, but it does so without shell=True and uses specific paths, showing no signs of malicious intent, data exfiltration, or prompt injection.
能力评估
Purpose & Capability
Name/description (batch desensitize docx/xlsx with reversible restore) matches the provided code and SKILL.md. The only out-of-band dependency is that legacy-format conversion calls scripts under ~/.workbuddy/skills/doc_xls2docx_xlsx/scripts — this is coherent with optional auto-conversion but means the skill expects another local skill to be installed.
Instruction Scope
SKILL.md instructs scanning a workspace/raw folder, producing outputs under _sanitized_output/_restored_output and a unified _sanitize_record.json. The runtime script implements these actions and does not attempt to read unrelated system config or environment variables. It uses subprocess to invoke optional local conversion scripts when converting .doc/.xls files.
Install Mechanism
There is no install spec; this is an instruction+script skill bundled with Python code. No network downloads, no package install automation in the skill itself. Required Python libraries are listed in SKILL.md (python-docx, openpyxl, optional xlrd/pywin32), which is proportionate.
Credentials
The skill requests no environment variables or credentials. It reads/writes files inside the workspace and will read an optional config file in the workspace. It also references the user's home directory (~/.workbuddy) to find optional conversion scripts — this is justified by the optional conversion feature but is worth attention.
Persistence & Privilege
The skill is not always-enabled and does not request elevated privileges. It writes a persistent record file (_sanitize_record.json) into the workspace that accumulates original data → sanitized mappings; this is normal for reversible sanitization but increases the sensitivity of the workspace.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install document-sanitizer
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /document-sanitizer 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.5.1
只扫描 raw 文件夹,修复子目录文件名恢复问题 / Scan only raw folder, fix filename restore in subdirectories
v1.5.0
关键字脱敏统一使用 [] 标记,优化残留占位符检测逻辑 / Keyword sanitization uses [] markers, improved residual detection
v1.4.0
关键字脱敏使用 -- 标记,如 --黑水-- / Keyword sanitization uses -- markers
v1.3.0
升级版本号至1.3.0 / Bump version to 1.3.0
v1.0.0
Initial release: keyword and regex based document desensitization with one-click reversible restoration for .docx/.xlsx files
v1.2.1
v1.2.1: bilingual display name
v1.2.0
v1.2.0: Clarify restore workflow - default output to _restored_output/, filename restoration, directory structure preservation
v1.1.0
v1.1.0: English path names, filename sanitization (--rename), --target subdirectory support, security hardening (path traversal, ReDoS, size limits)
元数据
Slug document-sanitizer
版本 1.5.1
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 8
常见问题

Document Sanitizer 是什么?

Batch desensitize docx/xlsx files via keyword and regex rules, with one-click reversible restoration. Replace sensitive terms (company names, personal info,... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 229 次。

如何安装 Document Sanitizer?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install document-sanitizer」即可一键安装,无需额外配置。

Document Sanitizer 是免费的吗?

是的,Document Sanitizer 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Document Sanitizer 支持哪些平台?

Document Sanitizer 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Document Sanitizer?

由 juanfenglong(@longjf25)开发并维护,当前版本 v1.5.1。

💬 留言讨论