功能描述

EN: Production-grade data cleaning across heterogeneous sources (CSV/Excel/JSON/Parquet/SQL dumps/log files). Profiles schemas, detects encoding/delimiter, n...

使用说明 (SKILL.md)

Multi-Source Data Cleaner · 多源数据清洗

Name: multi-source-data-cleaner-pro
Author: boboy-j

Drop a folder of CSVs, Excels, and JSONs from 5 different teams; get back a single clean table, a deduplication report, and a data-quality scorecard. No manual schema mapping required.

把 5 个部门各种格式的 CSV/Excel/JSON 一起扔进来，自动给你一张干净统一表、去重报告、数据质量评分。无需手工配字段映射。

🎯 When to Use · 何时使用

Trigger keywords (中文): 清洗数据、数据清洗、合并数据、去重、缺失值、字段对齐、schema 合并、数据质量、数据预处理、ETL

Trigger keywords (EN): clean data, data cleaning, deduplicate, missing values, schema reconcile, ETL, data quality, profile dataset

Supported sources:

格式 / Format	说明
CSV / TSV	Auto-detect encoding (UTF-8/GBK/BIG5), delimiter, quote char, header row
Excel (.xlsx/.xls/.xlsm)	Multi-sheet, merged cells, formula values
JSON / JSONL / NDJSON	Nested structures auto-flattened
Parquet / Feather	Native columnar reading
SQL dumps (.sql)	MySQL / PostgreSQL INSERT extraction
Log files	Pattern-detected structured lines

Do NOT use when:

Input is unstructured free text (use NLP extraction skills first)
Input is binary/proprietary format with no parser (Adobe Indesign, custom CAD, etc.)
User wants real-time streaming cleaning (this is batch-oriented)

📋 Cleaning Pipeline · 清洗流程

Step 1: Source profiling · 源剖析

python3 scripts/profile.py --input \x3Cfile-or-dir> --out profile.json

For each source produces:

File format, encoding, line endings
Schema (columns, inferred types, null rates, cardinality)
Sample rows
Quality flags: encoding mismatches, type inconsistencies, suspicious patterns

Step 2: Type inference & normalization · 类型推断与归一

scripts/normalize_types.py standardizes:

Numbers: thousands separators, scientific notation, currency symbols → numeric
Dates: 50+ formats (2024-03-15, 2024/3/15, 15 Mar 2024, 民国113年3月15日, Excel serial) → ISO 8601
Booleans: Y/N/是/否/0/1/true/false/T/F/✓/✗ → boolean
Phone numbers: normalize to E.164
Chinese names: full-width / half-width normalization
IDs: zero-padding, prefix detection

Step 3: Missing value handling · 缺失值处理

Per-column strategy (configurable in templates/missing_strategy.json):

drop_row — drop rows where this column is null
mean|median|mode — statistical imputation (with imputation flag column)
constant:\x3Cvalue> — fill with literal
forward_fill — for time-series
interpolate — linear/spline for numeric series
keep_null — preserve as null (default for unknown)

Critical rule: every imputed value gets a sidecar \x3Ccol>_imputed boolean column so downstream analysis can distinguish original vs. imputed data.

Step 4: Schema reconciliation · Schema 合并

scripts/reconcile_schema.py aligns columns across sources using:

Exact name match
Fuzzy match (Levenshtein + Chinese pinyin)
Type compatibility check
User-supplied mapping override (--mapping mapping.yaml)

Outputs a crosswalk.json documenting every column mapping for audit.

Step 5: Fuzzy deduplication · 模糊去重

scripts/dedup.py uses configurable blocking + record linkage:

Blocking keys to narrow candidates (e.g. first 3 chars of name + phone last 4)
Similarity scoring: Jaro-Winkler for names, token-set for addresses, exact for IDs
Threshold-based merge with conflict resolution rules (newest wins / longest non-null / authoritative source priority)

Reports merge groups for human review before commit.

Step 6: PII handling · 隐私字段处理

Per CLEANER_PII_POLICY:

keep — leave as-is (use only with explicit user authorization)
mask — partial mask (王*三, 138****5678, 4400****1234)
drop — remove column entirely

Auto-detection of common PII: 姓名、身份证号、手机号、邮箱、地址、银行卡号、IP、车牌号。

Step 7: Data quality report · 数据质量报告

python3 scripts/quality_report.py --input cleaned.parquet --out dq_report.md

Six dimensions (per DAMA-DMBOK):

Completeness (完整性)
Accuracy (准确性, sample validation)
Consistency (一致性, cross-column rules)
Timeliness (时效性)
Uniqueness (唯一性, dedup outcome)
Validity (有效性, regex/range checks)

Each scored 0-100 with drill-down detail.

📤 Output Format · 输出格式

output/
├── cleaned.parquet              # main clean dataset (or .csv if requested)
├── crosswalk.json               # source → target schema mapping
├── dedup_groups.json            # merged record groups for review
├── dq_report.md                 # human-readable data quality report
├── dq_report.json               # machine-readable DQ metrics
├── audit/
│   ├── per_source_profile.json
│   ├── imputation_log.csv
│   └── pii_actions.log
└── provenance.csv               # row-level lineage: which source each row came from

⚠️ Safety & Compliance · 安全合规

No silent data loss — every drop/merge/impute action logged in audit/.
Imputation flags mandatory — imputed values marked so they cannot masquerade as originals.
PII default mask — unless user explicitly authorizes keep, PII is masked.
Reversibility — original sources never modified; cleaning is non-destructive.
Dedup human-in-the-loop — fuzzy merges above threshold but below 0.95 confidence flagged for review, not auto-committed.
No external network calls — all processing local; no data leaves the workspace.

不静默丢数据，所有删除/合并/填充均记录到 audit/；填充值带标志列防止假冒原值；隐私字段默认脱敏；原始文件不修改；模糊去重低置信度合并强制人工复核；不向外部上传任何数据。

🚀 Usage Examples · 使用示例

Example 1: Clean a single messy CSV

python3 scripts/run_pipeline.py \
  --input sales_q1.csv \
  --output-dir ./cleaned_q1/ \
  --pii-policy mask

Example 2: Merge 3 source CSVs into unified customer table

python3 scripts/run_pipeline.py \
  --input ./customer_sources/ \
  --output-dir ./unified_customers/ \
  --dedup-keys name,phone \
  --priority-source crm_export.csv

Example 3: Schema reconcile with manual mapping override

python3 scripts/run_pipeline.py \
  --input ./multi_team_data/ \
  --mapping mapping.yaml \
  --output-dir ./unified/

mapping.yaml:

target_schema:
  customer_id: { aliases: [客户ID, cust_id, ClientID, 编号] }
  phone:        { aliases: [手机, 联系电话, Mobile, tel] }
  signup_date:  { aliases: [注册日期, 开户日期, CreatedAt], type: date }

Example 4: Quality scan only (read-only audit)

python3 scripts/profile.py --input ./suspicious_dataset/ --out dq_audit.md --read-only

🧪 Testing · 测试

cd tests && python3 -m pytest -v

Fixtures include:

Encoding test set (UTF-8 BOM, GBK, BIG5, Latin1)
12 date format variants
Schema-drift simulation across 5 source files
Synthetic dedup dataset (10k records with controlled duplication)
PII regression suite

📚 References · 参考资料

DAMA-DMBOK Data Quality dimensions
Fellegi-Sunter probabilistic record linkage
Jaro-Winkler distance for fuzzy match
pandas, pyarrow, recordlinkage library docs

🏷️ Tags · 标签

data ETL data-cleaning dedup schema-reconcile data-quality 数据清洗 多源整合 去重 数据质量

安全使用建议

Review before installing if you will process customer, employee, medical, financial, or other sensitive datasets. Use a protected output directory, avoid `--pii-policy=keep` unless you have explicit authorization, and inspect or delete audit files such as `audit/per_source_profile.json` because they may contain raw sample values.

能力评估

ℹ Purpose & Capability

The stated purpose is batch data cleaning, profiling, deduplication, and reporting, which matches the included Python scripts at a basic level; however, the documentation advertises broader source-format support than the scripts actually implement.

ℹ Instruction Scope

Instructions are scoped to user-provided datasets and local cleaning/reporting workflows, with no prompt override behavior or unrelated commands found; the concern is that privacy effects are not fully described.

✓ Install Mechanism

The skill declares only python3, has no dependency-install step, and the static scan and dependency registry checks were clean.

✓ Credentials

Artifact review found local file reads from the user-selected input path and local writes to the user-selected output directory, with no network calls, credential reads, or external upload behavior.

⚠ Persistence & Privilege

The pipeline persistently writes cleaned data, quality reports, and audit/profile artifacts; `--pii-policy=keep` can preserve raw personal data, and the profile audit is generated from raw rows before masking or dropping, so sample values may retain PII even under safer output policies.

版本历史

v1.0.0

**Initial release of multi-source-data-cleaner: unified, production-grade data cleaning for heterogeneous data sources.** - Supports CSV, Excel, JSON, Parquet, SQL dumps, and log files, auto-detecting schema, encoding, delimiter, and structure. - Cleans and unifies datasets: automatic type normalization, robust handling of missing values, fuzzy deduplication, and cross-source schema reconciliation. - Outputs a main clean dataset plus a detailed data quality report, deduplication review logs, schema mapping, and row-level provenance/audit trails. - Built-in privacy (PII) handling with configurable masking, logging every sensitive data operation for compliance. - Non-destructive processing with strong safety and human-in-the-loop review for low-confidence merges.

元数据

Slug multi-source-data-cleaner-pro

版本 1.0.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题

multi-source-data-cleaner-pro 是什么？

EN: Production-grade data cleaning across heterogeneous sources (CSV/Excel/JSON/Parquet/SQL dumps/log files). Profiles schemas, detects encoding/delimiter, n... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件，目前累计下载 62 次。

如何安装 multi-source-data-cleaner-pro？

在 OpenClaw 或 Claude Code 对话框中运行命令「/install multi-source-data-cleaner-pro」即可一键安装，无需额外配置。

multi-source-data-cleaner-pro 是免费的吗？

是的，multi-source-data-cleaner-pro 完全免费，采用 MIT-0 许可证，可自由下载、安装和使用。

multi-source-data-cleaner-pro 支持哪些平台？

multi-source-data-cleaner-pro 跨平台运行，可在任意部署了 OpenClaw / Claude Code 的环境中使用（cross-platform）。

谁开发了 multi-source-data-cleaner-pro？

由 boboy（@boboy-j）开发并维护，当前版本 v1.0.0。

multi-source-data-cleaner-pro