← 返回 Skills 市场
boboy-j

multi-source-data-cleaner-pro

作者 boboy · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ⚠ suspicious
62
总下载
0
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install multi-source-data-cleaner-pro
功能描述
EN: Production-grade data cleaning across heterogeneous sources (CSV/Excel/JSON/Parquet/SQL dumps/log files). Profiles schemas, detects encoding/delimiter, n...
使用说明 (SKILL.md)

Multi-Source Data Cleaner · 多源数据清洗

Drop a folder of CSVs, Excels, and JSONs from 5 different teams; get back a single clean table, a deduplication report, and a data-quality scorecard. No manual schema mapping required.

把 5 个部门各种格式的 CSV/Excel/JSON 一起扔进来,自动给你一张干净统一表、去重报告、数据质量评分。无需手工配字段映射。


🎯 When to Use · 何时使用

Trigger keywords (中文): 清洗数据、数据清洗、合并数据、去重、缺失值、字段对齐、schema 合并、数据质量、数据预处理、ETL

Trigger keywords (EN): clean data, data cleaning, deduplicate, missing values, schema reconcile, ETL, data quality, profile dataset

Supported sources:

格式 / Format 说明
CSV / TSV Auto-detect encoding (UTF-8/GBK/BIG5), delimiter, quote char, header row
Excel (.xlsx/.xls/.xlsm) Multi-sheet, merged cells, formula values
JSON / JSONL / NDJSON Nested structures auto-flattened
Parquet / Feather Native columnar reading
SQL dumps (.sql) MySQL / PostgreSQL INSERT extraction
Log files Pattern-detected structured lines

Do NOT use when:

  • Input is unstructured free text (use NLP extraction skills first)
  • Input is binary/proprietary format with no parser (Adobe Indesign, custom CAD, etc.)
  • User wants real-time streaming cleaning (this is batch-oriented)

📋 Cleaning Pipeline · 清洗流程

Step 1: Source profiling · 源剖析

python3 scripts/profile.py --input \x3Cfile-or-dir> --out profile.json

For each source produces:

  • File format, encoding, line endings
  • Schema (columns, inferred types, null rates, cardinality)
  • Sample rows
  • Quality flags: encoding mismatches, type inconsistencies, suspicious patterns

Step 2: Type inference & normalization · 类型推断与归一

scripts/normalize_types.py standardizes:

  • Numbers: thousands separators, scientific notation, currency symbols → numeric
  • Dates: 50+ formats (2024-03-15, 2024/3/15, 15 Mar 2024, 民国113年3月15日, Excel serial) → ISO 8601
  • Booleans: Y/N/是/否/0/1/true/false/T/F/✓/✗ → boolean
  • Phone numbers: normalize to E.164
  • Chinese names: full-width / half-width normalization
  • IDs: zero-padding, prefix detection

Step 3: Missing value handling · 缺失值处理

Per-column strategy (configurable in templates/missing_strategy.json):

  • drop_row — drop rows where this column is null
  • mean|median|mode — statistical imputation (with imputation flag column)
  • constant:\x3Cvalue> — fill with literal
  • forward_fill — for time-series
  • interpolate — linear/spline for numeric series
  • keep_null — preserve as null (default for unknown)

Critical rule: every imputed value gets a sidecar \x3Ccol>_imputed boolean column so downstream analysis can distinguish original vs. imputed data.

Step 4: Schema reconciliation · Schema 合并

scripts/reconcile_schema.py aligns columns across sources using:

  • Exact name match
  • Fuzzy match (Levenshtein + Chinese pinyin)
  • Type compatibility check
  • User-supplied mapping override (--mapping mapping.yaml)

Outputs a crosswalk.json documenting every column mapping for audit.

Step 5: Fuzzy deduplication · 模糊去重

scripts/dedup.py uses configurable blocking + record linkage:

  • Blocking keys to narrow candidates (e.g. first 3 chars of name + phone last 4)
  • Similarity scoring: Jaro-Winkler for names, token-set for addresses, exact for IDs
  • Threshold-based merge with conflict resolution rules (newest wins / longest non-null / authoritative source priority)

Reports merge groups for human review before commit.

Step 6: PII handling · 隐私字段处理

Per CLEANER_PII_POLICY:

  • keep — leave as-is (use only with explicit user authorization)
  • mask — partial mask (王*三, 138****5678, 4400****1234)
  • drop — remove column entirely

Auto-detection of common PII: 姓名、身份证号、手机号、邮箱、地址、银行卡号、IP、车牌号。

Step 7: Data quality report · 数据质量报告

python3 scripts/quality_report.py --input cleaned.parquet --out dq_report.md

Six dimensions (per DAMA-DMBOK):

  • Completeness (完整性)
  • Accuracy (准确性, sample validation)
  • Consistency (一致性, cross-column rules)
  • Timeliness (时效性)
  • Uniqueness (唯一性, dedup outcome)
  • Validity (有效性, regex/range checks)

Each scored 0-100 with drill-down detail.


📤 Output Format · 输出格式

output/
├── cleaned.parquet              # main clean dataset (or .csv if requested)
├── crosswalk.json               # source → target schema mapping
├── dedup_groups.json            # merged record groups for review
├── dq_report.md                 # human-readable data quality report
├── dq_report.json               # machine-readable DQ metrics
├── audit/
│   ├── per_source_profile.json
│   ├── imputation_log.csv
│   └── pii_actions.log
└── provenance.csv               # row-level lineage: which source each row came from

⚠️ Safety & Compliance · 安全合规

  1. No silent data loss — every drop/merge/impute action logged in audit/.
  2. Imputation flags mandatory — imputed values marked so they cannot masquerade as originals.
  3. PII default mask — unless user explicitly authorizes keep, PII is masked.
  4. Reversibility — original sources never modified; cleaning is non-destructive.
  5. Dedup human-in-the-loop — fuzzy merges above threshold but below 0.95 confidence flagged for review, not auto-committed.
  6. No external network calls — all processing local; no data leaves the workspace.

不静默丢数据,所有删除/合并/填充均记录到 audit/;填充值带标志列防止假冒原值;隐私字段默认脱敏;原始文件不修改;模糊去重低置信度合并强制人工复核;不向外部上传任何数据。


🚀 Usage Examples · 使用示例

Example 1: Clean a single messy CSV

python3 scripts/run_pipeline.py \
  --input sales_q1.csv \
  --output-dir ./cleaned_q1/ \
  --pii-policy mask

Example 2: Merge 3 source CSVs into unified customer table

python3 scripts/run_pipeline.py \
  --input ./customer_sources/ \
  --output-dir ./unified_customers/ \
  --dedup-keys name,phone \
  --priority-source crm_export.csv

Example 3: Schema reconcile with manual mapping override

python3 scripts/run_pipeline.py \
  --input ./multi_team_data/ \
  --mapping mapping.yaml \
  --output-dir ./unified/

mapping.yaml:

target_schema:
  customer_id: { aliases: [客户ID, cust_id, ClientID, 编号] }
  phone:        { aliases: [手机, 联系电话, Mobile, tel] }
  signup_date:  { aliases: [注册日期, 开户日期, CreatedAt], type: date }

Example 4: Quality scan only (read-only audit)

python3 scripts/profile.py --input ./suspicious_dataset/ --out dq_audit.md --read-only

🧪 Testing · 测试

cd tests && python3 -m pytest -v

Fixtures include:

  • Encoding test set (UTF-8 BOM, GBK, BIG5, Latin1)
  • 12 date format variants
  • Schema-drift simulation across 5 source files
  • Synthetic dedup dataset (10k records with controlled duplication)
  • PII regression suite

📚 References · 参考资料

  • DAMA-DMBOK Data Quality dimensions
  • Fellegi-Sunter probabilistic record linkage
  • Jaro-Winkler distance for fuzzy match
  • pandas, pyarrow, recordlinkage library docs

🏷️ Tags · 标签

data ETL data-cleaning dedup schema-reconcile data-quality 数据清洗 多源整合 去重 数据质量

安全使用建议
Review before installing if you will process customer, employee, medical, financial, or other sensitive datasets. Use a protected output directory, avoid `--pii-policy=keep` unless you have explicit authorization, and inspect or delete audit files such as `audit/per_source_profile.json` because they may contain raw sample values.
能力评估
Purpose & Capability
The stated purpose is batch data cleaning, profiling, deduplication, and reporting, which matches the included Python scripts at a basic level; however, the documentation advertises broader source-format support than the scripts actually implement.
Instruction Scope
Instructions are scoped to user-provided datasets and local cleaning/reporting workflows, with no prompt override behavior or unrelated commands found; the concern is that privacy effects are not fully described.
Install Mechanism
The skill declares only python3, has no dependency-install step, and the static scan and dependency registry checks were clean.
Credentials
Artifact review found local file reads from the user-selected input path and local writes to the user-selected output directory, with no network calls, credential reads, or external upload behavior.
Persistence & Privilege
The pipeline persistently writes cleaned data, quality reports, and audit/profile artifacts; `--pii-policy=keep` can preserve raw personal data, and the profile audit is generated from raw rows before masking or dropping, so sample values may retain PII even under safer output policies.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install multi-source-data-cleaner-pro
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /multi-source-data-cleaner-pro 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v1.0.0
**Initial release of multi-source-data-cleaner: unified, production-grade data cleaning for heterogeneous data sources.** - Supports CSV, Excel, JSON, Parquet, SQL dumps, and log files, auto-detecting schema, encoding, delimiter, and structure. - Cleans and unifies datasets: automatic type normalization, robust handling of missing values, fuzzy deduplication, and cross-source schema reconciliation. - Outputs a main clean dataset plus a detailed data quality report, deduplication review logs, schema mapping, and row-level provenance/audit trails. - Built-in privacy (PII) handling with configurable masking, logging every sensitive data operation for compliance. - Non-destructive processing with strong safety and human-in-the-loop review for low-confidence merges.
元数据
Slug multi-source-data-cleaner-pro
版本 1.0.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

multi-source-data-cleaner-pro 是什么?

EN: Production-grade data cleaning across heterogeneous sources (CSV/Excel/JSON/Parquet/SQL dumps/log files). Profiles schemas, detects encoding/delimiter, n... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 62 次。

如何安装 multi-source-data-cleaner-pro?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install multi-source-data-cleaner-pro」即可一键安装,无需额外配置。

multi-source-data-cleaner-pro 是免费的吗?

是的,multi-source-data-cleaner-pro 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

multi-source-data-cleaner-pro 支持哪些平台?

multi-source-data-cleaner-pro 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 multi-source-data-cleaner-pro?

由 boboy(@boboy-j)开发并维护,当前版本 v1.0.0。

💬 留言讨论