← Back to Skills Marketplace
tujinsama

data-cleaning-claw

by Ricky · GitHub ↗ · v1.0.0 · MIT-0
cross-platform ✓ Security Clean
116
Downloads
0
Stars
0
Active Installs
1
Versions
Install in OpenClaw
/install data-cleaning-claw
Description
数据自动化清洗虾。处理脏数据、重复数据与广告噪音,输出高质量干净数据。 触发场景:用户提到"清洗数据"、"去重"、"数据清理"、"脏数据"、"数据标准化"、"格式统一"、"去噪"、"数据预处理"、"数据质量"、"缺失值处理",或上传了 CSV/Excel/JSON 文件并要求清洗处理时。 支持:Excel/CSV...
README (SKILL.md)

数据自动化清洗虾

处理脏数据的专用 skill。核心脚本:scripts/data_clean.py

工作流程

1. 接收数据

用户可通过以下方式提供数据:

  • 上传文件(CSV / Excel / JSON)→ 保存到 workspace,记录路径
  • 直接粘贴数据 → 写入临时 CSV 文件

2. 确认清洗需求

如果用户没有明确说明,询问以下信息(可一次性问完):

  • 需要哪些清洗操作(去重/缺失值/格式标准化/HTML去噪/数据验证)?
  • 去重时是否有关键字段(如手机号、邮箱)?
  • 输出格式(CSV/Excel/JSON)?

如果用户说"全部清洗"或"帮我清洗一下",默认执行所有规则:strip-html,deduplicate,fill-missing,standardize,validate

3. 执行清洗

使用 exec 运行脚本:

python3 ~/.openclaw/skills/data-cleaning-claw/scripts/data_clean.py \
  --input "\x3C输入文件路径>" \
  --output "\x3C输出文件路径>" \
  --rules "strip-html,deduplicate,fill-missing,standardize,validate" \
  --key-fields "\x3C可选:去重关键字段,逗号分隔>"

可用规则(--rules 参数,逗号分隔):

  • strip-html — 去除 HTML 标签和广告噪音
  • deduplicate — 去重(默认全字段;--key-fields 指定关键字段)
  • fill-missing — 缺失值填充(数值→中位数,文本→"未知")
  • standardize — 格式标准化(自动识别日期/金额/电话列)
  • validate — 数据验证,异常行添加 _数据质量标记

可选字段强制指定参数:

  • --date-fields — 强制指定日期列(逗号分隔列名)
  • --amount-fields — 强制指定金额列
  • --phone-fields — 强制指定电话列

4. 输出结果

脚本自动生成两个文件:

  • \x3Coutput> — 清洗后的数据文件
  • \x3Coutput>.report.json — 清洗报告(删除行数、各列处理情况)

向用户展示清洗报告摘要,并发送清洗后的文件。

参考资料

需要了解具体规则时读取:

  • references/cleaning-rules.md — 去重、缺失值、格式标准化的详细规则
  • references/noise-patterns.md — HTML噪音、广告文案、无效字符的识别模板
  • references/data-types.md — 日期、金额、电话、邮箱的识别正则

注意事项

  • 清洗前提醒用户备份原始数据(如果是重要数据)
  • 默认策略是标记异常值(添加 _数据质量标记 列),不直接删除,保留人工复核
  • 脚本依赖:pandas numpy openpyxl beautifulsoup4,如缺少依赖先运行 pip install pandas numpy openpyxl beautifulsoup4
Usage Guidance
This skill appears to do exactly what it says: local data cleaning via a bundled Python script. Before using it, (1) back up original data; (2) avoid uploading highly sensitive or regulated PII (full ID numbers, raw bank card numbers, health records) unless you inspect and trust the environment; (3) run the script in a controlled environment where you install the listed Python packages rather than auto-installing into a shared system interpreter; (4) supply output paths inside a workspace to avoid accidental overwriting of system files; and (5) review the included script yourself if you have concerns (it is short and performs only local file I/O and transformation). If you need the agent to operate autonomously on files with sensitive data, consider additional safeguards or manual approval steps.
Capability Analysis
Type: OpenClaw Skill Name: data-cleaning-claw Version: 1.0.0 The skill bundle is a legitimate tool for automated data cleaning, including deduplication, format standardization, and HTML noise removal. The core logic in 'scripts/data_clean.py' uses standard data processing libraries (pandas, numpy, BeautifulSoup) and contains no indicators of data exfiltration, malicious execution, or unauthorized access. The instructions in 'SKILL.md' are well-defined and align strictly with the stated purpose of processing user-provided datasets.
Capability Assessment
Purpose & Capability
The name/description (data cleaning: dedupe, fill-missing, standardize, strip HTML, validate) match the included script (scripts/data_clean.py) and reference documents. No unrelated binaries, environment variables, or external services are required.
Instruction Scope
Runtime instructions tell the agent to run the included Python script on files provided by the user and to save outputs and a .report.json; the script reads/writes local files and performs only data-cleaning operations. Important privacy note: the skill is designed to process personal data (phones, IDs, emails, bankcards) and applies masking rules, but it still reads and writes original data and report files — users should avoid uploading highly sensitive or regulated data without review. There is no network exfiltration or external endpoint referenced in the instructions or code.
Install Mechanism
No install spec is provided (instruction-only plus a local script file). The SKILL.md and script list pip dependencies (pandas, numpy, openpyxl, beautifulsoup4); installing these via pip is standard but will modify the Python environment if performed.
Credentials
The skill requests no environment variables, credentials, or config paths. The code only accesses input/output file paths provided at runtime (user-supplied) and local filesystem. The lack of credential requests is proportional to the stated purpose.
Persistence & Privilege
always is false and the skill does not request elevated or permanent agent/system privileges. It does not modify other skills or system-wide configuration. It runs as a local script when invoked.
How to Use
  1. Make sure OpenClaw is installed (local or Docker)
  2. Run the install command in chat: /install data-cleaning-claw
  3. After installation, invoke the skill by name or use /data-cleaning-claw
  4. Provide required inputs per the skill's parameter spec and get structured output
Version History
v1.0.0
Initial publish
Metadata
Slug data-cleaning-claw
Version 1.0.0
License MIT-0
All-time Installs 0
Active Installs 0
Total Versions 1
Frequently Asked Questions

What is data-cleaning-claw?

数据自动化清洗虾。处理脏数据、重复数据与广告噪音,输出高质量干净数据。 触发场景:用户提到"清洗数据"、"去重"、"数据清理"、"脏数据"、"数据标准化"、"格式统一"、"去噪"、"数据预处理"、"数据质量"、"缺失值处理",或上传了 CSV/Excel/JSON 文件并要求清洗处理时。 支持:Excel/CSV... It is an AI Agent Skill for Claude Code / OpenClaw, with 116 downloads so far.

How do I install data-cleaning-claw?

Run "/install data-cleaning-claw" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.

Is data-cleaning-claw free?

Yes, data-cleaning-claw is completely free, licensed under MIT-0. You can download, install and use it at no cost.

Which platforms does data-cleaning-claw support?

data-cleaning-claw is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).

Who created data-cleaning-claw?

It is built and maintained by Ricky (@tujinsama); the current version is v1.0.0.

💬 Comments