← 返回 Skills 市场
gopendrasharma89-tech

Clean CSV Toolkit

作者 gopendrasharma89-tech · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ✓ 安全检测通过
74
总下载
1
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install clean-csv-toolkit
功能描述
Local CSV / TSV / JSONL inspection and cleanup toolkit. Profile a tabular file (row count, auto-detected column types, nulls, distincts, samples), validate i...
使用说明 (SKILL.md)

clean-csv-toolkit

v0.1.0

A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.

What this skill does

  • scripts/inspect.py — profile a .csv / .tsv / .jsonl file: row count, auto-detected column types (int, float, bool, date, datetime, string, empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.
  • scripts/validate.py — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.
  • scripts/dedupe.py — remove duplicate rows by full-row match or by key columns. Optional --keep first|last, --case-insensitive, --trim, and a JSONL report of every removed row.
  • scripts/diff.py — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.
  • scripts/convert.py — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.
  • scripts/check_deps.sh — verify python3 is available.

What this skill does not do

  • It does not call any LLM, web service, or remote API.
  • It does not load a full dataframe into memory just to do simple structural work; the helpers stream rows where possible.
  • It does not write outside the input/output paths the caller provides.
  • It does not do statistical analysis (mean, percentile, correlation). For that, use a dataframe library.
  • It does not parse Excel files (.xls / .xlsx). Export to CSV first.

Required dependencies

bash scripts/check_deps.sh

Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.

Workflows

1. Profile an unknown CSV

python3 scripts/inspect.py customers.csv

Output:

file:      /path/customers.csv
size:      284 B (284 bytes)
encoding:  utf-8
kind:      csv
rows:      5
columns:   6

  #  name                          type           nulls   null%    distinct  sample
----------------------------------------------------------------------------------------------------
  1  id                            int                0    0.00           5  '1', '2', '3'
  2  email                         string             0    0.00           5  '[email protected]', ...
  3  name                          string             0    0.00           5  'Alice', 'Bob', 'Carol'
  4  amount                        float              1   20.00           4  '42.50', '100.00', '7.25'
  5  status                        string             0    0.00           3  'approved', 'pending', ...
  6  signup_date                   date               0    0.00           5  '2025-01-15', ...

Pass --json for machine-readable output that pipes into other tools.

The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.

2. Validate against a schema

Write a schema.json:

{
  "required_columns": ["id", "email", "amount", "status"],
  "columns": {
    "id":     {"type": "int", "required": true, "unique": true, "min": 1},
    "email":  {"type": "string", "required": true, "regex": ".+@.+\\..+"},
    "amount": {"type": "float", "min": 0, "max": 100000},
    "status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
    "signup_date": {"type": "date"}
  }
}

Then:

python3 scripts/validate.py customers.csv --schema schema.json

A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:

   row  column                  kind                    detail
------------------------------------------------------------------------------------------------
     2  email                   regex_mismatch          value did not match regex | value='not-an-email'
     2  amount                  bad_type                value does not match type 'float' | value='abc'
     3  amount                  below_min               value -50.0 \x3C min 0 | value='-50.00'
     3  status                  not_in_enum             value not in allowed set | value='unknown_status'
     4  id                      duplicate_unique        value already seen earlier in this column | value='1'

Pass --json for a structured report and --max-errors N to cap collection on huge files.

3. Remove duplicates

By full-row match (any two rows identical in every column):

python3 scripts/dedupe.py messy.csv clean.csv

By a key column (only one canonical row per id):

python3 scripts/dedupe.py messy.csv clean.csv --key id \
  --removed-report removed.jsonl

--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " [email protected]" and "[email protected]" collapse to one row.

The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.

4. Diff two files

python3 scripts/diff.py customers_old.csv customers_new.csv --key id

Output:

added:      1
removed:    1
changed:    1

--- ADDED (1) ---
  + 6
--- REMOVED (1) ---
  - 4
--- CHANGED (1) ---
  ~ 2
      amount: '100.00' -> '150.00'
      status: 'pending' -> 'approved'

Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").

5. Convert between formats

python3 scripts/convert.py data.csv data.jsonl       # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv       # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md          # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv         # delimiter change

Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \ in cell values so the table stays well-formed.

Exit codes

Code Meaning
0 success / validation pass / files identical
1 validation fail / files differ / no rows in input
2 bad arguments / unsafe path / missing input / unsupported extension / schema malformed

This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:

python3 scripts/validate.py incoming.csv --schema schema.json \
  && python3 scripts/dedupe.py incoming.csv clean.csv --key id \
  && python3 scripts/inspect.py clean.csv

Safety properties

  • Pure Python 3 standard library. No third-party dependencies.
  • No subprocess calls. No shell invocation.
  • All file paths are validated against a strict allowlist regex that rejects shell metacharacters (;, |, &, >, \x3C, $, `, backslash-newline, etc.).
  • Scripts only read the input paths the caller provides and write to the output paths the caller provides. No temp files outside the system's tempdir.
  • All inputs and outputs use UTF-8 by default; CSV reads auto-fall-back through utf-8-sig, cp1252, and latin-1 when the file's encoding is non-UTF-8.
  • Deterministic: the same input produces the same output every time.

Performance

  • inspect.py profiles 10,000 rows in well under one second on a single core (single-pass streaming read).
  • All scripts stream rows; they do not load the entire file into memory for processing. The exception is dedupe.py and diff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop.
  • No background threads, no process pool, no caching.

Known limitations

  • Type inference uses regex-shape matching, not locale-aware parsing. "1,234.56" is detected as string, not float. Re-export with a different number format if you need different inference.
  • The Markdown writer flattens multi-line cells to single lines (newlines become spaces).
  • JSON Lines input must have one JSON object per line. Multi-line JSON arrays are not supported; use the regular CSV/JSONL pipeline.

License

MIT. See LICENSE.

安全使用建议
This skill appears safe for local CSV-style cleanup. Before installing, note that it can read whichever local file path is supplied and can write output/report files; avoid using it on highly sensitive data unless you are comfortable with sample values or row contents appearing in the agent output.
功能分析
Type: OpenClaw Skill Name: clean-csv-toolkit Version: 0.1.0 The clean-csv-toolkit is a well-implemented set of local data processing utilities using only the Python standard library. It includes proactive security measures, such as a path validation regex in scripts/_common.py designed to prevent shell metacharacter injection. The code contains no network calls, no subprocess execution, and no evidence of data exfiltration or prompt injection, strictly adhering to its stated purpose of offline CSV/JSONL manipulation.
能力标签
cryptocan-make-purchases
能力评估
Purpose & Capability
The stated purpose is local CSV/TSV/JSONL inspection, validation, deduplication, diffing, and conversion, and the included scripts implement those tasks with Python standard-library code. The registry capability signals for crypto/purchases are not supported by the reviewed source.
Instruction Scope
The instructions are clear and user-directed, but profiling and diff outputs can include actual cell values from the selected files.
Install Mechanism
There is no install spec or dependency download; the only setup helper checks that python3 is present.
Credentials
The scripts operate on local paths supplied as arguments and may create parent directories and write output or report files. This is proportionate to the toolkit purpose but users should choose paths carefully.
Persistence & Privilege
No credentials, environment variables, background services, autonomous persistence, or remote APIs are present; only explicit local output files persist.
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install clean-csv-toolkit
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /clean-csv-toolkit 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
v0.1.0 initial release. Local CSV/TSV/JSONL toolkit. Five scripts: inspect.py profiles a tabular file with auto-detected column types (int/float/bool/date/datetime/string), null counts, distincts, samples, encoding, and dialect. validate.py checks a file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). dedupe.py removes duplicates by full-row or key columns with optional --keep first/last, --case-insensitive, --trim, and JSONL removed-rows report. diff.py compares two files by key column(s) and classifies rows as added/removed/changed/unchanged with per-column before/after for changed rows. convert.py converts between csv/tsv/jsonl/json/md. Pure Python 3 standard library, no pandas, no numpy, no subprocess, no remote calls. Consistent 0/1/2 exit codes across all scripts. 26 end-to-end tests covering CSV/TSV/JSONL inputs, schema validation, full-row and keyed dedup, file diffing, round-trip format conversion, 10k-row performance, error paths.
元数据
Slug clean-csv-toolkit
版本 0.1.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Clean CSV Toolkit 是什么?

Local CSV / TSV / JSONL inspection and cleanup toolkit. Profile a tabular file (row count, auto-detected column types, nulls, distincts, samples), validate i... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 74 次。

如何安装 Clean CSV Toolkit?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install clean-csv-toolkit」即可一键安装,无需额外配置。

Clean CSV Toolkit 是免费的吗?

是的,Clean CSV Toolkit 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Clean CSV Toolkit 支持哪些平台?

Clean CSV Toolkit 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Clean CSV Toolkit?

由 gopendrasharma89-tech(@gopendrasharma89-tech)开发并维护,当前版本 v0.1.0。

💬 留言讨论