功能描述

Local CSV / TSV / JSONL inspection and cleanup toolkit. Profile a tabular file (row count, auto-detected column types, nulls, distincts, samples), validate i...

使用说明 (SKILL.md)

clean-csv-toolkit

Name: Clean CSV Toolkit
Author: gopendrasharma89-tech

v0.1.0

A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.

What this skill does

scripts/inspect.py — profile a .csv / .tsv / .jsonl file: row count, auto-detected column types (int, float, bool, date, datetime, string, empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.
scripts/validate.py — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.
scripts/dedupe.py — remove duplicate rows by full-row match or by key columns. Optional --keep first|last, --case-insensitive, --trim, and a JSONL report of every removed row.
scripts/diff.py — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.
scripts/convert.py — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.
scripts/check_deps.sh — verify python3 is available.

What this skill does not do

It does not call any LLM, web service, or remote API.
It does not load a full dataframe into memory just to do simple structural work; the helpers stream rows where possible.
It does not write outside the input/output paths the caller provides.
It does not do statistical analysis (mean, percentile, correlation). For that, use a dataframe library.
It does not parse Excel files (.xls / .xlsx). Export to CSV first.

Required dependencies

bash scripts/check_deps.sh

Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.

Workflows

1. Profile an unknown CSV

python3 scripts/inspect.py customers.csv

Output:

file:      /path/customers.csv
size:      284 B (284 bytes)
encoding:  utf-8
kind:      csv
rows:      5
columns:   6

  #  name                          type           nulls   null%    distinct  sample
----------------------------------------------------------------------------------------------------
  1  id                            int                0    0.00           5  '1', '2', '3'
  2  email                         string             0    0.00           5  '[email protected]', ...
  3  name                          string             0    0.00           5  'Alice', 'Bob', 'Carol'
  4  amount                        float              1   20.00           4  '42.50', '100.00', '7.25'
  5  status                        string             0    0.00           3  'approved', 'pending', ...
  6  signup_date                   date               0    0.00           5  '2025-01-15', ...

Pass --json for machine-readable output that pipes into other tools.

The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.

2. Validate against a schema

Write a schema.json:

{
  "required_columns": ["id", "email", "amount", "status"],
  "columns": {
    "id":     {"type": "int", "required": true, "unique": true, "min": 1},
    "email":  {"type": "string", "required": true, "regex": ".+@.+\\..+"},
    "amount": {"type": "float", "min": 0, "max": 100000},
    "status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
    "signup_date": {"type": "date"}
  }
}

Then:

python3 scripts/validate.py customers.csv --schema schema.json

A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:

   row  column                  kind                    detail
------------------------------------------------------------------------------------------------
     2  email                   regex_mismatch          value did not match regex | value='not-an-email'
     2  amount                  bad_type                value does not match type 'float' | value='abc'
     3  amount                  below_min               value -50.0 \x3C min 0 | value='-50.00'
     3  status                  not_in_enum             value not in allowed set | value='unknown_status'
     4  id                      duplicate_unique        value already seen earlier in this column | value='1'

Pass --json for a structured report and --max-errors N to cap collection on huge files.

3. Remove duplicates

By full-row match (any two rows identical in every column):

python3 scripts/dedupe.py messy.csv clean.csv

By a key column (only one canonical row per id):

python3 scripts/dedupe.py messy.csv clean.csv --key id \
  --removed-report removed.jsonl

--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " [email protected]" and "[email protected]" collapse to one row.

The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.

4. Diff two files

python3 scripts/diff.py customers_old.csv customers_new.csv --key id

Output:

added:      1
removed:    1
changed:    1

--- ADDED (1) ---
  + 6
--- REMOVED (1) ---
  - 4
--- CHANGED (1) ---
  ~ 2
      amount: '100.00' -> '150.00'
      status: 'pending' -> 'approved'

Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").

5. Convert between formats

python3 scripts/convert.py data.csv data.jsonl       # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv       # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md          # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv         # delimiter change

Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \ in cell values so the table stays well-formed.

Exit codes

Code	Meaning
0	success / validation pass / files identical
1	validation fail / files differ / no rows in input
2	bad arguments / unsafe path / missing input / unsupported extension / schema malformed

This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:

python3 scripts/validate.py incoming.csv --schema schema.json \
  && python3 scripts/dedupe.py incoming.csv clean.csv --key id \
  && python3 scripts/inspect.py clean.csv

Safety properties

Pure Python 3 standard library. No third-party dependencies.
No subprocess calls. No shell invocation.
All file paths are validated against a strict allowlist regex that rejects shell metacharacters (;, |, &, >, \x3C, $, `, backslash-newline, etc.).
Scripts only read the input paths the caller provides and write to the output paths the caller provides. No temp files outside the system's tempdir.
All inputs and outputs use UTF-8 by default; CSV reads auto-fall-back through utf-8-sig, cp1252, and latin-1 when the file's encoding is non-UTF-8.
Deterministic: the same input produces the same output every time.

Performance

inspect.py profiles 10,000 rows in well under one second on a single core (single-pass streaming read).
All scripts stream rows; they do not load the entire file into memory for processing. The exception is dedupe.py and diff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop.
No background threads, no process pool, no caching.

Known limitations

Type inference uses regex-shape matching, not locale-aware parsing. "1,234.56" is detected as string, not float. Re-export with a different number format if you need different inference.
The Markdown writer flattens multi-line cells to single lines (newlines become spaces).
JSON Lines input must have one JSON object per line. Multi-line JSON arrays are not supported; use the regular CSV/JSONL pipeline.

License

MIT. See LICENSE.

安全使用建议

This skill appears safe for local CSV-style cleanup. Before installing, note that it can read whichever local file path is supplied and can write output/report files; avoid using it on highly sensitive data unless you are comfortable with sample values or row contents appearing in the agent output.

功能分析

Type: OpenClaw Skill Name: clean-csv-toolkit Version: 0.1.0 The clean-csv-toolkit is a well-implemented set of local data processing utilities using only the Python standard library. It includes proactive security measures, such as a path validation regex in scripts/_common.py designed to prevent shell metacharacter injection. The code contains no network calls, no subprocess execution, and no evidence of data exfiltration or prompt injection, strictly adhering to its stated purpose of offline CSV/JSONL manipulation.

能力标签

cryptocan-make-purchases

能力评估

✓ Purpose & Capability

The stated purpose is local CSV/TSV/JSONL inspection, validation, deduplication, diffing, and conversion, and the included scripts implement those tasks with Python standard-library code. The registry capability signals for crypto/purchases are not supported by the reviewed source.

ℹ Instruction Scope

The instructions are clear and user-directed, but profiling and diff outputs can include actual cell values from the selected files.

✓ Install Mechanism

There is no install spec or dependency download; the only setup helper checks that python3 is present.

ℹ Credentials

The scripts operate on local paths supplied as arguments and may create parent directories and write output or report files. This is proportionate to the toolkit purpose but users should choose paths carefully.

✓ Persistence & Privilege

No credentials, environment variables, background services, autonomous persistence, or remote APIs are present; only explicit local output files persist.

版本历史

v0.1.0

v0.1.0 initial release. Local CSV/TSV/JSONL toolkit. Five scripts: inspect.py profiles a tabular file with auto-detected column types (int/float/bool/date/datetime/string), null counts, distincts, samples, encoding, and dialect. validate.py checks a file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). dedupe.py removes duplicates by full-row or key columns with optional --keep first/last, --case-insensitive, --trim, and JSONL removed-rows report. diff.py compares two files by key column(s) and classifies rows as added/removed/changed/unchanged with per-column before/after for changed rows. convert.py converts between csv/tsv/jsonl/json/md. Pure Python 3 standard library, no pandas, no numpy, no subprocess, no remote calls. Consistent 0/1/2 exit codes across all scripts. 26 end-to-end tests covering CSV/TSV/JSONL inputs, schema validation, full-row and keyed dedup, file diffing, round-trip format conversion, 10k-row performance, error paths.

元数据

Slug clean-csv-toolkit

版本 0.1.0

许可证 MIT-0

累计安装 0

当前安装数 0

历史版本数 1

常见问题