← 返回 Skills 市场
gopendrasharma89-tech

Clean Text Toolkit

作者 gopendrasharma89-tech · GitHub ↗ · v0.1.0 · MIT-0
cross-platform ⚠ pending
72
总下载
1
收藏
0
当前安装
1
版本数
在 OpenClaw 中安装
/install clean-text-toolkit
功能描述
Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII with custom placeholders...
使用说明 (SKILL.md)

clean-text-toolkit

v0.1.0

A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.

This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.

What this skill does

  • scripts/extract.py — pull structured items out of any text file. Kinds: url, email, phone, ipv4, ipv6, hashtag, mention, hex-color, money, iso-date. Output to stdout (one-per-line or JSON), or to a .txt / .json / .jsonl file. Optional --unique, --sort, --with-line (prefix with the source line number).
  • scripts/normalize.py — clean up messy text. Chainable transforms applied in command-line order: --trim, --collapse-spaces, --strip-blank, --to-unix, --to-crlf, --dehyphenate (rejoin OCR/PDF hyphenated line-breaks), --unsmart (smart quotes / em-dashes → ASCII), --strip-bom, --strip-zwsp (zero-width spaces and joiners), --tabs-to-spaces N, --spaces-to-tabs N, --lower / --upper / --title, --normalize-unicode NFC|NFD|NFKC|NFKD.
  • scripts/redact.py — anonymize text by replacing PII-like patterns with placeholder tokens. Kinds: email, phone, ipv4, ipv6, url, credit-card (with Luhn validation to suppress false positives), ssn-us, uuid, hex-token (32+ hex chars, typical for tokens / hashes), aws-access-key (AKIA…), jwt (three base64url segments with the eyJ header). --keep-counts makes the same value always get the same placeholder; --preserve-length pads/truncates the placeholder to the original length.
  • scripts/lines.py — line-oriented utilities. --op count | dedupe | sort | shuffle | head | tail. Streams count, head, tail. dedupe and sort are O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop. --case-insensitive, --keep first|last, --numeric, --reverse, --seed for deterministic shuffles.
  • scripts/wordcount.py — word / character / line / sentence statistics. Optional --top N for most-frequent words, --stopwords PATH, --min-length N, --ignore-case, --regex PATTERN (default [A-Za-z']+).
  • scripts/diff_text.py — three-mode text diff using stdlib difflib. --mode unified (default), --mode side (custom two-column layout), --mode html (writes a full HTML file with red/green coloring). --ignore-case, --ignore-whitespace, --context N.
  • scripts/check_deps.sh — verify python3 is available.

What this skill does not do

  • It does not call any LLM, web service, or remote API.
  • It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (extract, lines --op count|head|tail, wordcount for chars/lines counters) read one line at a time.
  • It does not write outside the input/output paths the caller provides.

Quick start

1. Pull every email out of a log file

python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique

2. Find every URL and tag it with the source line

python3 scripts/extract.py article.md --kind url --with-line

3. Clean up a messy OCR dump

python3 scripts/normalize.py scanned.txt clean.txt \
    --strip-bom --to-unix --dehyphenate --collapse-spaces \
    --unsmart --strip-blank --normalize-unicode NFC

The transforms run in the order you list them on the command line.

4. Redact PII before sharing a transcript

python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
    --kinds email,phone --keep-counts
# Custom template
python3 scripts/redact.py log.txt safe.txt \
    --token-template "\x3C\x3C{kind}#{i}>>"
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length

Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.

5. Line utilities

# Quick file stats
python3 scripts/lines.py haystack.txt --op count

# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt

# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse

# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42

# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20

6. Word counts

# Basic stats
python3 scripts/wordcount.py essay.txt

# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt

# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json

7. Text diff

# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt

# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side

# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html

# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace

Exit codes

Code Meaning
0 success / one or more matches / files identical
1 zero matches / zero redactions / files differ / empty input
2 bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension

This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:

# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
  && python3 scripts/redact.py clean.txt safe.txt \
  && python3 scripts/wordcount.py safe.txt --top 10

Safety properties

  • Pure Python 3 standard library. No third-party dependencies, no pip install.
  • No subprocess calls. No shell invocation.
  • All file paths are validated against a strict allowlist regex that rejects shell metacharacters (;, |, &, >, \x3C, $, `, etc.). The same safe_path() helper that powers clean-csv-toolkit.
  • Scripts only read the input paths the caller provides and write to the output paths the caller provides.
  • All inputs and outputs default to UTF-8; reads fall back through utf-8-sig, cp1252, latin-1 if needed. Writes are always UTF-8.
  • Deterministic where it matters: shuffle --seed N is reproducible; extract and wordcount always emit results in the same order for a given input.

Performance

  • lines.py --op dedupe processes 100,000 short lines (500 distinct) in ~0.06 s.
  • lines.py --op sort processes 100,000 lines in ~0.10 s.
  • extract.py scans the file in a single streaming pass — memory does not grow with file size.

Known limitations

  • The PII patterns are pragmatic heuristics, not strict RFC validators. The email regex accepts [email protected] shapes but does not validate that host.tld resolves. phone accepts three telltale formats (+\x3Cdigits>, (XXX) XXX-XXXX, XXX-XXX-XXXX / XXX XXX XXXX) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats.
  • credit-card uses the Luhn checksum, but hex-token (and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.
  • diff_text.py --mode html produces the standard difflib.HtmlDiff markup, which embeds inline styles. The file is portable but the styling is not customizable.

v0.1.0 changes

  • First public release of clean-text-toolkit.
  • Six scripts: extract.py, normalize.py, redact.py, lines.py, wordcount.py, diff_text.py.
  • Shared _common.py with safe_path, read_text, iter_lines, and write_text helpers (mirrors the design of clean-csv-toolkit/scripts/_common.py).
  • Bug fixed during development: initial phone regex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones.
  • Zero third-party dependencies; works on any system that ships Python 3.

Pairs well with

  • clean-csv-toolkit — same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.
  • openclaw-prompt-shield — pair extract.py --kind email,url with prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.

License

MIT

能力标签
cryptocan-make-purchases
如何使用
  1. 确保已安装 OpenClaw(本地或 Docker 部署)
  2. 在对话框中输入安装命令:/install clean-text-toolkit
  3. 安装完成后,直接呼叫该 Skill 的名称或使用 /clean-text-toolkit 触发
  4. 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
版本历史
v0.1.0
v0.1.0 initial release. Six scripts in pure stdlib: extract.py (URL/email/phone/IP/hashtag/money/iso-date), normalize.py (chainable transforms: BOM/CRLF/smart-quotes/whitespace/tabs/case/Unicode NFC/dehyphenate), redact.py (PII anonymization with Luhn-validated credit-card detection and custom placeholder templates), lines.py (count/dedupe/sort/shuffle/head/tail), wordcount.py (stats + top-N with stopwords), diff_text.py (unified/side/html diffs). Shares the safe-path policy and 0/1/2 exit-code contract with clean-csv-toolkit. Streams where possible; 100k lines deduped in 60ms.
元数据
Slug clean-text-toolkit
版本 0.1.0
许可证 MIT-0
累计安装 0
当前安装数 0
历史版本数 1
常见问题

Clean Text Toolkit 是什么?

Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII with custom placeholders... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 72 次。

如何安装 Clean Text Toolkit?

在 OpenClaw 或 Claude Code 对话框中运行命令「/install clean-text-toolkit」即可一键安装,无需额外配置。

Clean Text Toolkit 是免费的吗?

是的,Clean Text Toolkit 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。

Clean Text Toolkit 支持哪些平台?

Clean Text Toolkit 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。

谁开发了 Clean Text Toolkit?

由 gopendrasharma89-tech(@gopendrasharma89-tech)开发并维护,当前版本 v0.1.0。

💬 留言讨论