Clean Text Toolkit
/install clean-text-toolkit
clean-text-toolkit
v0.1.0
A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.
This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.
What this skill does
scripts/extract.py— pull structured items out of any text file. Kinds:url,email,phone,ipv4,ipv6,hashtag,mention,hex-color,money,iso-date. Output to stdout (one-per-line or JSON), or to a.txt/.json/.jsonlfile. Optional--unique,--sort,--with-line(prefix with the source line number).scripts/normalize.py— clean up messy text. Chainable transforms applied in command-line order:--trim,--collapse-spaces,--strip-blank,--to-unix,--to-crlf,--dehyphenate(rejoin OCR/PDF hyphenated line-breaks),--unsmart(smart quotes / em-dashes → ASCII),--strip-bom,--strip-zwsp(zero-width spaces and joiners),--tabs-to-spaces N,--spaces-to-tabs N,--lower/--upper/--title,--normalize-unicode NFC|NFD|NFKC|NFKD.scripts/redact.py— anonymize text by replacing PII-like patterns with placeholder tokens. Kinds:email,phone,ipv4,ipv6,url,credit-card(with Luhn validation to suppress false positives),ssn-us,uuid,hex-token(32+ hex chars, typical for tokens / hashes),aws-access-key(AKIA…),jwt(three base64url segments with theeyJheader).--keep-countsmakes the same value always get the same placeholder;--preserve-lengthpads/truncates the placeholder to the original length.scripts/lines.py— line-oriented utilities.--op count | dedupe | sort | shuffle | head | tail. Streamscount,head,tail.dedupeandsortare O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop.--case-insensitive,--keep first|last,--numeric,--reverse,--seedfor deterministic shuffles.scripts/wordcount.py— word / character / line / sentence statistics. Optional--top Nfor most-frequent words,--stopwords PATH,--min-length N,--ignore-case,--regex PATTERN(default[A-Za-z']+).scripts/diff_text.py— three-mode text diff using stdlibdifflib.--mode unified(default),--mode side(custom two-column layout),--mode html(writes a full HTML file with red/green coloring).--ignore-case,--ignore-whitespace,--context N.scripts/check_deps.sh— verifypython3is available.
What this skill does not do
- It does not call any LLM, web service, or remote API.
- It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (
extract,lines --op count|head|tail,wordcountfor chars/lines counters) read one line at a time. - It does not write outside the input/output paths the caller provides.
Quick start
1. Pull every email out of a log file
python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique
2. Find every URL and tag it with the source line
python3 scripts/extract.py article.md --kind url --with-line
3. Clean up a messy OCR dump
python3 scripts/normalize.py scanned.txt clean.txt \
--strip-bom --to-unix --dehyphenate --collapse-spaces \
--unsmart --strip-blank --normalize-unicode NFC
The transforms run in the order you list them on the command line.
4. Redact PII before sharing a transcript
python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
--kinds email,phone --keep-counts
# Custom template
python3 scripts/redact.py log.txt safe.txt \
--token-template "\x3C\x3C{kind}#{i}>>"
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length
Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.
5. Line utilities
# Quick file stats
python3 scripts/lines.py haystack.txt --op count
# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt
# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse
# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42
# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20
6. Word counts
# Basic stats
python3 scripts/wordcount.py essay.txt
# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt
# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json
7. Text diff
# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt
# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side
# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html
# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace
Exit codes
| Code | Meaning |
|---|---|
| 0 | success / one or more matches / files identical |
| 1 | zero matches / zero redactions / files differ / empty input |
| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |
This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:
# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
&& python3 scripts/redact.py clean.txt safe.txt \
&& python3 scripts/wordcount.py safe.txt --top 10
Safety properties
- Pure Python 3 standard library. No third-party dependencies, no
pip install. - No
subprocesscalls. No shell invocation. - All file paths are validated against a strict allowlist regex that rejects shell metacharacters (
;,|,&,>,\x3C,$,`, etc.). The samesafe_path()helper that powersclean-csv-toolkit. - Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs and outputs default to UTF-8; reads fall back through
utf-8-sig,cp1252,latin-1if needed. Writes are always UTF-8. - Deterministic where it matters:
shuffle --seed Nis reproducible;extractandwordcountalways emit results in the same order for a given input.
Performance
lines.py --op dedupeprocesses 100,000 short lines (500 distinct) in ~0.06 s.lines.py --op sortprocesses 100,000 lines in ~0.10 s.extract.pyscans the file in a single streaming pass — memory does not grow with file size.
Known limitations
- The PII patterns are pragmatic heuristics, not strict RFC validators. The
emailregex accepts[email protected]shapes but does not validate thathost.tldresolves.phoneaccepts three telltale formats (+\x3Cdigits>,(XXX) XXX-XXXX,XXX-XXX-XXXX/XXX XXX XXXX) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats. credit-carduses the Luhn checksum, buthex-token(and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.diff_text.py --mode htmlproduces the standarddifflib.HtmlDiffmarkup, which embeds inline styles. The file is portable but the styling is not customizable.
v0.1.0 changes
- First public release of clean-text-toolkit.
- Six scripts:
extract.py,normalize.py,redact.py,lines.py,wordcount.py,diff_text.py. - Shared
_common.pywithsafe_path,read_text,iter_lines, andwrite_texthelpers (mirrors the design ofclean-csv-toolkit/scripts/_common.py). - Bug fixed during development: initial
phoneregex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones. - Zero third-party dependencies; works on any system that ships Python 3.
Pairs well with
clean-csv-toolkit— same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.openclaw-prompt-shield— pairextract.py --kind email,urlwith prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.
License
MIT
- 确保已安装 OpenClaw(本地或 Docker 部署)
- 在对话框中输入安装命令:
/install clean-text-toolkit - 安装完成后,直接呼叫该 Skill 的名称或使用
/clean-text-toolkit触发 - 根据 Skill 的参数说明提供必要输入,即可获得结构化输出
Clean Text Toolkit 是什么?
Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII with custom placeholders... 它是一个面向 Claude Code / OpenClaw 的 AI Agent Skill 插件,目前累计下载 72 次。
如何安装 Clean Text Toolkit?
在 OpenClaw 或 Claude Code 对话框中运行命令「/install clean-text-toolkit」即可一键安装,无需额外配置。
Clean Text Toolkit 是免费的吗?
是的,Clean Text Toolkit 完全免费,采用 MIT-0 许可证,可自由下载、安装和使用。
Clean Text Toolkit 支持哪些平台?
Clean Text Toolkit 跨平台运行,可在任意部署了 OpenClaw / Claude Code 的环境中使用(cross-platform)。
谁开发了 Clean Text Toolkit?
由 gopendrasharma89-tech(@gopendrasharma89-tech)开发并维护,当前版本 v0.1.0。