Clean Text Toolkit
/install clean-text-toolkit
clean-text-toolkit
v0.1.0
A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.
This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.
What this skill does
scripts/extract.py— pull structured items out of any text file. Kinds:url,email,phone,ipv4,ipv6,hashtag,mention,hex-color,money,iso-date. Output to stdout (one-per-line or JSON), or to a.txt/.json/.jsonlfile. Optional--unique,--sort,--with-line(prefix with the source line number).scripts/normalize.py— clean up messy text. Chainable transforms applied in command-line order:--trim,--collapse-spaces,--strip-blank,--to-unix,--to-crlf,--dehyphenate(rejoin OCR/PDF hyphenated line-breaks),--unsmart(smart quotes / em-dashes → ASCII),--strip-bom,--strip-zwsp(zero-width spaces and joiners),--tabs-to-spaces N,--spaces-to-tabs N,--lower/--upper/--title,--normalize-unicode NFC|NFD|NFKC|NFKD.scripts/redact.py— anonymize text by replacing PII-like patterns with placeholder tokens. Kinds:email,phone,ipv4,ipv6,url,credit-card(with Luhn validation to suppress false positives),ssn-us,uuid,hex-token(32+ hex chars, typical for tokens / hashes),aws-access-key(AKIA…),jwt(three base64url segments with theeyJheader).--keep-countsmakes the same value always get the same placeholder;--preserve-lengthpads/truncates the placeholder to the original length.scripts/lines.py— line-oriented utilities.--op count | dedupe | sort | shuffle | head | tail. Streamscount,head,tail.dedupeandsortare O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop.--case-insensitive,--keep first|last,--numeric,--reverse,--seedfor deterministic shuffles.scripts/wordcount.py— word / character / line / sentence statistics. Optional--top Nfor most-frequent words,--stopwords PATH,--min-length N,--ignore-case,--regex PATTERN(default[A-Za-z']+).scripts/diff_text.py— three-mode text diff using stdlibdifflib.--mode unified(default),--mode side(custom two-column layout),--mode html(writes a full HTML file with red/green coloring).--ignore-case,--ignore-whitespace,--context N.scripts/check_deps.sh— verifypython3is available.
What this skill does not do
- It does not call any LLM, web service, or remote API.
- It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (
extract,lines --op count|head|tail,wordcountfor chars/lines counters) read one line at a time. - It does not write outside the input/output paths the caller provides.
Quick start
1. Pull every email out of a log file
python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique
2. Find every URL and tag it with the source line
python3 scripts/extract.py article.md --kind url --with-line
3. Clean up a messy OCR dump
python3 scripts/normalize.py scanned.txt clean.txt \
--strip-bom --to-unix --dehyphenate --collapse-spaces \
--unsmart --strip-blank --normalize-unicode NFC
The transforms run in the order you list them on the command line.
4. Redact PII before sharing a transcript
python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
--kinds email,phone --keep-counts
# Custom template
python3 scripts/redact.py log.txt safe.txt \
--token-template "\x3C\x3C{kind}#{i}>>"
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length
Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.
5. Line utilities
# Quick file stats
python3 scripts/lines.py haystack.txt --op count
# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt
# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse
# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42
# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20
6. Word counts
# Basic stats
python3 scripts/wordcount.py essay.txt
# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt
# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json
7. Text diff
# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt
# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side
# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html
# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace
Exit codes
| Code | Meaning |
|---|---|
| 0 | success / one or more matches / files identical |
| 1 | zero matches / zero redactions / files differ / empty input |
| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |
This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:
# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
&& python3 scripts/redact.py clean.txt safe.txt \
&& python3 scripts/wordcount.py safe.txt --top 10
Safety properties
- Pure Python 3 standard library. No third-party dependencies, no
pip install. - No
subprocesscalls. No shell invocation. - All file paths are validated against a strict allowlist regex that rejects shell metacharacters (
;,|,&,>,\x3C,$,`, etc.). The samesafe_path()helper that powersclean-csv-toolkit. - Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs and outputs default to UTF-8; reads fall back through
utf-8-sig,cp1252,latin-1if needed. Writes are always UTF-8. - Deterministic where it matters:
shuffle --seed Nis reproducible;extractandwordcountalways emit results in the same order for a given input.
Performance
lines.py --op dedupeprocesses 100,000 short lines (500 distinct) in ~0.06 s.lines.py --op sortprocesses 100,000 lines in ~0.10 s.extract.pyscans the file in a single streaming pass — memory does not grow with file size.
Known limitations
- The PII patterns are pragmatic heuristics, not strict RFC validators. The
emailregex accepts[email protected]shapes but does not validate thathost.tldresolves.phoneaccepts three telltale formats (+\x3Cdigits>,(XXX) XXX-XXXX,XXX-XXX-XXXX/XXX XXX XXXX) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats. credit-carduses the Luhn checksum, buthex-token(and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.diff_text.py --mode htmlproduces the standarddifflib.HtmlDiffmarkup, which embeds inline styles. The file is portable but the styling is not customizable.
v0.1.0 changes
- First public release of clean-text-toolkit.
- Six scripts:
extract.py,normalize.py,redact.py,lines.py,wordcount.py,diff_text.py. - Shared
_common.pywithsafe_path,read_text,iter_lines, andwrite_texthelpers (mirrors the design ofclean-csv-toolkit/scripts/_common.py). - Bug fixed during development: initial
phoneregex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones. - Zero third-party dependencies; works on any system that ships Python 3.
Pairs well with
clean-csv-toolkit— same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.openclaw-prompt-shield— pairextract.py --kind email,urlwith prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.
License
MIT
- Make sure OpenClaw is installed (local or Docker)
- Run the install command in chat:
/install clean-text-toolkit - After installation, invoke the skill by name or use
/clean-text-toolkit - Provide required inputs per the skill's parameter spec and get structured output
What is Clean Text Toolkit?
Local text cleanup and inspection toolkit. Extract structured items (URLs, emails, phones, IPs, dates, hashtags, money), redact PII with custom placeholders... It is an AI Agent Skill for Claude Code / OpenClaw, with 72 downloads so far.
How do I install Clean Text Toolkit?
Run "/install clean-text-toolkit" in the OpenClaw or Claude Code chat to install it in one step — no extra setup required.
Is Clean Text Toolkit free?
Yes, Clean Text Toolkit is completely free, licensed under MIT-0. You can download, install and use it at no cost.
Which platforms does Clean Text Toolkit support?
Clean Text Toolkit is cross-platform and runs anywhere OpenClaw / Claude Code is available (cross-platform).
Who created Clean Text Toolkit?
It is built and maintained by gopendrasharma89-tech (@gopendrasharma89-tech); the current version is v0.1.0.