文本三剑客
第4章:文本三剑客深度实战
grep 正则表达式完全指南(BRE/ERE/PCRE),awk 结构化文本处理(字段分隔、内置变量、BEGIN/END、关联数组),sed 流编辑器(地址寻址、替换、删除、插入、多行操作),以及 tr/diff/patch/sort/uniq 等辅助工具的实战用法。
grep 完全指南
grep(Global Regular Expression Print)是 Linux 文本搜索的核心工具。它支持三种正则引擎:基本正则(BRE)、扩展正则(ERE,通过 -E 启用)和 Perl 兼容正则(PCRE,通过 -P 启用)。
常用选项速查
- -i:忽略大小写
- -v:反向匹配(输出不匹配的行)
- -r / -R:递归搜索目录
- -l:只输出匹配的文件名(不显示匹配内容)
- -n:显示匹配行的行号
- -c:只输出匹配行的计数
- -o:只输出匹配部分(每个匹配单独一行)
- -A N:显示匹配行后 N 行(After)
- -B N:显示匹配行前 N 行(Before)
- -C N:显示匹配行前后各 N 行(Context)
- --color:高亮显示匹配部分
- -w:全词匹配(匹配完整单词)
- -x:全行匹配(整行完全匹配)
- -e:指定多个模式(逻辑 OR)
- -f:从文件读取模式列表
# 基本搜索
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog # 忽略大小写
grep -v "DEBUG" app.log # 过滤掉 DEBUG 行
grep -n "fatal" app.log # 显示行号
grep -c "404" access.log # 统计 404 出现次数
# 递归搜索
grep -r "TODO" ./src/ # 搜索目录
grep -rl "FIXME" ./src/ # 只显示文件名
grep -rn "panic" ./ --include="*.go" # 只搜索 Go 文件
# 上下文行
grep -A3 "ERROR" app.log # 错误行及其后 3 行
grep -B2 "FATAL" app.log # 错误行及其前 2 行
grep -C2 "exception" app.log # 前后各 2 行
# 多模式
grep -e "error" -e "warn" -e "fatal" app.log
# 只输出匹配内容
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" access.log | sort -u
BRE vs ERE:正则引擎差异
BRE(基本正则)是 grep 默认引擎,特殊字符需要反斜杠转义才有特殊含义。ERE(grep -E 或 egrep)则相反,特殊字符直接有含义,无需转义。
| {{if eq .Lang "zh"}}功能 | BRE (grep) | ERE (grep -E) |
|---|---|---|
| 分组 | (abc) | (abc) |
| 一次或多次 | + | + |
| 零次或一次 | ? | ? |
| 或 | | | |
| 精确重复 | {3} | {3} |
# BRE 写法(默认)
grep "colou\?r" file.txt # 匹配 color 或 colour
grep "\(error\|warn\)" app.log # 匹配 error 或 warn
# ERE 写法(-E 更简洁)
grep -E "colou?r" file.txt
grep -E "error|warn|fatal" app.log
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" access.log # 日期格式行
grep -E "\b[A-Z]{2,}\b" doc.txt # 全大写单词
# 常用 ERE 元字符
# . 任意单个字符
# * 零次或多次
# + 一次或多次
# ? 零次或一次
# ^ 行首
# $ 行尾
# [] 字符集合
# [^] 否定字符集合
# \b 单词边界
# \w 单词字符 [a-zA-Z0-9_]
# \s 空白字符
PCRE:Perl 兼容正则(grep -P)
PCRE 提供了最强大的正则语法,包括 \d \s \w、懒惰匹配(?)、前瞻/后顾断言等。
# PCRE 专用语法(需要 -P)
grep -P "\d{4}-\d{2}-\d{2}" log.txt # 日期(\d = [0-9])
grep -P "\s+" file.txt # 一个或多个空白
grep -P "https?://\S+" page.html # URL 提取(-o 配合)
# 懒惰匹配(非贪婪,尽可能少匹配)
grep -oP "" index.html # HTML 标签提取
# 前瞻断言(lookahead):匹配后面跟着某内容的位置
grep -P "foo(?=bar)" file.txt # 匹配后面跟 bar 的 foo
grep -P "foo(?!bar)" file.txt # 匹配后面不跟 bar 的 foo
# 后顾断言(lookbehind)
grep -P "(? **-o 选项的力量:**-o 让 grep 只输出匹配部分而非整行,每个匹配单独一行。这使 grep 从"行过滤器"变成了"提取器",结合 -P 的复杂正则,可以直接从日志中提取 IP、URL、数字等结构化信息。
## awk 完整教程
awk 是一门完整的数据处理语言,不只是"文本处理工具"。它的核心思想是:逐行读取输入,按分隔符切分字段,对每行执行用户定义的操作。
### 工作原理与内置变量
- **$0**:整行内容
- **$1, $2, ... $NF**:第1、2、...最后一个字段
- **NF**:当前行的字段总数
- **NR**:当前行的行号(跨文件递增)
- **FNR**:当前文件中的行号(每个文件从1开始)
- **FS**:输入字段分隔符(默认:空白)
- **OFS**:输出字段分隔符(默认:空格)
- **RS**:输入记录分隔符(默认:换行符)
- **ORS**:输出记录分隔符(默认:换行符)
- **FILENAME**:当前处理的文件名
```bash
# 基本用法:打印第1和第3字段
awk '{print $1, $3}' file.txt
# 指定分隔符:冒号分割的 /etc/passwd
awk -F: '{print $1, $3}' /etc/passwd # 用户名和 UID
awk -F: '{print $1":"$3}' /etc/passwd # 自定义输出格式
# OFS:修改输出分隔符
awk -F: 'BEGIN{OFS=","} {print $1,$3,$7}' /etc/passwd
# 行号和字段数
awk '{print NR, NF, $0}' file.txt
# 最后一个字段
awk '{print $NF}' file.txt
# 倒数第二个字段
awk '{print $(NF-1)}' file.txt
条件过滤与模式匹配
# 正则匹配:打印包含 error 的行
awk '/error/' app.log
awk '/error/{print NR": "$0}' app.log # 带行号
# 反向匹配
awk '!/debug/' app.log
# 数值条件
awk -F: '$3 >= 1000' /etc/passwd # UID >= 1000 的用户
awk '{if ($5 > 100) print $1, $5}' data.txt
# 字符串比较
awk '$2 == "FAILED"' results.txt
awk '$1 ~ /^192\.168\./' access.log # IP 以 192.168. 开头
# 范围模式(从匹配行到结束标记)
awk '/START/,/END/' file.txt # 打印 START 到 END 之间的行
awk 'NR==10,NR==20' file.txt # 打印第10到第20行
BEGIN 和 END 块
# BEGIN:处理任何行之前执行
# END:处理完所有行之后执行
awk 'BEGIN{total=0} {total+=$5} END{print "Total:", total}' sales.csv
# 统计行数、列数
awk 'BEGIN{print "File analysis:"} {lines++} END{print "Lines:", lines}' data.txt
# 格式化报告
awk -F, 'BEGIN{
printf "%-20s %10s %10s\n", "Name", "Sales", "Commission"
printf "%-20s %10s %10s\n", "----", "-----", "----------"
}
{
commission = $2 * 0.1
printf "%-20s %10.2f %10.2f\n", $1, $2, commission
}
END{
print "Processing complete."
}' sales.csv
字符串函数
# length:字符串长度
awk '{print length($0), $0}' file.txt
# substr:子字符串(从第N位开始取M个字符)
awk '{print substr($1, 1, 4)}' dates.txt
# index:查找子串位置(0表示不存在)
awk '{print index($0, "error")}' log.txt
# split:分割字符串为数组
awk '{n=split($1, arr, ":"); for(i=1;i 1000ms)
awk '$NF > 1000 {print $0}' timing.log
**awk 的核心优势:**awk 原生支持关联数组(即其他语言的 map/dict),这让统计频率、分组求和等任务在一行命令内就能完成,无需写脚本文件。对于结构化文本(日志、CSV、/etc/passwd),awk 的效率远超 Python 脚本。
sed 完整教程
sed(Stream EDitor)是流编辑器,逐行读取输入,按照脚本中的命令修改后输出。它的强大之处在于非交互式批量修改文件,以及精确的地址寻址能力。
基本替换语法
# 基本替换:s/旧/新/
sed 's/foo/bar/' file.txt # 每行替换第一个 foo
sed 's/foo/bar/g' file.txt # 每行替换所有 foo
sed 's/foo/bar/2' file.txt # 每行替换第二个 foo
sed 's/foo/bar/gi' file.txt # 全部替换,忽略大小写
# 原地修改文件(注意:-i 在 macOS 和 Linux 行为不同)
sed -i 's/old/new/g' file.txt # Linux
sed -i '' 's/old/new/g' file.txt # macOS(-i 需要空字符串参数)
# 多个表达式(-e)
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# 使用不同分隔符(当模式含斜杠时特别有用)
sed 's|/usr/local|/opt|g' paths.txt
sed 's#https://old.com#https://new.com#g' urls.txt
地址寻址
sed 最强大的特性之一是可以对特定行或行范围执行命令,而不是全文处理。
# 行号地址
sed '5s/foo/bar/' file.txt # 只修改第5行
sed '5,10s/foo/bar/g' file.txt # 修改第5到第10行
sed '5,$s/foo/bar/g' file.txt # 从第5行到最后一行
# 正则地址
sed '/^#/d' config.txt # 删除注释行
sed '/error/s/warn/ERROR/g' log # 只在包含 error 的行中替换
# 否定地址(!)
sed '/pattern/!d' file.txt # 只保留匹配行(等效于 grep)
sed '1!d' file.txt # 只保留第1行
# 步进地址(每N行)
sed '0~2d' file.txt # 删除偶数行(GNU sed)
sed '1~2d' file.txt # 删除奇数行
删除、插入与追加命令
# d 命令:删除匹配行
sed '/^$/d' file.txt # 删除空行
sed '/^#/d' file.txt # 删除注释行
sed '/^\s*$/d' file.txt # 删除仅含空白的行
sed '1d' file.txt # 删除第1行(去掉标题行)
sed '$d' file.txt # 删除最后一行
# i 命令:在匹配行前插入
sed '/^server/i\# nginx server block' nginx.conf
# a 命令:在匹配行后追加
sed '/^worker_processes/a\worker_rlimit_nofile 65535;' nginx.conf
# c 命令:替换整行
sed '/^Port 22/c\Port 2222' /etc/ssh/sshd_config
# p 命令:打印(常配合 -n 只打印匹配行)
sed -n '10,20p' file.txt # 打印第10到20行(类似 head/tail 组合)
sed -n '/error/p' app.log # 只打印匹配行(等效于 grep)
多行操作
# N 命令:读取下一行追加到模式空间
# 合并相邻两行(去掉两行之间的换行)
sed 'N;s/\n/ /' file.txt
# 删除文件中的 Windows 换行符(CRLF -> LF)
sed 's/\r$//' file.txt
# 或者
sed -i 's/\r//' file.txt
# 删除连续空行(压缩为单个空行)
sed '/^$/{
N
/^\n$/d
}' file.txt
# 实战:处理多行 HTML 标签
sed -n '//p' index.html
sed 实战案例
# 批量替换配置文件中的域名
sed -i 's/old-domain\.com/new-domain.com/g' /etc/nginx/sites-enabled/*.conf
# 删除 HTML 标签(简单场景)
sed 's/]*>//g' page.html
# 在文件第一行前插入版权声明
sed -i '1i\# Copyright 2026 YiteAI. All rights reserved.' *.sh
# 提取两个标记之间的内容
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/p' cert.pem
# 给每行添加行号
sed = file.txt | sed 'N;s/\n/\t/'
# 反转文件(tac 的 sed 实现)
sed -n '1!G;h;$p' file.txt
# macOS 与 Linux -i 差异
# Linux:sed -i 's/a/b/' file.txt
# macOS:sed -i '' 's/a/b/' file.txt
# 跨平台写法:先备份再处理
sed -i.bak 's/a/b/' file.txt # 两个平台都支持(生成 .bak 备份)
**macOS 的 -i 陷阱:**macOS 的 sed 来自 BSD,-i 后面必须跟一个备份后缀参数(可以是空字符串 '')。Linux 的 GNU sed 则 -i 后面直接跟文件名。跨平台脚本建议用 sed -i.bak 的形式,或安装 GNU sed(brew install gnu-sed)。
其他文本处理工具
tr:字符级转换
tr 比 sed 更底层,它在字符级别进行替换、删除和压缩操作,速度极快。
# 大小写转换
echo "Hello World" | tr 'a-z' 'A-Z'
echo "Hello World" | tr 'A-Z' 'a-z'
# 删除字符(-d)
echo "He110 W0r1d" | tr -d '0-9' # 删除所有数字
echo "remove spaces" | tr -d ' ' # 删除所有空格
cat file.txt | tr -d '\r' # 删除 Windows 换行符
# 压缩重复字符(-s squeezing)
echo "Hello World" | tr -s ' ' # 多个空格压缩为一个
echo "aaabbbccc" | tr -s 'a-z' # 压缩连续重复字母
# 字符替换
echo "2026-04-25" | tr '-' '/' # 将横线替换为斜杠
echo "a:b:c" | tr ':' '\n' # 将冒号替换为换行(分行输出)
# 生成随机密码(结合 /dev/urandom)
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 16
sort 与 uniq:排序去重组合
# sort 选项
sort file.txt # 字典序排序
sort -n numbers.txt # 数字排序
sort -rn numbers.txt # 逆序数字排序
sort -k2 -t, data.csv # 按第2字段排序,逗号分隔
sort -k2,2n -k1,1 data.txt # 先按第2字段数字排,再按第1字段字典排
sort -u names.txt # 排序并去重
sort -V versions.txt # 版本号排序(GNU sort)
# uniq 选项(必须先排序)
sort file.txt | uniq # 去重
sort file.txt | uniq -c # 计数(每行出现次数)
sort file.txt | uniq -d # 只显示重复行
sort file.txt | uniq -u # 只显示唯一行(未重复)
sort file.txt | uniq -i # 忽略大小写去重
# 经典组合:统计最频繁出现的内容
sort access.log | uniq -c | sort -rn | head -20
# 统计 IP 访问频率
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
diff 与 patch:文件比较与补丁
# 基本比较
diff file1.txt file2.txt
# Unified 格式(最常用,patch 工具可直接使用)
diff -u original.txt modified.txt
# 生成补丁文件
diff -u original.conf new.conf > changes.patch
# 应用补丁
patch original.conf = 8 的字符串
# 安全分析:检查可执行文件中的可疑字符串
strings suspicious.exe | grep -E "http|cmd|exec|passwd"
# 从固件/镜像中提取版本信息
strings firmware.bin | grep -i "version\|release"
jq:JSON 处理简介
# 安装
sudo apt install jq
# 格式化 JSON(美化输出)
cat data.json | jq '.'
curl -s https://api.example.com/data | jq '.'
# 提取字段
echo '{"name":"Alice","age":30}' | jq '.name'
# "Alice"
# 数组操作
echo '[1,2,3,4,5]' | jq '.[]' # 展开数组
echo '[1,2,3]' | jq 'map(. * 2)' # 数组映射
# 过滤
cat users.json | jq '.[] | select(.age > 25) | .name'
# 组合多字段
cat data.json | jq '.[] | {name: .name, email: .email}'
# 从 API 响应中提取并格式化
curl -s "https://api.github.com/repos/torvalds/linux/commits?per_page=5" | \
jq '.[] | {sha: .sha[0:8], message: .commit.message[0:60]}'
**文本三剑客的协作:**grep/awk/sed 最强的状态是组合使用。典型流水线:用 grep 过滤出相关行 → 用 awk 切割字段和统计 → 用 sed 格式化输出。每个工具做自己最擅长的事,通过管道串联,比单独使用任何一个工具都更强大。
{{else}}
Chapter 4: Text Processing Power Tools
Complete grep regex guide (BRE/ERE/PCRE), awk structured text processing (field separation, built-in variables, BEGIN/END, associative arrays), sed stream editor (addressing, substitution, deletion, insertion, multiline), and practical usage of tr/diff/patch/sort/uniq.
grep Complete Guide
grep (Global Regular Expression Print) is the core Linux text search tool. It supports three regex engines: Basic Regular Expressions (BRE), Extended Regular Expressions (ERE, via -E), and Perl-Compatible Regular Expressions (PCRE, via -P).
Common Options Quick Reference
- -i: Case-insensitive search
- -v: Invert match (output non-matching lines)
- -r / -R: Recursive directory search
- -l: Output matching filenames only (no content)
- -n: Show line numbers for matches
- -c: Output count of matching lines only
- -o: Output only the matched portion (one match per line)
- -A N: Show N lines After the match
- -B N: Show N lines Before the match
- -C N: Show N lines of Context (before and after)
- --color: Highlight matched portions
- -w: Match whole words only
- -x: Match whole lines only
- -e: Specify multiple patterns (logical OR)
- -f: Read patterns from a file
# Basic search
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog # case-insensitive
grep -v "DEBUG" app.log # filter out DEBUG lines
grep -n "fatal" app.log # show line numbers
grep -c "404" access.log # count 404 occurrences
# Recursive search
grep -r "TODO" ./src/ # search directory
grep -rl "FIXME" ./src/ # filenames only
grep -rn "panic" ./ --include="*.go" # only Go files
# Context lines
grep -A3 "ERROR" app.log # match + 3 lines after
grep -B2 "FATAL" app.log # match + 2 lines before
grep -C2 "exception" app.log # 2 lines before and after
# Multiple patterns
grep -e "error" -e "warn" -e "fatal" app.log
# Extract only matched parts
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" access.log | sort -u
BRE vs ERE: Regex Engine Differences
BRE (Basic Regular Expressions) is grep's default engine — special characters need a backslash to be treated as special. ERE (grep -E / egrep) is the opposite: special characters are special by default, no backslash needed.
| Feature | BRE (grep) | ERE (grep -E) |
|---|---|---|
| Grouping | (abc) | (abc) |
| One or more | + | + |
| Zero or one | ? | ? |
| Alternation | | | |
| Exact repeat | {3} | {3} |
# BRE style (default)
grep "colou\?r" file.txt # matches "color" or "colour"
grep "\(error\|warn\)" app.log # matches error or warn
# ERE style (-E is more readable)
grep -E "colou?r" file.txt
grep -E "error|warn|fatal" app.log
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" access.log # date-format lines
grep -E "\b[A-Z]{2,}\b" doc.txt # all-uppercase words
# Common ERE metacharacters:
# . any single character
# * zero or more
# + one or more
# ? zero or one
# ^ start of line
# $ end of line
# [] character class
# [^] negated character class
# \b word boundary
# \w word character [a-zA-Z0-9_]
# \s whitespace
PCRE: Perl-Compatible Regex (grep -P)
PCRE provides the most powerful regex syntax, including \d \s \w, lazy matching, and lookahead/lookbehind assertions.
# PCRE-specific syntax (requires -P)
grep -P "\d{4}-\d{2}-\d{2}" log.txt # date (\d = [0-9])
grep -P "\s+" file.txt # one or more whitespace
grep -P "https?://\S+" page.html # URL extraction (use with -o)
# Lazy matching (non-greedy)
grep -oP "" index.html # extract HTML tags
# Lookahead: match position followed by something
grep -P "foo(?=bar)" file.txt # foo followed by bar
grep -P "foo(?!bar)" file.txt # foo NOT followed by bar
# Lookbehind: match position preceded by something
grep -P "(? **The power of -o:** The -o flag makes grep output only the matched portion rather than the whole line, one match per line. This transforms grep from a "line filter" into an "extractor." Combined with -P's complex patterns, you can extract structured data (IPs, URLs, numbers) directly from logs.
## awk Complete Tutorial
awk is a complete data processing language, not just a "text tool." Its core idea: read input line by line, split fields by a separator, and execute user-defined operations on each line.
### How awk Works and Built-in Variables
- **$0**: Entire line
- **$1, $2, ... $NF**: Field 1, 2, ... last field
- **NF**: Number of fields in current line
- **NR**: Current line number (increments across files)
- **FNR**: Line number within current file (resets per file)
- **FS**: Input field separator (default: whitespace)
- **OFS**: Output field separator (default: space)
- **RS**: Input record separator (default: newline)
- **ORS**: Output record separator (default: newline)
- **FILENAME**: Current filename being processed
```bash
# Basic: print fields 1 and 3
awk '{print $1, $3}' file.txt
# Custom separator: colon-delimited /etc/passwd
awk -F: '{print $1, $3}' /etc/passwd # username and UID
awk -F: '{print $1":"$3}' /etc/passwd # custom output format
# OFS: change output separator
awk -F: 'BEGIN{OFS=","} {print $1,$3,$7}' /etc/passwd
# Line numbers and field count
awk '{print NR, NF, $0}' file.txt
# Last field
awk '{print $NF}' file.txt
# Second-to-last field
awk '{print $(NF-1)}' file.txt
Conditional Filtering and Pattern Matching
# Regex match: print lines containing "error"
awk '/error/' app.log
awk '/error/{print NR": "$0}' app.log # with line numbers
# Inverted match
awk '!/debug/' app.log
# Numeric conditions
awk -F: '$3 >= 1000' /etc/passwd # users with UID >= 1000
awk '{if ($5 > 100) print $1, $5}' data.txt
# String comparison
awk '$2 == "FAILED"' results.txt
awk '$1 ~ /^192\.168\./' access.log # IPs starting with 192.168.
# Range pattern (between start and end markers)
awk '/START/,/END/' file.txt # print lines from START to END
awk 'NR==10,NR==20' file.txt # print lines 10 to 20
BEGIN and END Blocks
# BEGIN: runs before processing any line
# END: runs after processing all lines
awk 'BEGIN{total=0} {total+=$5} END{print "Total:", total}' sales.csv
# Count lines
awk 'BEGIN{print "Analysis:"} {lines++} END{print "Lines:", lines}' data.txt
# Formatted report
awk -F, 'BEGIN{
printf "%-20s %10s %10s\n", "Name", "Sales", "Commission"
printf "%-20s %10s %10s\n", "----", "-----", "----------"
}
{
commission = $2 * 0.1
printf "%-20s %10.2f %10.2f\n", $1, $2, commission
}
END{
print "Done."
}' sales.csv
String Functions
# length: string length
awk '{print length($0), $0}' file.txt
# substr: substring (start at position N, take M chars)
awk '{print substr($1, 1, 4)}' dates.txt
# index: find substring position (0 = not found)
awk '{print index($0, "error")}' log.txt
# split: split string into array
awk '{n=split($1, arr, ":"); for(i=1;i 1000 {print $0}' timing.log
awk's core advantage: awk natively supports associative arrays (equivalent to maps/dicts in other languages). This lets you count frequencies, group-sum, and pivot data in a single command — no script file needed. For structured text (logs, CSV, /etc/passwd), awk is far faster to write than a Python script.
sed Complete Tutorial
sed (Stream EDitor) reads input line by line, modifies it according to script commands, and outputs the result. Its power lies in non-interactive batch file modification and precise line addressing.
Basic Substitution Syntax
# Basic substitution: s/old/new/
sed 's/foo/bar/' file.txt # replace first foo per line
sed 's/foo/bar/g' file.txt # replace all foo per line
sed 's/foo/bar/2' file.txt # replace second foo per line
sed 's/foo/bar/gi' file.txt # replace all, case-insensitive
# In-place editing (NOTE: -i differs between macOS and Linux)
sed -i 's/old/new/g' file.txt # Linux
sed -i '' 's/old/new/g' file.txt # macOS (-i requires an empty string)
# Multiple expressions (-e)
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt
# Alternative delimiters (useful when pattern contains slashes)
sed 's|/usr/local|/opt|g' paths.txt
sed 's#https://old.com#https://new.com#g' urls.txt
Address Ranges
One of sed's most powerful features is applying commands to specific lines or line ranges, not just the entire file.
# Line number addresses
sed '5s/foo/bar/' file.txt # only modify line 5
sed '5,10s/foo/bar/g' file.txt # modify lines 5 through 10
sed '5,$s/foo/bar/g' file.txt # from line 5 to end of file
# Regex addresses
sed '/^#/d' config.txt # delete comment lines
sed '/error/s/warn/ERROR/g' log # substitute only on lines containing "error"
# Negation (!)
sed '/pattern/!d' file.txt # keep only matching lines (like grep)
sed '1!d' file.txt # keep only line 1
# Step addressing (every N lines)
sed '0~2d' file.txt # delete even-numbered lines (GNU sed)
sed '1~2d' file.txt # delete odd-numbered lines
Delete, Insert, and Append
# d: delete matching lines
sed '/^$/d' file.txt # delete blank lines
sed '/^#/d' file.txt # delete comment lines
sed '/^\s*$/d' file.txt # delete whitespace-only lines
sed '1d' file.txt # delete first line (strip header)
sed '$d' file.txt # delete last line
# i: insert before matching line
sed '/^server/i\# nginx server block' nginx.conf
# a: append after matching line
sed '/^worker_processes/a\worker_rlimit_nofile 65535;' nginx.conf
# c: replace entire matching line
sed '/^Port 22/c\Port 2222' /etc/ssh/sshd_config
# p: print (often used with -n to print only matches)
sed -n '10,20p' file.txt # print lines 10 to 20
sed -n '/error/p' app.log # print matching lines (like grep)
Multiline Operations
# N: append next line to pattern space
# Join two adjacent lines (remove newline between them)
sed 'N;s/\n/ /' file.txt
# Remove Windows line endings (CRLF -> LF)
sed 's/\r$//' file.txt
# or
sed -i 's/\r//' file.txt
# Compress multiple blank lines into one
sed '/^$/{
N
/^\n$/d
}' file.txt
# Extract content between HTML tags
sed -n '//p' index.html
sed in Practice
# Batch replace domain in config files
sed -i 's/old-domain\.com/new-domain.com/g' /etc/nginx/sites-enabled/*.conf
# Strip HTML tags (simple cases)
sed 's/]*>//g' page.html
# Insert copyright header at line 1 of all shell scripts
sed -i '1i\# Copyright 2026 YiteAI. All rights reserved.' *.sh
# Extract content between markers
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/p' cert.pem
# Add line numbers to a file
sed = file.txt | sed 'N;s/\n/\t/'
# Cross-platform -i (works on both Linux and macOS)
sed -i.bak 's/a/b/' file.txt # creates a .bak backup on both platforms
The macOS -i trap: macOS sed (BSD) requires a backup suffix argument after -i (can be an empty string ''). GNU sed on Linux uses -i without a suffix. For cross-platform scripts, use sed -i.bak, or install GNU sed on macOS:
brew install gnu-sed.
Other Text Processing Tools
tr: Character-Level Translation
# Case conversion
echo "Hello World" | tr 'a-z' 'A-Z'
echo "Hello World" | tr 'A-Z' 'a-z'
# Delete characters (-d)
echo "He110 W0r1d" | tr -d '0-9' # delete all digits
echo "remove spaces" | tr -d ' ' # delete all spaces
cat file.txt | tr -d '\r' # remove Windows line endings
# Squeeze repeated characters (-s)
echo "Hello World" | tr -s ' ' # compress multiple spaces to one
echo "aaabbbccc" | tr -s 'a-z' # squeeze repeated lowercase letters
# Character substitution
echo "2026-04-25" | tr '-' '/' # replace dashes with slashes
echo "a:b:c" | tr ':' '\n' # replace colons with newlines
# Generate random password
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 16
sort and uniq: Sorting and Deduplication
# sort options
sort file.txt # lexicographic sort
sort -n numbers.txt # numeric sort
sort -rn numbers.txt # reverse numeric sort
sort -k2 -t, data.csv # sort by field 2, comma delimiter
sort -k2,2n -k1,1 data.txt # sort by field 2 numerically, then field 1
sort -u names.txt # sort and deduplicate
sort -V versions.txt # version-aware sort (GNU sort)
# uniq options (input must be sorted first)
sort file.txt | uniq # deduplicate
sort file.txt | uniq -c # count occurrences
sort file.txt | uniq -d # show only duplicate lines
sort file.txt | uniq -u # show only unique (non-repeated) lines
sort file.txt | uniq -i # case-insensitive dedup
# Classic pipeline: top 20 most frequent entries
sort access.log | uniq -c | sort -rn | head -20
# Top IPs by request count
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
diff and patch: File Comparison and Patching
# Basic comparison
diff file1.txt file2.txt
# Unified format (most common; compatible with patch)
diff -u original.txt modified.txt
# Generate a patch file
diff -u original.conf new.conf > changes.patch
# Apply a patch
patch original.conf = 8
# Security analysis: look for suspicious strings in an executable
strings suspicious.exe | grep -E "http|cmd|exec|passwd"
# Extract version info from firmware/image
strings firmware.bin | grep -i "version\|release"
jq: JSON Processing Introduction
# Install
sudo apt install jq
# Pretty-print JSON
cat data.json | jq '.'
curl -s https://api.example.com/data | jq '.'
# Extract a field
echo '{"name":"Alice","age":30}' | jq '.name'
# "Alice"
# Array operations
echo '[1,2,3,4,5]' | jq '.[]' # expand array
echo '[1,2,3]' | jq 'map(. * 2)' # map over array
# Filter
cat users.json | jq '.[] | select(.age > 25) | .name'
# Select multiple fields
cat data.json | jq '.[] | {name: .name, email: .email}'
# Extract from GitHub API
curl -s "https://api.github.com/repos/torvalds/linux/commits?per_page=5" | \
jq '.[] | {sha: .sha[0:8], message: .commit.message[0:60]}'
The three tools working together: grep/awk/sed are most powerful in combination. A typical pipeline: use grep to filter relevant lines → awk to split fields and aggregate → sed to format output. Each tool does what it does best, connected through pipes — far more powerful than any single tool alone.
{{end}}
上一章
← 第3章:文件操作精通
下一章
第5章:进程与作业控制 →