第 4 章

文本三剑客

第4章：文本三剑客深度实战

grep 正则表达式完全指南（BRE/ERE/PCRE），awk 结构化文本处理（字段分隔、内置变量、BEGIN/END、关联数组），sed 流编辑器（地址寻址、替换、删除、插入、多行操作），以及 tr/diff/patch/sort/uniq 等辅助工具的实战用法。

grep 完全指南

grep（Global Regular Expression Print）是 Linux 文本搜索的核心工具。它支持三种正则引擎：基本正则（BRE）、扩展正则（ERE，通过 -E 启用）和 Perl 兼容正则（PCRE，通过 -P 启用）。

常用选项速查

-i：忽略大小写
-v：反向匹配（输出不匹配的行）
-r / -R：递归搜索目录
-l：只输出匹配的文件名（不显示匹配内容）
-n：显示匹配行的行号
-c：只输出匹配行的计数
-o：只输出匹配部分（每个匹配单独一行）
-A N：显示匹配行后 N 行（After）
-B N：显示匹配行前 N 行（Before）
-C N：显示匹配行前后各 N 行（Context）
--color：高亮显示匹配部分
-w：全词匹配（匹配完整单词）
-x：全行匹配（整行完全匹配）
-e：指定多个模式（逻辑 OR）
-f：从文件读取模式列表

# 基本搜索
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog    # 忽略大小写
grep -v "DEBUG" app.log             # 过滤掉 DEBUG 行
grep -n "fatal" app.log             # 显示行号
grep -c "404" access.log            # 统计 404 出现次数

# 递归搜索
grep -r "TODO" ./src/               # 搜索目录
grep -rl "FIXME" ./src/             # 只显示文件名
grep -rn "panic" ./ --include="*.go"  # 只搜索 Go 文件

# 上下文行
grep -A3 "ERROR" app.log    # 错误行及其后 3 行
grep -B2 "FATAL" app.log    # 错误行及其前 2 行
grep -C2 "exception" app.log  # 前后各 2 行

# 多模式
grep -e "error" -e "warn" -e "fatal" app.log

# 只输出匹配内容
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" access.log | sort -u

BRE vs ERE：正则引擎差异

BRE（基本正则）是 grep 默认引擎，特殊字符需要反斜杠转义才有特殊含义。ERE（grep -E 或 egrep）则相反，特殊字符直接有含义，无需转义。

{{if eq .Lang "zh"}}功能	BRE (grep)	ERE (grep -E)
分组	(abc)	(abc)
一次或多次	+	+
零次或一次	?	?
或	\|
精确重复	{3}	{3}

# BRE 写法（默认）
grep "colou\?r" file.txt          # 匹配 color 或 colour
grep "\(error\|warn\)" app.log    # 匹配 error 或 warn

# ERE 写法（-E 更简洁）
grep -E "colou?r" file.txt
grep -E "error|warn|fatal" app.log
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" access.log   # 日期格式行
grep -E "\b[A-Z]{2,}\b" doc.txt    # 全大写单词

# 常用 ERE 元字符
# .    任意单个字符
# *    零次或多次
# +    一次或多次
# ?    零次或一次
# ^    行首
# $    行尾
# []   字符集合
# [^]  否定字符集合
# \b   单词边界
# \w   单词字符 [a-zA-Z0-9_]
# \s   空白字符

PCRE：Perl 兼容正则（grep -P）

PCRE 提供了最强大的正则语法，包括 \d \s \w、懒惰匹配（?）、前瞻/后顾断言等。

# PCRE 专用语法（需要 -P）
grep -P "\d{4}-\d{2}-\d{2}" log.txt        # 日期（\d = [0-9]）
grep -P "\s+" file.txt                       # 一个或多个空白
grep -P "https?://\S+" page.html            # URL 提取（-o 配合）

# 懒惰匹配（非贪婪，尽可能少匹配）
grep -oP "" index.html                  # HTML 标签提取

# 前瞻断言（lookahead）：匹配后面跟着某内容的位置
grep -P "foo(?=bar)" file.txt               # 匹配后面跟 bar 的 foo
grep -P "foo(?!bar)" file.txt               # 匹配后面不跟 bar 的 foo

# 后顾断言（lookbehind）
grep -P "(? **-o 选项的力量：**-o 让 grep 只输出匹配部分而非整行，每个匹配单独一行。这使 grep 从"行过滤器"变成了"提取器"，结合 -P 的复杂正则，可以直接从日志中提取 IP、URL、数字等结构化信息。


## awk 完整教程


awk 是一门完整的数据处理语言，不只是"文本处理工具"。它的核心思想是：逐行读取输入，按分隔符切分字段，对每行执行用户定义的操作。


### 工作原理与内置变量


- **$0**：整行内容
- **$1, $2, ... $NF**：第1、2、...最后一个字段
- **NF**：当前行的字段总数
- **NR**：当前行的行号（跨文件递增）
- **FNR**：当前文件中的行号（每个文件从1开始）
- **FS**：输入字段分隔符（默认：空白）
- **OFS**：输出字段分隔符（默认：空格）
- **RS**：输入记录分隔符（默认：换行符）
- **ORS**：输出记录分隔符（默认：换行符）
- **FILENAME**：当前处理的文件名


```bash
# 基本用法：打印第1和第3字段
awk '{print $1, $3}' file.txt

# 指定分隔符：冒号分割的 /etc/passwd
awk -F: '{print $1, $3}' /etc/passwd    # 用户名和 UID
awk -F: '{print $1":"$3}' /etc/passwd  # 自定义输出格式

# OFS：修改输出分隔符
awk -F: 'BEGIN{OFS=","} {print $1,$3,$7}' /etc/passwd

# 行号和字段数
awk '{print NR, NF, $0}' file.txt

# 最后一个字段
awk '{print $NF}' file.txt

# 倒数第二个字段
awk '{print $(NF-1)}' file.txt

条件过滤与模式匹配

# 正则匹配：打印包含 error 的行
awk '/error/' app.log
awk '/error/{print NR": "$0}' app.log   # 带行号

# 反向匹配
awk '!/debug/' app.log

# 数值条件
awk -F: '$3 >= 1000' /etc/passwd         # UID >= 1000 的用户
awk '{if ($5 > 100) print $1, $5}' data.txt

# 字符串比较
awk '$2 == "FAILED"' results.txt
awk '$1 ~ /^192\.168\./' access.log      # IP 以 192.168. 开头

# 范围模式（从匹配行到结束标记）
awk '/START/,/END/' file.txt             # 打印 START 到 END 之间的行
awk 'NR==10,NR==20' file.txt            # 打印第10到第20行

BEGIN 和 END 块

# BEGIN：处理任何行之前执行
# END：处理完所有行之后执行
awk 'BEGIN{total=0} {total+=$5} END{print "Total:", total}' sales.csv

# 统计行数、列数
awk 'BEGIN{print "File analysis:"} {lines++} END{print "Lines:", lines}' data.txt

# 格式化报告
awk -F, 'BEGIN{
    printf "%-20s %10s %10s\n", "Name", "Sales", "Commission"
    printf "%-20s %10s %10s\n", "----", "-----", "----------"
}
{
    commission = $2 * 0.1
    printf "%-20s %10.2f %10.2f\n", $1, $2, commission
}
END{
    print "Processing complete."
}' sales.csv

字符串函数

# length：字符串长度
awk '{print length($0), $0}' file.txt

# substr：子字符串（从第N位开始取M个字符）
awk '{print substr($1, 1, 4)}' dates.txt

# index：查找子串位置（0表示不存在）
awk '{print index($0, "error")}' log.txt

# split：分割字符串为数组
awk '{n=split($1, arr, ":"); for(i=1;i 1000ms）
awk '$NF > 1000 {print $0}' timing.log

**awk 的核心优势：**awk 原生支持关联数组（即其他语言的 map/dict），这让统计频率、分组求和等任务在一行命令内就能完成，无需写脚本文件。对于结构化文本（日志、CSV、/etc/passwd），awk 的效率远超 Python 脚本。

sed 完整教程

sed（Stream EDitor）是流编辑器，逐行读取输入，按照脚本中的命令修改后输出。它的强大之处在于非交互式批量修改文件，以及精确的地址寻址能力。

基本替换语法

# 基本替换：s/旧/新/
sed 's/foo/bar/' file.txt          # 每行替换第一个 foo
sed 's/foo/bar/g' file.txt         # 每行替换所有 foo
sed 's/foo/bar/2' file.txt         # 每行替换第二个 foo
sed 's/foo/bar/gi' file.txt        # 全部替换，忽略大小写

# 原地修改文件（注意：-i 在 macOS 和 Linux 行为不同）
sed -i 's/old/new/g' file.txt      # Linux
sed -i '' 's/old/new/g' file.txt   # macOS（-i 需要空字符串参数）

# 多个表达式（-e）
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# 使用不同分隔符（当模式含斜杠时特别有用）
sed 's|/usr/local|/opt|g' paths.txt
sed 's#https://old.com#https://new.com#g' urls.txt

地址寻址

sed 最强大的特性之一是可以对特定行或行范围执行命令，而不是全文处理。

# 行号地址
sed '5s/foo/bar/' file.txt          # 只修改第5行
sed '5,10s/foo/bar/g' file.txt     # 修改第5到第10行
sed '5,$s/foo/bar/g' file.txt      # 从第5行到最后一行

# 正则地址
sed '/^#/d' config.txt              # 删除注释行
sed '/error/s/warn/ERROR/g' log     # 只在包含 error 的行中替换

# 否定地址（!）
sed '/pattern/!d' file.txt          # 只保留匹配行（等效于 grep）
sed '1!d' file.txt                  # 只保留第1行

# 步进地址（每N行）
sed '0~2d' file.txt                 # 删除偶数行（GNU sed）
sed '1~2d' file.txt                 # 删除奇数行

删除、插入与追加命令

# d 命令：删除匹配行
sed '/^$/d' file.txt                # 删除空行
sed '/^#/d' file.txt                # 删除注释行
sed '/^\s*$/d' file.txt             # 删除仅含空白的行
sed '1d' file.txt                   # 删除第1行（去掉标题行）
sed '$d' file.txt                   # 删除最后一行

# i 命令：在匹配行前插入
sed '/^server/i\# nginx server block' nginx.conf

# a 命令：在匹配行后追加
sed '/^worker_processes/a\worker_rlimit_nofile 65535;' nginx.conf

# c 命令：替换整行
sed '/^Port 22/c\Port 2222' /etc/ssh/sshd_config

# p 命令：打印（常配合 -n 只打印匹配行）
sed -n '10,20p' file.txt            # 打印第10到20行（类似 head/tail 组合）
sed -n '/error/p' app.log           # 只打印匹配行（等效于 grep）

多行操作

# N 命令：读取下一行追加到模式空间
# 合并相邻两行（去掉两行之间的换行）
sed 'N;s/\n/ /' file.txt

# 删除文件中的 Windows 换行符（CRLF -> LF）
sed 's/\r$//' file.txt
# 或者
sed -i 's/\r//' file.txt

# 删除连续空行（压缩为单个空行）
sed '/^$/{
N
/^\n$/d
}' file.txt

# 实战：处理多行 HTML 标签
sed -n '//p' index.html

sed 实战案例

# 批量替换配置文件中的域名
sed -i 's/old-domain\.com/new-domain.com/g' /etc/nginx/sites-enabled/*.conf

# 删除 HTML 标签（简单场景）
sed 's/]*>//g' page.html

# 在文件第一行前插入版权声明
sed -i '1i\# Copyright 2026 YiteAI. All rights reserved.' *.sh

# 提取两个标记之间的内容
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/p' cert.pem

# 给每行添加行号
sed = file.txt | sed 'N;s/\n/\t/'

# 反转文件（tac 的 sed 实现）
sed -n '1!G;h;$p' file.txt

# macOS 与 Linux -i 差异
# Linux：sed -i 's/a/b/' file.txt
# macOS：sed -i '' 's/a/b/' file.txt
# 跨平台写法：先备份再处理
sed -i.bak 's/a/b/' file.txt   # 两个平台都支持（生成 .bak 备份）

**macOS 的 -i 陷阱：**macOS 的 sed 来自 BSD，-i 后面必须跟一个备份后缀参数（可以是空字符串 ''）。Linux 的 GNU sed 则 -i 后面直接跟文件名。跨平台脚本建议用 sed -i.bak 的形式，或安装 GNU sed（brew install gnu-sed）。

其他文本处理工具

tr：字符级转换

tr 比 sed 更底层，它在字符级别进行替换、删除和压缩操作，速度极快。

# 大小写转换
echo "Hello World" | tr 'a-z' 'A-Z'
echo "Hello World" | tr 'A-Z' 'a-z'

# 删除字符（-d）
echo "He110 W0r1d" | tr -d '0-9'    # 删除所有数字
echo "remove spaces" | tr -d ' '     # 删除所有空格
cat file.txt | tr -d '\r'            # 删除 Windows 换行符

# 压缩重复字符（-s squeezing）
echo "Hello   World" | tr -s ' '    # 多个空格压缩为一个
echo "aaabbbccc" | tr -s 'a-z'      # 压缩连续重复字母

# 字符替换
echo "2026-04-25" | tr '-' '/'      # 将横线替换为斜杠
echo "a:b:c" | tr ':' '\n'          # 将冒号替换为换行（分行输出）

# 生成随机密码（结合 /dev/urandom）
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 16

sort 与 uniq：排序去重组合

# sort 选项
sort file.txt                    # 字典序排序
sort -n numbers.txt              # 数字排序
sort -rn numbers.txt             # 逆序数字排序
sort -k2 -t, data.csv            # 按第2字段排序，逗号分隔
sort -k2,2n -k1,1 data.txt       # 先按第2字段数字排，再按第1字段字典排
sort -u names.txt                # 排序并去重
sort -V versions.txt             # 版本号排序（GNU sort）

# uniq 选项（必须先排序）
sort file.txt | uniq             # 去重
sort file.txt | uniq -c          # 计数（每行出现次数）
sort file.txt | uniq -d          # 只显示重复行
sort file.txt | uniq -u          # 只显示唯一行（未重复）
sort file.txt | uniq -i          # 忽略大小写去重

# 经典组合：统计最频繁出现的内容
sort access.log | uniq -c | sort -rn | head -20

# 统计 IP 访问频率
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

diff 与 patch：文件比较与补丁

# 基本比较
diff file1.txt file2.txt

# Unified 格式（最常用，patch 工具可直接使用）
diff -u original.txt modified.txt

# 生成补丁文件
diff -u original.conf new.conf > changes.patch

# 应用补丁
patch original.conf = 8 的字符串

# 安全分析：检查可执行文件中的可疑字符串
strings suspicious.exe | grep -E "http|cmd|exec|passwd"

# 从固件/镜像中提取版本信息
strings firmware.bin | grep -i "version\|release"

jq：JSON 处理简介

# 安装
sudo apt install jq

# 格式化 JSON（美化输出）
cat data.json | jq '.'
curl -s https://api.example.com/data | jq '.'

# 提取字段
echo '{"name":"Alice","age":30}' | jq '.name'
# "Alice"

# 数组操作
echo '[1,2,3,4,5]' | jq '.[]'           # 展开数组
echo '[1,2,3]' | jq 'map(. * 2)'        # 数组映射

# 过滤
cat users.json | jq '.[] | select(.age > 25) | .name'

# 组合多字段
cat data.json | jq '.[] | {name: .name, email: .email}'

# 从 API 响应中提取并格式化
curl -s "https://api.github.com/repos/torvalds/linux/commits?per_page=5" | \
  jq '.[] | {sha: .sha[0:8], message: .commit.message[0:60]}'

**文本三剑客的协作：**grep/awk/sed 最强的状态是组合使用。典型流水线：用 grep 过滤出相关行 → 用 awk 切割字段和统计 → 用 sed 格式化输出。每个工具做自己最擅长的事，通过管道串联，比单独使用任何一个工具都更强大。

Chapter 4: Text Processing Power Tools

Complete grep regex guide (BRE/ERE/PCRE), awk structured text processing (field separation, built-in variables, BEGIN/END, associative arrays), sed stream editor (addressing, substitution, deletion, insertion, multiline), and practical usage of tr/diff/patch/sort/uniq.

grep Complete Guide

grep (Global Regular Expression Print) is the core Linux text search tool. It supports three regex engines: Basic Regular Expressions (BRE), Extended Regular Expressions (ERE, via -E), and Perl-Compatible Regular Expressions (PCRE, via -P).

Common Options Quick Reference

-i: Case-insensitive search
-v: Invert match (output non-matching lines)
-r / -R: Recursive directory search
-l: Output matching filenames only (no content)
-n: Show line numbers for matches
-c: Output count of matching lines only
-o: Output only the matched portion (one match per line)
-A N: Show N lines After the match
-B N: Show N lines Before the match
-C N: Show N lines of Context (before and after)
--color: Highlight matched portions
-w: Match whole words only
-x: Match whole lines only
-e: Specify multiple patterns (logical OR)
-f: Read patterns from a file

# Basic search
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog    # case-insensitive
grep -v "DEBUG" app.log             # filter out DEBUG lines
grep -n "fatal" app.log             # show line numbers
grep -c "404" access.log            # count 404 occurrences

# Recursive search
grep -r "TODO" ./src/               # search directory
grep -rl "FIXME" ./src/             # filenames only
grep -rn "panic" ./ --include="*.go"  # only Go files

# Context lines
grep -A3 "ERROR" app.log    # match + 3 lines after
grep -B2 "FATAL" app.log    # match + 2 lines before
grep -C2 "exception" app.log  # 2 lines before and after

# Multiple patterns
grep -e "error" -e "warn" -e "fatal" app.log

# Extract only matched parts
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" access.log | sort -u

BRE vs ERE: Regex Engine Differences

BRE (Basic Regular Expressions) is grep's default engine — special characters need a backslash to be treated as special. ERE (grep -E / egrep) is the opposite: special characters are special by default, no backslash needed.

Feature	BRE (grep)	ERE (grep -E)
Grouping	(abc)	(abc)
One or more	+	+
Zero or one	?	?
Alternation	\|
Exact repeat	{3}	{3}

# BRE style (default)
grep "colou\?r" file.txt           # matches "color" or "colour"
grep "\(error\|warn\)" app.log     # matches error or warn

# ERE style (-E is more readable)
grep -E "colou?r" file.txt
grep -E "error|warn|fatal" app.log
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" access.log  # date-format lines
grep -E "\b[A-Z]{2,}\b" doc.txt    # all-uppercase words

# Common ERE metacharacters:
# .    any single character
# *    zero or more
# +    one or more
# ?    zero or one
# ^    start of line
# $    end of line
# []   character class
# [^]  negated character class
# \b   word boundary
# \w   word character [a-zA-Z0-9_]
# \s   whitespace

PCRE: Perl-Compatible Regex (grep -P)

PCRE provides the most powerful regex syntax, including \d \s \w, lazy matching, and lookahead/lookbehind assertions.

# PCRE-specific syntax (requires -P)
grep -P "\d{4}-\d{2}-\d{2}" log.txt        # date (\d = [0-9])
grep -P "\s+" file.txt                       # one or more whitespace
grep -P "https?://\S+" page.html            # URL extraction (use with -o)

# Lazy matching (non-greedy)
grep -oP "" index.html                  # extract HTML tags

# Lookahead: match position followed by something
grep -P "foo(?=bar)" file.txt               # foo followed by bar
grep -P "foo(?!bar)" file.txt               # foo NOT followed by bar

# Lookbehind: match position preceded by something
grep -P "(? **The power of -o:** The -o flag makes grep output only the matched portion rather than the whole line, one match per line. This transforms grep from a "line filter" into an "extractor." Combined with -P's complex patterns, you can extract structured data (IPs, URLs, numbers) directly from logs.


## awk Complete Tutorial


awk is a complete data processing language, not just a "text tool." Its core idea: read input line by line, split fields by a separator, and execute user-defined operations on each line.


### How awk Works and Built-in Variables


- **$0**: Entire line
- **$1, $2, ... $NF**: Field 1, 2, ... last field
- **NF**: Number of fields in current line
- **NR**: Current line number (increments across files)
- **FNR**: Line number within current file (resets per file)
- **FS**: Input field separator (default: whitespace)
- **OFS**: Output field separator (default: space)
- **RS**: Input record separator (default: newline)
- **ORS**: Output record separator (default: newline)
- **FILENAME**: Current filename being processed


```bash
# Basic: print fields 1 and 3
awk '{print $1, $3}' file.txt

# Custom separator: colon-delimited /etc/passwd
awk -F: '{print $1, $3}' /etc/passwd    # username and UID
awk -F: '{print $1":"$3}' /etc/passwd  # custom output format

# OFS: change output separator
awk -F: 'BEGIN{OFS=","} {print $1,$3,$7}' /etc/passwd

# Line numbers and field count
awk '{print NR, NF, $0}' file.txt

# Last field
awk '{print $NF}' file.txt

# Second-to-last field
awk '{print $(NF-1)}' file.txt

Conditional Filtering and Pattern Matching

# Regex match: print lines containing "error"
awk '/error/' app.log
awk '/error/{print NR": "$0}' app.log   # with line numbers

# Inverted match
awk '!/debug/' app.log

# Numeric conditions
awk -F: '$3 >= 1000' /etc/passwd        # users with UID >= 1000
awk '{if ($5 > 100) print $1, $5}' data.txt

# String comparison
awk '$2 == "FAILED"' results.txt
awk '$1 ~ /^192\.168\./' access.log     # IPs starting with 192.168.

# Range pattern (between start and end markers)
awk '/START/,/END/' file.txt            # print lines from START to END
awk 'NR==10,NR==20' file.txt           # print lines 10 to 20

BEGIN and END Blocks

# BEGIN: runs before processing any line
# END: runs after processing all lines
awk 'BEGIN{total=0} {total+=$5} END{print "Total:", total}' sales.csv

# Count lines
awk 'BEGIN{print "Analysis:"} {lines++} END{print "Lines:", lines}' data.txt

# Formatted report
awk -F, 'BEGIN{
    printf "%-20s %10s %10s\n", "Name", "Sales", "Commission"
    printf "%-20s %10s %10s\n", "----", "-----", "----------"
}
{
    commission = $2 * 0.1
    printf "%-20s %10.2f %10.2f\n", $1, $2, commission
}
END{
    print "Done."
}' sales.csv

String Functions

# length: string length
awk '{print length($0), $0}' file.txt

# substr: substring (start at position N, take M chars)
awk '{print substr($1, 1, 4)}' dates.txt

# index: find substring position (0 = not found)
awk '{print index($0, "error")}' log.txt

# split: split string into array
awk '{n=split($1, arr, ":"); for(i=1;i 1000 {print $0}' timing.log

awk's core advantage: awk natively supports associative arrays (equivalent to maps/dicts in other languages). This lets you count frequencies, group-sum, and pivot data in a single command — no script file needed. For structured text (logs, CSV, /etc/passwd), awk is far faster to write than a Python script.

sed Complete Tutorial

sed (Stream EDitor) reads input line by line, modifies it according to script commands, and outputs the result. Its power lies in non-interactive batch file modification and precise line addressing.

Basic Substitution Syntax

# Basic substitution: s/old/new/
sed 's/foo/bar/' file.txt          # replace first foo per line
sed 's/foo/bar/g' file.txt         # replace all foo per line
sed 's/foo/bar/2' file.txt         # replace second foo per line
sed 's/foo/bar/gi' file.txt        # replace all, case-insensitive

# In-place editing (NOTE: -i differs between macOS and Linux)
sed -i 's/old/new/g' file.txt      # Linux
sed -i '' 's/old/new/g' file.txt   # macOS (-i requires an empty string)

# Multiple expressions (-e)
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# Alternative delimiters (useful when pattern contains slashes)
sed 's|/usr/local|/opt|g' paths.txt
sed 's#https://old.com#https://new.com#g' urls.txt

Address Ranges

One of sed's most powerful features is applying commands to specific lines or line ranges, not just the entire file.

# Line number addresses
sed '5s/foo/bar/' file.txt          # only modify line 5
sed '5,10s/foo/bar/g' file.txt     # modify lines 5 through 10
sed '5,$s/foo/bar/g' file.txt      # from line 5 to end of file

# Regex addresses
sed '/^#/d' config.txt              # delete comment lines
sed '/error/s/warn/ERROR/g' log     # substitute only on lines containing "error"

# Negation (!)
sed '/pattern/!d' file.txt          # keep only matching lines (like grep)
sed '1!d' file.txt                  # keep only line 1

# Step addressing (every N lines)
sed '0~2d' file.txt                 # delete even-numbered lines (GNU sed)
sed '1~2d' file.txt                 # delete odd-numbered lines

Delete, Insert, and Append

# d: delete matching lines
sed '/^$/d' file.txt                # delete blank lines
sed '/^#/d' file.txt                # delete comment lines
sed '/^\s*$/d' file.txt             # delete whitespace-only lines
sed '1d' file.txt                   # delete first line (strip header)
sed '$d' file.txt                   # delete last line

# i: insert before matching line
sed '/^server/i\# nginx server block' nginx.conf

# a: append after matching line
sed '/^worker_processes/a\worker_rlimit_nofile 65535;' nginx.conf

# c: replace entire matching line
sed '/^Port 22/c\Port 2222' /etc/ssh/sshd_config

# p: print (often used with -n to print only matches)
sed -n '10,20p' file.txt            # print lines 10 to 20
sed -n '/error/p' app.log           # print matching lines (like grep)

Multiline Operations

# N: append next line to pattern space
# Join two adjacent lines (remove newline between them)
sed 'N;s/\n/ /' file.txt

# Remove Windows line endings (CRLF -> LF)
sed 's/\r$//' file.txt
# or
sed -i 's/\r//' file.txt

# Compress multiple blank lines into one
sed '/^$/{
N
/^\n$/d
}' file.txt

# Extract content between HTML tags
sed -n '//p' index.html

sed in Practice

# Batch replace domain in config files
sed -i 's/old-domain\.com/new-domain.com/g' /etc/nginx/sites-enabled/*.conf

# Strip HTML tags (simple cases)
sed 's/]*>//g' page.html

# Insert copyright header at line 1 of all shell scripts
sed -i '1i\# Copyright 2026 YiteAI. All rights reserved.' *.sh

# Extract content between markers
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/p' cert.pem

# Add line numbers to a file
sed = file.txt | sed 'N;s/\n/\t/'

# Cross-platform -i (works on both Linux and macOS)
sed -i.bak 's/a/b/' file.txt   # creates a .bak backup on both platforms

The macOS -i trap: macOS sed (BSD) requires a backup suffix argument after -i (can be an empty string ''). GNU sed on Linux uses -i without a suffix. For cross-platform scripts, use sed -i.bak, or install GNU sed on macOS: brew install gnu-sed.

Other Text Processing Tools

tr: Character-Level Translation

# Case conversion
echo "Hello World" | tr 'a-z' 'A-Z'
echo "Hello World" | tr 'A-Z' 'a-z'

# Delete characters (-d)
echo "He110 W0r1d" | tr -d '0-9'    # delete all digits
echo "remove spaces" | tr -d ' '     # delete all spaces
cat file.txt | tr -d '\r'            # remove Windows line endings

# Squeeze repeated characters (-s)
echo "Hello   World" | tr -s ' '    # compress multiple spaces to one
echo "aaabbbccc" | tr -s 'a-z'      # squeeze repeated lowercase letters

# Character substitution
echo "2026-04-25" | tr '-' '/'      # replace dashes with slashes
echo "a:b:c" | tr ':' '\n'          # replace colons with newlines

# Generate random password
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 16

sort and uniq: Sorting and Deduplication

# sort options
sort file.txt                    # lexicographic sort
sort -n numbers.txt              # numeric sort
sort -rn numbers.txt             # reverse numeric sort
sort -k2 -t, data.csv            # sort by field 2, comma delimiter
sort -k2,2n -k1,1 data.txt       # sort by field 2 numerically, then field 1
sort -u names.txt                # sort and deduplicate
sort -V versions.txt             # version-aware sort (GNU sort)

# uniq options (input must be sorted first)
sort file.txt | uniq             # deduplicate
sort file.txt | uniq -c          # count occurrences
sort file.txt | uniq -d          # show only duplicate lines
sort file.txt | uniq -u          # show only unique (non-repeated) lines
sort file.txt | uniq -i          # case-insensitive dedup

# Classic pipeline: top 20 most frequent entries
sort access.log | uniq -c | sort -rn | head -20

# Top IPs by request count
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

diff and patch: File Comparison and Patching

# Basic comparison
diff file1.txt file2.txt

# Unified format (most common; compatible with patch)
diff -u original.txt modified.txt

# Generate a patch file
diff -u original.conf new.conf > changes.patch

# Apply a patch
patch original.conf = 8

# Security analysis: look for suspicious strings in an executable
strings suspicious.exe | grep -E "http|cmd|exec|passwd"

# Extract version info from firmware/image
strings firmware.bin | grep -i "version\|release"

jq: JSON Processing Introduction

# Install
sudo apt install jq

# Pretty-print JSON
cat data.json | jq '.'
curl -s https://api.example.com/data | jq '.'

# Extract a field
echo '{"name":"Alice","age":30}' | jq '.name'
# "Alice"

# Array operations
echo '[1,2,3,4,5]' | jq '.[]'           # expand array
echo '[1,2,3]' | jq 'map(. * 2)'        # map over array

# Filter
cat users.json | jq '.[] | select(.age > 25) | .name'

# Select multiple fields
cat data.json | jq '.[] | {name: .name, email: .email}'

# Extract from GitHub API
curl -s "https://api.github.com/repos/torvalds/linux/commits?per_page=5" | \
  jq '.[] | {sha: .sha[0:8], message: .commit.message[0:60]}'

The three tools working together: grep/awk/sed are most powerful in combination. A typical pipeline: use grep to filter relevant lines → awk to split fields and aggregate → sed to format output. Each tool does what it does best, connected through pipes — far more powerful than any single tool alone.

上一章
← 第3章：文件操作精通


下一章
第5章：进程与作业控制 →

本章评分

4.5 / 5 (68 评分)