第 4 章

文本三剑客

第4章:文本三剑客深度实战

grep 正则表达式完全指南(BRE/ERE/PCRE),awk 结构化文本处理(字段分隔、内置变量、BEGIN/END、关联数组),sed 流编辑器(地址寻址、替换、删除、插入、多行操作),以及 tr/diff/patch/sort/uniq 等辅助工具的实战用法。

grep 完全指南

grep(Global Regular Expression Print)是 Linux 文本搜索的核心工具。它支持三种正则引擎:基本正则(BRE)、扩展正则(ERE,通过 -E 启用)和 Perl 兼容正则(PCRE,通过 -P 启用)。

常用选项速查

# 基本搜索
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog    # 忽略大小写
grep -v "DEBUG" app.log             # 过滤掉 DEBUG 行
grep -n "fatal" app.log             # 显示行号
grep -c "404" access.log            # 统计 404 出现次数

# 递归搜索
grep -r "TODO" ./src/               # 搜索目录
grep -rl "FIXME" ./src/             # 只显示文件名
grep -rn "panic" ./ --include="*.go"  # 只搜索 Go 文件

# 上下文行
grep -A3 "ERROR" app.log    # 错误行及其后 3 行
grep -B2 "FATAL" app.log    # 错误行及其前 2 行
grep -C2 "exception" app.log  # 前后各 2 行

# 多模式
grep -e "error" -e "warn" -e "fatal" app.log

# 只输出匹配内容
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" access.log | sort -u

BRE vs ERE:正则引擎差异

BRE(基本正则)是 grep 默认引擎,特殊字符需要反斜杠转义才有特殊含义。ERE(grep -E 或 egrep)则相反,特殊字符直接有含义,无需转义。

{{if eq .Lang "zh"}}功能 BRE (grep) ERE (grep -E)
分组 (abc) (abc)
一次或多次 + +
零次或一次 ? ?
|
精确重复 {3} {3}
# BRE 写法(默认)
grep "colou\?r" file.txt          # 匹配 color 或 colour
grep "\(error\|warn\)" app.log    # 匹配 error 或 warn

# ERE 写法(-E 更简洁)
grep -E "colou?r" file.txt
grep -E "error|warn|fatal" app.log
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" access.log   # 日期格式行
grep -E "\b[A-Z]{2,}\b" doc.txt    # 全大写单词

# 常用 ERE 元字符
# .    任意单个字符
# *    零次或多次
# +    一次或多次
# ?    零次或一次
# ^    行首
# $    行尾
# []   字符集合
# [^]  否定字符集合
# \b   单词边界
# \w   单词字符 [a-zA-Z0-9_]
# \s   空白字符

PCRE:Perl 兼容正则(grep -P)

PCRE 提供了最强大的正则语法,包括 \d \s \w、懒惰匹配(?)、前瞻/后顾断言等。

# PCRE 专用语法(需要 -P)
grep -P "\d{4}-\d{2}-\d{2}" log.txt        # 日期(\d = [0-9])
grep -P "\s+" file.txt                       # 一个或多个空白
grep -P "https?://\S+" page.html            # URL 提取(-o 配合)

# 懒惰匹配(非贪婪,尽可能少匹配)
grep -oP "" index.html                  # HTML 标签提取

# 前瞻断言(lookahead):匹配后面跟着某内容的位置
grep -P "foo(?=bar)" file.txt               # 匹配后面跟 bar 的 foo
grep -P "foo(?!bar)" file.txt               # 匹配后面不跟 bar 的 foo

# 后顾断言(lookbehind)
grep -P "(? **-o 选项的力量:**-o 让 grep 只输出匹配部分而非整行,每个匹配单独一行。这使 grep 从"行过滤器"变成了"提取器",结合 -P 的复杂正则,可以直接从日志中提取 IP、URL、数字等结构化信息。


## awk 完整教程


awk 是一门完整的数据处理语言,不只是"文本处理工具"。它的核心思想是:逐行读取输入,按分隔符切分字段,对每行执行用户定义的操作。


### 工作原理与内置变量


- **$0**:整行内容
- **$1, $2, ... $NF**:第1、2、...最后一个字段
- **NF**:当前行的字段总数
- **NR**:当前行的行号(跨文件递增)
- **FNR**:当前文件中的行号(每个文件从1开始)
- **FS**:输入字段分隔符(默认:空白)
- **OFS**:输出字段分隔符(默认:空格)
- **RS**:输入记录分隔符(默认:换行符)
- **ORS**:输出记录分隔符(默认:换行符)
- **FILENAME**:当前处理的文件名


```bash
# 基本用法:打印第1和第3字段
awk '{print $1, $3}' file.txt

# 指定分隔符:冒号分割的 /etc/passwd
awk -F: '{print $1, $3}' /etc/passwd    # 用户名和 UID
awk -F: '{print $1":"$3}' /etc/passwd  # 自定义输出格式

# OFS:修改输出分隔符
awk -F: 'BEGIN{OFS=","} {print $1,$3,$7}' /etc/passwd

# 行号和字段数
awk '{print NR, NF, $0}' file.txt

# 最后一个字段
awk '{print $NF}' file.txt

# 倒数第二个字段
awk '{print $(NF-1)}' file.txt

条件过滤与模式匹配

# 正则匹配:打印包含 error 的行
awk '/error/' app.log
awk '/error/{print NR": "$0}' app.log   # 带行号

# 反向匹配
awk '!/debug/' app.log

# 数值条件
awk -F: '$3 >= 1000' /etc/passwd         # UID >= 1000 的用户
awk '{if ($5 > 100) print $1, $5}' data.txt

# 字符串比较
awk '$2 == "FAILED"' results.txt
awk '$1 ~ /^192\.168\./' access.log      # IP 以 192.168. 开头

# 范围模式(从匹配行到结束标记)
awk '/START/,/END/' file.txt             # 打印 START 到 END 之间的行
awk 'NR==10,NR==20' file.txt            # 打印第10到第20行

BEGIN 和 END 块

# BEGIN:处理任何行之前执行
# END:处理完所有行之后执行
awk 'BEGIN{total=0} {total+=$5} END{print "Total:", total}' sales.csv

# 统计行数、列数
awk 'BEGIN{print "File analysis:"} {lines++} END{print "Lines:", lines}' data.txt

# 格式化报告
awk -F, 'BEGIN{
    printf "%-20s %10s %10s\n", "Name", "Sales", "Commission"
    printf "%-20s %10s %10s\n", "----", "-----", "----------"
}
{
    commission = $2 * 0.1
    printf "%-20s %10.2f %10.2f\n", $1, $2, commission
}
END{
    print "Processing complete."
}' sales.csv

字符串函数

# length:字符串长度
awk '{print length($0), $0}' file.txt

# substr:子字符串(从第N位开始取M个字符)
awk '{print substr($1, 1, 4)}' dates.txt

# index:查找子串位置(0表示不存在)
awk '{print index($0, "error")}' log.txt

# split:分割字符串为数组
awk '{n=split($1, arr, ":"); for(i=1;i 1000ms)
awk '$NF > 1000 {print $0}' timing.log

**awk 的核心优势:**awk 原生支持关联数组(即其他语言的 map/dict),这让统计频率、分组求和等任务在一行命令内就能完成,无需写脚本文件。对于结构化文本(日志、CSV、/etc/passwd),awk 的效率远超 Python 脚本。

sed 完整教程

sed(Stream EDitor)是流编辑器,逐行读取输入,按照脚本中的命令修改后输出。它的强大之处在于非交互式批量修改文件,以及精确的地址寻址能力。

基本替换语法

# 基本替换:s/旧/新/
sed 's/foo/bar/' file.txt          # 每行替换第一个 foo
sed 's/foo/bar/g' file.txt         # 每行替换所有 foo
sed 's/foo/bar/2' file.txt         # 每行替换第二个 foo
sed 's/foo/bar/gi' file.txt        # 全部替换,忽略大小写

# 原地修改文件(注意:-i 在 macOS 和 Linux 行为不同)
sed -i 's/old/new/g' file.txt      # Linux
sed -i '' 's/old/new/g' file.txt   # macOS(-i 需要空字符串参数)

# 多个表达式(-e)
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# 使用不同分隔符(当模式含斜杠时特别有用)
sed 's|/usr/local|/opt|g' paths.txt
sed 's#https://old.com#https://new.com#g' urls.txt

地址寻址

sed 最强大的特性之一是可以对特定行或行范围执行命令,而不是全文处理。

# 行号地址
sed '5s/foo/bar/' file.txt          # 只修改第5行
sed '5,10s/foo/bar/g' file.txt     # 修改第5到第10行
sed '5,$s/foo/bar/g' file.txt      # 从第5行到最后一行

# 正则地址
sed '/^#/d' config.txt              # 删除注释行
sed '/error/s/warn/ERROR/g' log     # 只在包含 error 的行中替换

# 否定地址(!)
sed '/pattern/!d' file.txt          # 只保留匹配行(等效于 grep)
sed '1!d' file.txt                  # 只保留第1行

# 步进地址(每N行)
sed '0~2d' file.txt                 # 删除偶数行(GNU sed)
sed '1~2d' file.txt                 # 删除奇数行

删除、插入与追加命令

# d 命令:删除匹配行
sed '/^$/d' file.txt                # 删除空行
sed '/^#/d' file.txt                # 删除注释行
sed '/^\s*$/d' file.txt             # 删除仅含空白的行
sed '1d' file.txt                   # 删除第1行(去掉标题行)
sed '$d' file.txt                   # 删除最后一行

# i 命令:在匹配行前插入
sed '/^server/i\# nginx server block' nginx.conf

# a 命令:在匹配行后追加
sed '/^worker_processes/a\worker_rlimit_nofile 65535;' nginx.conf

# c 命令:替换整行
sed '/^Port 22/c\Port 2222' /etc/ssh/sshd_config

# p 命令:打印(常配合 -n 只打印匹配行)
sed -n '10,20p' file.txt            # 打印第10到20行(类似 head/tail 组合)
sed -n '/error/p' app.log           # 只打印匹配行(等效于 grep)

多行操作

# N 命令:读取下一行追加到模式空间
# 合并相邻两行(去掉两行之间的换行)
sed 'N;s/\n/ /' file.txt

# 删除文件中的 Windows 换行符(CRLF -> LF)
sed 's/\r$//' file.txt
# 或者
sed -i 's/\r//' file.txt

# 删除连续空行(压缩为单个空行)
sed '/^$/{
N
/^\n$/d
}' file.txt

# 实战:处理多行 HTML 标签
sed -n '//p' index.html

sed 实战案例

# 批量替换配置文件中的域名
sed -i 's/old-domain\.com/new-domain.com/g' /etc/nginx/sites-enabled/*.conf

# 删除 HTML 标签(简单场景)
sed 's/]*>//g' page.html

# 在文件第一行前插入版权声明
sed -i '1i\# Copyright 2026 YiteAI. All rights reserved.' *.sh

# 提取两个标记之间的内容
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/p' cert.pem

# 给每行添加行号
sed = file.txt | sed 'N;s/\n/\t/'

# 反转文件(tac 的 sed 实现)
sed -n '1!G;h;$p' file.txt

# macOS 与 Linux -i 差异
# Linux:sed -i 's/a/b/' file.txt
# macOS:sed -i '' 's/a/b/' file.txt
# 跨平台写法:先备份再处理
sed -i.bak 's/a/b/' file.txt   # 两个平台都支持(生成 .bak 备份)

**macOS 的 -i 陷阱:**macOS 的 sed 来自 BSD,-i 后面必须跟一个备份后缀参数(可以是空字符串 '')。Linux 的 GNU sed 则 -i 后面直接跟文件名。跨平台脚本建议用 sed -i.bak 的形式,或安装 GNU sed(brew install gnu-sed)。

其他文本处理工具

tr:字符级转换

tr 比 sed 更底层,它在字符级别进行替换、删除和压缩操作,速度极快。

# 大小写转换
echo "Hello World" | tr 'a-z' 'A-Z'
echo "Hello World" | tr 'A-Z' 'a-z'

# 删除字符(-d)
echo "He110 W0r1d" | tr -d '0-9'    # 删除所有数字
echo "remove spaces" | tr -d ' '     # 删除所有空格
cat file.txt | tr -d '\r'            # 删除 Windows 换行符

# 压缩重复字符(-s squeezing)
echo "Hello   World" | tr -s ' '    # 多个空格压缩为一个
echo "aaabbbccc" | tr -s 'a-z'      # 压缩连续重复字母

# 字符替换
echo "2026-04-25" | tr '-' '/'      # 将横线替换为斜杠
echo "a:b:c" | tr ':' '\n'          # 将冒号替换为换行(分行输出)

# 生成随机密码(结合 /dev/urandom)
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 16

sort 与 uniq:排序去重组合

# sort 选项
sort file.txt                    # 字典序排序
sort -n numbers.txt              # 数字排序
sort -rn numbers.txt             # 逆序数字排序
sort -k2 -t, data.csv            # 按第2字段排序,逗号分隔
sort -k2,2n -k1,1 data.txt       # 先按第2字段数字排,再按第1字段字典排
sort -u names.txt                # 排序并去重
sort -V versions.txt             # 版本号排序(GNU sort)

# uniq 选项(必须先排序)
sort file.txt | uniq             # 去重
sort file.txt | uniq -c          # 计数(每行出现次数)
sort file.txt | uniq -d          # 只显示重复行
sort file.txt | uniq -u          # 只显示唯一行(未重复)
sort file.txt | uniq -i          # 忽略大小写去重

# 经典组合:统计最频繁出现的内容
sort access.log | uniq -c | sort -rn | head -20

# 统计 IP 访问频率
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

diff 与 patch:文件比较与补丁

# 基本比较
diff file1.txt file2.txt

# Unified 格式(最常用,patch 工具可直接使用)
diff -u original.txt modified.txt

# 生成补丁文件
diff -u original.conf new.conf > changes.patch

# 应用补丁
patch original.conf = 8 的字符串

# 安全分析:检查可执行文件中的可疑字符串
strings suspicious.exe | grep -E "http|cmd|exec|passwd"

# 从固件/镜像中提取版本信息
strings firmware.bin | grep -i "version\|release"

jq:JSON 处理简介

# 安装
sudo apt install jq

# 格式化 JSON(美化输出)
cat data.json | jq '.'
curl -s https://api.example.com/data | jq '.'

# 提取字段
echo '{"name":"Alice","age":30}' | jq '.name'
# "Alice"

# 数组操作
echo '[1,2,3,4,5]' | jq '.[]'           # 展开数组
echo '[1,2,3]' | jq 'map(. * 2)'        # 数组映射

# 过滤
cat users.json | jq '.[] | select(.age > 25) | .name'

# 组合多字段
cat data.json | jq '.[] | {name: .name, email: .email}'

# 从 API 响应中提取并格式化
curl -s "https://api.github.com/repos/torvalds/linux/commits?per_page=5" | \
  jq '.[] | {sha: .sha[0:8], message: .commit.message[0:60]}'

**文本三剑客的协作:**grep/awk/sed 最强的状态是组合使用。典型流水线:用 grep 过滤出相关行 → 用 awk 切割字段和统计 → 用 sed 格式化输出。每个工具做自己最擅长的事,通过管道串联,比单独使用任何一个工具都更强大。

{{else}}

Chapter 4: Text Processing Power Tools

Complete grep regex guide (BRE/ERE/PCRE), awk structured text processing (field separation, built-in variables, BEGIN/END, associative arrays), sed stream editor (addressing, substitution, deletion, insertion, multiline), and practical usage of tr/diff/patch/sort/uniq.

grep Complete Guide

grep (Global Regular Expression Print) is the core Linux text search tool. It supports three regex engines: Basic Regular Expressions (BRE), Extended Regular Expressions (ERE, via -E), and Perl-Compatible Regular Expressions (PCRE, via -P).

Common Options Quick Reference

# Basic search
grep "error" /var/log/syslog
grep -i "error" /var/log/syslog    # case-insensitive
grep -v "DEBUG" app.log             # filter out DEBUG lines
grep -n "fatal" app.log             # show line numbers
grep -c "404" access.log            # count 404 occurrences

# Recursive search
grep -r "TODO" ./src/               # search directory
grep -rl "FIXME" ./src/             # filenames only
grep -rn "panic" ./ --include="*.go"  # only Go files

# Context lines
grep -A3 "ERROR" app.log    # match + 3 lines after
grep -B2 "FATAL" app.log    # match + 2 lines before
grep -C2 "exception" app.log  # 2 lines before and after

# Multiple patterns
grep -e "error" -e "warn" -e "fatal" app.log

# Extract only matched parts
grep -o "[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}\.[0-9]\{1,3\}" access.log | sort -u

BRE vs ERE: Regex Engine Differences

BRE (Basic Regular Expressions) is grep's default engine — special characters need a backslash to be treated as special. ERE (grep -E / egrep) is the opposite: special characters are special by default, no backslash needed.

Feature BRE (grep) ERE (grep -E)
Grouping (abc) (abc)
One or more + +
Zero or one ? ?
Alternation |
Exact repeat {3} {3}
# BRE style (default)
grep "colou\?r" file.txt           # matches "color" or "colour"
grep "\(error\|warn\)" app.log     # matches error or warn

# ERE style (-E is more readable)
grep -E "colou?r" file.txt
grep -E "error|warn|fatal" app.log
grep -E "^[0-9]{4}-[0-9]{2}-[0-9]{2}" access.log  # date-format lines
grep -E "\b[A-Z]{2,}\b" doc.txt    # all-uppercase words

# Common ERE metacharacters:
# .    any single character
# *    zero or more
# +    one or more
# ?    zero or one
# ^    start of line
# $    end of line
# []   character class
# [^]  negated character class
# \b   word boundary
# \w   word character [a-zA-Z0-9_]
# \s   whitespace

PCRE: Perl-Compatible Regex (grep -P)

PCRE provides the most powerful regex syntax, including \d \s \w, lazy matching, and lookahead/lookbehind assertions.

# PCRE-specific syntax (requires -P)
grep -P "\d{4}-\d{2}-\d{2}" log.txt        # date (\d = [0-9])
grep -P "\s+" file.txt                       # one or more whitespace
grep -P "https?://\S+" page.html            # URL extraction (use with -o)

# Lazy matching (non-greedy)
grep -oP "" index.html                  # extract HTML tags

# Lookahead: match position followed by something
grep -P "foo(?=bar)" file.txt               # foo followed by bar
grep -P "foo(?!bar)" file.txt               # foo NOT followed by bar

# Lookbehind: match position preceded by something
grep -P "(? **The power of -o:** The -o flag makes grep output only the matched portion rather than the whole line, one match per line. This transforms grep from a "line filter" into an "extractor." Combined with -P's complex patterns, you can extract structured data (IPs, URLs, numbers) directly from logs.


## awk Complete Tutorial


awk is a complete data processing language, not just a "text tool." Its core idea: read input line by line, split fields by a separator, and execute user-defined operations on each line.


### How awk Works and Built-in Variables


- **$0**: Entire line
- **$1, $2, ... $NF**: Field 1, 2, ... last field
- **NF**: Number of fields in current line
- **NR**: Current line number (increments across files)
- **FNR**: Line number within current file (resets per file)
- **FS**: Input field separator (default: whitespace)
- **OFS**: Output field separator (default: space)
- **RS**: Input record separator (default: newline)
- **ORS**: Output record separator (default: newline)
- **FILENAME**: Current filename being processed


```bash
# Basic: print fields 1 and 3
awk '{print $1, $3}' file.txt

# Custom separator: colon-delimited /etc/passwd
awk -F: '{print $1, $3}' /etc/passwd    # username and UID
awk -F: '{print $1":"$3}' /etc/passwd  # custom output format

# OFS: change output separator
awk -F: 'BEGIN{OFS=","} {print $1,$3,$7}' /etc/passwd

# Line numbers and field count
awk '{print NR, NF, $0}' file.txt

# Last field
awk '{print $NF}' file.txt

# Second-to-last field
awk '{print $(NF-1)}' file.txt

Conditional Filtering and Pattern Matching

# Regex match: print lines containing "error"
awk '/error/' app.log
awk '/error/{print NR": "$0}' app.log   # with line numbers

# Inverted match
awk '!/debug/' app.log

# Numeric conditions
awk -F: '$3 >= 1000' /etc/passwd        # users with UID >= 1000
awk '{if ($5 > 100) print $1, $5}' data.txt

# String comparison
awk '$2 == "FAILED"' results.txt
awk '$1 ~ /^192\.168\./' access.log     # IPs starting with 192.168.

# Range pattern (between start and end markers)
awk '/START/,/END/' file.txt            # print lines from START to END
awk 'NR==10,NR==20' file.txt           # print lines 10 to 20

BEGIN and END Blocks

# BEGIN: runs before processing any line
# END: runs after processing all lines
awk 'BEGIN{total=0} {total+=$5} END{print "Total:", total}' sales.csv

# Count lines
awk 'BEGIN{print "Analysis:"} {lines++} END{print "Lines:", lines}' data.txt

# Formatted report
awk -F, 'BEGIN{
    printf "%-20s %10s %10s\n", "Name", "Sales", "Commission"
    printf "%-20s %10s %10s\n", "----", "-----", "----------"
}
{
    commission = $2 * 0.1
    printf "%-20s %10.2f %10.2f\n", $1, $2, commission
}
END{
    print "Done."
}' sales.csv

String Functions

# length: string length
awk '{print length($0), $0}' file.txt

# substr: substring (start at position N, take M chars)
awk '{print substr($1, 1, 4)}' dates.txt

# index: find substring position (0 = not found)
awk '{print index($0, "error")}' log.txt

# split: split string into array
awk '{n=split($1, arr, ":"); for(i=1;i 1000 {print $0}' timing.log

awk's core advantage: awk natively supports associative arrays (equivalent to maps/dicts in other languages). This lets you count frequencies, group-sum, and pivot data in a single command — no script file needed. For structured text (logs, CSV, /etc/passwd), awk is far faster to write than a Python script.

sed Complete Tutorial

sed (Stream EDitor) reads input line by line, modifies it according to script commands, and outputs the result. Its power lies in non-interactive batch file modification and precise line addressing.

Basic Substitution Syntax

# Basic substitution: s/old/new/
sed 's/foo/bar/' file.txt          # replace first foo per line
sed 's/foo/bar/g' file.txt         # replace all foo per line
sed 's/foo/bar/2' file.txt         # replace second foo per line
sed 's/foo/bar/gi' file.txt        # replace all, case-insensitive

# In-place editing (NOTE: -i differs between macOS and Linux)
sed -i 's/old/new/g' file.txt      # Linux
sed -i '' 's/old/new/g' file.txt   # macOS (-i requires an empty string)

# Multiple expressions (-e)
sed -e 's/foo/bar/g' -e 's/baz/qux/g' file.txt

# Alternative delimiters (useful when pattern contains slashes)
sed 's|/usr/local|/opt|g' paths.txt
sed 's#https://old.com#https://new.com#g' urls.txt

Address Ranges

One of sed's most powerful features is applying commands to specific lines or line ranges, not just the entire file.

# Line number addresses
sed '5s/foo/bar/' file.txt          # only modify line 5
sed '5,10s/foo/bar/g' file.txt     # modify lines 5 through 10
sed '5,$s/foo/bar/g' file.txt      # from line 5 to end of file

# Regex addresses
sed '/^#/d' config.txt              # delete comment lines
sed '/error/s/warn/ERROR/g' log     # substitute only on lines containing "error"

# Negation (!)
sed '/pattern/!d' file.txt          # keep only matching lines (like grep)
sed '1!d' file.txt                  # keep only line 1

# Step addressing (every N lines)
sed '0~2d' file.txt                 # delete even-numbered lines (GNU sed)
sed '1~2d' file.txt                 # delete odd-numbered lines

Delete, Insert, and Append

# d: delete matching lines
sed '/^$/d' file.txt                # delete blank lines
sed '/^#/d' file.txt                # delete comment lines
sed '/^\s*$/d' file.txt             # delete whitespace-only lines
sed '1d' file.txt                   # delete first line (strip header)
sed '$d' file.txt                   # delete last line

# i: insert before matching line
sed '/^server/i\# nginx server block' nginx.conf

# a: append after matching line
sed '/^worker_processes/a\worker_rlimit_nofile 65535;' nginx.conf

# c: replace entire matching line
sed '/^Port 22/c\Port 2222' /etc/ssh/sshd_config

# p: print (often used with -n to print only matches)
sed -n '10,20p' file.txt            # print lines 10 to 20
sed -n '/error/p' app.log           # print matching lines (like grep)

Multiline Operations

# N: append next line to pattern space
# Join two adjacent lines (remove newline between them)
sed 'N;s/\n/ /' file.txt

# Remove Windows line endings (CRLF -> LF)
sed 's/\r$//' file.txt
# or
sed -i 's/\r//' file.txt

# Compress multiple blank lines into one
sed '/^$/{
N
/^\n$/d
}' file.txt

# Extract content between HTML tags
sed -n '//p' index.html

sed in Practice

# Batch replace domain in config files
sed -i 's/old-domain\.com/new-domain.com/g' /etc/nginx/sites-enabled/*.conf

# Strip HTML tags (simple cases)
sed 's/]*>//g' page.html

# Insert copyright header at line 1 of all shell scripts
sed -i '1i\# Copyright 2026 YiteAI. All rights reserved.' *.sh

# Extract content between markers
sed -n '/BEGIN CERTIFICATE/,/END CERTIFICATE/p' cert.pem

# Add line numbers to a file
sed = file.txt | sed 'N;s/\n/\t/'

# Cross-platform -i (works on both Linux and macOS)
sed -i.bak 's/a/b/' file.txt   # creates a .bak backup on both platforms

The macOS -i trap: macOS sed (BSD) requires a backup suffix argument after -i (can be an empty string ''). GNU sed on Linux uses -i without a suffix. For cross-platform scripts, use sed -i.bak, or install GNU sed on macOS: brew install gnu-sed.

Other Text Processing Tools

tr: Character-Level Translation

# Case conversion
echo "Hello World" | tr 'a-z' 'A-Z'
echo "Hello World" | tr 'A-Z' 'a-z'

# Delete characters (-d)
echo "He110 W0r1d" | tr -d '0-9'    # delete all digits
echo "remove spaces" | tr -d ' '     # delete all spaces
cat file.txt | tr -d '\r'            # remove Windows line endings

# Squeeze repeated characters (-s)
echo "Hello   World" | tr -s ' '    # compress multiple spaces to one
echo "aaabbbccc" | tr -s 'a-z'      # squeeze repeated lowercase letters

# Character substitution
echo "2026-04-25" | tr '-' '/'      # replace dashes with slashes
echo "a:b:c" | tr ':' '\n'          # replace colons with newlines

# Generate random password
cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c 16

sort and uniq: Sorting and Deduplication

# sort options
sort file.txt                    # lexicographic sort
sort -n numbers.txt              # numeric sort
sort -rn numbers.txt             # reverse numeric sort
sort -k2 -t, data.csv            # sort by field 2, comma delimiter
sort -k2,2n -k1,1 data.txt       # sort by field 2 numerically, then field 1
sort -u names.txt                # sort and deduplicate
sort -V versions.txt             # version-aware sort (GNU sort)

# uniq options (input must be sorted first)
sort file.txt | uniq             # deduplicate
sort file.txt | uniq -c          # count occurrences
sort file.txt | uniq -d          # show only duplicate lines
sort file.txt | uniq -u          # show only unique (non-repeated) lines
sort file.txt | uniq -i          # case-insensitive dedup

# Classic pipeline: top 20 most frequent entries
sort access.log | uniq -c | sort -rn | head -20

# Top IPs by request count
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

diff and patch: File Comparison and Patching

# Basic comparison
diff file1.txt file2.txt

# Unified format (most common; compatible with patch)
diff -u original.txt modified.txt

# Generate a patch file
diff -u original.conf new.conf > changes.patch

# Apply a patch
patch original.conf = 8

# Security analysis: look for suspicious strings in an executable
strings suspicious.exe | grep -E "http|cmd|exec|passwd"

# Extract version info from firmware/image
strings firmware.bin | grep -i "version\|release"

jq: JSON Processing Introduction

# Install
sudo apt install jq

# Pretty-print JSON
cat data.json | jq '.'
curl -s https://api.example.com/data | jq '.'

# Extract a field
echo '{"name":"Alice","age":30}' | jq '.name'
# "Alice"

# Array operations
echo '[1,2,3,4,5]' | jq '.[]'           # expand array
echo '[1,2,3]' | jq 'map(. * 2)'        # map over array

# Filter
cat users.json | jq '.[] | select(.age > 25) | .name'

# Select multiple fields
cat data.json | jq '.[] | {name: .name, email: .email}'

# Extract from GitHub API
curl -s "https://api.github.com/repos/torvalds/linux/commits?per_page=5" | \
  jq '.[] | {sha: .sha[0:8], message: .commit.message[0:60]}'

The three tools working together: grep/awk/sed are most powerful in combination. A typical pipeline: use grep to filter relevant lines → awk to split fields and aggregate → sed to format output. Each tool does what it does best, connected through pipes — far more powerful than any single tool alone.

{{end}}

上一章
← 第3章:文件操作精通


下一章
第5章:进程与作业控制 →
本章评分
4.5  / 5  (68 评分)

💬 留言讨论