Chapter 3

File Operations Mastery

Algorithm Extension Ratio Speed Best For

  gzip
  .tar.gz / .tgz
  Medium
  Fast
  Daily use, best compatibility


  bzip2
  .tar.bz2
  High
  Slow
  Source releases (legacy)


  xz
  .tar.xz
  Highest
  Very slow
  Kernel/package releases, storage-first


  zstd
  .tar.zst
  High
  Very fast
  Modern default: best speed/ratio balance

**推荐:**新项目优先选用 zstd。Facebook 开发的 zstd 在压缩率接近 xz 的同时,解压速度与 gzip 相当。Linux 内核 5.9+ 已将 zstd 作为内核镜像的默认压缩格式。

rsync:增量同步与远程备份

rsync 是生产环境备份的标准工具。它通过比较两端文件的校验和与时间戳,只传输差异部分,极大减少网络开销。

核心选项

# 本地目录同步(注意:src/ 末尾斜杠表示"目录内容",无斜杠表示"目录本身")
rsync -avz /var/www/html/ /backup/html/

# 先模拟,确认无误再实际执行
rsync -avz --dry-run /var/www/ /backup/www/
rsync -avz /var/www/ /backup/www/

# 镜像同步(目标端多余的文件会被删除)
rsync -avz --delete /var/www/ /backup/www/

# 远程同步(rsync over SSH)
rsync -avz -e ssh /local/data/ user@remote:/backup/data/

# 指定 SSH 端口
rsync -avz -e "ssh -p 2222" /local/ user@host:/remote/

# 限速传输(1MB/s),避免占满带宽
rsync -avz --bwlimit=1024 /data/ user@host:/data/

# 排除多个目录
rsync -avz \
  --exclude='*.log' \
  --exclude='cache/' \
  --exclude='.git/' \
  /var/www/ /backup/www/

# 增量备份:只同步最近 1 天内修改的文件
rsync -avz --filter="m-1440" /data/ /backup/

生产备份脚本示例

#!/bin/bash
# backup.sh — 每日增量备份脚本

set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
SRC="/var/www"
BACKUP_ROOT="/mnt/backup"
DEST="${BACKUP_ROOT}/${TIMESTAMP}"
LATEST="${BACKUP_ROOT}/latest"
LOG="/var/log/backup.log"

echo "[${TIMESTAMP}] Starting backup..." | tee -a "$LOG"

# 使用 --link-dest 实现快照式增量备份(只有变化的文件才占用新空间)
rsync -avz --delete \
  --link-dest="${LATEST}" \
  --exclude='*.tmp' \
  --exclude='cache/' \
  "${SRC}/" "${DEST}/" 2>&1 | tee -a "$LOG"

# 更新 latest 软链接
rm -f "${LATEST}"
ln -s "${DEST}" "${LATEST}"

echo "[$(date +%Y%m%d_%H%M%S)] Backup complete: ${DEST}" | tee -a "$LOG"

# 保留最近 30 天的备份,删除旧的
find "${BACKUP_ROOT}" -maxdepth 1 -type d -name "20*" \
  -mtime +30 -exec rm -rf {} +

**--link-dest 的原理:**rsync 会将目标目录中与上次备份相同的文件创建硬链接,而不是复制。这样每次备份都是"完整快照",但实际只占用差异部分的磁盘空间。30 天的每日备份可能只占用 2-3 倍存储空间。

xargs:批量处理的利器

xargs 将标准输入的内容转化为命令参数,弥补了 Linux 管道不能直接传参给命令的局限。

# 基本用法:将 find 结果传给 rm
find /tmp -name "*.tmp" | xargs rm -f

# -I{} 占位符:将参数插入命令中间位置
find . -name "*.log" | xargs -I{} cp {} /backup/

# 处理含空格的文件名:配合 find -print0
find . -name "*.txt" -print0 | xargs -0 wc -l

# -P 并行执行(4 个进程同时处理)
find . -name "*.jpg" -print0 | xargs -0 -P4 -I{} convert {} -resize 800x {}-resized.jpg

# -n 每次传入的参数数量(每次传 2 个)
echo "a b c d e f" | xargs -n2 echo

# 结合 grep 查找包含特定内容的文件并统计
grep -rl "TODO" ./src | xargs wc -l

# 用 xargs 安全删除大量文件(避免"argument list too long")
find /var/log -name "*.gz" -mtime +90 | xargs -r rm -f

# 交互式确认(-p 每次执行前询问)
find . -name "*.bak" | xargs -p rm

**含空格文件名必须用 -print0 + -0:**默认 xargs 以空白字符(空格、换行)分割参数,文件名含空格时会出错。始终用 find -print0 | xargs -0 的组合处理真实文件路径。

watch:实时周期监控

watch 以固定间隔重复执行命令并刷新屏幕,是实时监控系统状态的简单利器。

# 每 2 秒刷新一次磁盘使用情况
watch -n 2 df -h

# 高亮显示变化内容(-d)
watch -d -n 1 'ss -tnp'

# 监控进程(等效于简化版 top)
watch -n 1 'ps aux --sort=-%cpu | head -15'

# 监控目录文件数量变化
watch -n 5 'ls -l /var/spool/mail/ | wc -l'

# 监控 nginx 访问日志实时条数
watch -n 2 'wc -l /var/log/nginx/access.log'

# 监控网络连接状态统计
watch -n 2 'ss -s'

# 不换行(--no-title 隐藏标题栏)
watch --no-title -n 1 uptime

inotifywait:文件变动实时监听

inotifywait 使用 Linux 内核的 inotify 接口监听文件系统事件,可以在文件被创建、修改、删除时立即触发响应,是自动化部署和配置热重载的基础工具。

# 安装
sudo apt install inotify-tools

# 持续监听目录(-m 持续运行,-r 递归,-e 指定事件)
inotifywait -m -r -e create,modify,delete /etc/nginx/

# 监听特定事件并格式化输出
inotifywait -m -r \
  --format '%T %w%f %e' \
  --timefmt '%Y-%m-%d %H:%M:%S' \
  -e create,modify,delete \
  /var/www/html/

# 自动重载 nginx 配置:检测到配置文件变化时重载
inotifywait -m -e modify /etc/nginx/nginx.conf |
while read -r path action file; do
  echo "Config changed: $file ($action)"
  nginx -t && systemctl reload nginx
done

# 自动同步:检测到本地目录变化时触发 rsync
inotifywait -m -r -e create,modify,delete /var/www/html/ |
while read -r dir event file; do
  echo "[$event] $dir$file"
  rsync -az /var/www/html/ user@remote:/var/www/html/
done

**inotify 的内核限制:**默认最大监听数为 8192(/proc/sys/fs/inotify/max_user_watches)。监听大型代码仓库时需要调大:echo 524288 | sudo tee /proc/sys/fs/inotify/max_user_watches,并写入 /etc/sysctl.conf 永久生效。

文件查找工具全解

Linux 有多种"查找命令"工具,它们的工作原理和用途各不相同,很多人混淆了它们。

Command Search Scope Speed Notes
which PATH 环境变量 Instant Finds executable path in PATH
type Shell built-in Instant Distinguishes builtins/aliases/functions
whereis Fixed path list Fast Finds binary, man page, source
locate Database index Very fast Requires updatedb, new files have delay
find Real-time filesystem scan Slow (large dirs) Most powerful, supports complex criteria
# which:找可执行文件位置
which python3
# /usr/bin/python3

# type:判断命令类型(内建/别名/函数/文件)
type ls
# ls is aliased to `ls --color=auto'
type cd
# cd is a shell builtin

# whereis:同时找到二进制、man 页面和源码位置
whereis nginx
# nginx: /usr/sbin/nginx /etc/nginx /usr/share/man/man8/nginx.8.gz

# locate:数据库查找,极快
sudo updatedb           # 先更新数据库
locate "*.conf" | grep nginx

# find:精确实时搜索
find /etc -name "*.conf" -mtime -7   # 7 天内修改过的配置文件
find /var/log -size +100M            # 大于 100MB 的日志文件
find . -perm /4000                   # 有 SUID 位的文件(安全审计)

文本处理基础工具

以下工具是文本处理流水线的基础组件,与 grep/awk/sed 配合使用效果更强(第4章详细讲解)。

# wc:统计行数/单词数/字节数
wc -l access.log          # 行数
wc -w document.txt        # 单词数
wc -c binary.dat          # 字节数

# sort:排序
sort -n numbers.txt        # 按数字排序
sort -rn numbers.txt       # 逆序数字排序
sort -k2 -t: /etc/passwd   # 按第2字段排序,分隔符为:
sort -u names.txt          # 排序并去重

# uniq:去重(需配合 sort 使用)
sort access.log | uniq -c | sort -rn | head -20   # 统计最频繁的行

# cut:字段切割
cut -d: -f1 /etc/passwd    # 取以:分割的第1列(用户名)
cut -c1-10 file.txt        # 取每行前10个字符

# paste:横向合并文件
paste file1.txt file2.txt  # 两文件按列合并,Tab 分隔
paste -d, file1.txt file2.txt  # 用逗号分隔

# tee:同时输出到屏幕和文件
ls -la | tee listing.txt                    # 显示并保存
echo "start" | tee -a build.log             # 追加模式

{{else}}

Chapter 3: File Operations Mastery

Deep options and safe usage of cp/mv/rm, complete tar guide (gzip/bzip2/xz/zstd comparison), rsync incremental sync and remote backup, xargs batch processing, watch periodic monitoring, inotifywait real-time file change detection.

cp Deep Dive

cp is the most commonly used file copy command, but most users ignore its powerful options. Mastering these options lets you copy files more precisely and safely.

Core Options Explained

# Archive copy: preserve all attributes (great for server migrations)
cp -a /var/www/html/ /backup/html-20260425/

# Only copy files newer than destination (incremental)
cp -u /src/*.conf /dst/

# Auto-backup before overwriting, old file becomes .bak
cp --backup=numbered --suffix=.bak nginx.conf /etc/nginx/nginx.conf

# Copy directory following all symlinks (copies real files)
cp -rL /opt/app/ /backup/app/

# Copy directory preserving symlinks as-is
cp -ra /opt/app/ /backup/app/

-a vs -p difference: -p preserves only permissions, timestamps, and ownership. -a additionally preserves symlink structure (-d) and extended attributes (xattr). Use -a for backups, -p for everyday copies.

mv and Renaming

mv is atomic on the same filesystem (just modifies a directory entry, no data movement), but degrades to cp + rm across filesystems. Moving large files across disks saturates I/O and risks data loss if interrupted.

# Move file (same partition: instant; cross-partition: actual copy)
mv largefile.tar.gz /mnt/backup/

# Rename
mv old-name.txt new-name.txt

# Prompt before overwriting
mv -i source.txt target.txt

# Never overwrite existing destination
mv -n draft.txt production.txt

# Batch rename: change .jpeg to .jpg (requires rename tool)
rename 's/\.jpeg$/.jpg/' *.jpeg

# Batch rename with bash loop (no extra tools needed)
for f in *.log.1; do mv "$f" "${f%.1}.old"; done

Risk of cross-filesystem mv with large files: mv across disks equals cp + rm. If interrupted by a power cut or Ctrl-C, the source file is untouched but the destination is incomplete. For large cross-disk transfers, use rsync --remove-source-files and verify after completion.

rm Safety Practices

rm -rf is one of the most dangerous commands in Linux. No recycle bin, no confirmation, no undo.

Safer Alternatives

# Install trash-cli
sudo apt install trash-cli   # Debian/Ubuntu

# Safe delete (recoverable)
trash-put /tmp/old-logs/

# List trash contents
trash-list

# Restore a file
trash-restore

# Empty trash
trash-empty

# Confirm each file before deletion
rm -ri ./temp-dir/

# Remove only empty directories (including empty subdirs)
find . -type d -empty -delete

Production rule: Before rm -rf in a script, always print the path or use echo rm -rf to simulate. Never write rm -rf "$VAR/" without first validating VAR is non-empty — an empty variable turns this into rm -rf /.

mkdir and Directory Operations

# Recursively create nested directories
mkdir -p /opt/app/{logs,conf,data,tmp}

# install -d: create directory and set permissions in one step
install -d -m 755 -o www-data -g www-data /var/www/html

# View directory tree
tree -L 2 /opt/app/

# Show file sizes
tree -sh /opt/app/

# Show only directories
tree -d /etc/

touch — More Than Creating Files

touch isn't just for creating empty files. It can precisely control file timestamps, which matters for Makefiles and build systems.

# Create empty file
touch newfile.txt

# Create multiple files
touch file{01..10}.txt

# Update timestamps to now (file must exist; -c skips creation)
touch -c existing.txt

# Update only access time (atime)
touch -a logfile.txt

# Update only modification time (mtime)
touch -m config.conf

# Set a specific timestamp
touch -d "2026-01-01 00:00:00" archive.tar.gz

# Copy timestamps from one file to another
touch -r reference.txt target.txt

tar Complete Guide

tar (Tape ARchive) is the core Linux archiving tool. It doesn't compress by itself — compression is done by external programs (gzip/bzip2/xz/zstd), invoked via the -z/-j/-J/--zstd flags.

Common Operations Quick Reference

# Create gzip archive (most common)
tar -czf archive.tar.gz /path/to/dir/

# Create with verbose output (list each file)
tar -czvf archive.tar.gz /path/to/dir/

# Extract to current directory
tar -xzf archive.tar.gz

# Extract to a specific directory
tar -xzf archive.tar.gz -C /opt/

# List archive contents without extracting
tar -tzf archive.tar.gz

# Extract a single file from the archive
tar -xzf archive.tar.gz path/inside/archive/file.conf

# Exclude specific directories and patterns
tar -czf backup.tar.gz /var/www/ \
    --exclude='/var/www/cache' \
    --exclude='*.log' \
    --exclude='.git'

# bzip2 compression (higher ratio, slower)
tar -cjf archive.tar.bz2 /path/to/dir/

# xz compression (best ratio, slowest)
tar -cJf archive.tar.xz /path/to/dir/

# zstd compression (modern: fast + good ratio)
tar --zstd -cf archive.tar.zst /path/to/dir/

# Incremental backup: only pack files newer than given date
tar -czf incremental.tar.gz \
    --newer-mtime="2026-04-01" /var/data/

Compression Algorithm Comparison

Algorithm Extension Ratio Speed Best For
gzip .tar.gz / .tgz Medium Fast Daily use, best compatibility
bzip2 .tar.bz2 High Slow Source releases (legacy)
xz .tar.xz Highest Very slow Kernel/package releases, storage-first
zstd .tar.zst High Very fast Modern default: best speed/ratio balance

Recommendation: Choose zstd for new projects. Developed by Facebook, zstd achieves near-xz compression ratios at gzip-comparable decompression speed. Linux kernel 5.9+ uses zstd as the default kernel image compression format.

rsync: Incremental Sync and Remote Backup

rsync is the standard tool for production backups. It compares checksums and timestamps on both ends, transferring only the differences — drastically reducing network overhead.

Key Options

# Local directory sync (trailing slash on src = "contents of dir", no slash = "the dir itself")
rsync -avz /var/www/html/ /backup/html/

# Dry run first, then execute for real
rsync -avz --dry-run /var/www/ /backup/www/
rsync -avz /var/www/ /backup/www/

# Mirror sync (delete files at destination not in source)
rsync -avz --delete /var/www/ /backup/www/

# Remote sync over SSH
rsync -avz -e ssh /local/data/ user@remote:/backup/data/

# Custom SSH port
rsync -avz -e "ssh -p 2222" /local/ user@host:/remote/

# Rate-limited transfer (1 MB/s) to avoid saturating bandwidth
rsync -avz --bwlimit=1024 /data/ user@host:/data/

# Exclude multiple patterns
rsync -avz \
  --exclude='*.log' \
  --exclude='cache/' \
  --exclude='.git/' \
  /var/www/ /backup/www/

Production Backup Script Example

#!/bin/bash
# backup.sh — daily incremental backup script

set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
SRC="/var/www"
BACKUP_ROOT="/mnt/backup"
DEST="${BACKUP_ROOT}/${TIMESTAMP}"
LATEST="${BACKUP_ROOT}/latest"
LOG="/var/log/backup.log"

echo "[${TIMESTAMP}] Starting backup..." | tee -a "$LOG"

# --link-dest enables snapshot-style incremental backups
# (only changed files consume new disk space)
rsync -avz --delete \
  --link-dest="${LATEST}" \
  --exclude='*.tmp' \
  --exclude='cache/' \
  "${SRC}/" "${DEST}/" 2>&1 | tee -a "$LOG"

# Update the 'latest' symlink
rm -f "${LATEST}"
ln -s "${DEST}" "${LATEST}"

echo "[$(date +%Y%m%d_%H%M%S)] Backup complete: ${DEST}" | tee -a "$LOG"

# Keep last 30 days of backups, delete older ones
find "${BACKUP_ROOT}" -maxdepth 1 -type d -name "20*" \
  -mtime +30 -exec rm -rf {} +

How --link-dest works: rsync creates hard links for files that are identical to the previous backup, rather than copying them. This means each backup is a "full snapshot" but only the changed files consume additional disk space. 30 daily backups might use only 2–3x the storage of the original data.

xargs: Batch Processing Power

xargs converts standard input into command arguments, bridging the gap where Linux pipes can't pass arguments directly to commands.

# Basic: pass find results to rm
find /tmp -name "*.tmp" | xargs rm -f

# -I{} placeholder: insert argument at a specific position
find . -name "*.log" | xargs -I{} cp {} /backup/

# Handle filenames with spaces: use find -print0
find . -name "*.txt" -print0 | xargs -0 wc -l

# -P parallel execution (4 processes simultaneously)
find . -name "*.jpg" -print0 | xargs -0 -P4 -I{} convert {} -resize 800x {}-resized.jpg

# -n: number of arguments per invocation (2 at a time)
echo "a b c d e f" | xargs -n2 echo

# Find files with TODO and count lines
grep -rl "TODO" ./src | xargs wc -l

# Safely delete many files (avoids "argument list too long")
find /var/log -name "*.gz" -mtime +90 | xargs -r rm -f

# Interactive confirmation (-p prompts before each execution)
find . -name "*.bak" | xargs -p rm

Filenames with spaces require -print0 + -0: xargs splits on whitespace by default, so filenames with spaces will cause errors. Always use the find -print0 | xargs -0 combination when working with real file paths.

watch: Periodic Real-time Monitoring

watch repeatedly executes a command at a fixed interval and refreshes the screen — a simple but powerful real-time monitoring tool.

# Refresh disk usage every 2 seconds
watch -n 2 df -h

# Highlight changes (-d)
watch -d -n 1 'ss -tnp'

# Monitor processes (simplified top alternative)
watch -n 1 'ps aux --sort=-%cpu | head -15'

# Monitor file count in a directory
watch -n 5 'ls -l /var/spool/mail/ | wc -l'

# Watch nginx access log line count
watch -n 2 'wc -l /var/log/nginx/access.log'

# Monitor network connection stats
watch -n 2 'ss -s'

# Hide the header bar
watch --no-title -n 1 uptime

inotifywait: Real-time File Change Detection

inotifywait uses the Linux kernel's inotify interface to watch filesystem events, triggering immediately when files are created, modified, or deleted. It's a foundation for automated deployments and hot-reload configuration systems.

# Install
sudo apt install inotify-tools

# Continuously monitor a directory (-m continuous, -r recursive, -e events)
inotifywait -m -r -e create,modify,delete /etc/nginx/

# Monitor with formatted output
inotifywait -m -r \
  --format '%T %w%f %e' \
  --timefmt '%Y-%m-%d %H:%M:%S' \
  -e create,modify,delete \
  /var/www/html/

# Auto-reload nginx when config file changes
inotifywait -m -e modify /etc/nginx/nginx.conf |
while read -r path action file; do
  echo "Config changed: $file ($action)"
  nginx -t && systemctl reload nginx
done

# Auto-sync: trigger rsync when local directory changes
inotifywait -m -r -e create,modify,delete /var/www/html/ |
while read -r dir event file; do
  echo "[$event] $dir$file"
  rsync -az /var/www/html/ user@remote:/var/www/html/
done

inotify kernel limits: The default max watchers is 8192 (/proc/sys/fs/inotify/max_user_watches). For large codebases, increase it: echo 524288 | sudo tee /proc/sys/fs/inotify/max_user_watches, and persist in /etc/sysctl.conf.

File Lookup Tools Compared

Command Search Scope Speed Notes
which PATH variable Instant Finds executable path in PATH only
type Shell built-in Instant Distinguishes builtins/aliases/functions/executables
whereis Fixed path list Fast Finds binary, man page, and source simultaneously
locate Database index Very fast Requires updatedb; new files have a delay
find Real-time filesystem scan Slow (large dirs) Most powerful; supports time/permission/size criteria
# which: find executable location
which python3
# /usr/bin/python3

# type: classify a command
type ls
# ls is aliased to `ls --color=auto'
type cd
# cd is a shell builtin

# whereis: find binary, man page, and source at once
whereis nginx
# nginx: /usr/sbin/nginx /etc/nginx /usr/share/man/man8/nginx.8.gz

# locate: database-based search, very fast
sudo updatedb           # update the database first
locate "*.conf" | grep nginx

# find: real-time precise search
find /etc -name "*.conf" -mtime -7    # configs modified in last 7 days
find /var/log -size +100M             # log files over 100MB
find . -perm /4000                    # files with SUID bit (security audit)

Text Processing Foundation Tools

# wc: count lines/words/bytes
wc -l access.log          # line count
wc -w document.txt        # word count
wc -c binary.dat          # byte count

# sort
sort -n numbers.txt        # numeric sort
sort -rn numbers.txt       # reverse numeric sort
sort -k2 -t: /etc/passwd   # sort by field 2, delimiter :
sort -u names.txt          # sort and deduplicate

# uniq: deduplicate (requires sorted input)
sort access.log | uniq -c | sort -rn | head -20   # top 20 most frequent lines

# cut: field extraction
cut -d: -f1 /etc/passwd    # first column (username) with : delimiter
cut -c1-10 file.txt        # first 10 characters of each line

# paste: horizontal file merge
paste file1.txt file2.txt        # merge columns with tab
paste -d, file1.txt file2.txt   # merge with comma delimiter

# tee: output to both screen and file
ls -la | tee listing.txt                    # display and save
echo "start" | tee -a build.log             # append mode

{{end}}

Previous
← Ch2: Filesystem Deep Dive


Next
Ch4: Text Processing Tools →
Rate this chapter
4.8  / 5  (77 ratings)

💬 Comments