Chapter 16

Container Internals

Chapter 16: Container Internals

There is no magic beneath Docker and Kubernetes: containers are fundamentally a combination of Linux kernel features. This chapter dismantles six namespace types (pid/net/mnt/uts/ipc/user), cgroup v2 resource isolation, overlay filesystem layering, chroot/pivot_root root filesystem switching, and seccomp syscall filtering — then ties it all together with a complete shell script that manually "manufactures" a container, fully demystifying how runc/containerd works.

1. Why Container Isolation is Needed

Running multiple services directly on a single host carries fundamental risks: any process can see all other processes (ps aux), shares the same network stack (port conflicts, traffic sniffing), and accesses the same filesystem (config files, key leakage). The kernel provides isolation through these mechanisms:

Isolation Need	Linux Mechanism	Effect
Process visibility	PID namespace	Container sees only its own process tree
Network stack	Network namespace	Independent NIC, IP, routes, iptables
Filesystem view	Mount namespace + pivot_root	Container has its own root filesystem
Hostname	UTS namespace	Container can have its own hostname
IPC	IPC namespace	Isolates System V IPC and message queues
User privileges	User namespace	Unprivileged user mapped to container root
CPU/Memory/IO	cgroup v2	Hard resource limits, prevents starvation
Syscalls	seccomp	Only whitelisted syscalls permitted

2. Six Linux Namespace Types

PID Namespace — Process Isolation

A PID namespace creates an independent process ID space. The first process in the new namespace has PID 1 (acting as the container's init), but has a different real PID on the host. Child processes cannot see processes in parent namespaces, but the parent can see all child processes.

# 在新 PID namespace 中运行 bash（需要 root 或 user namespace）
unshare --pid --fork --mount-proc bash

# 在新 shell 内查看进程
ps aux
# 只能看到当前 bash 和 ps 两个进程
# PID 1 = bash（新 namespace 的 init）

# 在宿主机上查看对应的真实 PID
# 新 namespace 中 PID 1 在宿主机上可能是 PID 12345

# /proc 文件系统与 PID namespace
# 每个 PID namespace 需要有自己挂载的 /proc
mount -t proc proc /proc    # 挂载新 namespace 的 /proc

# 查看进程的所有 namespace
ls -la /proc/1/ns/
# lrwxrwxrwx 1 root root 0  pid -> pid:[4026531836]
# lrwxrwxrwx 1 root root 0  net -> net:[4026531992]
# ...
# 相同数字 = 相同 namespace，不同数字 = 已隔离

# 查看某进程所属的 namespace ID
readlink /proc/$(pgrep nginx)/ns/pid

Network Namespace — Network Stack Isolation

# ip netns — 管理命名 network namespace
ip netns add mycontainer          # 创建命名 netns
ip netns list                     # 列出所有命名 netns
ip netns exec mycontainer ip a    # 在指定 netns 中执行命令
ip netns delete mycontainer       # 删除

# 在新 netns 中只有 loopback，无法与外界通信
ip netns exec mycontainer ip link show
# 1: lo:  mtu 65536 ...（默认 DOWN）

# veth pair：连通两个 namespace 的"虚拟网线"
# 创建 veth pair（veth0 留在主机，veth1 移入容器 netns）
ip link add veth0 type veth peer name veth1
ip link set veth1 netns mycontainer

# 配置 IP
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up

ip netns exec mycontainer ip addr add 10.0.0.2/24 dev veth1
ip netns exec mycontainer ip link set veth1 up
ip netns exec mycontainer ip link set lo up

# 测试连通性
ping -c 2 10.0.0.2
ip netns exec mycontainer ping -c 2 10.0.0.1

# 为容器配置 NAT（让容器能访问外网）
iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o eth0 -j MASQUERADE
ip netns exec mycontainer ip route add default via 10.0.0.1

# 清理
ip netns delete mycontainer

Mount / UTS / IPC Namespace

# Mount namespace：隔离文件系统挂载点
# 在新 mount namespace 中挂载不影响宿主机
unshare --mount bash
mount --bind /tmp/myroot /mnt/test
# 在宿主机上 /mnt/test 不会出现这个挂载

# UTS namespace：独立 hostname 和 domainname
unshare --uts bash
hostname mycontainer          # 修改不影响宿主机
hostname                      # → mycontainer
# 在另一个终端查看宿主机 hostname，仍然是原值

# IPC namespace：隔离 System V IPC（共享内存/信号量/消息队列）
unshare --ipc bash
ipcs                          # 查看当前 IPC 资源（新 namespace 内为空）
# System V 消息队列、共享内存段、信号量在不同 namespace 间完全隔离

# 查看一个进程所在的所有 namespace
# 两个进程如果 namespace 文件指向同一个 inode，表示它们共享该 namespace
stat /proc/self/ns/uts
stat /proc/1/ns/uts

User Namespace — User Privilege Mapping

# User namespace 允许非 root 用户在容器内拥有 root 权限
# 原理：将容器内的 UID/GID 映射到宿主机的普通用户

# 作为普通用户（uid=1000）创建 user namespace
unshare --user --map-root-user bash
whoami    # → root（在新 namespace 中）
id        # → uid=0(root) gid=0(root) groups=0(root)
# 但宿主机上这个进程实际是 uid=1000

# UID/GID 映射配置文件
cat /proc/self/uid_map
# 格式：容器UID  宿主机UID  范围
# 0         1000      1
# 表示：容器内的 uid 0 映射到宿主机的 uid 1000，范围1个

# 允许更大范围的 UID 映射（需配置 /etc/subuid /etc/subgid）
cat /etc/subuid
# alice:100000:65536   # alice 可以使用 100000-165535 作为子 UID

# 使用 newuidmap/newgidmap 设置完整映射
newuidmap PID 0 1000 1 1 100000 65536
# 容器UID 0 → 宿主机UID 1000
# 容器UID 1 → 宿主机UID 100000（共65536个）

# rootless 容器就是利用 user namespace 实现的
# podman 默认使用 rootless 模式

3. unshare in Practice: Creating a Fully Isolated Environment

# 创建一个具有完整隔离的 shell 环境（类容器）
# --fork: 在子进程中运行（PID namespace 的第一个进程作为 PID 1）
# --pid: 新 PID namespace
# --net: 新 network namespace
# --mount: 新 mount namespace
# --uts: 新 UTS namespace
# --ipc: 新 IPC namespace
# --user --map-root-user: 新 user namespace，当前用户映射为 root

unshare \
  --pid \
  --fork \
  --net \
  --mount \
  --uts \
  --ipc \
  --user \
  --map-root-user \
  /bin/bash

# 进入后配置隔离环境
hostname isolated-container       # 设置容器 hostname
mount -t proc proc /proc          # 挂载 proc（使 ps 正常工作）

# 验证隔离效果
hostname                           # → isolated-container（与宿主机不同）
ps aux                             # → 只看到容器内进程
ip a                               # → 只有 lo（无宿主机网卡）
ls /proc/1/ns/                     # → 查看 namespace inode，与宿主机不同

# 在宿主机上查看该进程
# ps aux | grep bash
# 在宿主机看到真实 PID 和 UID

4. nsenter: Entering a Running Container's Namespace

nsenter allows executing commands inside any namespace of a running process without restarting it. This is the core tool for debugging container network and filesystem issues, and is the underlying mechanism behind docker exec.

# 找到容器进程的 PID（宿主机视角）
# 假设容器内 PID 1 在宿主机上是 PID 12345
CPID=12345

# 进入容器的所有 namespace（相当于 docker exec -it shell）
nsenter -t $CPID --pid --net --mount --uts --ipc /bin/bash

# 只进入 network namespace（调试网络）
nsenter -t $CPID --net ip a
nsenter -t $CPID --net ss -tlnp
nsenter -t $CPID --net ping 8.8.8.8

# 只进入 mount namespace（检查文件系统）
nsenter -t $CPID --mount ls /
nsenter -t $CPID --mount cat /etc/resolv.conf

# 实用场景：容器没有调试工具（busybox/distroless 镜像）
# 从宿主机进入容器 network namespace，使用宿主机的 tcpdump 抓包
nsenter -t $CPID --net -- tcpdump -i eth0 -w /tmp/cap.pcap

# 用 docker 获取容器 PID
CPID=$(docker inspect --format '{{"{{"}}{{".State.Pid"}}{{"}}"}}' mycontainer)
nsenter -t $CPID --net ip route show

# 进入时保留宿主机的 PATH（避免 PATH 被覆盖）
nsenter -t $CPID --pid --net --mount -- env -i PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin /bin/bash

5. cgroup v2: Resource Isolation and Limits

cgroup (control group) is the Linux kernel's resource management mechanism, allowing hard CPU, memory, I/O, and process count limits to be applied to groups of processes. cgroup v2 is a redesign of v1 with a unified hierarchy, and is the default on Linux 5.2+ and modern distributions.

# 检查是否使用 cgroup v2
mount | grep cgroup
# cgroup2 on /sys/fs/cgroup type cgroup2 ...  ← v2
# cgroup on /sys/fs/cgroup/cpu type cgroup ...  ← v1（多个挂载点）

# cgroup v2 根目录
ls /sys/fs/cgroup/
# cgroup.controllers  cgroup.subtree_control  system.slice  user.slice ...

# 查看当前进程所属的 cgroup
cat /proc/self/cgroup
# 0::/user.slice/user-1000.slice/session-1.scope

# 查看可用控制器
cat /sys/fs/cgroup/cgroup.controllers
# cpuset cpu io memory hugetlb pids rdma misc

# === 创建并使用 cgroup ===

# 创建新 cgroup（在根 cgroup 下创建子目录即可）
mkdir /sys/fs/cgroup/mycontainer

# 启用控制器（在父 cgroup 的 subtree_control 文件中声明）
echo "+cpu +memory +pids +io" > /sys/fs/cgroup/cgroup.subtree_control

# 设置 CPU 配额（cpu.max 格式：配额 周期，单位 微秒）
# 50000/100000 = 50% CPU（一个周期内最多使用 50ms）
echo "50000 100000" > /sys/fs/cgroup/mycontainer/cpu.max

# 设置内存上限（字节）
echo $((512 * 1024 * 1024)) > /sys/fs/cgroup/mycontainer/memory.max   # 512MiB
echo $((512 * 1024 * 1024)) > /sys/fs/cgroup/mycontainer/memory.swap.max  # 禁用 swap

# 设置最大进程数
echo 100 > /sys/fs/cgroup/mycontainer/pids.max

# 设置 I/O 限速（io.max 格式：MAJOR:MINOR rbps=X wbps=X riops=X wiops=X）
# 先查设备号：ls -la /dev/sda → 8, 0
echo "8:0 rbps=10485760 wbps=10485760" > /sys/fs/cgroup/mycontainer/io.max  # 10MB/s

# 将进程加入 cgroup（写入 PID 到 cgroup.procs）
echo $$ > /sys/fs/cgroup/mycontainer/cgroup.procs   # 将当前 shell 加入

# 验证：列出 cgroup 中的进程
cat /sys/fs/cgroup/mycontainer/cgroup.procs

# 查看实时统计
cat /sys/fs/cgroup/mycontainer/cpu.stat
cat /sys/fs/cgroup/mycontainer/memory.current
cat /sys/fs/cgroup/mycontainer/memory.stat

# 测试内存限制（会触发 OOM kill）
# stress --vm 1 --vm-bytes 600M    # 尝试分配 600M，超过限制

# systemd 与 cgroup 的关系
# 每个 service/scope 都有对应的 cgroup
systemctl status nginx
# 显示：CGroup: /system.slice/nginx.service/...
# 可以通过 systemd 直接设置资源限制
systemctl set-property nginx.service CPUQuota=50% MemoryMax=512M

6. chroot and pivot_root: Root Filesystem Switching

chroot — Simple Isolation and Its Limits

# 创建一个最小 chroot 环境
ROOTFS=/tmp/minichroot
mkdir -p $ROOTFS/{bin,lib,lib64,dev,proc,sys,etc}

# 复制 bash 和必要的库
cp /bin/bash $ROOTFS/bin/
cp /bin/ls $ROOTFS/bin/
cp /bin/ps $ROOTFS/bin/

# 复制动态库依赖
ldd /bin/bash
# linux-vdso.so.1 → 内核提供，不需复制
# libtinfo.so.6 → /lib/x86_64-linux-gnu/libtinfo.so.6
# libc.so.6 → /lib/x86_64-linux-gnu/libc.so.6

cp /lib/x86_64-linux-gnu/libtinfo.so.6 $ROOTFS/lib/
cp /lib/x86_64-linux-gnu/libc.so.6 $ROOTFS/lib/
cp /lib64/ld-linux-x86-64.so.2 $ROOTFS/lib64/

# 挂载必要的伪文件系统
mount -t proc proc $ROOTFS/proc
mount --bind /dev $ROOTFS/dev

# 进入 chroot 环境
chroot $ROOTFS /bin/bash
ls /                  # 看到的是 $ROOTFS 下的目录结构
ps aux                # 需要 /proc 挂载才能工作

# chroot 的安全问题：有 root 权限可以逃逸
# 逃逸原理：打开一个指向外部的文件描述符，然后 chroot 到子目录，
# 再通过 fchdir(fd) + chdir("..") 逐步爬出 chroot 笼
# 因此 chroot 不能单独用于安全隔离

# 清理
umount $ROOTFS/proc
umount $ROOTFS/dev

pivot_root — Secure Root Filesystem Switch

# pivot_root 是 runc 使用的正确方式：
# 1. 将新根挂载到某目录
# 2. 使用 pivot_root 将新根提升为真正的 /
# 3. 卸载旧根（彻底断开与宿主机文件系统的联系）

# 在 mount namespace 中操作（避免影响宿主机）
# 步骤演示（在隔离的 mount namespace 中）

NEWROOT=/tmp/newroot
mkdir -p $NEWROOT $NEWROOT/.oldroot

# 将新根文件系统绑定挂载（或使用 overlayfs）
mount --bind $NEWROOT $NEWROOT

# 切换根文件系统：
# pivot_root 新根  旧根的临时挂载点
cd $NEWROOT
pivot_root . .oldroot

# 现在 / 是 $NEWROOT
# .oldroot 是旧的宿主机根文件系统

# 挂载新 /proc
mount -t proc proc /proc

# 卸载旧根（彻底隔离）
umount -l /.oldroot
rmdir /.oldroot

# 现在完全无法访问宿主机文件系统
# 这就是为什么 pivot_root + mount namespace 比 chroot 安全得多

# runc 的 pivot_root 实现流程（简化）：
# 1. unshare(CLONE_NEWNS)  # 新 mount namespace
# 2. 挂载 overlayfs 作为新根
# 3. pivot_root(newroot, newroot/.pivot)
# 4. umount /.pivot
# 5. 继续配置其他 namespace

7. Overlay Filesystem: Image Layering Principle

The overlay (union) filesystem "stacks" multiple directories into a unified view: the lower layer (lowerdir) is read-only, the upper layer (upperdir) is writable, and all modifications go into upperdir without affecting lowerdir. This is the core principle behind Docker image layering: each RUN/COPY instruction generates a read-only layer, and a writable layer is added on top when a container starts.

# 创建 overlay 所需的目录结构
OVDIR=/tmp/overlay-demo
mkdir -p $OVDIR/{lower1,lower2,upper,work,merged}

# 在只读层写入基础文件
echo "from lower1" > $OVDIR/lower1/file1.txt
echo "from lower2" > $OVDIR/lower2/file2.txt
echo "original"   > $OVDIR/lower1/shared.txt

# 挂载 overlay 文件系统
# lowerdir: 冒号分隔，右边优先级更高（lower2 > lower1）
mount -t overlay overlay \
  -o lowerdir=$OVDIR/lower2:$OVDIR/lower1,\
upperdir=$OVDIR/upper,\
workdir=$OVDIR/work \
  $OVDIR/merged

# 验证联合视图
ls $OVDIR/merged
# file1.txt  file2.txt  shared.txt  （来自两个 lower 层）

cat $OVDIR/merged/file1.txt   # → from lower1
cat $OVDIR/merged/file2.txt   # → from lower2

# 在 merged 中修改文件（写时复制 CoW）
echo "modified" > $OVDIR/merged/shared.txt

# 修改后：upper 层出现了 shared.txt 的副本
cat $OVDIR/upper/shared.txt   # → modified
cat $OVDIR/lower1/shared.txt  # → original（原文件未被修改！）

# 在 merged 中删除文件（创建 whiteout 文件）
rm $OVDIR/merged/file1.txt
ls $OVDIR/upper/
# c--------- 1 root root 0, 0 ... file1.txt  ← whiteout 文件（设备号0,0）
# 表示"这个文件在此层被删除"，lower 层的 file1.txt 还在但被遮盖

# Docker 使用 overlay2 驱动（内置支持，无需 FUSE）
# 查看 Docker 使用的 overlayfs
docker inspect mycontainer | grep -A5 GraphDriver

# Docker 镜像层的存储位置
ls /var/lib/docker/overlay2/
# 每个目录对应一个镜像层（diff/ merged/ link/ lower/ work/）

# 卸载
umount $OVDIR/merged

8. seccomp: Syscall Filtering

seccomp (secure computing mode) allows a process to declare which system calls it uses, with the kernel denying all others. This reduces the container escape attack surface: even if container code is compromised, attackers cannot call dangerous syscalls like ptrace, mount, or kexec_load.

# 检查内核是否支持 seccomp
grep CONFIG_SECCOMP /boot/config-$(uname -r)
# CONFIG_SECCOMP=y
# CONFIG_SECCOMP_FILTER=y

# 检查进程当前的 seccomp 状态
cat /proc/self/status | grep Seccomp
# Seccomp: 0    (0=未启用, 1=STRICT模式, 2=FILTER模式)

# SECCOMP_MODE_STRICT：只允许 read/write/exit/sigreturn 4个 syscall
# SECCOMP_MODE_FILTER：基于 BPF 规则的白名单/黑名单

# strace 测试程序实际使用的 syscall（生成白名单的第一步）
strace -c -f -e trace=all ./myapp 2>&1 | grep -E "^[0-9]" | awk '{print $NF}'

# 使用 libseccomp 工具查看 syscall 编号
scmp_sys_resolver openat
scmp_sys_resolver write

# Docker 默认 seccomp profile（位于 /etc/docker/seccomp.json）
# 约300个 syscall 被允许，危险的 syscall 被阻断：
# 阻断：kexec_load, create_module, init_module, delete_module
# 阻断：mount（需要 --privileged 或 SYS_ADMIN capability）
# 阻断：ptrace（防止进程注入）
# 阻断：clone with CLONE_NEWUSER（防止 user namespace 逃逸）

# 运行容器时指定自定义 seccomp profile
docker run --security-opt seccomp=/path/to/custom-profile.json myimage

# 禁用 seccomp（不推荐，仅调试用）
docker run --security-opt seccomp=unconfined myimage

# 用 bpftrace 观察被 seccomp 拒绝的 syscall
bpftrace -e 'tracepoint:syscalls:sys_enter_seccomp { printf("pid=%d comm=%s\n", pid, comm); }'

9. Linux Capabilities in Containers

The traditional Unix permission model only has two levels: "root (omnipotent)" and "regular user." Linux capabilities split root's super-privileges into approximately 40 independent abilities. Processes only need to declare the capabilities they require, limiting damage if compromised.

# 查看所有 capability
man capabilities | grep "CAP_" | head -30

# 常见 capabilities（容器相关）：
# CAP_NET_BIND_SERVICE  绑定 1024 以下端口（nginx 需要）
# CAP_NET_ADMIN         配置网络接口、路由（容器网络设置需要）
# CAP_SYS_PTRACE        ptrace 其他进程（调试用，通常禁用）
# CAP_SYS_ADMIN         大量管理操作（几乎等于 root，避免赋予）
# CAP_CHOWN             修改文件所有者
# CAP_SETUID/SETGID     切换 UID/GID
# CAP_KILL              向任意进程发送信号
# CAP_NET_RAW           使用 RAW socket（ping 需要）

# 查看当前进程的 capabilities
cat /proc/self/status | grep Cap
# CapInh: 0000000000000000  继承 capabilities
# CapPrm: 0000000000000000  允许的 capabilities（位图）
# CapEff: 0000000000000000  有效 capabilities
# CapBnd: 000001ffffffffff  capabilities bounding set（上限）

# 解码 capabilities 位图
capsh --decode=000001ffffffffff

# Docker 默认 capabilities（--cap-add/--cap-drop）
# 默认包含：CHOWN, DAC_OVERRIDE, FOWNER, FSETID, KILL, SETGID, SETUID,
#           SETPCAP, NET_BIND_SERVICE, NET_RAW, SYS_CHROOT, MKNOD, AUDIT_WRITE,
#           SETFCAP
# 默认不包含：SYS_ADMIN, NET_ADMIN, SYS_PTRACE

# 运行时添加/移除 capability
docker run --cap-add NET_ADMIN myimage          # 添加（如需要 ip route）
docker run --cap-drop NET_RAW myimage           # 移除（禁止 ping）
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx  # 最小权限

# 用 capsh 验证
capsh --print | grep Current

10. Manually Building a Container: Complete Shell Script

Combining all techniques from this chapter, build a busybox container using pure shell: namespace isolation, overlayfs filesystem, cgroup resource limits, and veth network connectivity. This is a simplified version of what runc does.

#!/usr/bin/env bash
# mini-container.sh — 手动构建容器（需要 root，Linux 5.x+）
# 用法：sudo ./mini-container.sh

set -euo pipefail

CONTAINER_ID="mc-$(head -c4 /dev/urandom | xxd -p)"
BASE_DIR="/tmp/minicontainer/$CONTAINER_ID"
BUSYBOX_IMAGE="/tmp/busybox-rootfs"    # 需要提前准备

log() { echo "[$(date +%T)] $*"; }
cleanup() {
  log "Cleaning up $CONTAINER_ID ..."
  umount "$BASE_DIR/merged" 2>/dev/null || true
  ip link delete "veth-h-$CONTAINER_ID" 2>/dev/null || true
  ip netns delete "ns-$CONTAINER_ID" 2>/dev/null || true
  rmdir /sys/fs/cgroup/"$CONTAINER_ID" 2>/dev/null || true
  rm -rf "$BASE_DIR"
  log "Done."
}
trap cleanup EXIT

# === 步骤1：准备 busybox rootfs ===
if [ ! -d "$BUSYBOX_IMAGE" ]; then
  log "Preparing busybox rootfs ..."
  mkdir -p "$BUSYBOX_IMAGE"
  # 使用 Docker 导出 busybox 文件系统（如果有 Docker）
  if command -v docker &>/dev/null; then
    docker export "$(docker create busybox)" | tar -C "$BUSYBOX_IMAGE" -xf -
  else
    # 手动下载 busybox 静态二进制
    mkdir -p "$BUSYBOX_IMAGE/bin"
    wget -qO "$BUSYBOX_IMAGE/bin/busybox" \
      "https://busybox.net/downloads/binaries/1.35.0-x86_64-linux-musl/busybox"
    chmod +x "$BUSYBOX_IMAGE/bin/busybox"
    # 创建常用命令的符号链接
    for cmd in sh ls ps cat echo mount umount; do
      ln -sf busybox "$BUSYBOX_IMAGE/bin/$cmd"
    done
    mkdir -p "$BUSYBOX_IMAGE"/{proc,sys,dev,tmp,etc}
    echo "nameserver 8.8.8.8" > "$BUSYBOX_IMAGE/etc/resolv.conf"
    echo "root:x:0:0:root:/root:/bin/sh" > "$BUSYBOX_IMAGE/etc/passwd"
  fi
fi

# === 步骤2：创建 overlayfs 文件系统 ===
log "Setting up overlayfs for $CONTAINER_ID ..."
mkdir -p "$BASE_DIR"/{upper,work,merged}

mount -t overlay overlay \
  -o "lowerdir=$BUSYBOX_IMAGE,upperdir=$BASE_DIR/upper,workdir=$BASE_DIR/work" \
  "$BASE_DIR/merged"

# 准备 merged 中的伪文件系统挂载点
mkdir -p "$BASE_DIR/merged"/{proc,sys,dev}

# === 步骤3：创建 network namespace 和 veth pair ===
log "Setting up network for $CONTAINER_ID ..."

NETNS="ns-$CONTAINER_ID"
VETH_HOST="veth-h-$CONTAINER_ID"
VETH_CONT="veth-c-$CONTAINER_ID"
HOST_IP="10.200.0.1"
CONT_IP="10.200.0.2"

ip netns add "$NETNS"
ip link add "$VETH_HOST" type veth peer name "$VETH_CONT"
ip link set "$VETH_CONT" netns "$NETNS"

# 配置宿主机端
ip addr add "$HOST_IP/24" dev "$VETH_HOST" 2>/dev/null || true
ip link set "$VETH_HOST" up

# 配置容器端（在 netns 中）
ip netns exec "$NETNS" ip addr add "$CONT_IP/24" dev "$VETH_CONT"
ip netns exec "$NETNS" ip link set "$VETH_CONT" up
ip netns exec "$NETNS" ip link set lo up
ip netns exec "$NETNS" ip route add default via "$HOST_IP"

# 开启 NAT（让容器能访问外网）
iptables -t nat -A POSTROUTING -s "10.200.0.0/24" -o eth0 -j MASQUERADE 2>/dev/null || true

log "Container IP: $CONT_IP, Host IP: $HOST_IP"

# === 步骤4：创建 cgroup 并设置资源限制 ===
log "Configuring cgroup for $CONTAINER_ID ..."

CG_PATH="/sys/fs/cgroup/$CONTAINER_ID"
mkdir -p "$CG_PATH"

# 启用控制器（需要父 cgroup 允许）
echo "+cpu +memory +pids" > /sys/fs/cgroup/cgroup.subtree_control 2>/dev/null || true

# CPU: 最多使用 25%（250ms/1000ms）
echo "250000 1000000" > "$CG_PATH/cpu.max" 2>/dev/null || true

# Memory: 最多 128MiB
echo $((128 * 1024 * 1024)) > "$CG_PATH/memory.max" 2>/dev/null || true
echo $((128 * 1024 * 1024)) > "$CG_PATH/memory.swap.max" 2>/dev/null || true

# pids: 最多 50 个进程
echo 50 > "$CG_PATH/pids.max" 2>/dev/null || true

log "Resource limits: CPU=25%, Memory=128MiB, PIDs=50"

# === 步骤5：启动容器进程（综合所有 namespace）===
log "Starting container $CONTAINER_ID ..."
log "Root filesystem: $BASE_DIR/merged"
echo "---"

# unshare 创建所有 namespace，在容器内执行启动脚本
ROOTFS="$BASE_DIR/merged"
NEWROOT="$ROOTFS"

# 使用 unshare 创建独立的 namespace 环境
# --pid --fork: 新 PID namespace，fork 出子进程
# --mount: 新 mount namespace（避免影响宿主机）
# --uts: 新 hostname namespace
# --ipc: 新 IPC namespace
# 注意：network namespace 已通过 ip netns 创建，这里通过 /var/run/netns 绑定

unshare \
  --pid \
  --fork \
  --mount \
  --uts \
  --ipc \
  -- \
  /bin/bash -c "
    set -e

    # 设置 hostname
    hostname '$CONTAINER_ID'

    # 切换到容器根文件系统
    cd '$NEWROOT'

    # 挂载 /proc（在新 PID namespace 中必须重新挂载）
    mount -t proc proc '$NEWROOT/proc'

    # 挂载 /dev（bind mount 宿主机 /dev 以获得设备访问）
    mount --bind /dev '$NEWROOT/dev'

    # 进入 network namespace（加入之前创建的 netns）
    # 通过 nsenter 方式加入 netns 需要在 exec 之前处理
    # 这里简化：在 chroot 内网络配置继承自 netns

    # 使用 pivot_root 切换根文件系统（比 chroot 更安全）
    mkdir -p '$NEWROOT/.oldroot'
    mount --bind '$NEWROOT' '$NEWROOT'
    pivot_root '$NEWROOT' '$NEWROOT/.oldroot'

    # 卸载旧根
    umount -l /.oldroot
    rmdir /.oldroot

    # 进入容器 shell
    echo '=== Welcome to MiniContainer ==='
    echo '=== Container ID: $CONTAINER_ID ==='
    exec /bin/sh
  " &

CONTAINER_PID=$!

# 将容器进程加入 cgroup
echo "$CONTAINER_PID" > "$CG_PATH/cgroup.procs" 2>/dev/null || true

# 将容器进程加入 network namespace
# 使用 nsenter 将容器进程迁移到创建的 netns
nsenter -t "$CONTAINER_PID" --net=/var/run/netns/"$NETNS" true 2>/dev/null || true

log "Container PID on host: $CONTAINER_PID"
log "cgroup: $CG_PATH"
log "Press Ctrl+C or type 'exit' in container to stop."

# 等待容器进程退出
wait "$CONTAINER_PID" || true
log "Container $CONTAINER_ID exited."

Script Notes: This script demonstrates the core principles of container implementation. In production, use runc/crun (OCI standard runtimes) or containerd. runc is the default runtime for Docker/Kubernetes and implements the complete OCI (Open Container Initiative) specification, including: config.json spec file, full seccomp/capabilities support, rootless containers (user namespace), and cgroup management. Run with: runc run mycontainer.

Container Technology Summary: A complete container = namespace isolation (cannot see host processes/network/filesystem) + cgroup limits (cannot exhaust CPU/memory) + overlayfs (image layering, copy-on-write) + pivot_root (secure root filesystem switch) + seccomp (only permitted syscalls) + capabilities (least-privilege root). Docker, Podman, and containerd are all high-level tools built on top of these kernel primitives.

  Previous
  ← Ch15: Security


  Next
  Ch17: Syscalls →

Rate this chapter

4.5 / 5 (14 ratings)