Chapter 40

Build a Simple Container

Chapter 40: Build a Simple Container

The word "container" has been overloaded. Many engineers think of containers as some kind of lightweight virtual machine — a Docker image containing an operating system, processes running inside, mysteriously isolated from each other. This understanding is not entirely wrong, but it obscures a more important truth: containers are not new technology. They are a composition of features already present in the Linux kernel.

This chapter removes the mystery. Starting from Linux namespaces and cgroups, we will use Go to implement a minimal container runtime capable of: isolating processes, mounting a filesystem, and limiting memory and CPU usage. Along the way you will understand what Docker and Kubernetes are doing, and when a container has a problem, which direction in the Linux kernel to investigate.

Level 1 · What Containers Really Are: Not Virtual Machines

Virtual Machine vs. Container: The Fundamental Difference

A virtual machine (VM) uses a Hypervisor to emulate complete hardware: the CPU instruction set, memory controller, network card, disk controller. The Guest OS runs on emulated hardware and knows nothing about the underlying physical machine. This is true isolation — two VMs can run completely different OS kernels (Linux and Windows simultaneously on one physical machine).

Containers share the host machine's Linux kernel. Processes inside a container are ordinary processes on the host, simply made to "not see" other processes, "not see" other filesystems, "not see" other network interfaces — via Linux namespace mechanisms. The cgroup mechanism limits how much CPU time and memory they can use.

Virtual Machine Architecture:
┌─────────────────────────────────────────┐
│  App   │   App   │   App               │
├────────┴─────────┴─────────────────────┤
│              Guest OS                   │
├─────────────────────────────────────────┤
│              Hypervisor                 │
├─────────────────────────────────────────┤
│         Host OS / Physical Hardware     │
└─────────────────────────────────────────┘

Container Architecture:
┌────────────┬────────────┬──────────────┐
│ Container  │ Container  │  Container   │
│ (ns+cg)   │ (ns+cg)   │  (ns+cg)    │
├────────────┴────────────┴──────────────┤
│           Host Linux Kernel            │
├────────────────────────────────────────┤
│             Physical Hardware          │
└────────────────────────────────────────┘

This architectural difference produces fundamental performance differences: container startup time is measured in milliseconds (process start), VM startup time is measured in seconds (OS boot). Container memory overhead equals the process's memory usage; VM memory overhead includes a complete Guest OS (typically hundreds of MB).

Why Containers Are "Processes"

After running docker run -d nginx, execute ps aux | grep nginx on the host — you will see the nginx process appear in the host's process list (just with a different PID). This nginx process:

Uses the host machine's CPU clock
Uses the host machine's physical memory
Makes system calls through the host machine's kernel

It is simply "wrapped" by namespaces and cgroups — given a restricted "worldview."

The Two Core Mechanisms: Namespace and Cgroup

Linux Namespace (introduced in 1991, completed in 2.6.x): controls what a process can "see." After creating a new namespace, the new process has an independent view of that namespace, without interfering with processes in other namespaces.

Linux Cgroup (Control Groups, introduced in 2008, 2.6.24): controls how much resource a process can "use." Sets upper limits on CPU, memory, disk I/O, and network bandwidth.

Without namespaces: a process can see all other processes (ps aux shows the entire machine), can operate on all files, can use any port. Without cgroups: a process can consume all the machine's memory (the OOM killer will kill other processes) and monopolize all CPU.

namespace + cgroup = container isolation + resource limits.

Why Go Is Well-Suited for Container Runtimes

Go's standard library provides direct access to Linux system calls (the syscall package) and convenient process management (os/exec). Go programs compile to static binaries (no external dependencies), suitable for running inside containers (which may not have a complete C runtime). runc — Docker's default container runtime — is written in Go.

Level 2 · Linux Namespace and Cgroup Internals

The Seven Linux Namespaces

Namespace	Constant	Isolates	Typical Use
PID	`CLONE_NEWPID`	Process ID space	Processes inside the container see PIDs starting from 1
Network	`CLONE_NEWNET`	Network interfaces, routing tables, firewall rules	Container has an independent network stack
Mount	`CLONE_NEWNS`	Filesystem mount points	Container has an independent filesystem view
UTS	`CLONE_NEWUTS`	Hostname and domain name	Container has an independent hostname
IPC	`CLONE_NEWIPC`	Message queues, shared memory, semaphores	IPC resources isolated between containers
User	`CLONE_NEWUSER`	UID/GID mapping	Root inside the container maps to a regular user on the host
Cgroup	`CLONE_NEWCGROUP`	Cgroup root directory view	Container cannot see the host's cgroup hierarchy

In Go, specify which namespaces to create by setting Cloneflags in syscall.SysProcAttr:

cmd := exec.Command("bash")
cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUTS |
                syscall.CLONE_NEWPID |
                syscall.CLONE_NEWNS  |
                syscall.CLONE_NEWNET |
                syscall.CLONE_NEWIPC,
}

PID Namespace in Depth

A PID namespace creates an independent process ID space. The first process in the namespace (the container's init process) has PID 1 within that namespace, but has a different PID on the host (say, 4782).

PID 1 has special meaning in Linux: it is the "foster home" for orphaned processes (when a process exits, its children are reparented to PID 1). More importantly, PID 1 does not respond to the default SIGTERM signal — this is a common cause of Docker containers failing to stop gracefully (the process does not handle SIGTERM, or does not forward the signal to child processes).

Process tree as seen inside the container:

PID 1: /init or /bin/sh  (the container's entry process)
  PID 2: nginx worker
  PID 3: nginx worker

As seen from the host:

PID 4782: /bin/sh
  PID 4783: nginx worker
  PID 4784: nginx worker

Mount Namespace and pivot_root

Mount namespace isolates the filesystem mount point view, but that is only the first step. The key to giving a container an independent filesystem is pivot_root (or the traditional chroot):

chroot (simple version): changes a process's root directory (/). The process cannot see files outside the new root. But chroot has security issues: a privileged process can escape to the real root directory using chroot combined with relative paths.

pivot_root (production version): truly switches the filesystem root directory, more secure than chroot. It mounts the old root under a subdirectory of the new root, then switches the new root to become /.

// Simplified version using chroot to demonstrate the concept.
// Production container runtimes use pivot_root.
func setupRootFS(newroot string) error {
    // Bind mount: make newroot an independent mount point (prerequisite for pivot_root)
    if err := syscall.Mount(newroot, newroot, "", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
        return fmt.Errorf("bind mount: %w", err)
    }
    // chroot to the new root
    if err := syscall.Chroot(newroot); err != nil {
        return fmt.Errorf("chroot: %w", err)
    }
    return os.Chdir("/")
}

Cgroup v2: The Modern Interface for Resource Limits

Cgroup v2 (Linux 4.5+, now the default on modern distributions) exposes its interface through the /sys/fs/cgroup filesystem. Each cgroup is a directory; write to specific files to set limits.

Key resource control files:

/sys/fs/cgroup/
└── mycontainer/          # cgroup directory created for the container
    ├── cgroup.procs      # write a PID to add the process to this cgroup
    ├── memory.max        # memory hard limit in bytes, or "max" for unlimited
    ├── memory.current    # current memory usage (read-only)
    ├── cpu.max           # CPU limit: "quota period" e.g. "50000 100000" = 50%
    ├── cpu.stat          # CPU usage statistics (read-only)
    └── pids.max          # maximum number of processes

Go code to operate cgroups:

const cgroupBase = "/sys/fs/cgroup"

func createCgroup(name string) (string, error) {
    cgPath := filepath.Join(cgroupBase, name)
    if err := os.MkdirAll(cgPath, 0755); err != nil {
        return "", fmt.Errorf("create cgroup dir: %w", err)
    }
    return cgPath, nil
}

// setMemoryLimit sets the container's memory ceiling in bytes.
// Processes that exceed this limit trigger the OOM killer (container processes first).
func setMemoryLimit(cgPath string, limitBytes int64) error {
    return os.WriteFile(
        filepath.Join(cgPath, "memory.max"),
        []byte(strconv.FormatInt(limitBytes, 10)), 0644)
}

// setCPULimit sets the CPU limit.
// quota/period = CPU fraction. e.g. quota=50000, period=100000 means 50% CPU.
func setCPULimit(cgPath string, quota, period int) error {
    return os.WriteFile(filepath.Join(cgPath, "cpu.max"),
        []byte(fmt.Sprintf("%d %d", quota, period)), 0644)
}

// addProcessToCgroup adds a process to the cgroup; from this moment it is resource-limited.
func addProcessToCgroup(cgPath string, pid int) error {
    return os.WriteFile(filepath.Join(cgPath, "cgroup.procs"),
        []byte(strconv.Itoa(pid)), 0644)
}

OverlayFS: The Secret Behind Container Filesystem Layers

Docker images use a layered filesystem; OverlayFS is the most common implementation. OverlayFS merges multiple "layers" into a unified view:

Container's filesystem view (OverlayFS merged result):
  /etc/nginx/nginx.conf   ← from upperdir (container's writable layer, if modified)
  /usr/sbin/nginx         ← from lowerdir (image layer, read-only)
  /bin/bash               ← from lowerdir (base image layer, read-only)

Mount command:
mount -t overlay overlay \
  -o lowerdir=/image/layer2:/image/layer1:/image/base,\
     upperdir=/container/rw,\
     workdir=/container/work \
  /container/merged

lowerdir: read-only layers (image layers). Multiple layers separated by :, leftmost has highest priority. upperdir: writable layer (all container modifications go here). When a container deletes a file, OverlayFS creates a whiteout file (a special marker) in upperdir. workdir: OverlayFS's internal working directory; must be on the same filesystem as upperdir. merged: the combined view, mounted for the container to use.

Copy-on-Write (CoW): the first time a container modifies a file from lowerdir, OverlayFS automatically copies the file into upperdir, then modifies the upperdir copy. The original file in lowerdir is untouched.

Level 3 · Building a Minimal Container Runtime from Scratch

Project Structure

minicontainer/
├── main.go          # Entry: parse command, dispatch to parent or child mode
├── container.go     # Container lifecycle management
├── cgroup.go        # Cgroup operations
├── network.go       # Network configuration (veth pair)
├── rootfs.go        # Filesystem preparation (OverlayFS/chroot)
└── Makefile

Step 1: Entry Point and the Self-Invoking Process Pattern

Container runtimes have a special challenge: they need to run "initialization code" inside a new namespace before exec'ing the user-specified command. A Go process cannot run arbitrary functions in a child process after clone() the way C can, because Go's runtime (goroutines, GC) is already initialized before main().

The solution is self-invoking (having the runtime spawn a copy of itself as the "in-container init process"), passing configuration via environment variables. The child process completes initialization inside the namespace, then execs the user command.

// main.go
package main

import (
	"fmt"
	"os"
)

func main() {
	if len(os.Args) < 2 {
		fmt.Fprintln(os.Stderr, "usage: minicontainer run <command> [args...]")
		os.Exit(1)
	}
	switch os.Args[1] {
	case "run":
		// Parent mode: create namespaces, start child process
		if err := runParent(os.Args[2:]); err != nil {
			fmt.Fprintf(os.Stderr, "error: %v\n", err)
			os.Exit(1)
		}
	case "_child":
		// Child mode: initialize inside the new namespace, then exec user command.
		// Triggered by runParent via Cmd.Start(); users should never call this directly.
		if err := runChild(); err != nil {
			fmt.Fprintf(os.Stderr, "child error: %v\n", err)
			os.Exit(1)
		}
	default:
		fmt.Fprintf(os.Stderr, "usage: minicontainer run <command> [args...]\n")
		os.Exit(1)
	}
}

Step 2: Parent Process — Create Namespaces and Start the Child

// container.go
package main

import (
	"fmt"
	"os"
	"os/exec"
	"strings"
	"syscall"
	"time"
)

type ContainerConfig struct {
	Command   []string
	RootFS    string
	MemLimit  int64  // bytes, 0 = unlimited
	CPUQuota  int    // microseconds
	CPUPeriod int    // microseconds, default 100000 (100ms)
	Hostname  string
}

// runParent runs on the host side. It:
// 1. Parses configuration
// 2. Creates new namespaces (via Cloneflags)
// 3. Starts the "child process" (the container's init process)
// 4. Configures cgroups (after child starts, before child finishes init)
// 5. Waits for the child to exit
func runParent(args []string) error {
	cfg := &ContainerConfig{
		Command:   args,
		RootFS:    "./rootfs",
		MemLimit:  128 * 1024 * 1024, // 128MB default
		CPUQuota:  50000,              // 50% CPU
		CPUPeriod: 100000,             // 100ms period
		Hostname:  "container",
	}

	env := append(os.Environ(),
		"MC_COMMAND="+strings.Join(cfg.Command, ","),
		"MC_ROOTFS="+cfg.RootFS,
		"MC_HOSTNAME="+cfg.Hostname,
	)

	self, err := os.Executable()
	if err != nil { return err }

	cmd := exec.Command(self, "_child")
	cmd.Stdin  = os.Stdin
	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	cmd.Env = env

	// Critical: set Cloneflags to create new namespaces.
	// These flags correspond to the clone(2) syscall's arguments.
	// The child process inherits the new namespaces from the moment of fork.
	cmd.SysProcAttr = &syscall.SysProcAttr{
		Cloneflags: syscall.CLONE_NEWUTS | // independent hostname
			syscall.CLONE_NEWPID |          // independent PID space (child is PID 1 inside)
			syscall.CLONE_NEWNS  |          // independent mount namespace
			syscall.CLONE_NEWNET |          // independent network namespace
			syscall.CLONE_NEWIPC,           // independent IPC namespace
		// Note: CLONE_NEWUSER requires additional UID/GID mapping (rootless containers)
	}

	if err := cmd.Start(); err != nil {
		return fmt.Errorf("start container process: %w", err)
	}

	// Configure cgroup on the host side after the child starts.
	// Must complete before the child finishes initialization.
	cgroupName := fmt.Sprintf("minicontainer-%d", cmd.Process.Pid)
	if err := setupCgroup(cgroupName, cmd.Process.Pid, cfg); err != nil {
		cmd.Process.Kill()
		return fmt.Errorf("setup cgroup: %w", err)
	}

	// Signal the child that cgroup setup is complete.
	// Simplified: use a temp file. Production: use a pipe or sync fd.
	readyFile := fmt.Sprintf("/tmp/mc-ready-%d", cmd.Process.Pid)
	os.WriteFile(readyFile, []byte("ready"), 0644)
	defer os.Remove(readyFile)

	return cmd.Wait()
}

func setupCgroup(name string, pid int, cfg *ContainerConfig) error {
	cgPath, err := createCgroup(name)
	if err != nil { return err }
	if cfg.MemLimit > 0 {
		if err := setMemoryLimit(cgPath, cfg.MemLimit); err != nil {
			return fmt.Errorf("set memory limit: %w", err)
		}
	}
	if cfg.CPUQuota > 0 {
		if err := setCPULimit(cgPath, cfg.CPUQuota, cfg.CPUPeriod); err != nil {
			return fmt.Errorf("set cpu limit: %w", err)
		}
	}
	return addProcessToCgroup(cgPath, pid)
}

Step 3: Child Process — Initialize Inside the Namespace

// runChild runs inside the new namespaces (created by the parent via Cloneflags).
// Responsibilities:
// 1. Wait for the parent to finish cgroup configuration
// 2. Set the hostname (UTS namespace)
// 3. Prepare the filesystem (chroot)
// 4. Mount virtual filesystems (/proc, /sys, /dev)
// 5. exec the user's command (replacing the current process)
func runChild() error {
	// Wait for the parent to complete cgroup setup (poll temp file)
	parentPID := os.Getenv("PARENT_PID")
	readyFile := fmt.Sprintf("/tmp/mc-ready-%s", parentPID)
	for i := 0; i < 100; i++ {
		if _, err := os.Stat(readyFile); err == nil { break }
		time.Sleep(10 * time.Millisecond)
	}

	// Set hostname in the isolated UTS namespace (does not affect the host)
	hostname := os.Getenv("MC_HOSTNAME")
	if hostname == "" { hostname = "container" }
	if err := syscall.Sethostname([]byte(hostname)); err != nil {
		return fmt.Errorf("set hostname: %w", err)
	}

	// Prepare the root filesystem
	rootfs := os.Getenv("MC_ROOTFS")
	if rootfs == "" { rootfs = "./rootfs" }
	if err := setupRootFS(rootfs); err != nil {
		return fmt.Errorf("setup rootfs: %w", err)
	}

	// Mount required virtual filesystems
	if err := mountVirtualFS(); err != nil {
		return fmt.Errorf("mount virtual fs: %w", err)
	}

	// Retrieve the command to run
	commandStr := os.Getenv("MC_COMMAND")
	if commandStr == "" { return fmt.Errorf("MC_COMMAND not set") }
	args := strings.Split(commandStr, ",")

	path, err := exec.LookPath(args[0])
	if err != nil { return fmt.Errorf("command not found: %s", args[0]) }

	// exec replaces the current process with the user's command.
	// This process (PID 1 inside the container) becomes the user's program.
	// syscall.Exec does not return on success.
	return syscall.Exec(path, args, os.Environ())
}

func setupRootFS(newroot string) error {
	abs, err := filepath.Abs(newroot)
	if err != nil { return err }
	if err := syscall.Mount(abs, abs, "", syscall.MS_BIND|syscall.MS_REC, ""); err != nil {
		return fmt.Errorf("bind mount rootfs: %w", err)
	}
	if err := syscall.Chroot(abs); err != nil {
		return fmt.Errorf("chroot: %w", err)
	}
	return os.Chdir("/")
}

// mountVirtualFS mounts /proc, /sys, and /dev inside the container.
// These filesystems do not correspond to real disk data; they expose kernel internal
// state to userspace. /proc shows processes in the current PID namespace.
func mountVirtualFS() error {
	type mountSpec struct {
		target, fstype string
		flags          uintptr
		data           string
	}
	mounts := []mountSpec{
		// /proc: process information. Tools like ps and top depend on this.
		// After mounting, /proc shows only processes in the current PID namespace.
		{"/proc", "proc", syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV, ""},
		// /sys: kernel and device information; cgroup v2 uses /sys/fs/cgroup.
		{"/sys", "sysfs", syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV, ""},
		// /dev: device files. Use tmpfs (not real devices) to prevent container
		//       from accessing host devices.
		{"/dev", "tmpfs", syscall.MS_NOSUID | syscall.MS_STRICTATIME, "mode=755"},
		// /dev/pts: pseudo-terminals (required for interactive shells, SSH)
		{"/dev/pts", "devpts", syscall.MS_NOSUID | syscall.MS_NOEXEC, "newinstance,ptmxmode=0666"},
	}
	for _, m := range mounts {
		os.MkdirAll(m.target, 0755)
		if err := syscall.Mount("none", m.target, m.fstype, m.flags, m.data); err != nil {
			return fmt.Errorf("mount %s: %w", m.target, err)
		}
	}
	// Standard symlinks for stdin/stdout/stderr
	os.Symlink("/proc/self/fd/0", "/dev/stdin")
	os.Symlink("/proc/self/fd/1", "/dev/stdout")
	os.Symlink("/proc/self/fd/2", "/dev/stderr")
	return nil
}

Step 4: Cgroup Operations

// cgroup.go
package main

import (
	"fmt"
	"os"
	"path/filepath"
	"strconv"
	"strings"
)

const cgroupBase = "/sys/fs/cgroup"

func createCgroup(name string) (string, error) {
	cgPath := filepath.Join(cgroupBase, name)
	if err := os.MkdirAll(cgPath, 0755); err != nil {
		return "", fmt.Errorf("mkdir %s: %w", cgPath, err)
	}
	return cgPath, nil
}

func setMemoryLimit(cgPath string, limitBytes int64) error {
	// memory.max: hard memory ceiling.
	// When exceeded, the kernel OOM killer terminates processes in this cgroup first.
	return os.WriteFile(filepath.Join(cgPath, "memory.max"),
		[]byte(strconv.FormatInt(limitBytes, 10)), 0644)
}

func setCPULimit(cgPath string, quota, period int) error {
	// cpu.max format: "quota period" (microseconds)
	// "50000 100000" means use at most 50ms per 100ms window = 50% CPU
	// "max 100000" means no CPU limit
	return os.WriteFile(filepath.Join(cgPath, "cpu.max"),
		[]byte(fmt.Sprintf("%d %d", quota, period)), 0644)
}

func setPIDLimit(cgPath string, maxPIDs int) error {
	// pids.max: prevents fork bombs by capping the total process count
	return os.WriteFile(filepath.Join(cgPath, "pids.max"),
		[]byte(strconv.Itoa(maxPIDs)), 0644)
}

func addProcessToCgroup(cgPath string, pid int) error {
	// Writing to cgroup.procs adds the process (and its threads) to this cgroup.
	// All future child processes inherit the cgroup membership automatically.
	return os.WriteFile(filepath.Join(cgPath, "cgroup.procs"),
		[]byte(strconv.Itoa(pid)), 0644)
}

// getCgroupStats reads resource usage statistics from a cgroup.
func getCgroupStats(cgPath string) (memUsageBytes int64, cpuUsageNs int64, err error) {
	memData, err := os.ReadFile(filepath.Join(cgPath, "memory.current"))
	if err != nil { return 0, 0, err }
	memUsageBytes, _ = strconv.ParseInt(strings.TrimSpace(string(memData)), 10, 64)

	cpuData, err := os.ReadFile(filepath.Join(cgPath, "cpu.stat"))
	if err != nil { return memUsageBytes, 0, err }
	for _, line := range strings.Split(string(cpuData), "\n") {
		if strings.HasPrefix(line, "usage_usec ") {
			parts := strings.Fields(line)
			if len(parts) == 2 {
				usec, _ := strconv.ParseInt(parts[1], 10, 64)
				cpuUsageNs = usec * 1000
			}
		}
	}
	return memUsageBytes, cpuUsageNs, nil
}

// cleanupCgroup removes the cgroup directory.
// The cgroup must have no processes before it can be removed.
func cleanupCgroup(cgPath string) error {
	return os.Remove(cgPath) // rmdir (only removes empty directories)
}

Step 5: Pulling and Unpacking OCI Image Layers

// rootfs.go
package main

import (
	"archive/tar"
	"compress/gzip"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"os"
	"path/filepath"
	"strings"
)

type OCIManifest struct {
	Layers []struct {
		Digest string `json:"digest"` // format: sha256:<hex>
		Size   int    `json:"size"`
	} `json:"layers"`
}

// PullAndUnpack fetches image layers from Docker Hub and extracts them to destDir.
// Simplified: only public images, no authentication handling.
func PullAndUnpack(image, tag, destDir string) error {
	tokenURL := fmt.Sprintf(
		"https://auth.docker.io/token?service=registry.docker.io&scope=repository:%s:pull",
		image)
	token, err := fetchToken(tokenURL)
	if err != nil { return fmt.Errorf("fetch token: %w", err) }

	manifestURL := fmt.Sprintf("https://registry-1.docker.io/v2/%s/manifests/%s", image, tag)
	manifest, err := fetchManifest(manifestURL, token)
	if err != nil { return fmt.Errorf("fetch manifest: %w", err) }

	// Extract layers in order (base first, most recent last)
	for i, layer := range manifest.Layers {
		fmt.Printf("Pulling layer %d/%d: %s\n", i+1, len(manifest.Layers), layer.Digest[:19])
		layerURL := fmt.Sprintf(
			"https://registry-1.docker.io/v2/%s/blobs/%s", image, layer.Digest)
		if err := downloadAndExtract(layerURL, token, destDir); err != nil {
			return fmt.Errorf("extract layer %s: %w", layer.Digest[:19], err)
		}
	}
	return nil
}

// downloadAndExtract downloads a tar.gz layer and extracts it to destDir,
// correctly handling whiteout files (OverlayFS deletion markers).
func downloadAndExtract(url, token, destDir string) error {
	req, _ := http.NewRequest("GET", url, nil)
	req.Header.Set("Authorization", "Bearer "+token)
	resp, err := http.DefaultClient.Do(req)
	if err != nil { return err }
	defer resp.Body.Close()

	gz, err := gzip.NewReader(resp.Body)
	if err != nil { return err }
	defer gz.Close()

	tr := tar.NewReader(gz)
	for {
		hdr, err := tr.Next()
		if err == io.EOF { break }
		if err != nil { return err }

		target := filepath.Join(destDir, hdr.Name)
		base := filepath.Base(hdr.Name)

		// Whiteout files (.wh.<filename>) mark deletions in this layer
		if strings.HasPrefix(base, ".wh.") {
			deleted := filepath.Join(filepath.Dir(target), strings.TrimPrefix(base, ".wh."))
			os.RemoveAll(deleted)
			continue
		}

		switch hdr.Typeflag {
		case tar.TypeDir:
			os.MkdirAll(target, hdr.FileInfo().Mode())
		case tar.TypeReg:
			f, err := os.OpenFile(target, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, hdr.FileInfo().Mode())
			if err != nil { return err }
			io.Copy(f, tr)
			f.Close()
		case tar.TypeSymlink:
			os.Symlink(hdr.Linkname, target)
		}
	}
	return nil
}

func fetchToken(url string) (string, error) {
	resp, err := http.Get(url)
	if err != nil { return "", err }
	defer resp.Body.Close()
	var result struct{ Token string `json:"token"` }
	json.NewDecoder(resp.Body).Decode(&result)
	return result.Token, nil
}

func fetchManifest(url, token string) (*OCIManifest, error) {
	req, _ := http.NewRequest("GET", url, nil)
	req.Header.Set("Authorization", "Bearer "+token)
	req.Header.Set("Accept", "application/vnd.docker.distribution.manifest.v2+json")
	resp, err := http.DefaultClient.Do(req)
	if err != nil { return nil, err }
	defer resp.Body.Close()
	var m OCIManifest
	json.NewDecoder(resp.Body).Decode(&m)
	return &m, nil
}

Running Example

# Prepare a minimal rootfs using Alpine Linux
mkdir -p rootfs
docker export $(docker create alpine) | tar -C rootfs -xf -

# Build and run
go build -o minicontainer .

# Start a shell inside the container
sudo ./minicontainer run /bin/sh

# Verify isolation inside the container:
# $ hostname
# container
# $ ps aux
#   PID USER  COMMAND
#     1 root  /bin/sh     # PID is 1 inside the container
# $ cat /proc/meminfo | grep MemTotal
# (on kernels that propagate cgroup limits into /proc, you'll see the limit)

Level 4 · Advanced: Networking, Seccomp, Rootless Containers, and the OCI Spec

Container Networking: veth Pairs and Bridges

Network isolation is provided by the Network namespace, but after isolation the container cannot reach the outside world. A veth pair (virtual Ethernet pair) connects the container namespace to the host:

Host Network namespace:
  docker0 (bridge, 172.17.0.1/16)
    └── veth0 (host side)

Container Network namespace:
  eth0 (container side, 172.17.0.2/16) ← paired with host's veth0

A veth pair is like both ends of a patch cable: packets entering one end emerge from the other. The host-side veth0 is attached to the docker0 bridge; the container-side eth0 serves as the container's NIC. NAT rules (iptables -t nat -A POSTROUTING) allow the container to reach external networks.

// network.go (skeleton)
func setupContainerNetwork(containerPID int, ip, gateway string) error {
	// 1. Create veth pair (requires root or CAP_NET_ADMIN)
	exec.Command("ip", "link", "add", "veth0", "type", "veth", "peer", "name", "eth0").Run()
	// 2. Move eth0 into the container's network namespace
	exec.Command("ip", "link", "set", "eth0", "netns", strconv.Itoa(containerPID)).Run()
	// 3. Attach veth0 to the docker0 bridge
	exec.Command("ip", "link", "set", "veth0", "master", "docker0").Run()
	exec.Command("ip", "link", "set", "veth0", "up").Run()
	// 4. Configure eth0's IP address inside the container namespace (via nsenter)
	exec.Command("nsenter", "-t", strconv.Itoa(containerPID), "-n",
		"ip", "addr", "add", ip+"/16", "dev", "eth0").Run()
	exec.Command("nsenter", "-t", strconv.Itoa(containerPID), "-n",
		"ip", "route", "add", "default", "via", gateway).Run()
	return nil
}

Seccomp: System Call Filtering

Even with namespaces and cgroups, a process inside a container can still call any Linux syscall. Some are dangerous: ptrace (can attach to arbitrary processes), mount (can modify the mount table), reboot (shuts down the system).

Seccomp (Secure Computing Mode) lets you define a syscall whitelist (or blacklist); the kernel rejects calls not on the list:

// Using libseccomp-golang for seccomp filters (production-grade)
import "github.com/seccomp/libseccomp-golang"

func applySeccompFilter() error {
	filter, err := seccomp.NewFilter(seccomp.ActErrno.SetReturnCode(int16(syscall.EPERM)))
	if err != nil { return err }

	// Whitelist: only allow necessary syscalls
	allowed := []string{"read", "write", "open", "close", "mmap", "exit_group"}
	for _, name := range allowed {
		call, _ := seccomp.GetSyscallFromNameWithArch(name, seccomp.ArchAMD64)
		filter.AddRule(call, seccomp.ActAllow)
	}
	return filter.Load()
}

Docker's default seccomp profile allows approximately 300 syscalls and blocks about 44 dangerous ones, including ptrace, reboot, and kexec_load.

Linux Capabilities: Fine-Grained Privilege Control

The traditional Unix permission model is binary: root (omnipotent) or non-root (restricted). Linux Capabilities decompose root's power into roughly 40 independent capabilities:

Capability	Meaning
`CAP_NET_BIND_SERVICE`	Bind to ports below 1024
`CAP_NET_ADMIN`	Network management (create veth, modify routes)
`CAP_SYS_ADMIN`	Broad administrative operations (mount, sethostname, etc.)
`CAP_MKNOD`	Create special device files
`CAP_CHOWN`	Change file ownership

When containerizing, drop all unnecessary capabilities and keep only what is required:

cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: ...,
    // Keep: NET_BIND_SERVICE (allow binding to low ports)
    // Drop: SYS_ADMIN, NET_ADMIN, MKNOD, and other dangerous capabilities
    AmbientCaps: []uintptr{CAP_NET_BIND_SERVICE},
}

Rootless Containers

Normal containers require root privileges (to create namespaces, configure veth pairs, operate cgroups). Rootless containers use the User namespace to let non-root users run containers:

cmd.SysProcAttr = &syscall.SysProcAttr{
    Cloneflags: syscall.CLONE_NEWUSER | ...,
    // UID/GID mapping: uid=0 (root) inside the container maps to uid=1000
    // (an ordinary user) on the host
    UidMappings: []syscall.SysProcIDMap{
        {ContainerID: 0, HostID: os.Getuid(), Size: 1},
    },
    GidMappings: []syscall.SysProcIDMap{
        {ContainerID: 0, HostID: os.Getgid(), Size: 1},
    },
}

The User namespace is the most powerful and most complex namespace: it allows an unprivileged user to have root privileges within the namespace, but that "root" has no effect on the host. Podman's rootless mode is built on exactly this mechanism.

Comparison with runc

Our minimal container runtime is roughly 500 lines; runc is approximately 100,000 lines. Where is the gap?

Feature	Our Implementation	runc
PID/Mount/UTS namespace	Supported	Supported
Network namespace	Skeleton	Full (veth/bridge/macvlan)
User namespace (rootless)	Skeleton	Full
Cgroup v1/v2	v2 basics	Full v1 + v2
OCI Runtime Spec compliance	No	Full
Seccomp	Skeleton	Full (BPF programs)
Capabilities	Not implemented	Full
Network plugins (CNI)	None	Supported
pivot_root vs chroot	chroot	pivot_root
State persistence	None	Full state machine

runc is the reference implementation of the OCI Runtime Specification, using config.json to define container configuration. Kubernetes calls containerd via CRI (Container Runtime Interface); containerd in turn calls runc.

The OCI Runtime Spec defines a standard JSON configuration file (config.json) that specifies namespaces, cgroups, mounts, capabilities, seccomp filters, and hooks. Any OCI-compliant runtime (runc, crun, kata-containers) must accept this format:

{
  "ociVersion": "1.0.2",
  "process": {
    "args": ["/bin/sh"],
    "capabilities": {
      "bounding": ["CAP_NET_BIND_SERVICE"]
    }
  },
  "linux": {
    "namespaces": [
      {"type": "pid"},
      {"type": "network"},
      {"type": "mount"},
      {"type": "uts"}
    ],
    "resources": {
      "memory": {"limit": 134217728},
      "cpu": {"quota": 50000, "period": 100000}
    },
    "seccomp": { ... }
  }
}

Understanding this spec is the key to understanding how Kubernetes pod specs translate all the way down to kernel-level configurations.

Chapter Summary

This chapter revealed the essence of containers: containers are not virtual machines but a composed application of existing Linux kernel mechanisms (namespace + cgroup).

Namespaces: control what a process can "see" (PID space, network stack, filesystem, hostname)
Cgroup v2: limit how much resource a process can "use" via the /sys/fs/cgroup filesystem interface
OverlayFS: implement layered image storage and copy-on-write via lowerdir/upperdir/merged
Self-invoking process pattern: solves Go's inability to run arbitrary code after clone() in a child process

In the code implementation we built: namespace-isolated process launch, cgroup resource limiting, chroot filesystem switching, /proc /sys /dev virtual filesystem mounting, and OCI image layer download and unpacking.

The advanced section covered container networking (veth pairs + bridge + NAT), system call filtering (seccomp with BPF), fine-grained privilege control (Linux capabilities), rootless containers (User namespace + UID mapping), a feature comparison with runc, and the OCI Runtime Specification.

Understanding all of this gives you a clear mental model of Docker's core mechanism: Docker is essentially a toolchain built around these Linux kernel features, providing higher-level abstractions for image management, network plugins, and log drivers. When a container has a problem, the root cause is almost always in one of three layers: namespace configuration, cgroup limits, or filesystem mounting — and now you know exactly where to look.

Rate this chapter

4.5 / 5 (3 ratings)