Chapter 27

Virtualization and Containers

Have you ever wondered how a single physical server can run dozens of "independent computers" at the same time? Or why a Docker container feels walled off from the host system, yet fundamentally shares the same operating system kernel?

Think of an office building. Virtualization is like subdividing the building into fully independent suites—each with its own door lock, its own HVAC controls, its own circuit breaker. They're completely isolated and don't interfere with each other. Containers are more like partition-divided workstations inside one large open-plan floor. Each team has its own territory, but the hallways, elevators, and restrooms (the OS kernel) are shared. Partitions are far lighter than separate rooms—they go up fast and move easily—but the soundproofing isn't as good.

Core Concepts

Why Virtualization Exists

Before the 2000s, a company that bought a server typically ran one application on it: one machine for email, one for the database, one for the website. Most of the time, CPU utilization hovered between 5% and 15%. Ninety percent of the hardware sat idle. This was massive waste.

The core idea of virtualization: make one physical machine simultaneously pretend to be several machines. Each "fake machine" (a virtual machine) believes it has exclusive use of the hardware. It can install its own operating system and run its own processes. But the underlying hardware is shared.

Three Virtualization Approaches

Full Virtualization:
  The guest OS has no idea it's inside a VM.
  The Hypervisor intercepts all privileged instructions and simulates hardware.
  Pro: No modification to guest OS required
  Con: High performance overhead (every privileged instruction traps to the Hypervisor)

Paravirtualization:
  The guest OS is modified to know it's in a VM.
  It uses "hypercalls" to request the Hypervisor directly, avoiding unnecessary traps.
  Used by Xen.
  Pro: Good performance
  Con: Requires kernel modification—closed-source OSes like Windows can't use it

Hardware-Assisted Virtualization:
  The CPU itself provides two execution modes:
    VMX Root mode  (the Hypervisor runs here)
    VMX Non-Root mode (the VMs run here)
  Privileged instructions automatically trigger a VM Exit—hardware handles the switch,
  no software emulation needed. Intel VT-x, AMD-V.
  Pro: Near-native performance, no guest OS modification needed
  This is today's mainstream approach

Intel VT-x adds a "virtual machine mode" to the CPU. When a guest OS executes a privileged instruction, the CPU hardware automatically pauses the VM and notifies the Hypervisor to handle it, then resumes—this cycle is called a VM Exit / VM Entry. It's far faster than pure software emulation.

Hypervisor Type 1 vs Type 2

Type 1 (Bare-Metal Hypervisor):
┌─────────────────────────────────────────┐
│  VM1          VM2           VM3         │  ← Virtual Machines
│  (Linux)      (Windows)     (FreeBSD)   │
├─────────────────────────────────────────┤
│              Hypervisor                 │  ← Runs directly on hardware
│     (VMware ESXi / KVM / Hyper-V)      │
├─────────────────────────────────────────┤
│              Physical Hardware          │
└─────────────────────────────────────────┘
No host OS—the Hypervisor is the OS. Best performance. Standard in data centers.

Type 2 (Hosted Hypervisor):
┌─────────────────────────────────────────┐
│  VM1          VM2                       │  ← Virtual Machines
│  (Linux)      (Windows)                 │
├─────────────────────────────────────────┤
│       Hypervisor (runs as an app)       │
│   (VMware Workstation / VirtualBox)     │
├─────────────────────────────────────────┤
│           Host Operating System         │  ← Ordinary OS
├─────────────────────────────────────────┤
│              Physical Hardware          │
└─────────────────────────────────────────┘
Runs on top of a regular OS. Convenient for developers on laptops. Lower performance.

Containers vs VMs: The Fundamental Difference

Virtual machines achieve hardware-level isolation. Each VM has its own virtual CPU, virtual memory, and virtual disk—and can run an entirely different OS kernel.

Containers achieve process-level isolation. All containers share the same Linux kernel as the host. What's isolated is each container's view of the system, using two key Linux kernel mechanisms:

Namespaces give each container an independent "window" onto the system:

Namespace types:
  PID namespace   → processes inside appear to start at PID 1
                    (they're actually large PIDs on the host)
  NET namespace   → container has its own network interfaces,
                    routing table, and iptables rules
  MNT namespace   → container has its own filesystem mount points
  UTS namespace   → container has its own hostname
  IPC namespace   → container has its own inter-process communication resources
  USER namespace  → root inside the container maps to an unprivileged host user

cgroups (control groups) enforce resource limits:

cgroups can limit:
  CPU usage        (e.g., at most 50% of 2 cores)
  Memory ceiling   (e.g., at most 512 MB)
  Disk I/O bandwidth
  Network bandwidth

Namespaces tell the container "this is the world you see." cgroups tell the kernel "these are the resource limits for this group." Together they create container isolation.

Docker's Overlay Filesystem

Docker images are layered. Each layer records only the delta relative to the layer below it. Multiple containers can share the same base layers, saving enormous amounts of disk space.

Docker image layer structure:
┌─────────────────────────────────────┐  ← Container writable layer
│  Your running app (writes go here)  │  Unique to each running container;
│                                     │  destroyed when the container stops
├─────────────────────────────────────┤
│  App layer (pip install requests)   │
├─────────────────────────────────────┤
│  Python layer (Python 3.11)         │  ← Image read-only layers
├─────────────────────────────────────┤  Shared across many containers
│  Ubuntu base layer                  │
└─────────────────────────────────────┘

OverlayFS merges these layers and presents them to the container
as a single coherent filesystem. Only the topmost writable layer
actually stores changes.

Network Virtualization: veth Pairs and Bridges

Each Docker container lives in its own network namespace. Docker uses veth pairs (virtual Ethernet pairs) to connect the container's network interface to the host:

Host machine:
┌─────────────────────────────────────────────────────┐
│                                                     │
│  docker0 (virtual bridge, acts like a switch)       │
│      │                                              │
│    veth0 ─────────────────── veth1 (eth0 inside container)
│                                        ┌────────────┤
│                                        │ Container A│
│                                        │ eth0       │
│                                        │ IP: 172.17.0.2
│                                        └────────────┘
└─────────────────────────────────────────────────────┘

A veth pair is two virtual network cables joined at the ends—
packets sent from one end emerge from the other.
The host end plugs into the docker0 bridge like a cable into a switch.

The Limits of Container Security Isolation

Containers are not virtual machines—their isolation is incomplete:

Shared kernel: All containers share the host kernel. A kernel vulnerability (like Dirty Cow, or Dirty Pipe CVE-2022-0847) can be exploited by malicious code inside a container to break out of isolation entirely
Namespace escape: Several "container escape" vulnerabilities have appeared over the years (e.g., the runc vulnerability CVE-2019-5736), where an attacker inside a container gained root access on the host
Running as root: If a process inside a container runs as root, even with USER namespace isolation, the risk profile is significantly worse than running as an unprivileged user

For this reason, in high-security multi-tenant environments, data centers often still use virtual machines to isolate different tenants—with containers running inside those VMs. AWS Fargate, for example, gives each task its own MicroVM.

Hands-On

On Linux, feel the effect of namespaces directly from the command line:

# Create a new PID namespace and run bash inside it
sudo unshare --pid --fork --mount-proc bash

# Inside this new bash, PID numbering starts at 1
ps aux
# You'll see this bash itself has PID 1!
# The thousands of processes on the host are completely invisible here

# Exit
exit

Check cgroup resource limits on a Docker container:

# Start a container with a 256 MB memory limit
docker run -d --memory=256m --name=test-limit nginx

# Inspect the corresponding cgroup memory limit file (Linux)
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
# Output: 268435456  (= 256 × 1024 × 1024)

🔬 Going Deeper

gVisor and Kata Containers: Finding the Middle Ground

To address the shared-kernel security weakness of containers without paying the full cost of virtual machines, the industry has explored two approaches. Google's open-source gVisor inserts a "sandbox kernel" between the container and the host kernel—a user-space kernel written in Go that intercepts all system calls. Even if malicious code runs inside the container, it can only harm the sandbox, never the host. Kata Containers takes the opposite approach: it gives each container a genuinely separate kernel inside an ultra-lightweight MicroVM (KVM-accelerated), achieving near-VM isolation strength while starting up in roughly 100 milliseconds—far faster than a traditional VM.

eBPF: A Programmable Kernel Replacing Container Middleware

In recent years, eBPF (Extended Berkeley Packet Filter) has been reshaping how container networking and security are implemented. eBPF lets you inject verified bytecode into a running kernel—without modifying or recompiling the kernel—to intercept system calls, observe network packets, and implement load balancing, all with no user-space/kernel-space switching overhead. Cilium, a Kubernetes networking plugin built on eBPF, has replaced traditional iptables-based routing in many large clusters, delivering dramatically better performance and observability.

Recommended Resources

Docker Deep Dive (Nigel Poulton)—fast, practical, and conversational; the best entry-level book for understanding Docker and container concepts; most effective when you follow along hands-on
Linux manual pages namespaces(7) and cgroups(7)—the official kernel documentation; everything about each namespace and cgroup type is explained here; run man namespaces to start
Systems Performance (Brendan Gregg)—a deep dive into Linux kernel performance analysis, including cgroups, eBPF, and much more; the desk reference for serious systems engineers

Rate this chapter

4.8 / 5 (3 ratings)

Virtualization and Containers

Virtualization and Containers

Core Concepts

Why Virtualization Exists

Three Virtualization Approaches

Hypervisor Type 1 vs Type 2

Containers vs VMs: The Fundamental Difference

Docker's Overlay Filesystem

Network Virtualization: veth Pairs and Bridges

The Limits of Container Security Isolation

Hands-On

🔬 Going Deeper

💬 Comments