Virtualization and Containers
Virtualization and Containers
Have you ever wondered how a single physical server can run dozens of "independent computers" at the same time? Or why a Docker container feels walled off from the host system, yet fundamentally shares the same operating system kernel?
Think of an office building. Virtualization is like subdividing the building into fully independent suitesโeach with its own door lock, its own HVAC controls, its own circuit breaker. They're completely isolated and don't interfere with each other. Containers are more like partition-divided workstations inside one large open-plan floor. Each team has its own territory, but the hallways, elevators, and restrooms (the OS kernel) are shared. Partitions are far lighter than separate roomsโthey go up fast and move easilyโbut the soundproofing isn't as good.
Core Concepts
Why Virtualization Exists
Before the 2000s, a company that bought a server typically ran one application on it: one machine for email, one for the database, one for the website. Most of the time, CPU utilization hovered between 5% and 15%. Ninety percent of the hardware sat idle. This was massive waste.
The core idea of virtualization: make one physical machine simultaneously pretend to be several machines. Each "fake machine" (a virtual machine) believes it has exclusive use of the hardware. It can install its own operating system and run its own processes. But the underlying hardware is shared.
Three Virtualization Approaches
Full Virtualization:
The guest OS has no idea it's inside a VM.
The Hypervisor intercepts all privileged instructions and simulates hardware.
Pro: No modification to guest OS required
Con: High performance overhead (every privileged instruction traps to the Hypervisor)
Paravirtualization:
The guest OS is modified to know it's in a VM.
It uses "hypercalls" to request the Hypervisor directly, avoiding unnecessary traps.
Used by Xen.
Pro: Good performance
Con: Requires kernel modificationโclosed-source OSes like Windows can't use it
Hardware-Assisted Virtualization:
The CPU itself provides two execution modes:
VMX Root mode (the Hypervisor runs here)
VMX Non-Root mode (the VMs run here)
Privileged instructions automatically trigger a VM Exitโhardware handles the switch,
no software emulation needed. Intel VT-x, AMD-V.
Pro: Near-native performance, no guest OS modification needed
This is today's mainstream approach
Intel VT-x adds a "virtual machine mode" to the CPU. When a guest OS executes a privileged instruction, the CPU hardware automatically pauses the VM and notifies the Hypervisor to handle it, then resumesโthis cycle is called a VM Exit / VM Entry. It's far faster than pure software emulation.
Hypervisor Type 1 vs Type 2
Type 1 (Bare-Metal Hypervisor):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VM1 VM2 VM3 โ โ Virtual Machines
โ (Linux) (Windows) (FreeBSD) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Hypervisor โ โ Runs directly on hardware
โ (VMware ESXi / KVM / Hyper-V) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Physical Hardware โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
No host OSโthe Hypervisor is the OS. Best performance. Standard in data centers.
Type 2 (Hosted Hypervisor):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ VM1 VM2 โ โ Virtual Machines
โ (Linux) (Windows) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Hypervisor (runs as an app) โ
โ (VMware Workstation / VirtualBox) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Host Operating System โ โ Ordinary OS
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Physical Hardware โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Runs on top of a regular OS. Convenient for developers on laptops. Lower performance.
Containers vs VMs: The Fundamental Difference
Virtual machines achieve hardware-level isolation. Each VM has its own virtual CPU, virtual memory, and virtual diskโand can run an entirely different OS kernel.
Containers achieve process-level isolation. All containers share the same Linux kernel as the host. What's isolated is each container's view of the system, using two key Linux kernel mechanisms:
Namespaces give each container an independent "window" onto the system:
Namespace types:
PID namespace โ processes inside appear to start at PID 1
(they're actually large PIDs on the host)
NET namespace โ container has its own network interfaces,
routing table, and iptables rules
MNT namespace โ container has its own filesystem mount points
UTS namespace โ container has its own hostname
IPC namespace โ container has its own inter-process communication resources
USER namespace โ root inside the container maps to an unprivileged host user
cgroups (control groups) enforce resource limits:
cgroups can limit:
CPU usage (e.g., at most 50% of 2 cores)
Memory ceiling (e.g., at most 512 MB)
Disk I/O bandwidth
Network bandwidth
Namespaces tell the container "this is the world you see." cgroups tell the kernel "these are the resource limits for this group." Together they create container isolation.
Docker's Overlay Filesystem
Docker images are layered. Each layer records only the delta relative to the layer below it. Multiple containers can share the same base layers, saving enormous amounts of disk space.
Docker image layer structure:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ Container writable layer
โ Your running app (writes go here) โ Unique to each running container;
โ โ destroyed when the container stops
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ App layer (pip install requests) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Python layer (Python 3.11) โ โ Image read-only layers
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค Shared across many containers
โ Ubuntu base layer โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
OverlayFS merges these layers and presents them to the container
as a single coherent filesystem. Only the topmost writable layer
actually stores changes.
Network Virtualization: veth Pairs and Bridges
Each Docker container lives in its own network namespace. Docker uses veth pairs (virtual Ethernet pairs) to connect the container's network interface to the host:
Host machine:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ docker0 (virtual bridge, acts like a switch) โ
โ โ โ
โ veth0 โโโโโโโโโโโโโโโโโโโ veth1 (eth0 inside container)
โ โโโโโโโโโโโโโโค
โ โ Container Aโ
โ โ eth0 โ
โ โ IP: 172.17.0.2
โ โโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
A veth pair is two virtual network cables joined at the endsโ
packets sent from one end emerge from the other.
The host end plugs into the docker0 bridge like a cable into a switch.
The Limits of Container Security Isolation
Containers are not virtual machinesโtheir isolation is incomplete:
- Shared kernel: All containers share the host kernel. A kernel vulnerability (like Dirty Cow, or Dirty Pipe CVE-2022-0847) can be exploited by malicious code inside a container to break out of isolation entirely
- Namespace escape: Several "container escape" vulnerabilities have appeared over the years (e.g., the runc vulnerability CVE-2019-5736), where an attacker inside a container gained root access on the host
- Running as root: If a process inside a container runs as root, even with USER namespace isolation, the risk profile is significantly worse than running as an unprivileged user
For this reason, in high-security multi-tenant environments, data centers often still use virtual machines to isolate different tenantsโwith containers running inside those VMs. AWS Fargate, for example, gives each task its own MicroVM.
Hands-On
On Linux, feel the effect of namespaces directly from the command line:
# Create a new PID namespace and run bash inside it
sudo unshare --pid --fork --mount-proc bash
# Inside this new bash, PID numbering starts at 1
ps aux
# You'll see this bash itself has PID 1!
# The thousands of processes on the host are completely invisible here
# Exit
exit
Check cgroup resource limits on a Docker container:
# Start a container with a 256 MB memory limit
docker run -d --memory=256m --name=test-limit nginx
# Inspect the corresponding cgroup memory limit file (Linux)
cat /sys/fs/cgroup/memory/docker/<container-id>/memory.limit_in_bytes
# Output: 268435456 (= 256 ร 1024 ร 1024)
๐ฌ Going Deeper
gVisor and Kata Containers: Finding the Middle Ground
To address the shared-kernel security weakness of containers without paying the full cost of virtual machines, the industry has explored two approaches. Google's open-source gVisor inserts a "sandbox kernel" between the container and the host kernelโa user-space kernel written in Go that intercepts all system calls. Even if malicious code runs inside the container, it can only harm the sandbox, never the host. Kata Containers takes the opposite approach: it gives each container a genuinely separate kernel inside an ultra-lightweight MicroVM (KVM-accelerated), achieving near-VM isolation strength while starting up in roughly 100 millisecondsโfar faster than a traditional VM.
eBPF: A Programmable Kernel Replacing Container Middleware
In recent years, eBPF (Extended Berkeley Packet Filter) has been reshaping how container networking and security are implemented. eBPF lets you inject verified bytecode into a running kernelโwithout modifying or recompiling the kernelโto intercept system calls, observe network packets, and implement load balancing, all with no user-space/kernel-space switching overhead. Cilium, a Kubernetes networking plugin built on eBPF, has replaced traditional iptables-based routing in many large clusters, delivering dramatically better performance and observability.
Recommended Resources
- Docker Deep Dive (Nigel Poulton)โfast, practical, and conversational; the best entry-level book for understanding Docker and container concepts; most effective when you follow along hands-on
- Linux manual pages
namespaces(7)andcgroups(7)โthe official kernel documentation; everything about each namespace and cgroup type is explained here; runman namespacesto start - Systems Performance (Brendan Gregg)โa deep dive into Linux kernel performance analysis, including cgroups, eBPF, and much more; the desk reference for serious systems engineers