Chapter 20

B-Tree, B+Tree and LSM-Tree

Chapter 20: B-Tree, B+Tree and LSM-Tree

Open a MySQL client and run EXPLAIN SELECT * FROM users WHERE id = 42. You will see const in the type column—meaning InnoDB located the row with a single B+Tree lookup. Behind this "single lookup" lies a multi-way search tree with a height of just 3-4 levels, where each node occupies exactly one 16KB disk page.

Why not a binary search tree? Why not a hash table? Why not a red-black tree? The answers to all these questions point to the same fundamental tension: a 100,000x gap between memory access speed and disk I/O speed. A random memory access takes about 100ns; a random SSD read takes about 100μs; a random HDD seek takes about 10ms. When data exceeds memory capacity, any data structure design must prioritize "minimizing disk I/O operations" above all else.

When Rudolf Bayer and Edward McCreight invented the B-Tree in 1972, they targeted precisely this problem. Their paper title was direct and precise: "Organization and Maintenance of Large Ordered Indexes." By increasing the "width" of each node (storing multiple keys and child pointers), a B-Tree compresses tree height to log_m(n) levels (where m is the branching factor, typically in the hundreds to thousands), limiting the disk I/O needed to find any record to 3-4 operations.

The B+Tree is a variant of the B-Tree where all data resides only in leaf nodes, and leaves are linked by a doubly-linked list—making range queries extremely efficient. MySQL InnoDB, PostgreSQL, and Oracle all use B+Tree for their index structures.

But B+Tree is not a silver bullet. When write load far exceeds reads (log systems, time-series databases), B+Tree's random write amplification becomes a bottleneck. Patrick O'Neil's 1996 LSM-Tree (Log-Structured Merge-Tree) converts random writes to sequential writes through a "write to memory → batch flush to disk → background merge" strategy, achieving orders-of-magnitude improvement in write throughput. LevelDB, RocksDB, Cassandra, and HBase are all based on LSM-Tree.

This chapter starts from the basic definition of B-Tree, dives deep into B+Tree's practical application in databases, then covers LSM-Tree's write-optimized design, and finally discusses why Redis chose skip lists over B+Tree. You will see that choosing a data structure is never about "which is better," but about "which costs less given the hardware constraints and access patterns."

Level 1 · What You Need to Know

20.1 B-Tree: Multi-way Search Tree

20.1.1 From Binary to Multi-way

In previous chapters we studied binary search trees (BST) and balanced BSTs (AVL, red-black trees). Their core property: each node has at most two children, and search takes O(log₂n) time.

For 1 million records, a binary tree has height ≈ log₂(1,000,000) ≈ 20. If the tree fits entirely in memory, 20 comparisons are trivial. But if each node resides on disk, that means 20 disk I/O operations—at 10ms per HDD seek, one lookup takes 200ms. Unacceptable.

The solution is to increase each node's "branching factor." If each node can store 1000 keys and 1001 child pointers, then the same 1 million records produce a tree of height log₁₀₀₁(1,000,000) ≈ 2. With the root node typically cached in memory, only 1-2 disk I/O operations are needed to find any record.

This is the core idea of B-Tree: trade node width for tree height.

20.1.2 Formal Definition of B-Tree

An order-m B-Tree (m is the maximum number of children per node) satisfies:

Each node has at most m children
Non-root nodes have at least ⌈m/2⌉ children (guaranteeing at least half-full)
Root has at least 2 children (unless it is a leaf)
A node with k children contains k-1 keys
All leaves are at the same level (perfectly balanced)
Keys within a node are sorted

class BTreeNode:
    """B-Tree node"""
    def __init__(self, leaf=False):
        self.keys = []       # Sorted list of keys
        self.children = []   # Child pointers
        self.leaf = leaf     # Whether this is a leaf


class BTree:
    """B-Tree implementation (order m)"""
    def __init__(self, m):
        self.root = BTreeNode(leaf=True)
        self.m = m  # Maximum children per node
        self.max_keys = m - 1
        self.min_keys = (m + 1) // 2 - 1

    def search(self, node, key):
        """Search for key in B-Tree"""
        i = 0
        # Find position within current node
        while i < len(node.keys) and key > node.keys[i]:
            i += 1

        # Found
        if i < len(node.keys) and key == node.keys[i]:
            return (node, i)

        # Not found and at leaf
        if node.leaf:
            return None

        # Recurse to child (this corresponds to one disk I/O)
        return self.search(node.children[i], key)

    def insert(self, key):
        """Insert a key"""
        root = self.root
        if len(root.keys) == self.max_keys:
            new_root = BTreeNode(leaf=False)
            new_root.children.append(self.root)
            self._split_child(new_root, 0)
            self.root = new_root
        self._insert_non_full(self.root, key)

    def _insert_non_full(self, node, key):
        """Insert into a non-full node"""
        i = len(node.keys) - 1

        if node.leaf:
            node.keys.append(None)
            while i >= 0 and key < node.keys[i]:
                node.keys[i + 1] = node.keys[i]
                i -= 1
            node.keys[i + 1] = key
        else:
            while i >= 0 and key < node.keys[i]:
                i -= 1
            i += 1
            if len(node.children[i].keys) == self.max_keys:
                self._split_child(node, i)
                if key > node.keys[i]:
                    i += 1
            self._insert_non_full(node.children[i], key)

    def _split_child(self, parent, idx):
        """Split parent's idx-th child"""
        child = parent.children[idx]
        mid = len(child.keys) // 2

        new_node = BTreeNode(leaf=child.leaf)
        new_node.keys = child.keys[mid + 1:]

        if not child.leaf:
            new_node.children = child.children[mid + 1:]
            child.children = child.children[:mid + 1]

        promote_key = child.keys[mid]
        child.keys = child.keys[:mid]

        parent.keys.insert(idx, promote_key)
        parent.children.insert(idx + 1, new_node)

20.1.3 Visualizing B-Tree Search

Consider an order-5 B-Tree storing keys [1, 2, ..., 20]:

                    [8]
                   /    \
          [3, 5]          [12, 15, 18]
         /  |  \         /   |    |   \
    [1,2] [4] [6,7]  [9,10,11] [13,14] [16,17] [19,20]

Searching for key=14:

Root [8]: 14 > 8, go right (1 disk I/O)
Internal [12, 15, 18]: 12 < 14 < 15, follow second pointer (1 disk I/O)
Leaf [13, 14]: found 14 (1 disk I/O)

Total: 3 disk I/O operations.

20.1.4 Why Not Red-Black Trees or Hash Tables?

This is a classic interview question. Let's analyze from first principles:

Data Structure	Lookup I/O	Range Query	Ordering	Disk-friendly
Red-black tree	O(log₂n) ≈ 20	Requires inorder traversal	Ordered	Poor (scattered nodes)
Hash table	O(1) average	Not supported	Unordered	Poor (resize costly)
B+Tree	O(log_m n) ≈ 3	Leaf linked list scan	Ordered	Excellent (node=page)

Problems with red-black trees:

Too tall: 1M records need ~20 levels = ~20 disk I/O
Nodes too small: each stores one key, cannot utilize 4KB/16KB disk pages
Scattered nodes: adjacent keys may be physically far apart on disk

Problems with hash tables:

No range queries: WHERE age BETWEEN 20 AND 30 requires full scan
No sorting: ORDER BY needs additional sort step
Costly resizing: rehash redistributes all data
Worst-case O(n) with hash collisions

Why B+Tree wins:

Extremely low tree height: even 1 billion records fit in 4 levels
Node size = disk page size: one I/O reads a complete node
Excellent sequential access: leaf linked list for range scans
Prefetch-friendly: OS readahead can preload adjacent pages

20.2 B+Tree: The Foundation of Database Indexes

20.2.1 Key Differences Between B+Tree and B-Tree

B+Tree is a B-Tree variant with two critical improvements:

All data resides only in leaf nodes: internal nodes store only keys (as signposts)
Leaf nodes are linked in a doubly-linked list: enabling efficient range scans

B-Tree (data distributed across all nodes):
        [10|data, 20|data]
       /        |         \
  [5|data]  [15|data]  [25|data, 30|data]

B+Tree (data only in leaves):
           [10, 20]              ← Internal: keys only, no data
          /    |    \
    [5|d]→[10|d, 15|d]→[20|d, 25|d, 30|d]  ← Leaves: data + linked list

Why "data only in leaves" is better:

Internal nodes without data means:

More keys per internal node → larger branching factor → shorter tree → fewer I/O
With a 16KB page, 8-byte key, 6-byte pointer:
- B-Tree internal node: with 100-byte data per entry, ~145 keys per page
- B+Tree internal node: no data overhead, ~1170 keys per page
B+Tree's branching factor is 8x that of B-Tree, reducing tree height by at least one level

Why the "leaf linked list" matters:

# Range query: SELECT * FROM users WHERE age BETWEEN 20 AND 30
# B+Tree: locate leaf with age=20, then scan sequentially until age>30
# B-Tree: requires inorder traversal jumping between levels

def range_query_bplus(root, low, high):
    """B+Tree range query"""
    # Step 1: locate starting leaf (O(log n) I/O)
    leaf = find_leaf(root, low)

    results = []
    # Step 2: sequential scan along leaf linked list (sequential I/O)
    while leaf is not None:
        for key, data in leaf.entries:
            if key > high:
                return results
            if key >= low:
                results.append(data)
        leaf = leaf.next_leaf
    return results

20.2.2 Complete B+Tree Search Process

Let's simulate an InnoDB B+Tree lookup with concrete numbers:

Page size: 16KB
Primary key: bigint (8 bytes), average row: 200 bytes
Internal node: 8-byte key + 6-byte pointer = 14 bytes → ~1170 pointers per page
Leaf node: 200-byte row → ~80 rows per page

Capacity:

2-level B+Tree: 1170 × 80 = 93,600 rows
3-level B+Tree: 1170 × 1170 × 80 = 109 million rows
4-level B+Tree: 1170³ × 80 = 128 billion rows

For most business tables (< 100M rows), 3 disk I/O operations locate any row. With the root and second level cached in Buffer Pool, actual disk I/O may be just 1 operation.

class BPlusTreeNode:
    """B+Tree node"""
    def __init__(self, leaf=False):
        self.keys = []
        self.leaf = leaf
        if leaf:
            self.data = []
            self.next_leaf = None
            self.prev_leaf = None
        else:
            self.children = []


class BPlusTree:
    """Simplified B+Tree implementation"""
    def __init__(self, order):
        self.root = BPlusTreeNode(leaf=True)
        self.order = order

    def search(self, key):
        """Point lookup"""
        node = self.root
        while not node.leaf:
            i = self._find_position(node.keys, key)
            node = node.children[i]
        i = self._find_position(node.keys, key)
        if i < len(node.keys) and node.keys[i] == key:
            return node.data[i]
        return None

    def range_search(self, low, high):
        """Range lookup"""
        node = self.root
        while not node.leaf:
            i = self._find_position(node.keys, low)
            node = node.children[i]

        results = []
        while node is not None:
            for i, key in enumerate(node.keys):
                if key > high:
                    return results
                if key >= low:
                    results.append((key, node.data[i]))
            node = node.next_leaf
        return results

    def insert(self, key, data):
        """Insert key-data pair"""
        node = self.root
        path = []
        while not node.leaf:
            i = self._find_position(node.keys, key)
            path.append((node, i))
            node = node.children[i]

        i = self._find_position(node.keys, key)
        node.keys.insert(i, key)
        node.data.insert(i, data)

        if len(node.keys) > self.order:
            self._split_leaf(node, path)

    def _split_leaf(self, leaf, path):
        """Split a leaf node"""
        mid = len(leaf.keys) // 2
        new_leaf = BPlusTreeNode(leaf=True)

        new_leaf.keys = leaf.keys[mid:]
        new_leaf.data = leaf.data[mid:]
        leaf.keys = leaf.keys[:mid]
        leaf.data = leaf.data[:mid]

        new_leaf.next_leaf = leaf.next_leaf
        if leaf.next_leaf:
            leaf.next_leaf.prev_leaf = new_leaf
        leaf.next_leaf = new_leaf
        new_leaf.prev_leaf = leaf

        promote_key = new_leaf.keys[0]
        self._insert_into_parent(leaf, promote_key, new_leaf, path)

    def _insert_into_parent(self, left, key, right, path):
        """Insert key into parent node"""
        if not path:
            new_root = BPlusTreeNode(leaf=False)
            new_root.keys = [key]
            new_root.children = [left, right]
            self.root = new_root
            return

        parent, idx = path.pop()
        parent.keys.insert(idx, key)
        parent.children.insert(idx + 1, right)

        if len(parent.keys) > self.order:
            self._split_internal(parent, path)

    def _split_internal(self, node, path):
        """Split an internal node"""
        mid = len(node.keys) // 2
        promote_key = node.keys[mid]

        new_node = BPlusTreeNode(leaf=False)
        new_node.keys = node.keys[mid + 1:]
        new_node.children = node.children[mid + 1:]
        node.keys = node.keys[:mid]
        node.children = node.children[:mid + 1]

        self._insert_into_parent(node, promote_key, new_node, path)

    def _find_position(self, keys, target):
        """Binary search for position"""
        lo, hi = 0, len(keys)
        while lo < hi:
            mid_idx = (lo + hi) // 2
            if keys[mid_idx] < target:
                lo = mid_idx + 1
            else:
                hi = mid_idx
        return lo

20.2.3 Insertion and Splitting

B+Tree insertion follows a "bottom-up split" strategy:

Locate leaf: follow root-to-leaf path
Insert into leaf: place key-data pair in sorted position
Check overflow: if leaf exceeds order, split
Propagate split: may propagate upward, worst case to root

Critical splitting rules:

Leaf split: middle key is copied up to parent (data stays in leaf)
Internal split: middle key is moved up to parent (internal nodes don't store data)

This distinction ensures all data is always findable in leaf nodes.

20.2.4 Deletion and Merging

Deletion is the inverse of insertion:

Locate and delete: remove target key from leaf
Check underflow: if keys fall below ⌈order/2⌉, handle it
Try borrowing: take a key from sibling (rotation)
Merge: if sibling is also minimal, merge two nodes into one

20.3 B+Tree and Disk I/O: The Deep Connection

20.3.1 Cost Model of Disk Access

Understanding why B+Tree is efficient requires understanding disk I/O costs:

Storage Hierarchy (typical 2024 values):
┌─────────────────────────┬──────────┬───────────┐
│ Storage Medium          │ Latency  │ Bandwidth │
├─────────────────────────┼──────────┼───────────┤
│ L1 Cache                │ 1ns      │ 1TB/s     │
│ L2 Cache                │ 4ns      │ 500GB/s   │
│ L3 Cache                │ 12ns     │ 200GB/s   │
│ DRAM                    │ 100ns    │ 50GB/s    │
│ NVMe SSD (random 4KB)  │ 100μs    │ 7GB/s     │
│ SATA SSD (random 4KB)  │ 200μs    │ 500MB/s   │
│ HDD (random 4KB)       │ 10ms     │ 200MB/s   │
└─────────────────────────┴──────────┴───────────┘

Key insights:

DRAM → SSD latency jump: 1000x
DRAM → HDD latency jump: 100,000x
But for sequential reads, the SSD/HDD bandwidth gap is only 10-35x

This explains two design principles:

Minimize random I/O count (B+Tree's low height)
Convert random I/O to sequential I/O (leaf linked list, LSM-Tree)

20.3.2 Why Node Size Equals Page Size

Operating systems perform disk I/O in "pages" (Linux default 4KB, databases typically 16KB for InnoDB or 8KB for PostgreSQL).

B+Tree nodes are sized equal to disk pages because:

One node = one I/O: regardless of keys in the node, cost is one disk read
Maximize each I/O: reading 16KB per operation, don't waste it on tiny nodes
Filesystem alignment: avoid nodes spanning two pages (which doubles I/O)

# InnoDB page capacity calculation
PAGE_SIZE = 16 * 1024  # 16KB

# Internal node fan-out
key_size = 8            # bigint primary key
pointer_size = 6        # InnoDB uses 6-byte page numbers
header_overhead = 120   # Page header/trailer fixed overhead

usable_space = PAGE_SIZE - header_overhead
keys_per_page = usable_space // (key_size + pointer_size)
# ~1170 keys, i.e., fan-out ~1170

# Leaf node record count
row_size = 200          # Assume average row 200 bytes
rows_per_leaf = usable_space // row_size
# ~80 rows

# Height vs capacity
def capacity(height, fan_out, rows_per_leaf):
    """Maximum rows for given B+Tree height"""
    return (fan_out ** (height - 1)) * rows_per_leaf

print(f"2 levels: {capacity(2, 1170, 80):,} rows")     # 93,600
print(f"3 levels: {capacity(3, 1170, 80):,} rows")     # 109,512,000
print(f"4 levels: {capacity(4, 1170, 80):,} rows")     # 128,129,040,000

20.4 Common Misconceptions

Misconception 1: "The B in B-Tree stands for Binary"

Wrong. The B is most commonly interpreted as "Balanced" or "Boeing" (Bayer and McCreight worked at Boeing Scientific Research Labs), or Bayer's initial. Bayer himself never gave an official explanation.

Misconception 2: "Red-black trees are worse than B-Trees"

This is context-dependent. Red-black trees are better in pure-memory scenarios—each comparison involves one key, and cache line utilization is higher. Linux kernel's process scheduler and Java's TreeMap use red-black trees. B-Tree/B+Tree's advantage is specifically for disk/large-data scenarios.

Misconception 3: "Hash indexes are completely useless"

MySQL's MEMORY engine and InnoDB's Adaptive Hash Index both use hashing. For equality lookups (WHERE id = 42), hash indexes with O(1) are indeed faster than B+Tree's O(log n). But hashing doesn't support range queries or sorting, so it can't serve as a general-purpose index structure.

Level 2 · How It Works Under the Hood

20.5 B+Tree in MySQL InnoDB

20.5.1 Clustered Index vs Secondary Index

InnoDB uses B+Tree to organize data, but with two distinct index types:

Clustered Index:

Each InnoDB table has exactly one clustered index
Leaf nodes store complete row data
Uses the primary key by default; if none exists, InnoDB picks a unique non-null index or generates a hidden ROW_ID
Physical storage order matches logical primary key order

Secondary Index:

A table can have multiple secondary indexes
Leaf nodes store indexed column values + primary key value (not full row)
Lookups via secondary index require a "bookmark lookup" (use PK to fetch from clustered index)

Clustered index structure (primary key id):
Root: [50, 100]
         ↓ (id < 50)
Leaf: [id=1, name="Alice", age=25] → [id=2, name="Bob", age=30] → ...

Secondary index structure (indexed column: name):
Root: ["John", "Mike"]
         ↓ (name < "John")
Leaf: [name="Alice", id=1] → [name="Bob", id=2] → ...
                     ↑ Only stores PK, needs bookmark lookup

Cost of bookmark lookups:

# Query: SELECT * FROM users WHERE name = 'Alice'
# With secondary index on name:

# Step 1: Search secondary index B+Tree for name='Alice'
#         → Get primary key id=1 (2-3 I/O)

# Step 2: Use id=1 to search clustered index B+Tree
#         → Get full row (1-2 I/O, root usually in memory)

# Total: 3-5 I/O

# With covering index (index contains all needed columns):
# SELECT id, name FROM users WHERE name = 'Alice'
# Secondary index leaf already has (name, id), no bookmark lookup needed
# Total: 2-3 I/O

20.5.2 Covering Indexes and Composite Indexes

A covering index contains all columns needed by a query, eliminating bookmark lookups:

-- Create composite index
CREATE INDEX idx_name_age ON users(name, age);

-- This query uses covering index (Extra: Using index)
SELECT name, age FROM users WHERE name = 'Alice';

-- This query needs bookmark lookup (email not in index)
SELECT name, age, email FROM users WHERE name = 'Alice';

Composite index B+Tree follows the leftmost prefix principle:

Composite index (name, age) B+Tree:
Leaf nodes sorted by (name, age):
[("Alice", 25, id=1)] → [("Alice", 30, id=5)] → [("Bob", 20, id=3)] → ...

Can use index:
✓ WHERE name = 'Alice'                    (leftmost prefix)
✓ WHERE name = 'Alice' AND age = 25       (full prefix)
✓ WHERE name = 'Alice' AND age > 20       (prefix + range)
✓ ORDER BY name, age                      (index order)

Cannot use index:
✗ WHERE age = 25                          (skips name)
✗ ORDER BY age, name                      (wrong order)

20.5.3 Physical Structure of an InnoDB Page

Internal layout of a 16KB InnoDB page:

┌──────────────────────────────────────────────┐
│ File Header (38 bytes)                        │ ← Page number, checksum, prev/next page
├──────────────────────────────────────────────┤
│ Page Header (56 bytes)                        │ ← Record count, free space pointer
├──────────────────────────────────────────────┤
│ Infimum & Supremum Records (26 bytes)         │ ← Virtual min/max records
├──────────────────────────────────────────────┤
│ User Records                                  │ ← Actual data rows (singly linked list)
│ ┌─────────────────────────────────────────┐  │
│ │ Record 1 → Record 2 → Record 3 → ...   │  │
│ └─────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│ Free Space                                    │ ← Unused space
├──────────────────────────────────────────────┤
│ Page Directory (variable)                     │ ← Sparse directory for in-page search
├──────────────────────────────────────────────┤
│ File Trailer (8 bytes)                        │ ← Checksum (consistency with header)
└──────────────────────────────────────────────┘

In-page search uses the Page Directory (sparse index), grouping records into slots of 4-8 records. Lookup binary-searches the directory to find the group, then sequentially scans within the group.

20.6 LSM-Tree: Write-Optimized Data Structure

20.6.1 B+Tree's Write Amplification Problem

B+Tree excels at reads, but writes have a fatal flaw—write amplification:

Modifying one row requires:
- Reading the 16KB page from disk
- Modifying a few dozen bytes
- Writing the entire 16KB page back
If node splitting occurs, multiple pages must be written
With WAL (Write-Ahead Log), one modification's actual write volume can be 10-30x the original data

When writes dominate (log collection, IoT sensors, time-series databases), B+Tree's write amplification severely limits throughput.

20.6.2 LSM-Tree Core Concept

Patrick O'Neil's 1996 paper "The Log-Structured Merge-Tree (LSM-Tree)" presented a key insight:

If sequential disk writes are 100x faster than random writes, just make all writes sequential.

LSM-Tree write path:

Write request → MemTable (in-memory sorted structure) → Full → Flush as SSTable (sorted disk file)
                                                                        ↓
                                                                Background Compaction (merge SSTables)

Detailed steps:

Write to MemTable: New data goes to an in-memory sorted structure (usually red-black tree or skip list), with WAL for durability
MemTable becomes Immutable: When MemTable reaches threshold (e.g., 64MB), freeze it
Flush to SSTable: Immutable MemTable is sequentially written to disk as an SSTable (Sorted String Table)
Compaction: Background threads periodically merge SSTables, eliminating duplicate keys

import bisect
import time
import os


class MemTable:
    """In-memory sorted table (simplified as sorted list)"""
    def __init__(self, max_size=1000):
        self.entries = []  # [(key, value, timestamp)]
        self.max_size = max_size
        self.size = 0

    def put(self, key, value):
        """Write/update"""
        entry = (key, value, time.time())
        idx = bisect.bisect_left(self.entries, (key,))
        if idx < len(self.entries) and self.entries[idx][0] == key:
            self.entries[idx] = entry
        else:
            self.entries.insert(idx, entry)
            self.size += 1

    def get(self, key):
        """Lookup"""
        idx = bisect.bisect_left(self.entries, (key,))
        if idx < len(self.entries) and self.entries[idx][0] == key:
            return self.entries[idx][1]
        return None

    def is_full(self):
        return self.size >= self.max_size

    def flush_to_sstable(self, filename):
        """Sequential write to disk, producing an SSTable"""
        with open(filename, 'w') as f:
            for key, value, ts in self.entries:
                f.write(f"{key}\t{value}\t{ts}\n")
        return SSTable(filename)


class SSTable:
    """Sorted file on disk (simplified)"""
    def __init__(self, filename):
        self.filename = filename
        self.index = self._build_sparse_index()

    def _build_sparse_index(self):
        """Build sparse index (record position every N entries)"""
        index = {}
        with open(self.filename, 'r') as f:
            offset = 0
            for line in f:
                key = line.split('\t')[0]
                if len(index) == 0 or hash(key) % 16 == 0:
                    index[key] = offset
                offset += len(line)
        return index

    def get(self, key):
        """Lookup in SSTable (requires I/O)"""
        with open(self.filename, 'r') as f:
            for line in f:
                parts = line.strip().split('\t')
                if parts[0] == key:
                    return parts[1]
                if parts[0] > key:
                    break
        return None


class LSMTree:
    """Simplified LSM-Tree implementation"""
    def __init__(self, data_dir="lsm_data"):
        self.memtable = MemTable()
        self.immutable_memtables = []
        self.sstables = []  # On-disk SSTables (newest first)
        self.data_dir = data_dir
        self.sstable_counter = 0

    def put(self, key, value):
        """Write"""
        self.memtable.put(key, value)
        if self.memtable.is_full():
            self._flush()

    def get(self, key):
        """Read: check memory first, then disk"""
        # 1. Current MemTable
        result = self.memtable.get(key)
        if result is not None:
            return result

        # 2. Immutable MemTables (newest first)
        for imm in reversed(self.immutable_memtables):
            result = imm.get(key)
            if result is not None:
                return result

        # 3. SSTables (newest first)
        for sst in self.sstables:
            result = sst.get(key)
            if result is not None:
                return result

        return None

    def _flush(self):
        """Flush MemTable to SSTable"""
        self.immutable_memtables.append(self.memtable)
        self.memtable = MemTable()

        imm = self.immutable_memtables.pop(0)
        filename = f"{self.data_dir}/sst_{self.sstable_counter:06d}.dat"
        self.sstable_counter += 1
        sst = imm.flush_to_sstable(filename)
        self.sstables.insert(0, sst)

    def compact(self):
        """Merge SSTables (simplified Level Compaction)"""
        if len(self.sstables) < 4:
            return
        sst1 = self.sstables.pop()
        sst2 = self.sstables.pop()
        merged = self._merge_sstables(sst1, sst2)
        self.sstables.append(merged)

    def _merge_sstables(self, sst1, sst2):
        """Merge two SSTables"""
        filename = f"{self.data_dir}/sst_{self.sstable_counter:06d}.dat"
        self.sstable_counter += 1

        entries = {}
        for sst in [sst2, sst1]:  # sst1 is newer, overwrites
            with open(sst.filename, 'r') as f:
                for line in f:
                    parts = line.strip().split('\t')
                    entries[parts[0]] = (parts[1], parts[2])

        with open(filename, 'w') as f:
            for key in sorted(entries.keys()):
                value, ts = entries[key]
                f.write(f"{key}\t{value}\t{ts}\n")

        os.remove(sst1.filename)
        os.remove(sst2.filename)
        return SSTable(filename)

20.6.3 LevelDB / RocksDB Leveled Compaction

Google's LevelDB (designed by Jeff Dean and Sanjay Ghemawat) organizes SSTables into multiple levels:

Level 0: SSTable_a, SSTable_b, SSTable_c    ← Direct flush from MemTable
         (key ranges may overlap)

Level 1: SSTable_1, SSTable_2, SSTable_3    ← No key overlap within level
         [0-100]     [101-200]   [201-300]

Level 2: SSTable_A, SSTable_B, ..., SSTable_F
         [0-50]     [51-100]    ...  [251-300]

Level 3: ... (each level is 10x the capacity of the previous)

Level Compaction rules:

Level 0 files may have overlapping key ranges (direct flushes)
Level 1+ files have non-overlapping key ranges within each level
When a level exceeds its size threshold, pick a file and merge with overlapping files in the next level
Each level is 10x larger than the previous

Read amplification vs write amplification tradeoff:

LSM-Tree cost model:
┌────────────┬──────────────────────────────────────────┐
│ Operation  │ Cost                                      │
├────────────┼──────────────────────────────────────────┤
│ Write      │ O(1) memory write + O(1) WAL seq write   │
│            │ Write amp ≈ 10-30x (compaction cost)      │
├────────────┼──────────────────────────────────────────┤
│ Point read │ Worst case: check every level O(levels)   │
│            │ Optimization: Bloom Filter skips files     │
├────────────┼──────────────────────────────────────────┤
│ Range read │ Merge results from multiple levels        │
└────────────┴──────────────────────────────────────────┘

B+Tree cost model:
┌────────────┬──────────────────────────────────────────┐
│ Operation  │ Cost                                      │
├────────────┼──────────────────────────────────────────┤
│ Write      │ O(log n) random I/O + WAL                 │
│            │ Write amp ≈ 10-30x (page-granularity)     │
├────────────┼──────────────────────────────────────────┤
│ Point read │ O(log n) ≈ 2-4 I/O                       │
├────────────┼──────────────────────────────────────────┤
│ Range read │ Locate + sequential leaf scan             │
└────────────┴──────────────────────────────────────────┘

20.6.4 Bloom Filter Optimization for Reads

LSM-Tree's read disadvantage is searching multiple levels. Bloom Filters quickly determine "a key is definitely NOT in this SSTable," skipping unnecessary disk reads.

import mmh3  # MurmurHash3


class BloomFilter:
    """Bloom filter"""
    def __init__(self, expected_items, false_positive_rate=0.01):
        import math
        self.size = int(-expected_items * math.log(false_positive_rate)
                        / (math.log(2) ** 2))
        self.num_hashes = int((self.size / expected_items) * math.log(2))
        self.bit_array = [0] * self.size

    def add(self, key):
        """Add key"""
        for i in range(self.num_hashes):
            idx = mmh3.hash(key, i) % self.size
            self.bit_array[idx] = 1

    def might_contain(self, key):
        """Query (may have false positives, never false negatives)"""
        for i in range(self.num_hashes):
            idx = mmh3.hash(key, i) % self.size
            if self.bit_array[idx] == 0:
                return False  # Definitely not present
        return True  # Possibly present


class SSTableWithBloom(SSTable):
    """SSTable with Bloom Filter"""
    def __init__(self, filename, bloom_filter):
        super().__init__(filename)
        self.bloom = bloom_filter

    def get(self, key):
        if not self.bloom.might_contain(key):
            return None  # Skip disk I/O
        return super().get(key)

RocksDB attaches a Bloom Filter to each SSTable (default 10 bits/key, ~1% false positive rate). This means 99% of miss queries complete with zero disk I/O.

20.7 Redis Skip List: Why Not B+Tree

Redis Sorted Sets (ZSET) use skip lists instead of B+Tree. Antirez (Redis creator Salvatore Sanfilippo) gave these reasons:

Pure in-memory: All Redis data is in memory, B+Tree's disk I/O advantage disappears
Implementation simplicity: Skip list code is ~1/3 the size of B+Tree, easier to maintain
Efficient range queries: Skip list's forward pointers naturally support range scans
Concurrency-friendly: Skip lists allow fine-grained locking; B+Tree splits require locking entire paths
Memory flexibility: Skip list nodes can be scattered; no need for pre-allocated contiguous blocks

import random


class SkipListNode:
    """Skip list node"""
    def __init__(self, key, value, level):
        self.key = key
        self.value = value
        self.forward = [None] * (level + 1)
        self.span = [0] * (level + 1)


class SkipList:
    """Skip list implementation (mimicking Redis ZSET)"""
    MAX_LEVEL = 32
    P = 0.25  # Redis uses 1/4 promotion probability

    def __init__(self):
        self.header = SkipListNode(None, None, self.MAX_LEVEL)
        self.level = 0
        self.length = 0

    def _random_level(self):
        """Randomly determine new node's level"""
        level = 0
        while random.random() < self.P and level < self.MAX_LEVEL:
            level += 1
        return level

    def insert(self, key, value):
        """Insert element"""
        update = [None] * (self.MAX_LEVEL + 1)
        current = self.header

        for i in range(self.level, -1, -1):
            while (current.forward[i] is not None and
                   current.forward[i].key < key):
                current = current.forward[i]
            update[i] = current

        new_level = self._random_level()
        if new_level > self.level:
            for i in range(self.level + 1, new_level + 1):
                update[i] = self.header
            self.level = new_level

        new_node = SkipListNode(key, value, new_level)
        for i in range(new_level + 1):
            new_node.forward[i] = update[i].forward[i]
            update[i].forward[i] = new_node

        self.length += 1

    def search(self, key):
        """Search"""
        current = self.header
        for i in range(self.level, -1, -1):
            while (current.forward[i] is not None and
                   current.forward[i].key < key):
                current = current.forward[i]
        current = current.forward[0]
        if current is not None and current.key == key:
            return current.value
        return None

    def range_by_score(self, min_score, max_score):
        """Range query (simulating ZRANGEBYSCORE)"""
        current = self.header
        for i in range(self.level, -1, -1):
            while (current.forward[i] is not None and
                   current.forward[i].key < min_score):
                current = current.forward[i]
        current = current.forward[0]

        results = []
        while current is not None and current.key <= max_score:
            results.append((current.key, current.value))
            current = current.forward[0]
        return results

Skip List vs B+Tree comparison:

Feature	Skip List	B+Tree
Best scenario	Pure memory	Disk storage
Implementation	Simple	Complex
Lookup time	O(log n) expected	O(log_m n) deterministic
Range query	Bottom-level linked list	Leaf linked list
Space overhead	Avg 1.33 pointers/node (P=0.25)	Half-full guarantee
Concurrency	Lock-free/fine-grained	Needs latch coupling
Disk I/O	Poor (scattered nodes)	Excellent (node=page)

Level 3 · What the Specifications Define

20.8 History and Design Decisions of B-Tree

20.8.1 Bayer & McCreight 1972

Rudolf Bayer and Edward McCreight published "Organization and Maintenance of Large Ordered Indexes" in 1972 (Acta Informatica), first proposing the B-Tree. They were working at Boeing Scientific Research Labs at the time.

Core contributions of the paper:

Defined B-Tree structure and properties
Proved all operations (search, insert, delete) are O(log n)
Analyzed upper bounds on disk I/O
Presented algorithms for node splitting and merging

Why called "B"-Tree? Bayer never gave a definitive explanation. Common interpretations include Boeing, Balanced, Bayer, and Broad. Douglas Comer wrote in his 1979 survey "The Ubiquitous B-Tree": "The origin of 'B-tree' has never been explained by the authors."

Historical context: In 1972, storage technology meant drums and disks with millisecond access times. Main memory (core memory) was measured in KB. The core constraint when designing B-Tree was: how to maintain large ordered indexes on extremely slow external storage? This question remains relevant 50+ years later—though SSDs are 100x faster than HDDs, they're still 1000x slower than DRAM.

20.8.2 Evolution of B+Tree

B+Tree has no single "inventor" paper. It evolved through database system practice:

1973: Knuth discussed B-Tree variants in "The Art of Computer Programming" Vol. 3
1977: IBM's System R project (origin of SQL) used B+Tree-like structures
1979: Douglas Comer's survey systematically summarized B-Tree variants including B+Tree

Two key design decisions of B+Tree over B-Tree:

Decision 1: Data only in leaves

Rationale: No data in internal nodes → smaller → more keys per page → larger fan-out → shorter tree → fewer I/O
Cost: Point lookups always traverse to leaf (B-Tree might find it earlier)
Tradeoff: For database workloads, reducing one level outweighs occasional saved I/O

Decision 2: Leaf linked list

Rationale: Range queries need only "locate + sequential scan," no backtracking to parent
Cost: Extra pointer space (typically negligible)
Implementation: Modern systems use doubly-linked lists for both ASC and DESC scans

20.8.3 O'Neil 1996 LSM-Tree Paper

Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil published "The Log-Structured Merge-Tree (LSM-Tree)" in 1996 (Acta Informatica).

Core arguments:

For write-intensive workloads, B-Tree's random write efficiency is extremely poor
Converting all writes to sequential (write to memory first, batch flush) dramatically improves write throughput
Trading some read degradation (checking multiple components) for orders-of-magnitude write improvement

O'Neil's original LSM-Tree had only two components (C₀ in memory, C₁ on disk). Google's BigTable paper (2006) later generalized LSM-Tree to multi-level structures, and LevelDB/RocksDB refined leveled compaction strategies.

Theoretical analysis:

Assume:

B = disk block size
M = size of C₀ in memory
Sequential write bandwidth = W_seq
Random write latency = T_rand

B+Tree write throughput: W_btree = 1 / T_rand (one random I/O per write) LSM-Tree write throughput: W_lsm ≈ W_seq / (write_amplification)

When W_seq >> 1/T_rand (always true on HDD), LSM-Tree write throughput far exceeds B+Tree. On SSDs, the sequential/random gap narrows, but LSM-Tree's write amplification consumes SSD P/E cycles—a new dimension of tradeoff.

20.8.4 Mathematical Relationship Between Disk I/O and B-Tree

Let:

n = total records
m = B-Tree order (max children)
h = tree height

Precise height bounds:

h ≤ log_{⌈m/2⌉}((n+1)/2) + 1   (max height, nodes half-full)
h ≥ log_m(n+1)                    (min height, nodes full)

For InnoDB typical parameters (m ≈ 1170, n = 100 million):

Min height: log₁₁₇₀(100,000,001) ≈ 2.6 → 3 levels
Max height: log₅₈₅(50,000,001) ≈ 2.78 → 3 levels

Regardless of fill factor, 3 levels suffice. This is why DBAs say "InnoDB's B+Tree is almost always 3 levels tall."

Buffer Pool impact:

InnoDB's Buffer Pool (default 128MB, production typically 50-80% of physical memory) caches hot pages:

Root node: nearly always in Buffer Pool (100% hit rate)
Second-level internal nodes: 1170 pages × 16KB = 18.3MB, almost all in memory
Third-level leaf nodes: 1170² ≈ 1.37M pages × 16KB ≈ 21.4GB, cached by access frequency

Conclusion: for hot data, actual disk I/O may be only 0-1 operations.

Level 4 · Edge Cases and Pitfalls

20.9 Connections to Other Books on This Site

This chapter relates closely to:

High Performance MySQL - Indexing — Practical InnoDB B+Tree index techniques: choosing index columns, analyzing EXPLAIN output, common index invalidation scenarios
High Performance MySQL - Architecture — InnoDB storage engine architecture, including Buffer Pool, Change Buffer, Redo Log and their B+Tree interactions
Redis Deep Dive — Skip list implementation in Redis ZSET, and other data structure choices

20.10 How to Answer "Why Does MySQL Use B+Tree" in Interviews

This is one of the most frequently asked backend interview questions. Here's a structured answer framework:

Layer 1: Disk I/O constraint "Database data lives on disk, where random I/O is slow (SSD ~100μs, HDD ~10ms). We need a data structure with extremely low height so each lookup requires minimal disk I/O."

Layer 2: Why not red-black tree/AVL "Red-black trees are binary—100M records need ~27 levels. Each level means one I/O, so 27 disk accesses per lookup is unacceptable. B+Tree nodes can have thousands of branches; 100M records need only 3 levels."

Layer 3: Why not hash table "Hash tables don't support range queries (BETWEEN), sorting (ORDER BY), or leftmost prefix matching. Databases need these operations. Though InnoDB has an Adaptive Hash Index as supplement."

Layer 4: Why B+Tree over B-Tree "B+Tree internal nodes store no data, so each page holds more keys, making the tree shorter. Plus leaf nodes form a linked list—range queries just locate the start and scan sequentially."

Bonus: Mention Buffer Pool "In practice, because of the Buffer Pool, the root and upper internal nodes are almost always in memory. Actual disk I/O is typically 0-1 operations."

20.11 B+Tree vs LSM-Tree Selection Guide

Dimension	B+Tree	LSM-Tree
Read/write ratio	Read-heavy	Write-heavy
Latency profile	Stable (2-4 I/O each time)	Low write latency, occasional read spikes
Space amplification	Low (in-page fragmentation)	Medium (multi-version data)
Write amplification	Medium (page-granularity)	High (compaction rewrites)
Read amplification	Low (direct lookup)	High (multi-level search)
Typical systems	MySQL, PostgreSQL, Oracle	RocksDB, Cassandra, HBase
Use cases	OLTP, transactions	Logs, time-series, write-intensive

Practical selection advice:

Traditional OLTP (e-commerce, social, payments): 99% of the time choose MySQL/PostgreSQL (B+Tree)
Log analytics, monitoring metrics, IoT data: consider ClickHouse/TDengine (columnar + LSM variants)
High write throughput KV store: consider RocksDB (LSM-Tree)
TiDB's storage layer TiKV uses RocksDB (LSM-Tree) but provides MySQL-compatible SQL interface above—this is "LSM-Tree implementing B+Tree semantics" as an engineering solution

20.12 Common Interview Questions and Traps

Q: Are B+Tree leaf nodes a singly or doubly linked list? A: InnoDB uses a doubly linked list. File Header contains FIL_PAGE_PREV and FIL_PAGE_NEXT fields. This supports both forward scans (ASC) and backward scans (DESC).

Q: How much data can one InnoDB B+Tree store? A: Assuming bigint PK (8B), 1KB row size:

Internal fan-out ≈ 16KB / (8+6)B ≈ 1170
Rows per leaf ≈ 16KB / 1KB ≈ 16
3 levels: 1170 × 16 = 18,720 rows
4 levels: 1170² × 16 ≈ 21.9M rows
But with 100B rows, 160 rows/leaf, 3 levels store 1170 × 160 ≈ 187M rows

Q: Why did MongoDB previously use B-Tree instead of B+Tree? A: MongoDB's WiredTiger engine actually uses a B+Tree variant. The earlier MMAPV1 engine used a B-Tree-like structure because it directly memory-mapped files. Since MongoDB 4.0, WiredTiger is the default—essentially B+Tree.

Q: What problems does LSM-Tree's Compaction cause? A:

Write amplification: Data is repeatedly read and rewritten, consuming SSD lifespan
Space amplification: During compaction, both old and new copies exist simultaneously
Latency spikes: Large compactions consume I/O bandwidth, affecting foreground reads/writes

Q: How does RocksDB solve LSM-Tree's read performance issues? A:

Bloom Filter: Each SSTable has a Bloom Filter, skipping files without the target key
Block Cache: Caches hot data blocks
Prefix Bloom Filter: Bloom Filter on key prefixes for prefix queries
Compression: Reduces actual I/O volume (LZ4/Snappy/ZSTD)

Chapter Summary

This chapter addressed the core problem of "efficiently organizing ordered data on external storage," introducing three key data structures:

B-Tree: Increases node width to reduce height, bringing disk I/O from O(log₂n) down to O(log_m n)
B+Tree: Builds on B-Tree by storing data only in leaves and linking them—becoming the database index standard
LSM-Tree: Converts random writes to sequential writes, trading read performance for write throughput in write-intensive scenarios

They are not "replacement" relationships but optimal solutions for different workloads. Understanding their design tradeoffs is the watershed between "using a database" and "understanding a database."

Rate this chapter

4.8 / 5 (10 ratings)