Chapter 20

B-Tree, B+Tree and LSM-Tree

Chapter 20: B-Tree, B+Tree and LSM-Tree

Open a MySQL client and run EXPLAIN SELECT * FROM users WHERE id = 42. You will see const in the type column—meaning InnoDB located the row with a single B+Tree lookup. Behind this "single lookup" lies a multi-way search tree with a height of just 3-4 levels, where each node occupies exactly one 16KB disk page.

Why not a binary search tree? Why not a hash table? Why not a red-black tree? The answers to all these questions point to the same fundamental tension: a 100,000x gap between memory access speed and disk I/O speed. A random memory access takes about 100ns; a random SSD read takes about 100μs; a random HDD seek takes about 10ms. When data exceeds memory capacity, any data structure design must prioritize "minimizing disk I/O operations" above all else.

When Rudolf Bayer and Edward McCreight invented the B-Tree in 1972, they targeted precisely this problem. Their paper title was direct and precise: "Organization and Maintenance of Large Ordered Indexes." By increasing the "width" of each node (storing multiple keys and child pointers), a B-Tree compresses tree height to log_m(n) levels (where m is the branching factor, typically in the hundreds to thousands), limiting the disk I/O needed to find any record to 3-4 operations.

The B+Tree is a variant of the B-Tree where all data resides only in leaf nodes, and leaves are linked by a doubly-linked list—making range queries extremely efficient. MySQL InnoDB, PostgreSQL, and Oracle all use B+Tree for their index structures.

But B+Tree is not a silver bullet. When write load far exceeds reads (log systems, time-series databases), B+Tree's random write amplification becomes a bottleneck. Patrick O'Neil's 1996 LSM-Tree (Log-Structured Merge-Tree) converts random writes to sequential writes through a "write to memory → batch flush to disk → background merge" strategy, achieving orders-of-magnitude improvement in write throughput. LevelDB, RocksDB, Cassandra, and HBase are all based on LSM-Tree.

This chapter starts from the basic definition of B-Tree, dives deep into B+Tree's practical application in databases, then covers LSM-Tree's write-optimized design, and finally discusses why Redis chose skip lists over B+Tree. You will see that choosing a data structure is never about "which is better," but about "which costs less given the hardware constraints and access patterns."


Level 1 · What You Need to Know

20.1 B-Tree: Multi-way Search Tree

20.1.1 From Binary to Multi-way

In previous chapters we studied binary search trees (BST) and balanced BSTs (AVL, red-black trees). Their core property: each node has at most two children, and search takes O(log₂n) time.

For 1 million records, a binary tree has height ≈ log₂(1,000,000) ≈ 20. If the tree fits entirely in memory, 20 comparisons are trivial. But if each node resides on disk, that means 20 disk I/O operations—at 10ms per HDD seek, one lookup takes 200ms. Unacceptable.

The solution is to increase each node's "branching factor." If each node can store 1000 keys and 1001 child pointers, then the same 1 million records produce a tree of height log₁₀₀₁(1,000,000) ≈ 2. With the root node typically cached in memory, only 1-2 disk I/O operations are needed to find any record.

This is the core idea of B-Tree: trade node width for tree height.

20.1.2 Formal Definition of B-Tree

An order-m B-Tree (m is the maximum number of children per node) satisfies:

  1. Each node has at most m children
  2. Non-root nodes have at least ⌈m/2⌉ children (guaranteeing at least half-full)
  3. Root has at least 2 children (unless it is a leaf)
  4. A node with k children contains k-1 keys
  5. All leaves are at the same level (perfectly balanced)
  6. Keys within a node are sorted
class BTreeNode:
    """B-Tree node"""
    def __init__(self, leaf=False):
        self.keys = []       # Sorted list of keys
        self.children = []   # Child pointers
        self.leaf = leaf     # Whether this is a leaf


class BTree:
    """B-Tree implementation (order m)"""
    def __init__(self, m):
        self.root = BTreeNode(leaf=True)
        self.m = m  # Maximum children per node
        self.max_keys = m - 1
        self.min_keys = (m + 1) // 2 - 1

    def search(self, node, key):
        """Search for key in B-Tree"""
        i = 0
        # Find position within current node
        while i < len(node.keys) and key > node.keys[i]:
            i += 1

        # Found
        if i < len(node.keys) and key == node.keys[i]:
            return (node, i)

        # Not found and at leaf
        if node.leaf:
            return None

        # Recurse to child (this corresponds to one disk I/O)
        return self.search(node.children[i], key)

    def insert(self, key):
        """Insert a key"""
        root = self.root
        if len(root.keys) == self.max_keys:
            new_root = BTreeNode(leaf=False)
            new_root.children.append(self.root)
            self._split_child(new_root, 0)
            self.root = new_root
        self._insert_non_full(self.root, key)

    def _insert_non_full(self, node, key):
        """Insert into a non-full node"""
        i = len(node.keys) - 1

        if node.leaf:
            node.keys.append(None)
            while i >= 0 and key < node.keys[i]:
                node.keys[i + 1] = node.keys[i]
                i -= 1
            node.keys[i + 1] = key
        else:
            while i >= 0 and key < node.keys[i]:
                i -= 1
            i += 1
            if len(node.children[i].keys) == self.max_keys:
                self._split_child(node, i)
                if key > node.keys[i]:
                    i += 1
            self._insert_non_full(node.children[i], key)

    def _split_child(self, parent, idx):
        """Split parent's idx-th child"""
        child = parent.children[idx]
        mid = len(child.keys) // 2

        new_node = BTreeNode(leaf=child.leaf)
        new_node.keys = child.keys[mid + 1:]

        if not child.leaf:
            new_node.children = child.children[mid + 1:]
            child.children = child.children[:mid + 1]

        promote_key = child.keys[mid]
        child.keys = child.keys[:mid]

        parent.keys.insert(idx, promote_key)
        parent.children.insert(idx + 1, new_node)

Consider an order-5 B-Tree storing keys [1, 2, ..., 20]:

                    [8]
                   /    \
          [3, 5]          [12, 15, 18]
         /  |  \         /   |    |   \
    [1,2] [4] [6,7]  [9,10,11] [13,14] [16,17] [19,20]

Searching for key=14:

  1. Root [8]: 14 > 8, go right (1 disk I/O)
  2. Internal [12, 15, 18]: 12 < 14 < 15, follow second pointer (1 disk I/O)
  3. Leaf [13, 14]: found 14 (1 disk I/O)

Total: 3 disk I/O operations.

20.1.4 Why Not Red-Black Trees or Hash Tables?

This is a classic interview question. Let's analyze from first principles:

Data Structure Lookup I/O Range Query Ordering Disk-friendly
Red-black tree O(log₂n) ≈ 20 Requires inorder traversal Ordered Poor (scattered nodes)
Hash table O(1) average Not supported Unordered Poor (resize costly)
B+Tree O(log_m n) ≈ 3 Leaf linked list scan Ordered Excellent (node=page)

Problems with red-black trees:

Problems with hash tables:

Why B+Tree wins:

20.2 B+Tree: The Foundation of Database Indexes

20.2.1 Key Differences Between B+Tree and B-Tree

B+Tree is a B-Tree variant with two critical improvements:

  1. All data resides only in leaf nodes: internal nodes store only keys (as signposts)
  2. Leaf nodes are linked in a doubly-linked list: enabling efficient range scans
B-Tree (data distributed across all nodes):
        [10|data, 20|data]
       /        |         \
  [5|data]  [15|data]  [25|data, 30|data]

B+Tree (data only in leaves):
           [10, 20]              ← Internal: keys only, no data
          /    |    \
    [5|d]→[10|d, 15|d]→[20|d, 25|d, 30|d]  ← Leaves: data + linked list

Why "data only in leaves" is better:

Internal nodes without data means:

Why the "leaf linked list" matters:

# Range query: SELECT * FROM users WHERE age BETWEEN 20 AND 30
# B+Tree: locate leaf with age=20, then scan sequentially until age>30
# B-Tree: requires inorder traversal jumping between levels

def range_query_bplus(root, low, high):
    """B+Tree range query"""
    # Step 1: locate starting leaf (O(log n) I/O)
    leaf = find_leaf(root, low)

    results = []
    # Step 2: sequential scan along leaf linked list (sequential I/O)
    while leaf is not None:
        for key, data in leaf.entries:
            if key > high:
                return results
            if key >= low:
                results.append(data)
        leaf = leaf.next_leaf
    return results

20.2.2 Complete B+Tree Search Process

Let's simulate an InnoDB B+Tree lookup with concrete numbers:

Capacity:

For most business tables (< 100M rows), 3 disk I/O operations locate any row. With the root and second level cached in Buffer Pool, actual disk I/O may be just 1 operation.

class BPlusTreeNode:
    """B+Tree node"""
    def __init__(self, leaf=False):
        self.keys = []
        self.leaf = leaf
        if leaf:
            self.data = []
            self.next_leaf = None
            self.prev_leaf = None
        else:
            self.children = []


class BPlusTree:
    """Simplified B+Tree implementation"""
    def __init__(self, order):
        self.root = BPlusTreeNode(leaf=True)
        self.order = order

    def search(self, key):
        """Point lookup"""
        node = self.root
        while not node.leaf:
            i = self._find_position(node.keys, key)
            node = node.children[i]
        i = self._find_position(node.keys, key)
        if i < len(node.keys) and node.keys[i] == key:
            return node.data[i]
        return None

    def range_search(self, low, high):
        """Range lookup"""
        node = self.root
        while not node.leaf:
            i = self._find_position(node.keys, low)
            node = node.children[i]

        results = []
        while node is not None:
            for i, key in enumerate(node.keys):
                if key > high:
                    return results
                if key >= low:
                    results.append((key, node.data[i]))
            node = node.next_leaf
        return results

    def insert(self, key, data):
        """Insert key-data pair"""
        node = self.root
        path = []
        while not node.leaf:
            i = self._find_position(node.keys, key)
            path.append((node, i))
            node = node.children[i]

        i = self._find_position(node.keys, key)
        node.keys.insert(i, key)
        node.data.insert(i, data)

        if len(node.keys) > self.order:
            self._split_leaf(node, path)

    def _split_leaf(self, leaf, path):
        """Split a leaf node"""
        mid = len(leaf.keys) // 2
        new_leaf = BPlusTreeNode(leaf=True)

        new_leaf.keys = leaf.keys[mid:]
        new_leaf.data = leaf.data[mid:]
        leaf.keys = leaf.keys[:mid]
        leaf.data = leaf.data[:mid]

        new_leaf.next_leaf = leaf.next_leaf
        if leaf.next_leaf:
            leaf.next_leaf.prev_leaf = new_leaf
        leaf.next_leaf = new_leaf
        new_leaf.prev_leaf = leaf

        promote_key = new_leaf.keys[0]
        self._insert_into_parent(leaf, promote_key, new_leaf, path)

    def _insert_into_parent(self, left, key, right, path):
        """Insert key into parent node"""
        if not path:
            new_root = BPlusTreeNode(leaf=False)
            new_root.keys = [key]
            new_root.children = [left, right]
            self.root = new_root
            return

        parent, idx = path.pop()
        parent.keys.insert(idx, key)
        parent.children.insert(idx + 1, right)

        if len(parent.keys) > self.order:
            self._split_internal(parent, path)

    def _split_internal(self, node, path):
        """Split an internal node"""
        mid = len(node.keys) // 2
        promote_key = node.keys[mid]

        new_node = BPlusTreeNode(leaf=False)
        new_node.keys = node.keys[mid + 1:]
        new_node.children = node.children[mid + 1:]
        node.keys = node.keys[:mid]
        node.children = node.children[:mid + 1]

        self._insert_into_parent(node, promote_key, new_node, path)

    def _find_position(self, keys, target):
        """Binary search for position"""
        lo, hi = 0, len(keys)
        while lo < hi:
            mid_idx = (lo + hi) // 2
            if keys[mid_idx] < target:
                lo = mid_idx + 1
            else:
                hi = mid_idx
        return lo

20.2.3 Insertion and Splitting

B+Tree insertion follows a "bottom-up split" strategy:

  1. Locate leaf: follow root-to-leaf path
  2. Insert into leaf: place key-data pair in sorted position
  3. Check overflow: if leaf exceeds order, split
  4. Propagate split: may propagate upward, worst case to root

Critical splitting rules:

This distinction ensures all data is always findable in leaf nodes.

20.2.4 Deletion and Merging

Deletion is the inverse of insertion:

  1. Locate and delete: remove target key from leaf
  2. Check underflow: if keys fall below ⌈order/2⌉, handle it
  3. Try borrowing: take a key from sibling (rotation)
  4. Merge: if sibling is also minimal, merge two nodes into one

20.3 B+Tree and Disk I/O: The Deep Connection

20.3.1 Cost Model of Disk Access

Understanding why B+Tree is efficient requires understanding disk I/O costs:

Storage Hierarchy (typical 2024 values):
┌─────────────────────────┬──────────┬───────────┐
│ Storage Medium          │ Latency  │ Bandwidth │
├─────────────────────────┼──────────┼───────────┤
│ L1 Cache                │ 1ns      │ 1TB/s     │
│ L2 Cache                │ 4ns      │ 500GB/s   │
│ L3 Cache                │ 12ns     │ 200GB/s   │
│ DRAM                    │ 100ns    │ 50GB/s    │
│ NVMe SSD (random 4KB)  │ 100μs    │ 7GB/s     │
│ SATA SSD (random 4KB)  │ 200μs    │ 500MB/s   │
│ HDD (random 4KB)       │ 10ms     │ 200MB/s   │
└─────────────────────────┴──────────┴───────────┘

Key insights:

This explains two design principles:

  1. Minimize random I/O count (B+Tree's low height)
  2. Convert random I/O to sequential I/O (leaf linked list, LSM-Tree)

20.3.2 Why Node Size Equals Page Size

Operating systems perform disk I/O in "pages" (Linux default 4KB, databases typically 16KB for InnoDB or 8KB for PostgreSQL).

B+Tree nodes are sized equal to disk pages because:

# InnoDB page capacity calculation
PAGE_SIZE = 16 * 1024  # 16KB

# Internal node fan-out
key_size = 8            # bigint primary key
pointer_size = 6        # InnoDB uses 6-byte page numbers
header_overhead = 120   # Page header/trailer fixed overhead

usable_space = PAGE_SIZE - header_overhead
keys_per_page = usable_space // (key_size + pointer_size)
# ~1170 keys, i.e., fan-out ~1170

# Leaf node record count
row_size = 200          # Assume average row 200 bytes
rows_per_leaf = usable_space // row_size
# ~80 rows

# Height vs capacity
def capacity(height, fan_out, rows_per_leaf):
    """Maximum rows for given B+Tree height"""
    return (fan_out ** (height - 1)) * rows_per_leaf

print(f"2 levels: {capacity(2, 1170, 80):,} rows")     # 93,600
print(f"3 levels: {capacity(3, 1170, 80):,} rows")     # 109,512,000
print(f"4 levels: {capacity(4, 1170, 80):,} rows")     # 128,129,040,000

20.4 Common Misconceptions

Misconception 1: "The B in B-Tree stands for Binary"

Wrong. The B is most commonly interpreted as "Balanced" or "Boeing" (Bayer and McCreight worked at Boeing Scientific Research Labs), or Bayer's initial. Bayer himself never gave an official explanation.

Misconception 2: "Red-black trees are worse than B-Trees"

This is context-dependent. Red-black trees are better in pure-memory scenarios—each comparison involves one key, and cache line utilization is higher. Linux kernel's process scheduler and Java's TreeMap use red-black trees. B-Tree/B+Tree's advantage is specifically for disk/large-data scenarios.

Misconception 3: "Hash indexes are completely useless"

MySQL's MEMORY engine and InnoDB's Adaptive Hash Index both use hashing. For equality lookups (WHERE id = 42), hash indexes with O(1) are indeed faster than B+Tree's O(log n). But hashing doesn't support range queries or sorting, so it can't serve as a general-purpose index structure.


Level 2 · How It Works Under the Hood

20.5 B+Tree in MySQL InnoDB

20.5.1 Clustered Index vs Secondary Index

InnoDB uses B+Tree to organize data, but with two distinct index types:

Clustered Index:

Secondary Index:

Clustered index structure (primary key id):
Root: [50, 100]
         ↓ (id < 50)
Leaf: [id=1, name="Alice", age=25] → [id=2, name="Bob", age=30] → ...

Secondary index structure (indexed column: name):
Root: ["John", "Mike"]
         ↓ (name < "John")
Leaf: [name="Alice", id=1] → [name="Bob", id=2] → ...
                     ↑ Only stores PK, needs bookmark lookup

Cost of bookmark lookups:

# Query: SELECT * FROM users WHERE name = 'Alice'
# With secondary index on name:

# Step 1: Search secondary index B+Tree for name='Alice'
#         → Get primary key id=1 (2-3 I/O)

# Step 2: Use id=1 to search clustered index B+Tree
#         → Get full row (1-2 I/O, root usually in memory)

# Total: 3-5 I/O

# With covering index (index contains all needed columns):
# SELECT id, name FROM users WHERE name = 'Alice'
# Secondary index leaf already has (name, id), no bookmark lookup needed
# Total: 2-3 I/O

20.5.2 Covering Indexes and Composite Indexes

A covering index contains all columns needed by a query, eliminating bookmark lookups:

-- Create composite index
CREATE INDEX idx_name_age ON users(name, age);

-- This query uses covering index (Extra: Using index)
SELECT name, age FROM users WHERE name = 'Alice';

-- This query needs bookmark lookup (email not in index)
SELECT name, age, email FROM users WHERE name = 'Alice';

Composite index B+Tree follows the leftmost prefix principle:

Composite index (name, age) B+Tree:
Leaf nodes sorted by (name, age):
[("Alice", 25, id=1)] → [("Alice", 30, id=5)] → [("Bob", 20, id=3)] → ...

Can use index:
✓ WHERE name = 'Alice'                    (leftmost prefix)
✓ WHERE name = 'Alice' AND age = 25       (full prefix)
✓ WHERE name = 'Alice' AND age > 20       (prefix + range)
✓ ORDER BY name, age                      (index order)

Cannot use index:
✗ WHERE age = 25                          (skips name)
✗ ORDER BY age, name                      (wrong order)

20.5.3 Physical Structure of an InnoDB Page

Internal layout of a 16KB InnoDB page:

┌──────────────────────────────────────────────┐
│ File Header (38 bytes)                        │ ← Page number, checksum, prev/next page
├──────────────────────────────────────────────┤
│ Page Header (56 bytes)                        │ ← Record count, free space pointer
├──────────────────────────────────────────────┤
│ Infimum & Supremum Records (26 bytes)         │ ← Virtual min/max records
├──────────────────────────────────────────────┤
│ User Records                                  │ ← Actual data rows (singly linked list)
│ ┌─────────────────────────────────────────┐  │
│ │ Record 1 → Record 2 → Record 3 → ...   │  │
│ └─────────────────────────────────────────┘  │
├──────────────────────────────────────────────┤
│ Free Space                                    │ ← Unused space
├──────────────────────────────────────────────┤
│ Page Directory (variable)                     │ ← Sparse directory for in-page search
├──────────────────────────────────────────────┤
│ File Trailer (8 bytes)                        │ ← Checksum (consistency with header)
└──────────────────────────────────────────────┘

In-page search uses the Page Directory (sparse index), grouping records into slots of 4-8 records. Lookup binary-searches the directory to find the group, then sequentially scans within the group.

20.6 LSM-Tree: Write-Optimized Data Structure

20.6.1 B+Tree's Write Amplification Problem

B+Tree excels at reads, but writes have a fatal flaw—write amplification:

  1. Modifying one row requires:
    • Reading the 16KB page from disk
    • Modifying a few dozen bytes
    • Writing the entire 16KB page back
  2. If node splitting occurs, multiple pages must be written
  3. With WAL (Write-Ahead Log), one modification's actual write volume can be 10-30x the original data

When writes dominate (log collection, IoT sensors, time-series databases), B+Tree's write amplification severely limits throughput.

20.6.2 LSM-Tree Core Concept

Patrick O'Neil's 1996 paper "The Log-Structured Merge-Tree (LSM-Tree)" presented a key insight:

If sequential disk writes are 100x faster than random writes, just make all writes sequential.

LSM-Tree write path:

Write request → MemTable (in-memory sorted structure) → Full → Flush as SSTable (sorted disk file)
                                                                        ↓
                                                                Background Compaction (merge SSTables)

Detailed steps:

  1. Write to MemTable: New data goes to an in-memory sorted structure (usually red-black tree or skip list), with WAL for durability
  2. MemTable becomes Immutable: When MemTable reaches threshold (e.g., 64MB), freeze it
  3. Flush to SSTable: Immutable MemTable is sequentially written to disk as an SSTable (Sorted String Table)
  4. Compaction: Background threads periodically merge SSTables, eliminating duplicate keys
import bisect
import time
import os


class MemTable:
    """In-memory sorted table (simplified as sorted list)"""
    def __init__(self, max_size=1000):
        self.entries = []  # [(key, value, timestamp)]
        self.max_size = max_size
        self.size = 0

    def put(self, key, value):
        """Write/update"""
        entry = (key, value, time.time())
        idx = bisect.bisect_left(self.entries, (key,))
        if idx < len(self.entries) and self.entries[idx][0] == key:
            self.entries[idx] = entry
        else:
            self.entries.insert(idx, entry)
            self.size += 1

    def get(self, key):
        """Lookup"""
        idx = bisect.bisect_left(self.entries, (key,))
        if idx < len(self.entries) and self.entries[idx][0] == key:
            return self.entries[idx][1]
        return None

    def is_full(self):
        return self.size >= self.max_size

    def flush_to_sstable(self, filename):
        """Sequential write to disk, producing an SSTable"""
        with open(filename, 'w') as f:
            for key, value, ts in self.entries:
                f.write(f"{key}\t{value}\t{ts}\n")
        return SSTable(filename)


class SSTable:
    """Sorted file on disk (simplified)"""
    def __init__(self, filename):
        self.filename = filename
        self.index = self._build_sparse_index()

    def _build_sparse_index(self):
        """Build sparse index (record position every N entries)"""
        index = {}
        with open(self.filename, 'r') as f:
            offset = 0
            for line in f:
                key = line.split('\t')[0]
                if len(index) == 0 or hash(key) % 16 == 0:
                    index[key] = offset
                offset += len(line)
        return index

    def get(self, key):
        """Lookup in SSTable (requires I/O)"""
        with open(self.filename, 'r') as f:
            for line in f:
                parts = line.strip().split('\t')
                if parts[0] == key:
                    return parts[1]
                if parts[0] > key:
                    break
        return None


class LSMTree:
    """Simplified LSM-Tree implementation"""
    def __init__(self, data_dir="lsm_data"):
        self.memtable = MemTable()
        self.immutable_memtables = []
        self.sstables = []  # On-disk SSTables (newest first)
        self.data_dir = data_dir
        self.sstable_counter = 0

    def put(self, key, value):
        """Write"""
        self.memtable.put(key, value)
        if self.memtable.is_full():
            self._flush()

    def get(self, key):
        """Read: check memory first, then disk"""
        # 1. Current MemTable
        result = self.memtable.get(key)
        if result is not None:
            return result

        # 2. Immutable MemTables (newest first)
        for imm in reversed(self.immutable_memtables):
            result = imm.get(key)
            if result is not None:
                return result

        # 3. SSTables (newest first)
        for sst in self.sstables:
            result = sst.get(key)
            if result is not None:
                return result

        return None

    def _flush(self):
        """Flush MemTable to SSTable"""
        self.immutable_memtables.append(self.memtable)
        self.memtable = MemTable()

        imm = self.immutable_memtables.pop(0)
        filename = f"{self.data_dir}/sst_{self.sstable_counter:06d}.dat"
        self.sstable_counter += 1
        sst = imm.flush_to_sstable(filename)
        self.sstables.insert(0, sst)

    def compact(self):
        """Merge SSTables (simplified Level Compaction)"""
        if len(self.sstables) < 4:
            return
        sst1 = self.sstables.pop()
        sst2 = self.sstables.pop()
        merged = self._merge_sstables(sst1, sst2)
        self.sstables.append(merged)

    def _merge_sstables(self, sst1, sst2):
        """Merge two SSTables"""
        filename = f"{self.data_dir}/sst_{self.sstable_counter:06d}.dat"
        self.sstable_counter += 1

        entries = {}
        for sst in [sst2, sst1]:  # sst1 is newer, overwrites
            with open(sst.filename, 'r') as f:
                for line in f:
                    parts = line.strip().split('\t')
                    entries[parts[0]] = (parts[1], parts[2])

        with open(filename, 'w') as f:
            for key in sorted(entries.keys()):
                value, ts = entries[key]
                f.write(f"{key}\t{value}\t{ts}\n")

        os.remove(sst1.filename)
        os.remove(sst2.filename)
        return SSTable(filename)

20.6.3 LevelDB / RocksDB Leveled Compaction

Google's LevelDB (designed by Jeff Dean and Sanjay Ghemawat) organizes SSTables into multiple levels:

Level 0: SSTable_a, SSTable_b, SSTable_c    ← Direct flush from MemTable
         (key ranges may overlap)

Level 1: SSTable_1, SSTable_2, SSTable_3    ← No key overlap within level
         [0-100]     [101-200]   [201-300]

Level 2: SSTable_A, SSTable_B, ..., SSTable_F
         [0-50]     [51-100]    ...  [251-300]

Level 3: ... (each level is 10x the capacity of the previous)

Level Compaction rules:

Read amplification vs write amplification tradeoff:

LSM-Tree cost model:
┌────────────┬──────────────────────────────────────────┐
│ Operation  │ Cost                                      │
├────────────┼──────────────────────────────────────────┤
│ Write      │ O(1) memory write + O(1) WAL seq write   │
│            │ Write amp ≈ 10-30x (compaction cost)      │
├────────────┼──────────────────────────────────────────┤
│ Point read │ Worst case: check every level O(levels)   │
│            │ Optimization: Bloom Filter skips files     │
├────────────┼──────────────────────────────────────────┤
│ Range read │ Merge results from multiple levels        │
└────────────┴──────────────────────────────────────────┘

B+Tree cost model:
┌────────────┬──────────────────────────────────────────┐
│ Operation  │ Cost                                      │
├────────────┼──────────────────────────────────────────┤
│ Write      │ O(log n) random I/O + WAL                 │
│            │ Write amp ≈ 10-30x (page-granularity)     │
├────────────┼──────────────────────────────────────────┤
│ Point read │ O(log n) ≈ 2-4 I/O                       │
├────────────┼──────────────────────────────────────────┤
│ Range read │ Locate + sequential leaf scan             │
└────────────┴──────────────────────────────────────────┘

20.6.4 Bloom Filter Optimization for Reads

LSM-Tree's read disadvantage is searching multiple levels. Bloom Filters quickly determine "a key is definitely NOT in this SSTable," skipping unnecessary disk reads.

import mmh3  # MurmurHash3


class BloomFilter:
    """Bloom filter"""
    def __init__(self, expected_items, false_positive_rate=0.01):
        import math
        self.size = int(-expected_items * math.log(false_positive_rate)
                        / (math.log(2) ** 2))
        self.num_hashes = int((self.size / expected_items) * math.log(2))
        self.bit_array = [0] * self.size

    def add(self, key):
        """Add key"""
        for i in range(self.num_hashes):
            idx = mmh3.hash(key, i) % self.size
            self.bit_array[idx] = 1

    def might_contain(self, key):
        """Query (may have false positives, never false negatives)"""
        for i in range(self.num_hashes):
            idx = mmh3.hash(key, i) % self.size
            if self.bit_array[idx] == 0:
                return False  # Definitely not present
        return True  # Possibly present


class SSTableWithBloom(SSTable):
    """SSTable with Bloom Filter"""
    def __init__(self, filename, bloom_filter):
        super().__init__(filename)
        self.bloom = bloom_filter

    def get(self, key):
        if not self.bloom.might_contain(key):
            return None  # Skip disk I/O
        return super().get(key)

RocksDB attaches a Bloom Filter to each SSTable (default 10 bits/key, ~1% false positive rate). This means 99% of miss queries complete with zero disk I/O.

20.7 Redis Skip List: Why Not B+Tree

Redis Sorted Sets (ZSET) use skip lists instead of B+Tree. Antirez (Redis creator Salvatore Sanfilippo) gave these reasons:

  1. Pure in-memory: All Redis data is in memory, B+Tree's disk I/O advantage disappears
  2. Implementation simplicity: Skip list code is ~1/3 the size of B+Tree, easier to maintain
  3. Efficient range queries: Skip list's forward pointers naturally support range scans
  4. Concurrency-friendly: Skip lists allow fine-grained locking; B+Tree splits require locking entire paths
  5. Memory flexibility: Skip list nodes can be scattered; no need for pre-allocated contiguous blocks
import random


class SkipListNode:
    """Skip list node"""
    def __init__(self, key, value, level):
        self.key = key
        self.value = value
        self.forward = [None] * (level + 1)
        self.span = [0] * (level + 1)


class SkipList:
    """Skip list implementation (mimicking Redis ZSET)"""
    MAX_LEVEL = 32
    P = 0.25  # Redis uses 1/4 promotion probability

    def __init__(self):
        self.header = SkipListNode(None, None, self.MAX_LEVEL)
        self.level = 0
        self.length = 0

    def _random_level(self):
        """Randomly determine new node's level"""
        level = 0
        while random.random() < self.P and level < self.MAX_LEVEL:
            level += 1
        return level

    def insert(self, key, value):
        """Insert element"""
        update = [None] * (self.MAX_LEVEL + 1)
        current = self.header

        for i in range(self.level, -1, -1):
            while (current.forward[i] is not None and
                   current.forward[i].key < key):
                current = current.forward[i]
            update[i] = current

        new_level = self._random_level()
        if new_level > self.level:
            for i in range(self.level + 1, new_level + 1):
                update[i] = self.header
            self.level = new_level

        new_node = SkipListNode(key, value, new_level)
        for i in range(new_level + 1):
            new_node.forward[i] = update[i].forward[i]
            update[i].forward[i] = new_node

        self.length += 1

    def search(self, key):
        """Search"""
        current = self.header
        for i in range(self.level, -1, -1):
            while (current.forward[i] is not None and
                   current.forward[i].key < key):
                current = current.forward[i]
        current = current.forward[0]
        if current is not None and current.key == key:
            return current.value
        return None

    def range_by_score(self, min_score, max_score):
        """Range query (simulating ZRANGEBYSCORE)"""
        current = self.header
        for i in range(self.level, -1, -1):
            while (current.forward[i] is not None and
                   current.forward[i].key < min_score):
                current = current.forward[i]
        current = current.forward[0]

        results = []
        while current is not None and current.key <= max_score:
            results.append((current.key, current.value))
            current = current.forward[0]
        return results

Skip List vs B+Tree comparison:

Feature Skip List B+Tree
Best scenario Pure memory Disk storage
Implementation Simple Complex
Lookup time O(log n) expected O(log_m n) deterministic
Range query Bottom-level linked list Leaf linked list
Space overhead Avg 1.33 pointers/node (P=0.25) Half-full guarantee
Concurrency Lock-free/fine-grained Needs latch coupling
Disk I/O Poor (scattered nodes) Excellent (node=page)

Level 3 · What the Specifications Define

20.8 History and Design Decisions of B-Tree

20.8.1 Bayer & McCreight 1972

Rudolf Bayer and Edward McCreight published "Organization and Maintenance of Large Ordered Indexes" in 1972 (Acta Informatica), first proposing the B-Tree. They were working at Boeing Scientific Research Labs at the time.

Core contributions of the paper:

  1. Defined B-Tree structure and properties
  2. Proved all operations (search, insert, delete) are O(log n)
  3. Analyzed upper bounds on disk I/O
  4. Presented algorithms for node splitting and merging

Why called "B"-Tree? Bayer never gave a definitive explanation. Common interpretations include Boeing, Balanced, Bayer, and Broad. Douglas Comer wrote in his 1979 survey "The Ubiquitous B-Tree": "The origin of 'B-tree' has never been explained by the authors."

Historical context: In 1972, storage technology meant drums and disks with millisecond access times. Main memory (core memory) was measured in KB. The core constraint when designing B-Tree was: how to maintain large ordered indexes on extremely slow external storage? This question remains relevant 50+ years later—though SSDs are 100x faster than HDDs, they're still 1000x slower than DRAM.

20.8.2 Evolution of B+Tree

B+Tree has no single "inventor" paper. It evolved through database system practice:

Two key design decisions of B+Tree over B-Tree:

Decision 1: Data only in leaves

Decision 2: Leaf linked list

20.8.3 O'Neil 1996 LSM-Tree Paper

Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil published "The Log-Structured Merge-Tree (LSM-Tree)" in 1996 (Acta Informatica).

Core arguments:

  1. For write-intensive workloads, B-Tree's random write efficiency is extremely poor
  2. Converting all writes to sequential (write to memory first, batch flush) dramatically improves write throughput
  3. Trading some read degradation (checking multiple components) for orders-of-magnitude write improvement

O'Neil's original LSM-Tree had only two components (C₀ in memory, C₁ on disk). Google's BigTable paper (2006) later generalized LSM-Tree to multi-level structures, and LevelDB/RocksDB refined leveled compaction strategies.

Theoretical analysis:

Assume:

B+Tree write throughput: W_btree = 1 / T_rand (one random I/O per write) LSM-Tree write throughput: W_lsm ≈ W_seq / (write_amplification)

When W_seq >> 1/T_rand (always true on HDD), LSM-Tree write throughput far exceeds B+Tree. On SSDs, the sequential/random gap narrows, but LSM-Tree's write amplification consumes SSD P/E cycles—a new dimension of tradeoff.

20.8.4 Mathematical Relationship Between Disk I/O and B-Tree

Let:

Precise height bounds:

h ≤ log_{⌈m/2⌉}((n+1)/2) + 1   (max height, nodes half-full)
h ≥ log_m(n+1)                    (min height, nodes full)

For InnoDB typical parameters (m ≈ 1170, n = 100 million):

Regardless of fill factor, 3 levels suffice. This is why DBAs say "InnoDB's B+Tree is almost always 3 levels tall."

Buffer Pool impact:

InnoDB's Buffer Pool (default 128MB, production typically 50-80% of physical memory) caches hot pages:

Conclusion: for hot data, actual disk I/O may be only 0-1 operations.


Level 4 · Edge Cases and Pitfalls

20.9 Connections to Other Books on This Site

This chapter relates closely to:

20.10 How to Answer "Why Does MySQL Use B+Tree" in Interviews

This is one of the most frequently asked backend interview questions. Here's a structured answer framework:

Layer 1: Disk I/O constraint "Database data lives on disk, where random I/O is slow (SSD ~100μs, HDD ~10ms). We need a data structure with extremely low height so each lookup requires minimal disk I/O."

Layer 2: Why not red-black tree/AVL "Red-black trees are binary—100M records need ~27 levels. Each level means one I/O, so 27 disk accesses per lookup is unacceptable. B+Tree nodes can have thousands of branches; 100M records need only 3 levels."

Layer 3: Why not hash table "Hash tables don't support range queries (BETWEEN), sorting (ORDER BY), or leftmost prefix matching. Databases need these operations. Though InnoDB has an Adaptive Hash Index as supplement."

Layer 4: Why B+Tree over B-Tree "B+Tree internal nodes store no data, so each page holds more keys, making the tree shorter. Plus leaf nodes form a linked list—range queries just locate the start and scan sequentially."

Bonus: Mention Buffer Pool "In practice, because of the Buffer Pool, the root and upper internal nodes are almost always in memory. Actual disk I/O is typically 0-1 operations."

20.11 B+Tree vs LSM-Tree Selection Guide

Dimension B+Tree LSM-Tree
Read/write ratio Read-heavy Write-heavy
Latency profile Stable (2-4 I/O each time) Low write latency, occasional read spikes
Space amplification Low (in-page fragmentation) Medium (multi-version data)
Write amplification Medium (page-granularity) High (compaction rewrites)
Read amplification Low (direct lookup) High (multi-level search)
Typical systems MySQL, PostgreSQL, Oracle RocksDB, Cassandra, HBase
Use cases OLTP, transactions Logs, time-series, write-intensive

Practical selection advice:

20.12 Common Interview Questions and Traps

Q: Are B+Tree leaf nodes a singly or doubly linked list? A: InnoDB uses a doubly linked list. File Header contains FIL_PAGE_PREV and FIL_PAGE_NEXT fields. This supports both forward scans (ASC) and backward scans (DESC).

Q: How much data can one InnoDB B+Tree store? A: Assuming bigint PK (8B), 1KB row size:

Q: Why did MongoDB previously use B-Tree instead of B+Tree? A: MongoDB's WiredTiger engine actually uses a B+Tree variant. The earlier MMAPV1 engine used a B-Tree-like structure because it directly memory-mapped files. Since MongoDB 4.0, WiredTiger is the default—essentially B+Tree.

Q: What problems does LSM-Tree's Compaction cause? A:

  1. Write amplification: Data is repeatedly read and rewritten, consuming SSD lifespan
  2. Space amplification: During compaction, both old and new copies exist simultaneously
  3. Latency spikes: Large compactions consume I/O bandwidth, affecting foreground reads/writes

Q: How does RocksDB solve LSM-Tree's read performance issues? A:

  1. Bloom Filter: Each SSTable has a Bloom Filter, skipping files without the target key
  2. Block Cache: Caches hot data blocks
  3. Prefix Bloom Filter: Bloom Filter on key prefixes for prefix queries
  4. Compression: Reduces actual I/O volume (LZ4/Snappy/ZSTD)

Chapter Summary

This chapter addressed the core problem of "efficiently organizing ordered data on external storage," introducing three key data structures:

  1. B-Tree: Increases node width to reduce height, bringing disk I/O from O(log₂n) down to O(log_m n)
  2. B+Tree: Builds on B-Tree by storing data only in leaves and linking them—becoming the database index standard
  3. LSM-Tree: Converts random writes to sequential writes, trading read performance for write throughput in write-intensive scenarios

They are not "replacement" relationships but optimal solutions for different workloads. Understanding their design tradeoffs is the watershed between "using a database" and "understanding a database."

Rate this chapter
4.8  / 5  (10 ratings)

💬 Comments