B-Tree, B+Tree and LSM-Tree
Chapter 20: B-Tree, B+Tree and LSM-Tree
Open a MySQL client and run EXPLAIN SELECT * FROM users WHERE id = 42. You will see const in the type columnโmeaning InnoDB located the row with a single B+Tree lookup. Behind this "single lookup" lies a multi-way search tree with a height of just 3-4 levels, where each node occupies exactly one 16KB disk page.
Why not a binary search tree? Why not a hash table? Why not a red-black tree? The answers to all these questions point to the same fundamental tension: a 100,000x gap between memory access speed and disk I/O speed. A random memory access takes about 100ns; a random SSD read takes about 100ฮผs; a random HDD seek takes about 10ms. When data exceeds memory capacity, any data structure design must prioritize "minimizing disk I/O operations" above all else.
When Rudolf Bayer and Edward McCreight invented the B-Tree in 1972, they targeted precisely this problem. Their paper title was direct and precise: "Organization and Maintenance of Large Ordered Indexes." By increasing the "width" of each node (storing multiple keys and child pointers), a B-Tree compresses tree height to log_m(n) levels (where m is the branching factor, typically in the hundreds to thousands), limiting the disk I/O needed to find any record to 3-4 operations.
The B+Tree is a variant of the B-Tree where all data resides only in leaf nodes, and leaves are linked by a doubly-linked listโmaking range queries extremely efficient. MySQL InnoDB, PostgreSQL, and Oracle all use B+Tree for their index structures.
But B+Tree is not a silver bullet. When write load far exceeds reads (log systems, time-series databases), B+Tree's random write amplification becomes a bottleneck. Patrick O'Neil's 1996 LSM-Tree (Log-Structured Merge-Tree) converts random writes to sequential writes through a "write to memory โ batch flush to disk โ background merge" strategy, achieving orders-of-magnitude improvement in write throughput. LevelDB, RocksDB, Cassandra, and HBase are all based on LSM-Tree.
This chapter starts from the basic definition of B-Tree, dives deep into B+Tree's practical application in databases, then covers LSM-Tree's write-optimized design, and finally discusses why Redis chose skip lists over B+Tree. You will see that choosing a data structure is never about "which is better," but about "which costs less given the hardware constraints and access patterns."
Level 1 ยท What You Need to Know
20.1 B-Tree: Multi-way Search Tree
20.1.1 From Binary to Multi-way
In previous chapters we studied binary search trees (BST) and balanced BSTs (AVL, red-black trees). Their core property: each node has at most two children, and search takes O(logโn) time.
For 1 million records, a binary tree has height โ logโ(1,000,000) โ 20. If the tree fits entirely in memory, 20 comparisons are trivial. But if each node resides on disk, that means 20 disk I/O operationsโat 10ms per HDD seek, one lookup takes 200ms. Unacceptable.
The solution is to increase each node's "branching factor." If each node can store 1000 keys and 1001 child pointers, then the same 1 million records produce a tree of height logโโโโ(1,000,000) โ 2. With the root node typically cached in memory, only 1-2 disk I/O operations are needed to find any record.
This is the core idea of B-Tree: trade node width for tree height.
20.1.2 Formal Definition of B-Tree
An order-m B-Tree (m is the maximum number of children per node) satisfies:
- Each node has at most m children
- Non-root nodes have at least โm/2โ children (guaranteeing at least half-full)
- Root has at least 2 children (unless it is a leaf)
- A node with k children contains k-1 keys
- All leaves are at the same level (perfectly balanced)
- Keys within a node are sorted
class BTreeNode:
"""B-Tree node"""
def __init__(self, leaf=False):
self.keys = [] # Sorted list of keys
self.children = [] # Child pointers
self.leaf = leaf # Whether this is a leaf
class BTree:
"""B-Tree implementation (order m)"""
def __init__(self, m):
self.root = BTreeNode(leaf=True)
self.m = m # Maximum children per node
self.max_keys = m - 1
self.min_keys = (m + 1) // 2 - 1
def search(self, node, key):
"""Search for key in B-Tree"""
i = 0
# Find position within current node
while i < len(node.keys) and key > node.keys[i]:
i += 1
# Found
if i < len(node.keys) and key == node.keys[i]:
return (node, i)
# Not found and at leaf
if node.leaf:
return None
# Recurse to child (this corresponds to one disk I/O)
return self.search(node.children[i], key)
def insert(self, key):
"""Insert a key"""
root = self.root
if len(root.keys) == self.max_keys:
new_root = BTreeNode(leaf=False)
new_root.children.append(self.root)
self._split_child(new_root, 0)
self.root = new_root
self._insert_non_full(self.root, key)
def _insert_non_full(self, node, key):
"""Insert into a non-full node"""
i = len(node.keys) - 1
if node.leaf:
node.keys.append(None)
while i >= 0 and key < node.keys[i]:
node.keys[i + 1] = node.keys[i]
i -= 1
node.keys[i + 1] = key
else:
while i >= 0 and key < node.keys[i]:
i -= 1
i += 1
if len(node.children[i].keys) == self.max_keys:
self._split_child(node, i)
if key > node.keys[i]:
i += 1
self._insert_non_full(node.children[i], key)
def _split_child(self, parent, idx):
"""Split parent's idx-th child"""
child = parent.children[idx]
mid = len(child.keys) // 2
new_node = BTreeNode(leaf=child.leaf)
new_node.keys = child.keys[mid + 1:]
if not child.leaf:
new_node.children = child.children[mid + 1:]
child.children = child.children[:mid + 1]
promote_key = child.keys[mid]
child.keys = child.keys[:mid]
parent.keys.insert(idx, promote_key)
parent.children.insert(idx + 1, new_node)
20.1.3 Visualizing B-Tree Search
Consider an order-5 B-Tree storing keys [1, 2, ..., 20]:
[8]
/ \
[3, 5] [12, 15, 18]
/ | \ / | | \
[1,2] [4] [6,7] [9,10,11] [13,14] [16,17] [19,20]
Searching for key=14:
- Root [8]: 14 > 8, go right (1 disk I/O)
- Internal [12, 15, 18]: 12 < 14 < 15, follow second pointer (1 disk I/O)
- Leaf [13, 14]: found 14 (1 disk I/O)
Total: 3 disk I/O operations.
20.1.4 Why Not Red-Black Trees or Hash Tables?
This is a classic interview question. Let's analyze from first principles:
| Data Structure | Lookup I/O | Range Query | Ordering | Disk-friendly |
|---|---|---|---|---|
| Red-black tree | O(logโn) โ 20 | Requires inorder traversal | Ordered | Poor (scattered nodes) |
| Hash table | O(1) average | Not supported | Unordered | Poor (resize costly) |
| B+Tree | O(log_m n) โ 3 | Leaf linked list scan | Ordered | Excellent (node=page) |
Problems with red-black trees:
- Too tall: 1M records need ~20 levels = ~20 disk I/O
- Nodes too small: each stores one key, cannot utilize 4KB/16KB disk pages
- Scattered nodes: adjacent keys may be physically far apart on disk
Problems with hash tables:
- No range queries:
WHERE age BETWEEN 20 AND 30requires full scan - No sorting:
ORDER BYneeds additional sort step - Costly resizing: rehash redistributes all data
- Worst-case O(n) with hash collisions
Why B+Tree wins:
- Extremely low tree height: even 1 billion records fit in 4 levels
- Node size = disk page size: one I/O reads a complete node
- Excellent sequential access: leaf linked list for range scans
- Prefetch-friendly: OS readahead can preload adjacent pages
20.2 B+Tree: The Foundation of Database Indexes
20.2.1 Key Differences Between B+Tree and B-Tree
B+Tree is a B-Tree variant with two critical improvements:
- All data resides only in leaf nodes: internal nodes store only keys (as signposts)
- Leaf nodes are linked in a doubly-linked list: enabling efficient range scans
B-Tree (data distributed across all nodes):
[10|data, 20|data]
/ | \
[5|data] [15|data] [25|data, 30|data]
B+Tree (data only in leaves):
[10, 20] โ Internal: keys only, no data
/ | \
[5|d]โ[10|d, 15|d]โ[20|d, 25|d, 30|d] โ Leaves: data + linked list
Why "data only in leaves" is better:
Internal nodes without data means:
- More keys per internal node โ larger branching factor โ shorter tree โ fewer I/O
- With a 16KB page, 8-byte key, 6-byte pointer:
- B-Tree internal node: with 100-byte data per entry, ~145 keys per page
- B+Tree internal node: no data overhead, ~1170 keys per page
- B+Tree's branching factor is 8x that of B-Tree, reducing tree height by at least one level
Why the "leaf linked list" matters:
# Range query: SELECT * FROM users WHERE age BETWEEN 20 AND 30
# B+Tree: locate leaf with age=20, then scan sequentially until age>30
# B-Tree: requires inorder traversal jumping between levels
def range_query_bplus(root, low, high):
"""B+Tree range query"""
# Step 1: locate starting leaf (O(log n) I/O)
leaf = find_leaf(root, low)
results = []
# Step 2: sequential scan along leaf linked list (sequential I/O)
while leaf is not None:
for key, data in leaf.entries:
if key > high:
return results
if key >= low:
results.append(data)
leaf = leaf.next_leaf
return results
20.2.2 Complete B+Tree Search Process
Let's simulate an InnoDB B+Tree lookup with concrete numbers:
- Page size: 16KB
- Primary key: bigint (8 bytes), average row: 200 bytes
- Internal node: 8-byte key + 6-byte pointer = 14 bytes โ ~1170 pointers per page
- Leaf node: 200-byte row โ ~80 rows per page
Capacity:
- 2-level B+Tree: 1170 ร 80 = 93,600 rows
- 3-level B+Tree: 1170 ร 1170 ร 80 = 109 million rows
- 4-level B+Tree: 1170ยณ ร 80 = 128 billion rows
For most business tables (< 100M rows), 3 disk I/O operations locate any row. With the root and second level cached in Buffer Pool, actual disk I/O may be just 1 operation.
class BPlusTreeNode:
"""B+Tree node"""
def __init__(self, leaf=False):
self.keys = []
self.leaf = leaf
if leaf:
self.data = []
self.next_leaf = None
self.prev_leaf = None
else:
self.children = []
class BPlusTree:
"""Simplified B+Tree implementation"""
def __init__(self, order):
self.root = BPlusTreeNode(leaf=True)
self.order = order
def search(self, key):
"""Point lookup"""
node = self.root
while not node.leaf:
i = self._find_position(node.keys, key)
node = node.children[i]
i = self._find_position(node.keys, key)
if i < len(node.keys) and node.keys[i] == key:
return node.data[i]
return None
def range_search(self, low, high):
"""Range lookup"""
node = self.root
while not node.leaf:
i = self._find_position(node.keys, low)
node = node.children[i]
results = []
while node is not None:
for i, key in enumerate(node.keys):
if key > high:
return results
if key >= low:
results.append((key, node.data[i]))
node = node.next_leaf
return results
def insert(self, key, data):
"""Insert key-data pair"""
node = self.root
path = []
while not node.leaf:
i = self._find_position(node.keys, key)
path.append((node, i))
node = node.children[i]
i = self._find_position(node.keys, key)
node.keys.insert(i, key)
node.data.insert(i, data)
if len(node.keys) > self.order:
self._split_leaf(node, path)
def _split_leaf(self, leaf, path):
"""Split a leaf node"""
mid = len(leaf.keys) // 2
new_leaf = BPlusTreeNode(leaf=True)
new_leaf.keys = leaf.keys[mid:]
new_leaf.data = leaf.data[mid:]
leaf.keys = leaf.keys[:mid]
leaf.data = leaf.data[:mid]
new_leaf.next_leaf = leaf.next_leaf
if leaf.next_leaf:
leaf.next_leaf.prev_leaf = new_leaf
leaf.next_leaf = new_leaf
new_leaf.prev_leaf = leaf
promote_key = new_leaf.keys[0]
self._insert_into_parent(leaf, promote_key, new_leaf, path)
def _insert_into_parent(self, left, key, right, path):
"""Insert key into parent node"""
if not path:
new_root = BPlusTreeNode(leaf=False)
new_root.keys = [key]
new_root.children = [left, right]
self.root = new_root
return
parent, idx = path.pop()
parent.keys.insert(idx, key)
parent.children.insert(idx + 1, right)
if len(parent.keys) > self.order:
self._split_internal(parent, path)
def _split_internal(self, node, path):
"""Split an internal node"""
mid = len(node.keys) // 2
promote_key = node.keys[mid]
new_node = BPlusTreeNode(leaf=False)
new_node.keys = node.keys[mid + 1:]
new_node.children = node.children[mid + 1:]
node.keys = node.keys[:mid]
node.children = node.children[:mid + 1]
self._insert_into_parent(node, promote_key, new_node, path)
def _find_position(self, keys, target):
"""Binary search for position"""
lo, hi = 0, len(keys)
while lo < hi:
mid_idx = (lo + hi) // 2
if keys[mid_idx] < target:
lo = mid_idx + 1
else:
hi = mid_idx
return lo
20.2.3 Insertion and Splitting
B+Tree insertion follows a "bottom-up split" strategy:
- Locate leaf: follow root-to-leaf path
- Insert into leaf: place key-data pair in sorted position
- Check overflow: if leaf exceeds order, split
- Propagate split: may propagate upward, worst case to root
Critical splitting rules:
- Leaf split: middle key is copied up to parent (data stays in leaf)
- Internal split: middle key is moved up to parent (internal nodes don't store data)
This distinction ensures all data is always findable in leaf nodes.
20.2.4 Deletion and Merging
Deletion is the inverse of insertion:
- Locate and delete: remove target key from leaf
- Check underflow: if keys fall below โorder/2โ, handle it
- Try borrowing: take a key from sibling (rotation)
- Merge: if sibling is also minimal, merge two nodes into one
20.3 B+Tree and Disk I/O: The Deep Connection
20.3.1 Cost Model of Disk Access
Understanding why B+Tree is efficient requires understanding disk I/O costs:
Storage Hierarchy (typical 2024 values):
โโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโ
โ Storage Medium โ Latency โ Bandwidth โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโค
โ L1 Cache โ 1ns โ 1TB/s โ
โ L2 Cache โ 4ns โ 500GB/s โ
โ L3 Cache โ 12ns โ 200GB/s โ
โ DRAM โ 100ns โ 50GB/s โ
โ NVMe SSD (random 4KB) โ 100ฮผs โ 7GB/s โ
โ SATA SSD (random 4KB) โ 200ฮผs โ 500MB/s โ
โ HDD (random 4KB) โ 10ms โ 200MB/s โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโ
Key insights:
- DRAM โ SSD latency jump: 1000x
- DRAM โ HDD latency jump: 100,000x
- But for sequential reads, the SSD/HDD bandwidth gap is only 10-35x
This explains two design principles:
- Minimize random I/O count (B+Tree's low height)
- Convert random I/O to sequential I/O (leaf linked list, LSM-Tree)
20.3.2 Why Node Size Equals Page Size
Operating systems perform disk I/O in "pages" (Linux default 4KB, databases typically 16KB for InnoDB or 8KB for PostgreSQL).
B+Tree nodes are sized equal to disk pages because:
- One node = one I/O: regardless of keys in the node, cost is one disk read
- Maximize each I/O: reading 16KB per operation, don't waste it on tiny nodes
- Filesystem alignment: avoid nodes spanning two pages (which doubles I/O)
# InnoDB page capacity calculation
PAGE_SIZE = 16 * 1024 # 16KB
# Internal node fan-out
key_size = 8 # bigint primary key
pointer_size = 6 # InnoDB uses 6-byte page numbers
header_overhead = 120 # Page header/trailer fixed overhead
usable_space = PAGE_SIZE - header_overhead
keys_per_page = usable_space // (key_size + pointer_size)
# ~1170 keys, i.e., fan-out ~1170
# Leaf node record count
row_size = 200 # Assume average row 200 bytes
rows_per_leaf = usable_space // row_size
# ~80 rows
# Height vs capacity
def capacity(height, fan_out, rows_per_leaf):
"""Maximum rows for given B+Tree height"""
return (fan_out ** (height - 1)) * rows_per_leaf
print(f"2 levels: {capacity(2, 1170, 80):,} rows") # 93,600
print(f"3 levels: {capacity(3, 1170, 80):,} rows") # 109,512,000
print(f"4 levels: {capacity(4, 1170, 80):,} rows") # 128,129,040,000
20.4 Common Misconceptions
Misconception 1: "The B in B-Tree stands for Binary"
Wrong. The B is most commonly interpreted as "Balanced" or "Boeing" (Bayer and McCreight worked at Boeing Scientific Research Labs), or Bayer's initial. Bayer himself never gave an official explanation.
Misconception 2: "Red-black trees are worse than B-Trees"
This is context-dependent. Red-black trees are better in pure-memory scenariosโeach comparison involves one key, and cache line utilization is higher. Linux kernel's process scheduler and Java's TreeMap use red-black trees. B-Tree/B+Tree's advantage is specifically for disk/large-data scenarios.
Misconception 3: "Hash indexes are completely useless"
MySQL's MEMORY engine and InnoDB's Adaptive Hash Index both use hashing. For equality lookups (WHERE id = 42), hash indexes with O(1) are indeed faster than B+Tree's O(log n). But hashing doesn't support range queries or sorting, so it can't serve as a general-purpose index structure.
Level 2 ยท How It Works Under the Hood
20.5 B+Tree in MySQL InnoDB
20.5.1 Clustered Index vs Secondary Index
InnoDB uses B+Tree to organize data, but with two distinct index types:
Clustered Index:
- Each InnoDB table has exactly one clustered index
- Leaf nodes store complete row data
- Uses the primary key by default; if none exists, InnoDB picks a unique non-null index or generates a hidden ROW_ID
- Physical storage order matches logical primary key order
Secondary Index:
- A table can have multiple secondary indexes
- Leaf nodes store indexed column values + primary key value (not full row)
- Lookups via secondary index require a "bookmark lookup" (use PK to fetch from clustered index)
Clustered index structure (primary key id):
Root: [50, 100]
โ (id < 50)
Leaf: [id=1, name="Alice", age=25] โ [id=2, name="Bob", age=30] โ ...
Secondary index structure (indexed column: name):
Root: ["John", "Mike"]
โ (name < "John")
Leaf: [name="Alice", id=1] โ [name="Bob", id=2] โ ...
โ Only stores PK, needs bookmark lookup
Cost of bookmark lookups:
# Query: SELECT * FROM users WHERE name = 'Alice'
# With secondary index on name:
# Step 1: Search secondary index B+Tree for name='Alice'
# โ Get primary key id=1 (2-3 I/O)
# Step 2: Use id=1 to search clustered index B+Tree
# โ Get full row (1-2 I/O, root usually in memory)
# Total: 3-5 I/O
# With covering index (index contains all needed columns):
# SELECT id, name FROM users WHERE name = 'Alice'
# Secondary index leaf already has (name, id), no bookmark lookup needed
# Total: 2-3 I/O
20.5.2 Covering Indexes and Composite Indexes
A covering index contains all columns needed by a query, eliminating bookmark lookups:
-- Create composite index
CREATE INDEX idx_name_age ON users(name, age);
-- This query uses covering index (Extra: Using index)
SELECT name, age FROM users WHERE name = 'Alice';
-- This query needs bookmark lookup (email not in index)
SELECT name, age, email FROM users WHERE name = 'Alice';
Composite index B+Tree follows the leftmost prefix principle:
Composite index (name, age) B+Tree:
Leaf nodes sorted by (name, age):
[("Alice", 25, id=1)] โ [("Alice", 30, id=5)] โ [("Bob", 20, id=3)] โ ...
Can use index:
โ WHERE name = 'Alice' (leftmost prefix)
โ WHERE name = 'Alice' AND age = 25 (full prefix)
โ WHERE name = 'Alice' AND age > 20 (prefix + range)
โ ORDER BY name, age (index order)
Cannot use index:
โ WHERE age = 25 (skips name)
โ ORDER BY age, name (wrong order)
20.5.3 Physical Structure of an InnoDB Page
Internal layout of a 16KB InnoDB page:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ File Header (38 bytes) โ โ Page number, checksum, prev/next page
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Page Header (56 bytes) โ โ Record count, free space pointer
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Infimum & Supremum Records (26 bytes) โ โ Virtual min/max records
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ User Records โ โ Actual data rows (singly linked list)
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Record 1 โ Record 2 โ Record 3 โ ... โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Free Space โ โ Unused space
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Page Directory (variable) โ โ Sparse directory for in-page search
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ File Trailer (8 bytes) โ โ Checksum (consistency with header)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
In-page search uses the Page Directory (sparse index), grouping records into slots of 4-8 records. Lookup binary-searches the directory to find the group, then sequentially scans within the group.
20.6 LSM-Tree: Write-Optimized Data Structure
20.6.1 B+Tree's Write Amplification Problem
B+Tree excels at reads, but writes have a fatal flawโwrite amplification:
- Modifying one row requires:
- Reading the 16KB page from disk
- Modifying a few dozen bytes
- Writing the entire 16KB page back
- If node splitting occurs, multiple pages must be written
- With WAL (Write-Ahead Log), one modification's actual write volume can be 10-30x the original data
When writes dominate (log collection, IoT sensors, time-series databases), B+Tree's write amplification severely limits throughput.
20.6.2 LSM-Tree Core Concept
Patrick O'Neil's 1996 paper "The Log-Structured Merge-Tree (LSM-Tree)" presented a key insight:
If sequential disk writes are 100x faster than random writes, just make all writes sequential.
LSM-Tree write path:
Write request โ MemTable (in-memory sorted structure) โ Full โ Flush as SSTable (sorted disk file)
โ
Background Compaction (merge SSTables)
Detailed steps:
- Write to MemTable: New data goes to an in-memory sorted structure (usually red-black tree or skip list), with WAL for durability
- MemTable becomes Immutable: When MemTable reaches threshold (e.g., 64MB), freeze it
- Flush to SSTable: Immutable MemTable is sequentially written to disk as an SSTable (Sorted String Table)
- Compaction: Background threads periodically merge SSTables, eliminating duplicate keys
import bisect
import time
import os
class MemTable:
"""In-memory sorted table (simplified as sorted list)"""
def __init__(self, max_size=1000):
self.entries = [] # [(key, value, timestamp)]
self.max_size = max_size
self.size = 0
def put(self, key, value):
"""Write/update"""
entry = (key, value, time.time())
idx = bisect.bisect_left(self.entries, (key,))
if idx < len(self.entries) and self.entries[idx][0] == key:
self.entries[idx] = entry
else:
self.entries.insert(idx, entry)
self.size += 1
def get(self, key):
"""Lookup"""
idx = bisect.bisect_left(self.entries, (key,))
if idx < len(self.entries) and self.entries[idx][0] == key:
return self.entries[idx][1]
return None
def is_full(self):
return self.size >= self.max_size
def flush_to_sstable(self, filename):
"""Sequential write to disk, producing an SSTable"""
with open(filename, 'w') as f:
for key, value, ts in self.entries:
f.write(f"{key}\t{value}\t{ts}\n")
return SSTable(filename)
class SSTable:
"""Sorted file on disk (simplified)"""
def __init__(self, filename):
self.filename = filename
self.index = self._build_sparse_index()
def _build_sparse_index(self):
"""Build sparse index (record position every N entries)"""
index = {}
with open(self.filename, 'r') as f:
offset = 0
for line in f:
key = line.split('\t')[0]
if len(index) == 0 or hash(key) % 16 == 0:
index[key] = offset
offset += len(line)
return index
def get(self, key):
"""Lookup in SSTable (requires I/O)"""
with open(self.filename, 'r') as f:
for line in f:
parts = line.strip().split('\t')
if parts[0] == key:
return parts[1]
if parts[0] > key:
break
return None
class LSMTree:
"""Simplified LSM-Tree implementation"""
def __init__(self, data_dir="lsm_data"):
self.memtable = MemTable()
self.immutable_memtables = []
self.sstables = [] # On-disk SSTables (newest first)
self.data_dir = data_dir
self.sstable_counter = 0
def put(self, key, value):
"""Write"""
self.memtable.put(key, value)
if self.memtable.is_full():
self._flush()
def get(self, key):
"""Read: check memory first, then disk"""
# 1. Current MemTable
result = self.memtable.get(key)
if result is not None:
return result
# 2. Immutable MemTables (newest first)
for imm in reversed(self.immutable_memtables):
result = imm.get(key)
if result is not None:
return result
# 3. SSTables (newest first)
for sst in self.sstables:
result = sst.get(key)
if result is not None:
return result
return None
def _flush(self):
"""Flush MemTable to SSTable"""
self.immutable_memtables.append(self.memtable)
self.memtable = MemTable()
imm = self.immutable_memtables.pop(0)
filename = f"{self.data_dir}/sst_{self.sstable_counter:06d}.dat"
self.sstable_counter += 1
sst = imm.flush_to_sstable(filename)
self.sstables.insert(0, sst)
def compact(self):
"""Merge SSTables (simplified Level Compaction)"""
if len(self.sstables) < 4:
return
sst1 = self.sstables.pop()
sst2 = self.sstables.pop()
merged = self._merge_sstables(sst1, sst2)
self.sstables.append(merged)
def _merge_sstables(self, sst1, sst2):
"""Merge two SSTables"""
filename = f"{self.data_dir}/sst_{self.sstable_counter:06d}.dat"
self.sstable_counter += 1
entries = {}
for sst in [sst2, sst1]: # sst1 is newer, overwrites
with open(sst.filename, 'r') as f:
for line in f:
parts = line.strip().split('\t')
entries[parts[0]] = (parts[1], parts[2])
with open(filename, 'w') as f:
for key in sorted(entries.keys()):
value, ts = entries[key]
f.write(f"{key}\t{value}\t{ts}\n")
os.remove(sst1.filename)
os.remove(sst2.filename)
return SSTable(filename)
20.6.3 LevelDB / RocksDB Leveled Compaction
Google's LevelDB (designed by Jeff Dean and Sanjay Ghemawat) organizes SSTables into multiple levels:
Level 0: SSTable_a, SSTable_b, SSTable_c โ Direct flush from MemTable
(key ranges may overlap)
Level 1: SSTable_1, SSTable_2, SSTable_3 โ No key overlap within level
[0-100] [101-200] [201-300]
Level 2: SSTable_A, SSTable_B, ..., SSTable_F
[0-50] [51-100] ... [251-300]
Level 3: ... (each level is 10x the capacity of the previous)
Level Compaction rules:
- Level 0 files may have overlapping key ranges (direct flushes)
- Level 1+ files have non-overlapping key ranges within each level
- When a level exceeds its size threshold, pick a file and merge with overlapping files in the next level
- Each level is 10x larger than the previous
Read amplification vs write amplification tradeoff:
LSM-Tree cost model:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Operation โ Cost โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Write โ O(1) memory write + O(1) WAL seq write โ
โ โ Write amp โ 10-30x (compaction cost) โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Point read โ Worst case: check every level O(levels) โ
โ โ Optimization: Bloom Filter skips files โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Range read โ Merge results from multiple levels โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
B+Tree cost model:
โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Operation โ Cost โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Write โ O(log n) random I/O + WAL โ
โ โ Write amp โ 10-30x (page-granularity) โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Point read โ O(log n) โ 2-4 I/O โ
โโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Range read โ Locate + sequential leaf scan โ
โโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
20.6.4 Bloom Filter Optimization for Reads
LSM-Tree's read disadvantage is searching multiple levels. Bloom Filters quickly determine "a key is definitely NOT in this SSTable," skipping unnecessary disk reads.
import mmh3 # MurmurHash3
class BloomFilter:
"""Bloom filter"""
def __init__(self, expected_items, false_positive_rate=0.01):
import math
self.size = int(-expected_items * math.log(false_positive_rate)
/ (math.log(2) ** 2))
self.num_hashes = int((self.size / expected_items) * math.log(2))
self.bit_array = [0] * self.size
def add(self, key):
"""Add key"""
for i in range(self.num_hashes):
idx = mmh3.hash(key, i) % self.size
self.bit_array[idx] = 1
def might_contain(self, key):
"""Query (may have false positives, never false negatives)"""
for i in range(self.num_hashes):
idx = mmh3.hash(key, i) % self.size
if self.bit_array[idx] == 0:
return False # Definitely not present
return True # Possibly present
class SSTableWithBloom(SSTable):
"""SSTable with Bloom Filter"""
def __init__(self, filename, bloom_filter):
super().__init__(filename)
self.bloom = bloom_filter
def get(self, key):
if not self.bloom.might_contain(key):
return None # Skip disk I/O
return super().get(key)
RocksDB attaches a Bloom Filter to each SSTable (default 10 bits/key, ~1% false positive rate). This means 99% of miss queries complete with zero disk I/O.
20.7 Redis Skip List: Why Not B+Tree
Redis Sorted Sets (ZSET) use skip lists instead of B+Tree. Antirez (Redis creator Salvatore Sanfilippo) gave these reasons:
- Pure in-memory: All Redis data is in memory, B+Tree's disk I/O advantage disappears
- Implementation simplicity: Skip list code is ~1/3 the size of B+Tree, easier to maintain
- Efficient range queries: Skip list's forward pointers naturally support range scans
- Concurrency-friendly: Skip lists allow fine-grained locking; B+Tree splits require locking entire paths
- Memory flexibility: Skip list nodes can be scattered; no need for pre-allocated contiguous blocks
import random
class SkipListNode:
"""Skip list node"""
def __init__(self, key, value, level):
self.key = key
self.value = value
self.forward = [None] * (level + 1)
self.span = [0] * (level + 1)
class SkipList:
"""Skip list implementation (mimicking Redis ZSET)"""
MAX_LEVEL = 32
P = 0.25 # Redis uses 1/4 promotion probability
def __init__(self):
self.header = SkipListNode(None, None, self.MAX_LEVEL)
self.level = 0
self.length = 0
def _random_level(self):
"""Randomly determine new node's level"""
level = 0
while random.random() < self.P and level < self.MAX_LEVEL:
level += 1
return level
def insert(self, key, value):
"""Insert element"""
update = [None] * (self.MAX_LEVEL + 1)
current = self.header
for i in range(self.level, -1, -1):
while (current.forward[i] is not None and
current.forward[i].key < key):
current = current.forward[i]
update[i] = current
new_level = self._random_level()
if new_level > self.level:
for i in range(self.level + 1, new_level + 1):
update[i] = self.header
self.level = new_level
new_node = SkipListNode(key, value, new_level)
for i in range(new_level + 1):
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
self.length += 1
def search(self, key):
"""Search"""
current = self.header
for i in range(self.level, -1, -1):
while (current.forward[i] is not None and
current.forward[i].key < key):
current = current.forward[i]
current = current.forward[0]
if current is not None and current.key == key:
return current.value
return None
def range_by_score(self, min_score, max_score):
"""Range query (simulating ZRANGEBYSCORE)"""
current = self.header
for i in range(self.level, -1, -1):
while (current.forward[i] is not None and
current.forward[i].key < min_score):
current = current.forward[i]
current = current.forward[0]
results = []
while current is not None and current.key <= max_score:
results.append((current.key, current.value))
current = current.forward[0]
return results
Skip List vs B+Tree comparison:
| Feature | Skip List | B+Tree |
|---|---|---|
| Best scenario | Pure memory | Disk storage |
| Implementation | Simple | Complex |
| Lookup time | O(log n) expected | O(log_m n) deterministic |
| Range query | Bottom-level linked list | Leaf linked list |
| Space overhead | Avg 1.33 pointers/node (P=0.25) | Half-full guarantee |
| Concurrency | Lock-free/fine-grained | Needs latch coupling |
| Disk I/O | Poor (scattered nodes) | Excellent (node=page) |
Level 3 ยท What the Specifications Define
20.8 History and Design Decisions of B-Tree
20.8.1 Bayer & McCreight 1972
Rudolf Bayer and Edward McCreight published "Organization and Maintenance of Large Ordered Indexes" in 1972 (Acta Informatica), first proposing the B-Tree. They were working at Boeing Scientific Research Labs at the time.
Core contributions of the paper:
- Defined B-Tree structure and properties
- Proved all operations (search, insert, delete) are O(log n)
- Analyzed upper bounds on disk I/O
- Presented algorithms for node splitting and merging
Why called "B"-Tree? Bayer never gave a definitive explanation. Common interpretations include Boeing, Balanced, Bayer, and Broad. Douglas Comer wrote in his 1979 survey "The Ubiquitous B-Tree": "The origin of 'B-tree' has never been explained by the authors."
Historical context: In 1972, storage technology meant drums and disks with millisecond access times. Main memory (core memory) was measured in KB. The core constraint when designing B-Tree was: how to maintain large ordered indexes on extremely slow external storage? This question remains relevant 50+ years laterโthough SSDs are 100x faster than HDDs, they're still 1000x slower than DRAM.
20.8.2 Evolution of B+Tree
B+Tree has no single "inventor" paper. It evolved through database system practice:
- 1973: Knuth discussed B-Tree variants in "The Art of Computer Programming" Vol. 3
- 1977: IBM's System R project (origin of SQL) used B+Tree-like structures
- 1979: Douglas Comer's survey systematically summarized B-Tree variants including B+Tree
Two key design decisions of B+Tree over B-Tree:
Decision 1: Data only in leaves
- Rationale: No data in internal nodes โ smaller โ more keys per page โ larger fan-out โ shorter tree โ fewer I/O
- Cost: Point lookups always traverse to leaf (B-Tree might find it earlier)
- Tradeoff: For database workloads, reducing one level outweighs occasional saved I/O
Decision 2: Leaf linked list
- Rationale: Range queries need only "locate + sequential scan," no backtracking to parent
- Cost: Extra pointer space (typically negligible)
- Implementation: Modern systems use doubly-linked lists for both ASC and DESC scans
20.8.3 O'Neil 1996 LSM-Tree Paper
Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil published "The Log-Structured Merge-Tree (LSM-Tree)" in 1996 (Acta Informatica).
Core arguments:
- For write-intensive workloads, B-Tree's random write efficiency is extremely poor
- Converting all writes to sequential (write to memory first, batch flush) dramatically improves write throughput
- Trading some read degradation (checking multiple components) for orders-of-magnitude write improvement
O'Neil's original LSM-Tree had only two components (Cโ in memory, Cโ on disk). Google's BigTable paper (2006) later generalized LSM-Tree to multi-level structures, and LevelDB/RocksDB refined leveled compaction strategies.
Theoretical analysis:
Assume:
- B = disk block size
- M = size of Cโ in memory
- Sequential write bandwidth = W_seq
- Random write latency = T_rand
B+Tree write throughput: W_btree = 1 / T_rand (one random I/O per write) LSM-Tree write throughput: W_lsm โ W_seq / (write_amplification)
When W_seq >> 1/T_rand (always true on HDD), LSM-Tree write throughput far exceeds B+Tree. On SSDs, the sequential/random gap narrows, but LSM-Tree's write amplification consumes SSD P/E cyclesโa new dimension of tradeoff.
20.8.4 Mathematical Relationship Between Disk I/O and B-Tree
Let:
- n = total records
- m = B-Tree order (max children)
- h = tree height
Precise height bounds:
h โค log_{โm/2โ}((n+1)/2) + 1 (max height, nodes half-full)
h โฅ log_m(n+1) (min height, nodes full)
For InnoDB typical parameters (m โ 1170, n = 100 million):
- Min height: logโโโโ(100,000,001) โ 2.6 โ 3 levels
- Max height: logโ โโ (50,000,001) โ 2.78 โ 3 levels
Regardless of fill factor, 3 levels suffice. This is why DBAs say "InnoDB's B+Tree is almost always 3 levels tall."
Buffer Pool impact:
InnoDB's Buffer Pool (default 128MB, production typically 50-80% of physical memory) caches hot pages:
- Root node: nearly always in Buffer Pool (100% hit rate)
- Second-level internal nodes: 1170 pages ร 16KB = 18.3MB, almost all in memory
- Third-level leaf nodes: 1170ยฒ โ 1.37M pages ร 16KB โ 21.4GB, cached by access frequency
Conclusion: for hot data, actual disk I/O may be only 0-1 operations.
Level 4 ยท Edge Cases and Pitfalls
20.9 Connections to Other Books on This Site
This chapter relates closely to:
- High Performance MySQL - Indexing โ Practical InnoDB B+Tree index techniques: choosing index columns, analyzing EXPLAIN output, common index invalidation scenarios
- High Performance MySQL - Architecture โ InnoDB storage engine architecture, including Buffer Pool, Change Buffer, Redo Log and their B+Tree interactions
- Redis Deep Dive โ Skip list implementation in Redis ZSET, and other data structure choices
20.10 How to Answer "Why Does MySQL Use B+Tree" in Interviews
This is one of the most frequently asked backend interview questions. Here's a structured answer framework:
Layer 1: Disk I/O constraint "Database data lives on disk, where random I/O is slow (SSD ~100ฮผs, HDD ~10ms). We need a data structure with extremely low height so each lookup requires minimal disk I/O."
Layer 2: Why not red-black tree/AVL "Red-black trees are binaryโ100M records need ~27 levels. Each level means one I/O, so 27 disk accesses per lookup is unacceptable. B+Tree nodes can have thousands of branches; 100M records need only 3 levels."
Layer 3: Why not hash table "Hash tables don't support range queries (BETWEEN), sorting (ORDER BY), or leftmost prefix matching. Databases need these operations. Though InnoDB has an Adaptive Hash Index as supplement."
Layer 4: Why B+Tree over B-Tree "B+Tree internal nodes store no data, so each page holds more keys, making the tree shorter. Plus leaf nodes form a linked listโrange queries just locate the start and scan sequentially."
Bonus: Mention Buffer Pool "In practice, because of the Buffer Pool, the root and upper internal nodes are almost always in memory. Actual disk I/O is typically 0-1 operations."
20.11 B+Tree vs LSM-Tree Selection Guide
| Dimension | B+Tree | LSM-Tree |
|---|---|---|
| Read/write ratio | Read-heavy | Write-heavy |
| Latency profile | Stable (2-4 I/O each time) | Low write latency, occasional read spikes |
| Space amplification | Low (in-page fragmentation) | Medium (multi-version data) |
| Write amplification | Medium (page-granularity) | High (compaction rewrites) |
| Read amplification | Low (direct lookup) | High (multi-level search) |
| Typical systems | MySQL, PostgreSQL, Oracle | RocksDB, Cassandra, HBase |
| Use cases | OLTP, transactions | Logs, time-series, write-intensive |
Practical selection advice:
- Traditional OLTP (e-commerce, social, payments): 99% of the time choose MySQL/PostgreSQL (B+Tree)
- Log analytics, monitoring metrics, IoT data: consider ClickHouse/TDengine (columnar + LSM variants)
- High write throughput KV store: consider RocksDB (LSM-Tree)
- TiDB's storage layer TiKV uses RocksDB (LSM-Tree) but provides MySQL-compatible SQL interface aboveโthis is "LSM-Tree implementing B+Tree semantics" as an engineering solution
20.12 Common Interview Questions and Traps
Q: Are B+Tree leaf nodes a singly or doubly linked list? A: InnoDB uses a doubly linked list. File Header contains FIL_PAGE_PREV and FIL_PAGE_NEXT fields. This supports both forward scans (ASC) and backward scans (DESC).
Q: How much data can one InnoDB B+Tree store? A: Assuming bigint PK (8B), 1KB row size:
- Internal fan-out โ 16KB / (8+6)B โ 1170
- Rows per leaf โ 16KB / 1KB โ 16
- 3 levels: 1170 ร 16 = 18,720 rows
- 4 levels: 1170ยฒ ร 16 โ 21.9M rows
- But with 100B rows, 160 rows/leaf, 3 levels store 1170 ร 160 โ 187M rows
Q: Why did MongoDB previously use B-Tree instead of B+Tree? A: MongoDB's WiredTiger engine actually uses a B+Tree variant. The earlier MMAPV1 engine used a B-Tree-like structure because it directly memory-mapped files. Since MongoDB 4.0, WiredTiger is the defaultโessentially B+Tree.
Q: What problems does LSM-Tree's Compaction cause? A:
- Write amplification: Data is repeatedly read and rewritten, consuming SSD lifespan
- Space amplification: During compaction, both old and new copies exist simultaneously
- Latency spikes: Large compactions consume I/O bandwidth, affecting foreground reads/writes
Q: How does RocksDB solve LSM-Tree's read performance issues? A:
- Bloom Filter: Each SSTable has a Bloom Filter, skipping files without the target key
- Block Cache: Caches hot data blocks
- Prefix Bloom Filter: Bloom Filter on key prefixes for prefix queries
- Compression: Reduces actual I/O volume (LZ4/Snappy/ZSTD)
Chapter Summary
This chapter addressed the core problem of "efficiently organizing ordered data on external storage," introducing three key data structures:
- B-Tree: Increases node width to reduce height, bringing disk I/O from O(logโn) down to O(log_m n)
- B+Tree: Builds on B-Tree by storing data only in leaves and linking themโbecoming the database index standard
- LSM-Tree: Converts random writes to sequential writes, trading read performance for write throughput in write-intensive scenarios
They are not "replacement" relationships but optimal solutions for different workloads. Understanding their design tradeoffs is the watershed between "using a database" and "understanding a database."