From Algorithms to System Design
Chapter 42: From Algorithms to System Design
You spent 40 chapters mastering data structures and algorithms. Now the interviewer says: "Design a Twitter-like system." Suddenly your algorithm knowledge seems disconnected.
That's because most people treat "algorithm interviews" and "system design interviews" as completely different domains. They're not โ every core decision in system design is backed by a data structure or algorithm underneath. Consistent hashing is the distributed version of a hash table, B+Tree indexing is an engineering optimization of binary search trees, Bloom filters are probabilistic set membership queries, and priority scheduling in message queues is a heap application.
This chapter does four things: first, it shows how algorithms from earlier chapters manifest in real systems; second, it introduces new algorithms for distributed scenarios (Raft, CRDT, DHT); third, it creates cross-references with the Redis/Kafka/MySQL books on this site; fourth, it teaches you how to precisely reference algorithm knowledge in system design interviews for bonus points.
Level 1 ยท What You Need to Know
1.1 Algorithms Directly Manifested in System Design
Every system design decision has an algorithm behind it. Here are the most common mappings:
| System Component | Underlying Algorithm/DS | Book Chapter | Engineering Scenario |
|---|---|---|---|
| Load Balancing | Consistent Hashing | Ch 8 | Nginx, Memcached node assignment |
| Database Index | B+Tree, Skip List | Ch 20, 6 | MySQL InnoDB, Redis ZSet |
| Cache Eviction | LRU/LFU (Hash + Doubly-Linked List) | Ch 4, 8 | Redis maxmemory-policy |
| Deduplication | Bloom Filter | Ch 9 | Crawler URL dedup, spam filter |
| Search Engine | Inverted Index (Hash + Sorted List) | Ch 8, 11 | Elasticsearch |
| Message Priority | Heap / Priority Queue | Ch 12 | Kafka priority queue, task scheduling |
| Routing Table | Trie | Ch 10 | IP routing longest prefix match |
| Leaderboard | Skip List / Red-Black Tree | Ch 6, 21 | Redis ZRANGEBYSCORE |
| Rate Limiting | Sliding Window / Token Bucket | Ch 2 | API Gateway throttling |
| Distributed ID | Snowflake (Bit Manipulation) | Ch 31 | Twitter Snowflake |
Consistent Hashing (Reference: Chapter 8)
Scenario: You have N cache servers and need to decide which machine stores each key.
Naive approach: server = hash(key) % N. Problem: when adding or removing a server, almost all keys must be remapped โ meaning nearly all cache becomes invalid (cache stampede).
What does consistent hashing solve? When nodes change, on average only K/N keys need remapping (K = total keys, N = node count).
import hashlib
from bisect import bisect_right
class ConsistentHash:
"""Consistent hash ring implementation
Core idea:
1. Organize hash value space as a ring (0 ~ 2^32-1)
2. Each node occupies multiple positions on the ring (virtual nodes)
3. Each key finds its nearest node clockwise
Why virtual nodes (detailed in Ch 8):
- With only 1 position per physical node, load distribution is highly uneven
- With 150-200 virtual nodes, std deviation drops to 5-10% of mean
- This is the core contribution of Karger et al. (1997)
"""
def __init__(self, nodes: list[str], replicas: int = 150):
self.replicas = replicas
self.ring = []
self.hash_to_node = {}
for node in nodes:
self.add_node(node)
def _hash(self, key: str) -> int:
"""Use MD5 for uniform hash distribution"""
digest = hashlib.md5(key.encode()).hexdigest()
return int(digest[:8], 16)
def add_node(self, node: str):
"""Add physical node โ place replicas virtual nodes on ring"""
for i in range(self.replicas):
virtual_key = f"{node}#VN{i}"
h = self._hash(virtual_key)
self.ring.append(h)
self.hash_to_node[h] = node
self.ring.sort()
def remove_node(self, node: str):
"""Remove physical node โ remove all its virtual nodes"""
for i in range(self.replicas):
virtual_key = f"{node}#VN{i}"
h = self._hash(virtual_key)
self.ring.remove(h)
del self.hash_to_node[h]
def get_node(self, key: str) -> str:
"""Given a key, find responsible node (first clockwise)
Algorithm: binary search โ O(log(N * replicas))
"""
if not self.ring:
return None
h = self._hash(key)
idx = bisect_right(self.ring, h) % len(self.ring)
return self.hash_to_node[self.ring[idx]]
def demo_migration():
"""Demonstrate key migration when nodes change"""
nodes = [f"server-{i}" for i in range(5)]
ch = ConsistentHash(nodes)
assignments_before = {}
for i in range(10000):
key = f"user:{i}"
assignments_before[key] = ch.get_node(key)
ch.add_node("server-5")
migrated = 0
for i in range(10000):
key = f"user:{i}"
if ch.get_node(key) != assignments_before[key]:
migrated += 1
# Theoretical: 10000 / 6 โ 1667 (only the new node's share migrates)
print(f"Migrated keys: {migrated} / 10000 ({migrated/100:.1f}%)")
# Actual output ~16-20%, close to theoretical 1/6 โ 16.7%
Bloom Filter (Reference: Chapter 9)
Scenario: A crawler system has 1 billion crawled URLs. Each new URL needs to be checked against the existing set. A HashSet storing 1 billion URLs requires ~60GB memory. A Bloom filter needs only ~1.2GB (at 1% false positive rate).
import mmh3
from bitarray import bitarray
class BloomFilter:
"""Bloom filter implementation
Mathematical foundation (detailed in Ch 9):
- m = -(n * ln(p)) / (ln(2))ยฒ optimal bit count
- k = (m/n) * ln(2) optimal hash function count
- where n = expected items, p = target false positive rate
Space advantage: HashSet needs O(n * element_size)
Bloom filter needs only O(n) bits, independent of element size
Trade-offs:
- False Positives: "exists" might be wrong
- No False Negatives: "doesn't exist" is always correct
- No deletion support (Counting Bloom Filter can, but doubles space)
"""
def __init__(self, expected_items: int, fp_rate: float = 0.01):
import math
self.size = int(-(expected_items * math.log(fp_rate)) / (math.log(2) ** 2))
self.num_hashes = int((self.size / expected_items) * math.log(2))
self.bit_array = bitarray(self.size)
self.bit_array.setall(0)
def add(self, item: str):
"""Add element: set corresponding bit for each hash function to 1"""
for i in range(self.num_hashes):
idx = mmh3.hash(item, i) % self.size
self.bit_array[idx] = 1
def contains(self, item: str) -> bool:
"""Query: all hash bits are 1 โ might exist; any is 0 โ definitely absent"""
for i in range(self.num_hashes):
idx = mmh3.hash(item, i) % self.size
if not self.bit_array[idx]:
return False
return True
class WebCrawler:
"""Crawler with Bloom filter deduplication"""
def __init__(self, expected_urls: int = 1_000_000_000):
self.bloom = BloomFilter(expected_urls, fp_rate=0.001)
def should_crawl(self, url: str) -> bool:
if self.bloom.contains(url):
return False
return True
def mark_crawled(self, url: str):
self.bloom.add(url)
B+Tree Index (Reference: Chapter 20)
Scenario: MySQL executing SELECT * FROM users WHERE age BETWEEN 25 AND 35 โ how to efficiently find all matching rows?
class BPlusTreeNode:
"""B+Tree node โ simplified implementation showing core concepts
Why databases choose B+Tree over Red-Black Tree/AVL?
1. Disk-friendly: each node is a disk page (typically 16KB), minimizing I/O
2. Efficient range queries: leaf nodes form a sorted linked list
3. High fanout: internal nodes store only keys (not data)
โ Tree height is extremely low (3-4 levels can index billions of rows)
Comparison with BST:
- BST: height O(logโ n) โ 1 billion rows needs 30 comparisons โ 30 disk I/Os
- B+Tree (order=1000): height O(logโโโโ n) โ 1 billion rows needs only 3 I/Os!
"""
def __init__(self, is_leaf: bool = False, order: int = 4):
self.is_leaf = is_leaf
self.keys = []
self.children = []
self.next_leaf = None
self.order = order
def range_query_explanation():
"""B+Tree range query process
SQL: SELECT * FROM users WHERE age BETWEEN 25 AND 35
Steps:
1. From root, binary search to locate leaf node for age=25 โ O(log_B n) I/Os
2. Find first position >= 25 within leaf node
3. Scan along leaf linked list until age > 35 โ O(result_size / page_size) I/Os
Key advantages:
- Location cost is minimal (3-4 I/Os)
- Range scan is sequential I/O (SSD sequential read 3GB/s vs random 100MB/s)
"""
pass
1.2 From Single-Machine to Distributed Algorithms
When your system scales from one machine to many, single-machine "certainties" disappear:
| Single-Machine Assumption | Distributed Reality |
|---|---|
| Operations are atomic | Network can drop/delay/reorder packets |
| Unified clock | Each machine's clock drifts differently |
| No failures | Any node can crash at any time |
| Shared memory | Data must travel over network |
These changes gave birth to distributed algorithms. Here are the three you're most likely to encounter in system design interviews:
Raft Consensus Algorithm (brief introduction):
class RaftState:
"""Raft consensus algorithm state machine model
Problem solved: How do multiple nodes agree on "the order of log entries"?
Source: Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm"
(USENIX ATC 2014) โ specifically designed as an understandable Paxos alternative
Core roles:
- Leader: receives client requests, replicates log to Followers
- Follower: passively receives log entries
- Candidate: competes to become Leader during election
Key invariants:
- At most one Leader per term
- Leader's log contains all committed entries
- Majority agreement suffices for commitment
"""
FOLLOWER = "follower"
CANDIDATE = "candidate"
LEADER = "leader"
def __init__(self, node_id: int, total_nodes: int):
self.node_id = node_id
self.state = self.FOLLOWER
self.current_term = 0
self.voted_for = None
self.log = []
self.commit_index = 0
self.majority = total_nodes // 2 + 1
def request_vote(self, candidate_term: int, candidate_id: int,
last_log_index: int, last_log_term: int) -> bool:
"""Vote RPC: decide whether to vote for candidate
Conditions:
1. candidate_term >= my term
2. Haven't voted this term (or already voted for this candidate)
3. Candidate's log is at least as up-to-date (compare term first, then index)
"""
if candidate_term < self.current_term:
return False
if candidate_term > self.current_term:
self.current_term = candidate_term
self.state = self.FOLLOWER
self.voted_for = None
if self.voted_for is not None and self.voted_for != candidate_id:
return False
my_last_term = self.log[-1][0] if self.log else 0
my_last_index = len(self.log)
log_is_up_to_date = (
candidate_term > my_last_term or
(candidate_term == my_last_term and last_log_index >= my_last_index)
)
if log_is_up_to_date:
self.voted_for = candidate_id
return True
return False
1.3 Template for Referencing Algorithms in System Design Interviews
How to precisely reference algorithm knowledge during interviews:
# Interviewer: "How would you design a distributed cache system?"
#
# Your response framework:
SYSTEM_DESIGN_WITH_ALGORITHMS = """
1. Data Sharding
โ Consistent hashing (Karger et al., 1997)
โ 150 virtual nodes per physical node for uniform distribution
โ Node changes only require migrating 1/N of data
2. Cache Eviction
โ LRU policy: hash table + doubly-linked list, O(1) get/put
โ If frequency matters: LFU or W-TinyLFU (Caffeine's implementation)
3. Hot Key Detection
โ Count-Min Sketch (probabilistic DS, Ch 9)
โ Estimate per-key access frequency with O(1) space
4. Cache Penetration Protection
โ Bloom filter to check if key might exist in database
โ Definitely-absent keys return immediately without DB query
5. Consistency Guarantee
โ Strong consistency needed: Raft replicated log
โ Eventual consistency acceptable: gossip protocol + vector clocks
"""
1.4 Common Mistakes
Mistake 1: Confusing algorithm application levels
# โ Wrong: "Use red-black tree for database index"
# Red-black tree is an in-memory structure; disk should use B+Tree
# โ Wrong: "Use DFS for service discovery"
# DFS is a graph traversal algorithm; service discovery uses gossip or registries
# โ
Correct mappings:
CORRECT_MAPPINGS = {
"In-memory index": "Red-black tree, skip list (Redis)",
"On-disk index": "B+Tree, LSM-Tree",
"In-memory cache eviction": "LRU (hash + doubly-linked list)",
"Distributed routing": "Consistent hashing",
"Approximate counting": "HyperLogLog",
"Approximate membership": "Bloom filter",
"Distributed consensus": "Raft / Paxos",
}
Mistake 2: Over-engineering small-scale systems
When the interviewer says "100K users," you don't need consistent hashing โ a single Redis instance suffices. Algorithm and data structure choices must match the scale.
Level 2 ยท How It Works
2.1 Deep Dive into Distributed Algorithms
Raft Engineering Implementation Details
Raft is used by etcd (Kubernetes' underlying storage), CockroachDB, and TiKV (TiDB's storage engine). Let's analyze its core mechanisms:
Mathematical guarantee of election mechanism:
import random
class RaftElection:
"""Raft election key design decisions
Why randomized timeout?
- If all nodes start elections simultaneously, Leader may never be elected (livelock)
- Random timeout [150ms, 300ms] ensures one node likely times out first
- Math: if timeout range >> network RTT, conflict probability โ 0
Timeout range selection (Ongaro's paper suggests):
- Lower bound > average network RTT (avoid frequent elections)
- Upper bound < acceptable unavailability duration
- Range = upper - lower >> average RTT (reduce conflicts)
"""
def __init__(self, heartbeat_interval: int = 50):
self.heartbeat_interval = heartbeat_interval
self.election_timeout_range = (
heartbeat_interval * 3,
heartbeat_interval * 6
)
def random_election_timeout(self) -> int:
return random.randint(*self.election_timeout_range)
def simulate_election(self, num_nodes: int = 5,
network_delay_ms: int = 5) -> dict:
"""Simulate one election round
Best case: 1 RTT to complete (150 + 5 = 155ms)
Worst case: multiple conflicts (rare, expected < 2 rounds)
"""
timeouts = [self.random_election_timeout() for _ in range(num_nodes)]
first_candidate = min(range(num_nodes), key=lambda i: timeouts[i])
election_start = timeouts[first_candidate]
vote_collection_time = election_start + network_delay_ms
second_timeout = sorted(timeouts)[1]
conflict = (second_timeout - election_start) < network_delay_ms
return {
"winner": first_candidate,
"time_ms": vote_collection_time,
"conflict": conflict,
"rounds": 2 if conflict else 1,
}
CRDT: Conflict-free Replicated Data Types
class GCounter:
"""G-Counter (Grow-only Counter) โ simplest CRDT
Source: Shapiro et al., "A comprehensive study of Convergent and
Commutative Replicated Data Types" (INRIA Technical Report, 2011)
Core idea:
- Each node maintains its own counter
- Global value = sum of all node counters
- Merge operation = take max of each node's counter
Why CRDT?
- Traditional: either locks (poor perf) or consensus (high latency)
- CRDT: lock-free, consensus-free, eventually consistent
- Cost: only supports specific operations (monotonic increase/decrease/add/remove)
"""
def __init__(self, node_id: str, num_nodes: int):
self.node_id = node_id
self.counters = {}
def increment(self, amount: int = 1):
self.counters[self.node_id] = self.counters.get(self.node_id, 0) + amount
def value(self) -> int:
return sum(self.counters.values())
def merge(self, other: 'GCounter'):
"""Merge: take max for each node
Key properties (why it's conflict-free):
- Commutative: merge(A, B) == merge(B, A)
- Associative: merge(merge(A, B), C) == merge(A, merge(B, C))
- Idempotent: merge(A, A) == A
These three properties guarantee eventual consistency regardless of message order
"""
for node_id, count in other.counters.items():
self.counters[node_id] = max(self.counters.get(node_id, 0), count)
class PNCounter:
"""PN-Counter (supports increment and decrement)
Idea: two G-Counters โ one for increments, one for decrements
value = P.value() - N.value()
"""
def __init__(self, node_id: str, num_nodes: int):
self.p = GCounter(node_id, num_nodes)
self.n = GCounter(node_id, num_nodes)
def increment(self, amount: int = 1):
self.p.increment(amount)
def decrement(self, amount: int = 1):
self.n.increment(amount)
def value(self) -> int:
return self.p.value() - self.n.value()
def merge(self, other: 'PNCounter'):
self.p.merge(other.p)
self.n.merge(other.n)
class LWWRegister:
"""Last-Writer-Wins Register โ timestamp-based CRDT
Application: concurrent updates to a single field in distributed DBs
Limitations:
- Depends on clock synchronization (NTP accuracy typically 1-10ms)
- If two writes happen within clock error, result is non-deterministic
- Solution: Hybrid Logical Clock (HLC) โ Kulkarni et al. (2014)
"""
def __init__(self, node_id: str):
self.node_id = node_id
self.value = None
self.timestamp = 0
def write(self, value, timestamp: int):
if timestamp > self.timestamp:
self.value = value
self.timestamp = timestamp
def merge(self, other: 'LWWRegister'):
if other.timestamp > self.timestamp:
self.value = other.value
self.timestamp = other.timestamp
Distributed Hash Table (DHT)
class KademliaNode:
"""Kademlia DHT โ distributed hash table for P2P networks
Source: Maymounkov & Mazieres, "Kademlia: A Peer-to-peer Information
System Based on the XOR Metric" (IPTPS 2002)
Used by BitTorrent, IPFS, Ethereum
Core ideas:
1. Each node has a 160-bit ID
2. Distance = XOR โ satisfies triangle inequality without physical proximity
3. Each node maintains 160 "k-buckets"; bucket i stores nodes at distance [2^i, 2^(i+1))
4. Lookup: each step halves distance โ O(log n) steps to find target
vs Consistent Hashing:
- Consistent hashing: needs centralized ring info โ suitable for data centers
- DHT: fully decentralized, each node knows O(log n) others โ suitable for P2P
"""
def __init__(self, node_id: int, k: int = 20):
self.node_id = node_id
self.k = k
self.buckets = [[] for _ in range(160)]
self.data = {}
def distance(self, id1: int, id2: int) -> int:
return id1 ^ id2
def bucket_index(self, other_id: int) -> int:
dist = self.distance(self.node_id, other_id)
if dist == 0:
return 0
return dist.bit_length() - 1
def find_closest_nodes(self, target_id: int, count: int = 20) -> list:
"""Find count closest nodes to target_id in routing table
Core Kademlia operation: returned nodes are closer to target each time
โ guarantees O(log n) convergence
"""
all_nodes = []
for bucket in self.buckets:
all_nodes.extend(bucket)
all_nodes.sort(key=lambda n: self.distance(n, target_id))
return all_nodes[:count]
2.2 Deep Correspondence Between Algorithms and System Components
LSM-Tree vs B+Tree: Write-Optimized vs Read-Optimized
class LSMTreeConcept:
"""LSM-Tree (Log-Structured Merge Tree) concept
Source: O'Neil et al., "The Log-Structured Merge-Tree (LSM-Tree)" (1996)
Used by LevelDB, RocksDB, Cassandra, HBase
Core idea:
1. Writes go to memory first (MemTable, usually skip list or red-black tree)
2. When MemTable is full, flush to disk as SSTable (Sorted String Table)
3. Background periodic compaction merges multiple SSTables
Comparison with B+Tree:
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ โ B+Tree โ LSM-Tree โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ Write amplif โ High (random disk) โ Low (sequential) โ
โ Read amplif โ Low (1 lookup) โ High (check layers) โ
โ Space amplif โ Low โ Medium (versions) โ
โ Best for โ Read-heavy (MySQL) โ Write-heavy (logs) โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
"""
def __init__(self, memtable_size: int = 4 * 1024 * 1024):
self.memtable = {}
self.memtable_size = memtable_size
self.sstables = []
def put(self, key: str, value: str):
"""Write: WAL (omitted) then MemTable
Time: O(log n) skip list insertion
Disk I/O: only WAL sequential append (1 write)
"""
self.memtable[key] = value
if len(self.memtable) >= self.memtable_size:
self._flush()
def get(self, key: str) -> str:
"""Read: check MemTable โ then SSTables newest to oldest
Worst case: check all levels โ use Bloom filter to skip irrelevant SSTables
"""
if key in self.memtable:
return self.memtable[key]
for sstable in self.sstables:
if key in sstable:
return sstable[key]
return None
def _flush(self):
sstable = dict(sorted(self.memtable.items()))
self.sstables.insert(0, sstable)
self.memtable = {}
Skip List in Redis
import random
class SkipListNode:
def __init__(self, key: float, value: str, level: int):
self.key = key
self.value = value
self.forward = [None] * (level + 1)
self.span = [0] * (level + 1)
class SkipList:
"""Skip List โ underlying implementation of Redis ZSET
Why Redis chose skip list over red-black tree? (antirez's explanation)
1. Simpler: skip list ~100 LOC vs red-black tree 300+ LOC
2. Range operations naturally efficient: ZRANGEBYSCORE just finds start, then scans
3. Controllable memory: average 1.33 pointers/node (p=1/4)
4. Comparable performance: O(log n) find/insert/delete
In system design interviews:
"Redis leaderboards use skip lists. Expected height is O(log n),
both point queries and range queries are O(log n + k). Compared to
red-black trees, skip lists don't need in-order traversal for ranges โ
they scan along the bottom-level linked list directly."
"""
MAX_LEVEL = 32
P = 0.25
def __init__(self):
self.header = SkipListNode(float('-inf'), '', self.MAX_LEVEL)
self.level = 0
self.length = 0
def _random_level(self) -> int:
"""Geometric distribution: P(level=k) = p^(k-1) * (1-p)
Expected level = 1/(1-p) = 1.33 when p=0.25
"""
level = 0
while random.random() < self.P and level < self.MAX_LEVEL:
level += 1
return level
def insert(self, key: float, value: str):
"""Insert โ O(log n) expected time"""
update = [None] * (self.MAX_LEVEL + 1)
current = self.header
for i in range(self.level, -1, -1):
while current.forward[i] and current.forward[i].key < key:
current = current.forward[i]
update[i] = current
new_level = self._random_level()
if new_level > self.level:
for i in range(self.level + 1, new_level + 1):
update[i] = self.header
self.level = new_level
new_node = SkipListNode(key, value, new_level)
for i in range(new_level + 1):
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
self.length += 1
def range_by_score(self, min_score: float, max_score: float) -> list:
"""Range query โ O(log n + k), k = result size
This is the core logic of Redis ZRANGEBYSCORE
"""
current = self.header
for i in range(self.level, -1, -1):
while current.forward[i] and current.forward[i].key < min_score:
current = current.forward[i]
current = current.forward[0]
result = []
while current and current.key <= max_score:
result.append((current.key, current.value))
current = current.forward[0]
return result
2.3 Cross-References with Other Books on This Site
The technical books on this site (dev.yiteai.com) form a knowledge network:
| Algorithm/Concept | Related Book | Related Section | Connection |
|---|---|---|---|
| Consistent Hashing | "Redis In Depth" | Cluster Sharding | Redis Cluster uses 16384 hash slots |
| Bloom Filter | "Redis In Depth" | Data Structures | Redis 4.0+ built-in BF.ADD/BF.EXISTS |
| B+Tree | "High Performance MySQL" | Indexing | InnoDB clustered index structure |
| LSM-Tree | "High Performance MySQL" | MyRocks | RocksDB storage engine core |
| Priority Queue/Heap | "Kafka In Depth" | Message Scheduling | Delayed messages, priority queues |
| Raft Consensus | "Redis In Depth" | Sentinel/Cluster | Redis Sentinel uses Raft-like leader election |
| CRDT | "Redis In Depth" | Active-Active | Redis Enterprise conflict resolution |
2.4 Data Structures in Message Queues
class DelayQueue:
"""Delay message queue โ based on min-heap
Scenarios:
- Auto-cancel orders unpaid after 30 minutes
- Scheduled push notifications
- Retry backoff (1s, 2s, 4s, 8s...)
In Kafka (detailed in the Kafka book on this site):
- Kafka doesn't natively support delay messages
- Common approach: multiple delay topics (delay_1s, delay_5s, delay_30m...)
- Or use Hierarchical Timing Wheel
"""
def __init__(self):
self.heap = []
self.counter = 0
def put(self, message: str, delay_seconds: float):
import time, heapq
execute_at = time.time() + delay_seconds
heapq.heappush(self.heap, (execute_at, self.counter, message))
self.counter += 1
def poll(self) -> str:
import time, heapq
if not self.heap:
return None
if self.heap[0][0] <= time.time():
_, _, message = heapq.heappop(self.heap)
return message
return None
class TimingWheel:
"""Timing Wheel โ used by Kafka/Netty for timers
Source: Varghese & Lauck, "Hashed and Hierarchical Timing Wheels" (1996)
Why not a heap?
- Heap insert/delete: O(log n)
- Timing wheel insert/delete: O(1)
- With millions of timers, the difference is massive
Core idea (clock analogy):
- Second hand: 60 slots, 1 second each
- Minute hand: 60 slots, 1 minute each
- Hour hand: 24 slots, 1 hour each
Tasks placed in appropriate slots; "hand" advances one slot per tick.
When higher-level slot is reached, tasks cascade down to lower level.
"""
def __init__(self, tick_ms: int = 1, wheel_size: int = 60):
self.tick_ms = tick_ms
self.wheel_size = wheel_size
self.buckets = [[] for _ in range(wheel_size)]
self.current_tick = 0
self.overflow_wheel = None
def add_task(self, task_id: str, delay_ticks: int):
"""Add timed task โ O(1)"""
if delay_ticks < self.wheel_size:
bucket_idx = (self.current_tick + delay_ticks) % self.wheel_size
self.buckets[bucket_idx].append((task_id, delay_ticks))
else:
if self.overflow_wheel is None:
self.overflow_wheel = TimingWheel(
self.tick_ms * self.wheel_size,
self.wheel_size
)
self.overflow_wheel.add_task(task_id, delay_ticks // self.wheel_size)
def advance(self):
"""Advance one tick โ O(tasks in current bucket)"""
self.current_tick = (self.current_tick + 1) % self.wheel_size
expired_tasks = self.buckets[self.current_tick]
self.buckets[self.current_tick] = []
return expired_tasks
Level 3 ยท How the Standard Defines It
3.1 CAP Theorem and Algorithm Choice
CAP Theorem (Brewer, 2000; proven by Gilbert & Lynch, 2002) states a distributed system cannot simultaneously guarantee all three:
- Consistency: all nodes see the same data at the same time
- Availability: every request receives a non-error response
- Partition tolerance: system continues operating during network partitions
Since network partitions are inevitable (P must be chosen), the real choice is CP vs AP:
CAP_ALGORITHM_MAPPING = {
"CP (Consistency Priority)": {
"algorithms": "Raft / Paxos / ZAB",
"behavior": "Minority partition stops serving during network split",
"systems": ["etcd", "ZooKeeper", "CockroachDB", "Spanner"],
"use_cases": "Financial transactions, distributed locks, config management",
},
"AP (Availability Priority)": {
"algorithms": "Gossip / CRDT / Last-Writer-Wins",
"behavior": "All nodes continue serving, but data may be inconsistent",
"systems": ["Cassandra", "DynamoDB", "Redis Cluster (some scenarios)"],
"use_cases": "Social media, shopping carts, DNS",
},
}
3.2 System Design Interview Scoring Dimensions
Based on Meta system design interviewer talks (InfoQ 2023) and the framework from Designing Data-Intensive Applications (Kleppmann, 2017):
| Dimension | Weight | Algorithm Connection |
|---|---|---|
| Requirements Analysis | 15% | Estimate QPS/storage โ determines needed complexity |
| High-Level Design | 25% | Component selection โ data structure choices underneath |
| Deep Dive | 35% | Detail questions โ requires explaining underlying algorithms |
| Trade-off Discussion | 25% | Trade-off analysis โ time/space complexity trade-offs |
Algorithm reference example in Deep Dive:
# Interviewer: "How do you keep the leaderboard updated in real-time?"
#
# Your answer:
"""
Use Redis ZSET (Sorted Set).
Underlying implementation is skip list + hash table:
- Skip list: maintains score ordering, supports O(log n) range queries
- Hash table: O(1) single key lookup
Operation complexity:
- ZADD (update score): O(log n)
- ZRANK (get rank): O(log n) โ via span accumulation
- ZRANGEBYSCORE (get top-k): O(log n + k)
Why not MySQL ORDER BY?
- Each query needs full table sort or index scan
- Under high concurrency (e.g., live-streaming tip leaderboard), MySQL can't handle it
- Redis skip list updates are atomic, naturally supports concurrency
Capacity estimation:
- 10M user leaderboard โ ZSET ~200MB
- If partitioned by game, 100K per partition โ single ZSET ~2MB
"""
3.3 Classic Distributed System Theorems
| Theorem/Algorithm | Authors | Year | Core Content | Interview Reference |
|---|---|---|---|---|
| FLP Impossibility | Fischer, Lynch, Paterson | 1985 | Cannot guarantee consensus with even 1 failure in async systems | Why timeouts are needed |
| CAP Theorem | Brewer (Gilbert & Lynch proof) | 2000/2002 | CP vs AP choice | Foundation of all SD trade-offs |
| Paxos | Lamport | 1989/1998 | First proven-correct consensus algorithm | Background; explain why Raft is more practical |
| Raft | Ongaro & Ousterhout | 2014 | Understandable consensus | Strong consistency scenarios |
| Gossip | Demers et al. | 1987 | Epidemic-style propagation | AP system dissemination |
| Vector Clock | Lamport / Fidge | 1978/1988 | Determine causal relationships | Resolve concurrent conflicts |
| Consistent Hashing | Karger et al. | 1997 | Minimize remapping | Distributed cache/storage sharding |
3.4 Consistency Model Spectrum
From strongest to weakest, with corresponding implementation algorithms:
CONSISTENCY_SPECTRUM = {
"Linearizability": {
"guarantee": "Operations appear to execute atomically at a single point in time",
"algorithm": "Raft/Paxos + all reads/writes through Leader",
"cost": "High latency (every write needs majority confirmation)",
"systems": "Spanner, CockroachDB",
},
"Sequential Consistency": {
"guarantee": "All operations have a global order; each process preserves its order",
"algorithm": "Total Order Broadcast",
"cost": "Medium latency",
"systems": "ZooKeeper (writes)",
},
"Causal Consistency": {
"guarantee": "Causally related operations are ordered; concurrent ops may be unordered",
"algorithm": "Vector clocks / version vectors",
"cost": "Lower latency, allows some concurrency",
"systems": "MongoDB (w:majority + causal sessions)",
},
"Eventual Consistency": {
"guarantee": "If no new writes, all replicas eventually converge to same value",
"algorithm": "Gossip + CRDT / Last-Writer-Wins",
"cost": "Lowest latency, but may read stale data",
"systems": "DynamoDB, Cassandra, DNS",
},
}
3.5 Back-of-Envelope Estimation in System Design
This is an essential interview skill โ it directly determines which algorithms and data structures you choose:
# Jeff Dean's "Numbers Every Programmer Should Know" (2012, continuously updated)
LATENCY_NUMBERS = {
"L1 cache reference": 0.5, # ns
"L2 cache reference": 7, # ns
"Main memory reference": 100, # ns
"SSD random read": 150_000, # ns = 150 ยตs
"HDD random read": 10_000_000, # ns = 10 ms
"Network round trip (same DC)": 500_000, # ns = 0.5 ms
"Network round trip (cross-city)": 50_000_000, # ns = 50 ms
}
def estimate_scale() -> dict:
"""System design capacity estimation template
Talk through it during interview:
'DAU 10M, 10 requests/user/day โ QPS = 10M * 10 / 86400 โ 1200'
'Peak at 3x โ 3600 QPS โ single Redis handles easily'
"""
estimation = {
"DAU": 10_000_000,
"actions_per_user_per_day": 10,
"total_daily_requests": 100_000_000,
"average_QPS": 100_000_000 / 86400, # โ 1157
"peak_QPS": 1157 * 3, # โ 3472
"tweet_size_bytes": 280,
"new_tweets_per_day": 10_000_000,
"daily_data_growth": 10_000_000 * 280, # โ 2.8 GB
"yearly_data": 2.8 * 365, # โ 1 TB
}
if estimation["peak_QPS"] < 10000:
estimation["cache_strategy"] = "Single Redis sufficient (100K+ QPS)"
elif estimation["peak_QPS"] < 100000:
estimation["cache_strategy"] = "Redis Cluster (3-5 nodes)"
else:
estimation["cache_strategy"] = "Multi-level cache (local + Redis Cluster)"
return estimation
Level 4 ยท Edge Cases and Pitfalls
4.1 Common Algorithm Mistakes in System Design Interviews
Mistake 1: Choosing the wrong "level" of data structure
# โ "Use red-black tree for database index"
# Problem: RBT is in-memory, one I/O per node
# 1M rows, height 20 โ 20 random disk I/Os โ 200ms (HDD)
#
# โ
"Use B+Tree for database index"
# B+Tree order=1000, 1M rows, height 2 โ 2 I/Os โ 20ms
# โ "Use B+Tree for in-memory database index"
# Problem: B+Tree optimizes for disk (large nodes reduce I/O)
# In memory, no I/O cost; B+Tree's cache line utilization is worse
#
# โ
"Use skip list for in-memory index" (Redis's choice)
# Or "ART (Adaptive Radix Tree)" (modern in-memory DB choice)
Mistake 2: Confusing exact vs approximate scenarios
# Scenario: Count unique visitors (UV)
#
# โ "Use HashSet to store all user IDs"
# Problem: 1B users ร 8 bytes/ID = 8GB memory
#
# โ
Analysis: Does UV need to be exact?
# - For billing (ad billing) โ need exact โ use bitmap + sharding
# - For analytics dashboard โ 2% error acceptable โ HyperLogLog (12KB memory!)
# Redis HyperLogLog
"""
PFADD page_views:2024-01-01 "user:12345"
PFADD page_views:2024-01-01 "user:67890"
PFCOUNT page_views:2024-01-01 # Get UV estimate
# Standard error 0.81%, fixed 12KB memory, independent of cardinality!
"""
Mistake 3: Ignoring the "hot key problem"
# Scenario: Consistent hashing distributes cache
#
# Problem: A celebrity posts โ billions of requests for one key โ node overload
# Consistent hashing can't solve hotspots (it ensures uniform distribution,
# not uniform access)
#
# Solutions:
HOT_KEY_SOLUTIONS = {
"Local cache": "After hot key detection, cache locally on app servers (short TTL)",
"Key spreading": "Add suffixes to hot key (key_1, key_2, ...key_N), spread across N nodes",
"Read replicas": "Dedicated read-only replicas for hot keys",
}
# How to detect hot keys?
# โ Count-Min Sketch (probabilistic DS, Chapter 9)
# โ Estimate per-key frequency with O(1) space at gateway layer
Mistake 4: Pursuing strong consistency in AP systems
# Interviewer: "Design a distributed shopping cart"
#
# โ "Use Raft for strong consistency of cart data"
# Problems:
# - Shopping cart doesn't need strong consistency
# - Raft makes minority partition unavailable โ users can't modify cart
#
# โ
"Use CRDT (or LWW) for eventual consistency, AP priority"
# - All nodes continue serving during network partition
# - Auto-merge after partition heals (CRDT conflict resolution)
# - Edge case: deleted item might reappear (acceptable, user just deletes again)
4.2 How to "Showcase" Algorithm Knowledge in System Design
Strategy 1: Natural introduction, not forced
# โ Forced: "Here I'll use consistent hashing. Consistent hashing was proposed
# by Karger in 1997..." (Interviewer: I know what it is, just tell me how you'll use it)
# โ
Natural: "Data sharding via consistent hashing, 150 virtual nodes per physical node.
# This way node changes only require migrating ~1/N of data, avoiding cache stampede.
# I can explain why 150 virtual nodes if you'd like to dive deeper."
Strategy 2: Use algorithms to explain design decisions
# Interviewer: "Why Redis ZSET instead of MySQL sorting?"
#
# Good answer:
"""
Two reasons, both related to underlying data structures:
1. Write performance:
- ZSET uses skip list, O(log n) score update, pure memory operation
- MySQL index update needs B+Tree node splits, involves disk I/O
- At 10K+ QPS leaderboard, ZSET latency < 1ms, MySQL might > 10ms
2. Range queries:
- ZRANGEBYSCORE scans along skip list bottom-level list, O(log n + k)
- MySQL ORDER BY LIMIT with large offset must skip preceding rows
- For deep pagination, ZSET cursor mode via ZRANGEBYSCORE is more efficient
"""
Strategy 3: Proactively present trade-offs
# Interviewer: "How to implement globally unique IDs?"
UNIQUE_ID_TRADEOFFS = {
"UUID v4": {
"algorithm": "128-bit random number",
"pros": "No coordination, fully decentralized",
"cons": "Unordered (B+Tree index splits frequently), 128-bit too long",
"good_for": "Scenarios not requiring ordering",
},
"Snowflake": {
"algorithm": "timestamp(41bit) + machine_id(10bit) + sequence(12bit) = 64bit",
"pros": "Trend-increasing (B+Tree friendly), 64-bit compact",
"cons": "Clock-dependent (clock skew causes duplicates)",
"good_for": "Most internet scenarios",
},
"DB Auto-increment": {
"algorithm": "Centralized counter",
"pros": "Strictly increasing, simple",
"cons": "Single point bottleneck, poor scalability",
"good_for": "Small-scale systems, order numbers",
},
}
4.3 Mental Model Shift: Algorithm Interview to System Design
| Algorithm Interview Thinking | System Design Thinking |
|---|---|
| Pursue optimal solution | Pursue "good enough" + clear trade-offs |
| Focus on single-machine performance | Focus on distributed scalability |
| Correctness first | Availability/consistency/partition trade-offs |
| Exact answers | Order-of-magnitude estimates |
| Code implementation | Architecture diagrams + component selection |
| Edge cases | Failure modes (node crash, network partition, disk full) |
4.4 Real System Case Studies
Case 1: Google Bigtable's Bloom Filters
"""
Bigtable (Chang et al., 2006) read path:
1. Client requests a row key
2. Check MemTable (memory) โ O(log n) skip list lookup
3. If not in MemTable, need to check multiple SSTables on disk
4. Each SSTable has a Bloom filter:
- BF says "absent" โ skip this SSTable (saves one disk I/O)
- BF says "maybe present" โ read this SSTable
5. Experimental data: Bloom filters reduced disk accesses by 10x
Impact of algorithm choice:
- Without BF: each read checks N SSTables โ N random I/Os
- With BF: expected 1 SSTable check โ 1 random I/O
- Cost: ~10 bits/key extra storage per SSTable for Bloom filter
"""
Case 2: Kafka's Timing Wheel
"""
Kafka manages millions of delayed operations (timeout requests, delayed ACKs).
Naive approach: priority queue (heap)
- Insert O(log n), delete-min O(log n)
- n = millions โ 20 comparisons per operation โ high CPU overhead
Kafka's approach: Hierarchical Timing Wheel
- Insert O(1), expiry check O(1)
- Handles long delays via overflow to higher-level wheels
Performance comparison (Kafka official benchmark):
- Heap: 1M timers, throughput ~500K ops/sec
- Timing wheel: 1M timers, throughput ~5M ops/sec (10x improvement)
"""
Case 3: Redis Cluster Hash Slots vs Consistent Hashing
"""
Redis Cluster uses "hash slots" instead of traditional consistent hashing:
- Fixed 16384 slots
- CRC16(key) % 16384 โ slot number
- Each node is responsible for some slots
Why not consistent hash ring? (antirez's explanation)
1. Hash slots are easier to manage: moving a slot = moving a group of keys
2. Config info is more compact: 16384 bits = 2KB describes entire cluster allocation
3. Consistent hashing virtual node management is complex
This is a classic engineering trade-off:
- Academically, consistent hashing is more elegant
- Engineering-wise, hash slots are simpler and more controllable
- Performance difference is negligible at Redis scale
"""
4.5 "Hybrid" Questions: System Design + Algorithm Code
Recent top-tier interviews feature "hybrid" questions requiring algorithm code within system design:
class SlidingWindowRateLimiter:
"""Sliding window rate limiter
System design considerations:
- Deploy at API Gateway layer
- Use Redis for window data (shared across instances)
- Time window granularity: 1s vs 1min vs adaptive
Algorithm considerations:
- Fixed window: simple but has boundary burst issue
- Sliding log: precise but memory-heavy (stores every request timestamp)
- Sliding window counter: compromise (this implementation)
"""
def __init__(self, max_requests: int, window_seconds: int):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.requests = {}
def allow_request(self, user_id: str, current_time: float) -> bool:
"""Sliding window counter algorithm"""
window_start = current_time - self.window_seconds
if user_id not in self.requests:
self.requests[user_id] = []
self.requests[user_id] = [
(ts, count) for ts, count in self.requests[user_id]
if ts > window_start
]
total = sum(count for _, count in self.requests[user_id])
if total >= self.max_requests:
return False
current_bucket = int(current_time)
for i, (ts, count) in enumerate(self.requests[user_id]):
if int(ts) == current_bucket:
self.requests[user_id][i] = (ts, count + 1)
return True
self.requests[user_id].append((current_time, 1))
return True
class TokenBucket:
"""Token bucket โ alternative rate limiting approach
vs Sliding Window:
- Token bucket allows bursts (bucket can be consumed instantly when full)
- Sliding window strictly limits total within window
Use cases:
- API rate limiting (allow brief bursts) โ token bucket
- Anti-abuse (strict frequency limit) โ sliding window
"""
def __init__(self, rate: float, capacity: int):
self.rate = rate
self.capacity = capacity
self.tokens = capacity
self.last_refill_time = 0
def allow_request(self, current_time: float) -> bool:
elapsed = current_time - self.last_refill_time
self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
self.last_refill_time = current_time
if self.tokens >= 1:
self.tokens -= 1
return True
return False
Chapter Summary
| Key Point | Details |
|---|---|
| Algorithms are system foundations | Every system component has a DS/algorithm underneath |
| Consistent Hashing | Distributed cache sharding; node changes migrate only 1/N data |
| Bloom Filter | 1/50th memory for "might exist" queries; reduces disk I/O |
| B+Tree vs LSM | Read-optimized vs write-optimized; match to business scenario |
| Raft | CP system consensus foundation; guarantees log consistency |
| CRDT | AP system conflict resolution; lock-free, consensus-free |
| Interview strategy | Naturally introduce algorithms, explain decisions, proactively discuss trade-offs |
Next chapter preview: Chapter 43 provides a complete learning path and practice guide, from beginner to competition level.