Chapter 42

From Algorithms to System Design

Chapter 42: From Algorithms to System Design

You spent 40 chapters mastering data structures and algorithms. Now the interviewer says: "Design a Twitter-like system." Suddenly your algorithm knowledge seems disconnected.

That's because most people treat "algorithm interviews" and "system design interviews" as completely different domains. They're not — every core decision in system design is backed by a data structure or algorithm underneath. Consistent hashing is the distributed version of a hash table, B+Tree indexing is an engineering optimization of binary search trees, Bloom filters are probabilistic set membership queries, and priority scheduling in message queues is a heap application.

This chapter does four things: first, it shows how algorithms from earlier chapters manifest in real systems; second, it introduces new algorithms for distributed scenarios (Raft, CRDT, DHT); third, it creates cross-references with the Redis/Kafka/MySQL books on this site; fourth, it teaches you how to precisely reference algorithm knowledge in system design interviews for bonus points.


Level 1 · What You Need to Know

1.1 Algorithms Directly Manifested in System Design

Every system design decision has an algorithm behind it. Here are the most common mappings:

System Component Underlying Algorithm/DS Book Chapter Engineering Scenario
Load Balancing Consistent Hashing Ch 8 Nginx, Memcached node assignment
Database Index B+Tree, Skip List Ch 20, 6 MySQL InnoDB, Redis ZSet
Cache Eviction LRU/LFU (Hash + Doubly-Linked List) Ch 4, 8 Redis maxmemory-policy
Deduplication Bloom Filter Ch 9 Crawler URL dedup, spam filter
Search Engine Inverted Index (Hash + Sorted List) Ch 8, 11 Elasticsearch
Message Priority Heap / Priority Queue Ch 12 Kafka priority queue, task scheduling
Routing Table Trie Ch 10 IP routing longest prefix match
Leaderboard Skip List / Red-Black Tree Ch 6, 21 Redis ZRANGEBYSCORE
Rate Limiting Sliding Window / Token Bucket Ch 2 API Gateway throttling
Distributed ID Snowflake (Bit Manipulation) Ch 31 Twitter Snowflake

Consistent Hashing (Reference: Chapter 8)

Scenario: You have N cache servers and need to decide which machine stores each key.

Naive approach: server = hash(key) % N. Problem: when adding or removing a server, almost all keys must be remapped — meaning nearly all cache becomes invalid (cache stampede).

What does consistent hashing solve? When nodes change, on average only K/N keys need remapping (K = total keys, N = node count).

import hashlib
from bisect import bisect_right

class ConsistentHash:
    """Consistent hash ring implementation
    
    Core idea:
    1. Organize hash value space as a ring (0 ~ 2^32-1)
    2. Each node occupies multiple positions on the ring (virtual nodes)
    3. Each key finds its nearest node clockwise
    
    Why virtual nodes (detailed in Ch 8):
    - With only 1 position per physical node, load distribution is highly uneven
    - With 150-200 virtual nodes, std deviation drops to 5-10% of mean
    - This is the core contribution of Karger et al. (1997)
    """
    
    def __init__(self, nodes: list[str], replicas: int = 150):
        self.replicas = replicas
        self.ring = []
        self.hash_to_node = {}
        
        for node in nodes:
            self.add_node(node)
    
    def _hash(self, key: str) -> int:
        """Use MD5 for uniform hash distribution"""
        digest = hashlib.md5(key.encode()).hexdigest()
        return int(digest[:8], 16)
    
    def add_node(self, node: str):
        """Add physical node → place replicas virtual nodes on ring"""
        for i in range(self.replicas):
            virtual_key = f"{node}#VN{i}"
            h = self._hash(virtual_key)
            self.ring.append(h)
            self.hash_to_node[h] = node
        self.ring.sort()
    
    def remove_node(self, node: str):
        """Remove physical node → remove all its virtual nodes"""
        for i in range(self.replicas):
            virtual_key = f"{node}#VN{i}"
            h = self._hash(virtual_key)
            self.ring.remove(h)
            del self.hash_to_node[h]
    
    def get_node(self, key: str) -> str:
        """Given a key, find responsible node (first clockwise)
        
        Algorithm: binary search → O(log(N * replicas))
        """
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect_right(self.ring, h) % len(self.ring)
        return self.hash_to_node[self.ring[idx]]


def demo_migration():
    """Demonstrate key migration when nodes change"""
    nodes = [f"server-{i}" for i in range(5)]
    ch = ConsistentHash(nodes)
    
    assignments_before = {}
    for i in range(10000):
        key = f"user:{i}"
        assignments_before[key] = ch.get_node(key)
    
    ch.add_node("server-5")
    
    migrated = 0
    for i in range(10000):
        key = f"user:{i}"
        if ch.get_node(key) != assignments_before[key]:
            migrated += 1
    
    # Theoretical: 10000 / 6 ≈ 1667 (only the new node's share migrates)
    print(f"Migrated keys: {migrated} / 10000 ({migrated/100:.1f}%)")
    # Actual output ~16-20%, close to theoretical 1/6 ≈ 16.7%

Bloom Filter (Reference: Chapter 9)

Scenario: A crawler system has 1 billion crawled URLs. Each new URL needs to be checked against the existing set. A HashSet storing 1 billion URLs requires ~60GB memory. A Bloom filter needs only ~1.2GB (at 1% false positive rate).

import mmh3
from bitarray import bitarray

class BloomFilter:
    """Bloom filter implementation
    
    Mathematical foundation (detailed in Ch 9):
    - m = -(n * ln(p)) / (ln(2))²  optimal bit count
    - k = (m/n) * ln(2)            optimal hash function count
    - where n = expected items, p = target false positive rate
    
    Space advantage: HashSet needs O(n * element_size)
    Bloom filter needs only O(n) bits, independent of element size
    
    Trade-offs:
    - False Positives: "exists" might be wrong
    - No False Negatives: "doesn't exist" is always correct
    - No deletion support (Counting Bloom Filter can, but doubles space)
    """
    
    def __init__(self, expected_items: int, fp_rate: float = 0.01):
        import math
        self.size = int(-(expected_items * math.log(fp_rate)) / (math.log(2) ** 2))
        self.num_hashes = int((self.size / expected_items) * math.log(2))
        self.bit_array = bitarray(self.size)
        self.bit_array.setall(0)
    
    def add(self, item: str):
        """Add element: set corresponding bit for each hash function to 1"""
        for i in range(self.num_hashes):
            idx = mmh3.hash(item, i) % self.size
            self.bit_array[idx] = 1
    
    def contains(self, item: str) -> bool:
        """Query: all hash bits are 1 → might exist; any is 0 → definitely absent"""
        for i in range(self.num_hashes):
            idx = mmh3.hash(item, i) % self.size
            if not self.bit_array[idx]:
                return False
        return True


class WebCrawler:
    """Crawler with Bloom filter deduplication"""
    
    def __init__(self, expected_urls: int = 1_000_000_000):
        self.bloom = BloomFilter(expected_urls, fp_rate=0.001)
    
    def should_crawl(self, url: str) -> bool:
        if self.bloom.contains(url):
            return False
        return True
    
    def mark_crawled(self, url: str):
        self.bloom.add(url)

B+Tree Index (Reference: Chapter 20)

Scenario: MySQL executing SELECT * FROM users WHERE age BETWEEN 25 AND 35 — how to efficiently find all matching rows?

class BPlusTreeNode:
    """B+Tree node — simplified implementation showing core concepts
    
    Why databases choose B+Tree over Red-Black Tree/AVL?
    1. Disk-friendly: each node is a disk page (typically 16KB), minimizing I/O
    2. Efficient range queries: leaf nodes form a sorted linked list
    3. High fanout: internal nodes store only keys (not data)
       → Tree height is extremely low (3-4 levels can index billions of rows)
    
    Comparison with BST:
    - BST: height O(log₂ n) → 1 billion rows needs 30 comparisons → 30 disk I/Os
    - B+Tree (order=1000): height O(log₁₀₀₀ n) → 1 billion rows needs only 3 I/Os!
    """
    
    def __init__(self, is_leaf: bool = False, order: int = 4):
        self.is_leaf = is_leaf
        self.keys = []
        self.children = []
        self.next_leaf = None
        self.order = order


def range_query_explanation():
    """B+Tree range query process
    
    SQL: SELECT * FROM users WHERE age BETWEEN 25 AND 35
    
    Steps:
    1. From root, binary search to locate leaf node for age=25  → O(log_B n) I/Os
    2. Find first position >= 25 within leaf node
    3. Scan along leaf linked list until age > 35             → O(result_size / page_size) I/Os
    
    Key advantages:
    - Location cost is minimal (3-4 I/Os)
    - Range scan is sequential I/O (SSD sequential read 3GB/s vs random 100MB/s)
    """
    pass

1.2 From Single-Machine to Distributed Algorithms

When your system scales from one machine to many, single-machine "certainties" disappear:

Single-Machine Assumption Distributed Reality
Operations are atomic Network can drop/delay/reorder packets
Unified clock Each machine's clock drifts differently
No failures Any node can crash at any time
Shared memory Data must travel over network

These changes gave birth to distributed algorithms. Here are the three you're most likely to encounter in system design interviews:

Raft Consensus Algorithm (brief introduction):

class RaftState:
    """Raft consensus algorithm state machine model
    
    Problem solved: How do multiple nodes agree on "the order of log entries"?
    
    Source: Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm"
    (USENIX ATC 2014) — specifically designed as an understandable Paxos alternative
    
    Core roles:
    - Leader: receives client requests, replicates log to Followers
    - Follower: passively receives log entries
    - Candidate: competes to become Leader during election
    
    Key invariants:
    - At most one Leader per term
    - Leader's log contains all committed entries
    - Majority agreement suffices for commitment
    """
    
    FOLLOWER = "follower"
    CANDIDATE = "candidate"
    LEADER = "leader"
    
    def __init__(self, node_id: int, total_nodes: int):
        self.node_id = node_id
        self.state = self.FOLLOWER
        self.current_term = 0
        self.voted_for = None
        self.log = []
        self.commit_index = 0
        self.majority = total_nodes // 2 + 1
    
    def request_vote(self, candidate_term: int, candidate_id: int,
                     last_log_index: int, last_log_term: int) -> bool:
        """Vote RPC: decide whether to vote for candidate
        
        Conditions:
        1. candidate_term >= my term
        2. Haven't voted this term (or already voted for this candidate)
        3. Candidate's log is at least as up-to-date (compare term first, then index)
        """
        if candidate_term < self.current_term:
            return False
        
        if candidate_term > self.current_term:
            self.current_term = candidate_term
            self.state = self.FOLLOWER
            self.voted_for = None
        
        if self.voted_for is not None and self.voted_for != candidate_id:
            return False
        
        my_last_term = self.log[-1][0] if self.log else 0
        my_last_index = len(self.log)
        
        log_is_up_to_date = (
            candidate_term > my_last_term or
            (candidate_term == my_last_term and last_log_index >= my_last_index)
        )
        
        if log_is_up_to_date:
            self.voted_for = candidate_id
            return True
        return False

1.3 Template for Referencing Algorithms in System Design Interviews

How to precisely reference algorithm knowledge during interviews:

# Interviewer: "How would you design a distributed cache system?"
# 
# Your response framework:
SYSTEM_DESIGN_WITH_ALGORITHMS = """
1. Data Sharding
   → Consistent hashing (Karger et al., 1997)
   → 150 virtual nodes per physical node for uniform distribution
   → Node changes only require migrating 1/N of data

2. Cache Eviction
   → LRU policy: hash table + doubly-linked list, O(1) get/put
   → If frequency matters: LFU or W-TinyLFU (Caffeine's implementation)

3. Hot Key Detection
   → Count-Min Sketch (probabilistic DS, Ch 9)
   → Estimate per-key access frequency with O(1) space

4. Cache Penetration Protection
   → Bloom filter to check if key might exist in database
   → Definitely-absent keys return immediately without DB query

5. Consistency Guarantee
   → Strong consistency needed: Raft replicated log
   → Eventual consistency acceptable: gossip protocol + vector clocks
"""

1.4 Common Mistakes

Mistake 1: Confusing algorithm application levels

# ❌ Wrong: "Use red-black tree for database index"
# Red-black tree is an in-memory structure; disk should use B+Tree

# ❌ Wrong: "Use DFS for service discovery"
# DFS is a graph traversal algorithm; service discovery uses gossip or registries

# ✅ Correct mappings:
CORRECT_MAPPINGS = {
    "In-memory index": "Red-black tree, skip list (Redis)",
    "On-disk index": "B+Tree, LSM-Tree",
    "In-memory cache eviction": "LRU (hash + doubly-linked list)",
    "Distributed routing": "Consistent hashing",
    "Approximate counting": "HyperLogLog",
    "Approximate membership": "Bloom filter",
    "Distributed consensus": "Raft / Paxos",
}

Mistake 2: Over-engineering small-scale systems

When the interviewer says "100K users," you don't need consistent hashing — a single Redis instance suffices. Algorithm and data structure choices must match the scale.


Level 2 · How It Works

2.1 Deep Dive into Distributed Algorithms

Raft Engineering Implementation Details

Raft is used by etcd (Kubernetes' underlying storage), CockroachDB, and TiKV (TiDB's storage engine). Let's analyze its core mechanisms:

Mathematical guarantee of election mechanism:

import random

class RaftElection:
    """Raft election key design decisions
    
    Why randomized timeout?
    - If all nodes start elections simultaneously, Leader may never be elected (livelock)
    - Random timeout [150ms, 300ms] ensures one node likely times out first
    - Math: if timeout range >> network RTT, conflict probability → 0
    
    Timeout range selection (Ongaro's paper suggests):
    - Lower bound > average network RTT (avoid frequent elections)
    - Upper bound < acceptable unavailability duration
    - Range = upper - lower >> average RTT (reduce conflicts)
    """
    
    def __init__(self, heartbeat_interval: int = 50):
        self.heartbeat_interval = heartbeat_interval
        self.election_timeout_range = (
            heartbeat_interval * 3,
            heartbeat_interval * 6
        )
    
    def random_election_timeout(self) -> int:
        return random.randint(*self.election_timeout_range)
    
    def simulate_election(self, num_nodes: int = 5,
                          network_delay_ms: int = 5) -> dict:
        """Simulate one election round
        
        Best case: 1 RTT to complete (150 + 5 = 155ms)
        Worst case: multiple conflicts (rare, expected < 2 rounds)
        """
        timeouts = [self.random_election_timeout() for _ in range(num_nodes)]
        first_candidate = min(range(num_nodes), key=lambda i: timeouts[i])
        election_start = timeouts[first_candidate]
        vote_collection_time = election_start + network_delay_ms
        
        second_timeout = sorted(timeouts)[1]
        conflict = (second_timeout - election_start) < network_delay_ms
        
        return {
            "winner": first_candidate,
            "time_ms": vote_collection_time,
            "conflict": conflict,
            "rounds": 2 if conflict else 1,
        }

CRDT: Conflict-free Replicated Data Types

class GCounter:
    """G-Counter (Grow-only Counter) — simplest CRDT
    
    Source: Shapiro et al., "A comprehensive study of Convergent and 
    Commutative Replicated Data Types" (INRIA Technical Report, 2011)
    
    Core idea:
    - Each node maintains its own counter
    - Global value = sum of all node counters
    - Merge operation = take max of each node's counter
    
    Why CRDT?
    - Traditional: either locks (poor perf) or consensus (high latency)
    - CRDT: lock-free, consensus-free, eventually consistent
    - Cost: only supports specific operations (monotonic increase/decrease/add/remove)
    """
    
    def __init__(self, node_id: str, num_nodes: int):
        self.node_id = node_id
        self.counters = {}
    
    def increment(self, amount: int = 1):
        self.counters[self.node_id] = self.counters.get(self.node_id, 0) + amount
    
    def value(self) -> int:
        return sum(self.counters.values())
    
    def merge(self, other: 'GCounter'):
        """Merge: take max for each node
        
        Key properties (why it's conflict-free):
        - Commutative: merge(A, B) == merge(B, A)
        - Associative: merge(merge(A, B), C) == merge(A, merge(B, C))
        - Idempotent: merge(A, A) == A
        
        These three properties guarantee eventual consistency regardless of message order
        """
        for node_id, count in other.counters.items():
            self.counters[node_id] = max(self.counters.get(node_id, 0), count)


class PNCounter:
    """PN-Counter (supports increment and decrement)
    
    Idea: two G-Counters — one for increments, one for decrements
    value = P.value() - N.value()
    """
    
    def __init__(self, node_id: str, num_nodes: int):
        self.p = GCounter(node_id, num_nodes)
        self.n = GCounter(node_id, num_nodes)
    
    def increment(self, amount: int = 1):
        self.p.increment(amount)
    
    def decrement(self, amount: int = 1):
        self.n.increment(amount)
    
    def value(self) -> int:
        return self.p.value() - self.n.value()
    
    def merge(self, other: 'PNCounter'):
        self.p.merge(other.p)
        self.n.merge(other.n)


class LWWRegister:
    """Last-Writer-Wins Register — timestamp-based CRDT
    
    Application: concurrent updates to a single field in distributed DBs
    
    Limitations:
    - Depends on clock synchronization (NTP accuracy typically 1-10ms)
    - If two writes happen within clock error, result is non-deterministic
    - Solution: Hybrid Logical Clock (HLC) — Kulkarni et al. (2014)
    """
    
    def __init__(self, node_id: str):
        self.node_id = node_id
        self.value = None
        self.timestamp = 0
    
    def write(self, value, timestamp: int):
        if timestamp > self.timestamp:
            self.value = value
            self.timestamp = timestamp
    
    def merge(self, other: 'LWWRegister'):
        if other.timestamp > self.timestamp:
            self.value = other.value
            self.timestamp = other.timestamp

Distributed Hash Table (DHT)

class KademliaNode:
    """Kademlia DHT — distributed hash table for P2P networks
    
    Source: Maymounkov & Mazieres, "Kademlia: A Peer-to-peer Information 
    System Based on the XOR Metric" (IPTPS 2002)
    
    Used by BitTorrent, IPFS, Ethereum
    
    Core ideas:
    1. Each node has a 160-bit ID
    2. Distance = XOR — satisfies triangle inequality without physical proximity
    3. Each node maintains 160 "k-buckets"; bucket i stores nodes at distance [2^i, 2^(i+1))
    4. Lookup: each step halves distance → O(log n) steps to find target
    
    vs Consistent Hashing:
    - Consistent hashing: needs centralized ring info → suitable for data centers
    - DHT: fully decentralized, each node knows O(log n) others → suitable for P2P
    """
    
    def __init__(self, node_id: int, k: int = 20):
        self.node_id = node_id
        self.k = k
        self.buckets = [[] for _ in range(160)]
        self.data = {}
    
    def distance(self, id1: int, id2: int) -> int:
        return id1 ^ id2
    
    def bucket_index(self, other_id: int) -> int:
        dist = self.distance(self.node_id, other_id)
        if dist == 0:
            return 0
        return dist.bit_length() - 1
    
    def find_closest_nodes(self, target_id: int, count: int = 20) -> list:
        """Find count closest nodes to target_id in routing table
        
        Core Kademlia operation: returned nodes are closer to target each time
        → guarantees O(log n) convergence
        """
        all_nodes = []
        for bucket in self.buckets:
            all_nodes.extend(bucket)
        all_nodes.sort(key=lambda n: self.distance(n, target_id))
        return all_nodes[:count]

2.2 Deep Correspondence Between Algorithms and System Components

LSM-Tree vs B+Tree: Write-Optimized vs Read-Optimized

class LSMTreeConcept:
    """LSM-Tree (Log-Structured Merge Tree) concept
    
    Source: O'Neil et al., "The Log-Structured Merge-Tree (LSM-Tree)" (1996)
    Used by LevelDB, RocksDB, Cassandra, HBase
    
    Core idea:
    1. Writes go to memory first (MemTable, usually skip list or red-black tree)
    2. When MemTable is full, flush to disk as SSTable (Sorted String Table)
    3. Background periodic compaction merges multiple SSTables
    
    Comparison with B+Tree:
    ┌──────────────┬─────────────────────┬─────────────────────┐
    │              │ B+Tree              │ LSM-Tree            │
    ├──────────────┼─────────────────────┼─────────────────────┤
    │ Write amplif │ High (random disk)  │ Low (sequential)    │
    │ Read amplif  │ Low (1 lookup)      │ High (check layers) │
    │ Space amplif │ Low                 │ Medium (versions)   │
    │ Best for     │ Read-heavy (MySQL)  │ Write-heavy (logs)  │
    └──────────────┴─────────────────────┴─────────────────────┘
    """
    
    def __init__(self, memtable_size: int = 4 * 1024 * 1024):
        self.memtable = {}
        self.memtable_size = memtable_size
        self.sstables = []
    
    def put(self, key: str, value: str):
        """Write: WAL (omitted) then MemTable
        Time: O(log n) skip list insertion
        Disk I/O: only WAL sequential append (1 write)
        """
        self.memtable[key] = value
        if len(self.memtable) >= self.memtable_size:
            self._flush()
    
    def get(self, key: str) -> str:
        """Read: check MemTable → then SSTables newest to oldest
        Worst case: check all levels → use Bloom filter to skip irrelevant SSTables
        """
        if key in self.memtable:
            return self.memtable[key]
        for sstable in self.sstables:
            if key in sstable:
                return sstable[key]
        return None
    
    def _flush(self):
        sstable = dict(sorted(self.memtable.items()))
        self.sstables.insert(0, sstable)
        self.memtable = {}

Skip List in Redis

import random

class SkipListNode:
    def __init__(self, key: float, value: str, level: int):
        self.key = key
        self.value = value
        self.forward = [None] * (level + 1)
        self.span = [0] * (level + 1)

class SkipList:
    """Skip List — underlying implementation of Redis ZSET
    
    Why Redis chose skip list over red-black tree? (antirez's explanation)
    1. Simpler: skip list ~100 LOC vs red-black tree 300+ LOC
    2. Range operations naturally efficient: ZRANGEBYSCORE just finds start, then scans
    3. Controllable memory: average 1.33 pointers/node (p=1/4)
    4. Comparable performance: O(log n) find/insert/delete
    
    In system design interviews:
    "Redis leaderboards use skip lists. Expected height is O(log n),
    both point queries and range queries are O(log n + k). Compared to
    red-black trees, skip lists don't need in-order traversal for ranges —
    they scan along the bottom-level linked list directly."
    """
    
    MAX_LEVEL = 32
    P = 0.25
    
    def __init__(self):
        self.header = SkipListNode(float('-inf'), '', self.MAX_LEVEL)
        self.level = 0
        self.length = 0
    
    def _random_level(self) -> int:
        """Geometric distribution: P(level=k) = p^(k-1) * (1-p)
        Expected level = 1/(1-p) = 1.33 when p=0.25
        """
        level = 0
        while random.random() < self.P and level < self.MAX_LEVEL:
            level += 1
        return level
    
    def insert(self, key: float, value: str):
        """Insert — O(log n) expected time"""
        update = [None] * (self.MAX_LEVEL + 1)
        current = self.header
        
        for i in range(self.level, -1, -1):
            while current.forward[i] and current.forward[i].key < key:
                current = current.forward[i]
            update[i] = current
        
        new_level = self._random_level()
        if new_level > self.level:
            for i in range(self.level + 1, new_level + 1):
                update[i] = self.header
            self.level = new_level
        
        new_node = SkipListNode(key, value, new_level)
        for i in range(new_level + 1):
            new_node.forward[i] = update[i].forward[i]
            update[i].forward[i] = new_node
        
        self.length += 1
    
    def range_by_score(self, min_score: float, max_score: float) -> list:
        """Range query — O(log n + k), k = result size
        This is the core logic of Redis ZRANGEBYSCORE
        """
        current = self.header
        for i in range(self.level, -1, -1):
            while current.forward[i] and current.forward[i].key < min_score:
                current = current.forward[i]
        
        current = current.forward[0]
        
        result = []
        while current and current.key <= max_score:
            result.append((current.key, current.value))
            current = current.forward[0]
        
        return result

2.3 Cross-References with Other Books on This Site

The technical books on this site (dev.yiteai.com) form a knowledge network:

Algorithm/Concept Related Book Related Section Connection
Consistent Hashing "Redis In Depth" Cluster Sharding Redis Cluster uses 16384 hash slots
Bloom Filter "Redis In Depth" Data Structures Redis 4.0+ built-in BF.ADD/BF.EXISTS
B+Tree "High Performance MySQL" Indexing InnoDB clustered index structure
LSM-Tree "High Performance MySQL" MyRocks RocksDB storage engine core
Priority Queue/Heap "Kafka In Depth" Message Scheduling Delayed messages, priority queues
Raft Consensus "Redis In Depth" Sentinel/Cluster Redis Sentinel uses Raft-like leader election
CRDT "Redis In Depth" Active-Active Redis Enterprise conflict resolution

2.4 Data Structures in Message Queues

class DelayQueue:
    """Delay message queue — based on min-heap
    
    Scenarios:
    - Auto-cancel orders unpaid after 30 minutes
    - Scheduled push notifications
    - Retry backoff (1s, 2s, 4s, 8s...)
    
    In Kafka (detailed in the Kafka book on this site):
    - Kafka doesn't natively support delay messages
    - Common approach: multiple delay topics (delay_1s, delay_5s, delay_30m...)
    - Or use Hierarchical Timing Wheel
    """
    
    def __init__(self):
        self.heap = []
        self.counter = 0
    
    def put(self, message: str, delay_seconds: float):
        import time, heapq
        execute_at = time.time() + delay_seconds
        heapq.heappush(self.heap, (execute_at, self.counter, message))
        self.counter += 1
    
    def poll(self) -> str:
        import time, heapq
        if not self.heap:
            return None
        if self.heap[0][0] <= time.time():
            _, _, message = heapq.heappop(self.heap)
            return message
        return None


class TimingWheel:
    """Timing Wheel — used by Kafka/Netty for timers
    
    Source: Varghese & Lauck, "Hashed and Hierarchical Timing Wheels" (1996)
    
    Why not a heap?
    - Heap insert/delete: O(log n)
    - Timing wheel insert/delete: O(1)
    - With millions of timers, the difference is massive
    
    Core idea (clock analogy):
    - Second hand: 60 slots, 1 second each
    - Minute hand: 60 slots, 1 minute each
    - Hour hand: 24 slots, 1 hour each
    
    Tasks placed in appropriate slots; "hand" advances one slot per tick.
    When higher-level slot is reached, tasks cascade down to lower level.
    """
    
    def __init__(self, tick_ms: int = 1, wheel_size: int = 60):
        self.tick_ms = tick_ms
        self.wheel_size = wheel_size
        self.buckets = [[] for _ in range(wheel_size)]
        self.current_tick = 0
        self.overflow_wheel = None
    
    def add_task(self, task_id: str, delay_ticks: int):
        """Add timed task — O(1)"""
        if delay_ticks < self.wheel_size:
            bucket_idx = (self.current_tick + delay_ticks) % self.wheel_size
            self.buckets[bucket_idx].append((task_id, delay_ticks))
        else:
            if self.overflow_wheel is None:
                self.overflow_wheel = TimingWheel(
                    self.tick_ms * self.wheel_size,
                    self.wheel_size
                )
            self.overflow_wheel.add_task(task_id, delay_ticks // self.wheel_size)
    
    def advance(self):
        """Advance one tick — O(tasks in current bucket)"""
        self.current_tick = (self.current_tick + 1) % self.wheel_size
        expired_tasks = self.buckets[self.current_tick]
        self.buckets[self.current_tick] = []
        return expired_tasks

Level 3 · How the Standard Defines It

3.1 CAP Theorem and Algorithm Choice

CAP Theorem (Brewer, 2000; proven by Gilbert & Lynch, 2002) states a distributed system cannot simultaneously guarantee all three:

Since network partitions are inevitable (P must be chosen), the real choice is CP vs AP:

CAP_ALGORITHM_MAPPING = {
    "CP (Consistency Priority)": {
        "algorithms": "Raft / Paxos / ZAB",
        "behavior": "Minority partition stops serving during network split",
        "systems": ["etcd", "ZooKeeper", "CockroachDB", "Spanner"],
        "use_cases": "Financial transactions, distributed locks, config management",
    },
    "AP (Availability Priority)": {
        "algorithms": "Gossip / CRDT / Last-Writer-Wins",
        "behavior": "All nodes continue serving, but data may be inconsistent",
        "systems": ["Cassandra", "DynamoDB", "Redis Cluster (some scenarios)"],
        "use_cases": "Social media, shopping carts, DNS",
    },
}

3.2 System Design Interview Scoring Dimensions

Based on Meta system design interviewer talks (InfoQ 2023) and the framework from Designing Data-Intensive Applications (Kleppmann, 2017):

Dimension Weight Algorithm Connection
Requirements Analysis 15% Estimate QPS/storage → determines needed complexity
High-Level Design 25% Component selection → data structure choices underneath
Deep Dive 35% Detail questions → requires explaining underlying algorithms
Trade-off Discussion 25% Trade-off analysis → time/space complexity trade-offs

Algorithm reference example in Deep Dive:

# Interviewer: "How do you keep the leaderboard updated in real-time?"
#
# Your answer:
"""
Use Redis ZSET (Sorted Set).

Underlying implementation is skip list + hash table:
- Skip list: maintains score ordering, supports O(log n) range queries
- Hash table: O(1) single key lookup

Operation complexity:
- ZADD (update score): O(log n)
- ZRANK (get rank): O(log n) — via span accumulation
- ZRANGEBYSCORE (get top-k): O(log n + k)

Why not MySQL ORDER BY?
- Each query needs full table sort or index scan
- Under high concurrency (e.g., live-streaming tip leaderboard), MySQL can't handle it
- Redis skip list updates are atomic, naturally supports concurrency

Capacity estimation:
- 10M user leaderboard → ZSET ~200MB
- If partitioned by game, 100K per partition → single ZSET ~2MB
"""

3.3 Classic Distributed System Theorems

Theorem/Algorithm Authors Year Core Content Interview Reference
FLP Impossibility Fischer, Lynch, Paterson 1985 Cannot guarantee consensus with even 1 failure in async systems Why timeouts are needed
CAP Theorem Brewer (Gilbert & Lynch proof) 2000/2002 CP vs AP choice Foundation of all SD trade-offs
Paxos Lamport 1989/1998 First proven-correct consensus algorithm Background; explain why Raft is more practical
Raft Ongaro & Ousterhout 2014 Understandable consensus Strong consistency scenarios
Gossip Demers et al. 1987 Epidemic-style propagation AP system dissemination
Vector Clock Lamport / Fidge 1978/1988 Determine causal relationships Resolve concurrent conflicts
Consistent Hashing Karger et al. 1997 Minimize remapping Distributed cache/storage sharding

3.4 Consistency Model Spectrum

From strongest to weakest, with corresponding implementation algorithms:

CONSISTENCY_SPECTRUM = {
    "Linearizability": {
        "guarantee": "Operations appear to execute atomically at a single point in time",
        "algorithm": "Raft/Paxos + all reads/writes through Leader",
        "cost": "High latency (every write needs majority confirmation)",
        "systems": "Spanner, CockroachDB",
    },
    "Sequential Consistency": {
        "guarantee": "All operations have a global order; each process preserves its order",
        "algorithm": "Total Order Broadcast",
        "cost": "Medium latency",
        "systems": "ZooKeeper (writes)",
    },
    "Causal Consistency": {
        "guarantee": "Causally related operations are ordered; concurrent ops may be unordered",
        "algorithm": "Vector clocks / version vectors",
        "cost": "Lower latency, allows some concurrency",
        "systems": "MongoDB (w:majority + causal sessions)",
    },
    "Eventual Consistency": {
        "guarantee": "If no new writes, all replicas eventually converge to same value",
        "algorithm": "Gossip + CRDT / Last-Writer-Wins",
        "cost": "Lowest latency, but may read stale data",
        "systems": "DynamoDB, Cassandra, DNS",
    },
}

3.5 Back-of-Envelope Estimation in System Design

This is an essential interview skill — it directly determines which algorithms and data structures you choose:

# Jeff Dean's "Numbers Every Programmer Should Know" (2012, continuously updated)
LATENCY_NUMBERS = {
    "L1 cache reference":           0.5,      # ns
    "L2 cache reference":           7,        # ns
    "Main memory reference":        100,      # ns
    "SSD random read":              150_000,  # ns = 150 µs
    "HDD random read":              10_000_000, # ns = 10 ms
    "Network round trip (same DC)":  500_000,  # ns = 0.5 ms
    "Network round trip (cross-city)": 50_000_000, # ns = 50 ms
}

def estimate_scale() -> dict:
    """System design capacity estimation template
    
    Talk through it during interview:
    'DAU 10M, 10 requests/user/day → QPS = 10M * 10 / 86400 ≈ 1200'
    'Peak at 3x → 3600 QPS → single Redis handles easily'
    """
    estimation = {
        "DAU": 10_000_000,
        "actions_per_user_per_day": 10,
        "total_daily_requests": 100_000_000,
        "average_QPS": 100_000_000 / 86400,  # ≈ 1157
        "peak_QPS": 1157 * 3,  # ≈ 3472
        "tweet_size_bytes": 280,
        "new_tweets_per_day": 10_000_000,
        "daily_data_growth": 10_000_000 * 280,  # ≈ 2.8 GB
        "yearly_data": 2.8 * 365,  # ≈ 1 TB
    }
    
    if estimation["peak_QPS"] < 10000:
        estimation["cache_strategy"] = "Single Redis sufficient (100K+ QPS)"
    elif estimation["peak_QPS"] < 100000:
        estimation["cache_strategy"] = "Redis Cluster (3-5 nodes)"
    else:
        estimation["cache_strategy"] = "Multi-level cache (local + Redis Cluster)"
    
    return estimation

Level 4 · Edge Cases and Pitfalls

4.1 Common Algorithm Mistakes in System Design Interviews

Mistake 1: Choosing the wrong "level" of data structure

# ❌ "Use red-black tree for database index"
# Problem: RBT is in-memory, one I/O per node
# 1M rows, height 20 → 20 random disk I/Os → 200ms (HDD)
# 
# ✅ "Use B+Tree for database index"
# B+Tree order=1000, 1M rows, height 2 → 2 I/Os → 20ms

# ❌ "Use B+Tree for in-memory database index"
# Problem: B+Tree optimizes for disk (large nodes reduce I/O)
# In memory, no I/O cost; B+Tree's cache line utilization is worse
# 
# ✅ "Use skip list for in-memory index" (Redis's choice)
# Or "ART (Adaptive Radix Tree)" (modern in-memory DB choice)

Mistake 2: Confusing exact vs approximate scenarios

# Scenario: Count unique visitors (UV)
# 
# ❌ "Use HashSet to store all user IDs"
# Problem: 1B users × 8 bytes/ID = 8GB memory
# 
# ✅ Analysis: Does UV need to be exact?
# - For billing (ad billing) → need exact → use bitmap + sharding
# - For analytics dashboard → 2% error acceptable → HyperLogLog (12KB memory!)

# Redis HyperLogLog
"""
PFADD page_views:2024-01-01 "user:12345"
PFADD page_views:2024-01-01 "user:67890"
PFCOUNT page_views:2024-01-01  # Get UV estimate
# Standard error 0.81%, fixed 12KB memory, independent of cardinality!
"""

Mistake 3: Ignoring the "hot key problem"

# Scenario: Consistent hashing distributes cache
# 
# Problem: A celebrity posts → billions of requests for one key → node overload
# Consistent hashing can't solve hotspots (it ensures uniform distribution,
# not uniform access)
# 
# Solutions:
HOT_KEY_SOLUTIONS = {
    "Local cache": "After hot key detection, cache locally on app servers (short TTL)",
    "Key spreading": "Add suffixes to hot key (key_1, key_2, ...key_N), spread across N nodes",
    "Read replicas": "Dedicated read-only replicas for hot keys",
}

# How to detect hot keys?
# → Count-Min Sketch (probabilistic DS, Chapter 9)
# → Estimate per-key frequency with O(1) space at gateway layer

Mistake 4: Pursuing strong consistency in AP systems

# Interviewer: "Design a distributed shopping cart"
# 
# ❌ "Use Raft for strong consistency of cart data"
# Problems:
# - Shopping cart doesn't need strong consistency
# - Raft makes minority partition unavailable → users can't modify cart
# 
# ✅ "Use CRDT (or LWW) for eventual consistency, AP priority"
# - All nodes continue serving during network partition
# - Auto-merge after partition heals (CRDT conflict resolution)
# - Edge case: deleted item might reappear (acceptable, user just deletes again)

4.2 How to "Showcase" Algorithm Knowledge in System Design

Strategy 1: Natural introduction, not forced

# ❌ Forced: "Here I'll use consistent hashing. Consistent hashing was proposed
# by Karger in 1997..." (Interviewer: I know what it is, just tell me how you'll use it)

# ✅ Natural: "Data sharding via consistent hashing, 150 virtual nodes per physical node.
# This way node changes only require migrating ~1/N of data, avoiding cache stampede.
# I can explain why 150 virtual nodes if you'd like to dive deeper."

Strategy 2: Use algorithms to explain design decisions

# Interviewer: "Why Redis ZSET instead of MySQL sorting?"
# 
# Good answer:
"""
Two reasons, both related to underlying data structures:

1. Write performance:
   - ZSET uses skip list, O(log n) score update, pure memory operation
   - MySQL index update needs B+Tree node splits, involves disk I/O
   - At 10K+ QPS leaderboard, ZSET latency < 1ms, MySQL might > 10ms

2. Range queries:
   - ZRANGEBYSCORE scans along skip list bottom-level list, O(log n + k)
   - MySQL ORDER BY LIMIT with large offset must skip preceding rows
   - For deep pagination, ZSET cursor mode via ZRANGEBYSCORE is more efficient
"""

Strategy 3: Proactively present trade-offs

# Interviewer: "How to implement globally unique IDs?"
UNIQUE_ID_TRADEOFFS = {
    "UUID v4": {
        "algorithm": "128-bit random number",
        "pros": "No coordination, fully decentralized",
        "cons": "Unordered (B+Tree index splits frequently), 128-bit too long",
        "good_for": "Scenarios not requiring ordering",
    },
    "Snowflake": {
        "algorithm": "timestamp(41bit) + machine_id(10bit) + sequence(12bit) = 64bit",
        "pros": "Trend-increasing (B+Tree friendly), 64-bit compact",
        "cons": "Clock-dependent (clock skew causes duplicates)",
        "good_for": "Most internet scenarios",
    },
    "DB Auto-increment": {
        "algorithm": "Centralized counter",
        "pros": "Strictly increasing, simple",
        "cons": "Single point bottleneck, poor scalability",
        "good_for": "Small-scale systems, order numbers",
    },
}

4.3 Mental Model Shift: Algorithm Interview to System Design

Algorithm Interview Thinking System Design Thinking
Pursue optimal solution Pursue "good enough" + clear trade-offs
Focus on single-machine performance Focus on distributed scalability
Correctness first Availability/consistency/partition trade-offs
Exact answers Order-of-magnitude estimates
Code implementation Architecture diagrams + component selection
Edge cases Failure modes (node crash, network partition, disk full)

4.4 Real System Case Studies

Case 1: Google Bigtable's Bloom Filters

"""
Bigtable (Chang et al., 2006) read path:

1. Client requests a row key
2. Check MemTable (memory) — O(log n) skip list lookup
3. If not in MemTable, need to check multiple SSTables on disk
4. Each SSTable has a Bloom filter:
   - BF says "absent" → skip this SSTable (saves one disk I/O)
   - BF says "maybe present" → read this SSTable
5. Experimental data: Bloom filters reduced disk accesses by 10x

Impact of algorithm choice:
- Without BF: each read checks N SSTables → N random I/Os
- With BF: expected 1 SSTable check → 1 random I/O
- Cost: ~10 bits/key extra storage per SSTable for Bloom filter
"""

Case 2: Kafka's Timing Wheel

"""
Kafka manages millions of delayed operations (timeout requests, delayed ACKs).

Naive approach: priority queue (heap)
- Insert O(log n), delete-min O(log n)
- n = millions → 20 comparisons per operation → high CPU overhead

Kafka's approach: Hierarchical Timing Wheel
- Insert O(1), expiry check O(1)
- Handles long delays via overflow to higher-level wheels

Performance comparison (Kafka official benchmark):
- Heap: 1M timers, throughput ~500K ops/sec
- Timing wheel: 1M timers, throughput ~5M ops/sec (10x improvement)
"""

Case 3: Redis Cluster Hash Slots vs Consistent Hashing

"""
Redis Cluster uses "hash slots" instead of traditional consistent hashing:
- Fixed 16384 slots
- CRC16(key) % 16384 → slot number
- Each node is responsible for some slots

Why not consistent hash ring? (antirez's explanation)
1. Hash slots are easier to manage: moving a slot = moving a group of keys
2. Config info is more compact: 16384 bits = 2KB describes entire cluster allocation
3. Consistent hashing virtual node management is complex

This is a classic engineering trade-off:
- Academically, consistent hashing is more elegant
- Engineering-wise, hash slots are simpler and more controllable
- Performance difference is negligible at Redis scale
"""

4.5 "Hybrid" Questions: System Design + Algorithm Code

Recent top-tier interviews feature "hybrid" questions requiring algorithm code within system design:

class SlidingWindowRateLimiter:
    """Sliding window rate limiter
    
    System design considerations:
    - Deploy at API Gateway layer
    - Use Redis for window data (shared across instances)
    - Time window granularity: 1s vs 1min vs adaptive
    
    Algorithm considerations:
    - Fixed window: simple but has boundary burst issue
    - Sliding log: precise but memory-heavy (stores every request timestamp)
    - Sliding window counter: compromise (this implementation)
    """
    
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.requests = {}
    
    def allow_request(self, user_id: str, current_time: float) -> bool:
        """Sliding window counter algorithm"""
        window_start = current_time - self.window_seconds
        
        if user_id not in self.requests:
            self.requests[user_id] = []
        
        self.requests[user_id] = [
            (ts, count) for ts, count in self.requests[user_id]
            if ts > window_start
        ]
        
        total = sum(count for _, count in self.requests[user_id])
        
        if total >= self.max_requests:
            return False
        
        current_bucket = int(current_time)
        for i, (ts, count) in enumerate(self.requests[user_id]):
            if int(ts) == current_bucket:
                self.requests[user_id][i] = (ts, count + 1)
                return True
        
        self.requests[user_id].append((current_time, 1))
        return True


class TokenBucket:
    """Token bucket — alternative rate limiting approach
    
    vs Sliding Window:
    - Token bucket allows bursts (bucket can be consumed instantly when full)
    - Sliding window strictly limits total within window
    
    Use cases:
    - API rate limiting (allow brief bursts) → token bucket
    - Anti-abuse (strict frequency limit) → sliding window
    """
    
    def __init__(self, rate: float, capacity: int):
        self.rate = rate
        self.capacity = capacity
        self.tokens = capacity
        self.last_refill_time = 0
    
    def allow_request(self, current_time: float) -> bool:
        elapsed = current_time - self.last_refill_time
        self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
        self.last_refill_time = current_time
        
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

Chapter Summary

Key Point Details
Algorithms are system foundations Every system component has a DS/algorithm underneath
Consistent Hashing Distributed cache sharding; node changes migrate only 1/N data
Bloom Filter 1/50th memory for "might exist" queries; reduces disk I/O
B+Tree vs LSM Read-optimized vs write-optimized; match to business scenario
Raft CP system consensus foundation; guarantees log consistency
CRDT AP system conflict resolution; lock-free, consensus-free
Interview strategy Naturally introduce algorithms, explain decisions, proactively discuss trade-offs

Next chapter preview: Chapter 43 provides a complete learning path and practice guide, from beginner to competition level.

Rate this chapter
4.7  / 5  (3 ratings)

💬 Comments