Chapter 39

Production Algorithms: Rate Limiting and Scheduling

Chapter 39: Production Algorithms — Rate Limiting and Scheduling

Over the previous thirty-eight chapters we studied virtually every classic data structure and algorithm. But in real production environments, there is a class of algorithms that almost every backend engineer encounters yet rarely appears in traditional textbooks—Rate Limiting and Load Balancing / Scheduling.

Rate limiting solves the core problem: when request volume exceeds system capacity, how to selectively reject some requests to protect the system from crashing. Scheduling solves the core problem: when multiple backend servers can handle requests, how to distribute requests to the most appropriate server to maximize throughput and minimize latency.

These algorithms typically appear in the system design portion of interviews, but their underlying data structures and algorithmic ideas can be implemented precisely in code—and indeed they have exact code implementations in Nginx, Redis, API Gateways, and other systems. This chapter dissects them through a rigorous algorithmic lens.

Level 1 · What You Need to Know

39.1 Token Bucket Algorithm

Problem Definition

Given a system with maximum allowed request rate rate per second and maximum burst size burst, design a rate limiter that decides whether to allow each incoming request.

Algorithm Concept

Imagine a bucket with capacity burst. Tokens are added to the bucket at a steady rate of rate per second. Each request needs to consume one token to pass:

If there are tokens in the bucket: consume one, allow the request
If no tokens available: reject the request

The elegance of the token bucket is that it allows short bursts of traffic (consuming accumulated tokens all at once) while ensuring the long-term average rate never exceeds rate.

This algorithm was first proposed by Turner (1986, "New Directions in Communications") and later widely adopted in network traffic shaping. IETF RFC 2697 and RFC 2698 define token bucket-based traffic policing standards.

Why No Background Thread to Add Tokens?

The naive implementation would spawn a thread that adds one token every 1/rate seconds. In production, we use lazy evaluation: only compute "how many tokens should have been added since last time" when a request arrives.

import time


class TokenBucket:
    """
    Token Bucket Rate Limiter
    
    Core formula:
    current_tokens = min(burst, last_tokens + (current_time - last_time) * rate)
    
    Why lazy evaluation?
    1. No background thread needed (saves resources)
    2. Higher precision (floating-point timestamps vs fixed intervals)
    3. No thread synchronization issues
    """
    
    def __init__(self, rate: float, burst: int):
        """
        Args:
            rate: Token fill rate (tokens/second)
            burst: Bucket capacity (maximum burst size)
        """
        self.rate = rate
        self.burst = burst
        self.tokens = float(burst)  # Start with full bucket
        self.last_time = time.time()
    
    def allow(self) -> bool:
        """Determine if current request is allowed"""
        now = time.time()
        
        # Calculate tokens that should have been added since last check
        elapsed = now - self.last_time
        self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
        self.last_time = now
        
        if self.tokens >= 1.0:
            self.tokens -= 1.0
            return True
        else:
            return False
    
    def allow_n(self, n: int) -> bool:
        """Check if n tokens can be consumed at once (batch requests)"""
        now = time.time()
        elapsed = now - self.last_time
        self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
        self.last_time = now
        
        if self.tokens >= n:
            self.tokens -= n
            return True
        else:
            return False

Parameter Meanings

Parameter	Meaning	Typical Value
rate	Long-term average allowed rate	100 req/s
burst	Maximum instantaneous burst allowed	200 (2-second buffer)

Why Start with a Full Bucket?

This is a design choice. A full bucket means the system can immediately handle a burst after startup. An empty bucket would require waiting burst/rate seconds to reach maximum capacity—potentially causing unnecessary rejections after a service restart.

Token Bucket vs Fixed Window Counter

The simplest rate limiting is "at most N requests per second" (fixed window counter), but it has a fatal flaw:

Timeline: |--- Second 1 ---|--- Second 2 ---|
Requests: ......[99 at end][99 at start]......

With a limit of 100 req/s, if 99 requests come in the last 0.5s of second 1 and 99 in the first 0.5s of second 2, actually 198 requests pass within 1 second—nearly double the limit!

Token bucket has no "window boundary" problem because it is based on continuous time.

39.2 Leaky Bucket Algorithm

Algorithm Concept

Imagine a bucket with a fixed-size hole at the bottom. Water (requests) is poured in at any rate:

Water leaks out at a fixed rate (processing requests), regardless of input rate
If the bucket is full (queue full), new water overflows (request rejected)

The leaky bucket's core property is strictly constant output rate—even with bursty input, output is uniform.

import time
from collections import deque


class LeakyBucket:
    """
    Leaky Bucket Rate Limiter
    
    Key difference from Token Bucket:
    - Token Bucket: allows bursts (pass if tokens available)
    - Leaky Bucket: strictly uniform output (requests queued, processed at fixed rate)
    
    Leaky bucket suits scenarios needing "smoothed traffic" like network packet sending.
    Token bucket suits "allow bursts but limit average rate" like API rate limiting.
    """
    
    def __init__(self, rate: float, capacity: int):
        """
        Args:
            rate: Leak rate (items/second), i.e., processing rate
            capacity: Bucket capacity (max queue length)
        """
        self.rate = rate
        self.capacity = capacity
        self.water = 0.0  # Current water level
        self.last_time = time.time()
    
    def allow(self) -> bool:
        """Is the request allowed to enter the bucket (queue for processing)?"""
        now = time.time()
        elapsed = now - self.last_time
        
        # Calculate water that has leaked out
        leaked = elapsed * self.rate
        self.water = max(0.0, self.water - leaked)
        self.last_time = now
        
        if self.water < self.capacity:
            self.water += 1.0
            return True  # Request enters queue
        else:
            return False  # Bucket full, rejected


class LeakyBucketQueue:
    """
    Leaky Bucket with actual queue (requests buffered then processed uniformly)
    
    Use case: message queue consumers, task schedulers
    """
    
    def __init__(self, rate: float, capacity: int):
        self.rate = rate
        self.capacity = capacity
        self.queue = deque()
        self.last_leak_time = time.time()
    
    def enqueue(self, request) -> bool:
        """Attempt to enqueue a request"""
        if len(self.queue) >= self.capacity:
            return False  # Queue full, rejected
        self.queue.append(request)
        return True
    
    def dequeue(self):
        """Dequeue at leaky bucket rate (called by consumer thread)"""
        now = time.time()
        elapsed = now - self.last_leak_time
        
        if elapsed >= 1.0 / self.rate and self.queue:
            self.last_leak_time = now
            return self.queue.popleft()
        return None

Token Bucket vs Leaky Bucket Comparison

Dimension	Token Bucket	Leaky Bucket
Burst handling	Allows bursts (pass if tokens available)	No bursts (strictly uniform output)
Output rate	Variable (can exceed average during bursts)	Constant (always equals rate)
Implementation	Simple (one counter + timestamp)	Medium (needs queue or counter)
Use case	API rate limiting, short bursts OK	Traffic shaping, smooth flow needed
Math model	Like bank deposits (accumulate + spend)	Like constant-speed conveyor belt

Why Does Nginx Use Leaky Bucket While Redis Uses Token Bucket?

Nginx's limit_req module uses the leaky bucket model because as a reverse proxy, Nginx needs to smooth the request rate reaching backend servers, preventing bursts from overwhelming them.

Redis's redis_cell module (GCRA algorithm) and many API Gateways use the token bucket model because for user API calls, allowing short bursts provides better user experience (users should not be throttled just because they happened to send two requests within milliseconds).

39.3 Sliding Window Rate Limiting

Problem Definition

Limit "at most N requests in any consecutive T seconds." This is more precise than fixed windows (no boundary problem) and more intuitive than token buckets (directly expresses "N times per T seconds" semantics).

Method 1: Sliding Window Log (Exact but Memory-Heavy)

Record every request's timestamp. When checking, count how many timestamps fall within the last T seconds.

import time
from collections import deque


class SlidingWindowLog:
    """
    Sliding Window Log Rate Limiter
    
    Principle: maintain a timestamp queue. For each request:
    1. Remove all timestamps older than T seconds
    2. If queue length < N, allow and record timestamp
    3. Otherwise reject
    
    Pros: Exact, no window boundary issues
    Cons: Each user needs up to N timestamps stored, memory O(N * users)
    """
    
    def __init__(self, max_requests: int, window_seconds: float):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.timestamps = deque()
    
    def allow(self) -> bool:
        now = time.time()
        
        # Remove timestamps outside the window
        while self.timestamps and self.timestamps[0] <= now - self.window_seconds:
            self.timestamps.popleft()
        
        if len(self.timestamps) < self.max_requests:
            self.timestamps.append(now)
            return True
        else:
            return False

Method 2: Sliding Window Counter (Approximate but Memory-Efficient)

Divide time into small sub-windows (e.g., a 1-second window into ten 100ms slots), each maintaining a counter.

import time


class SlidingWindowCounter:
    """
    Sliding Window Counter Rate Limiter (approximate algorithm)
    
    Idea: divide the large window into multiple small slots, store each slot's count
    in a circular array. When checking, sum all slots within the current window.
    
    Precision depends on slot count: more slots = more precise but more memory.
    This is the algorithm variant used by Cloudflare in their edge rate limiting.
    """
    
    def __init__(self, max_requests: int, window_seconds: float, num_slots: int = 10):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.num_slots = num_slots
        self.slot_duration = window_seconds / num_slots
        
        self.slots = [0] * num_slots
        self.slot_timestamps = [0.0] * num_slots
        self.current_slot = 0
    
    def _get_slot_index(self, timestamp: float) -> int:
        """Calculate slot index from timestamp"""
        return int(timestamp / self.slot_duration) % self.num_slots
    
    def allow(self) -> bool:
        now = time.time()
        current_idx = self._get_slot_index(now)
        
        # Clear expired slots
        self._clean_expired(now)
        
        # Sum all slots in window
        total = sum(self.slots)
        
        if total < self.max_requests:
            self.slots[current_idx] += 1
            self.slot_timestamps[current_idx] = now
            return True
        else:
            return False
    
    def _clean_expired(self, now: float) -> None:
        """Clear slots outside the window"""
        window_start = now - self.window_seconds
        for i in range(self.num_slots):
            if self.slot_timestamps[i] < window_start:
                self.slots[i] = 0

Method 3: Two-Window Weighted Approximation (Industry Standard)

A minimalist but effective approximation method, widely used:

import time


class SlidingWindowApprox:
    """
    Two-window weighted approximation
    
    Idea:
    - Maintain counts for current window and previous window
    - Approximate current sliding window count = prev_count * overlap_ratio + curr_count
    
    Example: window = 1 minute
    Current time is 40% through current window
    Approximation = prev_count * 0.6 + curr_count * 1.0
    
    Described in Cloudflare's blog (2017) as their production rate limiting algorithm.
    Maximum error approximately 12.5% (provable), precise enough for most scenarios.
    """
    
    def __init__(self, max_requests: int, window_seconds: float):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        
        self.prev_count = 0
        self.curr_count = 0
        self.window_start = time.time()
    
    def allow(self) -> bool:
        now = time.time()
        elapsed = now - self.window_start
        
        # Check if window needs rotation
        if elapsed >= self.window_seconds:
            windows_passed = int(elapsed / self.window_seconds)
            if windows_passed == 1:
                self.prev_count = self.curr_count
            else:
                self.prev_count = 0
            self.curr_count = 0
            self.window_start += windows_passed * self.window_seconds
            elapsed = now - self.window_start
        
        # Calculate weighted approximation
        weight = 1.0 - elapsed / self.window_seconds  # Previous window overlap ratio
        approx_count = self.prev_count * weight + self.curr_count
        
        if approx_count < self.max_requests:
            self.curr_count += 1
            return True
        else:
            return False

39.4 Comparison of Three Rate Limiting Algorithms

Dimension	Token Bucket	Leaky Bucket	Sliding Window Log	Sliding Window Counter
Precision	Exact	Exact	Exact	Approximate
Allows bursts	Yes	No	Yes (any distribution within window)	Yes
Memory	O(1)	O(1) or O(queue)	O(N * users)	O(slots * users)
Implementation	Low	Low-Medium	Low	Medium
Use case	API limiting	Traffic shaping	Small-scale exact limiting	Large-scale distributed limiting

39.5 Load Balancing Algorithms

When multiple backend servers are available, how do you choose which one receives a request?

Round Robin

class RoundRobin:
    """
    Simplest load balancing: distribute in circular order
    
    Pros: Simple, perfectly fair
    Cons: Ignores server capacity differences, ignores current load
    """
    
    def __init__(self, servers: list):
        self.servers = servers
        self.index = 0
    
    def next_server(self) -> str:
        server = self.servers[self.index % len(self.servers)]
        self.index += 1
        return server

Weighted Round Robin

class WeightedRoundRobin:
    """
    Weighted Round Robin: distribute requests by weight
    
    Nginx's Smooth Weighted Round Robin algorithm:
    Avoids consecutively sending requests to the same high-weight server.
    
    Algorithm by Nginx developer phxc:
    1. Each server has weight (configured) and current_weight (dynamic)
    2. Each selection: all servers current_weight += weight
    3. Select server with maximum current_weight
    4. Selected server: current_weight -= total_weight
    
    Effect: weights {A:5, B:1, C:1} yields sequence A,A,B,A,C,A,A not A,A,A,A,A,B,C
    """
    
    def __init__(self, servers: list, weights: list):
        self.servers = servers
        self.weights = weights
        self.current_weights = [0] * len(servers)
        self.total_weight = sum(weights)
    
    def next_server(self) -> str:
        # All servers: current_weight += weight
        for i in range(len(self.servers)):
            self.current_weights[i] += self.weights[i]
        
        # Select maximum
        max_idx = 0
        for i in range(1, len(self.servers)):
            if self.current_weights[i] > self.current_weights[max_idx]:
                max_idx = i
        
        # Selected server subtracts total_weight
        self.current_weights[max_idx] -= self.total_weight
        
        return self.servers[max_idx]

Least Connections

import heapq


class LeastConnections:
    """
    Least Connections: assign request to server with fewest active connections
    
    Use case: when request processing times vary widely (some 10ms, some 10s)
    Round robin in such scenarios causes slow requests to pile up on one server
    
    Implementation: min-heap maintaining (active_connections, server)
    """
    
    def __init__(self, servers: list):
        self.heap = [(0, i, s) for i, s in enumerate(servers)]
        heapq.heapify(self.heap)
    
    def acquire(self) -> str:
        """Acquire a server (connection count +1)"""
        conns, idx, server = heapq.heappop(self.heap)
        heapq.heappush(self.heap, (conns + 1, idx, server))
        return server
    
    def release(self, server: str) -> None:
        """Release a connection (connection count -1)"""
        new_heap = []
        for conns, idx, s in self.heap:
            if s == server:
                new_heap.append((max(0, conns - 1), idx, s))
            else:
                new_heap.append((conns, idx, s))
        self.heap = new_heap
        heapq.heapify(self.heap)

Consistent Hashing

import hashlib
import bisect


class ConsistentHash:
    """
    Consistent Hashing: map both keys and servers onto a ring
    
    Proposed by Karger et al. (1997, "Consistent Hashing and Random Trees") at MIT.
    
    Core advantage: adding/removing servers only remaps 1/N of keys
    (vs modulo hashing hash(key) % N where nearly all keys remap when N changes)
    
    Virtual nodes: each physical server maps to multiple virtual nodes
    Solves load imbalance (few physical nodes may distribute unevenly on hash ring)
    """
    
    def __init__(self, servers: list, num_replicas: int = 150):
        self.num_replicas = num_replicas
        self.ring = []          # Sorted hash values
        self.ring_map = {}      # hash value -> server name
        
        for server in servers:
            self.add_server(server)
    
    def _hash(self, key: str) -> int:
        """Generate hash using MD5 (production may use xxhash for speed)"""
        return int(hashlib.md5(key.encode()).hexdigest(), 16)
    
    def add_server(self, server: str) -> None:
        """Add server (and its virtual nodes) to hash ring"""
        for i in range(self.num_replicas):
            virtual_key = f"{server}#{i}"
            h = self._hash(virtual_key)
            bisect.insort(self.ring, h)
            self.ring_map[h] = server
    
    def remove_server(self, server: str) -> None:
        """Remove server from hash ring"""
        for i in range(self.num_replicas):
            virtual_key = f"{server}#{i}"
            h = self._hash(virtual_key)
            self.ring.remove(h)
            del self.ring_map[h]
    
    def get_server(self, key: str) -> str:
        """Get server for a given key"""
        if not self.ring:
            return None
        
        h = self._hash(key)
        # Find first node >= h clockwise on the ring
        idx = bisect.bisect_left(self.ring, h)
        if idx == len(self.ring):
            idx = 0  # Wrap around
        
        return self.ring_map[self.ring[idx]]

Why Are Virtual Node Counts Typically 100-200?

Karger et al.'s analysis shows that with $k$ virtual nodes, load imbalance is approximately $O(\sqrt{\log n / k})$, where $n$ is the number of keys. With $k = 150$, server load deviation stays within 5-10% for most practical workloads. More virtual nodes mean more uniformity but slower lookups (bisect on longer arrays); 150 is a common engineering sweet spot.

39.6 Priority Scheduling

In task scheduling, different tasks have different priorities. Higher-priority tasks should be processed first.

import heapq
import time
from dataclasses import dataclass, field


@dataclass(order=True)
class Task:
    priority: int                          # Priority (lower number = higher priority)
    timestamp: float = field(compare=True) # Same priority uses FIFO
    name: str = field(compare=False)       # Task name (not compared)
    payload: object = field(compare=False, default=None)


class PriorityScheduler:
    """
    Priority Scheduler
    
    Core data structure: min-heap
    Supports:
    - Submit task (with priority)
    - Get highest priority task
    - Priority boost/demotion (via mark-delete + re-insert)
    
    Anti-starvation mechanism: aging
    Long-waiting low-priority tasks gradually get priority boost
    """
    
    def __init__(self):
        self.heap = []
    
    def submit(self, name: str, priority: int, payload=None) -> None:
        """Submit a task"""
        task = Task(priority=priority, timestamp=time.time(), name=name, payload=payload)
        heapq.heappush(self.heap, task)
    
    def next_task(self) -> Task:
        """Get next task to execute"""
        if self.heap:
            return heapq.heappop(self.heap)
        return None
    
    def apply_aging(self, aging_seconds: float = 60.0, priority_boost: int = 1) -> None:
        """
        Aging mechanism: boost priority of tasks waiting longer than aging_seconds
        
        Why aging? Pure priority scheduling causes low-priority task "starvation"
        (never gets executed). OS process schedulers (like Linux CFS) use similar mechanisms.
        """
        now = time.time()
        new_heap = []
        for task in self.heap:
            wait_time = now - task.timestamp
            if wait_time > aging_seconds:
                boosted_priority = max(0, task.priority - priority_boost)
                new_task = Task(priority=boosted_priority, timestamp=task.timestamp,
                              name=task.name, payload=task.payload)
                new_heap.append(new_task)
            else:
                new_heap.append(task)
        self.heap = new_heap
        heapq.heapify(self.heap)

Level 2 · How It Works Internally

39.7 Mathematical Model of Token Bucket

Continuous Time Model

Let $b(t)$ be the token count at time $t$, $r$ the fill rate, $B$ the bucket capacity.

$$b(t) = \min\left(B, , b(t_0) + r \cdot (t - t_0)\right)$$

where $t_0$ is the last token consumption time.

Burst Duration

If the bucket is full ($b = B$) and requests arrive at rate $R_{burst}$ (where $R_{burst} > r$), how long can the burst last?

$$T_{burst} = \frac{B}{R_{burst} - r}$$

During the burst, net consumption is $R_{burst} - r$ tokens per second (consuming $R_{burst}$ but replenishing at $r$).

Example: $r = 100$ req/s, $B = 500$, $R_{burst} = 600$ req/s $$T_{burst} = \frac{500}{600 - 100} = 1 \text{ second}$$

During this 1 second, 600 requests can pass (not just 100), then steady-state returns to 100 req/s.

Connection to Queuing Theory

A token bucket limiter can be modeled as an M/D/1/K queuing system (Poisson arrivals, deterministic service, single server, finite queue K = burst). When arrival rate $\lambda > r$, the drop probability is:

$$P_{drop} \approx 1 - \frac{r}{\lambda} \quad (\text{steady-state approximation when } \lambda \gg r)$$

39.8 GCRA Algorithm (Generic Cell Rate Algorithm)

GCRA (Generic Cell Rate Algorithm) is the rate limiting algorithm defined in ATM network standards, equivalent to token bucket but with different mathematical formulation. Redis's redis_cell module implements GCRA.

Core idea: maintain a "Theoretical Arrival Time" (TAT). Each request's TAT is the "earliest time it should arrive." If actual arrival is earlier than TAT, the request is too frequent and gets throttled.

import time


class GCRA:
    """
    GCRA (Generic Cell Rate Algorithm)
    
    Equivalent to token bucket but expressed in "time" rather than "token count."
    
    Core variable: TAT (Theoretical Arrival Time)
    - If now >= TAT: allow, TAT = now + emission_interval
    - If now < TAT but TAT - now <= burst_tolerance: allow, TAT = TAT + emission_interval
    - If TAT - now > burst_tolerance: reject
    
    emission_interval = 1 / rate (minimum interval between requests)
    burst_tolerance = burst * emission_interval (maximum allowed "earliness")
    
    Defined by ITU-T I.371 standard (1993).
    """
    
    def __init__(self, rate: float, burst: int):
        self.emission_interval = 1.0 / rate
        self.burst_tolerance = burst * self.emission_interval
        self.tat = 0.0  # Theoretical Arrival Time
    
    def allow(self) -> bool:
        now = time.time()
        
        new_tat = max(self.tat, now) + self.emission_interval
        
        if new_tat - now > self.burst_tolerance + self.emission_interval:
            return False
        
        self.tat = new_tat
        return True

GCRA's advantage: only one variable TAT (one timestamp), extremely small memory. Redis's CL.THROTTLE command is GCRA-based, storing only one timestamp per key.

39.9 Load Balance Proof for Consistent Hashing

Theorem (Karger et al., 1997): In consistent hashing with $n$ servers and $m$ keys:

Expected load per server is $m/n$
With $k$ virtual nodes, each server's load falls within $[m/n \cdot (1-\epsilon), m/n \cdot (1+\epsilon)]$ with probability at least $1 - 2/n^2$, where $\epsilon = O(\sqrt{\log n / k})$

Key Migration When Adding/Removing Nodes

When adding a new node, only $m/(n+1)$ keys need to migrate from other nodes to the new one in expectation. This is $n$ times better than modulo hashing ($m \cdot n/(n+1)$ keys need remapping).

Virtual Nodes vs Consistency Trade-off

More virtual nodes → more uniform load → but adding/removing nodes involves more "fragmented" migrations (small amounts from many different nodes).

Production choices:

Amazon DynamoDB: 150-200 virtual nodes per node
Cassandra: 256 virtual nodes per node (default num_tokens)
Redis Cluster: no virtual nodes; directly allocates 16384 slots to nodes (essentially coarse-grained virtual nodes)

39.10 Mathematical Proof of Smooth Weighted Round Robin

Theorem: Nginx's smooth weighted round robin algorithm, within one complete weight cycle ($\sum w_i$ selections), selects each server $i$ exactly $w_i$ times.

Proof sketch:

Let total weight $W = \sum_{i=1}^{n} w_i$. Each selection round:

All servers: $current_weight_i += w_i$
Select maximum $current_weight_{max}$
Selected server: $current_weight_{max} -= W$

After one complete cycle ($W$ selections), each server's current_weight net change is: $w_i \times W - w_i \times W = 0$ (added $W$ times $w_i$, selected $w_i$ times subtracting $W$ each). The system returns to initial state, with each server selected exactly $w_i$ times.

Smoothness Guarantee

Key property: high-weight servers are never selected consecutively. For example, weights {A:5, B:1, C:1} yields A B A C A A A not A A A A A B C.

This is guaranteed by "subtract total_weight after selection": once a server is selected, its current_weight drops by $W$, making it unlikely to be maximum again in subsequent rounds.

39.11 Latency Analysis of Load Balancing Algorithms

Using queuing theory to analyze performance differences:

Assumptions: N servers, each with processing rate $\mu$, total arrival rate $\lambda < N\mu$.

Round Robin Average Response Time:

If all request processing times are identical (deterministic service), Round Robin is optimal—equivalent to an M/D/1 queue (arrival rate $\lambda/N$, service rate $\mu$).

With variable processing times (exponential distribution), Round Robin degenerates to N independent M/M/1 queues:

$$E[T_{RR}] = \frac{1}{\mu - \lambda/N}$$

Join-Shortest-Queue (JSQ) Advantage:

With variable processing times, JSQ approaches the performance of a single M/M/N queue:

$$E[T_{JSQ}] \approx \frac{1}{\mu} + \frac{\rho^N}{N\mu(1-\rho)} \quad (\text{where } \rho = \lambda/(N\mu))$$

JSQ outperforms Round Robin by an order of magnitude under high load—because it uses "current queue length" information for smarter decisions. But JSQ in distributed environments requires real-time knowledge of all servers' queue lengths, with high communication overhead.

Power of Two Choices:

Mitzenmacher (2001, "The Power of Two Choices in Randomized Load Balancing") proved a remarkable result: randomly selecting 2 from N servers, then choosing the less-loaded one—this simple strategy reduces maximum queue length from $O(\log N / \log\log N)$ (pure random) to $O(\log\log N)$ (two choices), an exponential improvement!

import random


class PowerOfTwoChoices:
    """
    Power of Two Choices Load Balancing
    
    Randomly pick two servers, choose the less-loaded one.
    Mathematically proven to be exponentially better than pure random
    (max queue length O(log log N)).
    
    This is the theoretical basis for Nginx's "random two" strategy.
    """
    
    def __init__(self, servers: list):
        self.servers = servers
        self.loads = {s: 0 for s in servers}
    
    def acquire(self) -> str:
        s1, s2 = random.sample(self.servers, 2)
        chosen = s1 if self.loads[s1] <= self.loads[s2] else s2
        self.loads[chosen] += 1
        return chosen
    
    def release(self, server: str) -> None:
        self.loads[server] = max(0, self.loads[server] - 1)

Level 3 · What the Specifications Define

39.12 Rate Limiting Implementation in Nginx

Nginx's ngx_http_limit_req_module implements the leaky bucket algorithm. Core configuration:

# Define rate limit zone (shared memory)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

# Apply rate limiting
location /api/ {
    limit_req zone=api burst=20 nodelay;
}

Parameter Analysis:

rate=10r/s: Leaky bucket drain rate, 10 requests per second
burst=20: Bucket capacity, queue up to 20 requests
nodelay: Queued requests are forwarded immediately (effectively token bucket behavior)
$binary_remote_addr: Rate limit per client IP (each IP has independent bucket)

Internal Implementation (Nginx Source Analysis):

Nginx uses a red-black tree (per-zone) to store each key's rate limit state (last access time + excess count), using shared memory (ngx_slab) across worker processes.

// Simplified core logic (ngx_http_limit_req_module.c)
// excess represents "overflow count in bucket" (lazy calculation)
excess = lr->excess - rate * ms_since_last / 1000;
if (excess < 0) excess = 0;

if (excess + 1000 <= burst * 1000) {
    // Allow: update excess
    lr->excess = excess + 1000;
    lr->last = now;
    return NGX_OK;
} else {
    // Reject: return 503
    return NGX_HTTP_SERVICE_UNAVAILABLE;
}

Nginx uses integer arithmetic (multiply by 1000) to avoid floating-point, critical in hot paths.

Meaning of nodelay

Without nodelay: requests exceeding rate but within burst are delayed (queued), waiting until the bucket drains enough. This is pure leaky bucket behavior.

With nodelay: as long as burst bucket is not full, all requests forward immediately—but burst recovery still follows rate. The effect equals a token bucket: allows bursts, but long-term average stays at rate.

39.13 Rate Limiting Implementation in Redis

Approach 1: Fixed Window (Simplest)

-- Fixed window rate limit (Lua script for atomicity)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = tonumber(redis.call('GET', key) or "0")
if current >= limit then
    return 0  -- Reject
end

redis.call('INCR', key)
if current == 0 then
    redis.call('EXPIRE', key, window)
end
return 1  -- Allow

Approach 2: Sliding Window (Sorted Set)

-- Sliding window rate limit (using sorted set)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove records outside window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window * 1000)

-- Count
local count = redis.call('ZCARD', key)
if count >= limit then
    return 0  -- Reject
end

-- Add current request
redis.call('ZADD', key, now, now .. math.random())
redis.call('PEXPIRE', key, window * 1000)
return 1  -- Allow

Approach 3: redis_cell Module (GCRA)

-- CL.THROTTLE command (redis_cell module)
-- CL.THROTTLE key burst rate period [quantity]
-- Example: max 30 requests per 60 seconds, allow burst of 10
CL.THROTTLE user:123 10 30 60 1

-- Returns:
-- 1) 0 (allowed) or 1 (denied)
-- 2) Total capacity (burst + 1)
-- 3) Remaining capacity
-- 4) If denied, seconds until retry
-- 5) Seconds until bucket fully recovers

redis_cell's advantage: one command completes rate limit check + state update atomically; returns rich metadata (remaining quota, retry time) for client backoff strategies.

39.14 Multi-Layer Rate Limiting in API Gateways

Production rate limiting is typically multi-layered:

User request → CDN layer → API Gateway → Service layer → Database connection pool
                ↓              ↓              ↓              ↓
           DDoS protection  User/API-level  Service-level   Resource protection
           (L3/L4)          (L7)            (Circuit breaker) (Semaphore)

Each Layer Has Different Responsibilities:

Layer	Granularity	Algorithm Choice	Typical Implementation
CDN/Edge	IP / Geography	Fixed window + blocklist	Cloudflare, AWS WAF
API Gateway	User / API Key / Route	Token bucket / Sliding window	Kong, Envoy, Nginx
Service-to-service	Service identity / Method	Adaptive limiting	Sentinel, Hystrix
Database	Connections / Query complexity	Semaphore / Queuing	Connection Pool

39.15 Google SRE's Rate Limiting Practice

Google SRE describes their rate limiting philosophy in Site Reliability Engineering (Beyer et al., 2016, O'Reilly):

Core Principle: Client-Side Adaptive Throttling

Traditional rate limiting is server-side rejection. Google's approach has clients proactively reduce requests:

import random


class AdaptiveThrottle:
    """
    Google SRE Client-Side Adaptive Throttling
    
    Idea: Client tracks recent request success rate.
    If server starts rejecting, client proactively reduces send rate.
    
    Core formula:
    rejection_probability = max(0, (requests - K * accepts) / (requests + 1))
    
    Where:
    - requests: total requests in recent window
    - accepts: requests accepted by server in recent window
    - K: multiplier (typically K=2), meaning "don't throttle while success rate > 1/K"
    
    K=2 means: as long as >50% of requests succeed, client doesn't throttle.
    When success rate drops below 50%, client starts probabilistically dropping requests.
    
    Advantages:
    1. No server-side threshold configuration needed
    2. Automatically adapts to backend capacity changes
    3. Graceful degradation (not sudden 100% to 0%)
    """
    
    def __init__(self, K: float = 2.0):
        self.K = K
        self.requests = 0
        self.accepts = 0
    
    def should_send(self) -> bool:
        """Client decides whether to send request"""
        rejection_prob = max(0.0, (self.requests - self.K * self.accepts) / (self.requests + 1))
        
        if random.random() < rejection_prob:
            return False  # Proactively drop
        return True
    
    def record_request(self) -> None:
        self.requests += 1
    
    def record_accept(self) -> None:
        self.accepts += 1
    
    def decay(self, factor: float = 0.5) -> None:
        """Periodically decay counters (sliding window effect)"""
        self.requests = int(self.requests * factor)
        self.accepts = int(self.accepts * factor)

Why K=2?

K=1: Start throttling with any rejection—too sensitive
K=2: Allow up to 50% failure before throttling—balanced sensitivity and stability
K=3: Allow 67% failure before throttling—too sluggish

Google uses K=2 in production because it achieves good balance between "fast reaction" and "avoiding false triggers."

Comparison with Traditional Rate Limiting

Dimension	Traditional Server-Side	Google Client-Side Adaptive
Configuration	Requires preset thresholds	No configuration (self-adaptive)
Adaptability	Static threshold, needs update on scale	Automatically follows backend capacity
Latency	Request reaches server before rejection	Client drops directly, saves round trip
Cascading failure	Upstream still sends heavy traffic	Upstream proactively reduces pressure
Complexity	Server-side implementation	Requires client library support

39.16 Combined Rate Limiting in Microservices

Production systems typically combine multiple rate limiting strategies:

class CompositeRateLimiter:
    """
    Composite Rate Limiter: multi-dimensional limiting
    
    Example: An API has simultaneous limits:
    - Global: 10000 req/s (protect entire service)
    - Per user: 100 req/s (fairness)
    - Per user per endpoint: 10 req/s (prevent single endpoint abuse)
    - Per IP: 1000 req/s (DDoS prevention)
    
    Requests must pass ALL dimension checks to proceed.
    """
    
    def __init__(self):
        self.global_limiter = TokenBucket(rate=10000, burst=15000)
        self.user_limiters = {}
        self.ip_limiters = {}
    
    def allow(self, user_id: str, ip: str, endpoint: str) -> bool:
        if not self.global_limiter.allow():
            return False
        
        if user_id not in self.user_limiters:
            self.user_limiters[user_id] = TokenBucket(rate=100, burst=200)
        if not self.user_limiters[user_id].allow():
            return False
        
        if ip not in self.ip_limiters:
            self.ip_limiters[ip] = TokenBucket(rate=1000, burst=2000)
        if not self.ip_limiters[ip].allow():
            return False
        
        return True

39.17 Challenges in Distributed Rate Limiting

Problem: Multiple API Gateway instances each maintain independent counters. User limit 100 req/s, 5 Gateway instances, user can actually send 500 req/s.

Approach 1: Centralized (Redis)

All instances share one Redis counter. Precise but adds one Redis round-trip per request (~0.5-1ms).

Approach 2: Quota Sharding

Each Gateway gets 100/5 = 20 req/s quota. Simple but uneven (some Gateways receive more of a user's requests).

Approach 3: Local Limiting + Periodic Sync

Each Gateway rate-limits locally (rough quota), syncs actual usage to center every 1-5 seconds and adjusts quota.

class DistributedRateLimiter:
    """
    Distributed Rate Limiter: local bucket + periodic Redis sync
    
    Balances precision and latency:
    - Local decision: 0 latency
    - Periodic sync: 1 Redis interaction per second, adjusts local quota
    
    Error: at most rate * sync_interval * num_instances deviation
    Example: 100 req/s * 1s * 5 instances = at most 500 extra requests
    Acceptable for most scenarios (rate limiting need not be absolutely precise)
    """
    
    def __init__(self, total_rate: float, num_instances: int, sync_interval: float = 1.0):
        self.local_rate = total_rate / num_instances
        self.local_bucket = TokenBucket(rate=self.local_rate, burst=int(self.local_rate * 2))
        self.sync_interval = sync_interval
        self.local_count = 0
    
    def allow(self) -> bool:
        if self.local_bucket.allow():
            self.local_count += 1
            return True
        return False

Envoy's Local + Global Rate Limiting Design:

Envoy proxy has two layers:

Local Rate Limit Filter: each Envoy instance independently rate-limits, no external dependency
Global Rate Limit Service: calls external gRPC service (like ratelimit project) for global judgment

Requests pass local limiting first (quickly reject obviously over-limit), then global limiting (precise judgment). This layered design achieves excellent balance between latency and precision.

Level 4 · Edge Cases and Pitfalls

39.18 How to Answer Rate Limiting in System Design Interviews

Typical Scenario: "Design a URL shortener/chat system/payment system—how do you handle traffic control?"

Answer Framework (5 Steps):

Step 1: Clarify Rate Limiting Goals

Protect what? (backend services, database, third-party APIs)
Limit what? (total requests, per-user, per-IP)
Limit to what degree? (estimate from system capacity)

Step 2: Choose Rate Limiting Algorithm

User-friendly (allow bursts): Token bucket
Need smooth traffic: Leaky bucket
Need precise "N times per T seconds" semantics: Sliding window

Step 3: Determine Rate Limiting Location

Client → CDN → Load Balancer → API Gateway → Service → Database
        L3/L4    L4/L7          L7 user-level  Service   Connection pool

Earlier rejection wastes fewer resources. But earlier layers lack user identity information.

Step 4: Handle Rate Limit Responses

HTTP 429 Too Many Requests
Retry-After header (tell client when to retry)
X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset headers

Step 5: Discuss Distributed Rate Limiting

Centralized vs distributed (Redis vs local)
Precision vs latency trade-off
"Soft" limiting: not sudden 100% rejection, but gradual degradation

39.19 Rate Limiting Anti-Patterns

Anti-pattern 1: Too Coarse Granularity

Only global limiting (e.g., "whole service 10000 QPS"), no per-user limiting. Consequence: one malicious user exhausts all quota, affecting every normal user.

Anti-pattern 2: No Meaningful Error Responses

Returning 500 Internal Server Error instead of 429 Too Many Requests. Consequence: client cannot distinguish "service failure" from "rate limited," cannot implement targeted backoff.

Anti-pattern 3: Hardcoded Rate Limit Configuration

# Anti-pattern
RATE_LIMIT = 100  # Hardcoded

# Correct approach
RATE_LIMIT = config.get("rate_limit", default=100)  # Dynamically adjustable

Consequence: cannot quickly respond to traffic changes (requires code deployment to change thresholds).

Anti-pattern 4: No Exponential Backoff on Rate-Limited Retries

Client immediately retries infinitely after being rate-limited, creating a "retry storm"—more rate-limited requests generate more retry traffic, vicious cycle.

Correct approach:

import random
import time


def request_with_backoff(url, max_retries=5):
    """Retry with exponential backoff"""
    for attempt in range(max_retries):
        response = send_request(url)
        
        if response.status_code == 429:
            # Exponential backoff + jitter
            wait = min(2 ** attempt + random.uniform(0, 1), 30)
            time.sleep(wait)
        else:
            return response
    
    raise Exception("Max retries exceeded")

Why add jitter? If 100 clients are rate-limited simultaneously, without jitter they all retry at exactly the same time (e.g., all retry after 1s), causing "thundering herd." Jitter spreads retry times.

39.20 Rate Limiting vs Circuit Breaking

Dimension	Rate Limiting	Circuit Breaking
Trigger condition	Request rate exceeds threshold	Downstream error rate exceeds threshold
Protects	Self	Downstream service
Behavior	Reject excess requests	Temporarily cut all requests to downstream
Recovery	Auto-recovers when rate drops	Half-open state (try small number of test requests)
Algorithm	Token bucket/Leaky bucket/Sliding window	State machine (Closed→Open→Half-Open)

They often work together: rate limiting protects you from being overwhelmed by upstream; circuit breaking protects downstream from being overwhelmed by you.

import time
from enum import Enum


class CircuitState(Enum):
    CLOSED = "closed"        # Normal: requests forwarded
    OPEN = "open"            # Broken: all requests fail immediately
    HALF_OPEN = "half_open"  # Testing: allow few requests to test


class CircuitBreaker:
    """
    Circuit Breaker
    
    State transitions:
    CLOSED -> OPEN: error rate exceeds threshold (e.g., 50%)
    OPEN -> HALF_OPEN: after recovery_timeout
    HALF_OPEN -> CLOSED: test request succeeds
    HALF_OPEN -> OPEN: test request fails
    
    Netflix Hystrix (2012) pioneered this pattern, now widely adopted by
    Resilience4j, Sentinel, and similar frameworks.
    """
    
    def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.last_failure_time = 0
    
    def allow(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        elif self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time >= self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                return True
            return False
        else:  # HALF_OPEN
            return True
    
    def record_success(self) -> None:
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.CLOSED
            self.failure_count = 0
    
    def record_failure(self) -> None:
        self.failure_count += 1
        self.last_failure_time = time.time()
        
        if self.state == CircuitState.HALF_OPEN:
            self.state = CircuitState.OPEN
        elif self.failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

39.21 Production Case Study: Multi-Dimensional Rate Limiting System

Requirements: Design rate limiting for a SaaS API platform:

Free users: 10 req/min, paid users: 1000 req/min
Global limit: 50000 req/s
Per IP: 100 req/s
Specific endpoints (e.g., /upload): 5 req/min additional limit
Clear error messages and retry guidance on rate limit

from dataclasses import dataclass
from typing import Optional
import time


@dataclass
class RateLimitResult:
    allowed: bool
    limit: int
    remaining: int
    reset_at: float
    retry_after: Optional[float] = None
    
    def to_headers(self) -> dict:
        """Convert to HTTP response headers"""
        headers = {
            'X-RateLimit-Limit': str(self.limit),
            'X-RateLimit-Remaining': str(max(0, self.remaining)),
            'X-RateLimit-Reset': str(int(self.reset_at)),
        }
        if self.retry_after is not None:
            headers['Retry-After'] = str(int(self.retry_after))
        return headers


class MultiDimensionRateLimiter:
    """
    Multi-Dimension Rate Limiter
    
    Design:
    1. Check from strictest (most easily triggered) to most lenient
    2. Short-circuit on first rejection
    3. Only allow if ALL layers pass
    4. Return info from the tightest layer (help user understand bottleneck)
    """
    
    def __init__(self):
        self.global_bucket = TokenBucket(rate=50000, burst=75000)
        self.ip_buckets = {}
        self.user_buckets = {}
        self.endpoint_buckets = {}
    
    def check(self, user_id: str, user_tier: str, ip: str, endpoint: str) -> RateLimitResult:
        now = time.time()
        
        # Layer 1: Global
        if not self.global_bucket.allow():
            return RateLimitResult(
                allowed=False, limit=50000, remaining=0,
                reset_at=now + 1, retry_after=1.0
            )
        
        # Layer 2: IP-level
        if ip not in self.ip_buckets:
            self.ip_buckets[ip] = TokenBucket(rate=100, burst=200)
        if not self.ip_buckets[ip].allow():
            return RateLimitResult(
                allowed=False, limit=100, remaining=0,
                reset_at=now + 1, retry_after=1.0
            )
        
        # Layer 3: User-level (by tier)
        user_rate = 1000 if user_tier == 'paid' else 10
        user_burst = user_rate * 2
        if user_id not in self.user_buckets:
            self.user_buckets[user_id] = TokenBucket(rate=user_rate / 60, burst=user_burst // 60)
        if not self.user_buckets[user_id].allow():
            return RateLimitResult(
                allowed=False, limit=user_rate, remaining=0,
                reset_at=now + 60, retry_after=60.0 / user_rate
            )
        
        # Layer 4: Specific endpoint
        if endpoint == '/upload':
            ep_key = f"{user_id}:{endpoint}"
            if ep_key not in self.endpoint_buckets:
                self.endpoint_buckets[ep_key] = TokenBucket(rate=5.0/60, burst=5)
            if not self.endpoint_buckets[ep_key].allow():
                return RateLimitResult(
                    allowed=False, limit=5, remaining=0,
                    reset_at=now + 60, retry_after=12.0
                )
        
        return RateLimitResult(
            allowed=True, limit=user_rate, remaining=user_rate - 1,
            reset_at=now + 60
        )

39.22 Scheduling Algorithms in Real Operating Systems

Linux Completely Fair Scheduler (CFS)

CFS, developed by Ingo Molnar (2007), uses a red-black tree to maintain all runnable processes, sorted by "virtual runtime" (vruntime). Each scheduling decision selects the process with minimum vruntime.

Core data structure: red-black tree (rb_tree), keyed by vruntime.

class SimpleCFS:
    """
    Simplified CFS Scheduler Model
    
    Core idea: give each process fair CPU time.
    "Fair" defined as: all processes' vruntimes should be approximately equal.
    
    vruntime growth rate is inversely proportional to weight:
    High-priority process vruntime grows slowly -> gets scheduled more
    Low-priority process vruntime grows quickly -> gets scheduled less
    """
    
    def __init__(self):
        import sortedcontainers
        self.tree = sortedcontainers.SortedList(key=lambda p: p['vruntime'])
    
    def add_process(self, pid: int, nice: int = 0):
        """Add process (lower nice = higher priority)"""
        weight = self._nice_to_weight(nice)
        process = {
            'pid': pid,
            'vruntime': self._min_vruntime(),
            'weight': weight,
            'nice': nice
        }
        self.tree.add(process)
    
    def schedule(self) -> dict:
        """Select next process to run"""
        if not self.tree:
            return None
        return self.tree[0]  # Minimum vruntime (leftmost in red-black tree, O(1))
    
    def tick(self, running_process: dict, elapsed_ns: int):
        """Clock interrupt: update running process's vruntime"""
        self.tree.remove(running_process)
        delta_vruntime = elapsed_ns * 1024 / running_process['weight']
        running_process['vruntime'] += delta_vruntime
        self.tree.add(running_process)
    
    def _min_vruntime(self) -> float:
        if self.tree:
            return self.tree[0]['vruntime']
        return 0.0
    
    def _nice_to_weight(self, nice: int) -> int:
        """Nice value to weight mapping (simplified)"""
        return max(1, int(1024 * (1.25 ** (-nice))))

Why CFS uses a red-black tree rather than a heap: it needs O(log n) insertion and deletion (processes block/wake, removing/adding from tree), while a heap can only get minimum in O(1) but requires O(n) to find an arbitrary element for deletion.

Chapter Summary

Rate limiting and scheduling are the most direct applications of data structures and algorithms in production environments. They are not "interview flights of fancy"—every backend service has a rate limiter protecting it, and every operating system kernel runs a scheduler.

Core takeaways:

Token Bucket = burst-allowing rate limiting, one counter + lazy evaluation
Leaky Bucket = strictly uniform rate limiting, essentially a fixed-rate queue
Sliding Window = precise "N times per T seconds" semantics, log method is exact but memory-heavy
Consistent Hashing = minimize key migration on node changes, virtual nodes ensure uniformity
Power of Two Choices = simple yet exponentially better than pure random

The core industrial lesson: there is no perfect rate limiting algorithm, only the right combination for your scenario. Trading off between latency, precision, memory, and implementation complexity is the essence of system design.

Rate this chapter

4.8 / 5 (3 ratings)