Production Algorithms: Rate Limiting and Scheduling
Chapter 39: Production Algorithms โ Rate Limiting and Scheduling
Over the previous thirty-eight chapters we studied virtually every classic data structure and algorithm. But in real production environments, there is a class of algorithms that almost every backend engineer encounters yet rarely appears in traditional textbooksโRate Limiting and Load Balancing / Scheduling.
Rate limiting solves the core problem: when request volume exceeds system capacity, how to selectively reject some requests to protect the system from crashing. Scheduling solves the core problem: when multiple backend servers can handle requests, how to distribute requests to the most appropriate server to maximize throughput and minimize latency.
These algorithms typically appear in the system design portion of interviews, but their underlying data structures and algorithmic ideas can be implemented precisely in codeโand indeed they have exact code implementations in Nginx, Redis, API Gateways, and other systems. This chapter dissects them through a rigorous algorithmic lens.
Level 1 ยท What You Need to Know
39.1 Token Bucket Algorithm
Problem Definition
Given a system with maximum allowed request rate rate per second and maximum burst size burst, design a rate limiter that decides whether to allow each incoming request.
Algorithm Concept
Imagine a bucket with capacity burst. Tokens are added to the bucket at a steady rate of rate per second. Each request needs to consume one token to pass:
- If there are tokens in the bucket: consume one, allow the request
- If no tokens available: reject the request
The elegance of the token bucket is that it allows short bursts of traffic (consuming accumulated tokens all at once) while ensuring the long-term average rate never exceeds rate.
This algorithm was first proposed by Turner (1986, "New Directions in Communications") and later widely adopted in network traffic shaping. IETF RFC 2697 and RFC 2698 define token bucket-based traffic policing standards.
Why No Background Thread to Add Tokens?
The naive implementation would spawn a thread that adds one token every 1/rate seconds. In production, we use lazy evaluation: only compute "how many tokens should have been added since last time" when a request arrives.
import time
class TokenBucket:
"""
Token Bucket Rate Limiter
Core formula:
current_tokens = min(burst, last_tokens + (current_time - last_time) * rate)
Why lazy evaluation?
1. No background thread needed (saves resources)
2. Higher precision (floating-point timestamps vs fixed intervals)
3. No thread synchronization issues
"""
def __init__(self, rate: float, burst: int):
"""
Args:
rate: Token fill rate (tokens/second)
burst: Bucket capacity (maximum burst size)
"""
self.rate = rate
self.burst = burst
self.tokens = float(burst) # Start with full bucket
self.last_time = time.time()
def allow(self) -> bool:
"""Determine if current request is allowed"""
now = time.time()
# Calculate tokens that should have been added since last check
elapsed = now - self.last_time
self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
self.last_time = now
if self.tokens >= 1.0:
self.tokens -= 1.0
return True
else:
return False
def allow_n(self, n: int) -> bool:
"""Check if n tokens can be consumed at once (batch requests)"""
now = time.time()
elapsed = now - self.last_time
self.tokens = min(self.burst, self.tokens + elapsed * self.rate)
self.last_time = now
if self.tokens >= n:
self.tokens -= n
return True
else:
return False
Parameter Meanings
| Parameter | Meaning | Typical Value |
|---|---|---|
| rate | Long-term average allowed rate | 100 req/s |
| burst | Maximum instantaneous burst allowed | 200 (2-second buffer) |
Why Start with a Full Bucket?
This is a design choice. A full bucket means the system can immediately handle a burst after startup. An empty bucket would require waiting burst/rate seconds to reach maximum capacityโpotentially causing unnecessary rejections after a service restart.
Token Bucket vs Fixed Window Counter
The simplest rate limiting is "at most N requests per second" (fixed window counter), but it has a fatal flaw:
Timeline: |--- Second 1 ---|--- Second 2 ---|
Requests: ......[99 at end][99 at start]......
With a limit of 100 req/s, if 99 requests come in the last 0.5s of second 1 and 99 in the first 0.5s of second 2, actually 198 requests pass within 1 secondโnearly double the limit!
Token bucket has no "window boundary" problem because it is based on continuous time.
39.2 Leaky Bucket Algorithm
Algorithm Concept
Imagine a bucket with a fixed-size hole at the bottom. Water (requests) is poured in at any rate:
- Water leaks out at a fixed rate (processing requests), regardless of input rate
- If the bucket is full (queue full), new water overflows (request rejected)
The leaky bucket's core property is strictly constant output rateโeven with bursty input, output is uniform.
import time
from collections import deque
class LeakyBucket:
"""
Leaky Bucket Rate Limiter
Key difference from Token Bucket:
- Token Bucket: allows bursts (pass if tokens available)
- Leaky Bucket: strictly uniform output (requests queued, processed at fixed rate)
Leaky bucket suits scenarios needing "smoothed traffic" like network packet sending.
Token bucket suits "allow bursts but limit average rate" like API rate limiting.
"""
def __init__(self, rate: float, capacity: int):
"""
Args:
rate: Leak rate (items/second), i.e., processing rate
capacity: Bucket capacity (max queue length)
"""
self.rate = rate
self.capacity = capacity
self.water = 0.0 # Current water level
self.last_time = time.time()
def allow(self) -> bool:
"""Is the request allowed to enter the bucket (queue for processing)?"""
now = time.time()
elapsed = now - self.last_time
# Calculate water that has leaked out
leaked = elapsed * self.rate
self.water = max(0.0, self.water - leaked)
self.last_time = now
if self.water < self.capacity:
self.water += 1.0
return True # Request enters queue
else:
return False # Bucket full, rejected
class LeakyBucketQueue:
"""
Leaky Bucket with actual queue (requests buffered then processed uniformly)
Use case: message queue consumers, task schedulers
"""
def __init__(self, rate: float, capacity: int):
self.rate = rate
self.capacity = capacity
self.queue = deque()
self.last_leak_time = time.time()
def enqueue(self, request) -> bool:
"""Attempt to enqueue a request"""
if len(self.queue) >= self.capacity:
return False # Queue full, rejected
self.queue.append(request)
return True
def dequeue(self):
"""Dequeue at leaky bucket rate (called by consumer thread)"""
now = time.time()
elapsed = now - self.last_leak_time
if elapsed >= 1.0 / self.rate and self.queue:
self.last_leak_time = now
return self.queue.popleft()
return None
Token Bucket vs Leaky Bucket Comparison
| Dimension | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst handling | Allows bursts (pass if tokens available) | No bursts (strictly uniform output) |
| Output rate | Variable (can exceed average during bursts) | Constant (always equals rate) |
| Implementation | Simple (one counter + timestamp) | Medium (needs queue or counter) |
| Use case | API rate limiting, short bursts OK | Traffic shaping, smooth flow needed |
| Math model | Like bank deposits (accumulate + spend) | Like constant-speed conveyor belt |
Why Does Nginx Use Leaky Bucket While Redis Uses Token Bucket?
Nginx's limit_req module uses the leaky bucket model because as a reverse proxy, Nginx needs to smooth the request rate reaching backend servers, preventing bursts from overwhelming them.
Redis's redis_cell module (GCRA algorithm) and many API Gateways use the token bucket model because for user API calls, allowing short bursts provides better user experience (users should not be throttled just because they happened to send two requests within milliseconds).
39.3 Sliding Window Rate Limiting
Problem Definition
Limit "at most N requests in any consecutive T seconds." This is more precise than fixed windows (no boundary problem) and more intuitive than token buckets (directly expresses "N times per T seconds" semantics).
Method 1: Sliding Window Log (Exact but Memory-Heavy)
Record every request's timestamp. When checking, count how many timestamps fall within the last T seconds.
import time
from collections import deque
class SlidingWindowLog:
"""
Sliding Window Log Rate Limiter
Principle: maintain a timestamp queue. For each request:
1. Remove all timestamps older than T seconds
2. If queue length < N, allow and record timestamp
3. Otherwise reject
Pros: Exact, no window boundary issues
Cons: Each user needs up to N timestamps stored, memory O(N * users)
"""
def __init__(self, max_requests: int, window_seconds: float):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.timestamps = deque()
def allow(self) -> bool:
now = time.time()
# Remove timestamps outside the window
while self.timestamps and self.timestamps[0] <= now - self.window_seconds:
self.timestamps.popleft()
if len(self.timestamps) < self.max_requests:
self.timestamps.append(now)
return True
else:
return False
Method 2: Sliding Window Counter (Approximate but Memory-Efficient)
Divide time into small sub-windows (e.g., a 1-second window into ten 100ms slots), each maintaining a counter.
import time
class SlidingWindowCounter:
"""
Sliding Window Counter Rate Limiter (approximate algorithm)
Idea: divide the large window into multiple small slots, store each slot's count
in a circular array. When checking, sum all slots within the current window.
Precision depends on slot count: more slots = more precise but more memory.
This is the algorithm variant used by Cloudflare in their edge rate limiting.
"""
def __init__(self, max_requests: int, window_seconds: float, num_slots: int = 10):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.num_slots = num_slots
self.slot_duration = window_seconds / num_slots
self.slots = [0] * num_slots
self.slot_timestamps = [0.0] * num_slots
self.current_slot = 0
def _get_slot_index(self, timestamp: float) -> int:
"""Calculate slot index from timestamp"""
return int(timestamp / self.slot_duration) % self.num_slots
def allow(self) -> bool:
now = time.time()
current_idx = self._get_slot_index(now)
# Clear expired slots
self._clean_expired(now)
# Sum all slots in window
total = sum(self.slots)
if total < self.max_requests:
self.slots[current_idx] += 1
self.slot_timestamps[current_idx] = now
return True
else:
return False
def _clean_expired(self, now: float) -> None:
"""Clear slots outside the window"""
window_start = now - self.window_seconds
for i in range(self.num_slots):
if self.slot_timestamps[i] < window_start:
self.slots[i] = 0
Method 3: Two-Window Weighted Approximation (Industry Standard)
A minimalist but effective approximation method, widely used:
import time
class SlidingWindowApprox:
"""
Two-window weighted approximation
Idea:
- Maintain counts for current window and previous window
- Approximate current sliding window count = prev_count * overlap_ratio + curr_count
Example: window = 1 minute
Current time is 40% through current window
Approximation = prev_count * 0.6 + curr_count * 1.0
Described in Cloudflare's blog (2017) as their production rate limiting algorithm.
Maximum error approximately 12.5% (provable), precise enough for most scenarios.
"""
def __init__(self, max_requests: int, window_seconds: float):
self.max_requests = max_requests
self.window_seconds = window_seconds
self.prev_count = 0
self.curr_count = 0
self.window_start = time.time()
def allow(self) -> bool:
now = time.time()
elapsed = now - self.window_start
# Check if window needs rotation
if elapsed >= self.window_seconds:
windows_passed = int(elapsed / self.window_seconds)
if windows_passed == 1:
self.prev_count = self.curr_count
else:
self.prev_count = 0
self.curr_count = 0
self.window_start += windows_passed * self.window_seconds
elapsed = now - self.window_start
# Calculate weighted approximation
weight = 1.0 - elapsed / self.window_seconds # Previous window overlap ratio
approx_count = self.prev_count * weight + self.curr_count
if approx_count < self.max_requests:
self.curr_count += 1
return True
else:
return False
39.4 Comparison of Three Rate Limiting Algorithms
| Dimension | Token Bucket | Leaky Bucket | Sliding Window Log | Sliding Window Counter |
|---|---|---|---|---|
| Precision | Exact | Exact | Exact | Approximate |
| Allows bursts | Yes | No | Yes (any distribution within window) | Yes |
| Memory | O(1) | O(1) or O(queue) | O(N * users) | O(slots * users) |
| Implementation | Low | Low-Medium | Low | Medium |
| Use case | API limiting | Traffic shaping | Small-scale exact limiting | Large-scale distributed limiting |
39.5 Load Balancing Algorithms
When multiple backend servers are available, how do you choose which one receives a request?
Round Robin
class RoundRobin:
"""
Simplest load balancing: distribute in circular order
Pros: Simple, perfectly fair
Cons: Ignores server capacity differences, ignores current load
"""
def __init__(self, servers: list):
self.servers = servers
self.index = 0
def next_server(self) -> str:
server = self.servers[self.index % len(self.servers)]
self.index += 1
return server
Weighted Round Robin
class WeightedRoundRobin:
"""
Weighted Round Robin: distribute requests by weight
Nginx's Smooth Weighted Round Robin algorithm:
Avoids consecutively sending requests to the same high-weight server.
Algorithm by Nginx developer phxc:
1. Each server has weight (configured) and current_weight (dynamic)
2. Each selection: all servers current_weight += weight
3. Select server with maximum current_weight
4. Selected server: current_weight -= total_weight
Effect: weights {A:5, B:1, C:1} yields sequence A,A,B,A,C,A,A not A,A,A,A,A,B,C
"""
def __init__(self, servers: list, weights: list):
self.servers = servers
self.weights = weights
self.current_weights = [0] * len(servers)
self.total_weight = sum(weights)
def next_server(self) -> str:
# All servers: current_weight += weight
for i in range(len(self.servers)):
self.current_weights[i] += self.weights[i]
# Select maximum
max_idx = 0
for i in range(1, len(self.servers)):
if self.current_weights[i] > self.current_weights[max_idx]:
max_idx = i
# Selected server subtracts total_weight
self.current_weights[max_idx] -= self.total_weight
return self.servers[max_idx]
Least Connections
import heapq
class LeastConnections:
"""
Least Connections: assign request to server with fewest active connections
Use case: when request processing times vary widely (some 10ms, some 10s)
Round robin in such scenarios causes slow requests to pile up on one server
Implementation: min-heap maintaining (active_connections, server)
"""
def __init__(self, servers: list):
self.heap = [(0, i, s) for i, s in enumerate(servers)]
heapq.heapify(self.heap)
def acquire(self) -> str:
"""Acquire a server (connection count +1)"""
conns, idx, server = heapq.heappop(self.heap)
heapq.heappush(self.heap, (conns + 1, idx, server))
return server
def release(self, server: str) -> None:
"""Release a connection (connection count -1)"""
new_heap = []
for conns, idx, s in self.heap:
if s == server:
new_heap.append((max(0, conns - 1), idx, s))
else:
new_heap.append((conns, idx, s))
self.heap = new_heap
heapq.heapify(self.heap)
Consistent Hashing
import hashlib
import bisect
class ConsistentHash:
"""
Consistent Hashing: map both keys and servers onto a ring
Proposed by Karger et al. (1997, "Consistent Hashing and Random Trees") at MIT.
Core advantage: adding/removing servers only remaps 1/N of keys
(vs modulo hashing hash(key) % N where nearly all keys remap when N changes)
Virtual nodes: each physical server maps to multiple virtual nodes
Solves load imbalance (few physical nodes may distribute unevenly on hash ring)
"""
def __init__(self, servers: list, num_replicas: int = 150):
self.num_replicas = num_replicas
self.ring = [] # Sorted hash values
self.ring_map = {} # hash value -> server name
for server in servers:
self.add_server(server)
def _hash(self, key: str) -> int:
"""Generate hash using MD5 (production may use xxhash for speed)"""
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_server(self, server: str) -> None:
"""Add server (and its virtual nodes) to hash ring"""
for i in range(self.num_replicas):
virtual_key = f"{server}#{i}"
h = self._hash(virtual_key)
bisect.insort(self.ring, h)
self.ring_map[h] = server
def remove_server(self, server: str) -> None:
"""Remove server from hash ring"""
for i in range(self.num_replicas):
virtual_key = f"{server}#{i}"
h = self._hash(virtual_key)
self.ring.remove(h)
del self.ring_map[h]
def get_server(self, key: str) -> str:
"""Get server for a given key"""
if not self.ring:
return None
h = self._hash(key)
# Find first node >= h clockwise on the ring
idx = bisect.bisect_left(self.ring, h)
if idx == len(self.ring):
idx = 0 # Wrap around
return self.ring_map[self.ring[idx]]
Why Are Virtual Node Counts Typically 100-200?
Karger et al.'s analysis shows that with $k$ virtual nodes, load imbalance is approximately $O(\sqrt{\log n / k})$, where $n$ is the number of keys. With $k = 150$, server load deviation stays within 5-10% for most practical workloads. More virtual nodes mean more uniformity but slower lookups (bisect on longer arrays); 150 is a common engineering sweet spot.
39.6 Priority Scheduling
In task scheduling, different tasks have different priorities. Higher-priority tasks should be processed first.
import heapq
import time
from dataclasses import dataclass, field
@dataclass(order=True)
class Task:
priority: int # Priority (lower number = higher priority)
timestamp: float = field(compare=True) # Same priority uses FIFO
name: str = field(compare=False) # Task name (not compared)
payload: object = field(compare=False, default=None)
class PriorityScheduler:
"""
Priority Scheduler
Core data structure: min-heap
Supports:
- Submit task (with priority)
- Get highest priority task
- Priority boost/demotion (via mark-delete + re-insert)
Anti-starvation mechanism: aging
Long-waiting low-priority tasks gradually get priority boost
"""
def __init__(self):
self.heap = []
def submit(self, name: str, priority: int, payload=None) -> None:
"""Submit a task"""
task = Task(priority=priority, timestamp=time.time(), name=name, payload=payload)
heapq.heappush(self.heap, task)
def next_task(self) -> Task:
"""Get next task to execute"""
if self.heap:
return heapq.heappop(self.heap)
return None
def apply_aging(self, aging_seconds: float = 60.0, priority_boost: int = 1) -> None:
"""
Aging mechanism: boost priority of tasks waiting longer than aging_seconds
Why aging? Pure priority scheduling causes low-priority task "starvation"
(never gets executed). OS process schedulers (like Linux CFS) use similar mechanisms.
"""
now = time.time()
new_heap = []
for task in self.heap:
wait_time = now - task.timestamp
if wait_time > aging_seconds:
boosted_priority = max(0, task.priority - priority_boost)
new_task = Task(priority=boosted_priority, timestamp=task.timestamp,
name=task.name, payload=task.payload)
new_heap.append(new_task)
else:
new_heap.append(task)
self.heap = new_heap
heapq.heapify(self.heap)
Level 2 ยท How It Works Internally
39.7 Mathematical Model of Token Bucket
Continuous Time Model
Let $b(t)$ be the token count at time $t$, $r$ the fill rate, $B$ the bucket capacity.
$$b(t) = \min\left(B, , b(t_0) + r \cdot (t - t_0)\right)$$
where $t_0$ is the last token consumption time.
Burst Duration
If the bucket is full ($b = B$) and requests arrive at rate $R_{burst}$ (where $R_{burst} > r$), how long can the burst last?
$$T_{burst} = \frac{B}{R_{burst} - r}$$
During the burst, net consumption is $R_{burst} - r$ tokens per second (consuming $R_{burst}$ but replenishing at $r$).
Example: $r = 100$ req/s, $B = 500$, $R_{burst} = 600$ req/s $$T_{burst} = \frac{500}{600 - 100} = 1 \text{ second}$$
During this 1 second, 600 requests can pass (not just 100), then steady-state returns to 100 req/s.
Connection to Queuing Theory
A token bucket limiter can be modeled as an M/D/1/K queuing system (Poisson arrivals, deterministic service, single server, finite queue K = burst). When arrival rate $\lambda > r$, the drop probability is:
$$P_{drop} \approx 1 - \frac{r}{\lambda} \quad (\text{steady-state approximation when } \lambda \gg r)$$
39.8 GCRA Algorithm (Generic Cell Rate Algorithm)
GCRA (Generic Cell Rate Algorithm) is the rate limiting algorithm defined in ATM network standards, equivalent to token bucket but with different mathematical formulation. Redis's redis_cell module implements GCRA.
Core idea: maintain a "Theoretical Arrival Time" (TAT). Each request's TAT is the "earliest time it should arrive." If actual arrival is earlier than TAT, the request is too frequent and gets throttled.
import time
class GCRA:
"""
GCRA (Generic Cell Rate Algorithm)
Equivalent to token bucket but expressed in "time" rather than "token count."
Core variable: TAT (Theoretical Arrival Time)
- If now >= TAT: allow, TAT = now + emission_interval
- If now < TAT but TAT - now <= burst_tolerance: allow, TAT = TAT + emission_interval
- If TAT - now > burst_tolerance: reject
emission_interval = 1 / rate (minimum interval between requests)
burst_tolerance = burst * emission_interval (maximum allowed "earliness")
Defined by ITU-T I.371 standard (1993).
"""
def __init__(self, rate: float, burst: int):
self.emission_interval = 1.0 / rate
self.burst_tolerance = burst * self.emission_interval
self.tat = 0.0 # Theoretical Arrival Time
def allow(self) -> bool:
now = time.time()
new_tat = max(self.tat, now) + self.emission_interval
if new_tat - now > self.burst_tolerance + self.emission_interval:
return False
self.tat = new_tat
return True
GCRA's advantage: only one variable TAT (one timestamp), extremely small memory. Redis's CL.THROTTLE command is GCRA-based, storing only one timestamp per key.
39.9 Load Balance Proof for Consistent Hashing
Theorem (Karger et al., 1997): In consistent hashing with $n$ servers and $m$ keys:
- Expected load per server is $m/n$
- With $k$ virtual nodes, each server's load falls within $[m/n \cdot (1-\epsilon), m/n \cdot (1+\epsilon)]$ with probability at least $1 - 2/n^2$, where $\epsilon = O(\sqrt{\log n / k})$
Key Migration When Adding/Removing Nodes
When adding a new node, only $m/(n+1)$ keys need to migrate from other nodes to the new one in expectation. This is $n$ times better than modulo hashing ($m \cdot n/(n+1)$ keys need remapping).
Virtual Nodes vs Consistency Trade-off
More virtual nodes โ more uniform load โ but adding/removing nodes involves more "fragmented" migrations (small amounts from many different nodes).
Production choices:
- Amazon DynamoDB: 150-200 virtual nodes per node
- Cassandra: 256 virtual nodes per node (default
num_tokens) - Redis Cluster: no virtual nodes; directly allocates 16384 slots to nodes (essentially coarse-grained virtual nodes)
39.10 Mathematical Proof of Smooth Weighted Round Robin
Theorem: Nginx's smooth weighted round robin algorithm, within one complete weight cycle ($\sum w_i$ selections), selects each server $i$ exactly $w_i$ times.
Proof sketch:
Let total weight $W = \sum_{i=1}^{n} w_i$. Each selection round:
- All servers: $current_weight_i += w_i$
- Select maximum $current_weight_{max}$
- Selected server: $current_weight_{max} -= W$
After one complete cycle ($W$ selections), each server's current_weight net change is: $w_i \times W - w_i \times W = 0$ (added $W$ times $w_i$, selected $w_i$ times subtracting $W$ each). The system returns to initial state, with each server selected exactly $w_i$ times.
Smoothness Guarantee
Key property: high-weight servers are never selected consecutively. For example, weights {A:5, B:1, C:1} yields A B A C A A A not A A A A A B C.
This is guaranteed by "subtract total_weight after selection": once a server is selected, its current_weight drops by $W$, making it unlikely to be maximum again in subsequent rounds.
39.11 Latency Analysis of Load Balancing Algorithms
Using queuing theory to analyze performance differences:
Assumptions: N servers, each with processing rate $\mu$, total arrival rate $\lambda < N\mu$.
Round Robin Average Response Time:
If all request processing times are identical (deterministic service), Round Robin is optimalโequivalent to an M/D/1 queue (arrival rate $\lambda/N$, service rate $\mu$).
With variable processing times (exponential distribution), Round Robin degenerates to N independent M/M/1 queues:
$$E[T_{RR}] = \frac{1}{\mu - \lambda/N}$$
Join-Shortest-Queue (JSQ) Advantage:
With variable processing times, JSQ approaches the performance of a single M/M/N queue:
$$E[T_{JSQ}] \approx \frac{1}{\mu} + \frac{\rho^N}{N\mu(1-\rho)} \quad (\text{where } \rho = \lambda/(N\mu))$$
JSQ outperforms Round Robin by an order of magnitude under high loadโbecause it uses "current queue length" information for smarter decisions. But JSQ in distributed environments requires real-time knowledge of all servers' queue lengths, with high communication overhead.
Power of Two Choices:
Mitzenmacher (2001, "The Power of Two Choices in Randomized Load Balancing") proved a remarkable result: randomly selecting 2 from N servers, then choosing the less-loaded oneโthis simple strategy reduces maximum queue length from $O(\log N / \log\log N)$ (pure random) to $O(\log\log N)$ (two choices), an exponential improvement!
import random
class PowerOfTwoChoices:
"""
Power of Two Choices Load Balancing
Randomly pick two servers, choose the less-loaded one.
Mathematically proven to be exponentially better than pure random
(max queue length O(log log N)).
This is the theoretical basis for Nginx's "random two" strategy.
"""
def __init__(self, servers: list):
self.servers = servers
self.loads = {s: 0 for s in servers}
def acquire(self) -> str:
s1, s2 = random.sample(self.servers, 2)
chosen = s1 if self.loads[s1] <= self.loads[s2] else s2
self.loads[chosen] += 1
return chosen
def release(self, server: str) -> None:
self.loads[server] = max(0, self.loads[server] - 1)
Level 3 ยท What the Specifications Define
39.12 Rate Limiting Implementation in Nginx
Nginx's ngx_http_limit_req_module implements the leaky bucket algorithm. Core configuration:
# Define rate limit zone (shared memory)
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
# Apply rate limiting
location /api/ {
limit_req zone=api burst=20 nodelay;
}
Parameter Analysis:
rate=10r/s: Leaky bucket drain rate, 10 requests per secondburst=20: Bucket capacity, queue up to 20 requestsnodelay: Queued requests are forwarded immediately (effectively token bucket behavior)$binary_remote_addr: Rate limit per client IP (each IP has independent bucket)
Internal Implementation (Nginx Source Analysis):
Nginx uses a red-black tree (per-zone) to store each key's rate limit state (last access time + excess count), using shared memory (ngx_slab) across worker processes.
// Simplified core logic (ngx_http_limit_req_module.c)
// excess represents "overflow count in bucket" (lazy calculation)
excess = lr->excess - rate * ms_since_last / 1000;
if (excess < 0) excess = 0;
if (excess + 1000 <= burst * 1000) {
// Allow: update excess
lr->excess = excess + 1000;
lr->last = now;
return NGX_OK;
} else {
// Reject: return 503
return NGX_HTTP_SERVICE_UNAVAILABLE;
}
Nginx uses integer arithmetic (multiply by 1000) to avoid floating-point, critical in hot paths.
Meaning of nodelay
Without nodelay: requests exceeding rate but within burst are delayed (queued), waiting until the bucket drains enough. This is pure leaky bucket behavior.
With nodelay: as long as burst bucket is not full, all requests forward immediatelyโbut burst recovery still follows rate. The effect equals a token bucket: allows bursts, but long-term average stays at rate.
39.13 Rate Limiting Implementation in Redis
Approach 1: Fixed Window (Simplest)
-- Fixed window rate limit (Lua script for atomicity)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local current = tonumber(redis.call('GET', key) or "0")
if current >= limit then
return 0 -- Reject
end
redis.call('INCR', key)
if current == 0 then
redis.call('EXPIRE', key, window)
end
return 1 -- Allow
Approach 2: Sliding Window (Sorted Set)
-- Sliding window rate limit (using sorted set)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Remove records outside window
redis.call('ZREMRANGEBYSCORE', key, 0, now - window * 1000)
-- Count
local count = redis.call('ZCARD', key)
if count >= limit then
return 0 -- Reject
end
-- Add current request
redis.call('ZADD', key, now, now .. math.random())
redis.call('PEXPIRE', key, window * 1000)
return 1 -- Allow
Approach 3: redis_cell Module (GCRA)
-- CL.THROTTLE command (redis_cell module)
-- CL.THROTTLE key burst rate period [quantity]
-- Example: max 30 requests per 60 seconds, allow burst of 10
CL.THROTTLE user:123 10 30 60 1
-- Returns:
-- 1) 0 (allowed) or 1 (denied)
-- 2) Total capacity (burst + 1)
-- 3) Remaining capacity
-- 4) If denied, seconds until retry
-- 5) Seconds until bucket fully recovers
redis_cell's advantage: one command completes rate limit check + state update atomically; returns rich metadata (remaining quota, retry time) for client backoff strategies.
39.14 Multi-Layer Rate Limiting in API Gateways
Production rate limiting is typically multi-layered:
User request โ CDN layer โ API Gateway โ Service layer โ Database connection pool
โ โ โ โ
DDoS protection User/API-level Service-level Resource protection
(L3/L4) (L7) (Circuit breaker) (Semaphore)
Each Layer Has Different Responsibilities:
| Layer | Granularity | Algorithm Choice | Typical Implementation |
|---|---|---|---|
| CDN/Edge | IP / Geography | Fixed window + blocklist | Cloudflare, AWS WAF |
| API Gateway | User / API Key / Route | Token bucket / Sliding window | Kong, Envoy, Nginx |
| Service-to-service | Service identity / Method | Adaptive limiting | Sentinel, Hystrix |
| Database | Connections / Query complexity | Semaphore / Queuing | Connection Pool |
39.15 Google SRE's Rate Limiting Practice
Google SRE describes their rate limiting philosophy in Site Reliability Engineering (Beyer et al., 2016, O'Reilly):
Core Principle: Client-Side Adaptive Throttling
Traditional rate limiting is server-side rejection. Google's approach has clients proactively reduce requests:
import random
class AdaptiveThrottle:
"""
Google SRE Client-Side Adaptive Throttling
Idea: Client tracks recent request success rate.
If server starts rejecting, client proactively reduces send rate.
Core formula:
rejection_probability = max(0, (requests - K * accepts) / (requests + 1))
Where:
- requests: total requests in recent window
- accepts: requests accepted by server in recent window
- K: multiplier (typically K=2), meaning "don't throttle while success rate > 1/K"
K=2 means: as long as >50% of requests succeed, client doesn't throttle.
When success rate drops below 50%, client starts probabilistically dropping requests.
Advantages:
1. No server-side threshold configuration needed
2. Automatically adapts to backend capacity changes
3. Graceful degradation (not sudden 100% to 0%)
"""
def __init__(self, K: float = 2.0):
self.K = K
self.requests = 0
self.accepts = 0
def should_send(self) -> bool:
"""Client decides whether to send request"""
rejection_prob = max(0.0, (self.requests - self.K * self.accepts) / (self.requests + 1))
if random.random() < rejection_prob:
return False # Proactively drop
return True
def record_request(self) -> None:
self.requests += 1
def record_accept(self) -> None:
self.accepts += 1
def decay(self, factor: float = 0.5) -> None:
"""Periodically decay counters (sliding window effect)"""
self.requests = int(self.requests * factor)
self.accepts = int(self.accepts * factor)
Why K=2?
- K=1: Start throttling with any rejectionโtoo sensitive
- K=2: Allow up to 50% failure before throttlingโbalanced sensitivity and stability
- K=3: Allow 67% failure before throttlingโtoo sluggish
Google uses K=2 in production because it achieves good balance between "fast reaction" and "avoiding false triggers."
Comparison with Traditional Rate Limiting
| Dimension | Traditional Server-Side | Google Client-Side Adaptive |
|---|---|---|
| Configuration | Requires preset thresholds | No configuration (self-adaptive) |
| Adaptability | Static threshold, needs update on scale | Automatically follows backend capacity |
| Latency | Request reaches server before rejection | Client drops directly, saves round trip |
| Cascading failure | Upstream still sends heavy traffic | Upstream proactively reduces pressure |
| Complexity | Server-side implementation | Requires client library support |
39.16 Combined Rate Limiting in Microservices
Production systems typically combine multiple rate limiting strategies:
class CompositeRateLimiter:
"""
Composite Rate Limiter: multi-dimensional limiting
Example: An API has simultaneous limits:
- Global: 10000 req/s (protect entire service)
- Per user: 100 req/s (fairness)
- Per user per endpoint: 10 req/s (prevent single endpoint abuse)
- Per IP: 1000 req/s (DDoS prevention)
Requests must pass ALL dimension checks to proceed.
"""
def __init__(self):
self.global_limiter = TokenBucket(rate=10000, burst=15000)
self.user_limiters = {}
self.ip_limiters = {}
def allow(self, user_id: str, ip: str, endpoint: str) -> bool:
if not self.global_limiter.allow():
return False
if user_id not in self.user_limiters:
self.user_limiters[user_id] = TokenBucket(rate=100, burst=200)
if not self.user_limiters[user_id].allow():
return False
if ip not in self.ip_limiters:
self.ip_limiters[ip] = TokenBucket(rate=1000, burst=2000)
if not self.ip_limiters[ip].allow():
return False
return True
39.17 Challenges in Distributed Rate Limiting
Problem: Multiple API Gateway instances each maintain independent counters. User limit 100 req/s, 5 Gateway instances, user can actually send 500 req/s.
Approach 1: Centralized (Redis)
All instances share one Redis counter. Precise but adds one Redis round-trip per request (~0.5-1ms).
Approach 2: Quota Sharding
Each Gateway gets 100/5 = 20 req/s quota. Simple but uneven (some Gateways receive more of a user's requests).
Approach 3: Local Limiting + Periodic Sync
Each Gateway rate-limits locally (rough quota), syncs actual usage to center every 1-5 seconds and adjusts quota.
class DistributedRateLimiter:
"""
Distributed Rate Limiter: local bucket + periodic Redis sync
Balances precision and latency:
- Local decision: 0 latency
- Periodic sync: 1 Redis interaction per second, adjusts local quota
Error: at most rate * sync_interval * num_instances deviation
Example: 100 req/s * 1s * 5 instances = at most 500 extra requests
Acceptable for most scenarios (rate limiting need not be absolutely precise)
"""
def __init__(self, total_rate: float, num_instances: int, sync_interval: float = 1.0):
self.local_rate = total_rate / num_instances
self.local_bucket = TokenBucket(rate=self.local_rate, burst=int(self.local_rate * 2))
self.sync_interval = sync_interval
self.local_count = 0
def allow(self) -> bool:
if self.local_bucket.allow():
self.local_count += 1
return True
return False
Envoy's Local + Global Rate Limiting Design:
Envoy proxy has two layers:
- Local Rate Limit Filter: each Envoy instance independently rate-limits, no external dependency
- Global Rate Limit Service: calls external gRPC service (like
ratelimitproject) for global judgment
Requests pass local limiting first (quickly reject obviously over-limit), then global limiting (precise judgment). This layered design achieves excellent balance between latency and precision.
Level 4 ยท Edge Cases and Pitfalls
39.18 How to Answer Rate Limiting in System Design Interviews
Typical Scenario: "Design a URL shortener/chat system/payment systemโhow do you handle traffic control?"
Answer Framework (5 Steps):
Step 1: Clarify Rate Limiting Goals
- Protect what? (backend services, database, third-party APIs)
- Limit what? (total requests, per-user, per-IP)
- Limit to what degree? (estimate from system capacity)
Step 2: Choose Rate Limiting Algorithm
- User-friendly (allow bursts): Token bucket
- Need smooth traffic: Leaky bucket
- Need precise "N times per T seconds" semantics: Sliding window
Step 3: Determine Rate Limiting Location
Client โ CDN โ Load Balancer โ API Gateway โ Service โ Database
L3/L4 L4/L7 L7 user-level Service Connection pool
Earlier rejection wastes fewer resources. But earlier layers lack user identity information.
Step 4: Handle Rate Limit Responses
- HTTP 429 Too Many Requests
Retry-Afterheader (tell client when to retry)X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Resetheaders
Step 5: Discuss Distributed Rate Limiting
- Centralized vs distributed (Redis vs local)
- Precision vs latency trade-off
- "Soft" limiting: not sudden 100% rejection, but gradual degradation
39.19 Rate Limiting Anti-Patterns
Anti-pattern 1: Too Coarse Granularity
Only global limiting (e.g., "whole service 10000 QPS"), no per-user limiting. Consequence: one malicious user exhausts all quota, affecting every normal user.
Anti-pattern 2: No Meaningful Error Responses
Returning 500 Internal Server Error instead of 429 Too Many Requests. Consequence: client cannot distinguish "service failure" from "rate limited," cannot implement targeted backoff.
Anti-pattern 3: Hardcoded Rate Limit Configuration
# Anti-pattern
RATE_LIMIT = 100 # Hardcoded
# Correct approach
RATE_LIMIT = config.get("rate_limit", default=100) # Dynamically adjustable
Consequence: cannot quickly respond to traffic changes (requires code deployment to change thresholds).
Anti-pattern 4: No Exponential Backoff on Rate-Limited Retries
Client immediately retries infinitely after being rate-limited, creating a "retry storm"โmore rate-limited requests generate more retry traffic, vicious cycle.
Correct approach:
import random
import time
def request_with_backoff(url, max_retries=5):
"""Retry with exponential backoff"""
for attempt in range(max_retries):
response = send_request(url)
if response.status_code == 429:
# Exponential backoff + jitter
wait = min(2 ** attempt + random.uniform(0, 1), 30)
time.sleep(wait)
else:
return response
raise Exception("Max retries exceeded")
Why add jitter? If 100 clients are rate-limited simultaneously, without jitter they all retry at exactly the same time (e.g., all retry after 1s), causing "thundering herd." Jitter spreads retry times.
39.20 Rate Limiting vs Circuit Breaking
| Dimension | Rate Limiting | Circuit Breaking |
|---|---|---|
| Trigger condition | Request rate exceeds threshold | Downstream error rate exceeds threshold |
| Protects | Self | Downstream service |
| Behavior | Reject excess requests | Temporarily cut all requests to downstream |
| Recovery | Auto-recovers when rate drops | Half-open state (try small number of test requests) |
| Algorithm | Token bucket/Leaky bucket/Sliding window | State machine (ClosedโOpenโHalf-Open) |
They often work together: rate limiting protects you from being overwhelmed by upstream; circuit breaking protects downstream from being overwhelmed by you.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal: requests forwarded
OPEN = "open" # Broken: all requests fail immediately
HALF_OPEN = "half_open" # Testing: allow few requests to test
class CircuitBreaker:
"""
Circuit Breaker
State transitions:
CLOSED -> OPEN: error rate exceeds threshold (e.g., 50%)
OPEN -> HALF_OPEN: after recovery_timeout
HALF_OPEN -> CLOSED: test request succeeds
HALF_OPEN -> OPEN: test request fails
Netflix Hystrix (2012) pioneered this pattern, now widely adopted by
Resilience4j, Sentinel, and similar frameworks.
"""
def __init__(self, failure_threshold: int = 5, recovery_timeout: float = 30.0):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = 0
def allow(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
elif self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time >= self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
return True
return False
else: # HALF_OPEN
return True
def record_success(self) -> None:
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.CLOSED
self.failure_count = 0
def record_failure(self) -> None:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
39.21 Production Case Study: Multi-Dimensional Rate Limiting System
Requirements: Design rate limiting for a SaaS API platform:
- Free users: 10 req/min, paid users: 1000 req/min
- Global limit: 50000 req/s
- Per IP: 100 req/s
- Specific endpoints (e.g., /upload): 5 req/min additional limit
- Clear error messages and retry guidance on rate limit
from dataclasses import dataclass
from typing import Optional
import time
@dataclass
class RateLimitResult:
allowed: bool
limit: int
remaining: int
reset_at: float
retry_after: Optional[float] = None
def to_headers(self) -> dict:
"""Convert to HTTP response headers"""
headers = {
'X-RateLimit-Limit': str(self.limit),
'X-RateLimit-Remaining': str(max(0, self.remaining)),
'X-RateLimit-Reset': str(int(self.reset_at)),
}
if self.retry_after is not None:
headers['Retry-After'] = str(int(self.retry_after))
return headers
class MultiDimensionRateLimiter:
"""
Multi-Dimension Rate Limiter
Design:
1. Check from strictest (most easily triggered) to most lenient
2. Short-circuit on first rejection
3. Only allow if ALL layers pass
4. Return info from the tightest layer (help user understand bottleneck)
"""
def __init__(self):
self.global_bucket = TokenBucket(rate=50000, burst=75000)
self.ip_buckets = {}
self.user_buckets = {}
self.endpoint_buckets = {}
def check(self, user_id: str, user_tier: str, ip: str, endpoint: str) -> RateLimitResult:
now = time.time()
# Layer 1: Global
if not self.global_bucket.allow():
return RateLimitResult(
allowed=False, limit=50000, remaining=0,
reset_at=now + 1, retry_after=1.0
)
# Layer 2: IP-level
if ip not in self.ip_buckets:
self.ip_buckets[ip] = TokenBucket(rate=100, burst=200)
if not self.ip_buckets[ip].allow():
return RateLimitResult(
allowed=False, limit=100, remaining=0,
reset_at=now + 1, retry_after=1.0
)
# Layer 3: User-level (by tier)
user_rate = 1000 if user_tier == 'paid' else 10
user_burst = user_rate * 2
if user_id not in self.user_buckets:
self.user_buckets[user_id] = TokenBucket(rate=user_rate / 60, burst=user_burst // 60)
if not self.user_buckets[user_id].allow():
return RateLimitResult(
allowed=False, limit=user_rate, remaining=0,
reset_at=now + 60, retry_after=60.0 / user_rate
)
# Layer 4: Specific endpoint
if endpoint == '/upload':
ep_key = f"{user_id}:{endpoint}"
if ep_key not in self.endpoint_buckets:
self.endpoint_buckets[ep_key] = TokenBucket(rate=5.0/60, burst=5)
if not self.endpoint_buckets[ep_key].allow():
return RateLimitResult(
allowed=False, limit=5, remaining=0,
reset_at=now + 60, retry_after=12.0
)
return RateLimitResult(
allowed=True, limit=user_rate, remaining=user_rate - 1,
reset_at=now + 60
)
39.22 Scheduling Algorithms in Real Operating Systems
Linux Completely Fair Scheduler (CFS)
CFS, developed by Ingo Molnar (2007), uses a red-black tree to maintain all runnable processes, sorted by "virtual runtime" (vruntime). Each scheduling decision selects the process with minimum vruntime.
Core data structure: red-black tree (rb_tree), keyed by vruntime.
class SimpleCFS:
"""
Simplified CFS Scheduler Model
Core idea: give each process fair CPU time.
"Fair" defined as: all processes' vruntimes should be approximately equal.
vruntime growth rate is inversely proportional to weight:
High-priority process vruntime grows slowly -> gets scheduled more
Low-priority process vruntime grows quickly -> gets scheduled less
"""
def __init__(self):
import sortedcontainers
self.tree = sortedcontainers.SortedList(key=lambda p: p['vruntime'])
def add_process(self, pid: int, nice: int = 0):
"""Add process (lower nice = higher priority)"""
weight = self._nice_to_weight(nice)
process = {
'pid': pid,
'vruntime': self._min_vruntime(),
'weight': weight,
'nice': nice
}
self.tree.add(process)
def schedule(self) -> dict:
"""Select next process to run"""
if not self.tree:
return None
return self.tree[0] # Minimum vruntime (leftmost in red-black tree, O(1))
def tick(self, running_process: dict, elapsed_ns: int):
"""Clock interrupt: update running process's vruntime"""
self.tree.remove(running_process)
delta_vruntime = elapsed_ns * 1024 / running_process['weight']
running_process['vruntime'] += delta_vruntime
self.tree.add(running_process)
def _min_vruntime(self) -> float:
if self.tree:
return self.tree[0]['vruntime']
return 0.0
def _nice_to_weight(self, nice: int) -> int:
"""Nice value to weight mapping (simplified)"""
return max(1, int(1024 * (1.25 ** (-nice))))
Why CFS uses a red-black tree rather than a heap: it needs O(log n) insertion and deletion (processes block/wake, removing/adding from tree), while a heap can only get minimum in O(1) but requires O(n) to find an arbitrary element for deletion.
Chapter Summary
Rate limiting and scheduling are the most direct applications of data structures and algorithms in production environments. They are not "interview flights of fancy"โevery backend service has a rate limiter protecting it, and every operating system kernel runs a scheduler.
Core takeaways:
- Token Bucket = burst-allowing rate limiting, one counter + lazy evaluation
- Leaky Bucket = strictly uniform rate limiting, essentially a fixed-rate queue
- Sliding Window = precise "N times per T seconds" semantics, log method is exact but memory-heavy
- Consistent Hashing = minimize key migration on node changes, virtual nodes ensure uniformity
- Power of Two Choices = simple yet exponentially better than pure random
The core industrial lesson: there is no perfect rate limiting algorithm, only the right combination for your scenario. Trading off between latency, precision, memory, and implementation complexity is the essence of system design.