Bloom Filters and Probabilistic Data Structures
Chapter 9: Bloom Filters and Probabilistic Data Structures
Your web crawler has fetched 1 billion URLs. For every new link discovered, you need to determine "has this been crawled before?" Store them in a HashSet? At 100 bytes average per URL, that's 100GB of RAM just for storage. Query a database? Tens of thousands of disk I/O operations per second won't keep up with the crawl rate.
A Bloom filter gives a stunning answer: with just 1.2GB of memory (about 10 bits per URL), it can tell you whether a URL "possibly exists" or "definitely does not exist" with a false positive rate of 0.1%. The trade-off? It will occasionally say "exists" when the URL was never inserted (false positive), but it will never say "does not exist" when the URL actually is present (no false negatives).
This idea of trading a small amount of space for approximate answers spawned an entire family: Bloom filters, Count-Min Sketch, HyperLogLog, Cuckoo Filters... They are collectively called Probabilistic Data Structures. This chapter will dive deep into each member, from mathematical foundations to engineering practice.
Level 1 - What You Need to Know
9.1 The Core Idea of Bloom Filters
The Bloom filter was invented by Burton Howard Bloom in 1970. Its core idea can be summarized in one sentence:
A bit array plus multiple hash functions enables extremely space-efficient set membership testing. Query results are either "possibly exists" or "definitely does not exist."
Why "possibly exists" instead of "definitely exists"? Because different elements' hash values may collide—multiple distinct elements can set the same group of bits to 1, making an element that was never inserted appear to "exist." But if any hash position is 0, the element was definitely never inserted.
Intuitive Understanding
Imagine a bit array of m bits, initially all zeros. You have k independent hash functions, each mapping input to the range [0, m-1].
- Insert: Compute k hash values for the element, set the corresponding k positions in the bit array to 1.
- Query: Compute k hash values for the element, check if all k corresponding positions are 1. All 1s → "possibly exists"; any 0 → "definitely does not exist."
Insert "apple":
h1("apple") = 3, h2("apple") = 7, h3("apple") = 11
Bit array: 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0
↑ ↑ ↑ ↑
idx: 3 7 11
Insert "banana":
h1("banana") = 1, h2("banana") = 7, h3("banana") = 14
Bit array: 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0
↑ ↑
idx:1 idx:14
Query "cherry":
h1("cherry") = 3, h2("cherry") = 5, h3("cherry") = 14
Position 3=1 ✓, Position 5=0 ✗ → definitely does not exist
Query "grape":
h1("grape") = 1, h2("grape") = 7, h3("grape") = 11
Position 1=1 ✓, Position 7=1 ✓, Position 11=1 ✓ → possibly exists (false positive!)
This is the essence of a Bloom filter: sacrifice a tiny amount of accuracy for enormous space savings.
9.2 Python Implementation of a Complete Bloom Filter
import math
import mmh3 # MurmurHash3, high-performance non-cryptographic hash
class BloomFilter:
"""
Bloom Filter implementation.
Parameters:
expected_items (int): Expected number of elements to insert (n)
false_positive_rate (float): Target false positive rate (epsilon), e.g., 0.01 for 1%
"""
def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
self.n = expected_items
self.fp_rate = false_positive_rate
# Compute optimal bit array size m
# m = -n * ln(epsilon) / (ln2)^2
self.m = self._optimal_bit_count(expected_items, false_positive_rate)
# Compute optimal number of hash functions k
# k = (m/n) * ln2
self.k = self._optimal_hash_count(self.m, expected_items)
# Bit array, implemented using bytearray
self.bit_array = bytearray(math.ceil(self.m / 8))
self.count = 0 # Number of inserted elements
@staticmethod
def _optimal_bit_count(n: int, fp_rate: float) -> int:
"""Compute optimal bit array size"""
m = -n * math.log(fp_rate) / (math.log(2) ** 2)
return int(math.ceil(m))
@staticmethod
def _optimal_hash_count(m: int, n: int) -> int:
"""Compute optimal number of hash functions"""
k = (m / n) * math.log(2)
return int(math.ceil(k))
def _get_hash_positions(self, item: str) -> list:
"""
Generate k hash positions using double hashing technique.
Technical note: We don't need k truly independent hash functions.
Just two hash values h1, h2 suffice to generate k positions:
hash_i = (h1 + i * h2) % m
This technique comes from Kirsch & Mitzenmacher (2006),
"Less Hashing, Same Performance: Building a Better Bloom Filter."
"""
h1 = mmh3.hash(item, seed=0) % self.m
h2 = mmh3.hash(item, seed=42) % self.m
positions = []
for i in range(self.k):
pos = (h1 + i * h2) % self.m
positions.append(pos)
return positions
def _set_bit(self, position: int):
"""Set the bit at specified position to 1"""
byte_idx = position // 8
bit_idx = position % 8
self.bit_array[byte_idx] |= (1 << bit_idx)
def _get_bit(self, position: int) -> bool:
"""Get the bit value at specified position"""
byte_idx = position // 8
bit_idx = position % 8
return bool(self.bit_array[byte_idx] & (1 << bit_idx))
def add(self, item: str):
"""Insert an element"""
for pos in self._get_hash_positions(item):
self._set_bit(pos)
self.count += 1
def __contains__(self, item: str) -> bool:
"""
Query whether an element possibly exists.
Returns True: element possibly exists (with fp_rate probability of being wrong)
Returns False: element definitely does not exist
"""
return all(self._get_bit(pos) for pos in self._get_hash_positions(item))
@property
def current_false_positive_rate(self) -> float:
"""Calculate actual false positive rate based on current insertion count"""
# (1 - e^(-kn/m))^k
exponent = -self.k * self.count / self.m
return (1 - math.exp(exponent)) ** self.k
def __repr__(self):
return (
f"BloomFilter(m={self.m} bits, k={self.k} hashes, "
f"n={self.count}/{self.n}, "
f"actual_fp≈{self.current_false_positive_rate:.6f})"
)
# Usage example
if __name__ == "__main__":
# Create a Bloom filter expecting 1 million items with 1% false positive rate
bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)
print(f"Bit array size: {bf.m:,} bits = {bf.m / 8 / 1024 / 1024:.2f} MB")
print(f"Number of hash functions: {bf.k}")
# Output: Bit array size: 9,585,059 bits = 1.14 MB
# Output: Number of hash functions: 7
# Insert data
for i in range(100_000):
bf.add(f"url_{i}")
# Query existing elements
print(f"'url_42' in bf: {'url_42' in bf}") # True (correct)
print(f"'url_99999' in bf: {'url_99999' in bf}") # True (correct)
# Query non-existent elements
false_positives = 0
test_count = 100_000
for i in range(test_count):
if f"not_exist_{i}" in bf:
false_positives += 1
print(f"Measured false positive rate: {false_positives/test_count:.4f}")
# Approximately 0.0001 (because only 100K inserted, far below 1M capacity)
print(bf)
Key Observations
- 1 million URLs in just 1.14 MB. A Python set storing 1 million URLs (50 chars average) requires at least 100MB+.
- Number of hash functions k=7. More is not always better, nor is fewer—there is an optimum.
- False positive rate increases with insertion count. The design-time expected_items is a critical parameter; exceeding it causes dramatic degradation.
9.3 False Positive Rate Formula and Optimal Parameters
A Bloom filter has three core parameters in relation:
- m: bit array size (number of bits)
- n: expected number of inserted elements
- k: number of hash functions
- epsilon: false positive rate
False positive rate formula:
$$\varepsilon = \left(1 - e^{-kn/m}\right)^k$$
Optimal number of hash functions:
$$k_{opt} = \frac{m}{n} \cdot \ln 2 \approx 0.693 \cdot \frac{m}{n}$$
Optimal bit array size:
$$m = -\frac{n \cdot \ln \varepsilon}{(\ln 2)^2} \approx 1.44 \cdot n \cdot \log_2\frac{1}{\varepsilon}$$
import math
def bloom_filter_params(n: int, fp_rate: float):
"""Given element count and target FP rate, compute optimal parameters"""
# Optimal m
m = -n * math.log(fp_rate) / (math.log(2) ** 2)
m = int(math.ceil(m))
# Optimal k
k = (m / n) * math.log(2)
k = int(math.ceil(k))
# Bits per element
bits_per_element = m / n
print(f"Expected elements: {n:,}")
print(f"Target FP rate: {fp_rate}")
print(f"Bit array size: {m:,} bits = {m/8/1024/1024:.2f} MB")
print(f"Hash functions: {k}")
print(f"Bits per element: {bits_per_element:.1f} bits/element")
return m, k
# Parameter comparison at different FP rates
print("=== 1% false positive rate ===")
bloom_filter_params(10_000_000, 0.01)
# m=95,850,584 bits=11.44 MB, k=7, 9.6 bits/elem
print("\n=== 0.1% false positive rate ===")
bloom_filter_params(10_000_000, 0.001)
# m=143,775,877 bits=17.15 MB, k=10, 14.4 bits/elem
print("\n=== 0.01% false positive rate ===")
bloom_filter_params(10_000_000, 0.0001)
# m=191,701,169 bits=22.87 MB, k=14, 19.2 bits/elem
Practical rule: Every 10x reduction in false positive rate costs approximately 4.8 additional bits per element (i.e., m/n increases by about 4.8).
9.4 Typical Application Scenarios
Scenario 1: Cache Penetration Protection
"""
Problem: Malicious requests query large numbers of non-existent keys,
all requests penetrate the cache and hit the database directly.
Solution: Add a Bloom filter before the cache layer;
keys not in the filter are rejected immediately.
"""
class CacheWithBloomFilter:
def __init__(self, all_valid_keys):
# Add all valid keys to the Bloom filter
self.bloom = BloomFilter(
expected_items=len(all_valid_keys),
false_positive_rate=0.001 # 0.1% FP rate
)
for key in all_valid_keys:
self.bloom.add(key)
self.cache = {} # Simulated cache layer
self.db = {} # Simulated database layer
def get(self, key: str):
# Layer 1: Bloom filter interception
if key not in self.bloom:
return None # Definitely doesn't exist, return immediately
# Layer 2: Check cache
if key in self.cache:
return self.cache[key]
# Layer 3: Check database (very few requests reach here)
value = self.db.get(key)
if value is not None:
self.cache[key] = value
return value
Effect: Assuming malicious requests all query non-existent keys, the Bloom filter blocks 99.9% of them (0.1% FP rate means only one in a thousand fake requests penetrates to the database).
Scenario 2: Web Crawler URL Deduplication
"""
1 billion URL deduplication:
- HashSet approach: ~100GB memory (infeasible)
- Database approach: too slow
- Bloom filter: ~1.2GB memory (10 bits/URL, 0.1% FP rate)
"""
class WebCrawler:
def __init__(self, expected_urls=1_000_000_000):
self.seen_urls = BloomFilter(
expected_items=expected_urls,
false_positive_rate=0.001
)
def should_crawl(self, url: str) -> bool:
"""Determine whether this URL should be crawled"""
if url in self.seen_urls:
return False # Possibly already crawled, skip
# Mark as crawled
self.seen_urls.add(url)
return True
def crawl(self, seed_urls: list):
queue = list(seed_urls)
while queue:
url = queue.pop(0)
if self.should_crawl(url):
# Fetch page and extract new links
new_links = self._fetch_and_extract(url)
queue.extend(new_links)
Trade-off: A 0.1% FP rate means approximately 1 million URLs will be incorrectly skipped (1 billion x 0.1%). For a large-scale crawler, this is perfectly acceptable.
Scenario 3: Spam Filtering
"""
Google Chrome originally used Bloom filters to store malicious URL blacklists.
Benefit: No need to download the complete list of malicious URLs to the client—
a compact Bloom filter (a few MB) can cover millions of malicious URLs.
"""
class SpamFilter:
def __init__(self, known_spam_addresses):
self.spam_bloom = BloomFilter(
expected_items=len(known_spam_addresses),
false_positive_rate=0.0001 # 0.01% FP rate
)
for addr in known_spam_addresses:
self.spam_bloom.add(addr.lower())
def is_spam(self, sender: str) -> str:
"""
Returns: "definitely_not_spam" or "possibly_spam"
"""
if sender.lower() not in self.spam_bloom:
return "definitely_not_spam"
else:
# Possibly spam, needs further inspection
return "possibly_spam"
9.5 Limitations of Bloom Filters
- Cannot delete elements: Setting a bit back from 1 to 0 may affect other elements' queries (multiple elements share bits).
- Cannot enumerate inserted elements: The Bloom filter stores only hash information; original data is lost.
- Fixed capacity: Must determine expected element count at creation time; exceeding it causes rapid FP rate degradation.
- Only supports membership queries: Cannot store associated values (unlike hash tables which store key-value pairs).
Level 2 - How It Works Under the Hood
9.6 Mathematical Derivation of Bloom Filter False Positive Rate
Let us derive the false positive rate formula from first principles.
Assumption: k hash functions map elements uniformly at random into a bit array of m bits.
Step 1: When inserting one element, the probability that a specific bit is not set to 1 by any single hash function is:
$$P(\text{bit is 0 after 1 hash}) = 1 - \frac{1}{m}$$
None of k hash functions set that bit:
$$P(\text{bit is 0 after 1 element}) = \left(1 - \frac{1}{m}\right)^k$$
Step 2: After inserting n elements, the probability that the bit remains 0:
$$P(\text{bit is 0 after n elements}) = \left(1 - \frac{1}{m}\right)^{kn}$$
Using the limit $(1-1/m)^m \to e^{-1}$ (for large m):
$$P(\text{bit is 0}) \approx e^{-kn/m}$$
Step 3: The probability that the bit is 1:
$$P(\text{bit is 1}) = 1 - e^{-kn/m}$$
Step 4: A false positive occurs when all k hash positions are 1. Assuming independence between bits (this is an approximation, but Mitzenmacher & Upfal 2005 proved it is extremely accurate in practice):
$$\varepsilon = \left(1 - e^{-kn/m}\right)^k$$
Deriving the optimal k:
Taking the logarithm of epsilon:
$$\ln \varepsilon = k \cdot \ln\left(1 - e^{-kn/m}\right)$$
Let $p = e^{-kn/m}$ (the probability that a bit is 0), then:
$$\ln \varepsilon = k \cdot \ln(1-p)$$
To minimize epsilon, we minimize $k \cdot \ln(1-p)$. Taking the derivative with respect to k and setting it to zero, we find that the minimum occurs when $p = 1/2$ (i.e., each bit is set to 1 with probability exactly 1/2).
From $p = e^{-kn/m} = 1/2$, we solve:
$$k = \frac{m}{n} \cdot \ln 2$$
At this optimum, the false positive rate is:
$$\varepsilon_{min} = (1/2)^k = (1/2)^{(m/n)\ln 2} = 2^{-(m/n)\ln 2}$$
Inversely, given a target false positive rate epsilon, the required bit array size is:
$$m = -\frac{n \ln \varepsilon}{(\ln 2)^2} = \frac{n \ln(1/\varepsilon)}{(\ln 2)^2} \approx 1.44 \cdot n \cdot \log_2\frac{1}{\varepsilon}$$
import math
def verify_formula(m, n, k):
"""Verify consistency between theoretical formula and actual false positive rate"""
# Theoretical false positive rate
theoretical_fp = (1 - math.exp(-k * n / m)) ** k
# Simulation experiment
import random
bit_array = [0] * m
# Insert n random elements
for _ in range(n):
for _ in range(k):
pos = random.randint(0, m - 1)
bit_array[pos] = 1
# Test 10000 non-existent elements
false_positives = 0
tests = 10000
for _ in range(tests):
if all(bit_array[random.randint(0, m - 1)] == 1 for _ in range(k)):
false_positives += 1
actual_fp = false_positives / tests
print(f"Theoretical FP rate: {theoretical_fp:.6f}")
print(f"Measured FP rate: {actual_fp:.6f}")
print(f"Relative error: {abs(theoretical_fp - actual_fp) / theoretical_fp * 100:.1f}%")
# m=10000, n=1000, k=7 (near optimal)
verify_formula(10000, 1000, 7)
9.7 Count-Min Sketch: Frequency Estimation
Bloom filters answer "does element exist?" But what if we want to ask "how many times has this element appeared?" For example: access frequency of each IP in network traffic, or word frequency in text.
Count-Min Sketch (Cormode & Muthukrishnan, 2005, "An Improved Data Stream Summary: The Count-Min Sketch and its Applications") solves exactly this problem.
Core Idea
Count-Min Sketch is a two-dimensional counter array (d rows x w columns) paired with d hash functions.
- Update: To increment element x's count by c, compute d hash values and add c to the corresponding column in each row.
- Query: For element x's frequency, take the minimum value across the d corresponding columns.
import mmh3
import math
class CountMinSketch:
"""
Count-Min Sketch: probabilistic data structure for frequency estimation
Guarantee: estimated value >= true value (only overestimates, never underestimates)
Error bound: P(estimate - true_value > epsilon * N) < delta
where N is the total frequency sum of all elements
Parameters:
epsilon: error parameter, estimation error won't exceed epsilon * N
delta: failure probability, probability of error exceeding epsilon * N is less than delta
"""
def __init__(self, epsilon: float = 0.001, delta: float = 0.01):
# w = ceil(e/epsilon), d = ceil(ln(1/delta))
self.w = int(math.ceil(math.e / epsilon))
self.d = int(math.ceil(math.log(1.0 / delta)))
self.table = [[0] * self.w for _ in range(self.d)]
self.total_count = 0
def _hash(self, item: str, row: int) -> int:
"""Hash function for row `row`"""
return mmh3.hash(item, seed=row) % self.w
def add(self, item: str, count: int = 1):
"""Increment an element's count"""
self.total_count += count
for row in range(self.d):
col = self._hash(item, row)
self.table[row][col] += count
def estimate(self, item: str) -> int:
"""
Estimate an element's frequency.
Return value >= true frequency (collisions only increase counts).
Taking the minimum across rows minimizes overestimation.
"""
min_count = float('inf')
for row in range(self.d):
col = self._hash(item, row)
min_count = min(min_count, self.table[row][col])
return min_count
@property
def memory_usage_bytes(self) -> int:
"""Estimated memory usage (4 bytes per counter)"""
return self.w * self.d * 4
# Example: tracking website visit frequency
cms = CountMinSketch(epsilon=0.001, delta=0.01)
print(f"Table size: {cms.d} x {cms.w} = {cms.d * cms.w:,} counters")
print(f"Memory usage: {cms.memory_usage_bytes / 1024:.1f} KB")
# Simulate access logs
import random
access_log = []
# Simulate power-law distribution: a few IPs with many visits
for i in range(100):
freq = int(10000 / (i + 1)) # Zipf distribution
access_log.extend([f"192.168.1.{i}"] * freq)
random.shuffle(access_log)
# Count using CMS
exact_counts = {}
for ip in access_log:
cms.add(ip)
exact_counts[ip] = exact_counts.get(ip, 0) + 1
# Verify accuracy
for ip in ["192.168.1.0", "192.168.1.9", "192.168.1.99"]:
exact = exact_counts.get(ip, 0)
estimated = cms.estimate(ip)
error = estimated - exact
print(f"{ip}: exact={exact}, estimated={estimated}, error={error}")
Why Take the Minimum?
Each hash function may have collisions (other elements mapping to the same column), and collisions only increase counts. Taking the minimum of d rows selects the row with the fewest collisions, effectively reducing overestimation.
Error guarantee (from Cormode & Muthukrishnan 2005):
$$P\left[\hat{f}(x) - f(x) > \varepsilon \cdot N\right] < \delta$$
Where $\hat{f}(x)$ is the estimated frequency, $f(x)$ is the true frequency, and $N$ is the total element count.
9.8 HyperLogLog: Cardinality Estimation
"How many distinct visitors did this website have today?" This is a cardinality estimation problem. Exact computation requires storing all unique visitor IDs (potentially tens of GB). HyperLogLog uses only about 12KB to estimate cardinalities in the hundreds of millions, with a standard error of approximately 0.81%.
Fundamental Principle
HyperLogLog (Flajolet, Fusy, Gandouet & Meunier, 2007, "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm") is based on an elegant probabilistic observation:
If you flip a coin, the probability of seeing k consecutive heads is $1/2^k$. If the longest run of consecutive heads you observe across many experiments is k, then you've performed approximately $2^k$ experiments.
Applying this to hash values: hash elements into uniformly distributed bit strings, then observe the maximum number of leading zeros to estimate the number of distinct elements.
import mmh3
import math
class HyperLogLog:
"""
HyperLogLog cardinality estimation algorithm
Uses 2^p registers (buckets), each storing the maximum observed leading zeros + 1.
Standard error ≈ 1.04 / sqrt(2^p)
Parameters:
p: precision parameter (4 <= p <= 16), uses 2^p buckets
p=14 gives 16384 buckets, standard error ≈ 0.81%
"""
def __init__(self, p: int = 14):
assert 4 <= p <= 16, "p must be between 4 and 16"
self.p = p
self.m = 1 << p # Number of buckets = 2^p
self.registers = [0] * self.m
# alpha_m correction factor
if self.m == 16:
self.alpha = 0.673
elif self.m == 32:
self.alpha = 0.697
elif self.m == 64:
self.alpha = 0.709
else:
self.alpha = 0.7213 / (1 + 1.079 / self.m)
def _hash(self, item: str) -> int:
"""Return 64-bit hash value"""
h = mmh3.hash128(item, signed=False)
return h & ((1 << 64) - 1)
def _leading_zeros(self, hash_value: int, start_bit: int) -> int:
"""Count leading zeros starting from start_bit, plus 1"""
count = 1
for i in range(start_bit, 64):
if hash_value & (1 << (63 - i)):
break
count += 1
return count
def add(self, item: str):
"""Add an element"""
h = self._hash(item)
# First p bits determine bucket index
bucket_idx = h >> (64 - self.p)
# Remaining bits used for leading zeros
w = self._leading_zeros(h, self.p)
# Keep the maximum
self.registers[bucket_idx] = max(self.registers[bucket_idx], w)
def estimate(self) -> int:
"""Estimate cardinality"""
# Harmonic mean
indicator = sum(2.0 ** (-r) for r in self.registers)
raw_estimate = self.alpha * self.m * self.m / indicator
# Small cardinality correction
if raw_estimate <= 2.5 * self.m:
zeros = self.registers.count(0)
if zeros > 0:
# Linear Counting
raw_estimate = self.m * math.log(self.m / zeros)
# Large cardinality correction (when exceeding 2^32 / 30)
elif raw_estimate > (1 << 32) / 30.0:
raw_estimate = -(1 << 32) * math.log(1 - raw_estimate / (1 << 32))
return int(raw_estimate)
@property
def memory_usage_bytes(self) -> int:
"""Each register needs 6 bits (max value 64), simplified to 1 byte here"""
return self.m
def merge(self, other: 'HyperLogLog'):
"""Merge two HLLs (take max per bucket), supports distributed computation"""
assert self.p == other.p, "Cannot merge HLLs with different precision"
for i in range(self.m):
self.registers[i] = max(self.registers[i], other.registers[i])
# Demonstration
hll = HyperLogLog(p=14)
print(f"Number of buckets: {hll.m:,}")
print(f"Memory usage: {hll.memory_usage_bytes:,} bytes ≈ {hll.memory_usage_bytes/1024:.1f} KB")
print(f"Theoretical std error: {1.04 / math.sqrt(hll.m):.4f} = {1.04 / math.sqrt(hll.m) * 100:.2f}%")
# Add different numbers of elements and check accuracy
import random
for true_cardinality in [1000, 10000, 100000, 1000000]:
hll = HyperLogLog(p=14)
for i in range(true_cardinality):
hll.add(f"user_{i}")
estimated = hll.estimate()
error = abs(estimated - true_cardinality) / true_cardinality * 100
print(f"True cardinality: {true_cardinality:>10,} | "
f"Estimated: {estimated:>10,} | "
f"Error: {error:.2f}%")
HyperLogLog in Redis
Redis natively supports HyperLogLog using a fixed 12KB memory footprint:
import redis
r = redis.Redis()
# PFADD: Add elements to HLL
r.pfadd("visitors:2024-01-15", "user_1", "user_2", "user_3")
r.pfadd("visitors:2024-01-15", "user_2", "user_4") # user_2 is duplicate
# PFCOUNT: Estimate cardinality
count = r.pfcount("visitors:2024-01-15")
print(f"Today's unique visitors: {count}") # Output: 4
# PFMERGE: Merge multiple HLLs (union cardinality)
r.pfmerge("visitors:week", "visitors:2024-01-15", "visitors:2024-01-16")
weekly_count = r.pfcount("visitors:week")
print(f"This week's unique visitors: {weekly_count}")
Redis HLL implementation details (Antirez, Redis source code):
- Uses p=14 (16384 buckets)
- Each bucket uses 6 bits, totaling 16384 x 6 / 8 = 12,288 bytes, approximately 12KB
- Standard error approximately 0.81%
- Uses sparse representation for small cardinalities (more memory efficient), switches to dense representation for large cardinalities
9.9 Cuckoo Filter: Bloom Filter with Deletion Support
The biggest pain point of Bloom filters is the inability to delete elements. Cuckoo Filter (Fan, Andersen, Kaminsky & Mitzenmacher, 2014, "Cuckoo Filter: Practically Better Than Bloom") solves this problem.
Core Idea
The Cuckoo Filter is named after Cuckoo Hashing:
- Each bucket can store multiple fingerprints (short fragments of element hash values).
- Each element has two candidate bucket positions.
- During insertion, if both positions are full, randomly evict an existing fingerprint to its alternate position (like a cuckoo bird pushing others from the nest).
- Deletion simply removes the corresponding fingerprint from the bucket.
import mmh3
import random
class CuckooFilter:
"""
Cuckoo Filter: probabilistic set data structure with deletion support
Advantages over Bloom filters:
1. Supports deletion
2. Faster queries (only check 2 positions)
3. Better space efficiency at the same FP rate (when epsilon < 3%)
"""
def __init__(self, capacity: int, bucket_size: int = 4,
fingerprint_bits: int = 8, max_kicks: int = 500):
"""
Parameters:
capacity: number of buckets
bucket_size: number of fingerprints per bucket
fingerprint_bits: bits per fingerprint (determines FP rate ≈ 2*bucket_size/2^fingerprint_bits)
max_kicks: maximum number of relocations
"""
self.capacity = capacity
self.bucket_size = bucket_size
self.fingerprint_bits = fingerprint_bits
self.max_kicks = max_kicks
self.buckets = [[] for _ in range(capacity)]
self.count = 0
def _fingerprint(self, item: str) -> int:
"""Compute element fingerprint"""
h = mmh3.hash(item, seed=0) & ((1 << self.fingerprint_bits) - 1)
# Fingerprint cannot be 0 (0 indicates empty slot)
return h if h != 0 else 1
def _hash1(self, item: str) -> int:
"""First bucket position"""
return mmh3.hash(item, seed=42) % self.capacity
def _hash2(self, index: int, fingerprint: int) -> int:
"""
Second bucket position: partial-key cuckoo hashing
i2 = i1 XOR hash(fingerprint)
This design ensures: knowing either position and the fingerprint,
the other position can be computed.
"""
fp_hash = mmh3.hash(str(fingerprint), seed=100) % self.capacity
return (index ^ fp_hash) % self.capacity
def insert(self, item: str) -> bool:
"""
Insert an element. Returns True on success, False if filter is full.
"""
fp = self._fingerprint(item)
i1 = self._hash1(item)
i2 = self._hash2(i1, fp)
# Try both buckets
if len(self.buckets[i1]) < self.bucket_size:
self.buckets[i1].append(fp)
self.count += 1
return True
if len(self.buckets[i2]) < self.bucket_size:
self.buckets[i2].append(fp)
self.count += 1
return True
# Both buckets full, start kicking
idx = random.choice([i1, i2])
for _ in range(self.max_kicks):
# Randomly select a fingerprint to evict
victim_idx = random.randint(0, self.bucket_size - 1)
fp, self.buckets[idx][victim_idx] = self.buckets[idx][victim_idx], fp
# Evicted fingerprint goes to its alternate position
idx = self._hash2(idx, fp)
if len(self.buckets[idx]) < self.bucket_size:
self.buckets[idx].append(fp)
self.count += 1
return True
# Exceeded max kicks, consider filter full
return False
def __contains__(self, item: str) -> bool:
"""Query whether element possibly exists"""
fp = self._fingerprint(item)
i1 = self._hash1(item)
i2 = self._hash2(i1, fp)
return fp in self.buckets[i1] or fp in self.buckets[i2]
def delete(self, item: str) -> bool:
"""
Delete an element. Returns True on successful deletion.
Warning: Only elements that were actually inserted should be deleted.
Deleting a never-inserted element may cause false negatives!
"""
fp = self._fingerprint(item)
i1 = self._hash1(item)
i2 = self._hash2(i1, fp)
if fp in self.buckets[i1]:
self.buckets[i1].remove(fp)
self.count -= 1
return True
if fp in self.buckets[i2]:
self.buckets[i2].remove(fp)
self.count -= 1
return True
return False
# Demonstrate deletion capability
cf = CuckooFilter(capacity=10000, bucket_size=4, fingerprint_bits=12)
# Insert
for i in range(5000):
cf.insert(f"item_{i}")
print(f"After insert: 'item_42' in cf = {'item_42' in cf}") # True
# Delete
cf.delete("item_42")
print(f"After delete: 'item_42' in cf = {'item_42' in cf}") # False
# False positive rate test
false_positives = 0
tests = 10000
for i in range(tests):
if f"not_exist_{i}" in cf:
false_positives += 1
print(f"False positive rate: {false_positives/tests:.4f}")
# Theoretical FP rate ≈ 2 * bucket_size / 2^fingerprint_bits = 8/4096 ≈ 0.002
Cuckoo Filter vs Bloom Filter Comparison
| Feature | Bloom Filter | Cuckoo Filter |
|---|---|---|
| Deletion | Not supported | Supported |
| Query speed | Check k positions | Only check 2 positions |
| Space efficiency (epsilon < 3%) | Worse | Better |
| Space efficiency (epsilon > 3%) | Better | Worse |
| Insertion (at full load) | Always O(1) | May require kicks |
| Implementation complexity | Simple | Moderate |
Level 3 - What the Specifications Define
9.10 Burton Bloom's 1970 Original Paper
Burton Howard Bloom published "Space/Time Trade-offs in Hash Coding with Allowable Errors" in Communications of the ACM, Vol. 13, No. 7, July 1970, pp. 422-426. This 5-page paper defined the entire field.
Original Motivation
Bloom's original problem was: how to determine whether an English word is in a dictionary (for spell checking or hyphenation point lookup) with limited memory? Computers of that era had extremely limited RAM, and a complete dictionary required too much space.
Bloom's key insight: If we allow a small fraction of false positives (judging words not in the dictionary as being in the dictionary), we can solve the membership problem using far less space than the original data requires.
Core Contributions of the Paper
- Proposed the bit vector + multiple hashes architecture: This design remains fundamentally unchanged over 50 years later.
- Analyzed optimal parameter selection: Provided the false positive rate formula and optimal hash function count.
- Articulated the fundamental trade-off: More memory leads to lower FP rate; more hash functions leads to slower speed but lower FP rate (up to an optimum, after which FP rate actually increases).
Historical Impact
Bloom Filters were invented to save disk accesses (disk seeks were extremely slow in that era), and later became widely applied in:
- Akamai CDN (determining whether a URL is cached)
- Google BigTable (reducing unnecessary disk reads)
- Apache Cassandra (determining whether an SSTable might contain a key)
- Bitcoin SPV nodes (BIP 37, lightweight transaction filtering)
9.11 Information-Theoretic Lower Bound
Why does a Bloom filter need $1.44 \cdot n \cdot \log_2(1/\varepsilon)$ bits? This is not coincidence—it is close to the information-theoretic limit.
Information-Theoretic Analysis
Consider a data structure that answers "is element x in set S?" The set S has n elements drawn from a universe of size U. What is the information-theoretic lower bound for this data structure?
If we allow false positive rate epsilon, then for any element not in S, the data structure must distinguish "in S" from "not in S," allowing epsilon error rate.
Carter, Floyd, Gill, Markowsky & Wegman (1978) proved:
Theorem: Any data structure supporting set membership queries with false positive rate at most epsilon requires at least $n \cdot \log_2(1/\varepsilon)$ bits.
A Bloom filter uses $1.44 \cdot n \cdot \log_2(1/\varepsilon)$ bits—44% more than the lower bound. This 1.44 factor comes from:
$$\frac{1}{(\ln 2)^2} = \frac{1}{0.4805} \approx 2.0814$$
And $\frac{m}{n \cdot \log_2(1/\varepsilon)} = \frac{-\ln\varepsilon / (\ln 2)^2}{\log_2(1/\varepsilon)} = \frac{\ln(1/\varepsilon) / (\ln 2)^2}{\ln(1/\varepsilon) / \ln 2} = \frac{1}{\ln 2} \approx 1.44$
In other words, Bloom filter space efficiency is 1/1.44 ≈ 69.3% of the information-theoretic optimum.
More Space-Efficient Data Structures
Are there data structures that achieve the information-theoretic lower bound? Yes:
- Compressed Bloom Filter (Mitzenmacher, 2002): Approaches the lower bound by compressing the bit array (requires communication scenarios).
- Ribbon Filter (Dillinger & Walzer, 2021): Near-optimal, space utilization exceeding 99%.
- Xor Filter (Graf & Lemire, 2020): Uses approximately 1.23 bits/element x log2(1/epsilon), very close to the lower bound.
# Comparing space efficiency of different data structures
import math
def space_comparison(n, epsilon):
"""Compare theoretical space requirements of different data structures"""
info_lower_bound = n * math.log2(1/epsilon) # Information-theoretic lower bound
bloom_space = 1.44 * n * math.log2(1/epsilon) # Standard Bloom filter
xor_space = 1.23 * n * math.log2(1/epsilon) # Xor Filter
print(f"n={n:,}, epsilon={epsilon}")
print(f" Info-theoretic lower bound: {info_lower_bound/8/1024/1024:.2f} MB (100%)")
print(f" Bloom Filter: {bloom_space/8/1024/1024:.2f} MB ({bloom_space/info_lower_bound*100:.1f}%)")
print(f" Xor Filter: {xor_space/8/1024/1024:.2f} MB ({xor_space/info_lower_bound*100:.1f}%)")
print()
space_comparison(10_000_000, 0.01)
space_comparison(10_000_000, 0.001)
space_comparison(1_000_000_000, 0.01)
9.12 Theoretical Guarantees of Count-Min Sketch
Cormode & Muthukrishnan's 2005 paper "An Improved Data Stream Summary: The Count-Min Sketch and its Applications" provides rigorous theoretical guarantees.
Parameter Selection
Given error parameter epsilon and failure probability delta:
- Width $w = \lceil e/\varepsilon \rceil$
- Depth $d = \lceil \ln(1/\delta) \rceil$
- Space complexity: $O(1/\varepsilon \cdot \log(1/\delta))$
Error Guarantee
For any element x:
$$f(x) \leq \hat{f}(x) \leq f(x) + \varepsilon \cdot N$$
Where $\hat{f}(x)$ is the estimated frequency, $f(x)$ is the true frequency, and $N$ is the total frequency of all elements. This guarantee holds with probability at least $1-\delta$.
Why One-Sided Error?
Count-Min Sketch only overestimates frequencies, never underestimates. This is because collisions can only increase counter values. This property is extremely useful in many applications—for example, if CMS says an IP's access frequency doesn't exceed a threshold, then it definitely doesn't.
The Heavy Hitter Problem
A classic application of Count-Min Sketch is finding Heavy Hitters: elements whose frequency exceeds a threshold in a data stream.
class HeavyHitterDetector:
"""
Detect heavy hitters (elements with frequency exceeding phi * N) using Count-Min Sketch
"""
def __init__(self, phi: float = 0.01, epsilon: float = 0.001, delta: float = 0.01):
"""
Parameters:
phi: heavy hitter threshold (elements with frequency exceeding phi * N)
epsilon: CMS error parameter (should be much smaller than phi)
delta: failure probability
"""
self.phi = phi
self.cms = CountMinSketch(epsilon=epsilon, delta=delta)
self.candidates = set() # Candidate heavy hitters
def add(self, item: str):
self.cms.add(item)
# If estimated frequency exceeds threshold, add to candidate set
threshold = self.phi * self.cms.total_count
if self.cms.estimate(item) >= threshold:
self.candidates.add(item)
def get_heavy_hitters(self) -> list:
"""Return all possible heavy hitters with their estimated frequencies"""
threshold = self.phi * self.cms.total_count
results = []
for item in self.candidates:
est = self.cms.estimate(item)
if est >= threshold:
results.append((item, est))
return sorted(results, key=lambda x: -x[1])
9.13 Theoretical Foundations of HyperLogLog
From Flajolet-Martin to HyperLogLog
The evolution of cardinality estimation algorithms:
- Flajolet-Martin (1985): "Probabilistic Counting Algorithms for Data Base Applications." Uses bit pattern observations; a single estimator has high variance.
- LogLog (2003): Durand & Flajolet. Uses m buckets with arithmetic mean, O(log log n) space per bucket.
- SuperLogLog (2003): Truncates outliers before averaging.
- HyperLogLog (2007): Flajolet, Fusy, Gandouet & Meunier. Uses harmonic mean instead of arithmetic mean, improving accuracy to standard error $1.04/\sqrt{m}$.
Why Harmonic Mean Instead of Arithmetic Mean?
Given m bucket values $M_1, M_2, \ldots, M_m$:
-
Arithmetic mean estimator: $E_{AM} = 2^{\frac{1}{m}\sum M_i}$
- Problem: Extremely sensitive to outlier-large values (one abnormally large bucket severely overestimates)
-
Harmonic mean estimator: $E_{HM} = \frac{\alpha_m \cdot m^2}{\sum 2^{-M_i}}$
- Advantage: Naturally "down-weights" large values ($2^{-M_i}$ approaches 0 for large $M_i$)
- This is the core improvement of HyperLogLog
Correction Factor alpha_m
The correction factor $\alpha_m$ eliminates estimator bias:
$$\alpha_m = \left(m \int_0^\infty \left(\log_2\frac{2+u}{1+u}\right)^m du\right)^{-1}$$
For large m, $\alpha_m \to 1/(2\ln 2) \approx 0.7213$.
In Redis with p=14, $\alpha_{16384}$ evaluates to 0.7213/(1+1.079/16384) ≈ 0.72125.
9.14 Redis HyperLogLog Implementation
Redis's HyperLogLog implementation (written by Salvatore Sanfilippo, a.k.a. Antirez) has several important engineering optimizations:
Sparse vs Dense Representation
Redis uses two encoding schemes:
-
Sparse encoding (small cardinalities): Uses run-length encoding (RLE) to compress consecutive zero-value registers. When most registers are 0, sparse representation is far smaller than 12KB.
-
Dense encoding (large cardinalities): 16384 registers, each 6 bits, totaling 12,288 bytes. Automatically converts from sparse to dense when sparse encoding exceeds a threshold (server.hll_sparse_max_bytes, default 3000 bytes) or register values exceed 32.
# Simplified model of Redis HLL sparse encoding
class RedisHLLSparse:
"""
Simplified model of Redis HLL sparse encoding
Three opcodes:
- ZERO(len): len consecutive zero-value registers (1 byte, len <= 64)
- XZERO(len): longer consecutive zero-value registers (2 bytes, len <= 16384)
- VAL(value, len): len consecutive same non-zero value registers (1 byte, value <= 32, len <= 4)
"""
def __init__(self):
# Initial state: 16384 zero-value registers = one XZERO(16384) opcode
self.encoding = [("XZERO", 16384)]
self.is_dense = False
def estimated_memory(self):
"""Estimate current memory usage"""
total = 0
for op in self.encoding:
if op[0] == "ZERO":
total += 1
elif op[0] == "XZERO":
total += 2
elif op[0] == "VAL":
total += 1
return total
Bias Correction
Redis implements the improved correction strategy proposed by Google (Heule, Nunkesser & Hall, 2013, "HyperLogLog in Practice"):
- Small cardinalities (estimate <= 5m): Uses precomputed bias correction tables with interpolation.
- Medium cardinalities: Uses raw HyperLogLog estimate directly.
- Large cardinalities (approaching 2^64): Uses logarithmic correction.
Level 4 - Edge Cases and Pitfalls
9.15 Deep Analysis of Why Bloom Filters Cannot Delete
Why deletion is impossible:
Consider two elements A and B whose hash positions overlap:
- A's hash positions: {3, 7, 11}
- B's hash positions: {5, 7, 14}
Position 7 is shared. If we delete A (clear positions 3, 7, 11 to zero), clearing position 7 causes queries for B to return False—B actually exists, but we say it doesn't. This introduces false negatives, destroying the Bloom filter's core guarantee of "if it says not present, it's definitely not present."
Solution Comparison
| Approach | Mechanism | Pros | Cons |
|---|---|---|---|
| Counting Bloom Filter | Replace each bit with counter | Simple and intuitive | 3-4x more space |
| Cuckoo Filter | Fingerprints + cuckoo hashing | Good space efficiency | More complex implementation |
| Quotient Filter | Quotient-remainder storage | Cache-friendly | Clustering issues |
| Rebuild | Periodically rebuild from source | No extra space | Rebuild overhead |
class CountingBloomFilter:
"""
Counting Bloom Filter: each bit becomes a 4-bit counter
Supports deletion, but space overhead is 4x that of standard Bloom filter.
Counter overflow (>15) stops incrementing (saturating count), which could
cause false negatives but is extremely rare in practice.
"""
def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
import math
self.m = int(math.ceil(-expected_items * math.log(false_positive_rate)
/ (math.log(2) ** 2)))
self.k = int(math.ceil((self.m / expected_items) * math.log(2)))
# 4-bit counters, 2 counters share 1 byte
self.counters = bytearray(math.ceil(self.m / 2))
def _get_counter(self, pos: int) -> int:
byte_idx = pos // 2
if pos % 2 == 0:
return self.counters[byte_idx] & 0x0F
else:
return (self.counters[byte_idx] >> 4) & 0x0F
def _set_counter(self, pos: int, value: int):
byte_idx = pos // 2
if pos % 2 == 0:
self.counters[byte_idx] = (self.counters[byte_idx] & 0xF0) | (value & 0x0F)
else:
self.counters[byte_idx] = (self.counters[byte_idx] & 0x0F) | ((value & 0x0F) << 4)
def _hash_positions(self, item: str) -> list:
import mmh3
h1 = mmh3.hash(item, seed=0) % self.m
h2 = mmh3.hash(item, seed=42) % self.m
return [(h1 + i * h2) % self.m for i in range(self.k)]
def add(self, item: str):
for pos in self._hash_positions(item):
val = self._get_counter(pos)
if val < 15: # Saturating count, prevent overflow
self._set_counter(pos, val + 1)
def remove(self, item: str) -> bool:
"""Delete element. Returns False if element doesn't exist."""
positions = self._hash_positions(item)
# First check existence
if not all(self._get_counter(pos) > 0 for pos in positions):
return False
# Decrement all counters
for pos in positions:
val = self._get_counter(pos)
if val < 15: # Don't decrement saturated counters (safety measure)
self._set_counter(pos, val - 1)
return True
def __contains__(self, item: str) -> bool:
return all(self._get_counter(pos) > 0 for pos in self._hash_positions(item))
9.16 Interview Question: Design a URL Deduplication System
Problem: Design a system that determines whether a URL about to be crawled has already been fetched. Expected URL volume: 10 billion. Requirements: low latency, scalable.
Analysis Framework
Step 1: Clarify Requirements
- 10 billion URLs, average length 100 bytes
- Exact storage requires ~1TB RAM, infeasible
- Very low false positive rate acceptable (missing some URLs is okay)
- No deletion needed (crawled URLs won't become "uncrawled")
- Must be distributed (1TB on single machine unrealistic)
Step 2: Approach Comparison
import math
def memory_analysis():
"""Memory analysis for different approaches"""
n = 10_000_000_000 # 10 billion URLs
# Approach 1: HashSet
# Each URL hash 64-bit + overhead, approximately 16 bytes/element
hashset_mem = n * 16 / (1024**3)
print(f"HashSet: {hashset_mem:.0f} GB")
# Approach 2: Bloom filter (0.1% FP rate)
bloom_bits = -n * math.log(0.001) / (math.log(2)**2)
bloom_mem = bloom_bits / 8 / (1024**3)
print(f"Bloom Filter (0.1% FP): {bloom_mem:.1f} GB")
# Approach 3: Bloom filter (1% FP rate)
bloom_bits_1 = -n * math.log(0.01) / (math.log(2)**2)
bloom_mem_1 = bloom_bits_1 / 8 / (1024**3)
print(f"Bloom Filter (1% FP): {bloom_mem_1:.1f} GB")
memory_analysis()
# HashSet: 149 GB
# Bloom Filter (0.1% FP): 17.9 GB
# Bloom Filter (1% FP): 11.9 GB
Step 3: Distributed Architecture
┌─────────────────────────────────────────────────────┐
│ URL Deduplication System Architecture │
├─────────────────────────────────────────────────────┤
│ │
│ New URL ──→ Hash(URL) % N ──→ Route to node │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node 0 │ │ Node 1 │ │ Node N │ │
│ │ BF 2GB │ │ BF 2GB │ │ BF 2GB │ │
│ │(1.25B URL)│ │(1.25B URL)│ │(1.25B URL)│ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │
│ Persistence: Periodically write bit arrays to │
│ disk/object storage │
│ Recovery: Load bit arrays from persistence │
│ on startup │
└─────────────────────────────────────────────────────┘
Step 4: Complete Answer Framework
"""
Interview Answer Framework:
1. Capacity Estimation
- 10 billion URLs, 0.1% FP rate
- Bloom filter: approximately 18 GB
- Split across 8 nodes, approximately 2.2 GB each
2. Sharding Strategy
- Hash URL modulo N to determine target node
- Consistent hashing (if nodes need dynamic scaling)
3. Persistence
- Periodic checkpoint of bit arrays to disk
- Incremental log: record each insert operation
- Recovery: load checkpoint + replay incremental log
4. Fault Tolerance
- Primary-replica replication: one replica per node
- Recover from checkpoint after bit array loss
5. Scaling
- Approach A: Layered Bloom filters (old data in BF1, new in BF2)
- Approach B: Consistent hashing + migration
6. Monitoring
- Monitor actual FP rate (sampling verification)
- Alert when element count approaches design capacity
7. Optimizations
- Local memory + Redis combination (hot URL local cache)
- Layered filtering: check local small BF first, then remote large BF on miss
"""
9.17 False Positive Rate and Memory Trade-offs in Engineering
In real systems, Bloom filter parameter selection must consider multiple dimensions:
def engineering_tradeoffs():
"""Parameter selection decision tree for engineering"""
scenarios = [
{
"scenario": "CDN URL cache lookup",
"n": 100_000_000,
"fp_cost": "One additional origin fetch (low cost)",
"recommended_fp": 0.01,
"rationale": "FP only means an extra origin fetch; 1% is perfectly acceptable"
},
{
"scenario": "Malicious URL blacklist",
"n": 10_000_000,
"fp_cost": "Legitimate site flagged as malicious (user complaints)",
"recommended_fp": 0.0001,
"rationale": "FP = false block, bad UX, need extremely low FP rate"
},
{
"scenario": "Database key existence check",
"n": 50_000_000,
"fp_cost": "One unnecessary disk read",
"recommended_fp": 0.001,
"rationale": "Disk IO cost is moderate; 0.1% is a good balance"
},
{
"scenario": "Deduplication (crawler/messaging)",
"n": 1_000_000_000,
"fp_cost": "Missing some URLs/messages",
"recommended_fp": 0.001,
"rationale": "Missing a few URLs is acceptable, but too many affects coverage"
}
]
for s in scenarios:
m = -s["n"] * math.log(s["recommended_fp"]) / (math.log(2)**2)
k = int(math.ceil((m/s["n"]) * math.log(2)))
mem_mb = m / 8 / 1024 / 1024
print(f"Scenario: {s['scenario']}")
print(f" Elements: {s['n']:,}")
print(f" FP rate: {s['recommended_fp']}")
print(f" Memory: {mem_mb:.1f} MB")
print(f" Hash functions: {k}")
print(f" Rationale: {s['rationale']}")
print()
engineering_tradeoffs()
Common Pitfalls
Pitfall 1: Exceeding Design Capacity
def capacity_overflow_demo():
"""Demonstrate FP rate degradation when exceeding capacity"""
bf = BloomFilter(expected_items=10000, false_positive_rate=0.01)
for filled_ratio in [0.5, 1.0, 1.5, 2.0, 3.0]:
# Recreate and fill
bf2 = BloomFilter(expected_items=10000, false_positive_rate=0.01)
actual_items = int(10000 * filled_ratio)
for i in range(actual_items):
bf2.add(f"item_{i}")
# Test FP rate
fp = 0
tests = 10000
for i in range(tests):
if f"not_exist_{i}" in bf2:
fp += 1
print(f"Fill ratio {filled_ratio:.0%} ({actual_items:,} items): "
f"measured FP rate = {fp/tests:.4f}")
# Typical output:
# Fill ratio 50%: measured FP rate = 0.0001
# Fill ratio 100%: measured FP rate = 0.0098 (close to design value 1%)
# Fill ratio 150%: measured FP rate = 0.0500 (5%!)
# Fill ratio 200%: measured FP rate = 0.1400 (14%!)
# Fill ratio 300%: measured FP rate = 0.3700 (37%! nearly useless)
Conclusion: At 150% capacity, FP rate jumps from 1% to 5%; at 200%, it reaches 14%. Design capacity must include headroom. Recommended practice: set design capacity to 1.5-2x the expected maximum.
Pitfall 2: Poor Hash Function Quality
"""
Bad example: using simple hash functions
"""
# Bad hash function: non-uniform bit distribution
def bad_hash(item, seed):
return (hash(item) + seed) % 1000 # hash() + offset is NOT independent hashing!
# Good hash function: double hashing technique
def good_hash(item, i, m):
import mmh3
h1 = mmh3.hash(item, seed=0)
h2 = mmh3.hash(item, seed=42)
return (h1 + i * h2) % m # Kirsch-Mitzenmacher (2006)
Why doesn't hash(item) + seed work? Because this is merely a simple offset—different seeds produce highly correlated hash values (constant difference), violating the Bloom filter's assumption that hash functions are "independent." Kirsch & Mitzenmacher 2006 proved that double hashing $g_i(x) = h_1(x) + i \cdot h_2(x)$ theoretically performs identically to k fully independent hash functions.
Pitfall 3: Thread Safety for Concurrent Writes
import threading
class ThreadSafeBloomFilter:
"""
Thread-safe Bloom filter
Note: A standard Bloom filter's add operation is not atomic
(it needs to set k bits, and intermediate states may be observed by other threads).
However, this doesn't affect correctness—worst case is a temporary false negative
that resolves once add completes.
The real problem is concurrent writes to the bit array may cause lost updates
(read-modify-write is not atomic).
"""
def __init__(self, expected_items, false_positive_rate=0.01):
self._bf = BloomFilter(expected_items, false_positive_rate)
self._lock = threading.Lock()
def add(self, item: str):
# Approach 1: Global lock (simple but poor performance)
with self._lock:
self._bf.add(item)
def __contains__(self, item: str) -> bool:
# Queries don't need locking (read-only)
# May see partially-updated state (safe for Bloom filters)
return item in self._bf
# Higher-performance approach: striped locks
class StripedBloomFilter:
"""Use striped locks to reduce contention"""
def __init__(self, expected_items, false_positive_rate=0.01, num_stripes=16):
self._bf = BloomFilter(expected_items, false_positive_rate)
self._stripes = [threading.Lock() for _ in range(num_stripes)]
def _get_stripe(self, position: int) -> threading.Lock:
return self._stripes[position % len(self._stripes)]
def add(self, item: str):
positions = self._bf._get_hash_positions(item)
# Acquire locks in order to avoid deadlock
unique_stripes = sorted(set(p % len(self._stripes) for p in positions))
for s in unique_stripes:
self._stripes[s].acquire()
try:
self._bf.add(item)
finally:
for s in unique_stripes:
self._stripes[s].release()
9.18 Redis Bloom Filter Module (RedisBloom)
Redis does not natively support Bloom filters, but the RedisBloom module (formerly ReBloom) provides a production-grade implementation.
import redis
# Requires Redis with RedisBloom module loaded
r = redis.Redis()
# BF.RESERVE: Create a Bloom filter
# BF.RESERVE key error_rate capacity [EXPANSION expansion] [NONSCALING]
r.execute_command('BF.RESERVE', 'urls:seen', 0.001, 10000000)
# Creates a Bloom filter with 10M capacity and 0.1% FP rate
# BF.ADD: Add element
r.execute_command('BF.ADD', 'urls:seen', 'https://example.com/page1')
# BF.MADD: Batch add
r.execute_command('BF.MADD', 'urls:seen',
'https://example.com/page2',
'https://example.com/page3')
# BF.EXISTS: Query
exists = r.execute_command('BF.EXISTS', 'urls:seen', 'https://example.com/page1')
print(f"exists: {exists}") # 1 (possibly exists)
not_exists = r.execute_command('BF.EXISTS', 'urls:seen', 'https://never-added.com')
print(f"not_exists: {not_exists}") # 0 (definitely does not exist)
# BF.MEXISTS: Batch query
results = r.execute_command('BF.MEXISTS', 'urls:seen',
'https://example.com/page1',
'https://never-added.com')
print(f"batch results: {results}") # [1, 0]
# BF.INFO: View filter information
info = r.execute_command('BF.INFO', 'urls:seen')
print(f"info: {info}")
# Output includes: Capacity, Size, Number of filters, Number of items inserted, Expansion rate
RedisBloom Auto-Scaling Mechanism
"""
RedisBloom's EXPANSION parameter enables auto-scaling:
When the first sub-filter fills up, automatically creates a second sub-filter
with capacity = original * expansion. Queries must check all sub-filters
(if any returns "exists", the result is "exists").
Costs:
- Slower queries (need to check multiple sub-filters)
- Slightly increased FP rate
- But avoids dramatic FP rate degradation from exceeding capacity
Default expansion=2, meaning capacity doubles on each expansion.
"""
# Create a scalable Bloom filter
r.execute_command('BF.RESERVE', 'scalable:bf',
0.01, # Initial FP rate
1000000, # Initial capacity
'EXPANSION', 2) # Expansion factor
# If scaling is not needed (fixed capacity, better performance)
r.execute_command('BF.RESERVE', 'fixed:bf',
0.01,
10000000,
'NONSCALING') # BF.ADD returns error when full
Production Best Practices
"""
Redis Bloom Filter production best practices:
1. Capacity Planning
- Use 2x peak data volume estimate as capacity
- Or use EXPANSION for auto-scaling
2. FP Rate Selection
- Cache penetration: 0.01 (1%)
- URL deduplication: 0.001 (0.1%)
- Security blacklist: 0.0001 (0.01%)
3. Persistence
- RedisBloom data persists with Redis RDB/AOF
- Large filters' RDB saves may cause latency spikes
4. Memory Estimation
- BF.DEBUG shows actual memory usage
- Rule of thumb: n * (-ln(epsilon) / (ln2)^2) / 8 bytes
5. Monitoring Metrics
- Inserted elements vs design capacity
- Actual FP rate (periodic sampling tests)
- Number of sub-filters (expansion count)
"""
class BloomFilterManager:
"""Redis Bloom Filter manager"""
def __init__(self, redis_client, key_prefix="bf:"):
self.r = redis_client
self.prefix = key_prefix
def create_if_not_exists(self, name: str, error_rate: float,
capacity: int, expansion: int = 2):
"""Safe creation (idempotent)"""
key = f"{self.prefix}{name}"
try:
self.r.execute_command('BF.INFO', key)
except redis.ResponseError:
self.r.execute_command('BF.RESERVE', key,
error_rate, capacity,
'EXPANSION', expansion)
def check_health(self, name: str) -> dict:
"""Check filter health status"""
key = f"{self.prefix}{name}"
info = self.r.execute_command('BF.INFO', key)
# Parse info (format varies by version)
info_dict = {}
for i in range(0, len(info), 2):
info_dict[info[i].decode()] = info[i+1]
capacity = info_dict.get('Capacity', 0)
inserted = info_dict.get('Number of items inserted', 0)
fill_ratio = inserted / capacity if capacity > 0 else 0
status = "healthy"
if fill_ratio > 0.9:
status = "critical" # Near full
elif fill_ratio > 0.7:
status = "warning" # Approaching capacity
return {
"key": key,
"capacity": capacity,
"inserted": inserted,
"fill_ratio": fill_ratio,
"status": status
}
9.19 A Unified View of Probabilistic Data Structures
Bloom filters, Count-Min Sketch, and HyperLogLog appear to solve different problems, but they share a core philosophy:
Use hashing to map data into compact summary structures, sacrificing a small amount of precision for orders-of-magnitude space savings.
| Data Structure | Question Answered | Error Type | Space Complexity |
|---|---|---|---|
| Bloom Filter | x in S? | False positives | O(n) bits |
| Count-Min Sketch | freq(x) = ? | Overestimates only | O(1/epsilon * log(1/delta)) |
| HyperLogLog | S | = ? | |
| Cuckoo Filter | x in S? + delete | False positives | O(n) bits |
| MinHash | Jaccard(A,B) = ? | +/- relative error | O(1/epsilon^2) |
Their shared properties:
- One-way: Data can be inserted but is compressed; original data is unrecoverable.
- Mergeable: Two summaries can be merged into one (supports distributed computation).
- Error-tunable: Users can increase space to reduce error.
- Stream-processable: Requires only a single pass over data; no random access needed.
These properties make probabilistic data structures particularly suitable for big data stream processing scenarios—data volumes too large for exact storage, but approximate answers are sufficient for making correct engineering decisions.
Chapter Summary
| Concept | Core Mechanism | Space | Error | Typical Applications |
|---|---|---|---|---|
| Bloom Filter | Bit array + k hashes | ~10 bits/elem @1% FP | False positives | Cache penetration, URL dedup |
| Count-Min Sketch | d x w counter matrix | O(1/epsilon * log(1/delta)) | Overestimates only | Traffic stats, hotspot detection |
| HyperLogLog | Leading zeros observation | ~12KB (Redis) | +/-0.81% | Unique visitor counts |
| Cuckoo Filter | Fingerprints + cuckoo hash | ~12 bits/elem @0.1% FP | False positives | Scenarios requiring deletion |
Core insight: Finding the optimal engineering balance between precision and space. When you face "TB-scale data with membership/frequency/cardinality queries," probabilistic data structures are almost always the right answer.