Chapter 9

Bloom Filters and Probabilistic Data Structures

Chapter 9: Bloom Filters and Probabilistic Data Structures

Your web crawler has fetched 1 billion URLs. For every new link discovered, you need to determine "has this been crawled before?" Store them in a HashSet? At 100 bytes average per URL, that's 100GB of RAM just for storage. Query a database? Tens of thousands of disk I/O operations per second won't keep up with the crawl rate.

A Bloom filter gives a stunning answer: with just 1.2GB of memory (about 10 bits per URL), it can tell you whether a URL "possibly exists" or "definitely does not exist" with a false positive rate of 0.1%. The trade-off? It will occasionally say "exists" when the URL was never inserted (false positive), but it will never say "does not exist" when the URL actually is present (no false negatives).

This idea of trading a small amount of space for approximate answers spawned an entire family: Bloom filters, Count-Min Sketch, HyperLogLog, Cuckoo Filters... They are collectively called Probabilistic Data Structures. This chapter will dive deep into each member, from mathematical foundations to engineering practice.


Level 1 - What You Need to Know

9.1 The Core Idea of Bloom Filters

The Bloom filter was invented by Burton Howard Bloom in 1970. Its core idea can be summarized in one sentence:

A bit array plus multiple hash functions enables extremely space-efficient set membership testing. Query results are either "possibly exists" or "definitely does not exist."

Why "possibly exists" instead of "definitely exists"? Because different elements' hash values may collideโ€”multiple distinct elements can set the same group of bits to 1, making an element that was never inserted appear to "exist." But if any hash position is 0, the element was definitely never inserted.

Intuitive Understanding

Imagine a bit array of m bits, initially all zeros. You have k independent hash functions, each mapping input to the range [0, m-1].

Insert "apple":
  h1("apple") = 3, h2("apple") = 7, h3("apple") = 11

  Bit array:  0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0
              โ†‘     โ†‘           โ†‘           โ†‘
            idx:   3           7          11

Insert "banana":
  h1("banana") = 1, h2("banana") = 7, h3("banana") = 14

  Bit array:  0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0
                 โ†‘                         โ†‘
               idx:1                     idx:14

Query "cherry":
  h1("cherry") = 3, h2("cherry") = 5, h3("cherry") = 14
  Position 3=1 โœ“, Position 5=0 โœ— โ†’ definitely does not exist

Query "grape":
  h1("grape") = 1, h2("grape") = 7, h3("grape") = 11
  Position 1=1 โœ“, Position 7=1 โœ“, Position 11=1 โœ“ โ†’ possibly exists (false positive!)

This is the essence of a Bloom filter: sacrifice a tiny amount of accuracy for enormous space savings.

9.2 Python Implementation of a Complete Bloom Filter

import math
import mmh3  # MurmurHash3, high-performance non-cryptographic hash


class BloomFilter:
    """
    Bloom Filter implementation.
    
    Parameters:
        expected_items (int): Expected number of elements to insert (n)
        false_positive_rate (float): Target false positive rate (epsilon), e.g., 0.01 for 1%
    """
    
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        self.n = expected_items
        self.fp_rate = false_positive_rate
        
        # Compute optimal bit array size m
        # m = -n * ln(epsilon) / (ln2)^2
        self.m = self._optimal_bit_count(expected_items, false_positive_rate)
        
        # Compute optimal number of hash functions k
        # k = (m/n) * ln2
        self.k = self._optimal_hash_count(self.m, expected_items)
        
        # Bit array, implemented using bytearray
        self.bit_array = bytearray(math.ceil(self.m / 8))
        self.count = 0  # Number of inserted elements
    
    @staticmethod
    def _optimal_bit_count(n: int, fp_rate: float) -> int:
        """Compute optimal bit array size"""
        m = -n * math.log(fp_rate) / (math.log(2) ** 2)
        return int(math.ceil(m))
    
    @staticmethod
    def _optimal_hash_count(m: int, n: int) -> int:
        """Compute optimal number of hash functions"""
        k = (m / n) * math.log(2)
        return int(math.ceil(k))
    
    def _get_hash_positions(self, item: str) -> list:
        """
        Generate k hash positions using double hashing technique.
        
        Technical note: We don't need k truly independent hash functions.
        Just two hash values h1, h2 suffice to generate k positions:
        hash_i = (h1 + i * h2) % m
        
        This technique comes from Kirsch & Mitzenmacher (2006),
        "Less Hashing, Same Performance: Building a Better Bloom Filter."
        """
        h1 = mmh3.hash(item, seed=0) % self.m
        h2 = mmh3.hash(item, seed=42) % self.m
        
        positions = []
        for i in range(self.k):
            pos = (h1 + i * h2) % self.m
            positions.append(pos)
        return positions
    
    def _set_bit(self, position: int):
        """Set the bit at specified position to 1"""
        byte_idx = position // 8
        bit_idx = position % 8
        self.bit_array[byte_idx] |= (1 << bit_idx)
    
    def _get_bit(self, position: int) -> bool:
        """Get the bit value at specified position"""
        byte_idx = position // 8
        bit_idx = position % 8
        return bool(self.bit_array[byte_idx] & (1 << bit_idx))
    
    def add(self, item: str):
        """Insert an element"""
        for pos in self._get_hash_positions(item):
            self._set_bit(pos)
        self.count += 1
    
    def __contains__(self, item: str) -> bool:
        """
        Query whether an element possibly exists.
        
        Returns True: element possibly exists (with fp_rate probability of being wrong)
        Returns False: element definitely does not exist
        """
        return all(self._get_bit(pos) for pos in self._get_hash_positions(item))
    
    @property
    def current_false_positive_rate(self) -> float:
        """Calculate actual false positive rate based on current insertion count"""
        # (1 - e^(-kn/m))^k
        exponent = -self.k * self.count / self.m
        return (1 - math.exp(exponent)) ** self.k
    
    def __repr__(self):
        return (
            f"BloomFilter(m={self.m} bits, k={self.k} hashes, "
            f"n={self.count}/{self.n}, "
            f"actual_fpโ‰ˆ{self.current_false_positive_rate:.6f})"
        )


# Usage example
if __name__ == "__main__":
    # Create a Bloom filter expecting 1 million items with 1% false positive rate
    bf = BloomFilter(expected_items=1_000_000, false_positive_rate=0.01)
    print(f"Bit array size: {bf.m:,} bits = {bf.m / 8 / 1024 / 1024:.2f} MB")
    print(f"Number of hash functions: {bf.k}")
    # Output: Bit array size: 9,585,059 bits = 1.14 MB
    # Output: Number of hash functions: 7
    
    # Insert data
    for i in range(100_000):
        bf.add(f"url_{i}")
    
    # Query existing elements
    print(f"'url_42' in bf: {'url_42' in bf}")   # True (correct)
    print(f"'url_99999' in bf: {'url_99999' in bf}")  # True (correct)
    
    # Query non-existent elements
    false_positives = 0
    test_count = 100_000
    for i in range(test_count):
        if f"not_exist_{i}" in bf:
            false_positives += 1
    
    print(f"Measured false positive rate: {false_positives/test_count:.4f}")
    # Approximately 0.0001 (because only 100K inserted, far below 1M capacity)
    print(bf)

Key Observations

  1. 1 million URLs in just 1.14 MB. A Python set storing 1 million URLs (50 chars average) requires at least 100MB+.
  2. Number of hash functions k=7. More is not always better, nor is fewerโ€”there is an optimum.
  3. False positive rate increases with insertion count. The design-time expected_items is a critical parameter; exceeding it causes dramatic degradation.

9.3 False Positive Rate Formula and Optimal Parameters

A Bloom filter has three core parameters in relation:

False positive rate formula:

$$\varepsilon = \left(1 - e^{-kn/m}\right)^k$$

Optimal number of hash functions:

$$k_{opt} = \frac{m}{n} \cdot \ln 2 \approx 0.693 \cdot \frac{m}{n}$$

Optimal bit array size:

$$m = -\frac{n \cdot \ln \varepsilon}{(\ln 2)^2} \approx 1.44 \cdot n \cdot \log_2\frac{1}{\varepsilon}$$

import math

def bloom_filter_params(n: int, fp_rate: float):
    """Given element count and target FP rate, compute optimal parameters"""
    # Optimal m
    m = -n * math.log(fp_rate) / (math.log(2) ** 2)
    m = int(math.ceil(m))
    
    # Optimal k
    k = (m / n) * math.log(2)
    k = int(math.ceil(k))
    
    # Bits per element
    bits_per_element = m / n
    
    print(f"Expected elements: {n:,}")
    print(f"Target FP rate: {fp_rate}")
    print(f"Bit array size: {m:,} bits = {m/8/1024/1024:.2f} MB")
    print(f"Hash functions: {k}")
    print(f"Bits per element: {bits_per_element:.1f} bits/element")
    return m, k

# Parameter comparison at different FP rates
print("=== 1% false positive rate ===")
bloom_filter_params(10_000_000, 0.01)
# m=95,850,584 bits=11.44 MB, k=7, 9.6 bits/elem

print("\n=== 0.1% false positive rate ===")
bloom_filter_params(10_000_000, 0.001)
# m=143,775,877 bits=17.15 MB, k=10, 14.4 bits/elem

print("\n=== 0.01% false positive rate ===")
bloom_filter_params(10_000_000, 0.0001)
# m=191,701,169 bits=22.87 MB, k=14, 19.2 bits/elem

Practical rule: Every 10x reduction in false positive rate costs approximately 4.8 additional bits per element (i.e., m/n increases by about 4.8).

9.4 Typical Application Scenarios

Scenario 1: Cache Penetration Protection

"""
Problem: Malicious requests query large numbers of non-existent keys,
all requests penetrate the cache and hit the database directly.
Solution: Add a Bloom filter before the cache layer;
keys not in the filter are rejected immediately.
"""

class CacheWithBloomFilter:
    def __init__(self, all_valid_keys):
        # Add all valid keys to the Bloom filter
        self.bloom = BloomFilter(
            expected_items=len(all_valid_keys),
            false_positive_rate=0.001  # 0.1% FP rate
        )
        for key in all_valid_keys:
            self.bloom.add(key)
        
        self.cache = {}  # Simulated cache layer
        self.db = {}     # Simulated database layer
    
    def get(self, key: str):
        # Layer 1: Bloom filter interception
        if key not in self.bloom:
            return None  # Definitely doesn't exist, return immediately
        
        # Layer 2: Check cache
        if key in self.cache:
            return self.cache[key]
        
        # Layer 3: Check database (very few requests reach here)
        value = self.db.get(key)
        if value is not None:
            self.cache[key] = value
        return value

Effect: Assuming malicious requests all query non-existent keys, the Bloom filter blocks 99.9% of them (0.1% FP rate means only one in a thousand fake requests penetrates to the database).

Scenario 2: Web Crawler URL Deduplication

"""
1 billion URL deduplication:
- HashSet approach: ~100GB memory (infeasible)
- Database approach: too slow
- Bloom filter: ~1.2GB memory (10 bits/URL, 0.1% FP rate)
"""

class WebCrawler:
    def __init__(self, expected_urls=1_000_000_000):
        self.seen_urls = BloomFilter(
            expected_items=expected_urls,
            false_positive_rate=0.001
        )
    
    def should_crawl(self, url: str) -> bool:
        """Determine whether this URL should be crawled"""
        if url in self.seen_urls:
            return False  # Possibly already crawled, skip
        
        # Mark as crawled
        self.seen_urls.add(url)
        return True
    
    def crawl(self, seed_urls: list):
        queue = list(seed_urls)
        while queue:
            url = queue.pop(0)
            if self.should_crawl(url):
                # Fetch page and extract new links
                new_links = self._fetch_and_extract(url)
                queue.extend(new_links)

Trade-off: A 0.1% FP rate means approximately 1 million URLs will be incorrectly skipped (1 billion x 0.1%). For a large-scale crawler, this is perfectly acceptable.

Scenario 3: Spam Filtering

"""
Google Chrome originally used Bloom filters to store malicious URL blacklists.
Benefit: No need to download the complete list of malicious URLs to the clientโ€”
a compact Bloom filter (a few MB) can cover millions of malicious URLs.
"""

class SpamFilter:
    def __init__(self, known_spam_addresses):
        self.spam_bloom = BloomFilter(
            expected_items=len(known_spam_addresses),
            false_positive_rate=0.0001  # 0.01% FP rate
        )
        for addr in known_spam_addresses:
            self.spam_bloom.add(addr.lower())
    
    def is_spam(self, sender: str) -> str:
        """
        Returns: "definitely_not_spam" or "possibly_spam"
        """
        if sender.lower() not in self.spam_bloom:
            return "definitely_not_spam"
        else:
            # Possibly spam, needs further inspection
            return "possibly_spam"

9.5 Limitations of Bloom Filters

  1. Cannot delete elements: Setting a bit back from 1 to 0 may affect other elements' queries (multiple elements share bits).
  2. Cannot enumerate inserted elements: The Bloom filter stores only hash information; original data is lost.
  3. Fixed capacity: Must determine expected element count at creation time; exceeding it causes rapid FP rate degradation.
  4. Only supports membership queries: Cannot store associated values (unlike hash tables which store key-value pairs).

Level 2 - How It Works Under the Hood

9.6 Mathematical Derivation of Bloom Filter False Positive Rate

Let us derive the false positive rate formula from first principles.

Assumption: k hash functions map elements uniformly at random into a bit array of m bits.

Step 1: When inserting one element, the probability that a specific bit is not set to 1 by any single hash function is:

$$P(\text{bit is 0 after 1 hash}) = 1 - \frac{1}{m}$$

None of k hash functions set that bit:

$$P(\text{bit is 0 after 1 element}) = \left(1 - \frac{1}{m}\right)^k$$

Step 2: After inserting n elements, the probability that the bit remains 0:

$$P(\text{bit is 0 after n elements}) = \left(1 - \frac{1}{m}\right)^{kn}$$

Using the limit $(1-1/m)^m \to e^{-1}$ (for large m):

$$P(\text{bit is 0}) \approx e^{-kn/m}$$

Step 3: The probability that the bit is 1:

$$P(\text{bit is 1}) = 1 - e^{-kn/m}$$

Step 4: A false positive occurs when all k hash positions are 1. Assuming independence between bits (this is an approximation, but Mitzenmacher & Upfal 2005 proved it is extremely accurate in practice):

$$\varepsilon = \left(1 - e^{-kn/m}\right)^k$$

Deriving the optimal k:

Taking the logarithm of epsilon:

$$\ln \varepsilon = k \cdot \ln\left(1 - e^{-kn/m}\right)$$

Let $p = e^{-kn/m}$ (the probability that a bit is 0), then:

$$\ln \varepsilon = k \cdot \ln(1-p)$$

To minimize epsilon, we minimize $k \cdot \ln(1-p)$. Taking the derivative with respect to k and setting it to zero, we find that the minimum occurs when $p = 1/2$ (i.e., each bit is set to 1 with probability exactly 1/2).

From $p = e^{-kn/m} = 1/2$, we solve:

$$k = \frac{m}{n} \cdot \ln 2$$

At this optimum, the false positive rate is:

$$\varepsilon_{min} = (1/2)^k = (1/2)^{(m/n)\ln 2} = 2^{-(m/n)\ln 2}$$

Inversely, given a target false positive rate epsilon, the required bit array size is:

$$m = -\frac{n \ln \varepsilon}{(\ln 2)^2} = \frac{n \ln(1/\varepsilon)}{(\ln 2)^2} \approx 1.44 \cdot n \cdot \log_2\frac{1}{\varepsilon}$$

import math

def verify_formula(m, n, k):
    """Verify consistency between theoretical formula and actual false positive rate"""
    # Theoretical false positive rate
    theoretical_fp = (1 - math.exp(-k * n / m)) ** k
    
    # Simulation experiment
    import random
    bit_array = [0] * m
    
    # Insert n random elements
    for _ in range(n):
        for _ in range(k):
            pos = random.randint(0, m - 1)
            bit_array[pos] = 1
    
    # Test 10000 non-existent elements
    false_positives = 0
    tests = 10000
    for _ in range(tests):
        if all(bit_array[random.randint(0, m - 1)] == 1 for _ in range(k)):
            false_positives += 1
    
    actual_fp = false_positives / tests
    print(f"Theoretical FP rate: {theoretical_fp:.6f}")
    print(f"Measured FP rate: {actual_fp:.6f}")
    print(f"Relative error: {abs(theoretical_fp - actual_fp) / theoretical_fp * 100:.1f}%")

# m=10000, n=1000, k=7 (near optimal)
verify_formula(10000, 1000, 7)

9.7 Count-Min Sketch: Frequency Estimation

Bloom filters answer "does element exist?" But what if we want to ask "how many times has this element appeared?" For example: access frequency of each IP in network traffic, or word frequency in text.

Count-Min Sketch (Cormode & Muthukrishnan, 2005, "An Improved Data Stream Summary: The Count-Min Sketch and its Applications") solves exactly this problem.

Core Idea

Count-Min Sketch is a two-dimensional counter array (d rows x w columns) paired with d hash functions.

import mmh3
import math


class CountMinSketch:
    """
    Count-Min Sketch: probabilistic data structure for frequency estimation
    
    Guarantee: estimated value >= true value (only overestimates, never underestimates)
    Error bound: P(estimate - true_value > epsilon * N) < delta
    where N is the total frequency sum of all elements
    
    Parameters:
        epsilon: error parameter, estimation error won't exceed epsilon * N
        delta: failure probability, probability of error exceeding epsilon * N is less than delta
    """
    
    def __init__(self, epsilon: float = 0.001, delta: float = 0.01):
        # w = ceil(e/epsilon), d = ceil(ln(1/delta))
        self.w = int(math.ceil(math.e / epsilon))
        self.d = int(math.ceil(math.log(1.0 / delta)))
        self.table = [[0] * self.w for _ in range(self.d)]
        self.total_count = 0
    
    def _hash(self, item: str, row: int) -> int:
        """Hash function for row `row`"""
        return mmh3.hash(item, seed=row) % self.w
    
    def add(self, item: str, count: int = 1):
        """Increment an element's count"""
        self.total_count += count
        for row in range(self.d):
            col = self._hash(item, row)
            self.table[row][col] += count
    
    def estimate(self, item: str) -> int:
        """
        Estimate an element's frequency.
        Return value >= true frequency (collisions only increase counts).
        Taking the minimum across rows minimizes overestimation.
        """
        min_count = float('inf')
        for row in range(self.d):
            col = self._hash(item, row)
            min_count = min(min_count, self.table[row][col])
        return min_count
    
    @property
    def memory_usage_bytes(self) -> int:
        """Estimated memory usage (4 bytes per counter)"""
        return self.w * self.d * 4


# Example: tracking website visit frequency
cms = CountMinSketch(epsilon=0.001, delta=0.01)
print(f"Table size: {cms.d} x {cms.w} = {cms.d * cms.w:,} counters")
print(f"Memory usage: {cms.memory_usage_bytes / 1024:.1f} KB")

# Simulate access logs
import random
access_log = []
# Simulate power-law distribution: a few IPs with many visits
for i in range(100):
    freq = int(10000 / (i + 1))  # Zipf distribution
    access_log.extend([f"192.168.1.{i}"] * freq)

random.shuffle(access_log)

# Count using CMS
exact_counts = {}
for ip in access_log:
    cms.add(ip)
    exact_counts[ip] = exact_counts.get(ip, 0) + 1

# Verify accuracy
for ip in ["192.168.1.0", "192.168.1.9", "192.168.1.99"]:
    exact = exact_counts.get(ip, 0)
    estimated = cms.estimate(ip)
    error = estimated - exact
    print(f"{ip}: exact={exact}, estimated={estimated}, error={error}")

Why Take the Minimum?

Each hash function may have collisions (other elements mapping to the same column), and collisions only increase counts. Taking the minimum of d rows selects the row with the fewest collisions, effectively reducing overestimation.

Error guarantee (from Cormode & Muthukrishnan 2005):

$$P\left[\hat{f}(x) - f(x) > \varepsilon \cdot N\right] < \delta$$

Where $\hat{f}(x)$ is the estimated frequency, $f(x)$ is the true frequency, and $N$ is the total element count.

9.8 HyperLogLog: Cardinality Estimation

"How many distinct visitors did this website have today?" This is a cardinality estimation problem. Exact computation requires storing all unique visitor IDs (potentially tens of GB). HyperLogLog uses only about 12KB to estimate cardinalities in the hundreds of millions, with a standard error of approximately 0.81%.

Fundamental Principle

HyperLogLog (Flajolet, Fusy, Gandouet & Meunier, 2007, "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm") is based on an elegant probabilistic observation:

If you flip a coin, the probability of seeing k consecutive heads is $1/2^k$. If the longest run of consecutive heads you observe across many experiments is k, then you've performed approximately $2^k$ experiments.

Applying this to hash values: hash elements into uniformly distributed bit strings, then observe the maximum number of leading zeros to estimate the number of distinct elements.

import mmh3
import math


class HyperLogLog:
    """
    HyperLogLog cardinality estimation algorithm
    
    Uses 2^p registers (buckets), each storing the maximum observed leading zeros + 1.
    Standard error โ‰ˆ 1.04 / sqrt(2^p)
    
    Parameters:
        p: precision parameter (4 <= p <= 16), uses 2^p buckets
           p=14 gives 16384 buckets, standard error โ‰ˆ 0.81%
    """
    
    def __init__(self, p: int = 14):
        assert 4 <= p <= 16, "p must be between 4 and 16"
        self.p = p
        self.m = 1 << p  # Number of buckets = 2^p
        self.registers = [0] * self.m
        
        # alpha_m correction factor
        if self.m == 16:
            self.alpha = 0.673
        elif self.m == 32:
            self.alpha = 0.697
        elif self.m == 64:
            self.alpha = 0.709
        else:
            self.alpha = 0.7213 / (1 + 1.079 / self.m)
    
    def _hash(self, item: str) -> int:
        """Return 64-bit hash value"""
        h = mmh3.hash128(item, signed=False)
        return h & ((1 << 64) - 1)
    
    def _leading_zeros(self, hash_value: int, start_bit: int) -> int:
        """Count leading zeros starting from start_bit, plus 1"""
        count = 1
        for i in range(start_bit, 64):
            if hash_value & (1 << (63 - i)):
                break
            count += 1
        return count
    
    def add(self, item: str):
        """Add an element"""
        h = self._hash(item)
        # First p bits determine bucket index
        bucket_idx = h >> (64 - self.p)
        # Remaining bits used for leading zeros
        w = self._leading_zeros(h, self.p)
        # Keep the maximum
        self.registers[bucket_idx] = max(self.registers[bucket_idx], w)
    
    def estimate(self) -> int:
        """Estimate cardinality"""
        # Harmonic mean
        indicator = sum(2.0 ** (-r) for r in self.registers)
        raw_estimate = self.alpha * self.m * self.m / indicator
        
        # Small cardinality correction
        if raw_estimate <= 2.5 * self.m:
            zeros = self.registers.count(0)
            if zeros > 0:
                # Linear Counting
                raw_estimate = self.m * math.log(self.m / zeros)
        
        # Large cardinality correction (when exceeding 2^32 / 30)
        elif raw_estimate > (1 << 32) / 30.0:
            raw_estimate = -(1 << 32) * math.log(1 - raw_estimate / (1 << 32))
        
        return int(raw_estimate)
    
    @property
    def memory_usage_bytes(self) -> int:
        """Each register needs 6 bits (max value 64), simplified to 1 byte here"""
        return self.m
    
    def merge(self, other: 'HyperLogLog'):
        """Merge two HLLs (take max per bucket), supports distributed computation"""
        assert self.p == other.p, "Cannot merge HLLs with different precision"
        for i in range(self.m):
            self.registers[i] = max(self.registers[i], other.registers[i])


# Demonstration
hll = HyperLogLog(p=14)
print(f"Number of buckets: {hll.m:,}")
print(f"Memory usage: {hll.memory_usage_bytes:,} bytes โ‰ˆ {hll.memory_usage_bytes/1024:.1f} KB")
print(f"Theoretical std error: {1.04 / math.sqrt(hll.m):.4f} = {1.04 / math.sqrt(hll.m) * 100:.2f}%")

# Add different numbers of elements and check accuracy
import random
for true_cardinality in [1000, 10000, 100000, 1000000]:
    hll = HyperLogLog(p=14)
    for i in range(true_cardinality):
        hll.add(f"user_{i}")
    
    estimated = hll.estimate()
    error = abs(estimated - true_cardinality) / true_cardinality * 100
    print(f"True cardinality: {true_cardinality:>10,} | "
          f"Estimated: {estimated:>10,} | "
          f"Error: {error:.2f}%")

HyperLogLog in Redis

Redis natively supports HyperLogLog using a fixed 12KB memory footprint:

import redis

r = redis.Redis()

# PFADD: Add elements to HLL
r.pfadd("visitors:2024-01-15", "user_1", "user_2", "user_3")
r.pfadd("visitors:2024-01-15", "user_2", "user_4")  # user_2 is duplicate

# PFCOUNT: Estimate cardinality
count = r.pfcount("visitors:2024-01-15")
print(f"Today's unique visitors: {count}")  # Output: 4

# PFMERGE: Merge multiple HLLs (union cardinality)
r.pfmerge("visitors:week", "visitors:2024-01-15", "visitors:2024-01-16")
weekly_count = r.pfcount("visitors:week")
print(f"This week's unique visitors: {weekly_count}")

Redis HLL implementation details (Antirez, Redis source code):

9.9 Cuckoo Filter: Bloom Filter with Deletion Support

The biggest pain point of Bloom filters is the inability to delete elements. Cuckoo Filter (Fan, Andersen, Kaminsky & Mitzenmacher, 2014, "Cuckoo Filter: Practically Better Than Bloom") solves this problem.

Core Idea

The Cuckoo Filter is named after Cuckoo Hashing:

  1. Each bucket can store multiple fingerprints (short fragments of element hash values).
  2. Each element has two candidate bucket positions.
  3. During insertion, if both positions are full, randomly evict an existing fingerprint to its alternate position (like a cuckoo bird pushing others from the nest).
  4. Deletion simply removes the corresponding fingerprint from the bucket.
import mmh3
import random


class CuckooFilter:
    """
    Cuckoo Filter: probabilistic set data structure with deletion support
    
    Advantages over Bloom filters:
    1. Supports deletion
    2. Faster queries (only check 2 positions)
    3. Better space efficiency at the same FP rate (when epsilon < 3%)
    """
    
    def __init__(self, capacity: int, bucket_size: int = 4, 
                 fingerprint_bits: int = 8, max_kicks: int = 500):
        """
        Parameters:
            capacity: number of buckets
            bucket_size: number of fingerprints per bucket
            fingerprint_bits: bits per fingerprint (determines FP rate โ‰ˆ 2*bucket_size/2^fingerprint_bits)
            max_kicks: maximum number of relocations
        """
        self.capacity = capacity
        self.bucket_size = bucket_size
        self.fingerprint_bits = fingerprint_bits
        self.max_kicks = max_kicks
        self.buckets = [[] for _ in range(capacity)]
        self.count = 0
    
    def _fingerprint(self, item: str) -> int:
        """Compute element fingerprint"""
        h = mmh3.hash(item, seed=0) & ((1 << self.fingerprint_bits) - 1)
        # Fingerprint cannot be 0 (0 indicates empty slot)
        return h if h != 0 else 1
    
    def _hash1(self, item: str) -> int:
        """First bucket position"""
        return mmh3.hash(item, seed=42) % self.capacity
    
    def _hash2(self, index: int, fingerprint: int) -> int:
        """
        Second bucket position: partial-key cuckoo hashing
        i2 = i1 XOR hash(fingerprint)
        
        This design ensures: knowing either position and the fingerprint,
        the other position can be computed.
        """
        fp_hash = mmh3.hash(str(fingerprint), seed=100) % self.capacity
        return (index ^ fp_hash) % self.capacity
    
    def insert(self, item: str) -> bool:
        """
        Insert an element. Returns True on success, False if filter is full.
        """
        fp = self._fingerprint(item)
        i1 = self._hash1(item)
        i2 = self._hash2(i1, fp)
        
        # Try both buckets
        if len(self.buckets[i1]) < self.bucket_size:
            self.buckets[i1].append(fp)
            self.count += 1
            return True
        if len(self.buckets[i2]) < self.bucket_size:
            self.buckets[i2].append(fp)
            self.count += 1
            return True
        
        # Both buckets full, start kicking
        idx = random.choice([i1, i2])
        for _ in range(self.max_kicks):
            # Randomly select a fingerprint to evict
            victim_idx = random.randint(0, self.bucket_size - 1)
            fp, self.buckets[idx][victim_idx] = self.buckets[idx][victim_idx], fp
            
            # Evicted fingerprint goes to its alternate position
            idx = self._hash2(idx, fp)
            if len(self.buckets[idx]) < self.bucket_size:
                self.buckets[idx].append(fp)
                self.count += 1
                return True
        
        # Exceeded max kicks, consider filter full
        return False
    
    def __contains__(self, item: str) -> bool:
        """Query whether element possibly exists"""
        fp = self._fingerprint(item)
        i1 = self._hash1(item)
        i2 = self._hash2(i1, fp)
        return fp in self.buckets[i1] or fp in self.buckets[i2]
    
    def delete(self, item: str) -> bool:
        """
        Delete an element. Returns True on successful deletion.
        
        Warning: Only elements that were actually inserted should be deleted.
        Deleting a never-inserted element may cause false negatives!
        """
        fp = self._fingerprint(item)
        i1 = self._hash1(item)
        i2 = self._hash2(i1, fp)
        
        if fp in self.buckets[i1]:
            self.buckets[i1].remove(fp)
            self.count -= 1
            return True
        if fp in self.buckets[i2]:
            self.buckets[i2].remove(fp)
            self.count -= 1
            return True
        return False


# Demonstrate deletion capability
cf = CuckooFilter(capacity=10000, bucket_size=4, fingerprint_bits=12)

# Insert
for i in range(5000):
    cf.insert(f"item_{i}")

print(f"After insert: 'item_42' in cf = {'item_42' in cf}")  # True

# Delete
cf.delete("item_42")
print(f"After delete: 'item_42' in cf = {'item_42' in cf}")  # False

# False positive rate test
false_positives = 0
tests = 10000
for i in range(tests):
    if f"not_exist_{i}" in cf:
        false_positives += 1
print(f"False positive rate: {false_positives/tests:.4f}")
# Theoretical FP rate โ‰ˆ 2 * bucket_size / 2^fingerprint_bits = 8/4096 โ‰ˆ 0.002

Cuckoo Filter vs Bloom Filter Comparison

Feature Bloom Filter Cuckoo Filter
Deletion Not supported Supported
Query speed Check k positions Only check 2 positions
Space efficiency (epsilon < 3%) Worse Better
Space efficiency (epsilon > 3%) Better Worse
Insertion (at full load) Always O(1) May require kicks
Implementation complexity Simple Moderate

Level 3 - What the Specifications Define

9.10 Burton Bloom's 1970 Original Paper

Burton Howard Bloom published "Space/Time Trade-offs in Hash Coding with Allowable Errors" in Communications of the ACM, Vol. 13, No. 7, July 1970, pp. 422-426. This 5-page paper defined the entire field.

Original Motivation

Bloom's original problem was: how to determine whether an English word is in a dictionary (for spell checking or hyphenation point lookup) with limited memory? Computers of that era had extremely limited RAM, and a complete dictionary required too much space.

Bloom's key insight: If we allow a small fraction of false positives (judging words not in the dictionary as being in the dictionary), we can solve the membership problem using far less space than the original data requires.

Core Contributions of the Paper

  1. Proposed the bit vector + multiple hashes architecture: This design remains fundamentally unchanged over 50 years later.
  2. Analyzed optimal parameter selection: Provided the false positive rate formula and optimal hash function count.
  3. Articulated the fundamental trade-off: More memory leads to lower FP rate; more hash functions leads to slower speed but lower FP rate (up to an optimum, after which FP rate actually increases).

Historical Impact

Bloom Filters were invented to save disk accesses (disk seeks were extremely slow in that era), and later became widely applied in:

9.11 Information-Theoretic Lower Bound

Why does a Bloom filter need $1.44 \cdot n \cdot \log_2(1/\varepsilon)$ bits? This is not coincidenceโ€”it is close to the information-theoretic limit.

Information-Theoretic Analysis

Consider a data structure that answers "is element x in set S?" The set S has n elements drawn from a universe of size U. What is the information-theoretic lower bound for this data structure?

If we allow false positive rate epsilon, then for any element not in S, the data structure must distinguish "in S" from "not in S," allowing epsilon error rate.

Carter, Floyd, Gill, Markowsky & Wegman (1978) proved:

Theorem: Any data structure supporting set membership queries with false positive rate at most epsilon requires at least $n \cdot \log_2(1/\varepsilon)$ bits.

A Bloom filter uses $1.44 \cdot n \cdot \log_2(1/\varepsilon)$ bitsโ€”44% more than the lower bound. This 1.44 factor comes from:

$$\frac{1}{(\ln 2)^2} = \frac{1}{0.4805} \approx 2.0814$$

And $\frac{m}{n \cdot \log_2(1/\varepsilon)} = \frac{-\ln\varepsilon / (\ln 2)^2}{\log_2(1/\varepsilon)} = \frac{\ln(1/\varepsilon) / (\ln 2)^2}{\ln(1/\varepsilon) / \ln 2} = \frac{1}{\ln 2} \approx 1.44$

In other words, Bloom filter space efficiency is 1/1.44 โ‰ˆ 69.3% of the information-theoretic optimum.

More Space-Efficient Data Structures

Are there data structures that achieve the information-theoretic lower bound? Yes:

# Comparing space efficiency of different data structures
import math

def space_comparison(n, epsilon):
    """Compare theoretical space requirements of different data structures"""
    info_lower_bound = n * math.log2(1/epsilon)  # Information-theoretic lower bound
    bloom_space = 1.44 * n * math.log2(1/epsilon)  # Standard Bloom filter
    xor_space = 1.23 * n * math.log2(1/epsilon)    # Xor Filter
    
    print(f"n={n:,}, epsilon={epsilon}")
    print(f"  Info-theoretic lower bound: {info_lower_bound/8/1024/1024:.2f} MB (100%)")
    print(f"  Bloom Filter:               {bloom_space/8/1024/1024:.2f} MB ({bloom_space/info_lower_bound*100:.1f}%)")
    print(f"  Xor Filter:                 {xor_space/8/1024/1024:.2f} MB ({xor_space/info_lower_bound*100:.1f}%)")
    print()

space_comparison(10_000_000, 0.01)
space_comparison(10_000_000, 0.001)
space_comparison(1_000_000_000, 0.01)

9.12 Theoretical Guarantees of Count-Min Sketch

Cormode & Muthukrishnan's 2005 paper "An Improved Data Stream Summary: The Count-Min Sketch and its Applications" provides rigorous theoretical guarantees.

Parameter Selection

Given error parameter epsilon and failure probability delta:

Error Guarantee

For any element x:

$$f(x) \leq \hat{f}(x) \leq f(x) + \varepsilon \cdot N$$

Where $\hat{f}(x)$ is the estimated frequency, $f(x)$ is the true frequency, and $N$ is the total frequency of all elements. This guarantee holds with probability at least $1-\delta$.

Why One-Sided Error?

Count-Min Sketch only overestimates frequencies, never underestimates. This is because collisions can only increase counter values. This property is extremely useful in many applicationsโ€”for example, if CMS says an IP's access frequency doesn't exceed a threshold, then it definitely doesn't.

The Heavy Hitter Problem

A classic application of Count-Min Sketch is finding Heavy Hitters: elements whose frequency exceeds a threshold in a data stream.

class HeavyHitterDetector:
    """
    Detect heavy hitters (elements with frequency exceeding phi * N) using Count-Min Sketch
    """
    
    def __init__(self, phi: float = 0.01, epsilon: float = 0.001, delta: float = 0.01):
        """
        Parameters:
            phi: heavy hitter threshold (elements with frequency exceeding phi * N)
            epsilon: CMS error parameter (should be much smaller than phi)
            delta: failure probability
        """
        self.phi = phi
        self.cms = CountMinSketch(epsilon=epsilon, delta=delta)
        self.candidates = set()  # Candidate heavy hitters
    
    def add(self, item: str):
        self.cms.add(item)
        # If estimated frequency exceeds threshold, add to candidate set
        threshold = self.phi * self.cms.total_count
        if self.cms.estimate(item) >= threshold:
            self.candidates.add(item)
    
    def get_heavy_hitters(self) -> list:
        """Return all possible heavy hitters with their estimated frequencies"""
        threshold = self.phi * self.cms.total_count
        results = []
        for item in self.candidates:
            est = self.cms.estimate(item)
            if est >= threshold:
                results.append((item, est))
        return sorted(results, key=lambda x: -x[1])

9.13 Theoretical Foundations of HyperLogLog

From Flajolet-Martin to HyperLogLog

The evolution of cardinality estimation algorithms:

  1. Flajolet-Martin (1985): "Probabilistic Counting Algorithms for Data Base Applications." Uses bit pattern observations; a single estimator has high variance.
  2. LogLog (2003): Durand & Flajolet. Uses m buckets with arithmetic mean, O(log log n) space per bucket.
  3. SuperLogLog (2003): Truncates outliers before averaging.
  4. HyperLogLog (2007): Flajolet, Fusy, Gandouet & Meunier. Uses harmonic mean instead of arithmetic mean, improving accuracy to standard error $1.04/\sqrt{m}$.

Why Harmonic Mean Instead of Arithmetic Mean?

Given m bucket values $M_1, M_2, \ldots, M_m$:

Correction Factor alpha_m

The correction factor $\alpha_m$ eliminates estimator bias:

$$\alpha_m = \left(m \int_0^\infty \left(\log_2\frac{2+u}{1+u}\right)^m du\right)^{-1}$$

For large m, $\alpha_m \to 1/(2\ln 2) \approx 0.7213$.

In Redis with p=14, $\alpha_{16384}$ evaluates to 0.7213/(1+1.079/16384) โ‰ˆ 0.72125.

9.14 Redis HyperLogLog Implementation

Redis's HyperLogLog implementation (written by Salvatore Sanfilippo, a.k.a. Antirez) has several important engineering optimizations:

Sparse vs Dense Representation

Redis uses two encoding schemes:

  1. Sparse encoding (small cardinalities): Uses run-length encoding (RLE) to compress consecutive zero-value registers. When most registers are 0, sparse representation is far smaller than 12KB.

  2. Dense encoding (large cardinalities): 16384 registers, each 6 bits, totaling 12,288 bytes. Automatically converts from sparse to dense when sparse encoding exceeds a threshold (server.hll_sparse_max_bytes, default 3000 bytes) or register values exceed 32.

# Simplified model of Redis HLL sparse encoding
class RedisHLLSparse:
    """
    Simplified model of Redis HLL sparse encoding
    
    Three opcodes:
    - ZERO(len): len consecutive zero-value registers (1 byte, len <= 64)
    - XZERO(len): longer consecutive zero-value registers (2 bytes, len <= 16384)
    - VAL(value, len): len consecutive same non-zero value registers (1 byte, value <= 32, len <= 4)
    """
    
    def __init__(self):
        # Initial state: 16384 zero-value registers = one XZERO(16384) opcode
        self.encoding = [("XZERO", 16384)]
        self.is_dense = False
    
    def estimated_memory(self):
        """Estimate current memory usage"""
        total = 0
        for op in self.encoding:
            if op[0] == "ZERO":
                total += 1
            elif op[0] == "XZERO":
                total += 2
            elif op[0] == "VAL":
                total += 1
        return total

Bias Correction

Redis implements the improved correction strategy proposed by Google (Heule, Nunkesser & Hall, 2013, "HyperLogLog in Practice"):


Level 4 - Edge Cases and Pitfalls

9.15 Deep Analysis of Why Bloom Filters Cannot Delete

Why deletion is impossible:

Consider two elements A and B whose hash positions overlap:

Position 7 is shared. If we delete A (clear positions 3, 7, 11 to zero), clearing position 7 causes queries for B to return Falseโ€”B actually exists, but we say it doesn't. This introduces false negatives, destroying the Bloom filter's core guarantee of "if it says not present, it's definitely not present."

Solution Comparison

Approach Mechanism Pros Cons
Counting Bloom Filter Replace each bit with counter Simple and intuitive 3-4x more space
Cuckoo Filter Fingerprints + cuckoo hashing Good space efficiency More complex implementation
Quotient Filter Quotient-remainder storage Cache-friendly Clustering issues
Rebuild Periodically rebuild from source No extra space Rebuild overhead
class CountingBloomFilter:
    """
    Counting Bloom Filter: each bit becomes a 4-bit counter
    
    Supports deletion, but space overhead is 4x that of standard Bloom filter.
    Counter overflow (>15) stops incrementing (saturating count), which could
    cause false negatives but is extremely rare in practice.
    """
    
    def __init__(self, expected_items: int, false_positive_rate: float = 0.01):
        import math
        self.m = int(math.ceil(-expected_items * math.log(false_positive_rate) 
                               / (math.log(2) ** 2)))
        self.k = int(math.ceil((self.m / expected_items) * math.log(2)))
        # 4-bit counters, 2 counters share 1 byte
        self.counters = bytearray(math.ceil(self.m / 2))
    
    def _get_counter(self, pos: int) -> int:
        byte_idx = pos // 2
        if pos % 2 == 0:
            return self.counters[byte_idx] & 0x0F
        else:
            return (self.counters[byte_idx] >> 4) & 0x0F
    
    def _set_counter(self, pos: int, value: int):
        byte_idx = pos // 2
        if pos % 2 == 0:
            self.counters[byte_idx] = (self.counters[byte_idx] & 0xF0) | (value & 0x0F)
        else:
            self.counters[byte_idx] = (self.counters[byte_idx] & 0x0F) | ((value & 0x0F) << 4)
    
    def _hash_positions(self, item: str) -> list:
        import mmh3
        h1 = mmh3.hash(item, seed=0) % self.m
        h2 = mmh3.hash(item, seed=42) % self.m
        return [(h1 + i * h2) % self.m for i in range(self.k)]
    
    def add(self, item: str):
        for pos in self._hash_positions(item):
            val = self._get_counter(pos)
            if val < 15:  # Saturating count, prevent overflow
                self._set_counter(pos, val + 1)
    
    def remove(self, item: str) -> bool:
        """Delete element. Returns False if element doesn't exist."""
        positions = self._hash_positions(item)
        # First check existence
        if not all(self._get_counter(pos) > 0 for pos in positions):
            return False
        # Decrement all counters
        for pos in positions:
            val = self._get_counter(pos)
            if val < 15:  # Don't decrement saturated counters (safety measure)
                self._set_counter(pos, val - 1)
        return True
    
    def __contains__(self, item: str) -> bool:
        return all(self._get_counter(pos) > 0 for pos in self._hash_positions(item))

9.16 Interview Question: Design a URL Deduplication System

Problem: Design a system that determines whether a URL about to be crawled has already been fetched. Expected URL volume: 10 billion. Requirements: low latency, scalable.

Analysis Framework

Step 1: Clarify Requirements

Step 2: Approach Comparison

import math

def memory_analysis():
    """Memory analysis for different approaches"""
    n = 10_000_000_000  # 10 billion URLs
    
    # Approach 1: HashSet
    # Each URL hash 64-bit + overhead, approximately 16 bytes/element
    hashset_mem = n * 16 / (1024**3)
    print(f"HashSet: {hashset_mem:.0f} GB")
    
    # Approach 2: Bloom filter (0.1% FP rate)
    bloom_bits = -n * math.log(0.001) / (math.log(2)**2)
    bloom_mem = bloom_bits / 8 / (1024**3)
    print(f"Bloom Filter (0.1% FP): {bloom_mem:.1f} GB")
    
    # Approach 3: Bloom filter (1% FP rate)
    bloom_bits_1 = -n * math.log(0.01) / (math.log(2)**2)
    bloom_mem_1 = bloom_bits_1 / 8 / (1024**3)
    print(f"Bloom Filter (1% FP): {bloom_mem_1:.1f} GB")

memory_analysis()
# HashSet: 149 GB
# Bloom Filter (0.1% FP): 17.9 GB
# Bloom Filter (1% FP): 11.9 GB

Step 3: Distributed Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           URL Deduplication System Architecture       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                     โ”‚
โ”‚   New URL โ”€โ”€โ†’ Hash(URL) % N โ”€โ”€โ†’ Route to node       โ”‚
โ”‚                                                     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”        โ”‚
โ”‚   โ”‚  Node 0  โ”‚  โ”‚  Node 1  โ”‚  โ”‚  Node N  โ”‚        โ”‚
โ”‚   โ”‚  BF 2GB  โ”‚  โ”‚  BF 2GB  โ”‚  โ”‚  BF 2GB  โ”‚        โ”‚
โ”‚   โ”‚(1.25B URL)โ”‚  โ”‚(1.25B URL)โ”‚  โ”‚(1.25B URL)โ”‚       โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚
โ”‚                                                     โ”‚
โ”‚   Persistence: Periodically write bit arrays to     โ”‚
โ”‚                disk/object storage                   โ”‚
โ”‚   Recovery: Load bit arrays from persistence        โ”‚
โ”‚             on startup                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 4: Complete Answer Framework

"""
Interview Answer Framework:

1. Capacity Estimation
   - 10 billion URLs, 0.1% FP rate
   - Bloom filter: approximately 18 GB
   - Split across 8 nodes, approximately 2.2 GB each

2. Sharding Strategy
   - Hash URL modulo N to determine target node
   - Consistent hashing (if nodes need dynamic scaling)

3. Persistence
   - Periodic checkpoint of bit arrays to disk
   - Incremental log: record each insert operation
   - Recovery: load checkpoint + replay incremental log

4. Fault Tolerance
   - Primary-replica replication: one replica per node
   - Recover from checkpoint after bit array loss

5. Scaling
   - Approach A: Layered Bloom filters (old data in BF1, new in BF2)
   - Approach B: Consistent hashing + migration

6. Monitoring
   - Monitor actual FP rate (sampling verification)
   - Alert when element count approaches design capacity

7. Optimizations
   - Local memory + Redis combination (hot URL local cache)
   - Layered filtering: check local small BF first, then remote large BF on miss
"""

9.17 False Positive Rate and Memory Trade-offs in Engineering

In real systems, Bloom filter parameter selection must consider multiple dimensions:

def engineering_tradeoffs():
    """Parameter selection decision tree for engineering"""
    
    scenarios = [
        {
            "scenario": "CDN URL cache lookup",
            "n": 100_000_000,
            "fp_cost": "One additional origin fetch (low cost)",
            "recommended_fp": 0.01,
            "rationale": "FP only means an extra origin fetch; 1% is perfectly acceptable"
        },
        {
            "scenario": "Malicious URL blacklist",
            "n": 10_000_000,
            "fp_cost": "Legitimate site flagged as malicious (user complaints)",
            "recommended_fp": 0.0001,
            "rationale": "FP = false block, bad UX, need extremely low FP rate"
        },
        {
            "scenario": "Database key existence check",
            "n": 50_000_000,
            "fp_cost": "One unnecessary disk read",
            "recommended_fp": 0.001,
            "rationale": "Disk IO cost is moderate; 0.1% is a good balance"
        },
        {
            "scenario": "Deduplication (crawler/messaging)",
            "n": 1_000_000_000,
            "fp_cost": "Missing some URLs/messages",
            "recommended_fp": 0.001,
            "rationale": "Missing a few URLs is acceptable, but too many affects coverage"
        }
    ]
    
    for s in scenarios:
        m = -s["n"] * math.log(s["recommended_fp"]) / (math.log(2)**2)
        k = int(math.ceil((m/s["n"]) * math.log(2)))
        mem_mb = m / 8 / 1024 / 1024
        
        print(f"Scenario: {s['scenario']}")
        print(f"  Elements: {s['n']:,}")
        print(f"  FP rate: {s['recommended_fp']}")
        print(f"  Memory: {mem_mb:.1f} MB")
        print(f"  Hash functions: {k}")
        print(f"  Rationale: {s['rationale']}")
        print()

engineering_tradeoffs()

Common Pitfalls

Pitfall 1: Exceeding Design Capacity

def capacity_overflow_demo():
    """Demonstrate FP rate degradation when exceeding capacity"""
    bf = BloomFilter(expected_items=10000, false_positive_rate=0.01)
    
    for filled_ratio in [0.5, 1.0, 1.5, 2.0, 3.0]:
        # Recreate and fill
        bf2 = BloomFilter(expected_items=10000, false_positive_rate=0.01)
        actual_items = int(10000 * filled_ratio)
        for i in range(actual_items):
            bf2.add(f"item_{i}")
        
        # Test FP rate
        fp = 0
        tests = 10000
        for i in range(tests):
            if f"not_exist_{i}" in bf2:
                fp += 1
        
        print(f"Fill ratio {filled_ratio:.0%} ({actual_items:,} items): "
              f"measured FP rate = {fp/tests:.4f}")

# Typical output:
# Fill ratio 50%: measured FP rate = 0.0001
# Fill ratio 100%: measured FP rate = 0.0098  (close to design value 1%)
# Fill ratio 150%: measured FP rate = 0.0500  (5%!)
# Fill ratio 200%: measured FP rate = 0.1400  (14%!)
# Fill ratio 300%: measured FP rate = 0.3700  (37%! nearly useless)

Conclusion: At 150% capacity, FP rate jumps from 1% to 5%; at 200%, it reaches 14%. Design capacity must include headroom. Recommended practice: set design capacity to 1.5-2x the expected maximum.

Pitfall 2: Poor Hash Function Quality

"""
Bad example: using simple hash functions
"""

# Bad hash function: non-uniform bit distribution
def bad_hash(item, seed):
    return (hash(item) + seed) % 1000  # hash() + offset is NOT independent hashing!

# Good hash function: double hashing technique
def good_hash(item, i, m):
    import mmh3
    h1 = mmh3.hash(item, seed=0)
    h2 = mmh3.hash(item, seed=42)
    return (h1 + i * h2) % m  # Kirsch-Mitzenmacher (2006)

Why doesn't hash(item) + seed work? Because this is merely a simple offsetโ€”different seeds produce highly correlated hash values (constant difference), violating the Bloom filter's assumption that hash functions are "independent." Kirsch & Mitzenmacher 2006 proved that double hashing $g_i(x) = h_1(x) + i \cdot h_2(x)$ theoretically performs identically to k fully independent hash functions.

Pitfall 3: Thread Safety for Concurrent Writes

import threading

class ThreadSafeBloomFilter:
    """
    Thread-safe Bloom filter
    
    Note: A standard Bloom filter's add operation is not atomic
    (it needs to set k bits, and intermediate states may be observed by other threads).
    However, this doesn't affect correctnessโ€”worst case is a temporary false negative
    that resolves once add completes.
    
    The real problem is concurrent writes to the bit array may cause lost updates
    (read-modify-write is not atomic).
    """
    
    def __init__(self, expected_items, false_positive_rate=0.01):
        self._bf = BloomFilter(expected_items, false_positive_rate)
        self._lock = threading.Lock()
    
    def add(self, item: str):
        # Approach 1: Global lock (simple but poor performance)
        with self._lock:
            self._bf.add(item)
    
    def __contains__(self, item: str) -> bool:
        # Queries don't need locking (read-only)
        # May see partially-updated state (safe for Bloom filters)
        return item in self._bf


# Higher-performance approach: striped locks
class StripedBloomFilter:
    """Use striped locks to reduce contention"""
    
    def __init__(self, expected_items, false_positive_rate=0.01, num_stripes=16):
        self._bf = BloomFilter(expected_items, false_positive_rate)
        self._stripes = [threading.Lock() for _ in range(num_stripes)]
    
    def _get_stripe(self, position: int) -> threading.Lock:
        return self._stripes[position % len(self._stripes)]
    
    def add(self, item: str):
        positions = self._bf._get_hash_positions(item)
        # Acquire locks in order to avoid deadlock
        unique_stripes = sorted(set(p % len(self._stripes) for p in positions))
        for s in unique_stripes:
            self._stripes[s].acquire()
        try:
            self._bf.add(item)
        finally:
            for s in unique_stripes:
                self._stripes[s].release()

9.18 Redis Bloom Filter Module (RedisBloom)

Redis does not natively support Bloom filters, but the RedisBloom module (formerly ReBloom) provides a production-grade implementation.

import redis

# Requires Redis with RedisBloom module loaded
r = redis.Redis()

# BF.RESERVE: Create a Bloom filter
# BF.RESERVE key error_rate capacity [EXPANSION expansion] [NONSCALING]
r.execute_command('BF.RESERVE', 'urls:seen', 0.001, 10000000)
# Creates a Bloom filter with 10M capacity and 0.1% FP rate

# BF.ADD: Add element
r.execute_command('BF.ADD', 'urls:seen', 'https://example.com/page1')

# BF.MADD: Batch add
r.execute_command('BF.MADD', 'urls:seen', 
                  'https://example.com/page2',
                  'https://example.com/page3')

# BF.EXISTS: Query
exists = r.execute_command('BF.EXISTS', 'urls:seen', 'https://example.com/page1')
print(f"exists: {exists}")  # 1 (possibly exists)

not_exists = r.execute_command('BF.EXISTS', 'urls:seen', 'https://never-added.com')
print(f"not_exists: {not_exists}")  # 0 (definitely does not exist)

# BF.MEXISTS: Batch query
results = r.execute_command('BF.MEXISTS', 'urls:seen',
                           'https://example.com/page1',
                           'https://never-added.com')
print(f"batch results: {results}")  # [1, 0]

# BF.INFO: View filter information
info = r.execute_command('BF.INFO', 'urls:seen')
print(f"info: {info}")
# Output includes: Capacity, Size, Number of filters, Number of items inserted, Expansion rate

RedisBloom Auto-Scaling Mechanism

"""
RedisBloom's EXPANSION parameter enables auto-scaling:

When the first sub-filter fills up, automatically creates a second sub-filter
with capacity = original * expansion. Queries must check all sub-filters
(if any returns "exists", the result is "exists").

Costs:
- Slower queries (need to check multiple sub-filters)
- Slightly increased FP rate
- But avoids dramatic FP rate degradation from exceeding capacity

Default expansion=2, meaning capacity doubles on each expansion.
"""

# Create a scalable Bloom filter
r.execute_command('BF.RESERVE', 'scalable:bf', 
                  0.01,        # Initial FP rate
                  1000000,     # Initial capacity
                  'EXPANSION', 2)  # Expansion factor

# If scaling is not needed (fixed capacity, better performance)
r.execute_command('BF.RESERVE', 'fixed:bf',
                  0.01,
                  10000000,
                  'NONSCALING')  # BF.ADD returns error when full

Production Best Practices

"""
Redis Bloom Filter production best practices:

1. Capacity Planning
   - Use 2x peak data volume estimate as capacity
   - Or use EXPANSION for auto-scaling

2. FP Rate Selection
   - Cache penetration: 0.01 (1%)
   - URL deduplication: 0.001 (0.1%)
   - Security blacklist: 0.0001 (0.01%)

3. Persistence
   - RedisBloom data persists with Redis RDB/AOF
   - Large filters' RDB saves may cause latency spikes

4. Memory Estimation
   - BF.DEBUG shows actual memory usage
   - Rule of thumb: n * (-ln(epsilon) / (ln2)^2) / 8 bytes

5. Monitoring Metrics
   - Inserted elements vs design capacity
   - Actual FP rate (periodic sampling tests)
   - Number of sub-filters (expansion count)
"""

class BloomFilterManager:
    """Redis Bloom Filter manager"""
    
    def __init__(self, redis_client, key_prefix="bf:"):
        self.r = redis_client
        self.prefix = key_prefix
    
    def create_if_not_exists(self, name: str, error_rate: float, 
                             capacity: int, expansion: int = 2):
        """Safe creation (idempotent)"""
        key = f"{self.prefix}{name}"
        try:
            self.r.execute_command('BF.INFO', key)
        except redis.ResponseError:
            self.r.execute_command('BF.RESERVE', key, 
                                  error_rate, capacity, 
                                  'EXPANSION', expansion)
    
    def check_health(self, name: str) -> dict:
        """Check filter health status"""
        key = f"{self.prefix}{name}"
        info = self.r.execute_command('BF.INFO', key)
        
        # Parse info (format varies by version)
        info_dict = {}
        for i in range(0, len(info), 2):
            info_dict[info[i].decode()] = info[i+1]
        
        capacity = info_dict.get('Capacity', 0)
        inserted = info_dict.get('Number of items inserted', 0)
        fill_ratio = inserted / capacity if capacity > 0 else 0
        
        status = "healthy"
        if fill_ratio > 0.9:
            status = "critical"  # Near full
        elif fill_ratio > 0.7:
            status = "warning"   # Approaching capacity
        
        return {
            "key": key,
            "capacity": capacity,
            "inserted": inserted,
            "fill_ratio": fill_ratio,
            "status": status
        }

9.19 A Unified View of Probabilistic Data Structures

Bloom filters, Count-Min Sketch, and HyperLogLog appear to solve different problems, but they share a core philosophy:

Use hashing to map data into compact summary structures, sacrificing a small amount of precision for orders-of-magnitude space savings.

Data Structure Question Answered Error Type Space Complexity
Bloom Filter x in S? False positives O(n) bits
Count-Min Sketch freq(x) = ? Overestimates only O(1/epsilon * log(1/delta))
HyperLogLog S = ?
Cuckoo Filter x in S? + delete False positives O(n) bits
MinHash Jaccard(A,B) = ? +/- relative error O(1/epsilon^2)

Their shared properties:

  1. One-way: Data can be inserted but is compressed; original data is unrecoverable.
  2. Mergeable: Two summaries can be merged into one (supports distributed computation).
  3. Error-tunable: Users can increase space to reduce error.
  4. Stream-processable: Requires only a single pass over data; no random access needed.

These properties make probabilistic data structures particularly suitable for big data stream processing scenariosโ€”data volumes too large for exact storage, but approximate answers are sufficient for making correct engineering decisions.


Chapter Summary

Concept Core Mechanism Space Error Typical Applications
Bloom Filter Bit array + k hashes ~10 bits/elem @1% FP False positives Cache penetration, URL dedup
Count-Min Sketch d x w counter matrix O(1/epsilon * log(1/delta)) Overestimates only Traffic stats, hotspot detection
HyperLogLog Leading zeros observation ~12KB (Redis) +/-0.81% Unique visitor counts
Cuckoo Filter Fingerprints + cuckoo hash ~12 bits/elem @0.1% FP False positives Scenarios requiring deletion

Core insight: Finding the optimal engineering balance between precision and space. When you face "TB-scale data with membership/frequency/cardinality queries," probabilistic data structures are almost always the right answer.

Rate this chapter
4.7  / 5  (44 ratings)

๐Ÿ’ฌ Comments