Chapter 8

Hash Tables: The Cost of O(1)

Chapter 8: Hash Tables — The Cost of O(1)

You're searching for username "alice" in a database of one million records. With an array, you scan all one million entries. With a sorted array and binary search, about 20 comparisons. With a hash table, on average just 1.

O(1) lookup sounds like magic. But magic has a price — extra memory, hash collisions, resize overhead, and worst-case degradation. This chapter shows you exactly how hash tables work, what the costs are, and when not to use them.


Level 1 · What You Need to Know

8.1 The Core Idea

The idea behind a hash table is dead simple: use a function to map keys to array indices, then access directly.

# Simplest hash table: modulo as hash function
class SimpleHashTable:
    def __init__(self, capacity=8):
        self.capacity = capacity
        self.buckets = [None] * capacity

    def _hash(self, key):
        return hash(key) % self.capacity

    def put(self, key, value):
        idx = self._hash(key)
        self.buckets[idx] = (key, value)  # Warning: no collision handling!

    def get(self, key):
        idx = self._hash(key)
        pair = self.buckets[idx]
        if pair and pair[0] == key:
            return pair[1]
        return None

Fatal flaw: if two different keys map to the same index (hash collision), the second overwrites the first.

8.2 Collision Resolution: Chaining

The most intuitive fix: each bucket holds a list instead of a single element.

class ChainedHashTable:
    def __init__(self, capacity=8):
        self.capacity = capacity
        self.size = 0
        self.buckets = [[] for _ in range(capacity)]

    def _hash(self, key):
        return hash(key) % self.capacity

    def put(self, key, value):
        idx = self._hash(key)
        bucket = self.buckets[idx]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))
        self.size += 1
        if self.size / self.capacity > 0.75:
            self._resize(self.capacity * 2)

    def get(self, key):
        idx = self._hash(key)
        for k, v in self.buckets[idx]:
            if k == key:
                return v
        return None

    def delete(self, key):
        idx = self._hash(key)
        bucket = self.buckets[idx]
        for i, (k, v) in enumerate(bucket):
            if k == key:
                bucket.pop(i)
                self.size -= 1
                return True
        return False

    def _resize(self, new_capacity):
        old_buckets = self.buckets
        self.capacity = new_capacity
        self.buckets = [[] for _ in range(new_capacity)]
        self.size = 0
        for bucket in old_buckets:
            for k, v in bucket:
                self.put(k, v)

Load factor = number of elements / number of buckets. When it exceeds a threshold (typically 0.75), the table resizes.

8.3 Collision Resolution: Open Addressing

Alternative strategy: when a collision occurs, probe for the next empty slot in the array itself.

class OpenAddressHashTable:
    EMPTY = object()
    DELETED = object()  # Tombstone marker

    def __init__(self, capacity=8):
        self.capacity = capacity
        self.size = 0
        self.keys = [self.EMPTY] * capacity
        self.values = [None] * capacity

    def _hash(self, key):
        return hash(key) % self.capacity

    def _probe(self, key):
        """Linear probing: find key's slot or first empty slot"""
        idx = self._hash(key)
        first_deleted = -1
        for _ in range(self.capacity):
            if self.keys[idx] is self.EMPTY:
                return (first_deleted if first_deleted >= 0 else idx), False
            if self.keys[idx] is self.DELETED:
                if first_deleted < 0:
                    first_deleted = idx
            elif self.keys[idx] == key:
                return idx, True
            idx = (idx + 1) % self.capacity
        return (first_deleted if first_deleted >= 0 else -1), False

    def put(self, key, value):
        if self.size * 3 >= self.capacity * 2:  # load factor > 2/3
            self._resize(self.capacity * 2)
        idx, found = self._probe(key)
        if not found:
            self.size += 1
        self.keys[idx] = key
        self.values[idx] = value

    def get(self, key):
        idx, found = self._probe(key)
        if found:
            return self.values[idx]
        return None

    def delete(self, key):
        idx, found = self._probe(key)
        if found:
            self.keys[idx] = self.DELETED  # Can't set to EMPTY
            self.values[idx] = None
            self.size -= 1
            return True
        return False

    def _resize(self, new_capacity):
        old_keys, old_values = self.keys, self.values
        self.capacity = new_capacity
        self.keys = [self.EMPTY] * new_capacity
        self.values = [None] * new_capacity
        self.size = 0
        for i, k in enumerate(old_keys):
            if k is not self.EMPTY and k is not self.DELETED:
                self.put(k, old_values[i])

Why use tombstones (DELETED) instead of setting to EMPTY? If A and B collide (B is placed after A), deleting A and setting it to EMPTY would cause a search for B to stop early at A's position, wrongly concluding B doesn't exist. The tombstone tells the probing algorithm: "someone lived here — keep looking."

8.4 Chaining vs. Open Addressing

Feature Chaining Open Addressing
Memory layout Non-contiguous (linked nodes) Contiguous (all in array)
Cache friendliness Poor Good
Load factor > 1 Yes No (must be < 1)
Deletion Simple Requires tombstones
Implementation Simple More complex
Use case General purpose Memory-compact, cache-sensitive

Which does Python dict use? Open addressing. CPython prioritizes compact memory layout and cache performance. Java's HashMap uses chaining (linked list per bucket, converting to red-black tree when length > 8).

8.5 Python dict Tips

d = {}
d["key"] = "value"            # O(1) insert
val = d.get("key", "default") # O(1) lookup, returns default if missing
"key" in d                     # O(1) existence check
del d["key"]                   # O(1) delete

# Counter pattern
from collections import Counter
counts = Counter("abracadabra")
# Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})

# defaultdict avoids KeyError
from collections import defaultdict
graph = defaultdict(list)
graph["a"].append("b")  # No need to check if "a" exists first

# Dict comprehension
squares = {x: x**2 for x in range(10)}

# Ordering: Python 3.7+ dicts preserve insertion order
d = {"b": 2, "a": 1, "c": 3}
list(d.keys())  # ['b', 'a', 'c'] — insertion order preserved

8.6 set: A Hash Table Without Values

Python's set is a hash table that stores only keys, no values.

s = set()
s.add(42)           # O(1)
42 in s             # O(1)
s.remove(42)        # O(1), raises KeyError if missing
s.discard(42)       # O(1), no error if missing

# Set operations
a = {1, 2, 3}
b = {2, 3, 4}
a | b  # Union {1, 2, 3, 4}
a & b  # Intersection {2, 3}
a - b  # Difference {1}
a ^ b  # Symmetric difference {1, 4}

# Deduplication
unique = list(set(items))  # Note: loses order
# Order-preserving dedup
seen = set()
unique = [x for x in items if x not in seen and not seen.add(x)]

Level 2 · How It Works

8.7 Hash Function Design

A good hash function needs two properties:

  1. Uniform distribution: Different keys should map to different buckets as much as possible
  2. Fast computation: The hash function itself must be quick

Integer hashing: The simplest approach is key % capacity, but if capacity is a power of 2, modulo is equivalent to taking the lowest bits, discarding high-bit information. Many implementations first "mix" the bits, spreading high-bit information into low bits.

Python hashes small integers to themselves: hash(42) == 42. For strings and large objects, Python uses SipHash (since Python 3.4), a cryptographically secure hash function that defends against hash collision attacks (HashDoS).

# Python integer hashing
hash(42)    # 42
hash(-1)    # -2 (special case: -1 means error in CPython internals)
hash(0)     # 0

# String hash changes every startup (random seed)
hash("hello")  # Different each run (HashDoS defense)

Java's perturbation function (HashMap.hash):

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

XORs the high 16 bits into the low 16 bits, ensuring that even when the array is small (only low bits matter), high-bit information contributes.

8.8 CPython dict Internals

CPython 3.6+ uses a clever compact layout proposed by Raymond Hettinger in 2012.

Old layout (Python < 3.6):

[hash|key|value, hash|key|value, —, hash|key|value, —, —, —, —]

Problem: empty slots waste memory, and each empty slot still occupies full hash + key + value space.

New layout (Python 3.6+):

indices:  [1, —, 0, —, 2, —, —, —]   ← small integer array (1/2/4/8 bytes each)
entries:  [(hash₀, key₀, val₀),        ← compact array, insertion order
           (hash₁, key₁, val₁),
           (hash₂, key₂, val₂)]

Benefits:

  1. Each index entry is only 1-8 bytes (not a full hash+key+value triple)
  2. Entries array is compact with no holes, naturally preserving insertion order
  3. 20-25% memory savings

Probe sequence: CPython doesn't use simple linear probing. It uses perturbation probing:

# CPython probe sequence (pseudocode)
perturb = hash_value
idx = perturb % capacity
while slot_occupied:
    idx = (5 * idx + perturb + 1) % capacity
    perturb >>= 5  # gradually decays to (5*idx+1) % capacity

This sequence guarantees:

8.9 Resize (Rehash) Mechanism

When the load factor exceeds 2/3, CPython triggers a resize:

  1. Allocate new indices + entries arrays (double or 4x capacity)
  2. Walk the old entries, recomputing each key's position
  3. Insert into the new arrays
  4. Free the old arrays

Key point: Rehash is O(n), but amortized over all insertions it's O(1) (identical analysis to dynamic array resizing — see Chapter 1 amortized analysis).

Shrinking: Python dict never auto-shrinks. Even after deleting many elements, memory isn't reclaimed. To force reclamation:

d = {k: v for k, v in d.items()}  # Creates a right-sized new dict

8.10 Consistent Hashing

Traditional hashing: server = hash(key) % N. When N changes (adding/removing servers), almost all keys need remapping.

Consistent hashing (David Karger et al., MIT, 1997) solves this:

import hashlib
import bisect

class ConsistentHash:
    def __init__(self, nodes, replicas=150):
        self.replicas = replicas
        self.ring = []
        self.ring_map = {}
        for node in nodes:
            self.add_node(node)

    def _hash(self, key):
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node):
        for i in range(self.replicas):
            h = self._hash(f"{node}:{i}")
            self.ring.append(h)
            self.ring_map[h] = node
        self.ring.sort()

    def remove_node(self, node):
        for i in range(self.replicas):
            h = self._hash(f"{node}:{i}")
            self.ring.remove(h)
            del self.ring_map[h]

    def get_node(self, key):
        if not self.ring:
            return None
        h = self._hash(key)
        idx = bisect.bisect_right(self.ring, h) % len(self.ring)
        return self.ring_map[self.ring[idx]]

ch = ConsistentHash(["server-1", "server-2", "server-3"])
print(ch.get_node("user:alice"))
# Adding server-4 only migrates ~1/4 of keys (not all)

Why virtual nodes (replicas)? With only 3 physical nodes, only 3 points on the hash ring — extremely uneven distribution. Creating 150 virtual nodes per physical node puts 450 points on the ring, ensuring balanced load.

Applications: Redis Cluster, Memcached, CDN load balancing, distributed database sharding.


Level 3 · What the Spec Says

8.11 Theoretical Foundations

Perfect Hashing: If all keys are known in advance, you can construct a hash function with zero collisions. Fredman, Komlós, and Szemerédi proved this possible in 1984 (FKS perfect hashing) with O(n) space and O(1) worst-case lookup.

The idea: two-level hashing. The first level distributes n keys into n buckets. The second level constructs a collision-free mini hash table per bucket. Szemerédi's result guarantees the second level uses O(n) total space.

Cuckoo Hashing (Pagh & Rodler, 2001): Uses two hash functions h₁ and h₂. Each key can be in one of two positions. On insertion, if both positions are occupied, "kick out" one occupant (like a cuckoo pushing eggs from a nest). The evicted key then moves to its other position.

# Cuckoo hashing insertion pseudocode
def cuckoo_insert(key):
    for _ in range(MAX_KICKS):
        pos1 = h1(key) % capacity
        if table1[pos1] is empty:
            table1[pos1] = key
            return
        key, table1[pos1] = table1[pos1], key  # kick out
        pos2 = h2(key) % capacity
        if table2[pos2] is empty:
            table2[pos2] = key
            return
        key, table2[pos2] = table2[pos2], key  # kick again
    rehash_with_new_functions()
    cuckoo_insert(key)

Advantage: O(1) worst-case lookup (check exactly two positions). Disadvantage: insertion may trigger chain evictions.

Robin Hood Hashing (Pedro Celis, 1986): An open-addressing variant. During probing, if the current element's "displacement" (distance from its ideal position) is less than the element being inserted, swap them. This equalizes displacement across all elements, reducing worst-case lookup length. Rust's early HashMap used this strategy.

8.12 SipHash and HashDoS Defense

In 2011, Alexander Klink and Julian Wälde demonstrated at 28C3 an attack: by crafting strings that all hash to the same bucket, web frameworks' hash tables degrade to O(n²), crashing servers with minimal requests (HashDoS).

Python, Ruby, Perl, and PHP subsequently introduced randomized hash seeds. Python 3.4+ defaults to SipHash-2-4 (Jean-Philippe Aumasson and Daniel J. Bernstein, 2012), a cryptographic short-input hash function:

import sys
print(sys.hash_info.algorithm)  # 'siphash24'

# Within the same process, hash is deterministic
assert hash("hello") == hash("hello")

# But between processes, hashes differ
# python3 -c "print(hash('hello'))"  → different each time

8.13 Universal Hashing

Carter and Wegman introduced universal hash families in 1979: a collection H of hash functions such that for a randomly chosen h from H and any two distinct keys x ≠ y, the collision probability Pr[h(x) = h(y)] ≤ 1/m (where m is the number of buckets).

This is a powerful guarantee: no matter how adversarial the input, randomly choosing a hash function ensures optimal expected collisions.

A simple universal hash family for integer keys:

h_{a,b}(x) = ((a * x + b) mod p) mod m

where p is a prime larger than all key values, a ∈ {1, ..., p-1} and b ∈ {0, ..., p-1} are chosen randomly.


Level 4 · Edge Cases and Pitfalls

8.14 Mutable Objects Cannot Be Dict Keys

Python dicts require keys to be hashable — they must implement __hash__ and the value must never change. Lists are not hashable:

d = {}
d[[1, 2]] = "value"  # TypeError: unhashable type: 'list'
d[(1, 2)] = "value"  # OK — tuples are hashable

# But tuples containing lists aren't hashable either
d[([1], [2])] = "value"  # TypeError: unhashable type: 'list'

Why? If a key could be modified, its hash value would change, making it unfindable at its original position. This is a fundamental correctness requirement for hash tables.

Custom objects as keys — implement both __hash__ and __eq__:

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __hash__(self):
        return hash((self.x, self.y))

    def __eq__(self, other):
        return isinstance(other, Point) and self.x == other.x and self.y == other.y

d = {Point(1, 2): "A"}
print(d[Point(1, 2)])  # "A"

Rule: If two objects are __eq__ equal, their __hash__ must be identical. The reverse is not required (different objects can share a hash — that's just a collision).

8.15 Modifying a Dict During Iteration

d = {"a": 1, "b": 2, "c": 3}

# Wrong: modifying size during iteration
for key in d:
    if d[key] == 2:
        del d[key]  # RuntimeError: dictionary changed size during iteration

# Correct: collect keys to delete first
to_delete = [k for k, v in d.items() if v == 2]
for k in to_delete:
    del d[k]

# Or create a new dict with comprehension
d = {k: v for k, v in d.items() if v != 2}

8.16 Worst-Case Scenarios

Interview point: Hash tables average O(1), but worst case is O(n). When does worst case occur?

  1. All keys collide into one bucket: Chaining degrades to linked list O(n); open addressing probes the entire array O(n)
  2. Adversarial input (HashDoS): Attacker crafts collision keys (see SipHash in Level 3)
  3. Momentary stall during resize: Rehash is O(n). While amortized O(1), a single operation can be slow

Redis's progressive rehash solves problem #3: instead of rehashing all at once, Redis migrates a small batch of keys during each subsequent read/write operation. This prevents latency spikes, at the cost of maintaining two hash tables simultaneously during the rehash period.

See the Redis book on this site, Chapter 6: "Hash Table and Progressive Rehash."

8.17 High-Frequency Interview Problems

Problem 1: Two Sum (LeetCode #1)

def two_sum(nums, target):
    seen = {}
    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i

Core insight: a hash map turns O(n²) nested loops into O(n) single pass.

Problem 2: Group Anagrams (LeetCode #49)

from collections import defaultdict

def group_anagrams(strs):
    groups = defaultdict(list)
    for s in strs:
        key = tuple(sorted(s))  # sorted form as hash key
        groups[key].append(s)
    return list(groups.values())

# Optimized: use character frequency as key (avoids sorting)
def group_anagrams_v2(strs):
    groups = defaultdict(list)
    for s in strs:
        count = [0] * 26
        for c in s:
            count[ord(c) - ord('a')] += 1
        groups[tuple(count)].append(s)
    return list(groups.values())

Problem 3: Longest Consecutive Sequence (LeetCode #128)

def longest_consecutive(nums):
    num_set = set(nums)
    max_len = 0
    for num in num_set:
        if num - 1 not in num_set:  # only start from sequence beginnings
            length = 1
            while num + length in num_set:
                length += 1
            max_len = max(max_len, length)
    return max_len

Looks like O(n²) (nested loops), but it's actually O(n). The inner while loop executes at most n times total across all outer iterations (each element is visited by the inner loop at most once).

Problem 4: LRU Cache (LeetCode #146)

Hash table + doubly linked list for O(1) get and put. This classic "design data structure" problem is fully implemented in Chapter 38.


Chapter Summary

Concept Key Point
Hash table core Hash function + array → O(1) average lookup
Chaining Linked list per bucket, simple but cache-unfriendly
Open addressing Probe for next empty slot, cache-friendly but complex deletion
CPython dict Open addressing + compact layout + perturbation probing + SipHash
Load factor Controls resize timing; CPython threshold is 2/3
Consistent hashing Virtual nodes + hash ring; adding/removing nodes migrates few keys
Worst case O(n); mitigated by randomized or perfect hashing

Exercises:

  1. Implement a hash table supporting put, get, delete using Cuckoo Hashing.
  2. Why is Python's frozenset hashable but set is not?
  3. Design a hash function that maps similar strings (e.g., "abc" and "abd") to different buckets. (Hint: polynomial hashing)
  4. In consistent hashing, when a node goes down, where do its virtual nodes' keys migrate? Is the distribution even?

Next chapter: Chapter 9 introduces Bloom filters and probabilistic data structures — when you don't need 100% accuracy, how to achieve approximate queries with minimal memory.

Rate this chapter
4.8  / 5  (50 ratings)

💬 Comments