Hash Tables: The Cost of O(1)
Chapter 8: Hash Tables โ The Cost of O(1)
You're searching for username "alice" in a database of one million records. With an array, you scan all one million entries. With a sorted array and binary search, about 20 comparisons. With a hash table, on average just 1.
O(1) lookup sounds like magic. But magic has a price โ extra memory, hash collisions, resize overhead, and worst-case degradation. This chapter shows you exactly how hash tables work, what the costs are, and when not to use them.
Level 1 ยท What You Need to Know
8.1 The Core Idea
The idea behind a hash table is dead simple: use a function to map keys to array indices, then access directly.
# Simplest hash table: modulo as hash function
class SimpleHashTable:
def __init__(self, capacity=8):
self.capacity = capacity
self.buckets = [None] * capacity
def _hash(self, key):
return hash(key) % self.capacity
def put(self, key, value):
idx = self._hash(key)
self.buckets[idx] = (key, value) # Warning: no collision handling!
def get(self, key):
idx = self._hash(key)
pair = self.buckets[idx]
if pair and pair[0] == key:
return pair[1]
return None
Fatal flaw: if two different keys map to the same index (hash collision), the second overwrites the first.
8.2 Collision Resolution: Chaining
The most intuitive fix: each bucket holds a list instead of a single element.
class ChainedHashTable:
def __init__(self, capacity=8):
self.capacity = capacity
self.size = 0
self.buckets = [[] for _ in range(capacity)]
def _hash(self, key):
return hash(key) % self.capacity
def put(self, key, value):
idx = self._hash(key)
bucket = self.buckets[idx]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket[i] = (key, value)
return
bucket.append((key, value))
self.size += 1
if self.size / self.capacity > 0.75:
self._resize(self.capacity * 2)
def get(self, key):
idx = self._hash(key)
for k, v in self.buckets[idx]:
if k == key:
return v
return None
def delete(self, key):
idx = self._hash(key)
bucket = self.buckets[idx]
for i, (k, v) in enumerate(bucket):
if k == key:
bucket.pop(i)
self.size -= 1
return True
return False
def _resize(self, new_capacity):
old_buckets = self.buckets
self.capacity = new_capacity
self.buckets = [[] for _ in range(new_capacity)]
self.size = 0
for bucket in old_buckets:
for k, v in bucket:
self.put(k, v)
Load factor = number of elements / number of buckets. When it exceeds a threshold (typically 0.75), the table resizes.
8.3 Collision Resolution: Open Addressing
Alternative strategy: when a collision occurs, probe for the next empty slot in the array itself.
class OpenAddressHashTable:
EMPTY = object()
DELETED = object() # Tombstone marker
def __init__(self, capacity=8):
self.capacity = capacity
self.size = 0
self.keys = [self.EMPTY] * capacity
self.values = [None] * capacity
def _hash(self, key):
return hash(key) % self.capacity
def _probe(self, key):
"""Linear probing: find key's slot or first empty slot"""
idx = self._hash(key)
first_deleted = -1
for _ in range(self.capacity):
if self.keys[idx] is self.EMPTY:
return (first_deleted if first_deleted >= 0 else idx), False
if self.keys[idx] is self.DELETED:
if first_deleted < 0:
first_deleted = idx
elif self.keys[idx] == key:
return idx, True
idx = (idx + 1) % self.capacity
return (first_deleted if first_deleted >= 0 else -1), False
def put(self, key, value):
if self.size * 3 >= self.capacity * 2: # load factor > 2/3
self._resize(self.capacity * 2)
idx, found = self._probe(key)
if not found:
self.size += 1
self.keys[idx] = key
self.values[idx] = value
def get(self, key):
idx, found = self._probe(key)
if found:
return self.values[idx]
return None
def delete(self, key):
idx, found = self._probe(key)
if found:
self.keys[idx] = self.DELETED # Can't set to EMPTY
self.values[idx] = None
self.size -= 1
return True
return False
def _resize(self, new_capacity):
old_keys, old_values = self.keys, self.values
self.capacity = new_capacity
self.keys = [self.EMPTY] * new_capacity
self.values = [None] * new_capacity
self.size = 0
for i, k in enumerate(old_keys):
if k is not self.EMPTY and k is not self.DELETED:
self.put(k, old_values[i])
Why use tombstones (DELETED) instead of setting to EMPTY? If A and B collide (B is placed after A), deleting A and setting it to EMPTY would cause a search for B to stop early at A's position, wrongly concluding B doesn't exist. The tombstone tells the probing algorithm: "someone lived here โ keep looking."
8.4 Chaining vs. Open Addressing
| Feature | Chaining | Open Addressing |
|---|---|---|
| Memory layout | Non-contiguous (linked nodes) | Contiguous (all in array) |
| Cache friendliness | Poor | Good |
| Load factor > 1 | Yes | No (must be < 1) |
| Deletion | Simple | Requires tombstones |
| Implementation | Simple | More complex |
| Use case | General purpose | Memory-compact, cache-sensitive |
Which does Python dict use? Open addressing. CPython prioritizes compact memory layout and cache performance. Java's HashMap uses chaining (linked list per bucket, converting to red-black tree when length > 8).
8.5 Python dict Tips
d = {}
d["key"] = "value" # O(1) insert
val = d.get("key", "default") # O(1) lookup, returns default if missing
"key" in d # O(1) existence check
del d["key"] # O(1) delete
# Counter pattern
from collections import Counter
counts = Counter("abracadabra")
# Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
# defaultdict avoids KeyError
from collections import defaultdict
graph = defaultdict(list)
graph["a"].append("b") # No need to check if "a" exists first
# Dict comprehension
squares = {x: x**2 for x in range(10)}
# Ordering: Python 3.7+ dicts preserve insertion order
d = {"b": 2, "a": 1, "c": 3}
list(d.keys()) # ['b', 'a', 'c'] โ insertion order preserved
8.6 set: A Hash Table Without Values
Python's set is a hash table that stores only keys, no values.
s = set()
s.add(42) # O(1)
42 in s # O(1)
s.remove(42) # O(1), raises KeyError if missing
s.discard(42) # O(1), no error if missing
# Set operations
a = {1, 2, 3}
b = {2, 3, 4}
a | b # Union {1, 2, 3, 4}
a & b # Intersection {2, 3}
a - b # Difference {1}
a ^ b # Symmetric difference {1, 4}
# Deduplication
unique = list(set(items)) # Note: loses order
# Order-preserving dedup
seen = set()
unique = [x for x in items if x not in seen and not seen.add(x)]
Level 2 ยท How It Works
8.7 Hash Function Design
A good hash function needs two properties:
- Uniform distribution: Different keys should map to different buckets as much as possible
- Fast computation: The hash function itself must be quick
Integer hashing: The simplest approach is key % capacity, but if capacity is a power of 2, modulo is equivalent to taking the lowest bits, discarding high-bit information. Many implementations first "mix" the bits, spreading high-bit information into low bits.
Python hashes small integers to themselves: hash(42) == 42. For strings and large objects, Python uses SipHash (since Python 3.4), a cryptographically secure hash function that defends against hash collision attacks (HashDoS).
# Python integer hashing
hash(42) # 42
hash(-1) # -2 (special case: -1 means error in CPython internals)
hash(0) # 0
# String hash changes every startup (random seed)
hash("hello") # Different each run (HashDoS defense)
Java's perturbation function (HashMap.hash):
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
XORs the high 16 bits into the low 16 bits, ensuring that even when the array is small (only low bits matter), high-bit information contributes.
8.8 CPython dict Internals
CPython 3.6+ uses a clever compact layout proposed by Raymond Hettinger in 2012.
Old layout (Python < 3.6):
[hash|key|value, hash|key|value, โ, hash|key|value, โ, โ, โ, โ]
Problem: empty slots waste memory, and each empty slot still occupies full hash + key + value space.
New layout (Python 3.6+):
indices: [1, โ, 0, โ, 2, โ, โ, โ] โ small integer array (1/2/4/8 bytes each)
entries: [(hashโ, keyโ, valโ), โ compact array, insertion order
(hashโ, keyโ, valโ),
(hashโ, keyโ, valโ)]
Benefits:
- Each index entry is only 1-8 bytes (not a full hash+key+value triple)
- Entries array is compact with no holes, naturally preserving insertion order
- 20-25% memory savings
Probe sequence: CPython doesn't use simple linear probing. It uses perturbation probing:
# CPython probe sequence (pseudocode)
perturb = hash_value
idx = perturb % capacity
while slot_occupied:
idx = (5 * idx + perturb + 1) % capacity
perturb >>= 5 # gradually decays to (5*idx+1) % capacity
This sequence guarantees:
- Early iterations use all bits of the hash value (via perturb)
- Later iterations degrade to a special linear congruential sequence
- When capacity is a power of 2,
5*idx+1visits every position
8.9 Resize (Rehash) Mechanism
When the load factor exceeds 2/3, CPython triggers a resize:
- Allocate new indices + entries arrays (double or 4x capacity)
- Walk the old entries, recomputing each key's position
- Insert into the new arrays
- Free the old arrays
Key point: Rehash is O(n), but amortized over all insertions it's O(1) (identical analysis to dynamic array resizing โ see Chapter 1 amortized analysis).
Shrinking: Python dict never auto-shrinks. Even after deleting many elements, memory isn't reclaimed. To force reclamation:
d = {k: v for k, v in d.items()} # Creates a right-sized new dict
8.10 Consistent Hashing
Traditional hashing: server = hash(key) % N. When N changes (adding/removing servers), almost all keys need remapping.
Consistent hashing (David Karger et al., MIT, 1997) solves this:
import hashlib
import bisect
class ConsistentHash:
def __init__(self, nodes, replicas=150):
self.replicas = replicas
self.ring = []
self.ring_map = {}
for node in nodes:
self.add_node(node)
def _hash(self, key):
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node):
for i in range(self.replicas):
h = self._hash(f"{node}:{i}")
self.ring.append(h)
self.ring_map[h] = node
self.ring.sort()
def remove_node(self, node):
for i in range(self.replicas):
h = self._hash(f"{node}:{i}")
self.ring.remove(h)
del self.ring_map[h]
def get_node(self, key):
if not self.ring:
return None
h = self._hash(key)
idx = bisect.bisect_right(self.ring, h) % len(self.ring)
return self.ring_map[self.ring[idx]]
ch = ConsistentHash(["server-1", "server-2", "server-3"])
print(ch.get_node("user:alice"))
# Adding server-4 only migrates ~1/4 of keys (not all)
Why virtual nodes (replicas)? With only 3 physical nodes, only 3 points on the hash ring โ extremely uneven distribution. Creating 150 virtual nodes per physical node puts 450 points on the ring, ensuring balanced load.
Applications: Redis Cluster, Memcached, CDN load balancing, distributed database sharding.
Level 3 ยท What the Spec Says
8.11 Theoretical Foundations
Perfect Hashing: If all keys are known in advance, you can construct a hash function with zero collisions. Fredman, Komlรณs, and Szemerรฉdi proved this possible in 1984 (FKS perfect hashing) with O(n) space and O(1) worst-case lookup.
The idea: two-level hashing. The first level distributes n keys into n buckets. The second level constructs a collision-free mini hash table per bucket. Szemerรฉdi's result guarantees the second level uses O(n) total space.
Cuckoo Hashing (Pagh & Rodler, 2001): Uses two hash functions hโ and hโ. Each key can be in one of two positions. On insertion, if both positions are occupied, "kick out" one occupant (like a cuckoo pushing eggs from a nest). The evicted key then moves to its other position.
# Cuckoo hashing insertion pseudocode
def cuckoo_insert(key):
for _ in range(MAX_KICKS):
pos1 = h1(key) % capacity
if table1[pos1] is empty:
table1[pos1] = key
return
key, table1[pos1] = table1[pos1], key # kick out
pos2 = h2(key) % capacity
if table2[pos2] is empty:
table2[pos2] = key
return
key, table2[pos2] = table2[pos2], key # kick again
rehash_with_new_functions()
cuckoo_insert(key)
Advantage: O(1) worst-case lookup (check exactly two positions). Disadvantage: insertion may trigger chain evictions.
Robin Hood Hashing (Pedro Celis, 1986): An open-addressing variant. During probing, if the current element's "displacement" (distance from its ideal position) is less than the element being inserted, swap them. This equalizes displacement across all elements, reducing worst-case lookup length. Rust's early HashMap used this strategy.
8.12 SipHash and HashDoS Defense
In 2011, Alexander Klink and Julian Wรคlde demonstrated at 28C3 an attack: by crafting strings that all hash to the same bucket, web frameworks' hash tables degrade to O(nยฒ), crashing servers with minimal requests (HashDoS).
Python, Ruby, Perl, and PHP subsequently introduced randomized hash seeds. Python 3.4+ defaults to SipHash-2-4 (Jean-Philippe Aumasson and Daniel J. Bernstein, 2012), a cryptographic short-input hash function:
- Generates a random 128-bit seed at startup
- Same string produces different hashes in different processes
- Attackers cannot predict hash values, so cannot craft collisions
import sys
print(sys.hash_info.algorithm) # 'siphash24'
# Within the same process, hash is deterministic
assert hash("hello") == hash("hello")
# But between processes, hashes differ
# python3 -c "print(hash('hello'))" โ different each time
8.13 Universal Hashing
Carter and Wegman introduced universal hash families in 1979: a collection H of hash functions such that for a randomly chosen h from H and any two distinct keys x โ y, the collision probability Pr[h(x) = h(y)] โค 1/m (where m is the number of buckets).
This is a powerful guarantee: no matter how adversarial the input, randomly choosing a hash function ensures optimal expected collisions.
A simple universal hash family for integer keys:
h_{a,b}(x) = ((a * x + b) mod p) mod m
where p is a prime larger than all key values, a โ {1, ..., p-1} and b โ {0, ..., p-1} are chosen randomly.
Level 4 ยท Edge Cases and Pitfalls
8.14 Mutable Objects Cannot Be Dict Keys
Python dicts require keys to be hashable โ they must implement __hash__ and the value must never change. Lists are not hashable:
d = {}
d[[1, 2]] = "value" # TypeError: unhashable type: 'list'
d[(1, 2)] = "value" # OK โ tuples are hashable
# But tuples containing lists aren't hashable either
d[([1], [2])] = "value" # TypeError: unhashable type: 'list'
Why? If a key could be modified, its hash value would change, making it unfindable at its original position. This is a fundamental correctness requirement for hash tables.
Custom objects as keys โ implement both __hash__ and __eq__:
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
def __hash__(self):
return hash((self.x, self.y))
def __eq__(self, other):
return isinstance(other, Point) and self.x == other.x and self.y == other.y
d = {Point(1, 2): "A"}
print(d[Point(1, 2)]) # "A"
Rule: If two objects are __eq__ equal, their __hash__ must be identical. The reverse is not required (different objects can share a hash โ that's just a collision).
8.15 Modifying a Dict During Iteration
d = {"a": 1, "b": 2, "c": 3}
# Wrong: modifying size during iteration
for key in d:
if d[key] == 2:
del d[key] # RuntimeError: dictionary changed size during iteration
# Correct: collect keys to delete first
to_delete = [k for k, v in d.items() if v == 2]
for k in to_delete:
del d[k]
# Or create a new dict with comprehension
d = {k: v for k, v in d.items() if v != 2}
8.16 Worst-Case Scenarios
Interview point: Hash tables average O(1), but worst case is O(n). When does worst case occur?
- All keys collide into one bucket: Chaining degrades to linked list O(n); open addressing probes the entire array O(n)
- Adversarial input (HashDoS): Attacker crafts collision keys (see SipHash in Level 3)
- Momentary stall during resize: Rehash is O(n). While amortized O(1), a single operation can be slow
Redis's progressive rehash solves problem #3: instead of rehashing all at once, Redis migrates a small batch of keys during each subsequent read/write operation. This prevents latency spikes, at the cost of maintaining two hash tables simultaneously during the rehash period.
See the Redis book on this site, Chapter 6: "Hash Table and Progressive Rehash."
8.17 High-Frequency Interview Problems
Problem 1: Two Sum (LeetCode #1)
def two_sum(nums, target):
seen = {}
for i, num in enumerate(nums):
complement = target - num
if complement in seen:
return [seen[complement], i]
seen[num] = i
Core insight: a hash map turns O(nยฒ) nested loops into O(n) single pass.
Problem 2: Group Anagrams (LeetCode #49)
from collections import defaultdict
def group_anagrams(strs):
groups = defaultdict(list)
for s in strs:
key = tuple(sorted(s)) # sorted form as hash key
groups[key].append(s)
return list(groups.values())
# Optimized: use character frequency as key (avoids sorting)
def group_anagrams_v2(strs):
groups = defaultdict(list)
for s in strs:
count = [0] * 26
for c in s:
count[ord(c) - ord('a')] += 1
groups[tuple(count)].append(s)
return list(groups.values())
Problem 3: Longest Consecutive Sequence (LeetCode #128)
def longest_consecutive(nums):
num_set = set(nums)
max_len = 0
for num in num_set:
if num - 1 not in num_set: # only start from sequence beginnings
length = 1
while num + length in num_set:
length += 1
max_len = max(max_len, length)
return max_len
Looks like O(nยฒ) (nested loops), but it's actually O(n). The inner while loop executes at most n times total across all outer iterations (each element is visited by the inner loop at most once).
Problem 4: LRU Cache (LeetCode #146)
Hash table + doubly linked list for O(1) get and put. This classic "design data structure" problem is fully implemented in Chapter 38.
Chapter Summary
| Concept | Key Point |
|---|---|
| Hash table core | Hash function + array โ O(1) average lookup |
| Chaining | Linked list per bucket, simple but cache-unfriendly |
| Open addressing | Probe for next empty slot, cache-friendly but complex deletion |
| CPython dict | Open addressing + compact layout + perturbation probing + SipHash |
| Load factor | Controls resize timing; CPython threshold is 2/3 |
| Consistent hashing | Virtual nodes + hash ring; adding/removing nodes migrates few keys |
| Worst case | O(n); mitigated by randomized or perfect hashing |
Exercises:
- Implement a hash table supporting
put,get,deleteusing Cuckoo Hashing. - Why is Python's
frozensethashable butsetis not? - Design a hash function that maps similar strings (e.g., "abc" and "abd") to different buckets. (Hint: polynomial hashing)
- In consistent hashing, when a node goes down, where do its virtual nodes' keys migrate? Is the distribution even?
Next chapter: Chapter 9 introduces Bloom filters and probabilistic data structures โ when you don't need 100% accuracy, how to achieve approximate queries with minimal memory.