Tries and Aho-Corasick Automaton
Chapter 13: Tries and Aho-Corasick Automaton
In the domain of text processing, one class of problems recurs perpetually: given a set of keywords, how do we efficiently locate them within a body of text? Search engine autocomplete, IDE code suggestions, network firewall content filtering, DNA sequence pattern matching—all these scenarios rely on the same family of data structures: the Trie (prefix tree).
When we further combine the Trie with the failure-link concept from the KMP algorithm, we obtain the Aho-Corasick automaton (AC automaton)—an algorithm capable of simultaneously matching multiple pattern strings in O(n + m + z) time, where n is the text length, m is the total length of all patterns, and z is the number of matches. This algorithm was proposed by Alfred Aho and Margaret Corasick at Bell Labs in 1975 and remains the industrial standard for multi-pattern matching to this day.
This chapter begins with the most fundamental Trie operations, progressively deepening into the complete implementation of the AC automaton, and finally grounding the theory with real interview problems and engineering case studies.
Level 1 · What You Need to Know
13.1 The Basic Concept of Trie
Trie (pronounced "try", derived from retrieval) is a multi-way tree structure where each path from root to a marked node represents a string. Unlike hash tables, Tries natively support prefix queries—this is their core advantage.
Imagine a Trie storing ["apple", "app", "apt", "bat", "bad"]:
root
/ \
a b
/ \
p a
/ \ / \
p t t d
|
l
|
e
Each node represents a character (strictly speaking, a character on an edge), and the path from the root to any marked node forms a complete word.
Why not just use a hash table? Hash table lookup for a single key is O(L) (L being the string length), but it cannot answer questions like "what are all the words starting with 'ap'?" A hash table would need to iterate over all keys and check each prefix, yielding O(N·L) complexity; a Trie only needs to traverse to the 'a' → 'p' node and then explore its subtree, with complexity O(P + K) where P is the prefix length and K is the number of matching results.
13.2 Basic Trie Implementation
class TrieNode:
def __init__(self):
self.children = {} # char -> TrieNode
self.is_end = False # marks whether this is the end of a complete word
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word: str) -> None:
"""Insert a word. Time complexity: O(L), L = len(word)"""
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end = True
def search(self, word: str) -> bool:
"""Exact lookup. Time complexity: O(L)"""
node = self._find_node(word)
return node is not None and node.is_end
def starts_with(self, prefix: str) -> bool:
"""Prefix query. Time complexity: O(P), P = len(prefix)"""
return self._find_node(prefix) is not None
def _find_node(self, prefix: str):
"""Find the node corresponding to a prefix"""
node = self.root
for char in prefix:
if char not in node.children:
return None
node = node.children[char]
return node
Usage example:
trie = Trie()
trie.insert("apple")
trie.insert("app")
trie.insert("application")
print(trie.search("app")) # True
print(trie.search("ap")) # False (not a complete word)
print(trie.starts_with("ap")) # True
print(trie.starts_with("b")) # False
Complexity analysis:
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Insert | O(L) | O(L) (worst case, no shared prefix) |
| Search | O(L) | O(1) |
| Prefix query | O(P) | O(1) |
| Build entire tree | O(N·L_avg) | O(N·L_avg) (worst case) |
Where L is word length, P is prefix length, N is number of words, and L_avg is average word length.
13.3 Autocomplete Implementation
Autocomplete is the most classic application of Tries. When a user types a prefix, we need to return all words starting with that prefix:
class AutoCompleteTrie(Trie):
def autocomplete(self, prefix: str, limit: int = 10) -> list:
"""Return all words starting with prefix, up to limit results"""
node = self._find_node(prefix)
if node is None:
return []
results = []
self._dfs_collect(node, prefix, results, limit)
return results
def _dfs_collect(self, node, current_word, results, limit):
"""DFS to collect all complete words"""
if len(results) >= limit:
return
if node.is_end:
results.append(current_word)
for char in sorted(node.children.keys()): # lexicographic order
self._dfs_collect(
node.children[char],
current_word + char,
results,
limit
)
Weighted autocomplete (ranked by popularity):
class WeightedTrieNode:
def __init__(self):
self.children = {}
self.is_end = False
self.weight = 0 # search frequency/popularity
class WeightedAutoComplete:
def __init__(self):
self.root = WeightedTrieNode()
def insert(self, word: str, weight: int) -> None:
node = self.root
for char in word:
if char not in node.children:
node.children[char] = WeightedTrieNode()
node = node.children[char]
node.is_end = True
node.weight = weight
def top_k_suggestions(self, prefix: str, k: int = 5) -> list:
"""Return top-k suggestions by weight"""
import heapq
node = self._find_node(prefix)
if node is None:
return []
# Use a min-heap to maintain top-k
heap = [] # (weight, word)
self._collect_all(node, prefix, heap, k)
# Return in descending weight order
return [word for _, word in sorted(heap, reverse=True)]
def _collect_all(self, node, current, heap, k):
import heapq
if node.is_end:
if len(heap) < k:
heapq.heappush(heap, (node.weight, current))
elif node.weight > heap[0][0]:
heapq.heapreplace(heap, (node.weight, current))
for char, child in node.children.items():
self._collect_all(child, current + char, heap, k)
def _find_node(self, prefix):
node = self.root
for char in prefix:
if char not in node.children:
return None
node = node.children[char]
return node
13.4 Compressed Trie (Patricia Tree / Radix Tree)
A problem with standard Tries is that when a path has only single-child nodes, substantial space is wasted. For example, storing "romane", "romanus", "romulus"—the path from 'r' to 'rom' has nodes each with only one child.
Patricia Tree (Practical Algorithm to Retrieve Information Coded in Alphanumeric, proposed by Donald R. Morrison in 1968) compresses these single-chain paths into a single edge:
Standard Trie: Compressed Trie (Patricia Tree):
r rom
| / \
o an ulus
| / \
m e us
/ \
a u
| |
n l
/ \ |
e u u
| s
s
class PatriciaNode:
def __init__(self, label=""):
self.label = label # edge label (can be multiple characters)
self.children = {} # first character -> PatriciaNode
self.is_end = False
class PatriciaTrie:
def __init__(self):
self.root = PatriciaNode()
def insert(self, word: str) -> None:
node = self.root
i = 0
while i < len(word):
char = word[i]
if char not in node.children:
# Insert remaining portion as a new edge
new_node = PatriciaNode(word[i:])
new_node.is_end = True
node.children[char] = new_node
return
child = node.children[char]
label = child.label
# Find longest common prefix between word[i:] and label
j = 0
while j < len(label) and i + j < len(word) and label[j] == word[i + j]:
j += 1
if j == len(label):
# Label fully matched, continue downward
i += j
node = child
else:
# Need to split the node
split_node = PatriciaNode(label[:j])
node.children[char] = split_node
# Original child becomes child of split_node
child.label = label[j:]
split_node.children[label[j]] = child
# If word has remaining characters, create new branch
if i + j < len(word):
new_node = PatriciaNode(word[i + j:])
new_node.is_end = True
split_node.children[word[i + j]] = new_node
else:
split_node.is_end = True
return
node.is_end = True
def search(self, word: str) -> bool:
node = self.root
i = 0
while i < len(word):
char = word[i]
if char not in node.children:
return False
child = node.children[char]
label = child.label
if not word[i:i+len(label)] == label:
return False
i += len(label)
node = child
return node.is_end
Advantages of Patricia Tree:
- Space: reduced from O(N·L) to O(N), since only non-shared suffixes need storage
- Search speed: actually faster due to fewer pointer dereferences
- Ideal for long strings: URL routing, file path matching, etc.
13.5 Common Mistakes and Pitfalls
Mistake 1: Confusing search and starts_with
# Wrong: forgetting to check the is_end flag
def search(self, word):
node = self._find_node(word)
return node is not None # BUG! "app" would return True in "apple"
# Correct:
def search(self, word):
node = self._find_node(word)
return node is not None and node.is_end
Mistake 2: Failing to reclaim empty nodes during deletion
# Correct deletion implementation
def delete(self, word: str) -> bool:
"""Delete a word, return whether deletion was successful"""
def _delete(node, word, depth):
if depth == len(word):
if not node.is_end:
return False # Word does not exist
node.is_end = False
return len(node.children) == 0 # Can this node be removed?
char = word[depth]
if char not in node.children:
return False
should_delete = _delete(node.children[char], word, depth + 1)
if should_delete:
del node.children[char]
return not node.is_end and len(node.children) == 0
return False
_delete(self.root, word, 0)
Mistake 3: Autocomplete without a limit causing OOM
When a Trie stores millions of words, a very short prefix (like "a") may match hundreds of thousands of results. A limit parameter is essential.
Level 2 · How It Works Under the Hood
13.6 Trie Space Optimization Strategies
13.6.1 Comparing Child Node Storage Approaches
The children field of a Trie node can be implemented in several ways, and the choice directly impacts time and space performance:
Approach 1: Fixed-size array
class TrieNodeArray:
def __init__(self):
# 26 letters, direct index access
self.children = [None] * 26
self.is_end = False
def get_child(self, char):
return self.children[ord(char) - ord('a')]
def set_child(self, char, node):
self.children[ord(char) - ord('a')] = node
- Time: O(1) child access
- Space: each node has a fixed 26-pointer slot (even if most are empty)
- Best for: small character sets (e.g., lowercase letters only), maximum speed
Approach 2: Hash table (dictionary)
class TrieNodeHash:
def __init__(self):
self.children = {} # dynamically allocated
self.is_end = False
- Time: amortized O(1), but with hashing overhead
- Space: only stores actually existing children
- Best for: large character sets (Unicode), sparse Tries
Approach 3: Sorted array (for static Tries)
class TrieNodeSorted:
def __init__(self):
self.keys = [] # sorted character list
self.values = [] # corresponding child nodes
self.is_end = False
def get_child(self, char):
# Binary search
import bisect
idx = bisect.bisect_left(self.keys, char)
if idx < len(self.keys) and self.keys[idx] == char:
return self.values[idx]
return None
- Time: O(log k) lookup, where k is the number of children
- Space: compact, no waste
- Best for: static dictionaries that are rarely modified after construction
Quantitative comparison (storing 100,000 English words, average length 8):
| Implementation | Memory Usage | Insert Time | Search Time |
|---|---|---|---|
| Fixed array (26) | ~100 MB | Fastest | Fastest |
| Hash table | ~40 MB | Medium | Medium |
| Sorted array | ~25 MB | Slowest (element shifting) | Slightly slower (binary search) |
| Compressed Trie | ~15 MB | Medium | Fast |
13.6.2 Double-Array Trie
In scenarios requiring extreme performance such as Chinese word segmentation and Japanese input methods, the Double-Array Trie (Aoe Jun-ichi, 1989) is the industrial-grade choice. It uses two integer arrays base[] and check[] to simulate the entire Trie:
class DoubleArrayTrie:
"""
Core idea of Double-Array Trie:
- base[s] + c = t: transition from state s via character c to state t
- check[t] = s: verify that state t's parent is indeed s
"""
def __init__(self, size=1000000):
self.base = [0] * size
self.check = [-1] * size # -1 means unused
self.base[0] = 1 # root node
def transition(self, state, char_code):
"""Transition from state via char_code"""
t = self.base[state] + char_code
if t < len(self.check) and self.check[t] == state:
return t
return -1 # transition failed
def search(self, word):
state = 0
for ch in word:
code = ord(ch) - ord('a') + 1 # 1-based
state = self.transition(state, code)
if state == -1:
return False
# Check for terminal state (using special marker)
end_state = self.transition(state, 0) # 0 as terminator
return end_state != -1
Advantages of Double-Array Trie:
- Extremely compact memory: only two integer arrays
- Cache-friendly: contiguous memory access
- Search speed: only array index arithmetic, no pointer chasing
Drawbacks include complex construction and difficult dynamic insertion—best suited for static dictionaries (such as word segmentation dictionaries).
13.7 Trie in IP Routing Tables
One of the core tasks of a network router is Longest Prefix Match (LPM): given a destination IP address, find the longest matching prefix in the routing table. This is a natural application for Tries.
class IPRoutingTrie:
"""
Trie implementation for IP routing tables.
Each bit serves as a character (0 or 1), building a binary Trie.
"""
class Node:
def __init__(self):
self.children = [None, None] # 0 and 1
self.next_hop = None # routing next hop
def __init__(self):
self.root = self.Node()
def insert_route(self, prefix: str, prefix_len: int, next_hop: str):
"""
Insert a routing entry.
prefix: binary representation of IP address
prefix_len: prefix length (the /24 in CIDR notation)
next_hop: next hop address
"""
node = self.root
for i in range(prefix_len):
bit = int(prefix[i])
if node.children[bit] is None:
node.children[bit] = self.Node()
node = node.children[bit]
node.next_hop = next_hop
def longest_prefix_match(self, ip_binary: str) -> str:
"""
Longest prefix match.
Returns the next hop of the most specific matching route.
"""
node = self.root
last_match = None
for bit_char in ip_binary:
bit = int(bit_char)
if node.children[bit] is None:
break
node = node.children[bit]
if node.next_hop is not None:
last_match = node.next_hop
return last_match
@staticmethod
def ip_to_binary(ip: str) -> str:
"""Convert IP address to 32-bit binary string"""
parts = ip.split('.')
binary = ''
for part in parts:
binary += format(int(part), '08b')
return binary
# Usage example
router = IPRoutingTrie()
# Insert routing table entries
# 192.168.0.0/16 -> Gateway A
router.insert_route(
IPRoutingTrie.ip_to_binary("192.168.0.0"), 16, "Gateway A"
)
# 192.168.1.0/24 -> Gateway B
router.insert_route(
IPRoutingTrie.ip_to_binary("192.168.1.0"), 24, "Gateway B"
)
# 0.0.0.0/0 -> Default Gateway
router.insert_route(
IPRoutingTrie.ip_to_binary("0.0.0.0"), 0, "Default"
)
# Query
ip = IPRoutingTrie.ip_to_binary("192.168.1.100")
print(router.longest_prefix_match(ip)) # "Gateway B" (matches /24, longer)
ip2 = IPRoutingTrie.ip_to_binary("192.168.2.50")
print(router.longest_prefix_match(ip2)) # "Gateway A" (only matches /16)
Why do routers use Tries instead of hash tables? Because route lookup is fundamentally a prefix matching problem. A packet's destination IP may match multiple routing rules (e.g., /8, /16, /24), and the router needs to find the longest one. Hash tables can only perform exact matches and cannot efficiently handle prefix relationships.
Optimizations in modern routers:
- Multi-bit Trie: examine 4 or 8 bits at a time, reducing tree depth (but increasing node size)
- Level-compressed Trie: compress consecutive single-branch paths
- Hardware TCAM: specialized hardware for parallel matching of all rules, O(1) but expensive
13.8 AC Automaton: The Ultimate Weapon for Multi-Pattern Matching
13.8.1 Problem Definition
Single-pattern matching: find one pattern P in text T (solved by KMP). Multi-pattern matching: simultaneously find multiple patterns P1, P2, ..., Pk in text T.
Naive approach: run KMP separately for each pattern, total complexity O(n·k + m). When k is large (e.g., a sensitive word dictionary with tens of thousands of entries), this is unacceptable.
The AC automaton's core insight is: build all patterns into a Trie, then add failure links on the Trie, so that scanning the text never requires backtracking—this is identical in spirit to KMP, except generalized from one dimension (single pattern) to a tree structure (multiple patterns).
13.8.2 The Three Components of an AC Automaton
- Goto function: the Trie transitions—from current state via a character to the next state
- Failure function: when Goto fails, jump to the state corresponding to the longest proper suffix (analogous to KMP's next/failure array)
- Output function: records all matching patterns at each state
13.8.3 Complete Implementation
from collections import deque
class AhoCorasick:
"""AC Automaton - Multi-pattern string matching"""
class State:
def __init__(self):
self.goto = {} # transition function: char -> state_id
self.failure = 0 # failure link
self.output = [] # output function: patterns matched at this state
def __init__(self):
self.states = [self.State()] # initially only root (id=0)
def _new_state(self):
self.states.append(self.State())
return len(self.states) - 1
def build(self, patterns: list) -> None:
"""
Three-step construction of AC automaton:
1. Build Trie
2. BFS to establish failure pointers
3. Merge outputs
"""
# Step 1: Build Goto (Trie)
for pattern in patterns:
state = 0
for char in pattern:
if char not in self.states[state].goto:
self.states[state].goto[char] = self._new_state()
state = self.states[state].goto[char]
self.states[state].output.append(pattern)
# Step 2 & 3: BFS to build Failure pointers + merge Output
queue = deque()
# Depth-1 nodes: failure points to root
for char, s in self.states[0].goto.items():
self.states[s].failure = 0
queue.append(s)
while queue:
u = queue.popleft()
for char, v in self.states[u].goto.items():
queue.append(v)
# Compute failure pointer for v
f = self.states[u].failure
while f != 0 and char not in self.states[f].goto:
f = self.states[f].failure
self.states[v].failure = (
self.states[f].goto[char] if char in self.states[f].goto else 0
)
# Avoid failure pointing to self
if self.states[v].failure == v:
self.states[v].failure = 0
# Merge output (all patterns on the suffix link chain)
self.states[v].output = (
self.states[v].output +
self.states[self.states[v].failure].output
)
def search(self, text: str) -> list:
"""
Scan text, return all matches [(start_pos, pattern), ...]
Time complexity: O(n + z), n = len(text), z = total matches
"""
results = []
state = 0
for i, char in enumerate(text):
while state != 0 and char not in self.states[state].goto:
state = self.states[state].failure
state = self.states[state].goto.get(char, 0)
for pattern in self.states[state].output:
results.append((i - len(pattern) + 1, pattern))
return results
# Usage example
ac = AhoCorasick()
patterns = ["he", "she", "his", "hers"]
ac.build(patterns)
text = "ahishers"
matches = ac.search(text)
for pos, pattern in matches:
print(f"Position {pos}: '{pattern}'")
# Output:
# Position 1: 'his'
# Position 3: 'she'
# Position 3: 'he' (she contains he)
# Position 4: 'hers'
13.8.4 Visualizing AC Automaton Execution
Using the pattern set {"he", "she", "his", "hers"} as an example, the automaton's state graph (simplified):
State transitions (Goto):
0 --h--> 1 --e--> 2 [output: "he"]
\--r--> 8 --s--> 9 [output: "hers"]
\--i--> 6 --s--> 7 [output: "his"]
0 --s--> 3 --h--> 4 --e--> 5 [output: "she", "he"]
Failure pointers:
State 5 (she) --failure--> State 2 (he)
Because the longest proper suffix of "she" that is also a Trie prefix is "he"
When scanning text "ahishers":
- 'a': root has no 'a' transition, stay at root (state 0)
- 'h': transition to state 1
- 'i': transition to state 6
- 's': transition to state 7, output "his"
- 'h': state 7 has no 'h' transition, follow failure chain back to state 0, then transition to state 1... then to state 4 (since "sh" path exists)
- 'e': transition to state 5, output "she" and "he"
- 'r': from state 5's failure state 2, transition to state 8
- 's': transition to state 9, output "hers"
13.9 Sensitive Word Filtering in Practice
class SensitiveWordFilter:
"""
Sensitive word filter based on AC automaton.
Supports:
- Dynamic add/remove of sensitive words
- Full-width/half-width character normalization
- Case-insensitive matching
- Replacing matches with ***
"""
def __init__(self):
self.ac = None
self.patterns = set()
self._dirty = True # flag indicating rebuild needed
def add_word(self, word: str) -> None:
"""Add a sensitive word"""
normalized = self._normalize(word)
if normalized:
self.patterns.add(normalized)
self._dirty = True
def remove_word(self, word: str) -> None:
"""Remove a sensitive word"""
normalized = self._normalize(word)
self.patterns.discard(normalized)
self._dirty = True
def _normalize(self, text: str) -> str:
"""Normalize text: full-width to half-width, to lowercase"""
result = []
for ch in text:
code = ord(ch)
# Full-width to half-width
if 0xFF01 <= code <= 0xFF5E:
ch = chr(code - 0xFEE0)
elif code == 0x3000:
ch = ' '
result.append(ch.lower())
return ''.join(result)
def _ensure_built(self):
"""Ensure automaton is built"""
if self._dirty:
self.ac = AhoCorasick()
self.ac.build(list(self.patterns))
self._dirty = False
def contains(self, text: str) -> bool:
"""Check whether text contains any sensitive word"""
self._ensure_built()
normalized = self._normalize(text)
return len(self.ac.search(normalized)) > 0
def filter(self, text: str, replacement: str = "***") -> str:
"""Replace sensitive words with specified characters"""
self._ensure_built()
normalized = self._normalize(text)
matches = self.ac.search(normalized)
if not matches:
return text
# Mark positions that need replacement
mask = [False] * len(text)
for start, pattern in matches:
for i in range(start, start + len(pattern)):
mask[i] = True
# Build result
result = []
i = 0
while i < len(text):
if mask[i]:
result.append(replacement)
while i < len(text) and mask[i]:
i += 1
else:
result.append(text[i])
i += 1
return ''.join(result)
def find_all(self, text: str) -> list:
"""Find all sensitive words and their positions"""
self._ensure_built()
normalized = self._normalize(text)
return self.ac.search(normalized)
# Usage example
word_filter = SensitiveWordFilter()
word_filter.add_word("gambling")
word_filter.add_word("drugs")
word_filter.add_word("violence")
text = "This website contains gambling and drugs content"
print(word_filter.contains(text)) # True
print(word_filter.filter(text)) # "This website contains *** and *** content"
print(word_filter.find_all(text)) # [(26, 'gambling'), (38, 'drugs')]
Engineering optimization recommendations:
- Cache-friendly state storage: store all states in a contiguous array rather than a linked list of objects
- Incremental building: for frequently changing word dictionaries, maintain multiple small AC automatons for batch querying
- Skip irrelevant characters: sensitive words may have special characters inserted between them (e.g., "gamblin*g"), preprocess to remove interfering characters
- Multi-level filtering: first use a Bloom filter to quickly exclude texts without sensitive words, then use the AC automaton for precise matching
Level 3 · What the Specification Defines
13.10 Edward Fredkin and the Origin of Trie (1960)
The origin of the name "Trie" is itself an interesting naming dispute in computer science history.
In 1959, Rene de la Briandais presented a paper at the Western Joint Computer Conference titled "File Searching Using Variable Length Keys," describing a tree structure that organizes a dictionary through character-by-character comparison of strings. However, he did not give this structure a dedicated name.
In 1960, Edward Fredkin formally proposed the term "Trie" in his paper "Trie Memory" (Communications of the ACM, Vol. 3, No. 9). Fredkin explained that the word was taken from the middle of "retrieval," implying its core purpose of information retrieval.
The naming controversy: Fredkin himself pronounced "Trie" as "tree" (homophonous with the tree data structure), arguing it emphasizes that it is fundamentally a tree structure. However, most of academia and industry later adopted the pronunciation "try" to avoid confusion with the generic "tree." Donald Knuth discussed this in detail in "The Art of Computer Programming" Vol. 3 (Sorting and Searching, 1973) and favored the "try" pronunciation.
Fredkin's original motivation: while working at Bolt, Beranek and Newman (BBN, which later participated in building ARPANET), he needed to efficiently store and retrieve large numbers of strings. Hash tables, while fast for individual lookups, could not support prefix range queries—a fundamental requirement in information retrieval systems. The Trie structure directly solved this problem.
Fredkin's other contribution: he simultaneously proposed the idea that Tries could be generalized to any radix. When each character has r possible values, each node has at most r children. For binary strings, r=2, which became the binary Trie widely used in IP routing.
13.11 The Birth of the Aho-Corasick Automaton (1975)
Alfred V. Aho and Margaret J. Corasick published their landmark paper "Efficient String Matching: An Aid to Bibliographic Search" in Communications of the ACM (Vol. 18, No. 6) in 1975.
Research context: Bell Labs was developing text processing tools at the time. The Unix fgrep command needed to simultaneously search for multiple fixed strings in a file. If each pattern was searched independently, performance became unacceptable when the number of patterns reached hundreds. Aho and Corasick needed an algorithm that could "scan the text once, matching all patterns simultaneously."
Core innovations:
- Build all patterns into a Trie (which they called the "goto function")
- Add a "failure function" on the Trie—when the current character cannot continue matching in the Trie, jump to the state corresponding to the longest correct suffix. This is the same concept as KMP, but generalized to the multi-pattern case
- Add an "output function"—through failure chain propagation, ensure that passing through a state reports all patterns ending at that position
Complexity proof (key theorem from the paper):
Let n be the text length, m be the total length of all patterns, and z be the number of matches:
- Preprocessing (building the automaton): O(m * sigma), where sigma is the alphabet size. With hash table transitions: O(m)
- Search: O(n + z)
- Space: O(m * sigma) (array implementation) or O(m) (hash table implementation)
Why is search O(n + z)? The key observation is: each character in the text is processed at most once (the total number of failure chain retreats throughout the entire search is O(n), identical to KMP's analysis).
13.12 The Relationship Between AC Automaton and KMP
The KMP algorithm (Knuth-Morris-Pratt, 1977) and the AC automaton share a deep internal connection. Formally:
KMP is a special case of the AC automaton for a single pattern.
| Concept | KMP | AC Automaton |
|---|---|---|
| Structural basis | Single pattern (linear) | Multi-pattern Trie (tree-shaped) |
| Failure handling | next[j]: longest proper prefix-suffix of first j chars | failure[s]: longest proper suffix of state s's string (in the Trie) |
| Preprocessing | O(m) | O(m * sigma) or O(m) |
| Search | O(n) | O(n + z) |
| Publication | Knuth, Morris, Pratt 1977 (actually discovered in 1970) | Aho, Corasick 1975 |
An interesting historical detail: the AC paper (1975) was published earlier than the KMP paper (1977). In reality, the KMP algorithm was discovered in 1970, but the paper underwent lengthy revisions before formal publication. Aho explicitly drew on KMP's failure-jump concept when designing the AC automaton (he was Knuth's colleague and knew of KMP's unpublished work).
The generalization from KMP to AC:
# KMP's next array: for a single pattern
def build_kmp_next(pattern):
"""
next[i] = length of longest proper prefix-suffix of pattern[0:i]
Equivalent to: if mismatch at position i, jump to next[i] to continue
"""
n = len(pattern)
next_arr = [0] * n
j = 0
for i in range(1, n):
while j > 0 and pattern[i] != pattern[j]:
j = next_arr[j - 1]
if pattern[i] == pattern[j]:
j += 1
next_arr[i] = j
return next_arr
# AC's failure pointer: generalized to multi-pattern (tree structure)
# Comparison: KMP's j = next[j-1] corresponds to AC's failure = states[failure].failure
# KMP is backtracking on a chain; AC is backtracking on a tree (but the logic is identical)
Essential unity: Both KMP and AC automaton are special cases of Deterministic Finite Automata (DFA). KMP builds a linear DFA (each state corresponds to a prefix of the pattern), while the AC automaton builds a tree-shaped DFA (each state corresponds to a prefix in the Trie). Both use the failure function to implement DFA state compression—avoiding explicit storage of all possible transitions.
13.13 Introduction to Suffix Trees (Weiner, 1973)
The Suffix Tree is another extremely important member of the Trie family. Peter Weiner presented his paper "Linear Pattern Matching Algorithms" at FOCS (IEEE Symposium on Foundations of Computer Science) in 1973, proposing the first linear-time suffix tree construction algorithm. Knuth called it "the Algorithm of the Year 1973."
What is a suffix tree? For the string S = "banana$" ($ is a terminator), all its suffixes are:
banana$
anana$
nana$
ana$
na$
a$
$
Inserting all these suffixes into a compressed Trie (Patricia Tree) yields the suffix tree.
The power of suffix trees:
| Problem | Complexity |
|---|---|
| Determine if P is a substring of S | O( |
| Count occurrences of P in S | O( |
| Find longest repeated substring of S | O( |
| Find longest common substring of two strings | O( |
| Find k-th smallest suffix (suffix array) | O( |
Evolution of construction algorithms:
- Weiner (1973): first O(n) algorithm, builds right to left, high space overhead
- McCreight (1976): builds left to right, space optimized
- Ukkonen (1995): online algorithm, incremental character-by-character construction, simplest implementation
Relationship to Trie: a suffix tree is essentially a compressed Trie built from all suffixes of a single string. It extends Trie applications from "a collection of short strings" to "substring problems of a single long string."
Why suffix trees have been largely replaced by suffix arrays in practice: The Suffix Array proposed by Manber and Myers (1993) requires only an integer array to store the sorted order of suffixes. Combined with an LCP array, it can solve most problems that suffix trees handle, with far less space overhead (suffix trees require multiple pointers per node). In fields like genomics that process extremely long strings, suffix arrays are the de facto standard.
13.14 Theoretical Complexity Analysis of Tries
Average-case analysis (assuming strings are generated uniformly at random over alphabet Sigma):
Philippe Flajolet and Robert Sedgewick performed precise average-case analysis of Tries in "Analytic Combinatorics" (2009):
- A Trie built from n random strings has approximately n / ln(sigma) nodes on average, where sigma = |Sigma|
- Average height is approximately (2 ln n) / ln sigma
- The average number of comparisons to search for a random string is approximately log_sigma(n)
Worst case: when all strings share a very long common prefix (e.g., "aaa...a1", "aaa...a2"), the Trie degenerates into a nearly linear chain, with height approaching the length of the longest string L_max. This is precisely the raison d'etre of compressed Tries.
Level 4 · Edge Cases and Pitfalls
13.15 LeetCode #208: Implement Trie (Prefix Tree)
Problem: Implement a Trie class supporting insert, search, and startsWith operations.
This is the most fundamental Trie interview problem, directly testing data structure implementation skills:
class Trie:
def __init__(self):
self.children = {}
self.is_end = False
def insert(self, word: str) -> None:
node = self
for ch in word:
if ch not in node.children:
node.children[ch] = Trie()
node = node.children[ch]
node.is_end = True
def search(self, word: str) -> bool:
node = self._search_prefix(word)
return node is not None and node.is_end
def startsWith(self, prefix: str) -> bool:
return self._search_prefix(prefix) is not None
def _search_prefix(self, prefix: str):
node = self
for ch in prefix:
if ch not in node.children:
return None
node = node.children[ch]
return node
Interview key points:
- Explain why you use dict vs fixed array (generality vs performance tradeoff)
- Explain the necessity of
is_end(distinguishing prefixes from complete words) - Discuss thread safety: multi-threaded environments need locks or ConcurrentHashMap
- Clear time/space complexity analysis
13.16 LeetCode #211: Design Add and Search Words Data Structure
Problem: Design a WordDictionary that supports addWord and search, where '.' in search can match any single character.
class WordDictionary:
def __init__(self):
self.children = {}
self.is_end = False
def addWord(self, word: str) -> None:
node = self
for ch in word:
if ch not in node.children:
node.children[ch] = WordDictionary()
node = node.children[ch]
node.is_end = True
def search(self, word: str) -> bool:
return self._search(word, 0)
def _search(self, word: str, index: int) -> bool:
if index == len(word):
return self.is_end
ch = word[index]
if ch == '.':
# Wildcard: try all children
for child in self.children.values():
if child._search(word, index + 1):
return True
return False
else:
if ch not in self.children:
return False
return self.children[ch]._search(word, index + 1)
Complexity analysis:
- addWord: O(L)
- search (no wildcards): O(L)
- search (all wildcards "..."): O(26^L), exponential worst case
Interview follow-ups:
- "How to optimize if there are many wildcards?"—Group words by length, filter mismatched lengths first
- "What if we need to support '*' matching any number of characters?"—Requires backtracking or dynamic programming
- "Can we use a regex engine instead?"—For a fixed dictionary, Trie with preprocessing is faster for queries
13.17 LeetCode #212: Word Search II
Problem: Given an m x n board of characters and a list of words, find all words that exist both in the board and the list. Words on the board must be constructed from adjacent cells (up/down/left/right), and each cell may not be reused.
This is the classic combination of Trie + DFS backtracking:
class Solution:
def findWords(self, board: list, words: list) -> list:
# Step 1: Build all words into a Trie
root = {}
for word in words:
node = root
for ch in word:
node = node.setdefault(ch, {})
node['#'] = word # '#' marks word end, stores complete word
m, n = len(board), len(board[0])
result = []
def dfs(i, j, parent):
ch = board[i][j]
node = parent.get(ch)
if node is None:
return
# Check if we've matched a complete word
if '#' in node:
result.append(node['#'])
del node['#'] # Avoid duplicate additions
# Mark as visited
board[i][j] = '@'
# DFS in four directions
for di, dj in [(-1, 0), (1, 0), (0, -1), (0, 1)]:
ni, nj = i + di, j + dj
if 0 <= ni < m and 0 <= nj < n and board[ni][nj] != '@':
dfs(ni, nj, node)
# Restore
board[i][j] = ch
# Optimization: prune node if it has no more children
if not node:
del parent[ch]
# Start search from every position
for i in range(m):
for j in range(n):
dfs(i, j, root)
return result
Key optimizations:
- Use Trie instead of per-word search: naive approach does one DFS per word, complexity O(kmn*4^L). With Trie, all words sharing a prefix only need one search
- Pruning: remove a word from Trie after finding it, reducing subsequent search space
- Node cleanup: when all descendants of a Trie node have been matched, delete the node so future searches won't enter this path
Complexity:
- Time: O(mn4*3^(L-1)), where L is the longest word length. First step has 4 directions, subsequent steps only 3 (cannot go back)
- Space: O(k*L) (Trie) + O(L) (recursion stack)
13.18 Trie vs Hash Table: How to Choose
This is a common design discussion question in interviews. The answer depends on specific requirements:
| Dimension | Trie | Hash Table |
|---|---|---|
| Exact lookup | O(L) | O(L) (hash computation is also O(L)) |
| Prefix query | O(P + K), native support | Not supported (must iterate all keys) |
| Sorted traversal | Naturally ordered (DFS = lexicographic) | Unordered |
| Longest prefix match | O(L), native support | Requires multiple lookups |
| Memory usage | Usually larger (one node per character) | Usually smaller |
| Cache performance | Poor (pointer chasing, random access) | Good (contiguous memory) |
| Dynamic insert/delete | O(L) | Amortized O(L) |
| Hash collisions | Don't exist | Exist, worst case O(n) |
| Hotspot issues | Don't exist | Hot key rehashing |
Choose Trie when:
- Prefix matching needed (autocomplete, IP routing)
- Lexicographic traversal needed
- Longest common prefix needed
- Strings share many common prefixes (URLs, file paths)
Choose Hash Table when:
- Only exact lookup needed
- Memory is limited
- Strings share no common prefixes (UUIDs)
- High cache hit rate needed
13.19 Classic Interview Variants
Variant 1: MapSum — Key-Value Mapping
class MapSum:
"""
LeetCode #677: Implement insert(key, val) and sum(prefix).
sum returns the sum of vals for all keys starting with prefix.
"""
def __init__(self):
self.root = {}
self.map = {} # track inserted key -> val
def insert(self, key: str, val: int) -> None:
delta = val - self.map.get(key, 0)
self.map[key] = val
node = self.root
for ch in key:
node = node.setdefault(ch, {'_sum': 0})
node['_sum'] += delta
def sum(self, prefix: str) -> int:
node = self.root
for ch in prefix:
if ch not in node:
return 0
node = node[ch]
return node.get('_sum', 0)
Variant 2: Palindrome Pairs
class Solution:
"""
LeetCode #336: Given a list of unique words, find all pairs (i, j)
such that words[i] + words[j] is a palindrome.
Approach: Insert each word reversed into a Trie, check if remaining
portion is a palindrome during search.
"""
def palindromePairs(self, words: list) -> list:
# Build reversed Trie
root = {}
for idx, word in enumerate(words):
node = root
reversed_word = word[::-1]
for i, ch in enumerate(reversed_word):
# If reversed_word[i:] is a palindrome, record this index
if self._is_palindrome(reversed_word, i, len(reversed_word) - 1):
node.setdefault('_palindrome_indices', []).append(idx)
node = node.setdefault(ch, {})
node['_idx'] = idx
node.setdefault('_palindrome_indices', []).append(idx)
result = []
for idx, word in enumerate(words):
node = root
for i, ch in enumerate(word):
# Case 1: word is longer than some reversed word in Trie
if '_idx' in node and node['_idx'] != idx:
if self._is_palindrome(word, i, len(word) - 1):
result.append([idx, node['_idx']])
if ch not in node:
break
node = node[ch]
else:
# Case 2: word fully matches or is shorter
if '_idx' in node and node['_idx'] != idx:
result.append([idx, node['_idx']])
# Case 3: longer reversed words in Trie with palindrome remainder
for j in node.get('_palindrome_indices', []):
if j != idx and j != node.get('_idx', -1):
result.append([idx, j])
return result
def _is_palindrome(self, s, left, right):
while left < right:
if s[left] != s[right]:
return False
left += 1
right -= 1
return True
13.20 Engineering Considerations in Practice
1. Memory Management
When storing large numbers of strings, Trie's memory overhead can be 3-10x that of hash tables. In Python for example, each TrieNode object costs approximately 100-200 bytes (object header + dict overhead), while a string key in a hash table only needs the string's own memory.
Optimization strategies:
- Use
__slots__to reduce object overhead - Use arrays instead of dicts for child storage (when charset is known and small)
- Consider C extensions (e.g.,
marisa-trie,datrielibraries)
class TrieNodeSlots:
"""Using __slots__ to reduce memory overhead"""
__slots__ = ['children', 'is_end']
def __init__(self):
self.children = {}
self.is_end = False
2. Concurrency Safety
In production, Tries are often accessed by multiple threads (e.g., autocomplete services for search engines). Options:
- Read-write locks: shared lock for reads, exclusive lock for writes
- Copy-on-Write: create new version on write, atomically swap root pointer
- Immutable Trie: no modification after construction, no locking needed
3. Persistence
Large Tries need disk persistence. Common approaches:
- Serialize to compact binary format (e.g.,
marisa-trie's LOUDS encoding) - Use memory-mapped files (mmap) to avoid loading the entire tree into memory
- FST (Finite State Transducer): used by Lucene/Elasticsearch, more compact than Trie
4. Performance Benchmarking
import time
import random
import string
def benchmark_trie_vs_hashset(n_words=100000, word_len=10):
"""Trie vs HashSet performance comparison"""
# Generate random words
words = [
''.join(random.choices(string.ascii_lowercase, k=word_len))
for _ in range(n_words)
]
# Trie construction
trie = Trie()
start = time.time()
for w in words:
trie.insert(w)
trie_build = time.time() - start
# HashSet construction
start = time.time()
hash_set = set(words)
hash_build = time.time() - start
# Trie search
start = time.time()
for w in words[:10000]:
trie.search(w)
trie_search = time.time() - start
# HashSet search
start = time.time()
for w in words[:10000]:
w in hash_set
hash_search = time.time() - start
# Trie prefix query (HashSet cannot do this efficiently)
start = time.time()
for w in words[:10000]:
trie.starts_with(w[:3])
trie_prefix = time.time() - start
print(f"Building {n_words} words:")
print(f" Trie: {trie_build:.3f}s")
print(f" HashSet: {hash_build:.3f}s")
print(f"Searching 10000 times:")
print(f" Trie: {trie_search:.4f}s")
print(f" HashSet: {hash_search:.4f}s")
print(f"Prefix query 10000 times:")
print(f" Trie: {trie_prefix:.4f}s")
print(f" HashSet: N/A (no efficient prefix query support)")
13.21 Summary and Decision Guide
Need prefix matching? ─── Yes ──→ Trie / Patricia Tree
│
No
│
Need multi-pattern matching? ── Yes ──→ AC Automaton
│
No
│
Need substring matching? ── Yes ──→ Suffix Tree / Suffix Array
│
No
│
Only need exact lookup? ── Yes ──→ Hash Table
Key formulas from this chapter:
- Trie lookup: O(L), L = string length
- Trie space (worst case): O(N * L * sigma), sigma = alphabet size
- AC automaton preprocessing: O(m), m = total length of all patterns
- AC automaton search: O(n + z), n = text length, z = number of matches
- Suffix tree construction: O(n) (Ukkonen's algorithm)
- Patricia Tree node count: O(N), N = number of strings
Remember: choosing a data structure is not about "which is better" but about "which better fits your specific requirements." Tries are irreplaceable for prefix-related problems, but if you only need to check whether a string exists in a set, the simplicity and efficiency of hash tables is the wiser choice.