Flexible Matching and Regex Engines
Chapter 33: Flexible Matching and Regex Engines — When "Exact Match" Isn't Enough
The previous chapters (KMP, Rabin-Karp, suffix arrays) all solve exact matching problems: given a pattern P, find all positions where P appears in text T. But real-world search is far more complex:
- A user types "algoritm" (misspelled) and you need to find "algorithm"
- Git needs to compare two file versions and find the shortest edit sequence
- You write
grep "log.*Error"to search logs — that's not a fixed string
This chapter discusses three core problems: fuzzy search (matching with allowed spelling errors), diff comparison (the Myers diff algorithm), and regex engines (how to turn a*b|c into a running program).
Their common thread: matching is no longer a binary "equal or not" judgment, but a search space that needs exploration.
Level 1 · What You Need to Know
1.1 Edit Distance Recap
In Chapter 28 we discussed edit distance (Levenshtein distance) in detail: the minimum number of operations (insert, delete, replace) to transform string A into string B.
Quick recap of the core formula:
dp[i][j] = min(
dp[i-1][j] + 1, # delete A[i]
dp[i][j-1] + 1, # insert B[j]
dp[i-1][j-1] + cost # replace (cost=0 if A[i]==B[j], else 1)
)
Time complexity O(nm), space optimizable to O(min(n,m)).
1.2 Implementing Fuzzy Search
Problem: In a dictionary, find all words within edit distance k of a query word.
Naive approach: Compute edit distance to every dictionary word. With N words of length m and query length n, total time is O(N * n * m). For a 100K word dictionary, this is too slow.
Method 1: BK Tree (Burkhard-Keller Tree, 1973)
BK trees exploit the triangle inequality of edit distance for pruning: if d(q, root) = d, then words within distance k of q must have distance to root in [d-k, d+k].
class BKTree:
"""
BK Tree: accelerates fuzzy search using triangle inequality
Triangle inequality: d(a, c) >= |d(a, b) - d(b, c)|
If d(query, root) = d, then answer word w satisfies:
d(root, w) is in range [d-k, d+k]
Burkhard & Keller, "Some Approaches to Best-Match File Searching", 1973
"""
def __init__(self):
self.root = None
class Node:
def __init__(self, word: str):
self.word = word
self.children = {} # distance -> child node
def add(self, word: str):
if self.root is None:
self.root = self.Node(word)
return
node = self.root
while True:
d = self._edit_distance(word, node.word)
if d == 0:
return # duplicate word
if d in node.children:
node = node.children[d]
else:
node.children[d] = self.Node(word)
return
def search(self, query: str, max_dist: int) -> list[tuple[str, int]]:
"""
Find all words with edit distance <= max_dist to query
Average time complexity much better than O(N)
"""
if self.root is None:
return []
results = []
stack = [self.root]
while stack:
node = stack.pop()
d = self._edit_distance(query, node.word)
if d <= max_dist:
results.append((node.word, d))
# Prune using triangle inequality
for dist, child in node.children.items():
if d - max_dist <= dist <= d + max_dist:
stack.append(child)
return results
@staticmethod
def _edit_distance(a: str, b: str) -> int:
"""Standard edit distance O(nm)"""
n, m = len(a), len(b)
dp = list(range(m + 1))
for i in range(1, n + 1):
prev = dp[0]
dp[0] = i
for j in range(1, m + 1):
temp = dp[j]
if a[i-1] == b[j-1]:
dp[j] = prev
else:
dp[j] = 1 + min(prev, dp[j], dp[j-1])
prev = temp
return dp[m]
# Demo
tree = BKTree()
dictionary = ["algorithm", "altruistic", "algebra", "logarithm",
"align", "alien", "all", "allocate"]
for word in dictionary:
tree.add(word)
results = tree.search("algoritm", max_dist=2)
print(f"Words within distance 2 of 'algoritm':")
for word, dist in sorted(results, key=lambda x: x[1]):
print(f" {word} (distance={dist})")
Method 2: Trie + Edit Distance Automaton
For larger dictionaries, insert all words into a Trie, then synchronize edit distance DP rows with Trie traversal.
class TrieNode:
def __init__(self):
self.children = {}
self.is_word = False
self.word = ""
class FuzzySearchTrie:
"""
Trie + Edit Distance: single DFS finds all fuzzy matches
Idea: DFS the Trie while maintaining the current DP row.
When the minimum value in the DP row exceeds max_dist, prune.
"""
def __init__(self):
self.root = TrieNode()
def insert(self, word: str):
node = self.root
for c in word:
if c not in node.children:
node.children[c] = TrieNode()
node = node.children[c]
node.is_word = True
node.word = word
def search(self, query: str, max_dist: int) -> list[tuple[str, int]]:
"""
Find all words with edit distance <= max_dist to query
Core: DFS the Trie while maintaining DP row
- Current Trie path = "prefix in dictionary"
- DP row column j = edit_distance(query[:j], current_prefix)
"""
results = []
n = len(query)
# Initial DP row: edit_distance("", query[:j]) = j
initial_row = list(range(n + 1))
# DFS
for c, child in self.root.children.items():
self._dfs(child, c, query, initial_row, max_dist, results)
return results
def _dfs(self, node: TrieNode, char: str, query: str,
prev_row: list[int], max_dist: int, results: list):
n = len(query)
# Compute current row
curr_row = [prev_row[0] + 1] # delete current character
for j in range(1, n + 1):
insert_cost = curr_row[j - 1] + 1
delete_cost = prev_row[j] + 1
replace_cost = prev_row[j - 1] + (0 if query[j-1] == char else 1)
curr_row.append(min(insert_cost, delete_cost, replace_cost))
# If word boundary and distance <= max_dist
if node.is_word and curr_row[n] <= max_dist:
results.append((node.word, curr_row[n]))
# Prune: if minimum in current row > max_dist, no need to continue
if min(curr_row) <= max_dist:
for c, child in node.children.items():
self._dfs(child, c, query, curr_row, max_dist, results)
# Demo
trie = FuzzySearchTrie()
for word in ["algorithm", "altruistic", "algebra", "logarithm",
"align", "alien", "all", "allocate"]:
trie.insert(word)
results = trie.search("algoritm", max_dist=2)
print(f"\nTrie fuzzy search 'algoritm' (distance<=2):")
for word, dist in sorted(results, key=lambda x: x[1]):
print(f" {word} (distance={dist})")
1.3 The Myers Diff Algorithm — The Core of Git Diff
When you run git diff, Git needs to find the shortest edit script (SES) between two file versions. This is essentially an edit distance problem, but the unit is "lines" rather than "characters."
Eugene Myers gave an elegant algorithm in his 1986 paper "An O(ND) Difference Algorithm and Its Variations," with time complexity O(ND), where N = n + m (total length of both sequences) and D = edit distance.
Key insight: Most file differences are small (D << N), so O(ND) is typically much better than O(NM).
def myers_diff(a: list[str], b: list[str]) -> list[tuple[str, str]]:
"""
Myers diff algorithm: find shortest edit script between two sequences
Input: a = original sequence, b = target sequence
Output: [(operation, content), ...] where operation is ' '(keep), '-'(delete), '+'(insert)
Time: O((n+m) * D), D = edit distance
Space: O((n+m) * D) (optimizable to O(D^2) with linear-space divide and conquer)
Myers, "An O(ND) Difference Algorithm", Algorithmica, 1986
"""
n, m = len(a), len(b)
# In the edit graph: x-axis is a, y-axis is b
# Diagonal k = x - y
# Goal: from (0,0) to (n,m)
# Moves:
# right = delete a[x] (x++)
# down = insert b[y] (y++)
# diagonal = a[x] == b[y], match (x++, y++, free)
max_d = n + m # Maximum possible edit distance
# v[k] = furthest x-coordinate reachable on diagonal k
offset = max_d
v_size = 2 * max_d + 1
# Save each round's v for backtracking
history = []
v = [-1] * v_size
v[offset + 1] = 0 # Initialize: start from diagonal 1 (virtual start)
for d in range(max_d + 1):
history.append(v[:])
for k in range(-d, d + 1, 2):
# Decide whether to reach diagonal k from above (delete) or left (insert)
if k == -d or (k != d and v[offset + k - 1] < v[offset + k + 1]):
x = v[offset + k + 1] # Move down from diagonal k+1 (insert)
else:
x = v[offset + k - 1] + 1 # Move right from diagonal k-1 (delete)
y = x - k
# Walk along diagonal (match)
while x < n and y < m and a[x] == b[y]:
x += 1
y += 1
v[offset + k] = x
# Reached the end?
if x >= n and y >= m:
return _backtrack(a, b, history, d, offset)
return [] # Should never reach here
def _backtrack(a: list[str], b: list[str], history: list,
d: int, offset: int) -> list[tuple[str, str]]:
"""Backtrack from Myers algorithm history to extract specific edit operations"""
n, m = len(a), len(b)
x, y = n, m
edits = []
for step in range(d, 0, -1):
v = history[step]
v_prev = history[step - 1]
k = x - y
if k == -step or (k != step and v_prev[offset + k - 1] < v_prev[offset + k + 1]):
prev_k = k + 1 # Came from above (insert)
prev_x = v_prev[offset + prev_k]
else:
prev_k = k - 1 # Came from left (delete)
prev_x = v_prev[offset + prev_k] + 1
prev_y = prev_x - prev_k
# Diagonal moves (matches)
while x > prev_x and y > prev_y:
x -= 1
y -= 1
edits.append((' ', a[x]))
# Non-diagonal move
if step > 0:
if prev_k == k + 1:
y -= 1
edits.append(('+', b[y]))
else:
x -= 1
edits.append(('-', a[x]))
# Handle diagonal at d=0
while x > 0 and y > 0:
x -= 1
y -= 1
edits.append((' ', a[x]))
edits.reverse()
return edits
# Demo
a = ["A", "B", "C", "A", "B", "B", "A"]
b = ["C", "B", "A", "B", "A", "C"]
diff = myers_diff(a, b)
print("Myers diff result:")
for op, line in diff:
print(f" {op} {line}")
1.4 Regular Expression Basics
Regular expressions describe string patterns. The core operations are only three:
| Operation | Notation | Meaning |
|---|---|---|
| Concatenation | ab | Match a, then b |
| Alternation | a|b | Match a or b |
| Repetition | a* | Match a zero or more times |
With parentheses for grouping, these three operations can describe all regular languages (Kleene, 1956).
Modern regex engines also support + (one or more), ? (zero or one), . (any character), [abc] (character class), etc., but these are syntactic sugar — they can all be expressed with the three basic operations.
# Syntactic sugar equivalences
# a+ equivalent to aa*
# a? equivalent to (a|epsilon) epsilon = empty string
# a{3} equivalent to aaa
# [abc] equivalent to (a|b|c)
# . equivalent to (a|b|c|...|z|0|...|9|...) alternation of all chars
1.5 Two Implementations of Regex Engines
Method 1: Backtracking
Most programming languages (Python, Java, JavaScript, Ruby) use backtracking regex engines. It's essentially a DFS with backtracking:
def regex_match_backtrack(text: str, pattern: str) -> bool:
"""
Backtracking regex match (simplified, supports only . and *)
Corresponds to LeetCode #10: Regular Expression Matching
Time complexity: worst case O(2^n) (exponential!)
"""
def match(t: int, p: int) -> bool:
if p == len(pattern):
return t == len(text)
first_match = (t < len(text) and
(pattern[p] == '.' or pattern[p] == text[t]))
if p + 1 < len(pattern) and pattern[p + 1] == '*':
return (match(t, p + 2) or # * matches zero times
(first_match and match(t + 1, p))) # * matches once more
return first_match and match(t + 1, p + 1)
return match(0, 0)
# Demo
print(regex_match_backtrack("aab", "c*a*b")) # True
print(regex_match_backtrack("mississippi", "mis*is*p*.")) # False
Method 2: Thompson NFA
Ken Thompson proposed a completely different approach in 1968: compile the regex into an NFA (nondeterministic finite automaton), then simulate all possible NFA states simultaneously.
class ThompsonNFA:
"""
Thompson NFA regex engine
Core idea:
1. Convert regex to NFA (fixed construction rules for each operation)
2. During matching, simultaneously track all possible current states
3. Time guarantee O(nm): n = text length, m = number of states in pattern
Thompson, "Programming Techniques: Regular Expression Search Algorithm",
Communications of the ACM, 1968
"""
class State:
def __init__(self, char=None):
self.char = char # None = epsilon transition, '.' = any char
self.out1 = None # Transition target 1
self.out2 = None # Transition target 2 (for branching)
self.is_match = False
class Fragment:
"""NFA fragment: has a start state and several dangling arrows"""
def __init__(self, start, dangling):
self.start = start
self.dangling = dangling # List of states with unconnected outputs
def __init__(self, pattern: str):
self.start = self._compile(pattern)
def _compile(self, pattern: str):
"""
Compile regex to NFA
Uses classic "postfix expression + stack" method
"""
postfix = self._to_postfix(pattern)
stack = []
for c in postfix:
if c == '.': # Concatenation
f2 = stack.pop()
f1 = stack.pop()
for s in f1.dangling:
s.out1 = f2.start
stack.append(self.Fragment(f1.start, f2.dangling))
elif c == '|': # Alternation
f2 = stack.pop()
f1 = stack.pop()
s = self.State()
s.out1 = f1.start
s.out2 = f2.start
stack.append(self.Fragment(s, f1.dangling + f2.dangling))
elif c == '*': # Repetition
f = stack.pop()
s = self.State()
s.out1 = f.start
s.out2 = None
for ds in f.dangling:
ds.out1 = s
stack.append(self.Fragment(s, [s]))
elif c == '?': # Zero or one
f = stack.pop()
s = self.State()
s.out1 = f.start
s.out2 = None
stack.append(self.Fragment(s, f.dangling + [s]))
elif c == '+': # One or more
f = stack.pop()
s = self.State()
s.out1 = f.start
s.out2 = None
for ds in f.dangling:
ds.out1 = s
stack.append(self.Fragment(f.start, [s]))
else: # Literal character
s = self.State(c)
stack.append(self.Fragment(s, [s]))
if stack:
f = stack.pop()
match_state = self.State()
match_state.is_match = True
for s in f.dangling:
s.out1 = match_state
return f.start
return None
def _to_postfix(self, pattern: str) -> str:
"""
Convert regex to postfix notation
Add explicit concatenation operator, then use shunting-yard algorithm
"""
output = []
for i, c in enumerate(pattern):
output.append(c)
if c not in ('(', '|') and i + 1 < len(pattern):
next_c = pattern[i + 1]
if next_c not in (')', '|', '*', '+', '?'):
output.append('.')
# Simplified: shows concept only
return ''.join(output)
def match(self, text: str) -> bool:
"""
NFA simulation matching
Simultaneously track all possible states
Time: O(n * m), n = len(text), m = NFA state count
"""
if self.start is None:
return False
current_states = self._epsilon_closure({self.start})
for c in text:
next_states = set()
for state in current_states:
if state.char == c or state.char == 'ANY':
if state.out1:
next_states.add(state.out1)
current_states = self._epsilon_closure(next_states)
return any(s.is_match for s in current_states)
def _epsilon_closure(self, states: set) -> set:
"""Compute epsilon closure of a state set"""
closure = set()
stack = list(states)
while stack:
s = stack.pop()
if s in closure:
continue
closure.add(s)
if s.char is None: # epsilon transition
if s.out1 and s.out1 not in closure:
stack.append(s.out1)
if s.out2 and s.out2 not in closure:
stack.append(s.out2)
return closure
1.6 The Fundamental Difference Between the Two Engines
| Property | Backtracking | Thompson NFA |
|---|---|---|
| Time complexity | Worst O(2^n) | Guaranteed O(nm) |
| Backreferences | Yes | No |
| Greedy/Lazy | Yes | Needs extra work |
| Performance (simple patterns) | Fast | Fast |
| Performance (malicious patterns) | Catastrophically slow | Still fast |
| Used by | Python, Java, JS, Ruby | RE2, Rust regex, Go |
Level 2 · How It Works Under the Hood
2.1 The Edit Graph Model of Myers Diff
Myers diff's core is transforming the diff problem into a shortest-path problem on a graph.
Edit Graph:
- An (n+1) x (m+1) grid graph
- Horizontal edge (x,y) -> (x+1,y): delete a[x], cost 1
- Vertical edge (x,y) -> (x,y+1): insert b[y], cost 1
- Diagonal edge (x,y) -> (x+1,y+1): only when a[x] == b[y], cost 0
- Goal: find minimum-cost path from (0,0) to (n,m)
Diagonal representation:
- k = x - y identifies which diagonal we're on
- Start at diagonal k=0
- End point (n,m) is on diagonal k = n-m
- Each horizontal move: k increases by 1
- Each vertical move: k decreases by 1
BFS by layers:
- Layer d = exactly d non-diagonal moves (edit distance = d)
- Start from d=0, each layer tries all reachable diagonals
- On each diagonal, walk as far as possible along the diagonal (free matches)
- Until some diagonal reaches (n,m)
This is why the algorithm is O(ND) — the outer loop runs D times (shortest edit distance), inner loop processes 2d+1 diagonals.
2.2 NFA to DFA Conversion (Subset Construction)
Each step of Thompson NFA matching maintains a set of states. If we pre-enumerate all possible state sets, we get a DFA (deterministic finite automaton):
def nfa_to_dfa(nfa_start, alphabet: set[str]):
"""
Subset Construction (Powerset Construction)
Convert NFA to equivalent DFA
Rabin & Scott, "Finite Automata and Their Decision Problems", 1959
Each DFA state = a subset of NFA states
Worst case: DFA state count = 2^(NFA state count) (exponential blowup)
"""
def epsilon_closure(states):
"""Compute epsilon closure"""
closure = set()
stack = list(states)
while stack:
s = stack.pop()
if id(s) in closure:
continue
closure.add(id(s))
if s.char is None:
if s.out1:
stack.append(s.out1)
if s.out2:
stack.append(s.out2)
return frozenset(closure)
# DFA start state = epsilon closure of NFA start state
start_closure = epsilon_closure({nfa_start})
# BFS to build DFA
dfa_states = {start_closure: 0}
dfa_transitions = {}
queue = [start_closure]
while queue:
current = queue.pop(0)
current_id = dfa_states[current]
for c in alphabet:
next_states = set() # move(current, c)
next_closure = epsilon_closure(next_states) if next_states else frozenset()
if not next_closure:
continue
if next_closure not in dfa_states:
dfa_states[next_closure] = len(dfa_states)
queue.append(next_closure)
dfa_transitions[(current_id, c)] = dfa_states[next_closure]
return dfa_states, dfa_transitions
DFA advantages and disadvantages:
- Advantage: each matching step is a single table lookup, O(n) matching
- Disadvantage: construction may explode exponentially (2^m states)
- Practical approach: lazy DFA construction — only build new DFA states on demand
2.3 RE2's Hybrid Strategy
Google's RE2 regex engine (designed by Russ Cox) uses a hybrid strategy:
- Small regex: Direct NFA simulation (state sets are usually small)
- Frequently used regex: Incrementally build DFA (cache previously seen state sets)
- When DFA is too large: Fall back to NFA simulation with DFA cache cap
class HybridRegexEngine:
"""
Hybrid regex engine (RE2 style)
Approach:
1. Start with NFA simulation
2. Cache observed (current_state_set, input_char) -> next_state_set
3. When cache is large enough, effectively builds a partial DFA
4. Set cache cap to prevent memory explosion
"""
def __init__(self, max_cache_size: int = 10000):
self.max_cache_size = max_cache_size
self.cache = {}
self.cache_hits = 0
self.cache_misses = 0
def step(self, current_states: frozenset, char: str,
nfa_transitions) -> frozenset:
"""Single transition step: check cache or compute"""
key = (current_states, char)
if key in self.cache:
self.cache_hits += 1
return self.cache[key]
self.cache_misses += 1
next_states = set()
for state in current_states:
if state in nfa_transitions and char in nfa_transitions[state]:
next_states.update(nfa_transitions[state][char])
result = frozenset(next_states)
if len(self.cache) < self.max_cache_size:
self.cache[key] = result
return result
2.4 Git Diff Optimization Details
The actual Git diff is not pure Myers algorithm. It applies several optimizations:
1. First strip common prefix and suffix
def optimized_diff(a: list[str], b: list[str]) -> list[tuple[str, str]]:
"""Git-style optimization: strip common prefix and suffix first"""
prefix_len = 0
while prefix_len < len(a) and prefix_len < len(b) and a[prefix_len] == b[prefix_len]:
prefix_len += 1
suffix_len = 0
while (suffix_len < len(a) - prefix_len and
suffix_len < len(b) - prefix_len and
a[-(suffix_len+1)] == b[-(suffix_len+1)]):
suffix_len += 1
# Only run Myers on the differing middle portion
a_mid = a[prefix_len:len(a)-suffix_len if suffix_len else len(a)]
b_mid = b[prefix_len:len(b)-suffix_len if suffix_len else len(b)]
result = []
for line in a[:prefix_len]:
result.append((' ', line))
if a_mid or b_mid:
diff_result = myers_diff(a_mid, b_mid)
result.extend(diff_result)
for line in a[len(a)-suffix_len:] if suffix_len else []:
result.append((' ', line))
return result
2. Linear-space optimization (Hirschberg-style)
Myers' original algorithm needs O(ND) space to save history. Git uses divide-and-conquer to reduce space to O(D^2):
- Run forward and backward Myers search simultaneously from both ends
- Find the middle "snake" (longest diagonal segment)
- Recursively process both halves
3. Patience Diff
Git also offers --patience option using the Patience Diff algorithm:
- First find "unique common lines" (lines appearing exactly once in both A and B)
- Compute LCS on these lines
- Use these "anchors" to split the problem, recurse on each part
Patience Diff often produces more intuitive diffs in practice (won't "align" braces from two different functions).
2.5 Levenshtein Automaton
A Levenshtein automaton is a DFA that accepts all strings within edit distance k of a given word w.
Key property (Schulz & Mihov, 2002):
For fixed k and alphabet size sigma, the Levenshtein automaton has O(n) states (n = |w|). This means we can build the automaton in O(n) time, then use it to match any candidate string of length m in O(m) time.
def build_levenshtein_nfa(word: str, max_dist: int):
"""
Conceptual demonstration of Levenshtein NFA construction
States: (i, e) means "processed first i characters of word, accumulated e edits"
Transitions:
- Match: (i, e) --c--> (i+1, e) if word[i] == c
- Replace: (i, e) --c--> (i+1, e+1) if word[i] != c and e < k
- Delete: (i, e) --epsilon--> (i+1, e+1) skip word[i], e < k
- Insert: (i, e) --c--> (i, e+1) don't consume word, e < k
Accept states: all (n, e) where e <= k
"""
n = len(word)
states = {}
for i in range(n + 1):
for e in range(max_dist + 1):
states[(i, e)] = {}
return states
Level 3 · What the Theory Says
3.1 Thompson (1968): The Origin of Regex Engines
Ken Thompson's paper "Programming Techniques: Regular Expression Search Algorithm" (published in CACM, June 1968) described how to compile regular expressions into machine code for the IBM 7094. This was the first practical regex engine in history.
Thompson's core contributions:
-
NFA construction rules: Each regex operation (concatenation, alternation, closure) has a fixed NFA construction template, each adding at most 2 states. Therefore NFA state count = O(m) (m = pattern length).
-
Parallel simulation: Instead of tracing one path at a time (which leads to backtracking), simultaneously trace all possible states. For each input character, update the entire set of current states.
-
Time guarantee: Each step processes at most m states, each with at most 2 outgoing edges, so each step is O(m). Total time: O(nm).
Thompson's precise construction rules:
1. Single character 'a':
s --a--> e
(One start state s, one edge labeled a, one end state e)
2. Concatenation AB:
A.end --epsilon--> B.start
(A's end state connects to B's start via epsilon edge)
3. Alternation A|B:
epsilon --> A.start
s -<
epsilon --> B.start
A.end --epsilon--> e
B.end --epsilon--> e
(New start s branches to A and B via epsilon,
both ends connect to new end e via epsilon)
4. Closure A*:
epsilon --> A.start
s -< A.end --epsilon--> e
epsilon --> e --epsilon--> A.start (loop back)
(Start can skip A directly to end,
A's end can loop back to A's start)
3.2 Russ Cox (2007): "Regular Expression Matching Can Be Simple and Fast"
Russ Cox published this highly influential article in 2007 (a blog post, not a formal paper, but widely cited), demonstrating a striking comparison:
Matching regex a?^n a^n against text a^n (n optional a's followed by n mandatory a's, matching n a's):
| n | Perl/Python (backtracking) | Thompson NFA |
|---|---|---|
| 10 | 0.0001s | 0.0001s |
| 20 | 0.01s | 0.0001s |
| 25 | 1s | 0.0001s |
| 30 | 60s | 0.0001s |
| 100 | Never finishes | 0.0001s |
Why is backtracking exponential here?
Regex a?a?a?aaa matching "aaa":
- Each
a?has two choices: match or skip - Backtracking tries all 2^n combinations
- Only at the very end can it confirm the best matching plan
Thompson NFA maintains all possible states simultaneously, O(n) per step.
Cox's key argument:
"Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be exponentially slow. In contrast, Thompson's algorithm guarantees linear time."
This article directly led to Google's RE2 engine.
3.3 Myers (1986): Mathematical Foundation of Diff Algorithms
Eugene Myers rigorously proved in "An O(ND) Difference Algorithm and Its Variations" (Algorithmica, 1986):
Theorem 1: The algorithm finds the shortest edit script in O(ND) time and O(D^2) space, where N = n + m, D = edit distance.
Theorem 2: Using divide-and-conquer (similar to Hirschberg 1975), space can be reduced to O(D^2) while maintaining O(ND) time.
Key lemma: At step d in the edit graph, the furthest reachable point V[d][k] on each diagonal k satisfies:
V[d][k] = max( V[d-1][k-1] + 1, (arrived from above) V[d-1][k+1] (arrived from left) ) + (how far we can walk along diagonal)
This recurrence means the algorithm needs only O(D) space to store current layer's V values.
3.4 Kleene (1956) and Regular Language Equivalence
Stephen Kleene's 1956 paper "Representation of Events in Nerve Nets and Finite Automata" proved the equivalence between regular expressions and finite automata (Kleene's theorem):
Theorem: A language is regular if and only if it can be described by:
- A regular expression
- A DFA
- An NFA
These three representations are interconvertible:
- Regex -> NFA: Thompson construction (1968), O(m) states
- NFA -> DFA: Subset construction (Rabin & Scott, 1959), worst case O(2^m) states
- DFA -> Regex: State elimination, O(2^n) length
Why backreferences take regex beyond regular languages:
Modern "regular expressions" support backreferences \1, \2, etc., which actually exceed the expressive power of regular languages. For example, (a+)\1 matches "aa", "aaaa", "aaaaaa" (even-length a-strings), which is not a regular language (provable by pumping lemma).
The matching problem with backreferences is NP-Complete (Aho, 1990), which is why backtracking is unavoidable for such patterns.
3.5 Ukkonen (1985): The Diagonal Trick for Edit Distance
Esko Ukkonen's "Algorithms for Approximate String Matching" (Information and Control, 1985) gave another O(kn) edit distance algorithm (k = edit distance, n = text length), similar in spirit to Myers 1986 but from a different angle:
- Myers: expand layer by layer by edit distance d, advancing in the edit graph
- Ukkonen: when computing the DP table, only compute a diagonal band of width 2k+1
Both exploit the insight that "when edit distance is small, only a small portion of the search space needs exploration."
Level 4 · Edge Cases and Pitfalls
4.1 Catastrophic Backtracking (ReDoS)
Catastrophic Backtracking or ReDoS (Regular Expression Denial of Service) is a class of security vulnerabilities: crafted inputs can make a regex engine run for exponential time.
Typical problematic patterns:
import re
import time
# Dangerous pattern 1: nested quantifiers
# (a+)+ matching "aaaaX" causes catastrophic backtracking
# because there are countless ways to partition a's into groups
# Dangerous pattern 2: overlapping alternatives
# (a|a)* matching "aaaaX" is equally catastrophic
# Dangerous pattern 3: real-world example
# ^(([a-z])+.)+[A-Z]([a-z])+$ # email validation
def demonstrate_redos():
"""Demonstrate ReDoS attack"""
pattern = re.compile(r'(a+)+b')
for n in [10, 15, 20, 22, 24]:
text = 'a' * n + 'X' # Will never match
start = time.time()
try:
result = pattern.match(text)
except Exception:
pass
elapsed = time.time() - start
print(f"n={n:2d}: {elapsed:.4f}s (result={result})")
if elapsed > 5:
print(" Too slow, stopping test")
break
# demonstrate_redos() # Uncomment to run (warning: may be very slow!)
Root cause of ReDoS:
Backtracking engines must undo all choices made when matching fails. If the pattern has nested quantifiers or overlapping alternatives, the number of choices is exponential.
Defense strategies:
- Use Thompson NFA engines: RE2, Rust regex, Go regex — guarantee linear time
- Static analysis tools: Detect problematic regex patterns
- Set timeouts: Limit maximum regex matching time
- Atomic groups and possessive quantifiers: Prevent backtracking (requires engine support)
# Defense in Python
import signal
class RegexTimeout(Exception):
pass
def timeout_handler(signum, frame):
raise RegexTimeout("Regex match timeout!")
def safe_regex_match(pattern: str, text: str, timeout_sec: float = 1.0):
"""Safe regex matching with timeout"""
signal.signal(signal.SIGALRM, timeout_handler)
signal.setitimer(signal.ITIMER_REAL, timeout_sec)
try:
result = re.match(pattern, text)
signal.setitimer(signal.ITIMER_REAL, 0)
return result
except RegexTimeout:
return None # Timeout, treat as no match
4.2 Interview Problem: Regular Expression Matching (LeetCode #10)
def is_match(s: str, p: str) -> bool:
"""
LeetCode #10: Regular Expression Matching
Implement regex matching supporting '.' and '*'
'.' matches any single character
'*' matches zero or more of the preceding element
Method: Dynamic Programming
dp[i][j] = whether s[:i] matches p[:j]
Time: O(nm)
Space: O(nm), optimizable to O(m)
"""
n, m = len(s), len(p)
dp = [[False] * (m + 1) for _ in range(n + 1)]
dp[0][0] = True
# Initialize: empty string s matches "a*b*c*..." patterns
for j in range(2, m + 1):
if p[j-1] == '*':
dp[0][j] = dp[0][j-2]
for i in range(1, n + 1):
for j in range(1, m + 1):
if p[j-1] == '*':
# Two choices for *:
# 1. Match zero times: skip last two pattern chars
dp[i][j] = dp[i][j-2]
# 2. Match one or more: current char matches char before *
if p[j-2] == '.' or p[j-2] == s[i-1]:
dp[i][j] = dp[i][j] or dp[i-1][j]
elif p[j-1] == '.' or p[j-1] == s[i-1]:
dp[i][j] = dp[i-1][j-1]
return dp[n][m]
# Tests
test_cases = [
("aa", "a", False),
("aa", "a*", True),
("ab", ".*", True),
("aab", "c*a*b", True),
("mississippi", "mis*is*p*.", False),
]
for s, p, expected in test_cases:
result = is_match(s, p)
status = "PASS" if result == expected else "FAIL"
print(f"[{status}] is_match('{s}', '{p}') = {result}")
4.3 Interview Problem: Wildcard Matching (LeetCode #44)
def is_match_wildcard(s: str, p: str) -> bool:
"""
LeetCode #44: Wildcard Matching
'?' matches any single character
'*' matches any sequence of characters (including empty)
Note: '*' here differs from regex '*'!
Regex '*' modifies the preceding character; wildcard '*' independently matches any sequence.
Method 1: Dynamic Programming O(nm)
"""
n, m = len(s), len(p)
dp = [[False] * (m + 1) for _ in range(n + 1)]
dp[0][0] = True
# Initialize: prefix of all *'s can match empty string
for j in range(1, m + 1):
if p[j-1] == '*':
dp[0][j] = dp[0][j-1]
for i in range(1, n + 1):
for j in range(1, m + 1):
if p[j-1] == '*':
dp[i][j] = dp[i][j-1] or dp[i-1][j]
elif p[j-1] == '?' or p[j-1] == s[i-1]:
dp[i][j] = dp[i-1][j-1]
return dp[n][m]
def is_match_wildcard_greedy(s: str, p: str) -> bool:
"""
Method 2: Greedy two-pointer O(nm) worst case, but average O(n+m)
Idea:
- When encountering '*', record position, assume it matches empty first
- If later matching fails, backtrack to last '*', have it match one more char
"""
n, m = len(s), len(p)
si, pi = 0, 0
star_pi, star_si = -1, -1
while si < n:
if pi < m and (p[pi] == '?' or p[pi] == s[si]):
si += 1
pi += 1
elif pi < m and p[pi] == '*':
star_pi = pi
star_si = si
pi += 1
elif star_pi != -1:
star_si += 1
si = star_si
pi = star_pi + 1
else:
return False
while pi < m and p[pi] == '*':
pi += 1
return pi == m
# Tests
test_cases = [
("aa", "a", False),
("aa", "*", True),
("cb", "?a", False),
("adceb", "*a*b", True),
("acdcb", "a*c?b", False),
]
for s, p, expected in test_cases:
result1 = is_match_wildcard(s, p)
result2 = is_match_wildcard_greedy(s, p)
status = "PASS" if result1 == expected else "FAIL"
print(f"[{status}] wildcard('{s}', '{p}') = {result1} (greedy={result2})")
4.4 Edit Distance Variants in Interviews
Variant 1: Delete-only operations (LCS problem)
If the only edit operation is "delete," then minimum deletions to transform A to B = |A| + |B| - 2*LCS(A, B).
Variant 2: One Edit Distance (LeetCode #161)
def is_one_edit_distance(s: str, t: str) -> bool:
"""
Determine if two strings have exactly edit distance 1
O(n) time, O(1) space
"""
n, m = len(s), len(t)
if abs(n - m) > 1:
return False
if n > m:
return is_one_edit_distance(t, s)
# Now n <= m
for i in range(n):
if s[i] != t[i]:
if n == m:
return s[i+1:] == t[i+1:]
else:
return s[i:] == t[i+1:]
return n + 1 == m
# Tests
print(is_one_edit_distance("abc", "ab")) # True (delete)
print(is_one_edit_distance("abc", "adc")) # True (replace)
print(is_one_edit_distance("abc", "abcd")) # True (insert)
print(is_one_edit_distance("abc", "abc")) # False (same)
Variant 3: Edit Distance with Operation Recovery
def edit_distance_with_operations(a: str, b: str) -> tuple[int, list[str]]:
"""
Compute edit distance and return the specific operation sequence
Useful for implementing diff tools' "minimum change suggestions"
"""
n, m = len(a), len(b)
dp = [[0] * (m + 1) for _ in range(n + 1)]
for i in range(n + 1):
dp[i][0] = i
for j in range(m + 1):
dp[0][j] = j
for i in range(1, n + 1):
for j in range(1, m + 1):
if a[i-1] == b[j-1]:
dp[i][j] = dp[i-1][j-1]
else:
dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
# Backtrack operations
ops = []
i, j = n, m
while i > 0 or j > 0:
if i > 0 and j > 0 and a[i-1] == b[j-1]:
ops.append(f"keep '{a[i-1]}'")
i -= 1
j -= 1
elif i > 0 and j > 0 and dp[i][j] == dp[i-1][j-1] + 1:
ops.append(f"replace '{a[i-1]}' with '{b[j-1]}'")
i -= 1
j -= 1
elif j > 0 and dp[i][j] == dp[i][j-1] + 1:
ops.append(f"insert '{b[j-1]}'")
j -= 1
else:
ops.append(f"delete '{a[i-1]}'")
i -= 1
ops.reverse()
return dp[n][m], ops
# Demo
dist, ops = edit_distance_with_operations("kitten", "sitting")
print(f"Edit distance: {dist}")
print("Operations:")
for op in ops:
print(f" {op}")
4.5 Regex Engine Selection in Practice
| Engine | Language/Tool | Type | Backreferences | Time Guarantee |
|---|---|---|---|---|
| PCRE | PHP, Nginx | Backtracking | Yes | None (may be exponential) |
| Python re | Python | Backtracking | Yes | None |
| RE2 | C++/Go | NFA/DFA | No | O(nm) |
| Rust regex | Rust | NFA/DFA | No | O(nm) |
| Hyperscan | Intel | DFA + HW acceleration | No | O(n) |
Selection guide:
- If input may come from users/external sources: must use RE2/Rust regex (prevent ReDoS)
- If backreferences needed and input is controlled: Python re is sufficient
- If extreme throughput needed: Hyperscan (network intrusion detection, etc.)
4.6 Comprehensive Comparison: The Full Matching Algorithm Landscape
| Algorithm | Problem | Time | Use Case |
|---|---|---|---|
| KMP/BM | Exact single pattern | O(n+m) | Simple search |
| Aho-Corasick | Exact multi-pattern | O(n+m+z) | Content filtering |
| Rabin-Karp | Exact (multi-pattern) | O(n+m) expected | Plagiarism detection |
| Suffix Array | Text indexing | O(m log n) query | Full-text search |
| Edit Distance | Fuzzy matching | O(nm) | Spell checking |
| Myers diff | Shortest edit script | O(ND) | Version comparison |
| Thompson NFA | Regex matching | O(nm) | Secure search |
| Backtracking | Regex + backreferences | O(2^n) worst | Complex patterns |
4.7 Chapter Summary
This chapter extended the concept of "matching" from exact matching in three directions:
- Fuzzy matching: search allowing edit distance (BK tree, Trie + DP, Levenshtein automaton)
- Diff comparison: finding the shortest edit script between two sequences (Myers diff, core of Git)
- Pattern matching: regex engines (backtracking vs Thompson NFA)
Core lessons:
- Backtracking is simple but dangerous — ReDoS is a real security threat
- Thompson NFA guarantees linear time but doesn't support backreferences
- Myers diff exploits the assumption "most differences are small," making it extremely fast in practice
- Edit distance is O(nm), but many acceleration techniques exist (BK tree, Trie, diagonal optimization)
Understanding these algorithms helps not just with interview problems, but with understanding: how Git works, how search engines handle typos, and why web services can be brought down by regular expressions.