Chapter 33

Flexible Matching and Regex Engines

Chapter 33: Flexible Matching and Regex Engines — When "Exact Match" Isn't Enough

The previous chapters (KMP, Rabin-Karp, suffix arrays) all solve exact matching problems: given a pattern P, find all positions where P appears in text T. But real-world search is far more complex:

A user types "algoritm" (misspelled) and you need to find "algorithm"
Git needs to compare two file versions and find the shortest edit sequence
You write grep "log.*Error" to search logs — that's not a fixed string

This chapter discusses three core problems: fuzzy search (matching with allowed spelling errors), diff comparison (the Myers diff algorithm), and regex engines (how to turn a*b|c into a running program).

Their common thread: matching is no longer a binary "equal or not" judgment, but a search space that needs exploration.

Level 1 · What You Need to Know

1.1 Edit Distance Recap

In Chapter 28 we discussed edit distance (Levenshtein distance) in detail: the minimum number of operations (insert, delete, replace) to transform string A into string B.

Quick recap of the core formula:

dp[i][j] = min(
    dp[i-1][j] + 1,         # delete A[i]
    dp[i][j-1] + 1,         # insert B[j]
    dp[i-1][j-1] + cost     # replace (cost=0 if A[i]==B[j], else 1)
)

Time complexity O(nm), space optimizable to O(min(n,m)).

1.2 Implementing Fuzzy Search

Problem: In a dictionary, find all words within edit distance k of a query word.

Naive approach: Compute edit distance to every dictionary word. With N words of length m and query length n, total time is O(N * n * m). For a 100K word dictionary, this is too slow.

Method 1: BK Tree (Burkhard-Keller Tree, 1973)

BK trees exploit the triangle inequality of edit distance for pruning: if d(q, root) = d, then words within distance k of q must have distance to root in [d-k, d+k].

class BKTree:
    """
    BK Tree: accelerates fuzzy search using triangle inequality
    
    Triangle inequality: d(a, c) >= |d(a, b) - d(b, c)|
    If d(query, root) = d, then answer word w satisfies:
    d(root, w) is in range [d-k, d+k]
    
    Burkhard & Keller, "Some Approaches to Best-Match File Searching", 1973
    """
    
    def __init__(self):
        self.root = None
    
    class Node:
        def __init__(self, word: str):
            self.word = word
            self.children = {}  # distance -> child node
    
    def add(self, word: str):
        if self.root is None:
            self.root = self.Node(word)
            return
        
        node = self.root
        while True:
            d = self._edit_distance(word, node.word)
            if d == 0:
                return  # duplicate word
            if d in node.children:
                node = node.children[d]
            else:
                node.children[d] = self.Node(word)
                return
    
    def search(self, query: str, max_dist: int) -> list[tuple[str, int]]:
        """
        Find all words with edit distance <= max_dist to query
        Average time complexity much better than O(N)
        """
        if self.root is None:
            return []
        
        results = []
        stack = [self.root]
        
        while stack:
            node = stack.pop()
            d = self._edit_distance(query, node.word)
            
            if d <= max_dist:
                results.append((node.word, d))
            
            # Prune using triangle inequality
            for dist, child in node.children.items():
                if d - max_dist <= dist <= d + max_dist:
                    stack.append(child)
        
        return results
    
    @staticmethod
    def _edit_distance(a: str, b: str) -> int:
        """Standard edit distance O(nm)"""
        n, m = len(a), len(b)
        dp = list(range(m + 1))
        for i in range(1, n + 1):
            prev = dp[0]
            dp[0] = i
            for j in range(1, m + 1):
                temp = dp[j]
                if a[i-1] == b[j-1]:
                    dp[j] = prev
                else:
                    dp[j] = 1 + min(prev, dp[j], dp[j-1])
                prev = temp
        return dp[m]


# Demo
tree = BKTree()
dictionary = ["algorithm", "altruistic", "algebra", "logarithm", 
              "align", "alien", "all", "allocate"]
for word in dictionary:
    tree.add(word)

results = tree.search("algoritm", max_dist=2)
print(f"Words within distance 2 of 'algoritm':")
for word, dist in sorted(results, key=lambda x: x[1]):
    print(f"  {word} (distance={dist})")

Method 2: Trie + Edit Distance Automaton

For larger dictionaries, insert all words into a Trie, then synchronize edit distance DP rows with Trie traversal.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_word = False
        self.word = ""


class FuzzySearchTrie:
    """
    Trie + Edit Distance: single DFS finds all fuzzy matches
    
    Idea: DFS the Trie while maintaining the current DP row.
    When the minimum value in the DP row exceeds max_dist, prune.
    """
    
    def __init__(self):
        self.root = TrieNode()
    
    def insert(self, word: str):
        node = self.root
        for c in word:
            if c not in node.children:
                node.children[c] = TrieNode()
            node = node.children[c]
        node.is_word = True
        node.word = word
    
    def search(self, query: str, max_dist: int) -> list[tuple[str, int]]:
        """
        Find all words with edit distance <= max_dist to query
        
        Core: DFS the Trie while maintaining DP row
        - Current Trie path = "prefix in dictionary"
        - DP row column j = edit_distance(query[:j], current_prefix)
        """
        results = []
        n = len(query)
        
        # Initial DP row: edit_distance("", query[:j]) = j
        initial_row = list(range(n + 1))
        
        # DFS
        for c, child in self.root.children.items():
            self._dfs(child, c, query, initial_row, max_dist, results)
        
        return results
    
    def _dfs(self, node: TrieNode, char: str, query: str, 
             prev_row: list[int], max_dist: int, results: list):
        n = len(query)
        
        # Compute current row
        curr_row = [prev_row[0] + 1]  # delete current character
        
        for j in range(1, n + 1):
            insert_cost = curr_row[j - 1] + 1
            delete_cost = prev_row[j] + 1
            replace_cost = prev_row[j - 1] + (0 if query[j-1] == char else 1)
            curr_row.append(min(insert_cost, delete_cost, replace_cost))
        
        # If word boundary and distance <= max_dist
        if node.is_word and curr_row[n] <= max_dist:
            results.append((node.word, curr_row[n]))
        
        # Prune: if minimum in current row > max_dist, no need to continue
        if min(curr_row) <= max_dist:
            for c, child in node.children.items():
                self._dfs(child, c, query, curr_row, max_dist, results)


# Demo
trie = FuzzySearchTrie()
for word in ["algorithm", "altruistic", "algebra", "logarithm", 
             "align", "alien", "all", "allocate"]:
    trie.insert(word)

results = trie.search("algoritm", max_dist=2)
print(f"\nTrie fuzzy search 'algoritm' (distance<=2):")
for word, dist in sorted(results, key=lambda x: x[1]):
    print(f"  {word} (distance={dist})")

1.3 The Myers Diff Algorithm — The Core of Git Diff

When you run git diff, Git needs to find the shortest edit script (SES) between two file versions. This is essentially an edit distance problem, but the unit is "lines" rather than "characters."

Eugene Myers gave an elegant algorithm in his 1986 paper "An O(ND) Difference Algorithm and Its Variations," with time complexity O(ND), where N = n + m (total length of both sequences) and D = edit distance.

Key insight: Most file differences are small (D << N), so O(ND) is typically much better than O(NM).

def myers_diff(a: list[str], b: list[str]) -> list[tuple[str, str]]:
    """
    Myers diff algorithm: find shortest edit script between two sequences
    
    Input: a = original sequence, b = target sequence
    Output: [(operation, content), ...] where operation is ' '(keep), '-'(delete), '+'(insert)
    
    Time: O((n+m) * D), D = edit distance
    Space: O((n+m) * D) (optimizable to O(D^2) with linear-space divide and conquer)
    
    Myers, "An O(ND) Difference Algorithm", Algorithmica, 1986
    """
    n, m = len(a), len(b)
    
    # In the edit graph: x-axis is a, y-axis is b
    # Diagonal k = x - y
    # Goal: from (0,0) to (n,m)
    # Moves:
    #   right = delete a[x] (x++)
    #   down = insert b[y] (y++)
    #   diagonal = a[x] == b[y], match (x++, y++, free)
    
    max_d = n + m  # Maximum possible edit distance
    
    # v[k] = furthest x-coordinate reachable on diagonal k
    offset = max_d
    v_size = 2 * max_d + 1
    
    # Save each round's v for backtracking
    history = []
    
    v = [-1] * v_size
    v[offset + 1] = 0  # Initialize: start from diagonal 1 (virtual start)
    
    for d in range(max_d + 1):
        history.append(v[:])
        
        for k in range(-d, d + 1, 2):
            # Decide whether to reach diagonal k from above (delete) or left (insert)
            if k == -d or (k != d and v[offset + k - 1] < v[offset + k + 1]):
                x = v[offset + k + 1]  # Move down from diagonal k+1 (insert)
            else:
                x = v[offset + k - 1] + 1  # Move right from diagonal k-1 (delete)
            
            y = x - k
            
            # Walk along diagonal (match)
            while x < n and y < m and a[x] == b[y]:
                x += 1
                y += 1
            
            v[offset + k] = x
            
            # Reached the end?
            if x >= n and y >= m:
                return _backtrack(a, b, history, d, offset)
    
    return []  # Should never reach here


def _backtrack(a: list[str], b: list[str], history: list, 
               d: int, offset: int) -> list[tuple[str, str]]:
    """Backtrack from Myers algorithm history to extract specific edit operations"""
    n, m = len(a), len(b)
    x, y = n, m
    
    edits = []
    
    for step in range(d, 0, -1):
        v = history[step]
        v_prev = history[step - 1]
        k = x - y
        
        if k == -step or (k != step and v_prev[offset + k - 1] < v_prev[offset + k + 1]):
            prev_k = k + 1  # Came from above (insert)
            prev_x = v_prev[offset + prev_k]
        else:
            prev_k = k - 1  # Came from left (delete)
            prev_x = v_prev[offset + prev_k] + 1
        
        prev_y = prev_x - prev_k
        
        # Diagonal moves (matches)
        while x > prev_x and y > prev_y:
            x -= 1
            y -= 1
            edits.append((' ', a[x]))
        
        # Non-diagonal move
        if step > 0:
            if prev_k == k + 1:
                y -= 1
                edits.append(('+', b[y]))
            else:
                x -= 1
                edits.append(('-', a[x]))
    
    # Handle diagonal at d=0
    while x > 0 and y > 0:
        x -= 1
        y -= 1
        edits.append((' ', a[x]))
    
    edits.reverse()
    return edits


# Demo
a = ["A", "B", "C", "A", "B", "B", "A"]
b = ["C", "B", "A", "B", "A", "C"]

diff = myers_diff(a, b)
print("Myers diff result:")
for op, line in diff:
    print(f"  {op} {line}")

1.4 Regular Expression Basics

Regular expressions describe string patterns. The core operations are only three:

Operation	Notation	Meaning
Concatenation	ab	Match a, then b
Alternation	a\|b	Match a or b
Repetition	a*	Match a zero or more times

With parentheses for grouping, these three operations can describe all regular languages (Kleene, 1956).

Modern regex engines also support + (one or more), ? (zero or one), . (any character), [abc] (character class), etc., but these are syntactic sugar — they can all be expressed with the three basic operations.

# Syntactic sugar equivalences
# a+    equivalent to  aa*
# a?    equivalent to  (a|epsilon)  epsilon = empty string
# a{3}  equivalent to  aaa
# [abc] equivalent to  (a|b|c)
# .     equivalent to  (a|b|c|...|z|0|...|9|...)  alternation of all chars

1.5 Two Implementations of Regex Engines

Method 1: Backtracking

Most programming languages (Python, Java, JavaScript, Ruby) use backtracking regex engines. It's essentially a DFS with backtracking:

def regex_match_backtrack(text: str, pattern: str) -> bool:
    """
    Backtracking regex match (simplified, supports only . and *)
    
    Corresponds to LeetCode #10: Regular Expression Matching
    
    Time complexity: worst case O(2^n) (exponential!)
    """
    def match(t: int, p: int) -> bool:
        if p == len(pattern):
            return t == len(text)
        
        first_match = (t < len(text) and 
                      (pattern[p] == '.' or pattern[p] == text[t]))
        
        if p + 1 < len(pattern) and pattern[p + 1] == '*':
            return (match(t, p + 2) or  # * matches zero times
                    (first_match and match(t + 1, p)))  # * matches once more
        
        return first_match and match(t + 1, p + 1)
    
    return match(0, 0)


# Demo
print(regex_match_backtrack("aab", "c*a*b"))   # True
print(regex_match_backtrack("mississippi", "mis*is*p*."))  # False

Method 2: Thompson NFA

Ken Thompson proposed a completely different approach in 1968: compile the regex into an NFA (nondeterministic finite automaton), then simulate all possible NFA states simultaneously.

class ThompsonNFA:
    """
    Thompson NFA regex engine
    
    Core idea:
    1. Convert regex to NFA (fixed construction rules for each operation)
    2. During matching, simultaneously track all possible current states
    3. Time guarantee O(nm): n = text length, m = number of states in pattern
    
    Thompson, "Programming Techniques: Regular Expression Search Algorithm",
    Communications of the ACM, 1968
    """
    
    class State:
        def __init__(self, char=None):
            self.char = char  # None = epsilon transition, '.' = any char
            self.out1 = None  # Transition target 1
            self.out2 = None  # Transition target 2 (for branching)
            self.is_match = False
    
    class Fragment:
        """NFA fragment: has a start state and several dangling arrows"""
        def __init__(self, start, dangling):
            self.start = start
            self.dangling = dangling  # List of states with unconnected outputs
    
    def __init__(self, pattern: str):
        self.start = self._compile(pattern)
    
    def _compile(self, pattern: str):
        """
        Compile regex to NFA
        Uses classic "postfix expression + stack" method
        """
        postfix = self._to_postfix(pattern)
        stack = []
        
        for c in postfix:
            if c == '.':  # Concatenation
                f2 = stack.pop()
                f1 = stack.pop()
                for s in f1.dangling:
                    s.out1 = f2.start
                stack.append(self.Fragment(f1.start, f2.dangling))
                
            elif c == '|':  # Alternation
                f2 = stack.pop()
                f1 = stack.pop()
                s = self.State()
                s.out1 = f1.start
                s.out2 = f2.start
                stack.append(self.Fragment(s, f1.dangling + f2.dangling))
                
            elif c == '*':  # Repetition
                f = stack.pop()
                s = self.State()
                s.out1 = f.start
                s.out2 = None
                for ds in f.dangling:
                    ds.out1 = s
                stack.append(self.Fragment(s, [s]))
                
            elif c == '?':  # Zero or one
                f = stack.pop()
                s = self.State()
                s.out1 = f.start
                s.out2 = None
                stack.append(self.Fragment(s, f.dangling + [s]))
                
            elif c == '+':  # One or more
                f = stack.pop()
                s = self.State()
                s.out1 = f.start
                s.out2 = None
                for ds in f.dangling:
                    ds.out1 = s
                stack.append(self.Fragment(f.start, [s]))
                
            else:  # Literal character
                s = self.State(c)
                stack.append(self.Fragment(s, [s]))
        
        if stack:
            f = stack.pop()
            match_state = self.State()
            match_state.is_match = True
            for s in f.dangling:
                s.out1 = match_state
            return f.start
        
        return None
    
    def _to_postfix(self, pattern: str) -> str:
        """
        Convert regex to postfix notation
        Add explicit concatenation operator, then use shunting-yard algorithm
        """
        output = []
        for i, c in enumerate(pattern):
            output.append(c)
            if c not in ('(', '|') and i + 1 < len(pattern):
                next_c = pattern[i + 1]
                if next_c not in (')', '|', '*', '+', '?'):
                    output.append('.')
        
        # Simplified: shows concept only
        return ''.join(output)
    
    def match(self, text: str) -> bool:
        """
        NFA simulation matching
        Simultaneously track all possible states
        Time: O(n * m), n = len(text), m = NFA state count
        """
        if self.start is None:
            return False
        
        current_states = self._epsilon_closure({self.start})
        
        for c in text:
            next_states = set()
            for state in current_states:
                if state.char == c or state.char == 'ANY':
                    if state.out1:
                        next_states.add(state.out1)
            current_states = self._epsilon_closure(next_states)
        
        return any(s.is_match for s in current_states)
    
    def _epsilon_closure(self, states: set) -> set:
        """Compute epsilon closure of a state set"""
        closure = set()
        stack = list(states)
        
        while stack:
            s = stack.pop()
            if s in closure:
                continue
            closure.add(s)
            
            if s.char is None:  # epsilon transition
                if s.out1 and s.out1 not in closure:
                    stack.append(s.out1)
                if s.out2 and s.out2 not in closure:
                    stack.append(s.out2)
        
        return closure

1.6 The Fundamental Difference Between the Two Engines

Property	Backtracking	Thompson NFA
Time complexity	Worst O(2^n)	Guaranteed O(nm)
Backreferences	Yes	No
Greedy/Lazy	Yes	Needs extra work
Performance (simple patterns)	Fast	Fast
Performance (malicious patterns)	Catastrophically slow	Still fast
Used by	Python, Java, JS, Ruby	RE2, Rust regex, Go

Level 2 · How It Works Under the Hood

2.1 The Edit Graph Model of Myers Diff

Myers diff's core is transforming the diff problem into a shortest-path problem on a graph.

Edit Graph:

An (n+1) x (m+1) grid graph
Horizontal edge (x,y) -> (x+1,y): delete a[x], cost 1
Vertical edge (x,y) -> (x,y+1): insert b[y], cost 1
Diagonal edge (x,y) -> (x+1,y+1): only when a[x] == b[y], cost 0
Goal: find minimum-cost path from (0,0) to (n,m)

Diagonal representation:

k = x - y identifies which diagonal we're on
Start at diagonal k=0
End point (n,m) is on diagonal k = n-m
Each horizontal move: k increases by 1
Each vertical move: k decreases by 1

BFS by layers:

Layer d = exactly d non-diagonal moves (edit distance = d)
Start from d=0, each layer tries all reachable diagonals
On each diagonal, walk as far as possible along the diagonal (free matches)
Until some diagonal reaches (n,m)

This is why the algorithm is O(ND) — the outer loop runs D times (shortest edit distance), inner loop processes 2d+1 diagonals.

2.2 NFA to DFA Conversion (Subset Construction)

Each step of Thompson NFA matching maintains a set of states. If we pre-enumerate all possible state sets, we get a DFA (deterministic finite automaton):

def nfa_to_dfa(nfa_start, alphabet: set[str]):
    """
    Subset Construction (Powerset Construction)
    Convert NFA to equivalent DFA
    
    Rabin & Scott, "Finite Automata and Their Decision Problems", 1959
    
    Each DFA state = a subset of NFA states
    Worst case: DFA state count = 2^(NFA state count) (exponential blowup)
    """
    
    def epsilon_closure(states):
        """Compute epsilon closure"""
        closure = set()
        stack = list(states)
        while stack:
            s = stack.pop()
            if id(s) in closure:
                continue
            closure.add(id(s))
            if s.char is None:
                if s.out1:
                    stack.append(s.out1)
                if s.out2:
                    stack.append(s.out2)
        return frozenset(closure)
    
    # DFA start state = epsilon closure of NFA start state
    start_closure = epsilon_closure({nfa_start})
    
    # BFS to build DFA
    dfa_states = {start_closure: 0}
    dfa_transitions = {}
    queue = [start_closure]
    
    while queue:
        current = queue.pop(0)
        current_id = dfa_states[current]
        
        for c in alphabet:
            next_states = set()  # move(current, c)
            next_closure = epsilon_closure(next_states) if next_states else frozenset()
            
            if not next_closure:
                continue
            
            if next_closure not in dfa_states:
                dfa_states[next_closure] = len(dfa_states)
                queue.append(next_closure)
            
            dfa_transitions[(current_id, c)] = dfa_states[next_closure]
    
    return dfa_states, dfa_transitions

DFA advantages and disadvantages:

Advantage: each matching step is a single table lookup, O(n) matching
Disadvantage: construction may explode exponentially (2^m states)
Practical approach: lazy DFA construction — only build new DFA states on demand

2.3 RE2's Hybrid Strategy

Google's RE2 regex engine (designed by Russ Cox) uses a hybrid strategy:

Small regex: Direct NFA simulation (state sets are usually small)
Frequently used regex: Incrementally build DFA (cache previously seen state sets)
When DFA is too large: Fall back to NFA simulation with DFA cache cap

class HybridRegexEngine:
    """
    Hybrid regex engine (RE2 style)
    
    Approach:
    1. Start with NFA simulation
    2. Cache observed (current_state_set, input_char) -> next_state_set
    3. When cache is large enough, effectively builds a partial DFA
    4. Set cache cap to prevent memory explosion
    """
    
    def __init__(self, max_cache_size: int = 10000):
        self.max_cache_size = max_cache_size
        self.cache = {}
        self.cache_hits = 0
        self.cache_misses = 0
    
    def step(self, current_states: frozenset, char: str, 
             nfa_transitions) -> frozenset:
        """Single transition step: check cache or compute"""
        key = (current_states, char)
        
        if key in self.cache:
            self.cache_hits += 1
            return self.cache[key]
        
        self.cache_misses += 1
        
        next_states = set()
        for state in current_states:
            if state in nfa_transitions and char in nfa_transitions[state]:
                next_states.update(nfa_transitions[state][char])
        
        result = frozenset(next_states)
        
        if len(self.cache) < self.max_cache_size:
            self.cache[key] = result
        
        return result

2.4 Git Diff Optimization Details

The actual Git diff is not pure Myers algorithm. It applies several optimizations:

1. First strip common prefix and suffix

def optimized_diff(a: list[str], b: list[str]) -> list[tuple[str, str]]:
    """Git-style optimization: strip common prefix and suffix first"""
    prefix_len = 0
    while prefix_len < len(a) and prefix_len < len(b) and a[prefix_len] == b[prefix_len]:
        prefix_len += 1
    
    suffix_len = 0
    while (suffix_len < len(a) - prefix_len and 
           suffix_len < len(b) - prefix_len and
           a[-(suffix_len+1)] == b[-(suffix_len+1)]):
        suffix_len += 1
    
    # Only run Myers on the differing middle portion
    a_mid = a[prefix_len:len(a)-suffix_len if suffix_len else len(a)]
    b_mid = b[prefix_len:len(b)-suffix_len if suffix_len else len(b)]
    
    result = []
    for line in a[:prefix_len]:
        result.append((' ', line))
    
    if a_mid or b_mid:
        diff_result = myers_diff(a_mid, b_mid)
        result.extend(diff_result)
    
    for line in a[len(a)-suffix_len:] if suffix_len else []:
        result.append((' ', line))
    
    return result

2. Linear-space optimization (Hirschberg-style)

Myers' original algorithm needs O(ND) space to save history. Git uses divide-and-conquer to reduce space to O(D^2):

Run forward and backward Myers search simultaneously from both ends
Find the middle "snake" (longest diagonal segment)
Recursively process both halves

3. Patience Diff

Git also offers --patience option using the Patience Diff algorithm:

First find "unique common lines" (lines appearing exactly once in both A and B)
Compute LCS on these lines
Use these "anchors" to split the problem, recurse on each part

Patience Diff often produces more intuitive diffs in practice (won't "align" braces from two different functions).

2.5 Levenshtein Automaton

A Levenshtein automaton is a DFA that accepts all strings within edit distance k of a given word w.

Key property (Schulz & Mihov, 2002):

For fixed k and alphabet size sigma, the Levenshtein automaton has O(n) states (n = |w|). This means we can build the automaton in O(n) time, then use it to match any candidate string of length m in O(m) time.

def build_levenshtein_nfa(word: str, max_dist: int):
    """
    Conceptual demonstration of Levenshtein NFA construction
    
    States: (i, e) means "processed first i characters of word, accumulated e edits"
    Transitions:
    - Match: (i, e) --c--> (i+1, e)  if word[i] == c
    - Replace: (i, e) --c--> (i+1, e+1)  if word[i] != c and e < k
    - Delete: (i, e) --epsilon--> (i+1, e+1)  skip word[i], e < k
    - Insert: (i, e) --c--> (i, e+1)  don't consume word, e < k
    
    Accept states: all (n, e) where e <= k
    """
    n = len(word)
    states = {}
    
    for i in range(n + 1):
        for e in range(max_dist + 1):
            states[(i, e)] = {}
    
    return states

Level 3 · What the Theory Says

3.1 Thompson (1968): The Origin of Regex Engines

Ken Thompson's paper "Programming Techniques: Regular Expression Search Algorithm" (published in CACM, June 1968) described how to compile regular expressions into machine code for the IBM 7094. This was the first practical regex engine in history.

Thompson's core contributions:

NFA construction rules: Each regex operation (concatenation, alternation, closure) has a fixed NFA construction template, each adding at most 2 states. Therefore NFA state count = O(m) (m = pattern length).
Parallel simulation: Instead of tracing one path at a time (which leads to backtracking), simultaneously trace all possible states. For each input character, update the entire set of current states.
Time guarantee: Each step processes at most m states, each with at most 2 outgoing edges, so each step is O(m). Total time: O(nm).

Thompson's precise construction rules:

1. Single character 'a':
   s --a--> e
   (One start state s, one edge labeled a, one end state e)

2. Concatenation AB:
   A.end --epsilon--> B.start
   (A's end state connects to B's start via epsilon edge)

3. Alternation A|B:
        epsilon --> A.start
   s -<
        epsilon --> B.start
   
   A.end --epsilon--> e
   B.end --epsilon--> e
   (New start s branches to A and B via epsilon,
    both ends connect to new end e via epsilon)

4. Closure A*:
        epsilon --> A.start
   s -<                   A.end --epsilon--> e
        epsilon --> e           --epsilon--> A.start (loop back)
   (Start can skip A directly to end,
    A's end can loop back to A's start)

3.2 Russ Cox (2007): "Regular Expression Matching Can Be Simple and Fast"

Russ Cox published this highly influential article in 2007 (a blog post, not a formal paper, but widely cited), demonstrating a striking comparison:

Matching regex a?^n a^n against text a^n (n optional a's followed by n mandatory a's, matching n a's):

n	Perl/Python (backtracking)	Thompson NFA
10	0.0001s	0.0001s
20	0.01s	0.0001s
25	1s	0.0001s
30	60s	0.0001s
100	Never finishes	0.0001s

Why is backtracking exponential here?

Regex a?a?a?aaa matching "aaa":

Each a? has two choices: match or skip
Backtracking tries all 2^n combinations
Only at the very end can it confirm the best matching plan

Thompson NFA maintains all possible states simultaneously, O(n) per step.

Cox's key argument:

"Perl, PCRE, Python, Ruby, Java, and many other languages have regular expression implementations based on recursive backtracking that are simple but can be exponentially slow. In contrast, Thompson's algorithm guarantees linear time."

This article directly led to Google's RE2 engine.

3.3 Myers (1986): Mathematical Foundation of Diff Algorithms

Eugene Myers rigorously proved in "An O(ND) Difference Algorithm and Its Variations" (Algorithmica, 1986):

Theorem 1: The algorithm finds the shortest edit script in O(ND) time and O(D^2) space, where N = n + m, D = edit distance.

Theorem 2: Using divide-and-conquer (similar to Hirschberg 1975), space can be reduced to O(D^2) while maintaining O(ND) time.

Key lemma: At step d in the edit graph, the furthest reachable point V[d][k] on each diagonal k satisfies:

V[d][k] = max( V[d-1][k-1] + 1, (arrived from above) V[d-1][k+1] (arrived from left) ) + (how far we can walk along diagonal)

This recurrence means the algorithm needs only O(D) space to store current layer's V values.

3.4 Kleene (1956) and Regular Language Equivalence

Stephen Kleene's 1956 paper "Representation of Events in Nerve Nets and Finite Automata" proved the equivalence between regular expressions and finite automata (Kleene's theorem):

Theorem: A language is regular if and only if it can be described by:

A regular expression
A DFA
An NFA

These three representations are interconvertible:

Regex -> NFA: Thompson construction (1968), O(m) states
NFA -> DFA: Subset construction (Rabin & Scott, 1959), worst case O(2^m) states
DFA -> Regex: State elimination, O(2^n) length

Why backreferences take regex beyond regular languages:

Modern "regular expressions" support backreferences \1, \2, etc., which actually exceed the expressive power of regular languages. For example, (a+)\1 matches "aa", "aaaa", "aaaaaa" (even-length a-strings), which is not a regular language (provable by pumping lemma).

The matching problem with backreferences is NP-Complete (Aho, 1990), which is why backtracking is unavoidable for such patterns.

3.5 Ukkonen (1985): The Diagonal Trick for Edit Distance

Esko Ukkonen's "Algorithms for Approximate String Matching" (Information and Control, 1985) gave another O(kn) edit distance algorithm (k = edit distance, n = text length), similar in spirit to Myers 1986 but from a different angle:

Myers: expand layer by layer by edit distance d, advancing in the edit graph
Ukkonen: when computing the DP table, only compute a diagonal band of width 2k+1

Both exploit the insight that "when edit distance is small, only a small portion of the search space needs exploration."

Level 4 · Edge Cases and Pitfalls

4.1 Catastrophic Backtracking (ReDoS)

Catastrophic Backtracking or ReDoS (Regular Expression Denial of Service) is a class of security vulnerabilities: crafted inputs can make a regex engine run for exponential time.

Typical problematic patterns:

import re
import time

# Dangerous pattern 1: nested quantifiers
# (a+)+ matching "aaaaX" causes catastrophic backtracking
# because there are countless ways to partition a's into groups

# Dangerous pattern 2: overlapping alternatives
# (a|a)* matching "aaaaX" is equally catastrophic

# Dangerous pattern 3: real-world example
# ^(([a-z])+.)+[A-Z]([a-z])+$  # email validation

def demonstrate_redos():
    """Demonstrate ReDoS attack"""
    pattern = re.compile(r'(a+)+b')
    
    for n in [10, 15, 20, 22, 24]:
        text = 'a' * n + 'X'  # Will never match
        start = time.time()
        try:
            result = pattern.match(text)
        except Exception:
            pass
        elapsed = time.time() - start
        print(f"n={n:2d}: {elapsed:.4f}s (result={result})")
        if elapsed > 5:
            print("  Too slow, stopping test")
            break

# demonstrate_redos()  # Uncomment to run (warning: may be very slow!)

Root cause of ReDoS:

Backtracking engines must undo all choices made when matching fails. If the pattern has nested quantifiers or overlapping alternatives, the number of choices is exponential.

Defense strategies:

Use Thompson NFA engines: RE2, Rust regex, Go regex — guarantee linear time
Static analysis tools: Detect problematic regex patterns
Set timeouts: Limit maximum regex matching time
Atomic groups and possessive quantifiers: Prevent backtracking (requires engine support)

# Defense in Python
import signal

class RegexTimeout(Exception):
    pass

def timeout_handler(signum, frame):
    raise RegexTimeout("Regex match timeout!")

def safe_regex_match(pattern: str, text: str, timeout_sec: float = 1.0):
    """Safe regex matching with timeout"""
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.setitimer(signal.ITIMER_REAL, timeout_sec)
    try:
        result = re.match(pattern, text)
        signal.setitimer(signal.ITIMER_REAL, 0)
        return result
    except RegexTimeout:
        return None  # Timeout, treat as no match

4.2 Interview Problem: Regular Expression Matching (LeetCode #10)

def is_match(s: str, p: str) -> bool:
    """
    LeetCode #10: Regular Expression Matching
    
    Implement regex matching supporting '.' and '*'
    '.' matches any single character
    '*' matches zero or more of the preceding element
    
    Method: Dynamic Programming
    dp[i][j] = whether s[:i] matches p[:j]
    
    Time: O(nm)
    Space: O(nm), optimizable to O(m)
    """
    n, m = len(s), len(p)
    
    dp = [[False] * (m + 1) for _ in range(n + 1)]
    dp[0][0] = True
    
    # Initialize: empty string s matches "a*b*c*..." patterns
    for j in range(2, m + 1):
        if p[j-1] == '*':
            dp[0][j] = dp[0][j-2]
    
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if p[j-1] == '*':
                # Two choices for *:
                # 1. Match zero times: skip last two pattern chars
                dp[i][j] = dp[i][j-2]
                # 2. Match one or more: current char matches char before *
                if p[j-2] == '.' or p[j-2] == s[i-1]:
                    dp[i][j] = dp[i][j] or dp[i-1][j]
            elif p[j-1] == '.' or p[j-1] == s[i-1]:
                dp[i][j] = dp[i-1][j-1]
    
    return dp[n][m]


# Tests
test_cases = [
    ("aa", "a", False),
    ("aa", "a*", True),
    ("ab", ".*", True),
    ("aab", "c*a*b", True),
    ("mississippi", "mis*is*p*.", False),
]

for s, p, expected in test_cases:
    result = is_match(s, p)
    status = "PASS" if result == expected else "FAIL"
    print(f"[{status}] is_match('{s}', '{p}') = {result}")

4.3 Interview Problem: Wildcard Matching (LeetCode #44)

def is_match_wildcard(s: str, p: str) -> bool:
    """
    LeetCode #44: Wildcard Matching
    
    '?' matches any single character
    '*' matches any sequence of characters (including empty)
    
    Note: '*' here differs from regex '*'!
    Regex '*' modifies the preceding character; wildcard '*' independently matches any sequence.
    
    Method 1: Dynamic Programming O(nm)
    """
    n, m = len(s), len(p)
    
    dp = [[False] * (m + 1) for _ in range(n + 1)]
    dp[0][0] = True
    
    # Initialize: prefix of all *'s can match empty string
    for j in range(1, m + 1):
        if p[j-1] == '*':
            dp[0][j] = dp[0][j-1]
    
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if p[j-1] == '*':
                dp[i][j] = dp[i][j-1] or dp[i-1][j]
            elif p[j-1] == '?' or p[j-1] == s[i-1]:
                dp[i][j] = dp[i-1][j-1]
    
    return dp[n][m]


def is_match_wildcard_greedy(s: str, p: str) -> bool:
    """
    Method 2: Greedy two-pointer O(nm) worst case, but average O(n+m)
    
    Idea:
    - When encountering '*', record position, assume it matches empty first
    - If later matching fails, backtrack to last '*', have it match one more char
    """
    n, m = len(s), len(p)
    si, pi = 0, 0
    star_pi, star_si = -1, -1
    
    while si < n:
        if pi < m and (p[pi] == '?' or p[pi] == s[si]):
            si += 1
            pi += 1
        elif pi < m and p[pi] == '*':
            star_pi = pi
            star_si = si
            pi += 1
        elif star_pi != -1:
            star_si += 1
            si = star_si
            pi = star_pi + 1
        else:
            return False
    
    while pi < m and p[pi] == '*':
        pi += 1
    
    return pi == m


# Tests
test_cases = [
    ("aa", "a", False),
    ("aa", "*", True),
    ("cb", "?a", False),
    ("adceb", "*a*b", True),
    ("acdcb", "a*c?b", False),
]

for s, p, expected in test_cases:
    result1 = is_match_wildcard(s, p)
    result2 = is_match_wildcard_greedy(s, p)
    status = "PASS" if result1 == expected else "FAIL"
    print(f"[{status}] wildcard('{s}', '{p}') = {result1} (greedy={result2})")

4.4 Edit Distance Variants in Interviews

Variant 1: Delete-only operations (LCS problem)

If the only edit operation is "delete," then minimum deletions to transform A to B = |A| + |B| - 2*LCS(A, B).

Variant 2: One Edit Distance (LeetCode #161)

def is_one_edit_distance(s: str, t: str) -> bool:
    """
    Determine if two strings have exactly edit distance 1
    O(n) time, O(1) space
    """
    n, m = len(s), len(t)
    
    if abs(n - m) > 1:
        return False
    
    if n > m:
        return is_one_edit_distance(t, s)
    
    # Now n <= m
    for i in range(n):
        if s[i] != t[i]:
            if n == m:
                return s[i+1:] == t[i+1:]
            else:
                return s[i:] == t[i+1:]
    
    return n + 1 == m


# Tests
print(is_one_edit_distance("abc", "ab"))    # True (delete)
print(is_one_edit_distance("abc", "adc"))   # True (replace)
print(is_one_edit_distance("abc", "abcd"))  # True (insert)
print(is_one_edit_distance("abc", "abc"))   # False (same)

Variant 3: Edit Distance with Operation Recovery

def edit_distance_with_operations(a: str, b: str) -> tuple[int, list[str]]:
    """
    Compute edit distance and return the specific operation sequence
    Useful for implementing diff tools' "minimum change suggestions"
    """
    n, m = len(a), len(b)
    dp = [[0] * (m + 1) for _ in range(n + 1)]
    
    for i in range(n + 1):
        dp[i][0] = i
    for j in range(m + 1):
        dp[0][j] = j
    
    for i in range(1, n + 1):
        for j in range(1, m + 1):
            if a[i-1] == b[j-1]:
                dp[i][j] = dp[i-1][j-1]
            else:
                dp[i][j] = 1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
    
    # Backtrack operations
    ops = []
    i, j = n, m
    while i > 0 or j > 0:
        if i > 0 and j > 0 and a[i-1] == b[j-1]:
            ops.append(f"keep '{a[i-1]}'")
            i -= 1
            j -= 1
        elif i > 0 and j > 0 and dp[i][j] == dp[i-1][j-1] + 1:
            ops.append(f"replace '{a[i-1]}' with '{b[j-1]}'")
            i -= 1
            j -= 1
        elif j > 0 and dp[i][j] == dp[i][j-1] + 1:
            ops.append(f"insert '{b[j-1]}'")
            j -= 1
        else:
            ops.append(f"delete '{a[i-1]}'")
            i -= 1
    
    ops.reverse()
    return dp[n][m], ops


# Demo
dist, ops = edit_distance_with_operations("kitten", "sitting")
print(f"Edit distance: {dist}")
print("Operations:")
for op in ops:
    print(f"  {op}")

4.5 Regex Engine Selection in Practice

Engine	Language/Tool	Type	Backreferences	Time Guarantee
PCRE	PHP, Nginx	Backtracking	Yes	None (may be exponential)
Python re	Python	Backtracking	Yes	None
RE2	C++/Go	NFA/DFA	No	O(nm)
Rust regex	Rust	NFA/DFA	No	O(nm)
Hyperscan	Intel	DFA + HW acceleration	No	O(n)

Selection guide:

If input may come from users/external sources: must use RE2/Rust regex (prevent ReDoS)
If backreferences needed and input is controlled: Python re is sufficient
If extreme throughput needed: Hyperscan (network intrusion detection, etc.)

4.6 Comprehensive Comparison: The Full Matching Algorithm Landscape

Algorithm	Problem	Time	Use Case
KMP/BM	Exact single pattern	O(n+m)	Simple search
Aho-Corasick	Exact multi-pattern	O(n+m+z)	Content filtering
Rabin-Karp	Exact (multi-pattern)	O(n+m) expected	Plagiarism detection
Suffix Array	Text indexing	O(m log n) query	Full-text search
Edit Distance	Fuzzy matching	O(nm)	Spell checking
Myers diff	Shortest edit script	O(ND)	Version comparison
Thompson NFA	Regex matching	O(nm)	Secure search
Backtracking	Regex + backreferences	O(2^n) worst	Complex patterns

4.7 Chapter Summary

This chapter extended the concept of "matching" from exact matching in three directions:

Fuzzy matching: search allowing edit distance (BK tree, Trie + DP, Levenshtein automaton)
Diff comparison: finding the shortest edit script between two sequences (Myers diff, core of Git)
Pattern matching: regex engines (backtracking vs Thompson NFA)

Core lessons:

Backtracking is simple but dangerous — ReDoS is a real security threat
Thompson NFA guarantees linear time but doesn't support backreferences
Myers diff exploits the assumption "most differences are small," making it extremely fast in practice
Edit distance is O(nm), but many acceleration techniques exist (BK tree, Trie, diagonal optimization)

Understanding these algorithms helps not just with interview problems, but with understanding: how Git works, how search engines handle typos, and why web services can be brought down by regular expressions.

Rate this chapter

4.7 / 5 (3 ratings)