Chapter 31

Multi-Pattern Matching and Aho-Corasick

Chapter 31: Multi-Pattern Matching and Aho-Corasick

The previous chapter mastered single-pattern matching: finding one pattern in a text. But the real world frequently demands searching for multiple patterns simultaneously — content moderation must check against thousands of prohibited terms, intrusion detection systems must detect hundreds of attack signatures, antivirus software must match hundreds of thousands of virus fingerprints.

Running KMP separately for each pattern costs O(k * (n + m)) for k patterns — the more patterns, the slower. Is there a way to scan the text once and match all patterns simultaneously?

The Aho-Corasick automaton is the answer. Its core idea: build all patterns into a Trie, then add failure links similar to KMP's next array. A single scan of the text matches all patterns, with time complexity O(n + m_total + z), where z is the total number of matches.

Level 1 · What You Need to Know

31.1 Why Multi-Pattern Matching

Scenario Comparison

Scenario	# Patterns	Text Length	Single-pattern	AC Automaton
Content moderation	10,000	1,000	10M operations	1K operations
Network IDS	500	1,500	750K operations	1.5K operations
Virus scanning	100,000	10^6	10^11 operations	10^6 operations

The difference is staggering. The AC automaton's scan time is independent of the number of patterns — it depends only on text length and match count.

31.2 AC Automaton = Trie + Failure Links

Three Components

Trie: Insert all patterns into a trie
Failure Links: Similar to KMP's next array — when matching fails, tells us where to jump
Output Links: Records which complete patterns can be reported at each state

Intuitive Understanding

Imagine sliding multiple rulers (patterns) simultaneously across the text. The AC automaton stacks all rulers together (Trie). When one ruler fails to match, instead of restarting from scratch, the failure link tells you which position on another ruler can continue.

31.3 Construction Process (BFS)

Step 1: Build the Trie

class AhoCorasick:
    def __init__(self):
        self.goto = [{}]   # goto[state][char] = next_state
        self.fail = [0]    # failure links
        self.output = [[]] # output patterns at each state
        self.state_count = 1
    
    def _add_pattern(self, pattern: str, index: int):
        """Insert a pattern into the Trie"""
        state = 0
        for ch in pattern:
            if ch not in self.goto[state]:
                self.goto.append({})
                self.fail.append(0)
                self.output.append([])
                self.goto[state][ch] = self.state_count
                self.state_count += 1
            state = self.goto[state][ch]
        self.output[state].append(index)

Step 2: BFS to Build Failure Links

The failure link's meaning: if we're in state s (representing matched string u), fail[s] points to the Trie node matching the longest proper suffix of u.

    def _build_failure(self):
        """BFS construction of failure links"""
        from collections import deque
        queue = deque()
        
        # Initialize: root's direct children have failure links to root
        for ch, next_state in self.goto[0].items():
            self.fail[next_state] = 0
            queue.append(next_state)
        
        # BFS
        while queue:
            curr = queue.popleft()
            for ch, next_state in self.goto[curr].items():
                queue.append(next_state)
                
                # Find failure link: follow curr's fail chain upward
                fallback = self.fail[curr]
                while fallback != 0 and ch not in self.goto[fallback]:
                    fallback = self.fail[fallback]
                
                self.fail[next_state] = self.goto[fallback].get(ch, 0)
                
                if self.fail[next_state] == next_state:
                    self.fail[next_state] = 0
                
                # Merge outputs: current output + fail target's output
                self.output[next_state] = (
                    self.output[next_state] + 
                    self.output[self.fail[next_state]]
                )
            
    def build(self, patterns: list[str]):
        """Build the complete AC automaton"""
        for i, pattern in enumerate(patterns):
            self._add_pattern(pattern, i)
        self._build_failure()

Why BFS? Computing failure links requires the parent's failure link to be known already. BFS ensures level-by-level processing — shallower nodes are processed first, their failure links ready for deeper nodes to use.

31.4 Matching Process

    def search(self, text: str) -> list[tuple[int, int]]:
        """
        Search text for all patterns
        Returns [(end_position, pattern_index), ...]
        """
        results = []
        state = 0
        
        for i, ch in enumerate(text):
            # No transition for ch: follow fail chain
            while state != 0 and ch not in self.goto[state]:
                state = self.fail[state]
            
            # Attempt transition
            state = self.goto[state].get(ch, 0)
            
            # Check for outputs
            if self.output[state]:
                for pattern_idx in self.output[state]:
                    results.append((i, pattern_idx))
        
        return results

Complete Usage Example

patterns = ["he", "she", "his", "hers"]
ac = AhoCorasick()
ac.build(patterns)

text = "ushers"
matches = ac.search(text)

for end_pos, pattern_idx in matches:
    pattern = patterns[pattern_idx]
    start_pos = end_pos - len(pattern) + 1
    print(f"Found '{pattern}' at position {start_pos}-{end_pos}")

# Output:
# Found 'she' at position 1-3
# Found 'he' at position 2-3
# Found 'hers' at position 2-5

Time Complexity Analysis

Build Trie: O(m_total), where m_total is the sum of all pattern lengths
Build failure links: O(m_total * |Σ|) worst case, typically O(m_total)
Matching: O(n + z), n = text length, z = total matches

Space Complexity: O(m_total * |Σ|) with array implementation, or O(m_total) with dictionary implementation.

31.5 Step-by-Step Walkthrough

Let's trace through pattern set {"he", "she", "his", "hers"} completely.

Trie Structure

Root(0) → 'h' → (1) → 'e' → (2, output "he") → 'r' → (6) → 's' → (7, output "hers")
                     → 'i' → (3) → 's' → (4, output "his")
        → 's' → (5) → 'h' → (8) → 'e' → (9, output "she")

Failure Links

fail[1] = 0   ("h"'s longest proper suffix not in Trie)
fail[2] = 0   ("he"'s proper suffix "e" not in Trie)
fail[3] = 0   ("hi"'s proper suffix "i" not in Trie)
fail[4] = 5   ("his"'s proper suffix "s" is in Trie at state 5)
fail[5] = 0   ("s"'s proper suffix is empty)
fail[6] = 0   ("her"'s proper suffixes "r", "er" not in Trie)
fail[7] = 0   ("hers"'s proper suffixes not in Trie)
fail[8] = 1   ("sh"'s proper suffix "h" is in Trie at state 1)
fail[9] = 2   ("she"'s proper suffix "he" is in Trie at state 2)
             → output[9] merges output[2] = ["he"]
             → so output[9] = ["she", "he"]

Matching text = "ushers"

i=0, ch='u': state=0, no 'u' transition → state stays 0
i=1, ch='s': state=0 → goto[0]['s'] = 5 → state=5, no output
i=2, ch='h': state=5 → goto[5]['h'] = 8 → state=8, no output
i=3, ch='e': state=8 → goto[8]['e'] = 9 → state=9
             output[9] = ["she"(idx=1), "he"(idx=0)]
             → match "she" ending at position 3
             → match "he" ending at position 3
i=4, ch='r': state=9, no 'r' transition
             → follow fail: state = fail[9] = 2
             → goto[2] has 'r'? Yes! goto[2]['r'] = 6
             → state=6, no output
i=5, ch='s': state=6 → goto[6]['s'] = 7 → state=7
             output[7] = ["hers"(idx=3)]
             → match "hers" ending at position 5

Final result: found "she"@1, "he"@2, "hers"@2 in "ushers".

Level 2 · How It Works Under the Hood

31.6 Complete Optimized AC Implementation

The basic implementation above requires following the fail chain step by step on mismatch. In practice, we can precompute all transitions (DFA-style), eliminating runtime fallback:

class AhoCorasickOptimized:
    """Optimized AC automaton: precomputed transitions, no fail chain traversal during matching"""
    
    def __init__(self, patterns: list[str]):
        self.patterns = patterns
        self.num_states = 1
        self.goto_table = [{}]
        self.failure = [0]
        self.output_list = [[]]
        
        self._build(patterns)
    
    def _build(self, patterns: list[str]):
        # Step 1: Build Trie
        for idx, pattern in enumerate(patterns):
            state = 0
            for ch in pattern:
                if ch not in self.goto_table[state]:
                    self.goto_table.append({})
                    self.failure.append(0)
                    self.output_list.append([])
                    self.goto_table[state][ch] = self.num_states
                    self.num_states += 1
                state = self.goto_table[state][ch]
            self.output_list[state].append(idx)
        
        # Step 2: BFS to build failure links + complete transition table
        from collections import deque
        queue = deque()
        
        for ch, s in self.goto_table[0].items():
            self.failure[s] = 0
            queue.append(s)
        
        while queue:
            u = queue.popleft()
            for ch, v in self.goto_table[u].items():
                queue.append(v)
                f = self.failure[u]
                while f != 0 and ch not in self.goto_table[f]:
                    f = self.failure[f]
                self.failure[v] = self.goto_table[f].get(ch, 0)
                if self.failure[v] == v:
                    self.failure[v] = 0
                self.output_list[v] = (
                    self.output_list[v] + self.output_list[self.failure[v]]
                )
    
    def search(self, text: str) -> list[tuple[int, str]]:
        """Search text, returns (start_position, pattern) list"""
        results = []
        state = 0
        
        for i, ch in enumerate(text):
            while state != 0 and ch not in self.goto_table[state]:
                state = self.failure[state]
            state = self.goto_table[state].get(ch, 0)
            
            for idx in self.output_list[state]:
                pattern = self.patterns[idx]
                start = i - len(pattern) + 1
                results.append((start, pattern))
        
        return results

31.7 Sensitive Word Filter: Practical Implementation

Requirement: Given a list of sensitive words and user input text, replace all sensitive words with ***.

class SensitiveWordFilter:
    """AC automaton-based sensitive word filter"""
    
    def __init__(self, sensitive_words: list[str]):
        self.ac = AhoCorasickOptimized(sensitive_words)
    
    def filter_text(self, text: str) -> str:
        """Replace sensitive words with asterisks"""
        matches = self.ac.search(text)
        if not matches:
            return text
        
        # Sort by start position, merge overlapping intervals
        matches.sort(key=lambda x: x[0])
        intervals = []
        for start, pattern in matches:
            end = start + len(pattern)
            if intervals and start <= intervals[-1][1]:
                intervals[-1] = (intervals[-1][0], max(intervals[-1][1], end))
            else:
                intervals.append((start, end))
        
        # Replace
        result = []
        prev_end = 0
        for start, end in intervals:
            result.append(text[prev_end:start])
            result.append('*' * (end - start))
            prev_end = end
        result.append(text[prev_end:])
        
        return ''.join(result)
    
    def contains_sensitive(self, text: str) -> bool:
        """Quick check if text contains any sensitive word"""
        state = 0
        for ch in text:
            while state != 0 and ch not in self.ac.goto_table[state]:
                state = self.ac.failure[state]
            state = self.ac.goto_table[state].get(ch, 0)
            if self.ac.output_list[state]:
                return True
        return False


# Usage example
filter_sys = SensitiveWordFilter(["violence", "gambling", "drugs", "exploit"])
print(filter_sys.filter_text("This article discusses violence and gambling"))
# Output: "This article discusses ******** and ********"

print(filter_sys.contains_sensitive("Normal article content"))
# Output: False

Engineering Considerations

Case-insensitive matching: Convert both patterns and text to lowercase during preprocessing
Unicode support: Python strings natively handle Unicode, but watch for fullwidth/halfwidth conversions
Incremental updates: If the word list changes dynamically, the AC automaton must be rebuilt. Use a timed rebuild strategy.
Evasion detection: Users circumvent filters through spaces, homoglyphs, character substitutions — preprocessing normalization is needed

31.8 Relationship Between AC Automaton and KMP

KMP Is a Special Case of AC Automaton

When the pattern set contains only one pattern, the AC automaton degenerates to KMP:

Trie degenerates to a single chain (the pattern's characters)
Failure links degenerate to KMP's next array
BFS degenerates to linear scanning

Formal Correspondence

KMP	AC Automaton
Pattern character sequence	Root-to-leaf path in Trie
next[j] (prefix function)	fail[state] (failure link)
j's increment/decrement during matching	State transitions and fallbacks in Trie
One pattern → O(n+m)	k patterns → O(n + m_total + z)

Intuition: KMP is "one-dimensional" (falling back along a single chain); AC automaton is "multi-dimensional" (jumping through the entire Trie via failure links).

31.9 DFA Optimization

In high-performance scenarios, the AC matching process can be converted to pure DFA table lookup — precomputing the complete transition for every (state, char) pair, eliminating runtime fail chain traversal:

def build_dfa(ac: AhoCorasickOptimized, charset: str) -> list[dict]:
    """
    Convert AC automaton to complete DFA
    Precompute next state for every (state, char) pair
    """
    num_states = ac.num_states
    dfa = [{} for _ in range(num_states)]
    
    from collections import deque
    queue = deque([0])
    visited = {0}
    
    while queue:
        state = queue.popleft()
        for ch in charset:
            s = state
            while s != 0 and ch not in ac.goto_table[s]:
                s = ac.failure[s]
            next_state = ac.goto_table[s].get(ch, 0)
            dfa[state][ch] = next_state
            if next_state not in visited:
                visited.add(next_state)
                queue.append(next_state)
    
    return dfa

After DFA conversion, matching becomes simple table lookup:

def dfa_search(dfa, output_list, text):
    """DFA-based matching: exactly one table lookup per character"""
    state = 0
    results = []
    for i, ch in enumerate(text):
        state = dfa[state].get(ch, 0)
        if output_list[state]:
            for idx in output_list[state]:
                results.append((i, idx))
    return results

DFA Cost: Space increases from O(m_total) to O(m_total * |Σ|). For ASCII (|Σ|=256) with 100K states, that's 25.6M transition entries — acceptable. For Unicode (|Σ|=65536+), use dictionaries or compressed tables.

Level 3 · What the Theory Says

31.10 Aho & Corasick 1975 Paper

Historical Context

Alfred Aho and Margaret Corasick published "Efficient String Matching: An Aid to Bibliographic Search" (Communications of the ACM, Vol. 18, No. 6, 1975). The paper was motivated by Bell Labs' bibliographic retrieval system — needing to search for multiple keywords simultaneously across large document collections.

Core Contributions

Theorem 1: Given pattern set P = {p_1, ..., p_k}, a deterministic finite automaton M can be constructed in O(sum|p_i|) time such that M processes a text of length n in O(n) time, plus O(z) for reporting matches.
Algorithm Description: The paper precisely describes three phases:
- Goto construction: Insert all patterns into Trie, defining the goto function
- Failure construction: BFS computation of each state's failure transition
- Output merging: Propagating output sets through failure chains
Complexity Proof: Uses amortized analysis to prove linear time — the key insight is that total failure chain length is bounded.

Connection to Unix Tools

Aho is the author of Unix tools egrep and fgrep. fgrep (fixed string grep) internally implements the AC automaton. When you run fgrep -f patterns.txt text.txt, the algorithm described in this chapter executes underneath.

The "A" in awk also refers to Alfred Aho — he is one of the three creators of awk.

31.11 Wu-Manber Algorithm

Motivation

The AC automaton excels when pattern count is large but individual patterns are short. When patterns are long and the alphabet is large, Boyer-Moore-style "skipping" may be faster. The Wu-Manber algorithm (1994) generalizes Boyer-Moore's ideas to multi-pattern matching.

Core Idea

Sun Wu and Udi Manber proposed in "A Fast Algorithm for Multi-Pattern Searching" (Technical Report TR-94-17, University of Arizona):

Take m_min = minimum length among all patterns
Use a hash table (shift table) recording skip distances for text blocks of length B (typically B=2 or 3)
If the current B-character block doesn't match any pattern's suffix, safely skip m_min - B + 1 positions

Comparison with AC Automaton

Property	AC Automaton	Wu-Manber
Best for	Short patterns, many patterns	Long patterns, large alphabet
Worst case	O(n + z)	O(n * m)
Average case	O(n + z)	O(n * B / m_min)
Space	O(m_total)	O(
Real-world use	fgrep, IDS	agrep (approximate matching)

Wu-Manber's Advantage: When pattern length >= 5 and alphabet >= 256, average performance can exceed AC automaton — because it can skip large portions of text. However, its worst case is worse than AC automaton's.

31.12 Correctness Proof of AC Automaton

Lemma 1: Trie Correctness

If text[i..j] exactly equals some pattern p, then starting from the Trie root and feeding characters text[i], text[i+1], ..., text[j], the final state s satisfies: p is in output[s].

Proof: Trie construction guarantees this — each pattern corresponds to a unique root-to-state path.

Lemma 2: Failure Chain Correctness

For any state s (representing string u), failure[s] points to the state representing u's longest proper suffix v such that v is a prefix of some path in the Trie.

Proof: BFS level-order traversal ensures that when computing failure[v], all shallower states' failure links are already correct. Induction completes the proof.

Theorem: AC Automaton Finds All Matches

After running the AC automaton on text, all patterns occurring in text are reported.

Proof sketch:

If pattern p occurs at text[i..j], then after processing text[j], the current state s represents a string having p as suffix
Through transitivity of failure chains, output[s] contains p (directly or via failure chain merging)
Output merging during construction guarantees completeness

31.13 Rigorous Complexity Proof

Construction Phase: O(m_total * |Σ|)

Trie construction is O(m_total) (one insertion per character). BFS for failure links: each state dequeued once, each edge visited once: O(m_total). The while fallback in failure computation needs additional analysis — similar to KMP's amortized argument, total fallback across all states doesn't exceed O(m_total).

Matching Phase: O(n + z)

Outer loop: n iterations (one per text character)
While fallback: similar to KMP's amortized analysis. State "depth" increases by at most 1 per outer iteration (successful transition); while loop decreases depth each time. Total fallback count bounded by n.
Output collection: O(z)

Total time = O(n + z).

Level 4 · Edge Cases and Pitfalls

31.14 AC Automaton in Real Engineering

Web Application Firewall (WAF)

WAFs must detect hundreds of SQL injection, XSS attack signatures in every HTTP request. For example, ModSecurity's core rule set contains hundreds of attack patterns:

attack_patterns = [
    "union select",
    "' or 1=1",
    "<script>",
    "javascript:",
    "../../../etc/passwd",
    "cmd.exe",
    "; drop table",
    "exec xp_",
]

class WAF:
    def __init__(self, signatures: list[str]):
        self.ac = AhoCorasickOptimized([s.lower() for s in signatures])
        self.signatures = signatures
    
    def inspect(self, request_body: str) -> tuple[bool, list[str]]:
        """
        Check request body for attack signatures
        Returns (should_block, matched_signatures)
        """
        matches = self.ac.search(request_body.lower())
        if matches:
            matched_sigs = list(set(sig for _, sig in matches))
            return True, matched_sigs
        return False, []

Intrusion Detection Systems (IDS)

Snort and Suricata use AC automaton variants to detect malicious signatures in network traffic. Requirements:

Multi-gigabit throughput per second
Tens of thousands of rules/signatures
Real-time response (microsecond latency)

These systems typically use DFA-converted AC automata + SIMD acceleration to achieve line-rate processing.

Multi-Pattern Matching Pipeline in IDS

Network packet → Protocol parsing → Content extraction → AC matching → Rule engine → Alert/Block
                                                            ↑
                                                     Precompiled DFA
                                                     (loaded at startup)

31.15 Performance Optimization Techniques

1. Compressed Storage

For large character sets (e.g., UTF-8), full |Σ|-sized arrays are impractical. Common strategies:

Dictionary storage: Natural choice in Python, but has hash overhead
Double-Array Trie: Proposed by Aoe (1989), compact Trie representation with O(1) lookup and minimal memory
Layered mapping: Direct indexing for ASCII 0-127, hash for >127

class CompactAC:
    """AC automaton with character class compression"""
    
    def __init__(self, patterns: list[str]):
        # Analyze charset, build char-to-class mapping
        charset = set()
        for p in patterns:
            charset.update(p)
        
        self.char_map = {ch: i for i, ch in enumerate(sorted(charset))}
        self.num_classes = len(self.char_map)
        
        # Use 2D array instead of dictionary
        # goto_array[state * num_classes + char_class] = next_state
        # Much faster than dictionary lookup
        self._build(patterns)

2. Cache-Friendly Memory Layout

The AC automaton's performance bottleneck is often memory access patterns. Optimization strategies:

Pack frequently co-accessed data (goto, output, failure) contiguously
Number states in BFS order — ensuring adjacent state numbers are also adjacent in Trie, benefiting cache prefetch
Place hot states (near root, frequently accessed) at the front of memory

3. SIMD Acceleration

Modern CPUs' SIMD instructions (SSE4.2's PCMPISTRI/PCMPISTRM) can compare 16 bytes at once:

// Pseudocode: process 16 bytes at a time with SIMD
for each 16-byte block of text:
    if any byte matches interesting character set:
        fall back to AC automaton for that region
    else:
        skip entire block (all bytes are "boring")

This "pre-filter + precise matching" hybrid strategy is widely used in real IDS implementations, improving throughput 5-10x.

4. Failure Link Optimization — Suffix Link Compression

For deep nodes, following the failure chain may pass through many intermediate nodes without outputs. We can skip directly to the nearest ancestor with output — the "dictionary suffix link" optimization:

def build_dict_suffix_links(ac):
    """Precompute dictionary suffix links"""
    dict_link = [0] * ac.num_states
    from collections import deque
    queue = deque()
    
    for ch, s in ac.goto_table[0].items():
        dict_link[s] = 0
        queue.append(s)
    
    while queue:
        u = queue.popleft()
        for ch, v in ac.goto_table[u].items():
            queue.append(v)
            f = ac.failure[v]
            if ac.output_list[f]:
                dict_link[v] = f
            else:
                dict_link[v] = dict_link[f]
    
    return dict_link

31.16 AC Automaton in Interviews

Frequency of Appearance

The AC automaton rarely appears in general interviews (it's an advanced topic), but surfaces in:

System design interviews: designing content moderation systems, URL blacklist detection
Advanced algorithm interviews: LeetCode Hard string matching variants
Backend/infrastructure roles: discussing IDS/WAF implementation principles

How to Demonstrate AC Automaton Knowledge in Interviews

State what it solves: "Multi-pattern matching in O(n + z) time"
Describe core structure: "Trie + failure links, where failure links mean longest suffix match"
Relationship to KMP: "KMP is the single-pattern special case of AC automaton"
Complexity: "Build O(m_total), match O(n + z)"
Applications: "fgrep, IDS/WAF, content moderation"

Related Interview Problems

Design a content moderation system:
- Core: AC automaton
- Extensions: handling evasion (homoglyphs, inserted spaces, character substitution)
- Scale: 100K sensitive words, 100K requests per second
Word Search II (LeetCode #212): Though typically solved with Trie + DFS, the AC automaton represents a more systematic approach.
Implement a simple regex engine: Multi-fixed-string OR matching (a|b|c) is essentially an AC automaton.

31.17 Limitations of AC Automaton

Scenarios Where It Doesn't Apply

Wildcard matching: AC automaton only does exact matching — no * or ? support. Need NFA or other approaches.
Approximate matching: Searching for substrings within edit distance k of a pattern — AC automaton cannot handle directly. Requires combining with DP (e.g., Myers' bit-parallel algorithm).
Very long single pattern: If there's only one extremely long pattern (e.g., 100MB), AC automaton's space overhead is wasteful — use KMP or Boyer-Moore directly.
Dynamic pattern set: If patterns are frequently added/removed, rebuilding the AC automaton each time is costly. Consider managing multiple smaller AC automata in batches.

Relationship to Regex Engines

Fixed string matching: KMP / Boyer-Moore
Multiple fixed strings: AC automaton
With wildcards: NFA simulation
Full regex: NFA → DFA conversion (Thompson construction + subset construction)

The AC automaton is a specialization of regex engines — when all patterns are fixed strings, it's the optimal solution.

31.18 AC Automaton Variants in Real Systems

Commentz-Walter Algorithm (1979)

Combines Boyer-Moore's right-to-left comparison with AC automaton. Achieves large jumps on mismatch. Suited for multi-pattern matching with longer patterns.

Set-BM (Set Boyer-Moore)

Optimized for multiple long patterns. Uses the shortest pattern's length to determine the text scan window, then applies BM skip strategy within the window.

Aho-Corasick + Bit-Parallel

Combines AC automaton with bit-parallel techniques, leveraging CPU word width to process multiple transitions simultaneously. Particularly efficient when total pattern count <= 64.

Choices in Real Products

Product	Algorithm	Reason
Snort 3	AC + Intel Hyperscan	Line-rate needed, hardware acceleration
ClamAV	AC + prefix filtering	Large number of virus signatures
Cloudflare WAF	AC + SIMD	Low-latency requirement
grep (GNU)	BM + AC hybrid	Adaptive single/multi-pattern
Python re	NFA → DFA (pyahocorasick extension)	Generality + performance

Chapter Summary

AC Automaton = Trie + Failure Links: Reduces multi-pattern matching time from O(k * n) to O(n + z), independent of pattern count.
Construction: Insert all patterns into Trie, BFS-compute each node's failure link (pointing to longest suffix match), then merge output chains.
Relationship to KMP: AC automaton generalizes KMP from one dimension (single chain) to multiple dimensions (Trie). Failure links are the multi-pattern version of KMP's next array.
Aho & Corasick 1975 Paper: Established the theoretical foundation for multi-pattern exact matching, directly giving rise to the Unix fgrep tool.
Engineering Practice: WAF, IDS, and content moderation are core application scenarios. Performance optimizations include DFA conversion, double-array Trie, SIMD pre-filtering, and dictionary suffix links.
Interview Essentials: Understand why multi-pattern matching is needed, describe the construction process (Trie → BFS failure → output merging), and know the origin of O(n + m_total + z) complexity.

Rate this chapter

4.6 / 5 (3 ratings)