Chapter 30

String Matching: From Brute Force to KMP

Chapter 30: String Matching โ€” From Brute Force to KMP

"Finding a pattern within a text" โ€” this deceptively simple problem gave rise to one of the most elegant algorithms in computer science. From naive character-by-character comparison to KMP's single-pass, no-backtrack scan of the text, the underlying insight is the profound reuse of mismatch information.

String matching is the core operation behind editor Ctrl+F, the grep command, DNA sequence alignment, and intrusion detection systems. Understanding these algorithms is not just interview preparation โ€” it exemplifies the algorithmic design philosophy of "extracting reusable information from problem structure."


Level 1 ยท What You Need to Know

30.1 Brute Force Matching

Problem Definition

Given a text string text (length n) and a pattern string pattern (length m), find all positions where pattern occurs in text.

Brute Force Approach

Try every starting position i in text (from 0 to n-m), comparing text[i..i+m-1] with pattern[0..m-1] character by character.

def brute_force_search(text: str, pattern: str) -> list[int]:
    """Brute force string matching, returns all match positions"""
    n, m = len(text), len(pattern)
    if m == 0:
        return list(range(n + 1))
    
    positions = []
    for i in range(n - m + 1):
        match = True
        for j in range(m):
            if text[i + j] != pattern[j]:
                match = False
                break
        if match:
            positions.append(i)
    return positions

Time Complexity: Worst case O(n * m). Example: text = "aaaa...a" (n a's), pattern = "aaa...ab" (m-1 a's + b). Every position requires m comparisons before discovering the mismatch.

Why isn't brute force good enough? Consider searching for a 100-character keyword in a 1GB log file. n = 10^9, m = 100, worst case requires 10^11 comparisons. We need an O(n) algorithm.

30.2 The Core Intuition Behind KMP

KMP's Key Observation

When brute force fails at text[i+j] != pattern[j], it shifts the start position by one (i โ†’ i+1) and restarts comparison from scratch. But we already know that text[i..i+j-1] == pattern[0..j-1] โ€” these matched characters contain valuable information!

KMP's central question: When a mismatch occurs, how far can we slide the pattern to the right without missing any potential match?

The answer depends on the length of the longest proper prefix of pattern[0..j-1] that is also a suffix. This is the next array (also called failure function or prefix function).

Intuitive Example

text:    a b a b a c
pattern: a b a b a d
                  โ†‘ mismatch at position 5 (pattern[5]='d' != text[5]='c')

Matched portion: pattern[0..4] = "ababa"
Longest proper prefix == proper suffix of "ababa": "aba" (length 3)

This means: the last 3 characters matched in text ("aba") are also a prefix of pattern
So we can continue comparing from pattern[3], without backtracking the text pointer!

30.3 Building the Next Array

Definition

next[j] (also written as pi[j] or lps[j]) is the length of the longest proper prefix of pattern[0..j] that is also a suffix of pattern[0..j].

Examples:

Construction Algorithm

Building the next array is itself a process of "matching pattern against itself":

def build_next(pattern: str) -> list[int]:
    """
    Build KMP's next array (prefix function)
    next[i] = length of longest proper prefix-suffix of pattern[0..i]
    """
    m = len(pattern)
    next_arr = [0] * m  # next_arr[0] = 0, single char has no proper prefix
    
    j = 0  # length of matched prefix
    for i in range(1, m):
        # Fall back on mismatch
        while j > 0 and pattern[i] != pattern[j]:
            j = next_arr[j - 1]
        
        # Match succeeds
        if pattern[i] == pattern[j]:
            j += 1
        
        next_arr[i] = j
    
    return next_arr

Step-by-Step Walkthrough (pattern = "ababac")

i=0: next[0] = 0 (by definition)
i=1: pattern[1]='b', pattern[0]='a', mismatch โ†’ next[1] = 0
i=2: pattern[2]='a', pattern[0]='a', match โ†’ j=1, next[2] = 1
i=3: pattern[3]='b', pattern[1]='b', match โ†’ j=2, next[3] = 2
i=4: pattern[4]='a', pattern[2]='a', match โ†’ j=3, next[4] = 3
i=5: pattern[5]='c', pattern[3]='b', mismatch
     j = next[2] = 1
     pattern[5]='c', pattern[1]='b', mismatch
     j = next[0] = 0
     pattern[5]='c', pattern[0]='a', mismatch
     next[5] = 0

Result: next = [0, 0, 1, 2, 3, 0]

Time Complexity: O(m). Although there's a while loop, the total increment of j never exceeds m, and total decrement also never exceeds m.

30.4 Complete KMP Matching Process

def kmp_search(text: str, pattern: str) -> list[int]:
    """
    KMP string matching
    Returns all starting positions where pattern occurs in text
    """
    if not pattern:
        return list(range(len(text) + 1))
    
    n, m = len(text), len(pattern)
    next_arr = build_next(pattern)
    
    positions = []
    j = 0  # matched length in pattern
    
    for i in range(n):
        # On mismatch, use next array to fall back pattern pointer
        while j > 0 and text[i] != pattern[j]:
            j = next_arr[j - 1]
        
        # Current character matches
        if text[i] == pattern[j]:
            j += 1
        
        # Complete match found
        if j == m:
            positions.append(i - m + 1)
            j = next_arr[j - 1]  # Continue searching for next match
    
    return positions

Full Matching Walkthrough (text = "ababababac", pattern = "ababac")

next = [0, 0, 1, 2, 3, 0]

i=0: text[0]='a', pattern[0]='a' โ†’ j=1
i=1: text[1]='b', pattern[1]='b' โ†’ j=2
i=2: text[2]='a', pattern[2]='a' โ†’ j=3
i=3: text[3]='b', pattern[3]='b' โ†’ j=4
i=4: text[4]='a', pattern[4]='a' โ†’ j=5
i=5: text[5]='b', pattern[5]='c' โ†’ mismatch!
     j = next[4] = 3
     text[5]='b', pattern[3]='b' โ†’ j=4
i=6: text[6]='a', pattern[4]='a' โ†’ j=5
i=7: text[7]='b', pattern[5]='c' โ†’ mismatch!
     j = next[4] = 3
     text[7]='b', pattern[3]='b' โ†’ j=4
i=8: text[8]='a', pattern[4]='a' โ†’ j=5
i=9: text[9]='c', pattern[5]='c' โ†’ j=6 = m โ†’ match found!
     position = 9 - 6 + 1 = 4
     j = next[5] = 0

Result: [4] (pattern "ababac" occurs at position 4 in text)

Core Properties of KMP

  1. Text pointer i never backtracks: i strictly increases, incrementing by 1 each outer iteration
  2. Time complexity O(n+m): preprocessing O(m), matching O(n)
  3. Online algorithm: text can be streamed in; each character read immediately determines match status

Common Mistakes

  1. Index offset in next array: Some implementations use next[0] = -1 as sentinel, others use lps[0] = 0. Clarify your convention in interviews.
  2. Forgetting to fall back after a match: After finding a match, j should be set to next[j-1] (not 0) โ€” otherwise overlapping matches are missed.
  3. Empty pattern: Behavior when pattern is empty needs explicit definition.

Level 2 ยท How It Works Under the Hood

30.5 Boyer-Moore Algorithm

Boyer-Moore (1977) is the fastest single-pattern matching algorithm in practice. Its core idea: compare right-to-left, using mismatch information to achieve large jumps.

Bad Character Rule

When pattern[j] mismatches text[i+j], look at the "bad character" text[i+j] and find its rightmost occurrence position k in pattern:

def build_bad_char_table(pattern: str) -> dict[str, int]:
    """Bad character table: rightmost position of each char in pattern"""
    table = {}
    for i, ch in enumerate(pattern):
        table[ch] = i
    return table

Good Suffix Rule

When mismatch occurs at pattern[j], the already-matched suffix pattern[j+1..m-1] (the good suffix) is used to determine the shift:

The good suffix preprocessing is more complex, but it guarantees worst-case O(n) matching.

Simplified Boyer-Moore (Bad Character Rule Only)

def boyer_moore_search(text: str, pattern: str) -> list[int]:
    """Boyer-Moore matching (simplified: bad character rule only)"""
    n, m = len(text), len(pattern)
    if m == 0:
        return list(range(n + 1))
    
    bad_char = build_bad_char_table(pattern)
    
    positions = []
    i = 0  # current alignment position in text
    
    while i <= n - m:
        j = m - 1  # compare right to left
        
        while j >= 0 and pattern[j] == text[i + j]:
            j -= 1
        
        if j < 0:
            positions.append(i)
            i += 1  # simplified; can jump further
        else:
            bad_char_pos = bad_char.get(text[i + j], -1)
            shift = max(1, j - bad_char_pos)
            i += shift
    
    return positions

Boyer-Moore's Average Performance

On natural language text, Boyer-Moore averages about n/m comparisons โ€” fewer than the text length! This is because with a large alphabet, the bad character rule frequently skips the entire pattern length. This is why grep uses a Boyer-Moore variant by default.

Boyer-Moore vs KMP

Property KMP Boyer-Moore
Worst case O(n+m) O(n+m) (full version)
Average O(n) O(n/m) (large alphabet)
Comparison direction leftโ†’right rightโ†’left
Best for Streaming input, small alphabet Large text search, large alphabet
Preprocessing O(m) O(m + |ฮฃ|)

30.6 Rabin-Karp Algorithm (Rolling Hash)

Core Idea

Instead of comparing characters one by one, use hash values for quick elimination: if hash(text[i..i+m-1]) != hash(pattern), they definitely don't match; if equal, verify character by character (handling hash collisions).

The key technique is rolling hash: computing hash(text[i+1..i+m]) from hash(text[i..i+m-1]) in O(1).

Polynomial Hash

Treat the string as a base-b number (b typically 31 or 131), taken modulo a large prime p:

$$h(s) = s[0] \cdot b^{m-1} + s[1] \cdot b^{m-2} + ... + s[m-1] \cdot b^0 \mod p$$

Rolling update: $$h(s[i+1..i+m]) = (h(s[i..i+m-1]) - s[i] \cdot b^{m-1}) \cdot b + s[i+m] \mod p$$

Complete Implementation

def rabin_karp_search(text: str, pattern: str) -> list[int]:
    """Rabin-Karp string matching"""
    n, m = len(text), len(pattern)
    if m == 0:
        return list(range(n + 1))
    if m > n:
        return []
    
    base = 131
    mod = 10**9 + 7
    
    # Compute base^(m-1) mod p
    power = pow(base, m - 1, mod)
    
    # Compute pattern hash
    pattern_hash = 0
    for ch in pattern:
        pattern_hash = (pattern_hash * base + ord(ch)) % mod
    
    # Compute hash of first m characters of text
    text_hash = 0
    for i in range(m):
        text_hash = (text_hash * base + ord(text[i])) % mod
    
    positions = []
    
    for i in range(n - m + 1):
        if text_hash == pattern_hash:
            if text[i:i + m] == pattern:
                positions.append(i)
        
        # Rolling hash update
        if i < n - m:
            text_hash = (text_hash - ord(text[i]) * power) % mod
            text_hash = (text_hash * base + ord(text[i + m])) % mod
            text_hash = (text_hash + mod) % mod  # ensure non-negative
    
    return positions

Time Complexity:

Unique Advantages of Rabin-Karp

  1. Multi-pattern matching: maintain hash values for multiple patterns simultaneously
  2. 2D pattern matching: generalizes to searching for a sub-matrix within a matrix
  3. Simple implementation: shorter code than KMP or Boyer-Moore

Handling Hash Collisions

Single hash collision probability is about 1/p. To further reduce false positives:

30.7 Z-Function (Z-array)

Definition

For string s, Z[i] is the length of the longest substring starting from position i that matches a prefix of s. Z[0] is typically defined as 0 or len(s).

Example: s = "aabxaab"

Z[0] = 7 (or 0, by convention)
Z[1] = 1  "a" matches prefix "a"
Z[2] = 0  "b..." doesn't match prefix
Z[3] = 0  "x..." doesn't match prefix
Z[4] = 3  "aab" matches prefix "aab"
Z[5] = 1  "a" matches prefix "a"
Z[6] = 0  "b" doesn't match prefix

O(n) Z-Function Construction

def build_z_array(s: str) -> list[int]:
    """Build Z-function in O(n) time"""
    n = len(s)
    z = [0] * n
    z[0] = n  # by convention
    
    l, r = 0, 0  # [l, r) is the rightmost matching interval
    
    for i in range(1, n):
        if i < r:
            # i is within known matching interval, reuse information
            z[i] = min(r - i, z[i - l])
        
        # Extend by brute force
        while i + z[i] < n and s[z[i]] == s[i + z[i]]:
            z[i] += 1
        
        # Update rightmost matching interval
        if i + z[i] > r:
            l, r = i, i + z[i]
    
    return z

String Matching via Z-Function

Construct pattern + "$" + text (where $ is a separator not in either string) and compute its Z-array. Positions where Z[i] == m (offset by m+1) are match positions.

def z_search(text: str, pattern: str) -> list[int]:
    """String matching using Z-function"""
    m = len(pattern)
    combined = pattern + "$" + text
    z = build_z_array(combined)
    
    positions = []
    for i in range(m + 1, len(combined)):
        if z[i] == m:
            positions.append(i - m - 1)
    return positions

Z-Function vs KMP's Next Array

The two are essentially equivalent โ€” Z-function describes "how long a prefix match starts at each position," while next array describes "the longest proper prefix-suffix of each prefix." They are inter-convertible, but each has advantages in different contexts:


Level 3 ยท What the Theory Says

30.8 Theoretical Foundations of KMP

Knuth, Morris, Pratt 1977 Paper

Donald Knuth, James Morris, and Vaughan Pratt published "Fast Pattern Matching in Strings" (SIAM Journal on Computing, Vol. 6, No. 2, 1977). The paper proves:

  1. Theorem 1: There exists an algorithm requiring at most n + m - 1 character comparisons in the worst case for matching (excluding next array construction).

  2. Theorem 2: The next array can be constructed in at most 2m - 3 comparisons.

  3. Corollary: KMP's total comparisons never exceed 2n (matching phase) + 2m (preprocessing) = 2(n+m).

KMP and Automata

KMP effectively constructs a simplified version of a deterministic finite automaton (DFA). A full DFA's transition table has size O(m * |ฮฃ|), while KMP uses only an O(m) next array to simulate all DFA transitions โ€” through chain fallback on mismatch.

Knuth noted in the paper that KMP can be viewed as an improvement on Morris and Pratt's earlier work (1970 technical report), combined with Knuth's mathematical analysis of the failure function.

Optimality

The 1977 paper also proves a lower bound: any correct string matching algorithm requires at least n + m - O(1) comparisons in the worst case. KMP's n + m - 1 nearly matches this lower bound.

30.9 Boyer-Moore Theory

Boyer and Moore 1977 Paper

Robert Boyer and J Strother Moore published "A Fast String Searching Algorithm" (Communications of the ACM, Vol. 20, No. 10, 1977) the same year.

Key Theorems:

Intuition for Sub-linear Time

Why can Boyer-Moore be faster than O(n) on average? Because it doesn't need to examine every character of text! When the alphabet is large enough (e.g., ASCII's 256 characters), the probability of the bad character not appearing in pattern is high, allowing each mismatch to skip the entire pattern length.

For m = 10, |ฮฃ| = 256: probability a random character is not in pattern โ‰ˆ (1 - 10/256) โ‰ˆ 0.96, meaning most positions allow a jump of 10 characters.

30.10 Theoretical Foundations of String Hashing

Collision Probability of Polynomial Hash

Given two distinct strings of length m, using a random base b (uniformly chosen from [1, p-1]) and prime modulus p, the probability of hash collision is:

$$\Pr[h(s1) = h(s2)] \leq \frac{m-1}{p}$$

This follows from the Schwartz-Zippel Lemma: a non-zero polynomial of degree at most m-1 over Z_p has at most m-1 roots.

Practical Parameter Choices

Carter-Wegman Universal Hashing

J. Lawrence Carter and Mark Wegman established the theoretical framework of universal hashing in "Universal Classes of Hash Functions" (Journal of Computer and System Sciences, 1979). Polynomial hashing falls within this framework.

Karp and Rabin 1987 Paper

Richard Karp and Michael Rabin formally analyzed rolling hash for string matching in "Efficient Randomized Pattern-Matching Algorithms" (IBM Journal of Research and Development, 1987), proving that with appropriate hash parameters, expected runtime is O(n + m).

30.11 Mathematical Properties of the Prefix Function

Property 1: Values Form a "Chain"

For any position j, the sequence j, next[j-1], next[next[j-1]-1], ... forms a strictly decreasing sequence down to 0. The worst-case number of hops from any position back to 0 is O(m), but the total hops across all positions is O(m) โ€” this is the essence of amortized analysis.

Property 2: Amortized Analysis

The total number of while loop executions in KMP's matching phase does not exceed n. Proof:

Property 3: Periodicity

If next[m-1] > 0 and m % (m - next[m-1]) == 0, then pattern has period length m - next[m-1].

Example: pattern = "abcabc", next[5] = 3, period = 6 - 3 = 3. Pattern is indeed "abc" repeated twice.

This property is directly used in LeetCode #459 (Repeated Substring Pattern).


Level 4 ยท Edge Cases and Pitfalls

30.12 Interview Deep Dive: Implement strStr() (LeetCode #28)

Problem

Implement strStr(): return the first occurrence position of pattern in text, or -1 if not found.

Interview Strategy

  1. Write brute force first: demonstrate problem understanding; O(n*m) brute force is only 5-6 lines
  2. If interviewer asks for optimization: implement KMP, focus on explaining the next array's meaning
  3. If time permits: discuss trade-offs between algorithms
def str_str(haystack: str, needle: str) -> int:
    """LeetCode 28: Implement strStr()"""
    if not needle:
        return 0
    
    m = len(needle)
    next_arr = [0] * m
    j = 0
    for i in range(1, m):
        while j > 0 and needle[i] != needle[j]:
            j = next_arr[j - 1]
        if needle[i] == needle[j]:
            j += 1
        next_arr[i] = j
    
    j = 0
    for i in range(len(haystack)):
        while j > 0 and haystack[i] != needle[j]:
            j = next_arr[j - 1]
        if haystack[i] == needle[j]:
            j += 1
        if j == m:
            return i - m + 1
    
    return -1

Common Follow-up Questions:

30.13 Interview Deep Dive: Repeated Substring Pattern (LeetCode #459)

Problem

Determine if string s can be formed by repeating a substring multiple times. E.g., "abab" โ†’ True ("ab" repeated twice).

Method 1: KMP's Next Array

def repeated_substring_pattern(s: str) -> bool:
    """Use KMP next array to detect repeated substring"""
    n = len(s)
    next_arr = [0] * n
    j = 0
    for i in range(1, n):
        while j > 0 and s[i] != s[j]:
            j = next_arr[j - 1]
        if s[i] == s[j]:
            j += 1
        next_arr[i] = j
    
    period = n - next_arr[n - 1]
    return next_arr[n - 1] > 0 and n % period == 0

Why does this work? If s has period p (s = t * k, |t| = p), then s[p:] == s[:n-p], meaning s's first n-p characters form a suffix of s. This implies next[n-1] >= n - p. The longest prefix-suffix length next[n-1] yields the shortest period n - next[n-1].

Method 2: String Concatenation Trick

def repeated_substring_pattern_v2(s: str) -> bool:
    """String concatenation method"""
    doubled = (s + s)[1:-1]  # remove first and last characters
    return s in doubled

Intuition: If s = "abab", then s+s = "abababab", removing first and last gives "bababa". If s is a repeated string, it will still be found in the trimmed s+s; otherwise it won't.

30.14 Interview Deep Dive: Shortest Palindrome (LeetCode #214)

Problem

Add the fewest characters possible to the front of string s to make it a palindrome.

Analysis

Key observation: find s's longest palindrome prefix. If the longest palindrome prefix has length k, prepend s[k:][::-1].

KMP Solution

Construct s + "#" + reverse(s) and compute its next array. The last entry gives the length of s's longest palindrome prefix.

def shortest_palindrome(s: str) -> str:
    """LeetCode 214: Shortest Palindrome"""
    if not s:
        return s
    
    rev_s = s[::-1]
    combined = s + "#" + rev_s
    
    n = len(combined)
    next_arr = [0] * n
    j = 0
    for i in range(1, n):
        while j > 0 and combined[i] != combined[j]:
            j = next_arr[j - 1]
        if combined[i] == combined[j]:
            j += 1
        next_arr[i] = j
    
    longest_palindrome_prefix = next_arr[-1]
    suffix_to_add = rev_s[:len(s) - longest_palindrome_prefix]
    return suffix_to_add + s

Why does this work?

In s + "#" + rev_s, the last entry of next array gives k = length of longest prefix of combined (which is a prefix of s) that equals a suffix of combined (which is a suffix of rev_s = a reversed prefix of s).

"A prefix of s equals the reverse of a prefix of s" โ€” that's exactly the definition of a palindrome prefix! So k is the length of s's longest palindrome prefix.

Example: s = "aacecaaa"

30.15 Engineering Practice for String Algorithms

Algorithm Selection in Real Systems

Scenario Recommended Reason
Editor find (Ctrl+F) Boyer-Moore variant Large alphabet, user patterns are long
grep/ripgrep Aho-Corasick + SIMD Multi-pattern + hardware acceleration
DNA sequence matching KMP or suffix array Small alphabet (4 chars), exact matching
Network content filtering Rabin-Karp multi-hash Need to match multiple patterns simultaneously
Database LIKE queries Brute force (short patterns) Patterns usually short; setup overhead exceeds benefit

Python's Built-in Implementation

Python's str.find() and in operator use a modified Boyer-Moore-Horspool algorithm (CPython implementation in Objects/stringlib/fastsearch.h). For short patterns (< 6 characters), it degrades to brute force since preprocessing overhead isn't worthwhile.

Interview Recommendations

  1. Default to brute force first โ€” concise and correct
  2. If optimization is requested, KMP is the safest choice (deterministic O(n+m), moderate code length)
  3. Mention Boyer-Moore as the practically faster alternative
  4. Use Rabin-Karp when multi-pattern matching is needed or interviewer hints "hashing"

Chapter Summary

  1. Brute Force: O(n*m), simple but insufficient for large data.

  2. KMP Algorithm: O(n+m). Core is the next array โ€” recording each prefix's longest proper prefix-suffix, enabling no-backtrack matching. Proposed by Knuth, Morris, Pratt (1977).

  3. Boyer-Moore: Fastest in practice, average O(n/m). Compares right-to-left, leveraging bad character and good suffix rules for large jumps.

  4. Rabin-Karp: Rolling hash achieves O(1) window sliding, expected O(n+m). Strengths: multi-pattern matching and simple implementation.

  5. Z-Function: Equivalent to KMP but sometimes more intuitive. Z[i] directly gives the prefix match length starting at position i.

  6. Interview Essentials: Master KMP's next array construction and matching process, be able to write the code from scratch, and explain the amortized O(n+m) proof.

Rate this chapter
4.7  / 5  (3 ratings)

๐Ÿ’ฌ Comments