String Matching: From Brute Force to KMP
Chapter 30: String Matching โ From Brute Force to KMP
"Finding a pattern within a text" โ this deceptively simple problem gave rise to one of the most elegant algorithms in computer science. From naive character-by-character comparison to KMP's single-pass, no-backtrack scan of the text, the underlying insight is the profound reuse of mismatch information.
String matching is the core operation behind editor Ctrl+F, the grep command, DNA sequence alignment, and intrusion detection systems. Understanding these algorithms is not just interview preparation โ it exemplifies the algorithmic design philosophy of "extracting reusable information from problem structure."
Level 1 ยท What You Need to Know
30.1 Brute Force Matching
Problem Definition
Given a text string text (length n) and a pattern string pattern (length m), find all positions where pattern occurs in text.
Brute Force Approach
Try every starting position i in text (from 0 to n-m), comparing text[i..i+m-1] with pattern[0..m-1] character by character.
def brute_force_search(text: str, pattern: str) -> list[int]:
"""Brute force string matching, returns all match positions"""
n, m = len(text), len(pattern)
if m == 0:
return list(range(n + 1))
positions = []
for i in range(n - m + 1):
match = True
for j in range(m):
if text[i + j] != pattern[j]:
match = False
break
if match:
positions.append(i)
return positions
Time Complexity: Worst case O(n * m). Example: text = "aaaa...a" (n a's), pattern = "aaa...ab" (m-1 a's + b). Every position requires m comparisons before discovering the mismatch.
Why isn't brute force good enough? Consider searching for a 100-character keyword in a 1GB log file. n = 10^9, m = 100, worst case requires 10^11 comparisons. We need an O(n) algorithm.
30.2 The Core Intuition Behind KMP
KMP's Key Observation
When brute force fails at text[i+j] != pattern[j], it shifts the start position by one (i โ i+1) and restarts comparison from scratch. But we already know that text[i..i+j-1] == pattern[0..j-1] โ these matched characters contain valuable information!
KMP's central question: When a mismatch occurs, how far can we slide the pattern to the right without missing any potential match?
The answer depends on the length of the longest proper prefix of pattern[0..j-1] that is also a suffix. This is the next array (also called failure function or prefix function).
Intuitive Example
text: a b a b a c
pattern: a b a b a d
โ mismatch at position 5 (pattern[5]='d' != text[5]='c')
Matched portion: pattern[0..4] = "ababa"
Longest proper prefix == proper suffix of "ababa": "aba" (length 3)
This means: the last 3 characters matched in text ("aba") are also a prefix of pattern
So we can continue comparing from pattern[3], without backtracking the text pointer!
30.3 Building the Next Array
Definition
next[j] (also written as pi[j] or lps[j]) is the length of the longest proper prefix of pattern[0..j] that is also a suffix of pattern[0..j].
Examples:
- pattern = "abab": lps = [0, 0, 1, 2]
- pattern = "aaaa": lps = [0, 1, 2, 3]
Construction Algorithm
Building the next array is itself a process of "matching pattern against itself":
def build_next(pattern: str) -> list[int]:
"""
Build KMP's next array (prefix function)
next[i] = length of longest proper prefix-suffix of pattern[0..i]
"""
m = len(pattern)
next_arr = [0] * m # next_arr[0] = 0, single char has no proper prefix
j = 0 # length of matched prefix
for i in range(1, m):
# Fall back on mismatch
while j > 0 and pattern[i] != pattern[j]:
j = next_arr[j - 1]
# Match succeeds
if pattern[i] == pattern[j]:
j += 1
next_arr[i] = j
return next_arr
Step-by-Step Walkthrough (pattern = "ababac")
i=0: next[0] = 0 (by definition)
i=1: pattern[1]='b', pattern[0]='a', mismatch โ next[1] = 0
i=2: pattern[2]='a', pattern[0]='a', match โ j=1, next[2] = 1
i=3: pattern[3]='b', pattern[1]='b', match โ j=2, next[3] = 2
i=4: pattern[4]='a', pattern[2]='a', match โ j=3, next[4] = 3
i=5: pattern[5]='c', pattern[3]='b', mismatch
j = next[2] = 1
pattern[5]='c', pattern[1]='b', mismatch
j = next[0] = 0
pattern[5]='c', pattern[0]='a', mismatch
next[5] = 0
Result: next = [0, 0, 1, 2, 3, 0]
Time Complexity: O(m). Although there's a while loop, the total increment of j never exceeds m, and total decrement also never exceeds m.
30.4 Complete KMP Matching Process
def kmp_search(text: str, pattern: str) -> list[int]:
"""
KMP string matching
Returns all starting positions where pattern occurs in text
"""
if not pattern:
return list(range(len(text) + 1))
n, m = len(text), len(pattern)
next_arr = build_next(pattern)
positions = []
j = 0 # matched length in pattern
for i in range(n):
# On mismatch, use next array to fall back pattern pointer
while j > 0 and text[i] != pattern[j]:
j = next_arr[j - 1]
# Current character matches
if text[i] == pattern[j]:
j += 1
# Complete match found
if j == m:
positions.append(i - m + 1)
j = next_arr[j - 1] # Continue searching for next match
return positions
Full Matching Walkthrough (text = "ababababac", pattern = "ababac")
next = [0, 0, 1, 2, 3, 0]
i=0: text[0]='a', pattern[0]='a' โ j=1
i=1: text[1]='b', pattern[1]='b' โ j=2
i=2: text[2]='a', pattern[2]='a' โ j=3
i=3: text[3]='b', pattern[3]='b' โ j=4
i=4: text[4]='a', pattern[4]='a' โ j=5
i=5: text[5]='b', pattern[5]='c' โ mismatch!
j = next[4] = 3
text[5]='b', pattern[3]='b' โ j=4
i=6: text[6]='a', pattern[4]='a' โ j=5
i=7: text[7]='b', pattern[5]='c' โ mismatch!
j = next[4] = 3
text[7]='b', pattern[3]='b' โ j=4
i=8: text[8]='a', pattern[4]='a' โ j=5
i=9: text[9]='c', pattern[5]='c' โ j=6 = m โ match found!
position = 9 - 6 + 1 = 4
j = next[5] = 0
Result: [4] (pattern "ababac" occurs at position 4 in text)
Core Properties of KMP
- Text pointer i never backtracks: i strictly increases, incrementing by 1 each outer iteration
- Time complexity O(n+m): preprocessing O(m), matching O(n)
- Online algorithm: text can be streamed in; each character read immediately determines match status
Common Mistakes
- Index offset in next array: Some implementations use next[0] = -1 as sentinel, others use lps[0] = 0. Clarify your convention in interviews.
- Forgetting to fall back after a match: After finding a match, j should be set to
next[j-1](not 0) โ otherwise overlapping matches are missed. - Empty pattern: Behavior when pattern is empty needs explicit definition.
Level 2 ยท How It Works Under the Hood
30.5 Boyer-Moore Algorithm
Boyer-Moore (1977) is the fastest single-pattern matching algorithm in practice. Its core idea: compare right-to-left, using mismatch information to achieve large jumps.
Bad Character Rule
When pattern[j] mismatches text[i+j], look at the "bad character" text[i+j] and find its rightmost occurrence position k in pattern:
- If k < j: shift pattern right by j - k, aligning pattern[k] with the bad character
- If k > j or bad character doesn't appear in pattern: shift pattern past the bad character
def build_bad_char_table(pattern: str) -> dict[str, int]:
"""Bad character table: rightmost position of each char in pattern"""
table = {}
for i, ch in enumerate(pattern):
table[ch] = i
return table
Good Suffix Rule
When mismatch occurs at pattern[j], the already-matched suffix pattern[j+1..m-1] (the good suffix) is used to determine the shift:
- If the good suffix appears elsewhere in pattern (preceded by a different character), align to that occurrence
- If some suffix of the good suffix is a prefix of pattern, align to that prefix
The good suffix preprocessing is more complex, but it guarantees worst-case O(n) matching.
Simplified Boyer-Moore (Bad Character Rule Only)
def boyer_moore_search(text: str, pattern: str) -> list[int]:
"""Boyer-Moore matching (simplified: bad character rule only)"""
n, m = len(text), len(pattern)
if m == 0:
return list(range(n + 1))
bad_char = build_bad_char_table(pattern)
positions = []
i = 0 # current alignment position in text
while i <= n - m:
j = m - 1 # compare right to left
while j >= 0 and pattern[j] == text[i + j]:
j -= 1
if j < 0:
positions.append(i)
i += 1 # simplified; can jump further
else:
bad_char_pos = bad_char.get(text[i + j], -1)
shift = max(1, j - bad_char_pos)
i += shift
return positions
Boyer-Moore's Average Performance
On natural language text, Boyer-Moore averages about n/m comparisons โ fewer than the text length! This is because with a large alphabet, the bad character rule frequently skips the entire pattern length. This is why grep uses a Boyer-Moore variant by default.
Boyer-Moore vs KMP
| Property | KMP | Boyer-Moore |
|---|---|---|
| Worst case | O(n+m) | O(n+m) (full version) |
| Average | O(n) | O(n/m) (large alphabet) |
| Comparison direction | leftโright | rightโleft |
| Best for | Streaming input, small alphabet | Large text search, large alphabet |
| Preprocessing | O(m) | O(m + |ฮฃ|) |
30.6 Rabin-Karp Algorithm (Rolling Hash)
Core Idea
Instead of comparing characters one by one, use hash values for quick elimination: if hash(text[i..i+m-1]) != hash(pattern), they definitely don't match; if equal, verify character by character (handling hash collisions).
The key technique is rolling hash: computing hash(text[i+1..i+m]) from hash(text[i..i+m-1]) in O(1).
Polynomial Hash
Treat the string as a base-b number (b typically 31 or 131), taken modulo a large prime p:
$$h(s) = s[0] \cdot b^{m-1} + s[1] \cdot b^{m-2} + ... + s[m-1] \cdot b^0 \mod p$$
Rolling update: $$h(s[i+1..i+m]) = (h(s[i..i+m-1]) - s[i] \cdot b^{m-1}) \cdot b + s[i+m] \mod p$$
Complete Implementation
def rabin_karp_search(text: str, pattern: str) -> list[int]:
"""Rabin-Karp string matching"""
n, m = len(text), len(pattern)
if m == 0:
return list(range(n + 1))
if m > n:
return []
base = 131
mod = 10**9 + 7
# Compute base^(m-1) mod p
power = pow(base, m - 1, mod)
# Compute pattern hash
pattern_hash = 0
for ch in pattern:
pattern_hash = (pattern_hash * base + ord(ch)) % mod
# Compute hash of first m characters of text
text_hash = 0
for i in range(m):
text_hash = (text_hash * base + ord(text[i])) % mod
positions = []
for i in range(n - m + 1):
if text_hash == pattern_hash:
if text[i:i + m] == pattern:
positions.append(i)
# Rolling hash update
if i < n - m:
text_hash = (text_hash - ord(text[i]) * power) % mod
text_hash = (text_hash * base + ord(text[i + m])) % mod
text_hash = (text_hash + mod) % mod # ensure non-negative
return positions
Time Complexity:
- Expected O(n + m): with good hash function, collisions are rare
- Worst case O(n * m): all positions collide (extremely unlikely)
Unique Advantages of Rabin-Karp
- Multi-pattern matching: maintain hash values for multiple patterns simultaneously
- 2D pattern matching: generalizes to searching for a sub-matrix within a matrix
- Simple implementation: shorter code than KMP or Boyer-Moore
Handling Hash Collisions
Single hash collision probability is about 1/p. To further reduce false positives:
- Use double hashing (two different base/mod pairs)
- Collision probability drops to 1/(p1 * p2) โ 10^-18, practically impossible
30.7 Z-Function (Z-array)
Definition
For string s, Z[i] is the length of the longest substring starting from position i that matches a prefix of s. Z[0] is typically defined as 0 or len(s).
Example: s = "aabxaab"
Z[0] = 7 (or 0, by convention)
Z[1] = 1 "a" matches prefix "a"
Z[2] = 0 "b..." doesn't match prefix
Z[3] = 0 "x..." doesn't match prefix
Z[4] = 3 "aab" matches prefix "aab"
Z[5] = 1 "a" matches prefix "a"
Z[6] = 0 "b" doesn't match prefix
O(n) Z-Function Construction
def build_z_array(s: str) -> list[int]:
"""Build Z-function in O(n) time"""
n = len(s)
z = [0] * n
z[0] = n # by convention
l, r = 0, 0 # [l, r) is the rightmost matching interval
for i in range(1, n):
if i < r:
# i is within known matching interval, reuse information
z[i] = min(r - i, z[i - l])
# Extend by brute force
while i + z[i] < n and s[z[i]] == s[i + z[i]]:
z[i] += 1
# Update rightmost matching interval
if i + z[i] > r:
l, r = i, i + z[i]
return z
String Matching via Z-Function
Construct pattern + "$" + text (where $ is a separator not in either string) and compute its Z-array. Positions where Z[i] == m (offset by m+1) are match positions.
def z_search(text: str, pattern: str) -> list[int]:
"""String matching using Z-function"""
m = len(pattern)
combined = pattern + "$" + text
z = build_z_array(combined)
positions = []
for i in range(m + 1, len(combined)):
if z[i] == m:
positions.append(i - m - 1)
return positions
Z-Function vs KMP's Next Array
The two are essentially equivalent โ Z-function describes "how long a prefix match starts at each position," while next array describes "the longest proper prefix-suffix of each prefix." They are inter-convertible, but each has advantages in different contexts:
- Z-function is more intuitive: directly gives prefix match length at each position
- Next array is better for online processing: can be updated character by character with streaming input
Level 3 ยท What the Theory Says
30.8 Theoretical Foundations of KMP
Knuth, Morris, Pratt 1977 Paper
Donald Knuth, James Morris, and Vaughan Pratt published "Fast Pattern Matching in Strings" (SIAM Journal on Computing, Vol. 6, No. 2, 1977). The paper proves:
-
Theorem 1: There exists an algorithm requiring at most n + m - 1 character comparisons in the worst case for matching (excluding next array construction).
-
Theorem 2: The next array can be constructed in at most 2m - 3 comparisons.
-
Corollary: KMP's total comparisons never exceed 2n (matching phase) + 2m (preprocessing) = 2(n+m).
KMP and Automata
KMP effectively constructs a simplified version of a deterministic finite automaton (DFA). A full DFA's transition table has size O(m * |ฮฃ|), while KMP uses only an O(m) next array to simulate all DFA transitions โ through chain fallback on mismatch.
Knuth noted in the paper that KMP can be viewed as an improvement on Morris and Pratt's earlier work (1970 technical report), combined with Knuth's mathematical analysis of the failure function.
Optimality
The 1977 paper also proves a lower bound: any correct string matching algorithm requires at least n + m - O(1) comparisons in the worst case. KMP's n + m - 1 nearly matches this lower bound.
30.9 Boyer-Moore Theory
Boyer and Moore 1977 Paper
Robert Boyer and J Strother Moore published "A Fast String Searching Algorithm" (Communications of the ACM, Vol. 20, No. 10, 1977) the same year.
Key Theorems:
- With bad character rule alone: worst case is O(n * m)
- With good suffix rule added: worst case drops to O(n + m) (proven by Galil, 1979 for the complete version)
- On uniformly distributed random text: average comparisons are O(n/m)
Intuition for Sub-linear Time
Why can Boyer-Moore be faster than O(n) on average? Because it doesn't need to examine every character of text! When the alphabet is large enough (e.g., ASCII's 256 characters), the probability of the bad character not appearing in pattern is high, allowing each mismatch to skip the entire pattern length.
For m = 10, |ฮฃ| = 256: probability a random character is not in pattern โ (1 - 10/256) โ 0.96, meaning most positions allow a jump of 10 characters.
30.10 Theoretical Foundations of String Hashing
Collision Probability of Polynomial Hash
Given two distinct strings of length m, using a random base b (uniformly chosen from [1, p-1]) and prime modulus p, the probability of hash collision is:
$$\Pr[h(s1) = h(s2)] \leq \frac{m-1}{p}$$
This follows from the Schwartz-Zippel Lemma: a non-zero polynomial of degree at most m-1 over Z_p has at most m-1 roots.
Practical Parameter Choices
- p = 10^9 + 7: collision probability โ m/10^9, safe for m <= 10^5
- p = 10^9 + 9: another commonly used prime
- Double hash: (p1, p2) = (10^9+7, 10^9+9), collision probability โ 10^-18
Carter-Wegman Universal Hashing
J. Lawrence Carter and Mark Wegman established the theoretical framework of universal hashing in "Universal Classes of Hash Functions" (Journal of Computer and System Sciences, 1979). Polynomial hashing falls within this framework.
Karp and Rabin 1987 Paper
Richard Karp and Michael Rabin formally analyzed rolling hash for string matching in "Efficient Randomized Pattern-Matching Algorithms" (IBM Journal of Research and Development, 1987), proving that with appropriate hash parameters, expected runtime is O(n + m).
30.11 Mathematical Properties of the Prefix Function
Property 1: Values Form a "Chain"
For any position j, the sequence j, next[j-1], next[next[j-1]-1], ... forms a strictly decreasing sequence down to 0. The worst-case number of hops from any position back to 0 is O(m), but the total hops across all positions is O(m) โ this is the essence of amortized analysis.
Property 2: Amortized Analysis
The total number of while loop executions in KMP's matching phase does not exceed n. Proof:
- j (match length) increases by at most 1 per outer iteration
- Total increment of j โค n
- Each while loop iteration decreases j by at least 1
- Total decrement of j โค total increment โค n
- Therefore total while loop executions โค n
Property 3: Periodicity
If next[m-1] > 0 and m % (m - next[m-1]) == 0, then pattern has period length m - next[m-1].
Example: pattern = "abcabc", next[5] = 3, period = 6 - 3 = 3. Pattern is indeed "abc" repeated twice.
This property is directly used in LeetCode #459 (Repeated Substring Pattern).
Level 4 ยท Edge Cases and Pitfalls
30.12 Interview Deep Dive: Implement strStr() (LeetCode #28)
Problem
Implement strStr(): return the first occurrence position of pattern in text, or -1 if not found.
Interview Strategy
- Write brute force first: demonstrate problem understanding; O(n*m) brute force is only 5-6 lines
- If interviewer asks for optimization: implement KMP, focus on explaining the next array's meaning
- If time permits: discuss trade-offs between algorithms
def str_str(haystack: str, needle: str) -> int:
"""LeetCode 28: Implement strStr()"""
if not needle:
return 0
m = len(needle)
next_arr = [0] * m
j = 0
for i in range(1, m):
while j > 0 and needle[i] != needle[j]:
j = next_arr[j - 1]
if needle[i] == needle[j]:
j += 1
next_arr[i] = j
j = 0
for i in range(len(haystack)):
while j > 0 and haystack[i] != needle[j]:
j = next_arr[j - 1]
if haystack[i] == needle[j]:
j += 1
if j == m:
return i - m + 1
return -1
Common Follow-up Questions:
- "Why is the while loop in next array construction O(m)?" โ Amortized: total increment and decrement of j both bounded by m
- "When to choose KMP vs Boyer-Moore?" โ Small alphabet (DNA: ACGT) โ KMP; large alphabet โ BM
- "What if text is streamed?" โ KMP naturally supports streaming; BM doesn't (requires right-to-left comparison)
30.13 Interview Deep Dive: Repeated Substring Pattern (LeetCode #459)
Problem
Determine if string s can be formed by repeating a substring multiple times. E.g., "abab" โ True ("ab" repeated twice).
Method 1: KMP's Next Array
def repeated_substring_pattern(s: str) -> bool:
"""Use KMP next array to detect repeated substring"""
n = len(s)
next_arr = [0] * n
j = 0
for i in range(1, n):
while j > 0 and s[i] != s[j]:
j = next_arr[j - 1]
if s[i] == s[j]:
j += 1
next_arr[i] = j
period = n - next_arr[n - 1]
return next_arr[n - 1] > 0 and n % period == 0
Why does this work? If s has period p (s = t * k, |t| = p), then s[p:] == s[:n-p], meaning s's first n-p characters form a suffix of s. This implies next[n-1] >= n - p. The longest prefix-suffix length next[n-1] yields the shortest period n - next[n-1].
Method 2: String Concatenation Trick
def repeated_substring_pattern_v2(s: str) -> bool:
"""String concatenation method"""
doubled = (s + s)[1:-1] # remove first and last characters
return s in doubled
Intuition: If s = "abab", then s+s = "abababab", removing first and last gives "bababa". If s is a repeated string, it will still be found in the trimmed s+s; otherwise it won't.
30.14 Interview Deep Dive: Shortest Palindrome (LeetCode #214)
Problem
Add the fewest characters possible to the front of string s to make it a palindrome.
Analysis
Key observation: find s's longest palindrome prefix. If the longest palindrome prefix has length k, prepend s[k:][::-1].
KMP Solution
Construct s + "#" + reverse(s) and compute its next array. The last entry gives the length of s's longest palindrome prefix.
def shortest_palindrome(s: str) -> str:
"""LeetCode 214: Shortest Palindrome"""
if not s:
return s
rev_s = s[::-1]
combined = s + "#" + rev_s
n = len(combined)
next_arr = [0] * n
j = 0
for i in range(1, n):
while j > 0 and combined[i] != combined[j]:
j = next_arr[j - 1]
if combined[i] == combined[j]:
j += 1
next_arr[i] = j
longest_palindrome_prefix = next_arr[-1]
suffix_to_add = rev_s[:len(s) - longest_palindrome_prefix]
return suffix_to_add + s
Why does this work?
In s + "#" + rev_s, the last entry of next array gives k = length of longest prefix of combined (which is a prefix of s) that equals a suffix of combined (which is a suffix of rev_s = a reversed prefix of s).
"A prefix of s equals the reverse of a prefix of s" โ that's exactly the definition of a palindrome prefix! So k is the length of s's longest palindrome prefix.
Example: s = "aacecaaa"
- rev_s = "aaacecaa"
- combined = "aacecaaa#aaacecaa"
- Last next value = 7 (prefix "aacecaa" is a palindrome)
- Characters to prepend = rev_s[:8-7] = "a"
- Result = "a" + "aacecaaa" = "aaacecaaa"
30.15 Engineering Practice for String Algorithms
Algorithm Selection in Real Systems
| Scenario | Recommended | Reason |
|---|---|---|
| Editor find (Ctrl+F) | Boyer-Moore variant | Large alphabet, user patterns are long |
| grep/ripgrep | Aho-Corasick + SIMD | Multi-pattern + hardware acceleration |
| DNA sequence matching | KMP or suffix array | Small alphabet (4 chars), exact matching |
| Network content filtering | Rabin-Karp multi-hash | Need to match multiple patterns simultaneously |
| Database LIKE queries | Brute force (short patterns) | Patterns usually short; setup overhead exceeds benefit |
Python's Built-in Implementation
Python's str.find() and in operator use a modified Boyer-Moore-Horspool algorithm (CPython implementation in Objects/stringlib/fastsearch.h). For short patterns (< 6 characters), it degrades to brute force since preprocessing overhead isn't worthwhile.
Interview Recommendations
- Default to brute force first โ concise and correct
- If optimization is requested, KMP is the safest choice (deterministic O(n+m), moderate code length)
- Mention Boyer-Moore as the practically faster alternative
- Use Rabin-Karp when multi-pattern matching is needed or interviewer hints "hashing"
Chapter Summary
-
Brute Force: O(n*m), simple but insufficient for large data.
-
KMP Algorithm: O(n+m). Core is the next array โ recording each prefix's longest proper prefix-suffix, enabling no-backtrack matching. Proposed by Knuth, Morris, Pratt (1977).
-
Boyer-Moore: Fastest in practice, average O(n/m). Compares right-to-left, leveraging bad character and good suffix rules for large jumps.
-
Rabin-Karp: Rolling hash achieves O(1) window sliding, expected O(n+m). Strengths: multi-pattern matching and simple implementation.
-
Z-Function: Equivalent to KMP but sometimes more intuitive. Z[i] directly gives the prefix match length starting at position i.
-
Interview Essentials: Master KMP's next array construction and matching process, be able to write the code from scratch, and explain the amortized O(n+m) proof.