Chapter 32

Suffix Arrays and Suffix Automata

Chapter 32: Suffix Arrays and Suffix Automata — The "Ultimate Index" for Strings

Given a 500-page book, you need to find every occurrence of the word "algorithm." You could scan from beginning to end (O(n)), or you could flip to the index at the back (O(log n)). A suffix array is the "index page" for a string — it sorts all suffixes in lexicographic order, enabling binary search to find any pattern of length m in O(m log n) time.

But what if you need more than "where does this pattern appear"? What if you need "how many distinct substrings does this string have" or "what's the longest repeated substring"? That's where the Suffix Automaton (SAM) comes in — a finite state machine that encodes all substring information in O(n) space.

These two structures are the "ultimate weapons" of string algorithms. KMP and Rabin-Karp solve "find one pattern in a text," while suffix arrays and suffix automata solve "ask anything about all substrings of this string."

Level 1 · What You Need to Know

1.1 The Concept of Suffixes

All suffixes of the string s = "banana":

Index i	Suffix s[i:]
0	banana
1	anana
2	nana
3	ana
4	na
5	a

The Suffix Array (SA) is the array of starting positions after sorting these suffixes lexicographically:

After sorting: a(5), ana(3), anana(1), banana(0), na(4), nana(2)

So SA = [5, 3, 1, 0, 4, 2].

Why is this useful? Every substring is a prefix of some suffix. If suffixes are sorted, you can binary search for any substring.

def build_suffix_array_naive(s: str) -> list[int]:
    """
    Naive suffix array construction: O(n^2 log n)
    Generate all suffixes, sort lexicographically, return starting positions
    """
    n = len(s)
    # Generate (suffix, starting_position) pairs
    suffixes = [(s[i:], i) for i in range(n)]
    # Sort lexicographically
    suffixes.sort(key=lambda x: x[0])
    # Return sorted starting positions
    return [pos for _, pos in suffixes]


def search_pattern(text: str, pattern: str, sa: list[int]) -> list[int]:
    """
    Search for a pattern in text using suffix array + binary search
    Time complexity: O(m log n), m = len(pattern), n = len(text)
    """
    n = len(text)
    m = len(pattern)
    
    # Find left boundary: first suffix >= pattern
    lo, hi = 0, n
    while lo < hi:
        mid = (lo + hi) // 2
        suffix = text[sa[mid]:]
        if suffix[:m] < pattern:
            lo = mid + 1
        else:
            hi = mid
    left = lo
    
    # Find right boundary: first suffix > pattern
    lo, hi = left, n
    while lo < hi:
        mid = (lo + hi) // 2
        suffix = text[sa[mid]:]
        if suffix[:m] <= pattern:
            lo = mid + 1
        else:
            hi = mid
    right = lo
    
    return sorted(sa[left:right])


# Demo
text = "banana"
sa = build_suffix_array_naive(text)
print(f"Suffix array: {sa}")  # [5, 3, 1, 0, 4, 2]

positions = search_pattern(text, "ana", sa)
print(f"'ana' found at positions: {positions}")  # [1, 3]

1.2 The LCP Array — Common Prefix Lengths Between Adjacent Suffixes

The LCP (Longest Common Prefix) array: LCP[i] is the length of the longest common prefix between the i-th and (i-1)-th suffixes in sorted order.

For "banana" with SA = [5, 3, 1, 0, 4, 2]:

rank	SA[rank]	Suffix	LCP
0	5	a	—
1	3	ana	1
2	1	anana	3
3	0	banana	0
4	4	na	0
5	2	nana	2

LCP = [−, 1, 3, 0, 0, 2]

The power of the LCP array:

Longest repeated substring = max(LCP) = 3 ("ana")
Number of distinct substrings = n(n+1)/2 − sum(LCP) = 21 − 6 = 15

def build_lcp_array(text: str, sa: list[int]) -> list[int]:
    """
    Kasai's algorithm: O(n) LCP array construction
    
    Key insight: if suffix(sa[rank-1]) and suffix(sa[rank]) have LCP = k,
    then suffix(sa[rank-1]+1) and suffix(sa[rank]+1) have LCP >= k-1
    """
    n = len(text)
    # rank[i] = position of suffix i in the sorted order
    rank = [0] * n
    for i in range(n):
        rank[sa[i]] = i
    
    lcp = [0] * n
    k = 0  # Current LCP value
    
    for i in range(n):
        if rank[i] == 0:
            k = 0
            continue
        # sa[rank[i] - 1] is the starting position of the previous suffix in sorted order
        j = sa[rank[i] - 1]
        # Compare characters one by one
        while i + k < n and j + k < n and text[i + k] == text[j + k]:
            k += 1
        lcp[rank[i]] = k
        # Key: next round's LCP is at least k-1
        if k > 0:
            k -= 1
    
    return lcp


# Demo
text = "banana"
sa = build_suffix_array_naive(text)
lcp = build_lcp_array(text, sa)
print(f"LCP array: {lcp}")  # [0, 1, 3, 0, 0, 2]

# Longest repeated substring
max_lcp = max(lcp)
max_idx = lcp.index(max_lcp)
print(f"Longest repeated substring: '{text[sa[max_idx]:sa[max_idx]+max_lcp]}'")  # "ana"

# Number of distinct substrings
n = len(text)
total_substrings = n * (n + 1) // 2
distinct_substrings = total_substrings - sum(lcp)
print(f"Distinct substrings: {distinct_substrings}")  # 15

1.3 Why Kasai's Algorithm is O(n)

The algorithm has a while loop doing character-by-character comparison inside a for loop. Shouldn't it be O(n^2)?

Proof sketch: The variable k can increase at most n times total throughout all iterations (since k <= n), and each iteration of the outer loop decreases k by at most 1. Therefore, total increments of k <= n + n = 2n, meaning the while loop executes at most 2n times total. This is a classic amortized analysis argument.

1.4 Common Applications

Problem	Solution
Does text contain pattern P?	SA + binary search, O(m log n)
Longest repeated substring	max(LCP)
Number of distinct substrings	n(n+1)/2 − sum(LCP)
Longest common substring of two strings	Concatenate s1 + '#' + s2, build SA, find cross-boundary LCP
k-th smallest substring	Prefix sums on LCP array

1.5 Common Mistakes

Mistake 1: Forgetting the separator when concatenating strings

# Wrong: direct concatenation
combined = s1 + s2  # "abc" + "bcd" = "abcbcd"
# Suffix "cbcd" crosses the boundary, producing incorrect LCP

# Correct: use a character not in the alphabet
combined = s1 + '\x00' + s2  # '\x00' is smaller than any letter

Mistake 2: Comparing entire suffixes instead of prefixes during binary search

# Wrong: compare entire suffix
if text[sa[mid]:] < pattern:  # O(n) per comparison

# Correct: compare only first m characters
if text[sa[mid]:sa[mid]+m] < pattern:  # O(m) per comparison

Level 2 · How It Works Under the Hood

2.1 The Doubling Method for Suffix Array Construction

The naive method's bottleneck is that each comparison during sorting takes O(n) time. The doubling method's core idea: first sort by the 1st character, then by the first 2 characters, then by the first 4 characters... each round's comparison can leverage the previous round's results in O(1) time.

This is the idea proposed by Karp, Miller, Rosenberg (1972) and later refined by Manber and Myers (1993) into the standard suffix array construction algorithm.

def build_suffix_array_doubling(s: str) -> list[int]:
    """
    Doubling method for suffix array: O(n log^2 n)
    
    Idea:
    - Round 0: sort each suffix by its first character
    - Round k: sort each suffix by its first 2^k characters
    - Round k's sort key = (rank from round k-1 at position i,
                            rank from round k-1 at position i + 2^(k-1))
    
    Since round k-1 correctly sorted the first 2^(k-1) characters,
    round k only needs to combine two "halves" into a pair for sorting.
    """
    n = len(s)
    # Initialize: sort by single character's ASCII value
    sa = list(range(n))
    rank = [ord(c) for c in s]
    tmp = [0] * n
    
    k = 1  # Current comparison length
    while k < n:
        # Sort key: (rank[i], rank[i+k]), use -1 if i+k is out of bounds
        def compare(a, b):
            if rank[a] != rank[b]:
                return rank[a] - rank[b]
            ra = rank[a + k] if a + k < n else -1
            rb = rank[b + k] if b + k < n else -1
            return ra - rb
        
        import functools
        sa.sort(key=functools.cmp_to_key(compare))
        
        # Recompute ranks
        tmp[sa[0]] = 0
        for i in range(1, n):
            if compare(sa[i], sa[i-1]) == 0:
                tmp[sa[i]] = tmp[sa[i-1]]
            else:
                tmp[sa[i]] = tmp[sa[i-1]] + 1
            
        rank = tmp[:]
        
        # If all ranks are distinct, sorting is complete
        if rank[sa[-1]] == n - 1:
            break
        
        k *= 2
    
    return sa

Time complexity analysis:

Outer loop: O(log n) rounds (k doubles each round)
Each round's sort: O(n log n)
Total: O(n log^2 n)

If radix sort replaces comparison sort, each round becomes O(n), yielding O(n log n) total.

2.2 The SA-IS Algorithm

SA-IS (Suffix Array — Induced Sorting) was proposed by Nong, Zhang, and Chan in 2009. It achieves linear time through an elegant divide-and-conquer strategy.

Step 1: Classify suffixes as S-type and L-type

If s[i:] < s[i+1:] (lexicographically), suffix(i) is S-type (Smaller)
If s[i:] > s[i+1:] (lexicographically), suffix(i) is L-type (Larger)
The last character (usually '$', the smallest) is always S-type

Simple rules:

If s[i] < s[i+1], position i is S-type
If s[i] > s[i+1], position i is L-type
If s[i] == s[i+1], position i has the same type as i+1

def classify_suffixes(s: str) -> list[bool]:
    """
    Classify each suffix as S-type (True) or L-type (False)
    Single right-to-left scan, O(n)
    """
    n = len(s)
    is_s_type = [False] * n
    is_s_type[n - 1] = True  # Last is always S-type ('$' is smallest)
    
    for i in range(n - 2, -1, -1):
        if s[i] < s[i + 1]:
            is_s_type[i] = True
        elif s[i] > s[i + 1]:
            is_s_type[i] = False
        else:
            is_s_type[i] = is_s_type[i + 1]
    
    return is_s_type

Step 2: Find LMS (Left-Most S-type) suffixes

An LMS suffix has an L-type to its left and is itself S-type. There are at most n/2 LMS suffixes.

def find_lms_positions(is_s_type: list[bool]) -> list[int]:
    """Find all LMS positions"""
    lms = []
    for i in range(1, len(is_s_type)):
        if is_s_type[i] and not is_s_type[i - 1]:
            lms.append(i)
    return lms

Step 3: Induced Sorting

This is SA-IS's core — if we know the relative order of LMS suffixes, we can derive all suffixes' order in O(n) time:

Place LMS suffixes into correct buckets (bucketed by first character)
Left-to-right scan of SA: induce L-type suffix positions
Right-to-left scan of SA: induce S-type suffix positions

Step 4: Recursion

If LMS suffixes can't be uniquely ordered (identical LMS substrings exist), recurse. The subproblem size is at most n/2, so T(n) = T(n/2) + O(n) = O(n).

2.3 Suffix Automaton (SAM)

A Suffix Automaton is a directed acyclic graph (DAG) that accepts exactly all suffixes of string s. More importantly, every path from the initial state corresponds to a distinct substring of s.

Key properties:

Number of states <= 2n-1
Number of transitions <= 3n-4
Construction time: O(n)

class SuffixAutomaton:
    """
    Suffix Automaton (SAM)
    
    Each state represents an "equivalence class" — a set of substrings
    with identical endpos sets.
    endpos(t) = the set of all ending positions of substring t in s.
    
    Example s = "abcbc":
    - endpos("bc") = {2, 4}
    - endpos("c") = {2, 4}
    - "bc" and "c" belong to the same equivalence class (same endpos)
    """
    
    class State:
        def __init__(self):
            self.len = 0      # Length of longest substring in this class
            self.link = -1    # Suffix link (parent in suffix link tree)
            self.trans = {}   # Transitions: character -> state id
            self.cnt = 0      # How many suffixes pass through this state
    
    def __init__(self):
        self.states = [self.State()]  # states[0] = initial state
        self.states[0].len = 0
        self.states[0].link = -1
        self.last = 0  # State reached after extending the last character
    
    def extend(self, c: str):
        """
        Online construction: add one character c
        Amortized O(1) per character
        """
        cur = len(self.states)
        self.states.append(self.State())
        self.states[cur].len = self.states[self.last].len + 1
        self.states[cur].cnt = 1
        
        # Walk up suffix links, adding transitions
        p = self.last
        while p != -1 and c not in self.states[p].trans:
            self.states[p].trans[c] = cur
            p = self.states[p].link
        
        if p == -1:
            # Case 1: walked all the way to root without finding c-transition
            self.states[cur].link = 0
        else:
            q = self.states[p].trans[c]
            if self.states[p].len + 1 == self.states[q].len:
                # Case 2: no split needed
                self.states[cur].link = q
            else:
                # Case 3: need to clone state q
                clone = len(self.states)
                self.states.append(self.State())
                self.states[clone].len = self.states[p].len + 1
                self.states[clone].link = self.states[q].link
                self.states[clone].trans = self.states[q].trans.copy()
                self.states[clone].cnt = 0
                
                # Redirect q and cur to point to clone
                while p != -1 and self.states[p].trans.get(c) == q:
                    self.states[p].trans[c] = clone
                    p = self.states[p].link
                
                self.states[q].link = clone
                self.states[cur].link = clone
        
        self.last = cur
    
    def count_distinct_substrings(self) -> int:
        """
        Distinct substring count = sum of (len - len(link)) for all states
        Each state contributes len - parent_len distinct substrings
        """
        total = 0
        for i in range(1, len(self.states)):
            parent_len = self.states[self.states[i].link].len
            total += self.states[i].len - parent_len
        return total
    
    def longest_repeated_substring(self, s: str) -> str:
        """
        Longest repeated substring: find the state with cnt >= 2 and maximum len
        Requires topological sort to compute cnt for each state
        """
        n_states = len(self.states)
        # Sort by len (bucket sort)
        max_len = max(st.len for st in self.states)
        bucket = [0] * (max_len + 1)
        for st in self.states:
            bucket[st.len] += 1
        for i in range(1, max_len + 1):
            bucket[i] += bucket[i - 1]
        
        order = [0] * n_states
        for i in range(n_states):
            bucket[self.states[i].len] -= 1
            order[bucket[self.states[i].len]] = i
        
        # Propagate cnt from longest to shortest
        for i in range(n_states - 1, -1, -1):
            idx = order[i]
            if self.states[idx].link >= 0:
                self.states[self.states[idx].link].cnt += self.states[idx].cnt
        
        # Find longest with cnt >= 2
        best_len = 0
        best_state = -1
        for i in range(1, n_states):
            if self.states[i].cnt >= 2 and self.states[i].len > best_len:
                best_len = self.states[i].len
                best_state = i
        
        if best_state == -1:
            return ""
        
        # Find the actual substring in s
        for i in range(len(s) - best_len + 1):
            sub = s[i:i + best_len]
            if s.find(sub, i + 1) != -1:
                return sub
        return ""


# Demo
s = "abcbc"
sam = SuffixAutomaton()
for c in s:
    sam.extend(c)

print(f"Distinct substrings: {sam.count_distinct_substrings()}")  # 12
print(f"Longest repeated substring: '{sam.longest_repeated_substring(s)}'")  # "bc"

2.4 SAM State Semantics — The endpos Equivalence Classes

Each SAM state represents an endpos equivalence class. endpos(t) is the set of all ending positions where substring t occurs in s.

Example s = "aabbaab":

endpos("a") = {0, 1, 4, 5}
endpos("aa") = {1, 5}
endpos("aab") = {2, 6}
endpos("ab") = {2, 6}

Notice "aab" and "ab" have the same endpos! They belong to the same equivalence class.

Properties of endpos equivalence classes (Blumer et al., 1985):

For substrings u and v in the same class (|u| <= |v|), u is a suffix of v
Classes form a tree (suffix link tree), where parent's endpos is a superset of child's
There are at most 2n-1 equivalence classes

This means the SAM's suffix link tree is actually the "reverse suffix tree"!

2.5 Suffix Array vs. Suffix Tree vs. Suffix Automaton

Property	Suffix Array	Suffix Tree	Suffix Automaton
Space	O(n) integers	O(n) but large constant	O(n) states+transitions
Build time	O(n) SA-IS	O(n) Ukkonen's	O(n) online
Pattern search	O(m log n)	O(m)	O(m)
Implementation complexity	Medium	High	Medium
Actual memory	5n~9n bytes	20n~40n bytes	10n~15n bytes
Best for	Competitions, engineering	Theoretical analysis	Substring counting

Practical advice:

Competitive programming: suffix array is most common (shorter code, fewer bugs)
Substring counting/longest repeated: SAM is more convenient
Engineering (full-text search): suffix array + LCP (memory-friendly)

Level 3 · What the Theory Says

3.1 Karp, Miller, Rosenberg (1972): The Origin of Doubling

In their paper "Rapid Identification of Repeated Patterns in Strings, Trees and Arrays," Karp, Miller, and Rosenberg proposed an elegant idea:

To compare two substrings of length 2k, you only need to compare their two "halves" of length k. If we've already assigned names (numbers) to all length-k substrings (same name for identical substrings), then a length-2k substring can be uniquely identified by the pair (name_of_first_half, name_of_second_half).

This is the "doubling" idea. It influenced not just suffix arrays but also:

LCA (Lowest Common Ancestor) via binary lifting
Sparse Tables
Fast string hash comparison

The paper's precise formulation:

Define name_k(i) as the "name" of the substring starting at position i with length 2^k. Then:

name_0(i) = character code of s[i]
name_{k+1}(i) = rank of (name_k(i), name_k(i + 2^k)) among all such pairs

The final suffix array is the permutation given by name_{ceil(log n)}.

3.2 Manber & Myers (1993): Birth of the Suffix Array

In "Suffix Arrays: A New Method for On-Line String Searches," Manber and Myers formally defined the suffix array and gave O(n log n) construction and O(m + log n) search algorithms.

Their key contributions:

Defined the suffix array and LCP array interface: Showed that these two structures can replace suffix trees for most problems, with much less space overhead.
Search optimization: Naive binary search is O(m log n); they used LCP information to accelerate to O(m + log n).

The technique: maintain LCP values lcp_L and lcp_R between the search interval [L, R] and the pattern. When comparing middle element M, skip min(lcp_L, lcp_R) characters.

def search_manber_myers(text: str, pattern: str, sa: list[int], 
                         lcp: list[int]) -> tuple[int, int]:
    """
    Manber-Myers accelerated search: O(m + log n)
    Uses known LCP information to skip character comparisons
    Simplified version showing the concept
    """
    n = len(text)
    m = len(pattern)
    
    def lcp_with_pattern(idx: int) -> tuple[int, int]:
        """Returns (lcp_length, comparison_result)"""
        pos = sa[idx]
        k = 0
        while k < m and pos + k < n:
            if pattern[k] < text[pos + k]:
                return k, -1
            elif pattern[k] > text[pos + k]:
                return k, 1
            k += 1
        if k == m:
            return k, 0  # pattern is a prefix of this suffix
        return k, -1  # suffix is shorter than pattern
    
    # Standard binary search for left boundary (simplified)
    lo, hi = 0, n - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        _, cmp = lcp_with_pattern(mid)
        if cmp <= 0:
            hi = mid - 1
        else:
            lo = mid + 1
    left = lo
    
    # Find right boundary
    lo, hi = left, n - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        lcp_len, cmp = lcp_with_pattern(mid)
        if lcp_len == m:
            lo = mid + 1
        elif cmp > 0:
            lo = mid + 1
        else:
            hi = mid - 1
    right = hi
    
    return left, right

3.3 SA-IS (Nong, Zhang, Chan, 2009): The Linear Time Breakthrough

The paper "Two Efficient Algorithms for Linear Time Suffix Array Construction" gives the cleanest linear-time suffix array algorithm to date.

Formal description of SA-IS:

Input: String S[0..n-1], alphabet size sigma Output: Suffix array SA[0..n-1]

Algorithm SA-IS(S, n, sigma):
1. Classify: right-to-left scan, label each suffix S-type or L-type
2. Find LMS: mark all LMS positions (S-type with L-type neighbor to the left)
3. First induced sort:
   a. Place LMS suffixes at bucket tails in appearance order
   b. Left-to-right scan SA, induce L-type suffixes (place at bucket heads)
   c. Right-to-left scan SA, induce S-type suffixes (place at bucket tails)
4. Name LMS substrings: compare adjacent LMS substrings for equality
5. If duplicate names exist:
   a. Construct reduced problem S1 (sequence of LMS substring names)
   b. Recursively call SA-IS(S1, n1, sigma1)
   c. Use recursive result to determine correct LMS order
6. Second induced sort: redo step 3 with correct LMS ordering
7. Return SA

Time complexity proof:

Steps 1-4, 6: all O(n)
Step 5 recursion: subproblem size n1 <= n/2 (LMS positions have gap >= 2)
Total: T(n) <= T(n/2) + cn = O(n)

Why SA-IS is faster than DC3/Skew:

DC3 (Karkkäinen & Sanders, 2003) splits suffixes into mod-3 groups, recursing on 2n/3 of the problem. SA-IS recurses on n/2 with smaller constants (no merge step needed). In practice, SA-IS is 2-3x faster.

3.4 Blumer et al. (1985): Theoretical Foundation of SAM

In "The Smallest Automaton Recognizing the Subwords of a Text," Blumer et al. proved key theorems about suffix automata:

Theorem 1 (State count upper bound): The SAM of a length-n string has at most 2n-1 states (achieved when s = abbb...b).

Theorem 2 (Transition count upper bound): At most 3n-4 transitions (achieved when s = abbb...bbc).

Theorem 3 (endpos equivalence class property): Two substrings u, v belong to the same class iff endpos(u) = endpos(v). Classes form a tree (suffix link tree), with leaves corresponding to each position in the string.

3.5 Correctness Proof of Online SAM Construction

The online construction adds one character at a time. The key is proving three cases cover all possibilities:

Let s be the string before adding character c, and let last be the current state (representing the equivalence class of the entire string s).

Case 1: Walking up from last along suffix links, no state has a c-transition.

Character c has never appeared in s. New state cur's suffix link points to root.

Case 2: We reach state p with a c-transition to q, and len(q) = len(p) + 1.

q's longest substring is exactly p's longest substring plus c. New state cur's suffix link points directly to q.

Case 3: We reach state p with a c-transition to q, but len(q) > len(p) + 1.

q's equivalence class needs splitting. Clone q as clone with clone.len = p.len + 1, then redirect relevant transitions.

Amortized O(n) proof: Although each extend may walk up multiple suffix links, each extend adds at most 2 states and 3 transitions. The suffix link tree's total depth is O(n), and each transition is redirected at most once. Total time: O(n).

Level 4 · Edge Cases and Pitfalls

4.1 Suffix Array Applications in Competitive Programming

Problem 1: Longest Common Substring (Two Strings)

Given strings A and B, find their longest common substring.

def longest_common_substring(a: str, b: str) -> str:
    """
    Method: concatenate A + '#' + B, build SA and LCP array
    Answer = maximum LCP where adjacent suffixes belong to different strings
    
    Time: O(n + m) with SA-IS
    """
    separator = '#'  # Must not appear in A or B
    combined = a + separator + b
    n = len(combined)
    len_a = len(a)
    
    sa = build_suffix_array_naive(combined)  # Use SA-IS in practice
    lcp = build_lcp_array(combined, sa)
    
    best_len = 0
    best_pos = 0
    
    for i in range(1, n):
        # Check if sa[i] and sa[i-1] belong to different strings
        in_a_curr = sa[i] < len_a
        in_a_prev = sa[i - 1] < len_a
        
        if in_a_curr != in_a_prev and lcp[i] > best_len:
            best_len = lcp[i]
            best_pos = sa[i]
    
    return combined[best_pos:best_pos + best_len]


# Demo
print(longest_common_substring("ABABC", "BABCBA"))  # "BABC"

Problem 2: Distinct Substring Count (SAM solution)

def count_distinct_substrings_sam(s: str) -> int:
    """
    Count distinct substrings in O(n) using SAM
    Each state contributes len(state) - len(link(state)) substrings
    """
    sam = SuffixAutomaton()
    for c in s:
        sam.extend(c)
    return sam.count_distinct_substrings()


# Verify: consistent with suffix array method
s = "banana"
# SAM method
print(f"SAM: {count_distinct_substrings_sam(s)}")

# SA + LCP method
sa = build_suffix_array_naive(s)
lcp = build_lcp_array(s, sa)
n = len(s)
print(f"SA+LCP: {n * (n + 1) // 2 - sum(lcp)}")

Problem 3: k-th Smallest Substring

def kth_smallest_substring(s: str, k: int) -> str:
    """
    Find the k-th lexicographically smallest distinct substring
    
    Method: SA + LCP, traverse suffixes in SA order
    SA[i]'s suffix contributes len(s) - SA[i] - LCP[i] "new" substrings
    Accumulate until exceeding k
    """
    n = len(s)
    sa = build_suffix_array_naive(s)
    lcp = build_lcp_array(s, sa)
    
    count = 0
    for i in range(n):
        # New substrings from suffix sa[i]:
        # lengths from lcp[i]+1 to n-sa[i]
        new_substrings = n - sa[i] - lcp[i]
        
        if count + new_substrings >= k:
            # The k-th one is in this suffix
            length = lcp[i] + (k - count)
            return s[sa[i]:sa[i] + length]
        
        count += new_substrings
    
    return ""  # k exceeds total substring count


# Demo
s = "banana"
for k in range(1, 6):
    print(f"Substring #{k}: '{kth_smallest_substring(s, k)}'")

4.2 Real-World Engineering Applications

Full-Text Search Engines (e.g., Google Code Search)

Google's Code Search used suffix arrays as its underlying index. Russ Cox described this system in his 2012 blog post "Regular Expression Matching with a Trigram Index":

Build suffix arrays (or trigram index approximation) over the codebase
During search, use SA to locate candidate positions
Use regex engine for precise matching verification

Genome Alignment (BWA, Bowtie)

Modern genome alignment tools (BWA, Bowtie2) are built on the FM-Index, which is based on the BWT (Burrows-Wheeler Transform) — essentially a transformation of the suffix array:

BWT[i] = S[SA[i] - 1] (the character preceding each suffix in SA order)

FM-Index advantages:

Compressed storage (human genome 3GB -> FM-Index ~1.5GB)
Exact matching in O(m) without decompression

Data Compression (bzip2)

bzip2 uses BWT as a key compression step:

Compute the string's BWT
BWT clusters similar contexts together
Apply Move-to-Front + Huffman encoding

4.3 Knowledge Relevant to Interviews

Interviews rarely ask for direct suffix array/SAM implementation (too complex), but these variants may appear:

Variant 1: Longest Repeated Substring (LeetCode #1044)

Interviews typically expect O(n log n) binary search + Rabin-Karp, not suffix arrays. But knowing the O(n) SA solution demonstrates deeper understanding.

def longest_repeated_substring_binary_search(s: str) -> str:
    """
    Binary search + Rabin-Karp: O(n log n)
    Binary search on answer length, use rolling hash to check for repeats
    """
    n = len(s)
    MOD = (1 << 61) - 1
    BASE = 131
    
    def has_repeat(length: int) -> str:
        """Check if a repeated substring of given length exists"""
        if length == 0:
            return ""
        
        power = pow(BASE, length, MOD)
        
        # Rolling hash
        h = 0
        for i in range(length):
            h = (h * BASE + ord(s[i])) % MOD
        
        seen = {h: [0]}  # hash -> [starting positions]
        
        for i in range(1, n - length + 1):
            h = (h * BASE - ord(s[i - 1]) * power + ord(s[i + length - 1])) % MOD
            h = (h + MOD) % MOD
            
            if h in seen:
                # Verify on hash collision
                for prev in seen[h]:
                    if s[prev:prev + length] == s[i:i + length]:
                        return s[i:i + length]
                seen[h].append(i)
            else:
                seen[h] = [i]
        
        return ""
    
    # Binary search on length
    lo, hi = 0, n - 1
    result = ""
    
    while lo <= hi:
        mid = (lo + hi) // 2
        found = has_repeat(mid)
        if found:
            result = found
            lo = mid + 1
        else:
            hi = mid - 1
    
    return result

Variant 2: Repeated Substring Pattern (LeetCode #459)

Determine if string s can be formed by repeating a substring multiple times.

def repeated_substring_pattern(s: str) -> bool:
    """
    Classic trick: if s + s (with first and last characters removed) contains s,
    then s has a repeating structure.
    
    Why? If s = t * k (t repeated k times), then s + s = t * 2k.
    After removing first and last characters, it still contains a length-n run of t's.
    """
    doubled = (s + s)[1:-1]
    return s in doubled

4.4 Performance Comparison Experiment

import time
import random
import string

def benchmark_suffix_array_methods():
    """
    Compare actual performance of different suffix array construction methods
    """
    sizes = [1000, 5000, 10000, 50000]
    
    for n in sizes:
        s = ''.join(random.choices(string.ascii_lowercase, k=n))
        
        # Naive O(n^2 log n)
        start = time.time()
        sa_naive = build_suffix_array_naive(s)
        t_naive = time.time() - start
        
        # Doubling O(n log^2 n)
        start = time.time()
        sa_doubling = build_suffix_array_doubling(s)
        t_doubling = time.time() - start
        
        print(f"n={n:6d}: naive={t_naive:.4f}s, doubling={t_doubling:.4f}s")
        assert sa_naive == sa_doubling, "Results inconsistent!"

# benchmark_suffix_array_methods()  # Uncomment to run

4.5 Common Implementation Pitfalls

Pitfall 1: Memory management in suffix automata

SAM states can reach 2n-1, each storing a transition dictionary. For large alphabets (e.g., Unicode), memory can explode.

# Problem: each state uses dict for transitions, expensive for 26 letters
class State:
    trans: dict  # Python dict: at least 64 bytes each

# Optimization: for small alphabets, use arrays
class StateOptimized:
    trans: list  # [None] * 26, fixed size

Pitfall 2: RMQ preprocessing on LCP array

Many suffix array applications need fast "minimum LCP between SA[i] and SA[j]" queries (Range Minimum Query). Scanning O(n) each time defeats the purpose.

Standard approach: preprocess LCP array with a Sparse Table for O(1) RMQ.

import math

class SparseTable:
    """Sparse Table: O(n log n) preprocessing, O(1) range minimum query"""
    
    def __init__(self, arr: list[int]):
        n = len(arr)
        k = int(math.log2(n)) + 1 if n > 0 else 0
        self.table = [[0] * n for _ in range(k)]
        self.table[0] = arr[:]
        
        for j in range(1, k):
            for i in range(n - (1 << j) + 1):
                self.table[j][i] = min(
                    self.table[j-1][i],
                    self.table[j-1][i + (1 << (j-1))]
                )
        
        self.log = [0] * (n + 1)
        for i in range(2, n + 1):
            self.log[i] = self.log[i // 2] + 1
    
    def query(self, l: int, r: int) -> int:
        """Query min(arr[l..r]), O(1)"""
        if l > r:
            l, r = r, l
        j = self.log[r - l + 1]
        return min(self.table[j][l], self.table[j][r - (1 << j) + 1])

Pitfall 3: Sentinel characters in suffix arrays

Many algorithms require appending a "sentinel" (usually '$' or '\x00') that's smaller than all alphabet characters. Forgetting it causes comparison errors or out-of-bounds access.

def build_sa_with_sentinel(s: str) -> list[int]:
    """Correct approach: add sentinel"""
    s = s + '\x00'  # Sentinel ensures no two suffixes are identical
    sa = build_suffix_array_naive(s)
    return sa[1:]  # sa[0] is always the sentinel suffix

4.6 Recommended Learning Path

Beginner: Thoroughly understand naive suffix array + Kasai LCP
Intermediate: Implement the doubling method (understand KMR idea)
Competitive programmer: Memorize SA-IS template or use Python's sort() + doubling
Research/Engineering: Understand FM-Index, compressed suffix arrays

4.7 Chapter Summary

Concept	Key Numbers
Suffix array space	O(n) integers
SA-IS build time	O(n)
Binary search time	O(m log n) -> O(m + log n) (Manber-Myers)
LCP array build	O(n) (Kasai's algorithm)
SAM state count	<= 2n-1
SAM transition count	<= 3n-4
Distinct substring count	n(n+1)/2 - sum(LCP)

Suffix arrays and suffix automata represent the pinnacle of string algorithms. They reduce "any question about substrings" to "build a structure, then query." Understanding them helps you not only solve problems but also comprehend the underlying design of search engine indexes, genome alignment tools, and data compression systems.

Rate this chapter

4.8 / 5 (3 ratings)