Union-Find: The Connectivity Powerhouse
Chapter 15: Union-Find โ The Connectivity Powerhouse
Imagine you have a map with hundreds of cities and some roads. Someone keeps asking you: "Can city A reach city B?" If you run BFS/DFS for every query, that's O(V+E) per query. But what if there are a million such queries?
Union-Find (also called Disjoint Set Union, or DSU) is the data structure designed exactly for this class of problems. It supports two core operations:
- Find(x): Find the representative of the set containing x
- Union(x, y): Merge the sets containing x and y
With path compression and union by rank, both operations run in amortized O(ฮฑ(n)) time, where ฮฑ is the inverse Ackermann function โ a function that grows so incredibly slowly that for all practical inputs (even if n = the number of atoms in the universe), ฮฑ(n) โค 4. In other words, each Union-Find operation is essentially constant time.
In this chapter, we'll start from the naive implementation, progressively add optimizations, understand why it's so fast, and then apply it to graph theory and interview problems.
Level 1 ยท What You Need to Know
15.1 The Starting Point: Dynamic Connectivity
Consider this scenario: you manage a social network where users can become friends. You need to frequently answer: "Are users A and B in the same friend circle?" (Not necessarily direct friends โ transitive connections count.)
More abstractly, we have n elements, initially each in its own set. We need to support:
- Union: Merge two sets into one
- Find/Connected: Determine whether two elements are in the same set
This is the Dynamic Connectivity Problem.
Using an array where each element stores its component ID? Sure, but merging two sets requires traversing the entire array to update IDs โ O(n). Using an adjacency list with BFS? Each query costs O(V+E). The elegance of Union-Find is that it achieves near-O(1) for both operations using a remarkably simple tree structure.
15.2 Core Idea: Representing Sets as Trees
The fundamental idea is:
- Each set is represented as a tree
- The root of the tree is the representative of that set
- Two elements are in the same set if and only if they have the same root
- Merging two sets = making one tree's root point to the other tree's root
We use a parent[] array to represent these trees: parent[i] stores the parent of node i. If parent[i] == i, then i is a root.
15.3 The Naive Implementation
class UnionFind:
def __init__(self, n):
self.parent = list(range(n)) # Initially, each element is its own parent
def find(self, x):
"""Find the root of x"""
while self.parent[x] != x:
x = self.parent[x]
return x
def union(self, x, y):
"""Merge the sets containing x and y"""
root_x = self.find(x)
root_y = self.find(y)
if root_x != root_y:
self.parent[root_x] = root_y
def connected(self, x, y):
"""Check if x and y are in the same set"""
return self.find(x) == self.find(y)
What's the problem? This naive version can degenerate into a linked list! If you execute union(0,1), union(1,2), union(2,3), ..., the tree height becomes O(n), making each find O(n).
15.4 Optimization 1: Path Compression
The idea behind path compression is simple: since we only care about the root, during find we can directly attach every node on the path to the root. Next time we query any of these nodes, we reach the root in one step.
def find(self, x):
"""Find with path compression"""
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x]) # Recursively point x directly to root
return self.parent[x]
One line of recursion achieves full path compression. After execution, every node from x to the root points directly to the root.
If you're worried about recursion depth (Python has a default limit of 1000), here's the iterative version:
def find(self, x):
"""Iterative path compression"""
root = x
while self.parent[root] != root:
root = self.parent[root]
# Path compression: point all nodes from x to root directly to root
while self.parent[x] != root:
next_x = self.parent[x]
self.parent[x] = root
x = next_x
return root
15.5 Optimization 2: Union by Rank
Path compression is already powerful, but we can also optimize the union operation: always attach the shorter tree under the root of the taller tree, so the merged tree's height doesn't increase.
We use rank to approximate height (path compression changes actual height, but rank is only updated during unions โ it's an upper bound on height).
class UnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n # rank[i] is an upper bound on the height of tree rooted at i
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return False # Already in the same set
# Union by rank: attach shorter tree under taller tree
if self.rank[root_x] < self.rank[root_y]:
self.parent[root_x] = root_y
elif self.rank[root_x] > self.rank[root_y]:
self.parent[root_y] = root_x
else:
self.parent[root_y] = root_x
self.rank[root_x] += 1 # Only increment rank when both ranks are equal
return True
def connected(self, x, y):
return self.find(x) == self.find(y)
Why does rank only increase when both ranks are equal? Because a tree of rank h has at least 2^h nodes (provable by induction). When merging trees of different heights, attaching the shorter one under the taller one doesn't change the overall height. Only when two trees of equal height merge does the height increase by 1.
15.6 Complete Implementation with Component Counting
In many applications, we need to know "how many distinct connected components exist." Just maintain a counter that decrements on each successful union:
class UnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
self.count = n # Initially n components
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return False
if self.rank[root_x] < self.rank[root_y]:
self.parent[root_x] = root_y
elif self.rank[root_x] > self.rank[root_y]:
self.parent[root_y] = root_x
else:
self.parent[root_y] = root_x
self.rank[root_x] += 1
self.count -= 1 # Successful merge: one fewer component
return True
def connected(self, x, y):
return self.find(x) == self.find(y)
def get_count(self):
return self.count
15.7 Usage Example
# 5 nodes, numbered 0-4
uf = UnionFind(5)
print(uf.get_count()) # 5, each node is its own component
uf.union(0, 1) # {0,1}, {2}, {3}, {4}
uf.union(2, 3) # {0,1}, {2,3}, {4}
print(uf.get_count()) # 3
print(uf.connected(0, 1)) # True
print(uf.connected(0, 2)) # False
uf.union(1, 3) # {0,1,2,3}, {4}
print(uf.connected(0, 2)) # True
print(uf.get_count()) # 2
15.8 Union by Size Variant
Sometimes we need to know the size of each component. In such cases, "union by size" is more natural than "union by rank":
class UnionFindBySize:
def __init__(self, n):
self.parent = list(range(n))
self.size = [1] * n # size[i] = size of set rooted at i
self.count = n
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return False
# Attach smaller set under larger set
if self.size[root_x] < self.size[root_y]:
self.parent[root_x] = root_y
self.size[root_y] += self.size[root_x]
else:
self.parent[root_y] = root_x
self.size[root_x] += self.size[root_y]
self.count -= 1
return True
def get_size(self, x):
return self.size[self.find(x)]
Union by size and union by rank have the same time complexity: O(ฮฑ(n)). The difference is that union by size can answer "how large is this set?" in O(1).
15.9 Union-Find vs Other Approaches
| Approach | Union Time | Find/Connected Time | Space |
|---|---|---|---|
| Array labeling | O(n) | O(1) | O(n) |
| Adjacency list + BFS | O(1) | O(V+E) | O(V+E) |
| Naive Union-Find | O(n) worst | O(n) worst | O(n) |
| Path compression + rank | O(ฮฑ(n)) | O(ฮฑ(n)) | O(n) |
Since ฮฑ(n) โค 4 for all practical inputs, the last row is effectively O(1).
15.10 When to Use Union-Find
Union-Find is ideal when:
- You only need connectivity, not the actual path
- Edges are added dynamically (add-only โ standard Union-Find doesn't support deletion)
- Both offline and online processing work
- You need efficient set merging
Not suitable when:
- You need shortest paths โ use BFS/Dijkstra
- You need edge deletion โ use Link-Cut Tree or offline reverse processing
- You need to enumerate all elements in a set โ Union-Find doesn't directly support this
Level 2 ยท How It Works Under the Hood
15.11 Complexity Analysis: Why It's Nearly O(1)
Path Compression Only
With only path compression (no rank/size-based union), m operations on n elements take O(m log n) total time, amortized O(log n) per operation.
Intuition: Path compression makes trees increasingly flat. The first find on a deep path is expensive, but it simultaneously flattens every node on that path directly to the root. Subsequent finds on those nodes become O(1). This is classic amortized analysis โ occasional expensive operations pay for future cheap operations.
Union by Rank Only
With only union by rank (no path compression), tree height is guaranteed O(log n).
Proof: By induction, a tree of rank r has at least 2^r nodes.
- Base: rank = 0, tree has only the root, 1 = 2^0 nodes โ
- Inductive step: A tree of rank r can only be formed by merging two trees of rank r-1. By the inductive hypothesis, each has at least 2^(r-1) nodes. After merging: at least 2^(r-1) + 2^(r-1) = 2^r nodes โ
Therefore rank โค logโ(n), so find is O(log n).
Path Compression + Union by Rank
When both optimizations are combined, m operations take O(m ยท ฮฑ(n)) total time, where ฮฑ is the inverse Ackermann function.
15.12 The Inverse Ackermann Function ฮฑ(n)
The Ackermann function A(m, n) grows astronomically fast:
A(0, n) = n + 1
A(m, 0) = A(m-1, 1) (m > 0)
A(m, n) = A(m-1, A(m, n-1)) (m > 0, n > 0)
Some values:
- A(1, n) = n + 2
- A(2, n) = 2n + 3
- A(3, n) = 2^(n+3) - 3
- A(4, 1) = 2^(2^(2^(65536))) - 3 โ an astronomically large number
- A(4, 2) โ cannot even be written down
The inverse Ackermann function ฮฑ(n) = min{k : A(k, k) โฅ n}. Since A grows so fast, ฮฑ grows incredibly slowly:
| n | ฮฑ(n) |
|---|---|
| 1-2 | 0 |
| 3 | 1 |
| 4-7 | 2 |
| 8-2047 | 3 |
| 2048 to A(4,1) โ 2^65536 | 4 |
| A(4,1) to A(5,1) | 5 |
Since the number of atoms in the observable universe is approximately 10^80 โ 2^265, far less than 2^65536 = A(4,1), for all practical inputs ฮฑ(n) โค 4.
What does this mean? Union-Find with path compression and union by rank needs at most 4 extra steps per operation (in the amortized sense). This is why we say it's "essentially O(1)."
15.13 The Core Idea of Tarjan's Proof
Robert Tarjan proved the O(m ยท ฮฑ(n)) upper bound in 1975. His proof approach (greatly simplified):
- Define a potential function ฮฆ based on each node's "level" in the tree
- Partition nodes into "blocks" by rank, where block sizes increase according to the Ackermann function
- Show that path compression either decreases potential (banking savings for future operations) or has low actual cost itself
- Apply amortized analysis via the potential method to obtain total cost O(m ยท ฮฑ(n))
For the lower bound, Fredman and Saks (1989) proved that in the cell-probe model, any data structure supporting Union and Find requires ฮฉ(m ยท ฮฑ(n)) time in the worst case for m operations. This means path compression + union by rank is asymptotically optimal.
15.14 Weighted Union-Find
Standard Union-Find only answers "are two elements in the same set?" But some problems require maintaining relative relationships between elements (e.g., relative distances, relative weights). Weighted Union-Find is designed for this.
Core idea: Besides parent, maintain a weight array where weight[x] represents the "distance" (or other relational value) from x to parent[x]. By accumulating weights along the find path, we can determine the total weight from any node to the root.
class WeightedUnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
self.weight = [0] * n # weight[x] = weight from x to parent[x]
def find(self, x):
"""Return root of x, with path compression and weight update"""
if self.parent[x] != x:
root = self.find(self.parent[x])
self.weight[x] += self.weight[self.parent[x]] # Accumulate weights
self.parent[x] = root
return self.parent[x]
def union(self, x, y, w):
"""
Merge x and y with weight(x) - weight(y) = w
i.e., the weight difference of x relative to y is w
"""
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return # Already in same set; could check consistency
# After find: weight[x] = distance from x to root_x
# After find: weight[y] = distance from y to root_y
# We want root_x โ root_y with weight = w + weight[y] - weight[x]
if self.rank[root_x] < self.rank[root_y]:
self.parent[root_x] = root_y
self.weight[root_x] = w + self.weight[y] - self.weight[x]
elif self.rank[root_x] > self.rank[root_y]:
self.parent[root_y] = root_x
self.weight[root_y] = -w + self.weight[x] - self.weight[y]
else:
self.parent[root_x] = root_y
self.weight[root_x] = w + self.weight[y] - self.weight[x]
self.rank[root_y] += 1
def query(self, x, y):
"""Query weight(x) - weight(y); requires x and y in the same set"""
if self.find(x) != self.find(y):
return None # Not in same set
return self.weight[x] - self.weight[y]
Typical applications of Weighted Union-Find:
- LeetCode 399 - Evaluate Division: Given a/b = 2, b/c = 3, find a/c. Model division relationships as multiplicative weights.
- Food Chain Problem (NOI 2001): Three species A, B, C form a food chain (A eats B, B eats C, C eats A). Use weight mod 3 to represent relative relationships.
15.15 Kruskal's Minimum Spanning Tree Algorithm
The most classic graph theory application of Union-Find is Kruskal's algorithm for finding the Minimum Spanning Tree (MST).
Algorithm idea:
- Sort all edges by weight in ascending order
- For each edge (u, v, w): if u and v are not in the same component, add this edge (union u and v)
- Repeat until n-1 edges are added (n = number of vertices)
def kruskal(n, edges):
"""
n: number of vertices
edges: [(weight, u, v), ...]
Returns MST edges and total weight
"""
edges.sort() # Sort by weight
uf = UnionFind(n)
mst = []
total_weight = 0
for weight, u, v in edges:
if not uf.connected(u, v):
uf.union(u, v)
mst.append((u, v, weight))
total_weight += weight
if len(mst) == n - 1:
break # Found n-1 edges, MST complete
return mst, total_weight
Time complexity: Sorting O(E log E) + iterating edges with Union/Find O(E ยท ฮฑ(V)). Total: O(E log E), since sorting dominates.
Why is Kruskal correct? This is a classic application of the greedy paradigm. Its correctness is guaranteed by the Cut Property: for any cut of the graph (partition of vertices into two groups), the minimum-weight edge crossing the cut belongs to some MST. Each time Kruskal adds an edge, it's the minimum-weight edge connecting two different components โ exactly the cut property in action.
For deeper discussion of minimum spanning trees, see Chapter 18.
15.16 Variants of Path Compression
Besides full path compression (all nodes point directly to root), there are several variants:
Path Splitting: Make every node on the path point to its grandparent.
def find_path_splitting(self, x):
while self.parent[x] != x:
self.parent[x], x = self.parent[self.parent[x]], self.parent[x]
return x
Path Halving: Every other node on the path points to its grandparent.
def find_path_halving(self, x):
while self.parent[x] != x:
self.parent[x] = self.parent[self.parent[x]]
x = self.parent[x]
return x
All three variants (full compression, path splitting, path halving) have the same O(m ยท ฮฑ(n)) amortized complexity in theory. In practice, path halving is often fastest because it modifies only half the pointers, reducing memory writes.
15.17 Space Optimization
For consecutively numbered nodes, Union-Find needs just two arrays (parent and rank/size) โ strictly O(n) space.
For non-integer keys (e.g., strings), use a hash map instead:
class UnionFindMap:
def __init__(self):
self.parent = {}
self.rank = {}
def add(self, x):
if x not in self.parent:
self.parent[x] = x
self.rank[x] = 0
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
self.add(x)
self.add(y)
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
return False
if self.rank[root_x] < self.rank[root_y]:
self.parent[root_x] = root_y
elif self.rank[root_x] > self.rank[root_y]:
self.parent[root_y] = root_x
else:
self.parent[root_y] = root_x
self.rank[root_x] += 1
return True
Level 3 ยท What the Specification Says
15.18 History: Galler and Fischer's 1964 Paper
The invention of Union-Find traces back to Bernard A. Galler and Michael J. Fischer, who published "An Improved Equivalence Algorithm" in 1964 (Communications of the ACM, Vol. 7, No. 5, pp. 301-303).
Their motivation came from FORTRAN compilers handling EQUIVALENCE statements. FORTRAN's EQUIVALENCE statement allows programmers to declare that two variables share the same memory. The compiler needs to maintain the transitive closure of this equivalence relation โ if A is equivalent to B, and B is equivalent to C, then A is equivalent to C.
Galler and Fischer proposed representing equivalence classes as trees, and introduced union by size as an optimization. They proved that m operations on n elements take O(m log n) total time. This was a major improvement over the previous O(mn) methods.
Original quote (Galler & Fischer, 1964):
"The improved algorithm which we describe here represents each class of equivalent identifiers as a tree structure, where the root of each tree serves as the canonical representative of its class."
This is one of the most concise and elegant data structure inventions in computer science history: just two pages long, core code under 20 lines, yet solving a practical and important problem.
15.19 Tarjan's 1975 Breakthrough
Robert Endre Tarjan (yes, the same Tarjan who invented the strongly connected components algorithm) published his landmark paper "Efficiency of a Good But Not Linear Set Union Algorithm" in 1975 (Journal of the ACM, Vol. 22, No. 2, pp. 215-225).
Tarjan's contributions:
-
Introduced formal analysis of path compression: While the idea of path compression existed before, Tarjan was the first to rigorously analyze its complexity.
-
Proved the O(m ยท ฮฑ(n)) upper bound: This was a stunning result. People knew union by rank gives O(m log n), and that path compression is fast in practice, but no one could prove exactly how fast the combination is. Tarjan used an ingenious potential method, defining a hierarchical structure based on the Ackermann function.
-
Conjectured optimality: Tarjan conjectured that O(m ยท ฮฑ(n)) is optimal, but could not prove the lower bound at the time.
The key tool โ the Ackermann function: Tarjan's choice of the Ackermann function was not arbitrary. His analysis requires a fast-growing function to create "layers" โ partitioning node ranks into increasingly large blocks. The super-exponential growth of the Ackermann function provides exactly the right granularity for the layering.
15.20 The Lower Bound: Fredman-Saks 1989
Michael Fredman and Michael Saks proved in their 1989 paper "The Cell Probe Complexity of Dynamic Data Structures" (Proceedings of STOC, pp. 345-354):
In the cell-probe model, any data structure supporting m Union-Find operations requires ฮฉ(m ยท ฮฑ(n)) time in the worst case.
This means Union-Find with path compression + union by rank is asymptotically optimal โ you cannot invent a data structure that is asymptotically faster for the dynamic connectivity problem in the worst case.
The cell-probe model is extremely powerful (it only counts memory accesses; computation is free), so lower bounds in this model apply to all weaker models (including the RAM model).
15.21 Union-Find in Percolation
Percolation is a classic model in statistical physics and a beautiful application of Union-Find.
Problem description: Consider an nรn grid where each cell is "open" (passable) with probability p and "closed" with probability 1-p. Question: Is there a connected path from top to bottom?
Key concept โ percolation threshold: There exists a critical probability p* โ 0.5927 (for the 2D square lattice) such that when p > p*, there is almost certainly a path from top to bottom, and when p < p*, there is almost certainly not. This is a phase transition phenomenon.
Solving with Union-Find:
import random
class Percolation:
def __init__(self, n):
self.n = n
self.grid = [[False] * n for _ in range(n)]
# n*n cells + 2 virtual nodes (top source, bottom sink)
self.uf = UnionFind(n * n + 2)
self.top = n * n # Virtual top node
self.bottom = n * n + 1 # Virtual bottom node
self.open_count = 0
def _index(self, row, col):
return row * self.n + col
def open(self, row, col):
if self.grid[row][col]:
return
self.grid[row][col] = True
self.open_count += 1
# If in first row, connect to virtual top
if row == 0:
self.uf.union(self._index(row, col), self.top)
# If in last row, connect to virtual bottom
if row == self.n - 1:
self.uf.union(self._index(row, col), self.bottom)
# Connect to adjacent open cells
for dr, dc in [(-1,0), (1,0), (0,-1), (0,1)]:
nr, nc = row + dr, col + dc
if 0 <= nr < self.n and 0 <= nc < self.n and self.grid[nr][nc]:
self.uf.union(self._index(row, col), self._index(nr, nc))
def percolates(self):
"""Does a path exist from top to bottom?"""
return self.uf.connected(self.top, self.bottom)
def estimate_threshold(n, trials=1000):
"""Monte Carlo estimation of percolation threshold"""
total = 0
for _ in range(trials):
perc = Percolation(n)
sites = [(r, c) for r in range(n) for c in range(n)]
random.shuffle(sites)
for r, c in sites:
perc.open(r, c)
if perc.percolates():
total += perc.open_count / (n * n)
break
return total / trials
This algorithm is used by Princeton's Robert Sedgewick as a classic programming assignment in his Algorithms course. It beautifully demonstrates how Union-Find powers scientific simulations โ each cell opening requires only constant union operations, and the entire simulation runs in O(nยฒ ยท ฮฑ(nยฒ)) time.
15.22 Academic Variants of Union-Find
Rollback Union-Find (Persistent Union-Find):
Standard Union-Find merges are irreversible. But in some algorithms (e.g., divide-and-conquer for edge deletion problems), we need to "undo" the last merge. With union by rank (without path compression), we can use a stack to record each modification and restore them in reverse order when rolling back.
class RollbackUnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
self.history = [] # Record all modifications
def find(self, x):
# Note: NO path compression! Otherwise rollback is impossible
while self.parent[x] != x:
x = self.parent[x]
return x
def union(self, x, y):
root_x = self.find(x)
root_y = self.find(y)
if root_x == root_y:
self.history.append(None) # Mark: no actual merge
return False
if self.rank[root_x] < self.rank[root_y]:
root_x, root_y = root_y, root_x
# Record state before modification
self.history.append((root_y, self.parent[root_y], root_x, self.rank[root_x]))
self.parent[root_y] = root_x
if self.rank[root_x] == self.rank[root_y]:
self.rank[root_x] += 1
return True
def rollback(self):
"""Undo the most recent union operation"""
record = self.history.pop()
if record is None:
return # That union didn't actually merge
root_y, old_parent, root_x, old_rank = record
self.parent[root_y] = old_parent
self.rank[root_x] = old_rank
Note that rollback Union-Find cannot use path compression (a single find might modify many nodes, making simple rollback impossible), so its complexity is O(log n) rather than O(ฮฑ(n)).
Offline edge deletion: If a problem involves edge deletion, a common trick is "time reversal" โ convert deletions into additions and process from back to front using standard Union-Find. This avoids complex persistent data structures.
15.23 Relationship with Other Data Structures
Relationship with BFS/DFS on graphs: Union-Find can be seen as "preprocessing" for graph connectivity queries. If the edge set is fixed, one DFS marking components in O(V+E) suffices. Union-Find's advantage is when the edge set changes dynamically โ new edges are continuously added.
Relationship with MST algorithms: Kruskal uses Union-Find; Prim uses a priority queue. Both have similar time complexity, but Kruskal is better for sparse graphs (E close to V), while Prim is better for dense graphs (E close to Vยฒ).
Relationship with Link-Cut Trees: Union-Find only supports merging, not splitting. Link-Cut Trees (Sleator & Tarjan, 1983) support both dynamic linking and cutting, but each operation costs O(log n) with larger constants. If you don't need deletion, Union-Find is the better choice.
Level 4 ยท Edge Cases and Pitfalls
15.24 Interview Problem 1: Number of Provinces (LeetCode 547)
Problem: Given n cities with isConnected[i][j] = 1 indicating cities i and j are directly connected, find the number of provinces (connected components).
Analysis: This is the most direct application of Union-Find โ iterate over all edges, merge connected cities, and return the component count.
def findCircleNum(isConnected):
n = len(isConnected)
uf = UnionFind(n)
for i in range(n):
for j in range(i + 1, n): # Only traverse upper triangle
if isConnected[i][j] == 1:
uf.union(i, j)
return uf.get_count()
Time complexity: O(nยฒ ยท ฮฑ(n)), iterating the adjacency matrix. Space complexity: O(n).
Common mistakes:
- Forgetting to only traverse the upper (or lower) triangle, causing redundant processing. Doesn't affect correctness but wastes time.
- Forgetting that
isConnected[i][i] = 1doesn't need processing (a city is always connected to itself).
Alternative: DFS/BFS also works โ traverse each unvisited node and DFS to mark the entire component. Same O(nยฒ) time, but Union-Find has the advantage of cleaner code and flexibility for dynamic edge additions.
15.25 Interview Problem 2: Redundant Connection (LeetCode 684)
Problem: A tree with one extra edge becomes a graph. Find the redundant edge (if multiple answers exist, return the one appearing last in input).
Analysis: A tree has n nodes and n-1 edges. The problem gives n edges (one extra). Find the edge that creates a cycle.
Key insight: Process edges in input order; the first edge whose endpoints are already connected forms the cycle. Since the problem guarantees exactly one redundant edge, there's exactly one such edge.
def findRedundantConnection(edges):
n = len(edges)
uf = UnionFind(n + 1) # Nodes numbered from 1
for u, v in edges:
if not uf.union(u, v):
return [u, v] # Both endpoints already connected โ this edge is redundant
return [] # Won't reach here
Why is this correct? A tree is an acyclic connected graph. Processing edges in order, earlier edges form a forest (acyclic). When an edge (u, v) is encountered where u and v are already in the same tree, this edge creates a cycle โ it's the redundant one.
Follow-up: What about directed graphs? (LeetCode 685 - Redundant Connection II) This is significantly more complex, requiring case analysis: whether a node has in-degree 2, whether a directed cycle exists.
15.26 Interview Problem 3: Satisfiability of Equality Equations (LeetCode 990)
Problem: Given a set of equations and inequalities (e.g., ["a==b", "b!=c", "c==a"]), determine if there exists an assignment satisfying all constraints.
Analysis:
- Equation
a==bmeans a and b must be in the same set - Inequality
a!=bmeans a and b must be in different sets
Strategy: First process all equations (union), then check all inequalities for conflicts.
def equationsPossible(equations):
uf = UnionFind(26) # 26 letters
# First pass: process all equations
for eq in equations:
if eq[1] == '=': # "x==y"
x = ord(eq[0]) - ord('a')
y = ord(eq[3]) - ord('a')
uf.union(x, y)
# Second pass: check all inequalities
for eq in equations:
if eq[1] == '!': # "x!=y"
x = ord(eq[0]) - ord('a')
y = ord(eq[3]) - ord('a')
if uf.connected(x, y):
return False # Contradiction! x and y are already in the same set
return True
Why process equations first, then inequalities? Because equality is transitive (a==b, b==c โ a==c). We need to establish all transitive relationships before checking inequalities for contradictions. Interleaving might miss transitive relationships.
Follow-up: What if inequalities are also transitive? Then simple Union-Find won't work โ you'd need 2-SAT or constraint propagation.
15.27 Interview Problem 4: Largest Component Size by Common Factor (LeetCode 952)
Problem: Given array nums, if nums[i] and nums[j] share a common factor greater than 1, connect them with an edge. Find the largest connected component size.
Brute force: Check GCD for all pairs: O(nยฒ log(max)). Too slow.
Optimized approach: Instead of checking GCD for every pair, factorize each number and put the number and its factors in the same set.
def largestComponentSize(nums):
max_val = max(nums)
uf = UnionFind(max_val + 1)
for num in nums:
# Factorize num and merge num with all its factors
d = 2
while d * d <= num:
if num % d == 0:
uf.union(num, d)
uf.union(num, num // d)
d += 1
# Count how many nums elements are in each component
from collections import Counter
count = Counter(uf.find(num) for num in nums)
return max(count.values())
Why is this correct? If nums[i] and nums[j] share a common factor p, then nums[i] and p are in the same set, and nums[j] and p are in the same set, so nums[i] and nums[j] are in the same set. Transitivity is automatically guaranteed by Union-Find.
Time complexity: Factorizing each number is O(โnum), total O(n ยท โmax_val). Much better than O(nยฒ).
Space note: The implementation above creates a UnionFind of size max_val + 1. If numbers are large but few, use the hash map version instead.
Follow-up: What if values reach 10^9? Use a sieve for prime factorization preprocessing, or only factorize numbers in nums and use index mapping to compress the space.
15.28 Common Pitfalls with Union-Find
Pitfall 1: Comparing parents instead of roots
# WRONG!
if uf.parent[x] == uf.parent[y]: # Compares direct parents, not roots!
...
# CORRECT
if uf.find(x) == uf.find(y):
...
Pitfall 2: Merging original nodes instead of roots
# WRONG!
def union(self, x, y):
self.parent[x] = y # Directly connects x to y, breaking tree structure
# CORRECT
def union(self, x, y):
root_x = self.find(x)
root_y = self.find(y)
self.parent[root_x] = root_y # Connect roots
Pitfall 3: Off-by-one with 1-indexed nodes
Many graph problems use nodes numbered 1 to n. Create UnionFind with size n + 1, or subtract 1 from all indices.
Pitfall 4: Python recursion depth
Python's default recursion limit is 1000. For deep trees (e.g., 10^5 nodes in a chain before path compression kicks in), recursive find will overflow. Solutions:
- Use iterative find
- Or
import sys; sys.setrecursionlimit(200000)(not recommended โ may actually stack overflow)
Pitfall 5: Assuming Union-Find supports edge deletion
Union-Find only supports merging, not splitting. If the problem involves edge deletion, either:
- Process offline in reverse (convert deletions to additions)
- Use Link-Cut Tree
- Use rollback Union-Find + divide and conquer
15.29 Recognition Patterns for Interviews
When you encounter these signals in an interview, consider Union-Find:
| Signal | Example |
|---|---|
| "connected", "reachable", "same group" | Number of Provinces, Number of Islands |
| "grouping", "classify", "equivalence" | Equality Equations, Account Merge |
| "common factor", "shared property" | Largest Component by Common Factor |
| "redundant edge", "cycle" | Redundant Connection |
| "minimum spanning tree", "minimum cost to connect" | Min Cost to Connect All Points |
| "earliest time fully connected" | Earliest Moment When Everyone Becomes Friends |
Interview template:
class UnionFind:
def __init__(self, n):
self.parent = list(range(n))
self.rank = [0] * n
self.count = n
def find(self, x):
if self.parent[x] != x:
self.parent[x] = self.find(self.parent[x])
return self.parent[x]
def union(self, x, y):
rx, ry = self.find(x), self.find(y)
if rx == ry: return False
if self.rank[rx] < self.rank[ry]: rx, ry = ry, rx
self.parent[ry] = rx
if self.rank[rx] == self.rank[ry]: self.rank[rx] += 1
self.count -= 1
return True
def connected(self, x, y):
return self.find(x) == self.find(y)
This template is about 20 lines. In an interview, you should be able to write it from memory in under 2 minutes. Practice until it becomes muscle memory.
15.30 Comprehensive Comparison and Summary
| Dimension | Union-Find | BFS/DFS | Adjacency List |
|---|---|---|---|
| Connectivity check | O(ฮฑ(n)) โ O(1) | O(V+E) | O(V+E) |
| Dynamic edge addition | O(ฮฑ(n)) | Requires re-traversal | O(1) add, slow query |
| Edge deletion | Not supported | Requires re-traversal | O(1) delete, slow query |
| Find actual path | Not supported | Supported | Supported |
| Space | O(n) | O(V+E) | O(V+E) |
| Code complexity | Very low (20 lines) | Medium | Medium |
One-sentence summary: Union-Find is the optimal data structure for dynamic connectivity โ O(ฮฑ(n)) near-constant time, O(n) space, and 20 lines of code. Its limitations are that it only supports merging (not splitting) and can only determine connectivity (not produce paths). In interviews, once you identify the "dynamic merging + connectivity query" pattern, Union-Find is your go-to weapon.