MySQL Index Optimization Guide

Indexes are the single most impactful lever for MySQL performance. A well-designed index can reduce query time from seconds to milliseconds; a poorly designed one can cripple write throughput. This chapter starts from the physical B+Tree storage structure, dives deep into composite index design methodology, covers 12 common failure scenarios, and includes an interactive index size calculator.

Target Version: This chapter is based on MySQL 8.0 / 8.4 LTS (InnoDB engine). Features specific to 8.0+ (descending indexes, invisible indexes, functional indexes, histograms) are explicitly noted. Differences with 5.7 and earlier are also mentioned.

1. B+Tree Data Structure

1.1 Why B+Tree Over Other Structures?

A database index is fundamentally a data structure problem: given a key, locate the value as fast as possible. Let's compare the candidates:

Data StructureLookupRange QueryDisk I/OBest For
Hash TableO(1)Not supportedRandomMemory engine equality lookups
Binary Search TreeO(log n) avg, O(n) worstIn-order traversalOne I/O per levelSmall in-memory datasets
AVL / Red-Black TreeO(log n)In-order traversalHeight = log2n (too tall)In-memory indexes (Java TreeMap)
B-TreeO(logmn)Requires backtrackingOne I/O per level, very few levelsFilesystem indexes
B+TreeO(logmn)Leaf linked-list scan2-4 I/O opsRelational databases
LSM-TreeO(log n) multi-levelMerge requiredWrite-optimizedRocksDB / LevelDB
Skip ListO(log n)Linked-list sequentialNot disk-friendlyRedis ZSET

Why B+Tree wins for databases:

  1. Extremely low tree height: With 16KB pages and ~14-byte index entries per key+pointer, each internal node has a fanout of ~1170. A 3-level B+Tree indexes 1170^2 = ~1.37 million rows. Four levels handles ~1.6 billion rows. Most tables need only 2-4 disk I/O operations to locate any row.
  2. Doubly-linked leaf nodes: B+Tree leaf nodes are connected via forward and backward pointers, enabling efficient range scans (BETWEEN, >, ORDER BY) by simply following the linked list after finding the starting leaf.
  3. Internal nodes store only keys: Unlike B-Trees, B+Tree internal nodes don't store data rows -- only keys and child-page pointers. This means higher fanout, shorter trees, and fewer I/O operations.
  4. All data at leaf level: Every lookup traverses to the leaf level, making path length consistent and performance predictable.

1.2 InnoDB B+Tree Physical Structure

InnoDB uses pages as the minimum I/O unit, defaulting to 16KB (innodb_page_size). The physical structure of a B+Tree index:

[Root Page] / | \ [Internal] [Internal] [Internal] / | \ / | \ / | \ [Leaf] [Leaf] [Leaf] [Leaf] [Leaf] [Leaf] [Leaf] |----->|----->|----->|----->|----->|----->| |<-----|<-----|<-----|<-----|<-----|<-----| Leaf nodes connected via doubly-linked list (sorted by key) Each Page internal structure: +--------------------------------------------------+ | Page Header (38 bytes) | | - Page Number, Previous Page, Next Page | | - Page Type, Space ID, LSN | +--------------------------------------------------+ | Infimum Record (virtual minimum record) | | Supremum Record (virtual maximum record) | +--------------------------------------------------+ | User Records (actual index records, sorted) | | Record 1 -> Record 2 -> Record 3 -> ... | | (singly-linked via next_record offset) | +--------------------------------------------------+ | Free Space | +--------------------------------------------------+ | Page Directory (slot array for binary search) | +--------------------------------------------------+ | Page Trailer (8 bytes, checksum) | +--------------------------------------------------+

Intra-page Lookup Process

Within a page, records form a singly-linked list sorted by key. To avoid O(n) traversal, InnoDB maintains a Page Directory -- an array of "slots," each pointing to a record in a group (typically 4-8 records). Lookup first binary-searches the slots, then sequentially scans the small group.

1.3 B+Tree vs B-Tree: Key Differences

FeatureB-TreeB+Tree (used by InnoDB)
Data storageAll nodes (including internal) store dataOnly leaf nodes store data
Leaf linked listNoneDoubly-linked list
Range queriesRequires full in-order traversalSequential leaf scan from start point
Internal node fanoutSmaller (stores data too)Larger (keys + pointers only)
Lookup path lengthVariable (may find data mid-tree)Fixed (always reaches leaf level)
Cache friendlinessModerateBetter (small internal nodes fit in Buffer Pool)

1.4 Clustered Index

InnoDB stores table data itself organized as a B+Tree keyed by the primary key. This B+Tree -- with complete row data in its leaf nodes -- is the Clustered Index.

Key insight: In InnoDB, the table IS the clustered index and the clustered index IS the table. They are the same data. The .ibd file contains data organized as a primary-key B+Tree.

Primary key selection rules (InnoDB's automatic behavior):

  1. If a PRIMARY KEY is defined, use it
  2. If no PRIMARY KEY, use the first NOT NULL UNIQUE KEY
  3. If neither exists, InnoDB auto-generates a hidden 6-byte DB_ROW_ID column as the clustered index key
Always define an explicit primary key! The hidden DB_ROW_ID is globally auto-incremented (not per-table), creating a global mutex bottleneck under high-concurrency inserts. You also cannot use it in WHERE clauses, wasting the clustered index capability entirely.

1.5 Secondary Index

Every index other than the clustered index is a secondary index (also called non-clustered index). Its B+Tree leaf nodes store the indexed column values + the corresponding row's primary key value.

Clustered Index (PRIMARY KEY = id): [Root: ..., 50, 100, ...] / | \ [Leaf: id=1..49] [Leaf: id=50..99] [Leaf: id=100..149] Each leaf stores full rows: id=1, name='Alice', email='alice@...' id=2, name='Bob', email='bob@...' Secondary Index (INDEX idx_name (name)): [Root: ..., 'John', 'Mike', ...] / | \ [Leaf: 'A'..'J'] [Leaf: 'J'..'M'] [Leaf: 'M'..'Z'] Each leaf stores: name='Alice', id=1 name='Bob', id=2 (index columns + primary key)

Bookmark Lookup (Back to Clustered Index)

After finding the primary key via a secondary index, if the query needs columns not in the secondary index, InnoDB must look up the full row in the clustered index. This is called a bookmark lookup (or "back to table" lookup).

-- Assume INDEX idx_name (name)
SELECT id, name, email FROM users WHERE name = 'Alice';

-- Execution:
-- 1. Search idx_name B+Tree for name='Alice' -> get id=1
-- 2. Take id=1 to clustered index (PRIMARY KEY) B+Tree
-- 3. Find full row at leaf: id=1, name='Alice', email='[email protected]'
-- 4. Return result

-- This requires 2 B+Tree lookups (2 random I/O paths)
Cost of bookmark lookups: Each row requires an independent random I/O to the clustered index. If the secondary index scan returns 1000 rows, worst case requires 1000 additional random I/Os. This is why the optimizer may choose a full table scan over an index when selectivity is low -- sequential scanning can be faster than thousands of random I/Os.

1.6 Page Splits and Merges

Page Split

When inserting a record into a full leaf page, InnoDB must split the page into two:

  1. Allocate a new empty page
  2. Move approximately 50% of records to the new page
  3. Update the parent node's pointers and keys
  4. If the parent is also full, recursively split upward (rare)
UUID primary keys cause disaster: Random UUIDs (UUID()) as primary keys insert at random positions, triggering page splits on nearly every insert. Auto-increment keys always append to the last leaf page, causing almost no splits. Under heavy writes, UUID primary keys can be 3-5x slower than auto-increment.

Primary key type comparison:

PK TypeSplit FrequencySpace UtilizationDistributed
AUTO_INCREMENTVery low~95%Poor (needs coordination)
UUID v4 (random)Very high~50-70%Good
UUID v7 / ULID (ordered)Low~90%Good
Snowflake IDLow~90%Good

Page Merge

When a page's utilization drops below ~50% (controlled by MERGE_THRESHOLD) due to deletions, InnoDB attempts to merge adjacent pages, freeing empty pages.

-- Observe page merge behavior
SELECT * FROM INFORMATION_SCHEMA.INNODB_METRICS
WHERE NAME LIKE '%index_page_merge%';

-- MySQL 8.0.25+: set merge threshold per index
ALTER TABLE orders DROP INDEX idx_status,
  ADD INDEX idx_status (status) COMMENT 'MERGE_THRESHOLD=30';

1.7 Index Key Storage Format

Index Record Format: +------------------+------------------+-------------------+ | Variable-length | NULL flags | Record header | | field lengths | (1 bit per | (5 bytes) | | (stored reverse) | nullable col) | - delete_flag | | | | - min_rec_flag | | | | - n_owned | | | | - heap_no | | | | - record_type | | | | - next_record | +------------------+------------------+-------------------+ | Col1 | Col2 | ... | PK value (secondary) / Row data (clustered) | +------+------+-----+---------------------------------------------+

Storage overhead key points:

  • INT = fixed 4 bytes, BIGINT = fixed 8 bytes -- very compact as index columns
  • VARCHAR(255) = 1 byte length + actual content bytes. If NULLable, add NULL flag bits
  • Every secondary index record appends the PK columns, so shorter PKs = smaller secondary indexes
  • CHAR(255) CHARSET utf8mb4 = fixed 255*4 = 1020 bytes -- catastrophic waste as an index column

2. MySQL Index Types

2.1 PRIMARY KEY

CREATE TABLE users (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (id)
) ENGINE=InnoDB;
  • Only one per table
  • All columns must be NOT NULL
  • In InnoDB, this IS the clustered index
  • Automatically unique

2.2 UNIQUE KEY

ALTER TABLE users ADD UNIQUE KEY uk_email (email);
  • Guarantees uniqueness; allows one NULL (multiple NULLs allowed since MySQL 8.0.16+)
  • Lookup performance nearly identical to INDEX -- only difference is stopping after first match
  • Adds uniqueness check overhead on INSERT/UPDATE

2.3 INDEX / KEY (Regular Secondary Index)

ALTER TABLE orders ADD INDEX idx_user_id (user_id);
  • Most common index type
  • Allows duplicate values
  • Supports equality, range, sorting, and grouping

2.4 FULLTEXT Index

ALTER TABLE articles ADD FULLTEXT INDEX ft_content (title, body)
  WITH PARSER ngram; -- needed for CJK text

SELECT * FROM articles
WHERE MATCH(title, body) AGAINST('MySQL optimization' IN BOOLEAN MODE);
  • InnoDB FULLTEXT uses an inverted index structure
  • Best for keyword search, not prefix or exact matching
  • For large-scale search, consider Elasticsearch or Meilisearch

2.5 SPATIAL Index

CREATE TABLE places (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(100),
  location POINT NOT NULL SRID 4326,
  SPATIAL INDEX sp_location (location)
) ENGINE=InnoDB;
  • Uses R-Tree data structure (not B+Tree)
  • Column must be NOT NULL with SRID declared
  • InnoDB spatial index support since MySQL 8.0

2.6 Covering Index

Not a separate index type, but a query optimization effect: when all columns needed by a query are contained in the index, no bookmark lookup is needed. See Section 5.

2.7 Prefix Index

-- Index only first 20 characters of VARCHAR(255)
ALTER TABLE users ADD INDEX idx_email_prefix (email(20));

-- Determine optimal prefix length:
SELECT
  COUNT(DISTINCT LEFT(email, 10)) / COUNT(*) AS sel_10,
  COUNT(DISTINCT LEFT(email, 15)) / COUNT(*) AS sel_15,
  COUNT(DISTINCT LEFT(email, 20)) / COUNT(*) AS sel_20,
  COUNT(DISTINCT email) / COUNT(*) AS sel_full
FROM users;
Prefix index limitations: Cannot be used as a covering index (incomplete values), cannot be used for ORDER BY or GROUP BY. Only use when the column is very long and storage is a concern.

2.8 Composite (Multi-Column) Index

ALTER TABLE orders ADD INDEX idx_user_status_created
  (user_id, status, created_at);

Composite indexes are the core of index optimization. Multiple column values are concatenated into a single key in the B+Tree. See Section 3.

2.9 Descending Index -- MySQL 8.0+

-- MySQL 5.7: DESC keyword was parsed but ignored (all stored ASC)
-- MySQL 8.0+: truly supports descending storage

ALTER TABLE events ADD INDEX idx_time_score (created_at DESC, score ASC);

SELECT * FROM events ORDER BY created_at DESC, score ASC LIMIT 20;
-- EXPLAIN Extra: Using index (no filesort)
When do you need descending indexes? Only when ORDER BY has multiple columns with different directions (one ASC, one DESC). If all columns share the same direction, a single-direction index works via forward or backward scanning.

2.10 Invisible Index -- MySQL 8.0+

-- Make index invisible: optimizer ignores it, but index is still maintained
ALTER TABLE orders ALTER INDEX idx_status INVISIBLE;

-- After observing no performance regression, safely drop
ALTER TABLE orders DROP INDEX idx_status;

-- If problems arise, instantly restore
ALTER TABLE orders ALTER INDEX idx_status VISIBLE;

-- Even invisible indexes can be force-tested
SET optimizer_switch = 'use_invisible_indexes=on';

2.11 Functional Index -- MySQL 8.0.13+

-- WHERE YEAR(created_at) = 2026 can't use a regular index
-- Solution: create a functional index
ALTER TABLE orders ADD INDEX idx_year ((YEAR(created_at)));

SELECT * FROM orders WHERE YEAR(created_at) = 2026;
-- EXPLAIN: type=ref, key=idx_year

-- JSON field functional index
ALTER TABLE products ADD INDEX idx_brand
  ((CAST(attributes->>'$.brand' AS CHAR(50) CHARSET utf8mb4)));
Implementation: MySQL 8.0 functional indexes are internally hidden virtual generated columns + regular indexes.

3. Leftmost Prefix Rule

A composite index's B+Tree sorts entries by declared column order sequentially. The index is only usable when the query conditions start from the leftmost column and match continuously. This is the Leftmost Prefix Rule.

Given composite index:

INDEX idx_abc (a, b, c)

B+Tree sort order: First by a; within same a, by b; within same (a,b), by c.

idx_abc (a, b, c) leaf node data (illustration): | a=1,b=1,c=1 | -> | a=1,b=1,c=3 | -> | a=1,b=2,c=1 | -> | a=1,b=2,c=5 | -> | a=2,b=1,c=2 | -> | a=2,b=1,c=4 | -> | a=2,b=3,c=1 | -> | a=3,b=1,c=1 | Observation: - Column a is globally sorted: 1, 1, 1, 1, 2, 2, 2, 3 - Column b is sorted only within the same a value - Column c is sorted only within the same (a, b) values

3.1 12 Query Examples

All examples assume INDEX idx_abc (a, b, c):

1 WHERE a = 1

Uses index. Matches leftmost column. key_len includes only column a.

EXPLAIN SELECT * FROM t WHERE a = 1;
-- type: ref | key: idx_abc | key_len: 4 | ref: const

2 WHERE a = 1 AND b = 2

Uses index. Matches columns a and b.

EXPLAIN SELECT * FROM t WHERE a = 1 AND b = 2;
-- type: ref | key: idx_abc | key_len: 8 | ref: const,const

3 WHERE a = 1 AND b = 2 AND c = 3

Uses index. All three columns matched.

EXPLAIN SELECT * FROM t WHERE a = 1 AND b = 2 AND c = 3;
-- type: ref | key: idx_abc | key_len: 12 | ref: const,const,const

4 WHERE b = 2

Cannot use index. Skipped leftmost column a. Column b is not globally sorted in the index.

EXPLAIN SELECT * FROM t WHERE b = 2;
-- type: ALL | key: NULL (full table scan)

5 WHERE b = 2 AND c = 3

Cannot use index. Missing leftmost column a.

EXPLAIN SELECT * FROM t WHERE b = 2 AND c = 3;
-- type: ALL | key: NULL

6 WHERE a = 1 AND c = 3

Partially uses index. Uses a for filtering. Since b is skipped, c cannot use the index for filtering (c is not sorted within all a=1 rows). However, MySQL 8.0's ICP can filter c at the index level to reduce bookmark lookups.

EXPLAIN SELECT * FROM t WHERE a = 1 AND c = 3;
-- type: ref | key: idx_abc | key_len: 4 (only a used)
-- Extra: Using index condition (ICP filters c at index level)

7 WHERE a = 1 ORDER BY b

Uses index. After filtering a=1, b is already sorted within that range. No filesort needed.

EXPLAIN SELECT * FROM t WHERE a = 1 ORDER BY b;
-- type: ref | key: idx_abc | Extra: no filesort

8 WHERE a = 1 ORDER BY c

Partially uses index. a used for filtering, but c is not sorted within a=1 (because b varies), requiring filesort.

EXPLAIN SELECT * FROM t WHERE a = 1 ORDER BY c;
-- Extra: Using filesort

9 WHERE a = 1 AND b = 2 ORDER BY c

Uses index. After a=1 AND b=2, c is sorted within that subset.

EXPLAIN SELECT * FROM t WHERE a = 1 AND b = 2 ORDER BY c;
-- type: ref | key_len: 8 | Extra: no filesort

10 WHERE a > 1 AND b = 2

Partially uses index. a is used for range scan, but the range condition breaks the index for subsequent column b.

EXPLAIN SELECT * FROM t WHERE a > 1 AND b = 2;
-- type: range | key: idx_abc | key_len: 4 (only a)
-- Extra: Using index condition
Range breakpoint rule: In a composite index, after encountering a range condition (>, <, BETWEEN, LIKE 'xxx%'), all subsequent index columns cannot be used for precise lookups. Equality conditions (=, IN) do NOT break the chain.

11 WHERE a IN (1, 2) AND b = 2 AND c = 3

Uses full index. IN is treated as multiple equality conditions. It does NOT break subsequent columns.

EXPLAIN SELECT * FROM t WHERE a IN (1, 2) AND b = 2 AND c = 3;
-- type: range | key: idx_abc | key_len: 12 (all three columns)

12 WHERE a = 1 AND b > 5 AND c = 3

Partially uses index. a (equality) + b (range) used. b's range breaks c. ICP can filter c at index level.

EXPLAIN SELECT * FROM t WHERE a = 1 AND b > 5 AND c = 3;
-- type: range | key_len: 8 (a + b) | Extra: Using index condition

3.2 Leftmost Prefix Summary Table

WHERE ClauseIndex Columns UsedReason
a = 1a (1 col)Matches leftmost
a = 1 AND b = 2a, b (2 cols)Consecutive from left
a = 1 AND b = 2 AND c = 3a, b, c (3 cols)All matched
b = 20Skipped leftmost
c = 30Skipped leftmost
b = 2 AND c = 30Skipped leftmost
a = 1 AND c = 3a (1) + ICPb skipped, c via ICP only
a > 1 AND b = 2a (range)Range breaks subsequent
a = 1 AND b > 5 AND c = 3a, b (range)b range breaks c
a IN (1,2) AND b = 2 AND c = 3a, b, c (3 cols)IN does not break

3.3 Column Order Design Principles

  1. Equality columns first (most important)
  2. Higher selectivity columns first (among equality columns)
  3. Range columns last (range breaks subsequent columns)
  4. ORDER BY / GROUP BY columns right after equality columns (avoid filesort)

4. 12 Index Failure Scenarios

Even with indexes created, these 12 common scenarios cause indexes to be unusable (or the optimizer to bypass them), degrading queries to full table scans. Each includes SQL and EXPLAIN output.

Root principle: Index failure boils down to two causes: (1) the query condition cannot match the B+Tree's sort structure; (2) the optimizer determines using the index is less efficient than a full scan. Understanding these two covers all scenarios.

1 Function on indexed column

-- โŒ Index fails: YEAR() function
SELECT * FROM orders WHERE YEAR(created_at) = 2026;
-- EXPLAIN: type=ALL, key=NULL

-- โœ… Rewrite as range
SELECT * FROM orders
WHERE created_at >= '2026-01-01' AND created_at < '2027-01-01';
-- EXPLAIN: type=range, key=idx_created

-- โœ… MySQL 8.0.13+: functional index
ALTER TABLE orders ADD INDEX idx_year ((YEAR(created_at)));

2 Implicit type conversion

-- phone is VARCHAR(20) with INDEX idx_phone (phone)

-- โŒ Index fails: integer input, MySQL casts phone to number
SELECT * FROM users WHERE phone = 13800138000;

-- โœ… Pass as string
SELECT * FROM users WHERE phone = '13800138000';
-- EXPLAIN: type=ref, key=idx_phone
MySQL conversion rule: When comparing string column with number, MySQL converts the string to number (not vice versa) -- equivalent to applying CAST on the indexed column.

3 LIKE with leading wildcard

-- โŒ Leading wildcard: no search starting point in B+Tree
SELECT * FROM users WHERE name LIKE '%son';
-- EXPLAIN: type=ALL

-- โœ… Trailing wildcard works
SELECT * FROM users WHERE name LIKE 'John%';
-- EXPLAIN: type=range, key=idx_name

4 OR mixing indexed and non-indexed columns

-- โŒ Index fails
SELECT * FROM users WHERE name = 'John' OR remark = 'VIP';
-- EXPLAIN: type=ALL (remark has no index)

-- โœ… Add index to both, enables Index Merge
ALTER TABLE users ADD INDEX idx_remark (remark);
-- Or rewrite as UNION ALL

5 NOT IN / NOT EXISTS / != / <>

-- โŒ Often full scan (depends on data distribution)
SELECT * FROM orders WHERE status != 'completed';

-- โœ… Rewrite as positive condition
SELECT * FROM orders WHERE status IN ('pending', 'processing', 'shipped');

6 Expression on indexed column

-- โŒ Index fails
SELECT * FROM orders WHERE id + 1 = 10;

-- โœ… Rearrange the expression
SELECT * FROM orders WHERE id = 9;
-- EXPLAIN: type=const, rows=1

7 Charset mismatch in JOIN

-- table_a.user_id: utf8mb4_general_ci
-- table_b.user_id: utf8_general_ci

-- โŒ Index fails: charset mismatch
SELECT a.*, b.name
FROM table_a a JOIN table_b b ON a.user_id = b.user_id;
-- EXPLAIN (table_b): type=ALL

-- โœ… Unify charset
ALTER TABLE table_b MODIFY user_id VARCHAR(50)
  CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
This is one of the most insidious production pitfalls. Tables created at different times often have different charsets. The mismatch only surfaces as performance degrades with growing data.

8 Low cardinality -- optimizer chooses full scan

-- gender column with index, but only 2 distinct values
SELECT * FROM users WHERE gender = 'M';
-- ~50% of 1M rows match -> full scan is faster than 500K bookmark lookups

9 ORDER BY direction mismatch

-- INDEX (a ASC, b ASC) can't satisfy ORDER BY a ASC, b DESC
-- โœ… MySQL 8.0+: create descending index
ALTER TABLE t ADD INDEX idx_a_bdesc (a ASC, b DESC);

10 SELECT * when covering index is available

-- โŒ SELECT * forces bookmark lookup
SELECT * FROM orders WHERE user_id = 100 AND status = 'pending';

-- โœ… Select only indexed columns -> covering index
SELECT user_id, status FROM orders WHERE user_id = 100 AND status = 'pending';
-- Extra: Using index

11 Range condition breaks subsequent index columns

-- INDEX (a, b, c): WHERE a = 1 AND b > 10 AND c = 5
-- Only a and b (range) are used; c cannot narrow the search

-- โœ… Reorder: INDEX (a, c, b) if c equality is more selective

12 Small table -- optimizer decides full scan is faster

-- config table with 20 rows
SELECT * FROM config WHERE key_name = 'max_retries';
-- EXPLAIN: type=ALL, rows=20
-- Correct optimizer decision. No fix needed.

4.1 Failure Scenario Quick Reference

#ScenarioRoot CauseFix
1Function: YEAR(col)B+Tree sorted by raw valueRewrite as range / functional index
2Implicit cast: varchar = 123Equivalent to CAST functionMatch types
3LIKE '%xxx'No prefix = no start pointFULLTEXT / ES
4OR with mixed indexed/non-indexedCannot merge scansIndex both / UNION
5!= / NOT INLow selectivity negationRewrite as positive IN
6Expression: id+1=10Equivalent to functionRearrange
7Charset mismatchImplicit CONVERTUnify charset/collation
8Low cardinalityLookups costlier than scanComposite / covering index
9Sort direction mismatchMixed ASC/DESCDescending index (8.0+)
10SELECT *Cannot coverSelect only needed columns
11Range breaks subsequentPost-range cols unsortedEquality first, range last
12Small tableFull scan ≤ index lookupNo fix needed

5. Covering Index Deep Dive

5.1 What Is a Covering Index?

When all columns needed by a query (SELECT, WHERE, ORDER BY, GROUP BY) are contained within a single index, InnoDB can return results directly from the index B+Tree without any bookmark lookup to the clustered index.

-- INDEX idx_cover (user_id, status, amount)

SELECT user_id, status, amount FROM orders
WHERE user_id = 100 AND status = 'pending';

-- EXPLAIN Extra: Using index    <-- KEY INDICATOR!

EXPLAIN Extra column meanings:

  • Using index -- Covering index, no bookmark lookup
  • Using index condition -- ICP, filtered at index level but still needs bookmark lookup
  • NULL (empty Extra) -- Used index for positioning, bookmark lookup required

5.2 Performance Benefits

MetricNon-coveringCoveringBenefit
I/O (1000 row result)3-4 index + 1000 random bookmark I/Os3-4 index + sequential leaf scan~99% fewer I/Os
Data pages accessedIndex pages + many scattered data pagesIndex pages onlyHigher Buffer Pool hit rate
Data volume readFull rows (potentially KB each)Index entries only (bytes each)90%+ less data read

5.3 Index Condition Pushdown (ICP)

Introduced in MySQL 5.6. Without ICP, the storage engine uses only the index prefix columns to locate records, then passes full rows to the Server layer for further filtering. ICP allows the engine to filter using remaining index columns at the index level, reducing bookmark lookups.

-- INDEX idx_abc (a, b, c)
-- Query: WHERE a = 1 AND c = 5 (skipping b)

-- Without ICP: locate all a=1 rows (10000), bookmark lookup each, filter c=5 at server (100 remain)
-- With ICP: locate all a=1 rows, check c=5 AT INDEX LEVEL, only bookmark lookup 100 rows

5.4 How to Design Covering Indexes

  1. Identify all columns involved: SELECT + WHERE + ORDER BY + GROUP BY
  2. WHERE equality columns first
  3. ORDER BY / GROUP BY columns next
  4. Display-only SELECT columns last
  5. Assess width tradeoff: Too-wide covering indexes increase write overhead and storage

6. Index Merge

Normally MySQL uses only one index per query. But with certain OR/AND conditions, the optimizer can use multiple indexes simultaneously and merge the results.

6.1 Three Index Merge Algorithms

AlgorithmConditionOperationEXPLAIN Extra
IntersectionAND with separate indexesIntersect PK setsUsing intersect(idx1,idx2)
UnionOR with separate indexesUnion PK setsUsing union(idx1,idx2)
Sort-UnionOR with range conditionsSort then unionUsing sort_union(idx1,idx2)
Index Merge Intersection usually means a missing composite index. If you see Using intersect, creating INDEX (a, b) is almost always more efficient than two separate index lookups + merge.
-- Control Index Merge behavior
SET optimizer_switch = 'index_merge_intersection=off';

-- Via hints
SELECT /*+ NO_INDEX_MERGE(t) */ * FROM t WHERE a = 1 AND b = 2;
SELECT /*+ INDEX_MERGE(t idx_a, idx_b) */ * FROM t WHERE a = 1 OR b = 2;

7. Cardinality & Selectivity

7.1 What Is Cardinality?

Cardinality = estimated number of distinct values in an index column. Selectivity = Cardinality / Total Rows. Higher selectivity (closer to 1) = more effective index.

SHOW INDEX FROM orders;
-- user_id cardinality: 523467 / 9.8M rows = 0.053 (moderate)
-- status cardinality: 5 / 9.8M rows = 0.0000005 (very low)

7.2 How InnoDB Estimates Cardinality

InnoDB uses random sampling: it picks innodb_stats_persistent_sample_pages (default 20) random leaf pages, counts distinct values, and extrapolates.

SHOW VARIABLES LIKE 'innodb_stats%';
-- innodb_stats_persistent_sample_pages = 20
-- innodb_stats_auto_recalc = ON  (recalc when >10% DML changes)

7.3 ANALYZE TABLE

-- Trigger statistics recalculation
ANALYZE TABLE orders;
-- Online operation in InnoDB, does not block reads/writes

-- Increase precision for skewed distributions
ALTER TABLE orders STATS_SAMPLE_PAGES = 100;
ANALYZE TABLE orders;

7.4 Histogram Statistics -- MySQL 8.0+

-- Create histogram (no index needed)
ANALYZE TABLE orders UPDATE HISTOGRAM ON status WITH 16 BUCKETS;
ANALYZE TABLE orders UPDATE HISTOGRAM ON total_amount WITH 64 BUCKETS;

-- View histogram info
SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME,
       JSON_EXTRACT(histogram, '$.histogram-type') AS type,
       JSON_EXTRACT(histogram, '$.number-of-buckets-specified') AS buckets
FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;

Histogram vs Index:

AspectHistogramIndex
PurposeHelp optimizer estimate row countsSpeed up data lookup
Write overheadNone (updated only on ANALYZE)Every INSERT/UPDATE/DELETE
StorageMinimal (in data dictionary)Can be very large

8. Index Design Methodology

8.1 Systematic Design Process

  1. Collect query patterns from slow query log and performance_schema
  2. For each query, identify: equality columns, range columns, JOIN columns, ORDER BY/GROUP BY columns, SELECT columns
  3. Apply the Three-Star System
  4. Merge indexes: if multiple queries can share one composite index via leftmost prefix
  5. Evaluate write overhead
  6. Validate post-deployment with EXPLAIN and monitoring

8.2 Three-Star Index System

From Lahdenmaki & Leach's "Relational Database Index Design and the Optimizers":

StarConditionMeaning
First ⭐Index groups relevant WHERE rows togetherMinimize scan range. Equality columns first.
Second ⭐⭐Index row order matches ORDER BYAvoid filesort. ORDER BY follows equality cols.
Third ⭐⭐⭐Index contains all query columnsCovering index. No bookmark lookup.
-- Query:
SELECT user_id, order_date, total_amount
FROM orders WHERE user_id = 100 AND status = 'completed'
ORDER BY order_date DESC LIMIT 20;

-- โ˜…โ˜†โ˜† One-star: INDEX (user_id)
-- โ˜…โ˜…โ˜† Two-star: INDEX (user_id, status, order_date DESC)
-- โ˜…โ˜…โ˜… Three-star: INDEX (user_id, status, order_date DESC, total_amount)
Three stars isn't always optimal. Pursuing the third star (covering index) means wider indexes, more write overhead, and more storage. For write-heavy tables, two stars may be the better tradeoff.

8.3 Index Overhead Analysis

Write Amplification

  • INSERT: one clustered index write + one write per secondary index
  • DELETE: each secondary index needs a delete marker
  • UPDATE of indexed column: effectively DELETE + INSERT per affected index
-- Table with 5 secondary indexes:
-- One INSERT = 1 clustered + 5 secondary = 6 B+Tree insert operations
-- Plus redo log, undo log, doublewrite buffer...
-- Actual amplification can reach 10-20x

8.4 When NOT to Add an Index

  • Very small tables (<1000 rows): full scan is one I/O
  • Very low selectivity columns (boolean, gender): unless part of composite
  • Frequently bulk-updated columns: maintenance cost may exceed query benefit
  • Write-only tables (audit logs): indexes are pure overhead
  • Already covered by existing index: INDEX (a,b,c) already covers INDEX (a) and INDEX (a,b)

9. Practical Examples

9.1 E-commerce Orders Table (100M rows)

CREATE TABLE orders (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  user_id BIGINT UNSIGNED NOT NULL,
  shop_id INT UNSIGNED NOT NULL,
  order_no VARCHAR(32) NOT NULL,
  status TINYINT NOT NULL DEFAULT 0,
  total_amount DECIMAL(12,2) NOT NULL,
  pay_time DATETIME DEFAULT NULL,
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  updated_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  UNIQUE KEY uk_order_no (order_no),
  INDEX idx_user_status_created (user_id, status, created_at),
  INDEX idx_shop_created (shop_id, created_at),
  INDEX idx_pay_time (pay_time)
) ENGINE=InnoDB;

Deep Pagination Optimization

-- โŒ Deep pagination disaster
SELECT * FROM orders WHERE user_id = 100
ORDER BY created_at DESC LIMIT 20 OFFSET 100000;

-- โœ… Cursor-based pagination (recommended)
SELECT * FROM orders
WHERE user_id = 100
  AND (created_at, id) < ('2026-01-15 10:30:00', 99850)
ORDER BY created_at DESC, id DESC
LIMIT 20;

-- โœ… Deferred join
SELECT o.* FROM orders o
INNER JOIN (
  SELECT id FROM orders
  WHERE user_id = 100
  ORDER BY created_at DESC
  LIMIT 20 OFFSET 100000
) t ON o.id = t.id;

9.2 User Activity Log (Time-Range Queries)

CREATE TABLE activity_log (
  id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  user_id BIGINT UNSIGNED NOT NULL,
  action VARCHAR(50) NOT NULL,
  target_type VARCHAR(30),
  target_id BIGINT UNSIGNED,
  created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  INDEX idx_user_created (user_id, created_at)
) ENGINE=InnoDB
PARTITION BY RANGE (TO_DAYS(created_at)) (
  PARTITION p202601 VALUES LESS THAN (TO_DAYS('2026-02-01')),
  PARTITION p202602 VALUES LESS THAN (TO_DAYS('2026-03-01')),
  PARTITION p_future VALUES LESS THAN MAXVALUE
);

9.3 Social Feed (Complex JOIN + Pagination)

-- Feed query (pull model)
SELECT p.id, p.user_id, p.content, p.like_count, p.created_at
FROM posts p
INNER JOIN follows f ON f.following_id = p.user_id
WHERE f.follower_id = 12345
  AND p.created_at > NOW() - INTERVAL 7 DAY
ORDER BY p.created_at DESC
LIMIT 20;

-- Key: follows PRIMARY KEY (follower_id, following_id) is already optimal
-- posts INDEX (user_id, created_at DESC) perfectly matches

10. Index Monitoring

10.1 Finding Unused Indexes

SELECT * FROM sys.schema_unused_indexes
WHERE object_schema = 'your_db';

-- Note: based on performance_schema, needs sufficient uptime to be accurate
-- UNIQUE KEYs serve data integrity even if never queried

10.2 Finding Redundant Indexes

SELECT * FROM sys.schema_redundant_indexes
WHERE table_schema = 'your_db';

-- Example: idx_user_id (user_id) is redundant with idx_user_status (user_id, status)
-- Also: pt-duplicate-key-checker from Percona Toolkit

10.3 Index Usage Statistics

SELECT OBJECT_SCHEMA, OBJECT_NAME, INDEX_NAME,
       COUNT_READ AS reads, COUNT_WRITE AS writes
FROM performance_schema.table_io_waits_summary_by_index_usage
WHERE OBJECT_SCHEMA = 'your_db' AND INDEX_NAME IS NOT NULL
ORDER BY COUNT_READ DESC;

10.4 Index Fragmentation

-- Check fragmentation
SELECT TABLE_NAME,
  ROUND(DATA_FREE / (DATA_LENGTH + INDEX_LENGTH + DATA_FREE) * 100, 1) AS frag_pct
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'your_db'
ORDER BY DATA_FREE DESC;

-- Rebuild if fragmentation > 20%
ALTER TABLE orders ENGINE=InnoDB; -- online rebuild

11. Index Size Calculator

Estimate storage growth when adding a new index. Enter table row count and index column information to calculate the estimated index size.

Supported types: INT, BIGINT, TINYINT, SMALLINT, MEDIUMINT, DATETIME, TIMESTAMP, DATE, FLOAT, DOUBLE, DECIMAL, CHAR(n), VARCHAR(n), BINARY(n), VARBINARY(n).

12. Frequently Asked Questions

Q1: What is the maximum number of indexes per table?

InnoDB supports up to 64 secondary indexes (plus 1 clustered index), with a maximum of 16 columns per index. In practice, more than 5-7 indexes per table warrants a review for redundancy. More indexes = slower writes.

Q2: How do I decide the column order in a composite index?

Core principle: Equality columns -> Sort columns -> Range columns. Among equality columns, place higher-selectivity ones first. Range columns always go last because range conditions break subsequent column usage. For covering indexes, append display-only columns at the end.

Q3: Should I use auto-increment ID or UUID for primary keys?

Auto-increment BIGINT for most cases. Sequential inserts avoid page splits, 8-byte fixed length keeps secondary indexes small, and comparison is fastest. For distributed systems needing globally unique IDs, use UUID v7 / ULID / Snowflake -- time-ordered, balancing uniqueness and sequentiality. Avoid random UUID v4 as primary keys.

Q4: How is EXPLAIN key_len calculated?

key_len shows bytes actually used from the index. Rules: INT=4, BIGINT=8, DATETIME=5, TIMESTAMP=4, DATE=3, CHAR(n)=n*max_charset_bytes(utf8mb4=4), VARCHAR(n)=n*max_charset_bytes+2. If NULLable, add +1. By calculating key_len you can determine how many composite index columns were actually used.

Q5: Does adding an index lock the table?

MySQL 8.0 Online DDL supports ALGORITHM=INPLACE, LOCK=NONE for most index operations -- no table lock. Always explicitly specify LOCK=NONE; if unsupported, MySQL will error rather than silently locking.

ALTER TABLE orders ADD INDEX idx_new (col1, col2), ALGORITHM=INPLACE, LOCK=NONE;
Q6: Why does EXPLAIN show possible_keys but key is NULL?

The optimizer's cost estimation decided a full scan is cheaper. Causes: (1) low selectivity, (2) inaccurate cardinality -- run ANALYZE TABLE; (3) use FORCE INDEX(idx) for diagnosis only, not in production code.

Q7: How do I handle index fragmentation?

After heavy random INSERT/DELETE, B+Tree page utilization drops. Check: DATA_FREE / (DATA_LENGTH + INDEX_LENGTH + DATA_FREE) from INFORMATION_SCHEMA.TABLES. If >20-30%, rebuild: ALTER TABLE t ENGINE=InnoDB (online) or OPTIMIZE TABLE t. Run during off-peak hours due to I/O impact.

Q8: How do I determine if an index is worth keeping?

Combine: (1) sys.schema_unused_indexes -- is it queried? (2) performance_schema -- read vs write ratio; (3) sys.schema_redundant_indexes -- is it covered by another? Before dropping, set it INVISIBLE for a week (MySQL 8.0+).

Q9: What are the differences between InnoDB and MyISAM indexes?

InnoDB uses a clustered index (data and PK index are one structure); secondary index leaves store PK values. MyISAM indexes are all non-clustered; leaves store physical row pointers. MyISAM secondary lookups don't need a second B+Tree traversal, but MyISAM lacks transactions, row locking, and foreign keys. Modern applications should use InnoDB exclusively.

Q10: What indexing features did MySQL 8.0 introduce?

Key features: (1) Descending indexes -- true DESC storage for mixed-direction ORDER BY; (2) Invisible indexes -- safely test dropping indexes; (3) Functional indexes -- index expressions, solving function-on-column failures; (4) Histogram statistics -- better cost estimation without indexes; (5) Hash Join -- dramatically faster indexless JOINs; (6) Index Skip Scan -- skip the leftmost column in certain scenarios.