Chapter 8

Query Optimizer Internals

MySQL Optimizer Internals: Deep Analysis

The MySQL query optimizer is the strategic decision-maker. Understanding its cost model, algorithms, and heuristics explains why queries execute the way they do and how to guide it toward better plans.

1. Optimizer Architecture


Parse Tree (AST)
     ↓ PREPROCESSING
     ├─ sql_select.cc: simplify expressions
     ├─ Remove constant folding (1+1 → 2)
     ├─ Flatten subqueries where possible
     └─ Resolve column references
     ↓ LOGICAL TRANSFORMATION
     ├─ Predicate pushdown (move WHERE into JOINs)
     ├─ IN to EXISTS conversion
     ├─ Derive filter conditions from constants
     └─ Remove redundant conditions
     ↓ COST ESTIMATION
     ├─ Analyze table statistics (ANALYZE TABLE)
     ├─ Build histogram of column value distribution (MySQL 8.0+)
     ├─ Enumerate possible access methods
     │  ├─ Full table scan
     │  ├─ Index range scan
     │  ├─ Index unique lookup
     │  └─ Index merge
     ├─ Enumerate JOIN orders
     ├─ Calculate cost of each combination
     └─ Cache costs for reuse
     ↓ PLAN SELECTION
     ├─ Choose lowest-cost plan
     ├─ Apply optimizer hints if provided
     └─ Generate execution plan (QEP)
     ↓ EXECUTION PLAN (Ready for executor)

2. Cost Model


MySQL Cost Model (MySQL 5.7+):

Disk I/O Cost:
  io_block_read_cost = 1.0  (default)

  Cost to read one block = 1.0
  If table size = 1000 blocks, full scan cost = 1000

Memory Access Cost:
  memory_block_read_cost = 0.25  (default)

  Cost much lower (already in memory)
  Encourages using cached pages

Row Evaluation Cost:
  row_evaluate_cost = 0.1  (default)

  Cost to examine one row
  Predicate evaluation, expression evaluation

Key Comparison Cost:
  key_compare_cost = 0.05  (default)

  Cost to compare against index key
  Lower than row evaluation

EXAMPLE: SELECT * FROM users WHERE id = 1

Plan A: Full Table Scan
├─ Blocks to read: 100
├─ Rows to evaluate: 10,000
├─ Cost = (100 × 1.0) + (10,000 × 0.1) = 1,100

Plan B: Index Lookup on id (PRIMARY)
├─ Index blocks: 1
├─ Key comparisons: log(10,000) ≈ 14
├─ Row lookups: 1
├─ Cost = (1 × 1.0) + (14 × 0.05) + (1 × 0.1) = 1.8

Optimizer chooses Plan B (lower cost)

CUSTOMIZING COST MODEL:

mysql> SELECT * FROM mysql.server_cost;
cost_name                          cost_value
disk_temptable_create_cost         20
disk_temptable_row_cost            0.5
...

-- Adjust for SSD (lower I/O cost)
UPDATE mysql.server_cost
SET cost_value = 0.1
WHERE cost_name = 'io_block_read_cost';

FLUSH OPTIMIZER_COSTS;

3. Table Statistics

3.1 How MySQL Estimates Cardinality


HISTOGRAM (MySQL 8.0+):

Most accurate: Column value distribution

CREATE TABLE users (
  id INT,
  country VARCHAR(2),
  balance DECIMAL(10,2)
);

ANALYZE TABLE users UPDATE HISTOGRAM ON country;

-- See histogram
SELECT SCHEMA_NAME, TABLE_NAME, COLUMN_NAME, HISTOGRAM
FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
WHERE TABLE_NAME = 'users' AND COLUMN_NAME = 'country';

Histogram shows:
- Frequent countries: 'US' (5%), 'CN' (4%), 'IN' (3%), etc.
- Singleton buckets for common values
- Range buckets for uncommon values

SELECTIVITY ESTIMATION:

WHERE country = 'US'
├─ Query: 1,000,000 rows × 5% = 50,000 rows (estimated)
├─ Cost of filter: lower (fewer rows to scan)
└─ May enable different query plan

STATISTICS STALENESS:

-- Check when last ANALYZE
SELECT TABLE_NAME, UPDATE_TIME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_SCHEMA = 'mydb';

-- Auto statistics (MySQL 8.0+)
SET GLOBAL innodb_stats_persistent = ON;
SET GLOBAL innodb_stats_auto_recalc = ON;

-- Manual statistics
ANALYZE TABLE users;

3.2 Access Method Selection


For WHERE country = 'US', optimizer evaluates:

1. FULL TABLE SCAN
   Cost = blocks × io_block_read_cost + rows × row_evaluate_cost
   Cost = 100 + 100,000 × 0.1 = 10,100

2. INDEX RANGE SCAN (if index exists on country)
   With histogram: knows ~50,000 matching rows
   Cost = 10 + (log(1000) × 0.05) + (50,000/100 × 1.0) = 560

3. INDEX MERGE (if multiple indexes exist)
   Cost = cost(index1) + cost(index2) + merge_cost

Optimizer chooses LOWEST COST plan

SARGABLE vs NON-SARGABLE:

SARGABLE (can use index):
WHERE id = 5                      # Simple equality
WHERE id > 10 AND id = '2024-01-01'        # Range comparison

NOT SARGABLE (full scan):
WHERE YEAR(date) = 2024          # Function on column
WHERE id + 1 = 100               # Expression on column
WHERE CONCAT(first, last) = 'John Doe'
WHERE LOWER(country) = 'us'

-- Instead do this:
SELECT * FROM users
WHERE created_at >= '2024-01-01'
  AND created_at < '2025-01-01';  -- SARGABLE

4. JOIN Order Optimization


Problem: With n tables, there are (n-1)! possible JOIN orders

SELECT * FROM users u
JOIN orders o ON u.id = o.user_id
JOIN products p ON o.product_id = p.id
WHERE u.country = 'US';

Possible orders (3! = 6):
1. users → orders → products
2. users → products → orders
3. orders → users → products
4. orders → products → users
5. products → users → orders
6. products → orders → users

Each has different cost!

OPTIMIZER APPROACH:

mysql> SET optimizer_search_depth = 62;  # Default
# For small n, try all (n-1)! orders
# For large n, use heuristics (limit search depth)

With 5 tables:
├─ 4! = 24 plans (exhaustive search)

With 10 tables:
├─ 9! = 362,880 plans (too many!)
├─ optimizer_search_depth limits expansion

GREEDY HEURISTIC:

1. Choose first table with best cost
2. For next table, choose one that reduces result set most
3. Continue until all tables added

Example:
- users (1M rows, 50K after WHERE filter)
- orders (10M rows total, 500K match user_ids)
- products (100K rows)

Greedy order:
1. Start with users (smallest filtered set: 50K)
2. JOIN orders (each user has ~10 orders, total ~500K rows)
3. JOIN products (one product per order, keeps 500K rows)

5. Subquery Optimization


SUBQUERY TRANSFORMATION:

Original: IN subquery
SELECT * FROM orders o
WHERE o.user_id IN (
  SELECT id FROM users WHERE country = 'US'
);

Problem: Subquery executed for each order
Cost: O(orders_count × subquery_cost)

Transformation 1: Convert to JOIN
SELECT DISTINCT o.* FROM orders o
JOIN users u ON o.user_id = u.id
WHERE u.country = 'US';

SUBQUERY MATERIALIZATION (MySQL 5.6+):

Strategy 1: DEPENDENT SUBQUERY
├─ For each order row
├─ Execute subquery checking if user_id exists
└─ Cost: HIGH (subquery runs many times)

Strategy 2: MATERIALIZED
├─ Execute subquery once
├─ Store results in temporary table
├─ For each order, check temp table
└─ Cost: LOWER (subquery runs once)

CONTROLLING SUBQUERY STRATEGY:

-- Force materialization
SELECT /*+ SUBQUERY(MATERIALIZE) */ * FROM orders o
WHERE o.user_id IN (SELECT id FROM users WHERE country = 'US');

-- Force dependent subquery
SELECT /*+ SUBQUERY(INTOEXISTS) */ * FROM orders o
WHERE o.user_id IN (SELECT id FROM users WHERE country = 'US');

6. Histogram and Distribution Analysis


HISTOGRAMS (MySQL 8.0+):

Create histogram on skewed column:
ANALYZE TABLE orders UPDATE HISTOGRAM ON status USING 100 BUCKETS;

View histogram:
SELECT *
FROM INFORMATION_SCHEMA.COLUMN_STATISTICS
WHERE TABLE_NAME = 'orders' AND COLUMN_NAME = 'status';

Example histogram data:
status        | Frequency
pending       | 45% (hot value)
processing    | 30%
completed     | 20%
cancelled     | 5%

When optimizer sees:
WHERE status = 'pending'
├─ Histogram shows 45% of rows
├─ Estimates: 1,000,000 × 45% = 450,000 rows
├─ Calculates better join order
└─ May choose different index strategy

CREATE HISTOGRAM:

-- Singleton buckets (for every distinct value)
ANALYZE TABLE users UPDATE HISTOGRAM ON country USING 1000 BUCKETS;

-- Range buckets (for numeric ranges)
ANALYZE TABLE orders UPDATE HISTOGRAM ON amount USING 100 BUCKETS;

DELETE HISTOGRAM:
ANALYZE TABLE orders DROP HISTOGRAM ON status;

7. EXPLAIN and Optimizer Output


BASIC EXPLAIN:
mysql> EXPLAIN SELECT * FROM users WHERE country = 'US'\G

key_len: 10 (VARCHAR(10) = 10 bytes used from index)
rows: 50000 (estimated rows before filtering)
filtered: 45 (percentage of rows passing WHERE)

Extra:
├─ Using index condition: filter applied at index level
├─ Using index: query served entirely from index (covering index)
├─ Using filesort: ORDER BY requires separate sorting
├─ Using temporary: GROUP BY requires temporary table
└─ Using join buffer: join rows buffered for better I/O

JSON FORMAT (MySQL 5.7+):
EXPLAIN FORMAT=JSON SELECT ...\G
{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "500.50"
    },
    "table": {
      "table_name": "users",
      "access_type": "range",
      "possible_keys": ["idx_country"],
      "key": "idx_country",
      "rows_examined_per_scan": 50000,
      "filtered": 45.0
    }
  }
}

ANALYZING EXPLAIN:
1. Check "type" column
   ├─ ALL: Full table scan (bad!)
   ├─ range: Index range scan (good)
   ├─ ref: Index lookup (very good)
   └─ const: Single row (best)

2. Check "key" column
   ├─ NULL: No index used (often wrong)
   └─ idx_something: Using that index

3. Check "rows" column
   ├─ If too high compared to actual filtered rows
   └─ Statistics outdated: ANALYZE TABLE needed

4. Check "Extra" column
   ├─ No "Using filesort" = good
   ├─ "Using index" = excellent (covering index)
   └─ "Using temporary" = may be slow

8. Optimizer Hints


Use hints to override optimizer when it's wrong:

INDEX HINT (MySQL 5.0+):
SELECT * FROM users USE INDEX (idx_country)
WHERE country = 'US';

Optimizer hint (MySQL 5.7.7+):
SELECT /*+ INDEX(users idx_country) */ * FROM users
WHERE country = 'US';

JOIN_ORDER HINT:
SELECT /*+ JOIN_ORDER(o, u, p) */ * FROM orders o
JOIN users u USING (user_id)
JOIN products p USING (product_id);
-- Forces: orders first, then users, then products

MAX_EXECUTION_TIME:
SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM big_table;
-- Kill query if exceeds 1000ms

SET_VAR:
SELECT /*+ SET_VAR(optimizer_search_depth = 1) */ * FROM ...;

NO_INDEX_MERGE:
SELECT /*+ NO_INDEX_MERGE(users) */ * FROM users WHERE ...;
-- Disables index merge

WHEN TO USE HINTS:
✓ Statistics very stale (can't run ANALYZE)
✓ Optimizer consistently wrong despite statistics
✓ Specific query requires special handling
✓ Testing/debugging

✗ Routine optimization (fix with ANALYZE TABLE instead)
✗ Hiding underlying problems (use hints temporarily)
✗ Hints become tech debt (document why!)

9. Key Optimization Takeaways


1. ANALYZE TABLE regularly
   └─ Ensures statistics are current

2. Use histograms (MySQL 8.0+)
   └─ Improves estimates for skewed columns

3. Write SARGABLE queries
   └─ Avoid functions on columns
   └─ Use expressions on constants

4. Monitor EXPLAIN output
   └─ Check for unexpected full scans

5. Create indexes strategically
   └─ High selectivity columns first
   └─ Consider column order for composite indexes

6. Keep statistics fresh
   └─ Schedule regular ANALYZE TABLE
   └─ Auto-update if possible

7. Understand your data
   └─ Know column distributions
   └─ Expect skewed data

8. Test before production
   └─ Run EXPLAIN on queries
   └─ Use EXPLAIN ANALYZE
   └─ Benchmark different plans

9. Use hints sparingly
   └─ Document why hint needed
   └─ Revisit when stats updated

10. Watch optimizer evolution
    └─ MySQL 8.0+ has hypergraph optimizer
    └─ Behavior may change between versions
    └─ Test after upgrades

Conclusion

The MySQL optimizer makes decisions based on statistics, cost models, and heuristics. Understanding these internals lets you write queries that the optimizer understands, diagnose why plans are suboptimal, and guide the optimizer toward better decisions through indexes, statistics, and hints.

Rate this chapter
4.8  / 5  (53 ratings)

💬 Comments