Chapter 6

InnoDB Internals Deep Dive

Chapter 2: InnoDB Storage Engine Internals Deep Dive

  Level: Advanced
  Reading time ~60 min
  Based on MySQL 8.0 / 8.4

InnoDB is MySQL's default and most critical storage engine. This chapter delivers a source-code-level deep dive from the architecture overview through Buffer Pool page structure, Redo Log WAL mechanics, Undo Log and MVCC version chains, Doublewrite Buffer torn-page protection, Change Buffer random-I/O optimization, Adaptive Hash Index, tablespace layout, row format comparison, all the way to flush strategies and production monitoring. Every topic includes ASCII architecture diagrams, MySQL source file references, key parameter tuning advice, and real-world incident case studies.

1. InnoDB Architecture Overview

1.1 Architecture Diagram

InnoDB's internal architecture divides into in-memory structures and on-disk structures. The ASCII diagram below shows every core component and data flow:


┌─────────────────────────────────────────────────────────────────────────┐
│                        InnoDB In-Memory Structures                      │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                      Buffer Pool (default 128MB)                │   │
│  │  ┌────────────┐  ┌────────────┐  ┌──────────┐  ┌────────────┐  │   │
│  │  │  Data Pages │  │Index Pages │  │Undo Pages│  │ Change Buf │  │   │
│  │  │  (16KB ea.) │  │  (16KB)    │  │  (16KB)  │  │   Pages    │  │   │
│  │  └────────────┘  └────────────┘  └──────────┘  └────────────┘  │   │
│  │  ┌─────────────────────────────────────────────────────────┐    │   │
│  │  │  LRU List: [Young Sublist 5/8] ←→ [Old Sublist 3/8]   │    │   │
│  │  └─────────────────────────────────────────────────────────┘    │   │
│  │  ┌──────────────────┐  ┌───────────────┐                       │   │
│  │  │  Free List        │  │  Flush List    │                       │   │
│  │  └──────────────────┘  └───────────────┘                       │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  ┌────────────────┐  ┌──────────────────┐  ┌──────────────────────┐   │
│  │ Adaptive Hash   │  │  Log Buffer       │  │  Change Buffer       │   │
│  │ Index (AHI)     │  │  (64MB default)   │  │  (25% of pool max)   │   │
│  └────────────────┘  └──────────────────┘  └──────────────────────┘   │
└───────────┬──────────────────┬───────────────────┬───────────────────────┘
            │                  │                   │
            │  Checkpoint       │  WAL Write         │  Merge on read
            ▼                  ▼                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        InnoDB On-Disk Structures                        │
│                                                                         │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────────────────┐   │
│  │  System       │  │  Redo Log       │  │  Undo Tablespaces         │   │
│  │  Tablespace   │  │  ib_logfile0    │  │  undo_001, undo_002       │   │
│  │  (ibdata1)    │  │  ib_logfile1    │  │                           │   │
│  └──────────────┘  │  (or #innodb_   │  └──────────────────────────┘   │
│                     │   redo/ 8.0.30+)│                                 │
│  ┌──────────────┐  └────────────────┘  ┌──────────────────────────┐   │
│  │  File-per-    │                       │  Doublewrite Buffer       │   │
│  │  table .ibd   │  ┌────────────────┐  │  (ibdata1 or separate     │   │
│  │  files        │  │  Temp Tablespace│  │   .dblwr files 8.0.20+)  │   │
│  └──────────────┘  │  (ibtmp1)       │  └──────────────────────────┘   │
│                     └────────────────┘                                 │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │  General Tablespaces (CREATE TABLESPACE ... )                   │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

Source ref: storage/innobase/include/buf0buf.h, storage/innobase/srv/srv0srv.cc

1.2 Query Data Flow

When a SELECT reaches InnoDB, the data flow is:


Client SQL
   │
   ▼
MySQL Server Layer (parser → optimizer → executor)
   │
   ▼  handler::ha_read_row()
InnoDB Handler Interface
   │
   ├──▶ Check Adaptive Hash Index ──▶ HIT? → return row
   │                                  MISS ↓
   ├──▶ Check Buffer Pool LRU ──────▶ HIT? → return page
   │                                  MISS ↓
   ├──▶ Read page from .ibd file (OS read / AIO)
   │      └──▶ Place page into Buffer Pool (Old sublist head)
   │      └──▶ If accessed again after innodb_old_blocks_time → move to Young
   │
   ▼
Return row to Server Layer → Client

1.3 Write Data Flow

When an INSERT / UPDATE / DELETE executes, InnoDB's write path is considerably more complex than reads:


1. BEGIN (implicitly or explicitly)
2. Lock the target row(s) → record lock / gap lock / next-key lock
3. Write old row image to Undo Log (for rollback + MVCC)
4. Modify the page IN the Buffer Pool (in-memory only)
5. Write redo log record to Log Buffer
6. COMMIT:
   a. Flush Log Buffer → Redo Log files on disk (fsync per innodb_flush_log_at_trx_commit)
   b. Mark transaction committed in redo
   c. The dirty page stays in Buffer Pool → will be flushed later
7. Background threads (page cleaner) eventually:
   a. Write dirty page to Doublewrite Buffer (on disk)
   b. fsync Doublewrite Buffer
   c. Write page to actual .ibd file location
   d. fsync .ibd
   e. Advance checkpoint LSN

Key Insight: InnoDB's performance secret is converting random writes into sequential writes (the Redo Log is sequentially appended) and deferring actual dirty-page flushing to background threads. This is the core value of WAL (Write-Ahead Logging) — a transaction commit only waits for the sequential fsync of the Redo Log, not the random I/O of every dirty page.

1.4 In-Memory vs On-Disk Structures

Component	Location	Purpose	Key Parameter
Buffer Pool	Memory	Cache data and index pages to reduce disk I/O	`innodb_buffer_pool_size`
Log Buffer	Memory	Temporarily hold redo log records before flushing	`innodb_log_buffer_size`
Adaptive Hash Index	Memory	Auto-build hash indexes on frequently accessed B+Tree pages	`innodb_adaptive_hash_index`
Change Buffer	Memory + Disk	Cache secondary index changes, defer merge	`innodb_change_buffer_max_size`
Redo Log	Disk	Crash recovery, guarantee durability (D in ACID)	`innodb_redo_log_capacity`
Undo Log	Disk (+cached in Buffer Pool)	Transaction rollback + MVCC versioned reads	`innodb_undo_tablespaces`
Doublewrite Buffer	Disk	Prevent partial page writes (torn pages)	`innodb_doublewrite`
System Tablespace	Disk	Data dictionary, change buffer persistence	`innodb_data_file_path`

2. Buffer Pool Deep Dive

The Buffer Pool is InnoDB's most critical in-memory structure. Understanding how it works is foundational to understanding InnoDB performance.

2.1 Page Structure (16KB)

All data in InnoDB is stored in 16KB pages. Each page's internal structure is:


┌─────────────────────────────────────────────────────────────┐
│                    InnoDB Page (16384 bytes)                │
├─────────────────────────────────────────────────────────────┤
│  File Header              (38 bytes)                        │
│  ├─ FIL_PAGE_SPACE_OR_CHKSUM    (4)  checksum              │
│  ├─ FIL_PAGE_OFFSET             (4)  page number           │
│  ├─ FIL_PAGE_PREV               (4)  previous page (B+Tree)│
│  ├─ FIL_PAGE_NEXT               (4)  next page (B+Tree)    │
│  ├─ FIL_PAGE_LSN                (8)  last modified LSN     │
│  ├─ FIL_PAGE_TYPE               (2)  page type             │
│  ├─ FIL_PAGE_FILE_FLUSH_LSN     (8)  flush LSN (space 0)   │
│  └─ FIL_PAGE_SPACE_ID           (4)  tablespace ID         │
├─────────────────────────────────────────────────────────────┤
│  Page Header              (56 bytes)                        │
│  ├─ PAGE_N_DIR_SLOTS            (2)  slots in page dir     │
│  ├─ PAGE_HEAP_TOP               (2)  first free byte       │
│  ├─ PAGE_N_HEAP                 (2)  number of records     │
│  ├─ PAGE_FREE                   (2)  free record list head │
│  ├─ PAGE_GARBAGE                (2)  deleted bytes         │
│  ├─ PAGE_LAST_INSERT            (2)  last insert position  │
│  ├─ PAGE_DIRECTION              (2)  insert direction      │
│  ├─ PAGE_N_DIRECTION            (2)  consecutive inserts   │
│  ├─ PAGE_N_RECS                 (2)  user record count     │
│  ├─ PAGE_MAX_TRX_ID             (8)  max txn ID (sec idx)  │
│  ├─ PAGE_LEVEL                  (2)  B+Tree level (0=leaf) │
│  ├─ PAGE_INDEX_ID               (8)  index this page for   │
│  └─ ... (segment headers)                                   │
├─────────────────────────────────────────────────────────────┤
│  Infimum Record           (13 bytes)                        │
│  Supremum Record          (13 bytes)                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│            User Records (grows downward ↓)                  │
│                                                             │
│  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐          │
│  │ Record 1│→│ Record 2│→│ Record 3│→│ Record n│→Supremum │
│  └─────────┘ └─────────┘ └─────────┘ └─────────┘          │
│                                                             │
│            Free Space   (grows toward each other)           │
│                                                             │
│            Page Directory (grows upward ↑)                  │
│  ┌──────┬──────┬──────┬──────┬──────┐                      │
│  │Slot 0│Slot 1│Slot 2│ ...  │Slot n│  (2 bytes each)      │
│  └──────┴──────┴──────┴──────┴──────┘                      │
├─────────────────────────────────────────────────────────────┤
│  File Trailer             (8 bytes)                         │
│  ├─ Checksum                    (4)  matches header         │
│  └─ FIL_PAGE_LSN low 4 bytes    (4)  matches header LSN    │
└─────────────────────────────────────────────────────────────┘

Source ref: storage/innobase/include/fil0types.h, storage/innobase/include/page0page.h

Why 16KB? 16KB is InnoDB's default page size, controlled by innodb_page_size (options: 4K/8K/16K/32K/64K, but only settable at initialization). 16KB is a carefully chosen trade-off: too small increases B+Tree depth (more disk seeks), too large wastes memory and increases intra-page fragmentation. For most OLTP workloads, 16KB is optimal.

2.2 LRU Algorithm: Dual-Sublist Design

InnoDB's Buffer Pool does not use a standard LRU (Least Recently Used) algorithm. Instead it employs a modified two-segment LRU to prevent full table scans from evicting hot data.


Buffer Pool LRU List
┌─────────────────────────────────────────────────────────────────────┐
│                                                                     │
│   HEAD (MRU)                                        TAIL (LRU)     │
│   ◄────────── Young Sublist (5/8) ──────────►│◄── Old (3/8) ──►│   │
│                                               │                    │
│   [HOT] [HOT] [HOT] ... [WARM] [WARM]         │ [OLD] [OLD] [OLD]  │
│    ▲                                          │  ▲                 │
│    │ Page accessed again                      │  │ New page from    │
│    │ (after old_blocks_time)                  │  │ disk lands HERE  │
│    └──────────────────────────────────────────┘  │ ("midpoint       │
│              Promotion                           │  insertion")      │
│                                                  │                  │
│                                        Eviction from tail ──►OUT   │
└─────────────────────────────────────────────────────────────────────┘

Why Two Sublists?

Consider a typical production scenario: your Buffer Pool caches the last few hours of hot data (user table, order table index pages). Suddenly someone runs a SELECT * FROM large_table (full table scan), reading hundreds of thousands of data pages. With a standard LRU, these "cold" pages that will never be accessed again push out all the truly "hot" pages — causing subsequent normal queries to hit disk. This is called "Buffer Pool pollution."

InnoDB's solution:

When a new page is read from disk, it is inserted at the head of the Old sublist (midpoint insertion), not the head of the entire LRU.
Only when the page is accessed again after staying in the Old sublist for more than innodb_old_blocks_time (default 1000ms) will it be promoted to the Young sublist.
Full table scan pages are typically read and discarded within 1000ms, never entering the Young sublist, thus protecting hot data.

Key Parameters

Parameter	Default	Description
`innodb_old_blocks_pct`	37	Percentage of Buffer Pool for the Old sublist. Default 37% (~3/8). Range: 5-95.
`innodb_old_blocks_time`	1000	Milliseconds a page must stay in the Old sublist before it can be promoted to Young. Set to 0 to disable this protection.

2.3 Buffer Pool Instances

When the Buffer Pool is large (typically 32GB-128GB in production), a single Buffer Pool mutex becomes a concurrency bottleneck. InnoDB allows splitting the Buffer Pool into multiple independent instances, each with its own LRU list, Free list, Flush list, and mutex.

Parameter	Default	Recommended
`innodb_buffer_pool_instances`	8 (when pool >= 1GB)	At least 1GB per instance. 128GB pool → 16 instances.
`innodb_buffer_pool_chunk_size`	128MB	Unit for online pool resizing. pool_size should be a multiple of chunk_size * instances.

Note: MySQL 8.0 Change: In MySQL 8.0.26+, when innodb_buffer_pool_size < 1GB, innodb_buffer_pool_instances is forced to 1 regardless of your setting. This is because small Buffer Pools don't suffer from mutex contention.

2.4 Buffer Pool Warmup & Dump/Load

After a MySQL restart the Buffer Pool is empty (cold start) and query performance gradually recovers over 30-60 minutes. InnoDB provides automatic dump/load to accelerate warmup:


-- Enable automatic dump on shutdown and load on startup
SET GLOBAL innodb_buffer_pool_dump_at_shutdown = ON;
SET GLOBAL innodb_buffer_pool_load_at_startup  = ON;

-- Manual dump (writes to ib_buffer_pool file)
SET GLOBAL innodb_buffer_pool_dump_now = ON;

-- Check dump/load progress
SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status';
SHOW STATUS LIKE 'Innodb_buffer_pool_load_status';

-- Dump file content example (tablespace_id:page_number)
-- 0,7
-- 0,8
-- 12,0
-- 12,1

The dump file only saves (tablespace_id, page_number) pairs, not actual data. At startup InnoDB pre-reads pages from disk using this list. This is why even a 128GB Buffer Pool dump file is typically only a few MB.

2.5 Monitoring the Buffer Pool


-- Key metrics from SHOW ENGINE INNODB STATUS (BUFFER POOL AND MEMORY section)
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 137363456        -- Total memory allocated to Buffer Pool
Dictionary memory allocated 622013           -- Memory for data dictionary
Buffer pool size   8192                      -- Total pages (8192 * 16KB = 128MB)
Free buffers       1024                      -- Free pages not yet used
Database pages     7137                      -- Pages containing data
Old database pages 2614                      -- Pages in Old sublist
Modified db pages  123                       -- Dirty pages awaiting flush
Pending reads      0                         -- Pages waiting to be read from disk
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 18623, not young 2045921   -- Promotions vs blocks
3.47 youngs/s, 412.05 non-youngs/s            -- Rate of promotions
Pages read 16842, created 5765, written 29412
0.00 reads/s, 0.24 creates/s, 1.86 writes/s
Buffer pool hit rate 1000 / 1000            -- 1000/1000 = 100% hit rate (ideal)
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s

Production Metrics to Watch: 1. Buffer pool hit rate: Should stay at 999/1000 or above. Below this suggests the pool is too small or being polluted by large queries. 2. Free buffers: If persistently 0, the pool is full and every new page requires eviction. 3. Modified db pages: If continuously growing, flushing cannot keep up with writes. 4. Pages made young / not young: "not young" far exceeding "young" means innodb_old_blocks_time protection is working.

Source ref: storage/innobase/buf/buf0buf.cc — buf_page_get_gen(), buf_page_make_young()

2.6 Buffer Pool Size Calculator

  Total Server RAM (GB):

  Calculate

2.7 Buffer Pool Internal Lists

The Buffer Pool maintains three critical internal lists. Understanding them is essential for troubleshooting performance issues:


Buffer Pool Internal Lists:

┌─────────────────────────────────────────────────────────────────┐
│  1. FREE LIST                                                   │
│     Contains pages that are completely free (never used or       │
│     evicted). When InnoDB needs a new page, it takes one from   │
│     here first. When this is empty, LRU eviction kicks in.      │
│     ┌────┐  ┌────┐  ┌────┐  ┌────┐                             │
│     │Free│→ │Free│→ │Free│→ │Free│→ NULL                       │
│     └────┘  └────┘  └────┘  └────┘                             │
│                                                                 │
│  2. LRU LIST                                                    │
│     Contains all pages currently in use (clean + dirty).        │
│     Organized as Young (5/8) + Old (3/8).                       │
│     Used for page eviction when FREE list is empty.             │
│     ┌────┐  ┌────┐    ┌────┐  ┌────┐  ┌────┐                  │
│     │Hot │→ │Hot │ →→ │Old │→ │Old │→ │Old │→ evict            │
│     └────┘  └────┘    └────┘  └────┘  └────┘                  │
│      Young (MRU) ──────│─────── Old (LRU) ──────►               │
│                                                                 │
│  3. FLUSH LIST                                                  │
│     Contains only DIRTY pages, ordered by oldest_modification   │
│     LSN (oldest first). Page cleaner flushes from here to       │
│     advance the checkpoint.                                     │
│     ┌────┐  ┌────┐  ┌────┐  ┌────┐                             │
│     │LSN │→ │LSN │→ │LSN │→ │LSN │→ NULL                       │
│     │100 │  │200 │  │300 │  │400 │                              │
│     └────┘  └────┘  └────┘  └────┘                             │
│     oldest ────────────────────────► newest                     │
│                                                                 │
│  A dirty page appears in BOTH the LRU list AND the Flush list.  │
│  A clean page appears ONLY in the LRU list.                     │
└─────────────────────────────────────────────────────────────────┘

2.8 Page Types

Pages cached in the Buffer Pool are not just data pages. InnoDB defines multiple page types:

FIL_PAGE_TYPE	Value	Description
`FIL_PAGE_INDEX`	0x45BF	B+Tree leaf node (stores row data or secondary index leaf)
`FIL_PAGE_RTREE`	0x45BE	R-Tree index page (spatial index)
`FIL_PAGE_SDI`	0x45BD	Serialized Dictionary Information page (MySQL 8.0)
`FIL_PAGE_UNDO_LOG`	0x0002	Undo Log page
`FIL_PAGE_INODE`	0x0003	Segment information node page
`FIL_PAGE_IBUF_FREE_LIST`	0x0004	Change Buffer free list page
`FIL_PAGE_TYPE_ALLOCATED`	0x0000	Allocated but not yet used page
`FIL_PAGE_TYPE_SYS`	0x0006	System page (e.g., transaction system page)
`FIL_PAGE_TYPE_TRX_SYS`	0x0007	Transaction system data page
`FIL_PAGE_TYPE_FSP_HDR`	0x0008	File Space Header page
`FIL_PAGE_TYPE_XDES`	0x0009	Extent Descriptor page
`FIL_PAGE_TYPE_BLOB`	0x000A	Overflow page (stores large columns)

Source ref: storage/innobase/include/fil0fil.h — enum page_type_t

2.9 Buffer Pool Read-Ahead

InnoDB has two read-ahead mechanisms to reduce I/O wait time:

Linear Read-Ahead

When the number of sequentially accessed pages in an extent (64 contiguous pages = 1MB) exceeds innodb_read_ahead_threshold (default 56), InnoDB asynchronously pre-reads all pages in the next extent. Suits sequential access patterns like full table scans.

Random Read-Ahead

When more than 13 pages from an extent are in the Buffer Pool's Young sublist, InnoDB pre-reads the remaining pages in that extent. Disabled by default (innodb_random_read_ahead=OFF), as it rarely helps in most workloads.


-- Monitor read-ahead effectiveness:
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_ahead';         -- Pages pre-read
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_ahead_evicted'; -- Pre-read but never accessed

-- If read_ahead_evicted / read_ahead is high (>50%),
-- read-ahead is wasting I/O and should be tuned down.
-- Increase innodb_read_ahead_threshold (up to 63) to be more conservative.

2.10 Buffer Pool Mutex & Latch Contention

Under high concurrency (>1000 QPS), Buffer Pool internal mutexes can become bottlenecks. Key mutexes include:

Mutex / Latch	Protected Resource	Contention Symptom	Solution
`buf_pool->mutex`	Buffer Pool instance global lock	Broad contention on high concurrent R/W	Increase `innodb_buffer_pool_instances`
`buf_pool->LRU_list_mutex`	LRU list	Contention during page eviction/promotion	Increase instances + optimize queries to reduce eviction
`buf_pool->flush_list_mutex`	Flush list	Contention under high write load	Increase instances + page_cleaners
`hash_lock (page hash)`	Page hash table (lookup if page is in pool)	Contention under extreme concurrent point lookups	Partitioned in MySQL 8.0, usually no action needed


-- Diagnose Buffer Pool mutex waits:
SELECT EVENT_NAME, COUNT_STAR,
       SUM_TIMER_WAIT/1000000000 AS total_wait_ms,
       AVG_TIMER_WAIT/1000000 AS avg_wait_us
FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE 'wait/synch/%/innodb/buf_pool%'
   OR EVENT_NAME LIKE 'wait/synch/%/innodb/hash_table%'
ORDER BY SUM_TIMER_WAIT DESC;

3. Redo Log (WAL) Mechanism

3.1 Write-Ahead Logging Principle

WAL (Write-Ahead Logging) is InnoDB's core mechanism for transaction durability (the D in ACID). The principle is simple: before modifying a data page, first record the modification operation in the Redo Log. If the system crashes, all committed transaction modifications can be recovered by replaying the Redo Log.


WAL Write Flow:

Transaction: UPDATE users SET age=30 WHERE id=1

Step 1: Write redo record to Log Buffer (in memory)
   ┌──────────────────────────────────┐
   │  Log Buffer (innodb_log_buffer_size = 64MB)   │
   │  ┌──────────────────────────────┐│
   │  │ [type=MLOG_REC_UPDATE_IN_PLACE] ││
   │  │ [space=12, page=45, offset=200]  ││
   │  │ [old: age=25 → new: age=30]      ││
   │  │ [LSN: 1234567890]                 ││
   │  └──────────────────────────────┘│
   └──────────────────────────────────┘

Step 2: Modify page in Buffer Pool (in memory, page becomes "dirty")

Step 3: On COMMIT → flush Log Buffer to disk redo log files
   ┌──────────────┐     fsync     ┌──────────────────┐
   │  Log Buffer   │ ──────────▶  │  ib_logfile0/1    │
   │  (memory)     │              │  (disk, sequential)│
   └──────────────┘              └──────────────────┘

Step 4: Dirty page will be flushed to .ibd later (async, background)

3.2 Redo Log File Structure

Before MySQL 8.0.30, the Redo Log consisted of a fixed number of files (default 2), written in a circular fashion:


Pre-8.0.30: Fixed redo log files
┌────────────────────────────────────────────────────────┐
│   ib_logfile0            ib_logfile1                   │
│  ┌──────────────┐      ┌──────────────┐               │
│  │ ████████░░░░ │ ───▶ │ ░░░░░░░░░░░░ │ ───┐          │
│  └──────────────┘      └──────────────┘    │          │
│  ▲                                          │          │
│  └──────────────── circular ────────────────┘          │
│                                                        │
│  ████ = written redo records                           │
│  ░░░░ = available space                                │
│                                                        │
│  write_lsn ──▶ current write position                  │
│  checkpoint_lsn ──▶ all pages before this are on disk  │
│  Space between checkpoint_lsn and write_lsn = "active" │
└────────────────────────────────────────────────────────┘

Post-8.0.30: Dynamic redo log (in #innodb_redo/ directory)
┌────────────────────────────────────────────────────────┐
│  #innodb_redo/                                         │
│  ├── #ib_redo0    (active, being written)               │
│  ├── #ib_redo1    (active, being written)               │
│  ├── #ib_redo2_tmp (spare, pre-allocated)               │
│  └── #ib_redo3_tmp (spare, pre-allocated)               │
│                                                        │
│  Controlled by: innodb_redo_log_capacity (default 100MB)│
│  Files auto-created/removed as needed                  │
└────────────────────────────────────────────────────────┘

3.3 LSN (Log Sequence Number)

LSN is one of InnoDB's most important internal concepts. It is a monotonically increasing 64-bit integer representing the total bytes written to the Redo Log (counted from InnoDB initialization).

LSN Type	Meaning	Source
`Log sequence number`	Current latest LSN (position written to Log Buffer)	`SHOW ENGINE INNODB STATUS`
`Log flushed up to`	LSN flushed to disk redo log files	`SHOW ENGINE INNODB STATUS`
`Pages flushed up to`	Dirty pages flushed up to this LSN (oldest dirty page LSN in flush list)	`SHOW ENGINE INNODB STATUS`
`Last checkpoint at`	LSN of last checkpoint. Crash recovery replays from here.	`SHOW ENGINE INNODB STATUS`


LSN relationships (must always hold):

Last checkpoint at  ≤  Pages flushed up to  ≤  Log flushed up to  ≤  Log sequence number

Example from SHOW ENGINE INNODB STATUS:
---
LOG
---
Log sequence number          3456789012
Log buffer assigned up to    3456789012
Log buffer completed up to   3456789012
Log written up to            3456789012
Log flushed up to            3456789012
Added dirty pages up to      3456789012
Pages flushed up to          3456780000
Last checkpoint at           3456780000
                              ▲
                 Gap = 9012 bytes of redo not yet checkpointed
                 Crash recovery would need to replay these 9012 bytes

3.4 Checkpoint Mechanism

A checkpoint flushes dirty pages from the Buffer Pool to disk and advances the checkpoint LSN. It serves two purposes:

Free redo log space: redo log space before the checkpoint LSN can be reused
Shorten crash recovery time: recovery only needs to replay redo after the checkpoint LSN

Sharp Checkpoint

Flushes all dirty pages at once. Only used during database shutdown (innodb_fast_shutdown=0). Blocks all operations during flush — never used at runtime in production.

Fuzzy Checkpoint

Flushes dirty pages incrementally and asynchronously. InnoDB uses Fuzzy Checkpoint at runtime, including LRU flushing, Flush List flushing, and Async/Sync flushing (when redo log nears capacity).

3.5 innodb_flush_log_at_trx_commit: Durability vs Performance

This is arguably one of InnoDB's most important parameters. It controls the redo log flush strategy at transaction commit:


┌─────────┬──────────────────────────────────────────────────────────────┐
│  Value  │  Behavior                                                    │
├─────────┼──────────────────────────────────────────────────────────────┤
│         │  COMMIT → write to Log Buffer                                │
│    0    │  Background thread flushes to OS cache + fsync every 1 sec   │
│         │  ⚠ Up to 1 second of committed transactions lost on crash   │
│         │  ★ Fastest — no sync on commit                              │
├─────────┼──────────────────────────────────────────────────────────────┤
│         │  COMMIT → write to OS cache → fsync to disk                  │
│    1    │  Every single commit triggers an fsync                       │
│         │  ✓ ZERO data loss on crash (ACID compliant)                 │
│         │  ★ Slowest — fsync on every commit                          │
├─────────┼──────────────────────────────────────────────────────────────┤
│         │  COMMIT → write to OS cache (no fsync)                      │
│    2    │  Background thread fsyncs every 1 second                     │
│         │  ⚠ Data safe on MySQL crash, lost on OS/power crash         │
│         │  ★ Medium — good for replicas                               │
└─────────┴──────────────────────────────────────────────────────────────┘

Value	Performance	Data Safety	Use Case
`1`	Slowest (fsync per commit)	Safest (zero data loss)	Primary, financial systems, any scenario that cannot lose data
`2`	Faster (OS crash may lose 1 sec)	MySQL-crash safe, OS/power-failure unsafe	Replicas, bulk imports, scenarios tolerating minor loss
`0`	Fastest (no fsync)	Any crash may lose 1 second	Testing, non-critical data

Production Incident Case: An e-commerce company set innodb_flush_log_at_trx_commit=0 on a replica. One day a UPS failure caused the replica to power-cycle. After restart, the replica's GTID position was ~3 seconds behind what the primary recorded, breaking replication. The replica had to be rebuilt. Lesson: even for replicas, use 2 instead of 0 when replication consistency matters.

3.6 Redo Log Size Tuning

A redo log that is too small causes frequent checkpoints (because the redo space must wait for dirty pages to flush before reusing space), while too large extends crash recovery time.

Parameter	Version	Default	Recommended
`innodb_log_file_size`	< 8.0.30	48MB	1GB-4GB (high write workloads)
`innodb_log_files_in_group`	< 8.0.30	2	2
`innodb_redo_log_capacity`	≥ 8.0.30	100MB	2GB-8GB (replaces the two above)


-- How to estimate proper redo log size:
-- 1. During peak load, check LSN growth over 1 hour:

SHOW ENGINE INNODB STATUS;  -- note "Log sequence number" = LSN_start
-- wait 1 hour
SHOW ENGINE INNODB STATUS;  -- note "Log sequence number" = LSN_end

-- redo_per_hour = LSN_end - LSN_start
-- Recommended total redo size = redo_per_hour (covers 1 hour of writes)
-- This means checkpoint must only happen once per hour

-- Example: LSN grew by 2GB in 1 hour → set innodb_redo_log_capacity = 2G

-- 2. Alternative: check checkpoint age
SELECT
  (SELECT VARIABLE_VALUE FROM performance_schema.global_status
   WHERE VARIABLE_NAME='Innodb_redo_log_current_lsn') -
  (SELECT VARIABLE_VALUE FROM performance_schema.global_status
   WHERE VARIABLE_NAME='Innodb_redo_log_checkpoint_lsn')
  AS checkpoint_age;
-- If checkpoint_age frequently exceeds 75% of redo log capacity,
-- you need larger redo logs.

Source ref: storage/innobase/log/log0log.cc, storage/innobase/log/log0chkp.cc

3.7 Group Commit & Binlog Coordination

When multiple transactions commit simultaneously, InnoDB uses Group Commit optimization, merging multiple transactions' redo log writes into a single fsync. With binlog enabled, InnoDB uses two-phase commit (2PC) to ensure redo log and binlog consistency:


Two-Phase Commit with Binlog:

Phase 1: PREPARE
  ├─ Write redo log with PREPARE state
  └─ fsync redo log

Phase 2: COMMIT
  ├─ Write transaction to binlog
  ├─ fsync binlog (group commit: multiple txns in one fsync)
  └─ Write redo log with COMMIT state (no fsync needed—binlog is the authority)

Crash Recovery Logic:
  ├─ If redo=PREPARE + binlog=YES → COMMIT (transaction was successful)
  ├─ If redo=PREPARE + binlog=NO  → ROLLBACK (transaction incomplete)
  └─ If redo=COMMIT              → already committed

3.8 Redo Log Record Types

The Redo Log records multiple types of page modification operations. Each record contains a type identifier, tablespace ID, page number, and specific modification content. Major types include:

Record Type	Description	Triggered By
`MLOG_1BYTE / 2BYTE / 4BYTE / 8BYTE`	Write 1/2/4/8 bytes to specific page offset	Page header field updates, etc.
`MLOG_WRITE_STRING`	Write variable-length byte string to page	Data changes
`MLOG_REC_INSERT`	Insert a compact-format record	INSERT
`MLOG_REC_UPDATE_IN_PLACE`	In-place record update (same length)	UPDATE (when field length unchanged)
`MLOG_REC_DELETE`	Delete record (mark as deleted)	DELETE
`MLOG_PAGE_CREATE`	Create a new index page	Page split, etc.
`MLOG_UNDO_INSERT`	Insert an undo record into undo page	DML operations generating undo
`MLOG_COMP_PAGE_CREATE`	Create compact-format page	Tablespace page allocation
`MLOG_MULTI_REC_END`	Mini-transaction end marker	At end of each mtr

Source ref: storage/innobase/include/mtr0types.h

3.9 Mini-Transaction (mtr) Concept

A Mini-Transaction is InnoDB's internal atomic operation unit, distinct from user-level transactions. A user transaction contains multiple mtrs, and each mtr's writes to the redo log are atomic. For example, a B+Tree page split may involve modifying 3 pages — these 3 modifications must be written to the redo log as a unit: all or nothing.


User Transaction vs Mini-Transaction:

BEGIN;                               -- User transaction starts
  INSERT INTO t VALUES (1, 'a');     -- mtr1: modify leaf page + undo page
  INSERT INTO t VALUES (2, 'b');     -- mtr2: modify leaf page + undo page
  INSERT INTO t VALUES (3, 'c');     -- mtr3: page split! modifies 3 pages
                                     --   3a: allocate new page
                                     --   3b: move half records to new page
                                     --   3c: update parent page pointer
                                     --   These 3 writes are ONE mtr
COMMIT;                              -- User transaction commits

In the redo log:
  [mtr1 records][MLOG_MULTI_REC_END]
  [mtr2 records][MLOG_MULTI_REC_END]
  [mtr3 records][MLOG_MULTI_REC_END]
  ... (commit record)

3.10 Crash Recovery Process

When MySQL restarts after an abnormal shutdown, InnoDB crash recovery proceeds as follows:


Crash Recovery Phases:

Phase 1: REDO (Forward Recovery)
  ├─ Read redo log from last checkpoint LSN to end
  ├─ Parse redo records and apply them to data pages
  │   ├─ If page's LSN >= record's LSN → skip (already applied)
  │   └─ If page's LSN <  record's LSN → apply the redo record
  ├─ This restores ALL modifications (both committed and uncommitted)
  └─ After this phase: database pages are in the state at crash time

Phase 2: UNDO (Backward Recovery)
  ├─ Scan the undo log for active (uncommitted) transactions at crash time
  ├─ Roll back each uncommitted transaction using undo records
  ├─ This ensures atomicity: uncommitted changes are reversed
  └─ After this phase: only committed data remains

Phase 3: Purge
  ├─ Clean up undo records from rolled-back transactions
  └─ Background purge threads handle this after recovery completes

Recovery time depends on:
  - Amount of redo to replay (checkpoint_age at crash)
  - Number of uncommitted transactions to roll back
  - I/O speed of the storage device

Recovery Time Estimation: On SSD storage, redo replay speed is ~100-200 MB/s. If checkpoint_age is 4GB, recovery takes approximately 20-40 seconds (redo phase only). The undo phase depends on the size of transactions to roll back. Large uncommitted transactions (e.g., a 2-hour UPDATE ... WHERE) can take longer to roll back than the forward recovery, because undo reverses operations row by row.

4. Undo Log & MVCC

4.1 Undo Log's Dual Mission

The Undo Log serves two critical roles in InnoDB:

Transaction Rollback: When a transaction executes ROLLBACK, the old values recorded in the Undo Log restore data to the state before the transaction began.
MVCC: When other transactions need to read a historical version of a row, the version chain in the Undo Log locates the version visible to that transaction.

4.2 Undo Log Storage Structure


Undo Tablespace Structure:

┌────────────────────────────────────────────────────────┐
│  Undo Tablespace (undo_001, undo_002, ...)             │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Rollback Segment 1 (max 128 per tablespace)     │  │
│  │  ┌─────────────────────────────────────────────┐  │  │
│  │  │  Undo Slot 1 ──▶ Undo Log 1 (one per txn)  │  │  │
│  │  │  Undo Slot 2 ──▶ Undo Log 2                │  │  │
│  │  │  ...                                        │  │  │
│  │  │  Undo Slot 1024 ──▶ Undo Log 1024           │  │  │
│  │  └─────────────────────────────────────────────┘  │  │
│  ├──────────────────────────────────────────────────┤  │
│  │  Rollback Segment 2                               │  │
│  │  ...                                              │  │
│  └──────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────┘

Max concurrent transactions = undo_tablespaces × 128 rseg × 1024 slots
                            = 2 × 128 × 1024 = 262,144 (default)

Each transaction needs 1 to 2 Undo Logs depending on operation type:

INSERT Undo Log: Only records the primary key of newly inserted rows; rollback deletes them. Can be released immediately after commit.
UPDATE Undo Log: Records old values of modified/deleted rows. Cannot be released immediately after commit — must wait until all transactions that might read the old version finish, then purge threads clean up.

4.3 MVCC Version Chain

Every row in InnoDB has two hidden columns:

DB_TRX_ID (6 bytes): Transaction ID of the last transaction that modified this row
DB_ROLL_PTR (7 bytes): Pointer to the old version of this row in the Undo Log


MVCC Version Chain Example:

Current row in Buffer Pool / .ibd file:
┌─────────────────────────────────────────────┐
│  id=1, name="Charlie", age=30               │
│  DB_TRX_ID = 300                             │
│  DB_ROLL_PTR ──────────────────┐              │
└────────────────────────────────┼──────────────┘
                                 │
                                 ▼  Undo Log
                    ┌──────────────────────────────┐
                    │  name="Bob", age=28           │
                    │  DB_TRX_ID = 200              │
                    │  DB_ROLL_PTR ─────────┐       │
                    └───────────────────────┼───────┘
                                            │
                                            ▼
                    ┌──────────────────────────────┐
                    │  name="Alice", age=25         │
                    │  DB_TRX_ID = 100              │
                    │  DB_ROLL_PTR = NULL            │
                    └──────────────────────────────┘

Transaction 150 (RR isolation) started before trx 200 and 300:
  → Sees row version: name="Alice", age=25 (trx 100, committed before 150)

Transaction 250 (RR isolation) started after trx 200 committed:
  → Sees row version: name="Bob", age=28 (trx 200, committed before 250)

Transaction 350 (any isolation) started after trx 300 committed:
  → Sees current row: name="Charlie", age=30

4.4 ReadView in Detail

ReadView is the core data structure for InnoDB's MVCC consistent reads. When a transaction performs a consistent read (non-locking read), InnoDB creates a ReadView containing:

Field	Meaning
`m_creator_trx_id`	Transaction ID that created this ReadView
`m_low_limit_id`	Next transaction ID to be assigned when ReadView was created (i.e., max_trx_id at that point). Transactions with ID >= this are definitely invisible.
`m_up_limit_id`	Minimum transaction ID in the active list when ReadView was created. Transactions with ID < this are definitely committed and visible.
`m_ids`	List of all active (uncommitted) transaction IDs when ReadView was created.


ReadView Visibility Algorithm (simplified):

function is_visible(row_trx_id, read_view):
    if row_trx_id == read_view.m_creator_trx_id:
        return true    // I modified this row myself

    if row_trx_id = read_view.m_low_limit_id:
        return false   // Transaction started after ReadView was created

    // m_up_limit_id <= row_trx_id < m_low_limit_id
    if row_trx_id in read_view.m_ids:
        return false   // Transaction was active (uncommitted) when ReadView was created
    else:
        return true    // Transaction committed before ReadView was created
                        // (its ID is between up and low, but not in active list)

If not visible → follow DB_ROLL_PTR to undo log, check previous version
Repeat until a visible version is found or reach NULL (row didn't exist)

Source ref: storage/innobase/include/read0types.h — class ReadView, storage/innobase/row/row0sel.cc

4.5 RR vs RC: When the ReadView Is Created

REPEATABLE READ (RR)

ReadView is created at the first SELECT in the transaction and reused throughout the transaction's lifetime. Thus reading the same row multiple times within the transaction always returns the same value — even if other transactions have committed modifications.

READ COMMITTED (RC)

ReadView is recreated on every SELECT. Thus two reads of the same row within a transaction may return different values — if another transaction committed a modification between the two reads (Non-Repeatable Read).


Example: RR vs RC difference

Initial: users table has (id=1, name="Alice")

Timeline:
  T1 (RR):  BEGIN                           T2:  BEGIN
  T1:       SELECT name FROM users WHERE id=1   → "Alice"
                                             T2:  UPDATE users SET name="Bob" WHERE id=1
                                             T2:  COMMIT
  T1:       SELECT name FROM users WHERE id=1   → "Alice" (same ReadView!)

  T3 (RC):  BEGIN                           T4:  BEGIN
  T3:       SELECT name FROM users WHERE id=1   → "Alice"
                                             T4:  UPDATE users SET name="Bob" WHERE id=1
                                             T4:  COMMIT
  T3:       SELECT name FROM users WHERE id=1   → "Bob" (new ReadView!)

4.6 Dangers of Long Transactions

Production Incident: Undo Log Bloat Causes Disk Exhaustion A SaaS application had an analytics query transaction that ran for 8 hours without committing. During that time, UPDATE Undo Logs from other transactions could not be purged (because that long transaction's ReadView still needed old versions), causing the undo tablespace to bloat from 200MB to 150GB. Disk space was exhausted and the database crashed. Prevention: 1. Set innodb_undo_log_truncate=ON (default ON in MySQL 8.0) 2. Monitor information_schema.INNODB_TRX for long-uncommitted transactions 3. Set kill_idle_transaction (Percona) or periodically check timeout transactions 4. Route analytics queries to replicas

4.7 Undo Configuration

Parameter	Default	Description
`innodb_undo_tablespaces`	2	Number of separate undo tablespaces. MySQL 8.0 minimum is 2.
`innodb_undo_log_truncate`	ON (8.0)	Automatically reclaim undo tablespaces exceeding `innodb_max_undo_log_size`.
`innodb_max_undo_log_size`	1GB	Size threshold that triggers undo tablespace truncation.
`innodb_rollback_segments`	128	Number of rollback segments per undo tablespace.
`innodb_purge_threads`	4	Background threads for purge operations. Increase to 8 for high-write workloads.

Source ref: storage/innobase/trx/trx0trx.cc, storage/innobase/trx/trx0purge.cc

4.8 Purge Thread Internals

Purge is InnoDB's background cleanup operation, responsible for two tasks:

Physically remove rows marked as "delete-marked" (DELETE only sets a logical deletion flag; purge actually removes the row from indexes)
Clean up UPDATE undo log records no longer needed by any active transaction


Purge Process:

1. Purge coordinator thread (1 thread) reads the oldest ReadView
   across all active transactions

2. Any undo record with trx_id < oldest active ReadView's
   m_up_limit_id is safe to purge

3. For DELETE operations:
   ├─ Purge removes the delete-marked record from the clustered index
   └─ Purge removes corresponding entries from all secondary indexes

4. For UPDATE operations on indexed columns:
   └─ Purge removes old secondary index entries (the "ghost" entries)

5. After purge: undo log pages can be reused

Monitoring purge lag:
  SHOW ENGINE INNODB STATUS → "History list length"
  This = number of undo records NOT yet purged
  If growing → purge can't keep up → increase innodb_purge_threads
  Or a long transaction is holding back the oldest ReadView

4.9 MVCC and Secondary Indexes

MVCC implementation differs fundamentally between clustered indexes and secondary indexes:

Clustered Index

Each row has DB_TRX_ID and DB_ROLL_PTR, enabling direct version visibility checks and version chain traversal.

Secondary Index

Secondary index entries do not have DB_TRX_ID and DB_ROLL_PTR. InnoDB uses PAGE_MAX_TRX_ID in the page header (the maximum trx_id on that page) for a quick check: if PAGE_MAX_TRX_ID < ReadView's m_up_limit_id, all records on the page are visible without returning to the clustered index. Otherwise, it must go back to the clustered index to check each row's version visibility.

Performance Impact: This explains why in high-concurrency write scenarios, the advantage of covering indexes may be diminished. If a secondary index page's PAGE_MAX_TRX_ID is recent (due to active write transactions), even if all query columns are in the secondary index, InnoDB may still need to go back to the clustered index for version visibility checks. Long transactions exacerbate this problem.

5. Doublewrite Buffer

5.1 The Partial Page Write Problem (Torn Pages)

InnoDB's page size is 16KB, but the OS atomic write is typically 4KB (one filesystem block). A single 16KB page write requires 4 separate 4KB I/O operations. If a power failure or system crash occurs between these 4 I/O operations, only partial data may have been written — this is a "torn page."

Why Can't the Redo Log Fix a Torn Page? The Redo Log records logical modifications to pages (e.g., "at offset 200, change value from 25 to 30"), not complete page contents. If the page itself is corrupted (only 8KB was correctly written), replaying redo operations on a corrupted page produces more errors. The Redo Log assumes the base page state is correct.

5.2 How Doublewrite Works


Doublewrite Buffer Write Flow:

Step 1: Collect dirty pages to flush (batch, e.g., 64 pages)

Step 2: Write pages sequentially to Doublewrite Buffer (on disk)
  ┌────────────────────────────────────────────────┐
  │  Doublewrite Buffer (2 extents = 2MB total)    │
  │  ┌─────┬─────┬─────┬─────┬─────┬───────────┐  │
  │  │ P1  │ P2  │ P3  │ ... │ P64 │ (unused)  │  │
  │  └─────┴─────┴─────┴─────┴─────┴───────────┘  │
  └────────────────────────────────────────────────┘

Step 3: fsync the Doublewrite Buffer

Step 4: Write each page to its actual location in .ibd files
  P1 → tablespace A, page 45
  P2 → tablespace A, page 892
  P3 → tablespace B, page 17
  ... (these are random I/O)

Step 5: fsync the .ibd files

Crash Recovery:
  For each page in the .ibd file:
  1. Verify page checksum
  2. If checksum FAILS → page is torn → copy correct version from Doublewrite Buffer
  3. Then apply redo log records on top of the good page

5.3 Performance Impact & When to Disable

Aspect	Details
Write Amplification	Each dirty page is written twice: once to the Doublewrite Buffer, once to the actual location. Theoretically 2x write I/O, but in practice ~5-10% overhead (because Doublewrite writes are sequential).
Safe to Disable When	1. Filesystem supports atomic writes (ZFS, RAID controller with battery-backed cache) 2. Some cloud databases (e.g., Aurora) implement storage-layer atomic writes 3. `innodb_page_size` set to 4KB (equal to filesystem block size — no torn-page risk)


-- MySQL 8.0.20+: separate doublewrite files
-- Old: doublewrite stored in ibdata1
-- New: separate .dblwr files in datadir

-- Disable doublewrite (only if your storage guarantees atomic writes!):
SET GLOBAL innodb_doublewrite = OFF;

-- MySQL 8.0.30+: fine-grained control
-- innodb_doublewrite = ON | OFF | DETECT_AND_RECOVER | DETECT_ONLY
-- DETECT_AND_RECOVER (default): full protection
-- DETECT_ONLY: detect torn pages but don't fix (useful with ZFS)

6. Change Buffer

6.1 What Problem Does the Change Buffer Solve?

When an INSERT/UPDATE/DELETE modifies a secondary index page that is not in the Buffer Pool, InnoDB faces a choice:

Without Change Buffer

Immediately read that secondary index page from disk into the Buffer Pool, modify it, mark as dirty. Problem: this is a random read I/O. For thousands of INSERTs per second, this generates massive random I/O.

With Change Buffer

Cache the modification in the Change Buffer without reading from disk. When the page is eventually read into the Buffer Pool for other queries, merge the cached modifications. This avoids massive random read I/O.

Important Limitation: The Change Buffer only works for secondary indexes (non-unique). It does NOT apply to: 1. Clustered index (primary key) — the primary key page must be read on INSERT to determine insert position 2. Unique indexes — the index page must be read on INSERT to check the uniqueness constraint

6.2 Merge Triggers

The secondary index page is read into the Buffer Pool by a SELECT query
Background merge thread executes periodically
Buffer Pool space is insufficient and Change Buffer pages must be evicted
At transaction commit (in certain cases)
During normal server shutdown (full merge)

6.3 Configuration

Parameter	Default	Description
`innodb_change_buffering`	all	Which operations to buffer: `none`, `inserts`, `deletes`, `changes`(ins+del), `purges`, `all`
`innodb_change_buffer_max_size`	25	Maximum percentage of Buffer Pool for Change Buffer. Range: 0-50.

6.4 When It Helps vs When It Hurts

Effective Scenarios

Heavy random INSERTs (log tables, secondary indexes on UUID PK tables)
Secondary index pages not frequently read by SELECTs (write-heavy, read-light)
Disk I/O is the bottleneck (HDD benefits more than SSD)

Ineffective/Harmful Scenarios

All indexes are unique (Change Buffer has zero effect)
Read immediately after write (merge triggers immediately, no different from direct read)
Buffer Pool large enough to cache all index pages (no need for Change Buffer optimization)

7. Adaptive Hash Index (AHI)

7.1 How It Works

All user data and indexes in InnoDB are stored in B+Trees. A B+Tree lookup traverses from the root node downward level by level. For a 3-4 level B+Tree, each lookup requires 3-4 page accesses (though upper-level pages are usually in the Buffer Pool, there is still CPU overhead).

The Adaptive Hash Index (AHI) is an in-memory hash table that InnoDB automatically builds on frequently accessed B+Tree pages. When InnoDB detects that a certain index prefix is repeatedly searched with the same pattern, it automatically builds a hash index, reducing lookup complexity from O(log n) to O(1).


Without AHI:  query → B+Tree root → level 1 → level 2 → leaf page → row
                        3-4 page accesses, CPU-intensive comparison at each level

With AHI:     query → hash lookup → directly to leaf page + record offset → row
                        1 hash probe, O(1)

7.2 Monitoring AHI


-- From SHOW ENGINE INNODB STATUS:
-------------------------------------
INSERT BUFFER AND ADAPTIVE HASH INDEX
-------------------------------------
Hash table size 34679, node heap has 1 buffer(s)
Hash table size 34679, node heap has 0 buffer(s)
...
2.84 hash searches/s, 3.27 non-hash searches/s   ← AHI hit ratio

-- AHI hit rate = hash searches / (hash searches + non-hash searches)
-- In this example: 2.84 / (2.84 + 3.27) = 46.5%
-- If hit rate is consistently below 3-5%, consider disabling AHI

7.3 AHI Contention Issues

AHI is protected by rw-locks. Under high concurrency, many threads simultaneously accessing AHI can cause rw-lock contention, manifesting as high CPU usage with throughput actually decreasing.

Parameter	Default	Description
`innodb_adaptive_hash_index`	ON	Whether to enable AHI. Can be changed dynamically.
`innodb_adaptive_hash_index_parts`	8	Number of AHI partitions. Increasing reduces rw-lock contention. Range: 1-512.

7.4 When to Disable AHI

AHI hit rate consistently below 5% (wasting memory and CPU)
Seeing heavy btr_search rw-lock waits in the SEMAPHORES section of SHOW ENGINE INNODB STATUS
Workload is primarily range scans (AHI helps point lookups, not range scans)
Queries use many different index prefixes randomly (AHI cannot build effective hashes)

8. Tablespace Architecture

8.1 Tablespace Types Overview


InnoDB Tablespace Hierarchy:

datadir/
├── ibdata1                    ← System Tablespace
├── undo_001                   ← Undo Tablespace 1
├── undo_002                   ← Undo Tablespace 2
├── ibtmp1                     ← Temporary Tablespace
├── #innodb_temp/               ← Session Temporary Tablespaces (8.0.15+)
│   ├── temp_1.ibt
│   └── temp_2.ibt
├── #innodb_redo/               ← Redo Log files (8.0.30+)
│   ├── #ib_redo0
│   └── #ib_redo1
├── *.dblwr                    ← Doublewrite files (8.0.20+)
├── mydb/                      ← Database directory
│   ├── users.ibd              ← File-per-table tablespace
│   ├── orders.ibd
│   └── ...
└── general_ts.ibd             ← General Tablespace (optional)

8.2 System Tablespace (ibdata1)

The system tablespace is InnoDB's original tablespace, storing:

Change Buffer persistent data
Doublewrite Buffer (before MySQL 8.0.20)
InnoDB internal data dictionary (before MySQL 8.0)
User table data if innodb_file_per_table=OFF

ibdata1 Only Grows, Never Shrinks The system tablespace file can only grow, never shrink automatically. Even after deleting massive amounts of data, ibdata1's size does not decrease. This is because InnoDB reuses freed space internally rather than returning it to the OS. Historically many DBAs encountered ibdata1 ballooning to tens of GB (usually because innodb_file_per_table=OFF). The only solution is logical backup + instance rebuild. Therefore, always keep innodb_file_per_table=ON (default ON since MySQL 5.6.6+).

8.3 File-per-Table Tablespaces

When innodb_file_per_table=ON, each InnoDB table's data and indexes are stored in its own .ibd file.

Advantages	Disadvantages
DROP TABLE / TRUNCATE TABLE actually reclaims disk space	Filesystem inode limits (relevant with tens of thousands of tables)
Can individually migrate tables with `ALTER TABLE ... TABLESPACE`	Slight storage overhead for each .ibd file's metadata
Can place tables on different disks (via symlinks or `DATA DIRECTORY`)	Finer fsync granularity (per file) may slightly impact write performance
Supports COMPRESSED row format

8.4 General Tablespaces


-- Create a general tablespace
CREATE TABLESPACE ts_hot
  ADD DATAFILE '/fast_ssd/ts_hot.ibd'
  ENGINE = InnoDB;

-- Create table in specific tablespace
CREATE TABLE hot_data (id INT PRIMARY KEY, val VARCHAR(100))
  TABLESPACE ts_hot;

-- Move existing table to the tablespace
ALTER TABLE orders TABLESPACE ts_hot;

General tablespaces allow organizing related tables in a single file (reducing file descriptor usage) or placing them on specific storage devices.

8.5 Temporary Tablespaces

MySQL 8.0 has two types of temporary tablespaces:

Global Temporary Tablespace (ibtmp1): Used for rollback segments of user-created temporary tables (CREATE TEMPORARY TABLE). Automatically rebuilt on restart.
Session Temporary Tablespaces (#innodb_temp/): Added in MySQL 8.0.15+, stores user-created temporary table data and internal temporary tables. One allocated per session; space is reclaimable after session ends.

8.6 Tablespace Internals: Segment, Extent, Page

InnoDB tablespaces organize disk space internally in a three-level hierarchy:


Tablespace Internal Hierarchy:

┌─────────────────────────────────────────────────────────┐
│  Tablespace (.ibd file)                                 │
│                                                         │
│  ┌──── Segment (logical grouping) ──────────────────┐   │
│  │  Each index has 2 segments:                       │   │
│  │  ├─ Leaf node segment (data)                      │   │
│  │  └─ Non-leaf node segment (internal nodes)        │   │
│  │                                                    │   │
│  │  ┌── Extent (physical allocation unit) ────────┐  │   │
│  │  │  1 extent = 64 contiguous pages = 1 MB       │  │   │
│  │  │  (when page_size=16KB)                       │  │   │
│  │  │                                               │  │   │
│  │  │  ┌──────┬──────┬──────┬───┬──────┐           │  │   │
│  │  │  │Page 0│Page 1│Page 2│...│Page63│ (16KB ea) │  │   │
│  │  │  └──────┴──────┴──────┴───┴──────┘           │  │   │
│  │  └─────────────────────────────────────────────┘  │   │
│  │                                                    │   │
│  │  For small segments (< 32 pages):                 │   │
│  │  Pages are allocated individually from a           │   │
│  │  "fragment" extent (shared with other segments)    │   │
│  └────────────────────────────────────────────────────┘   │
│                                                         │
│  Page 0: FSP_HDR (file space header) — tracks extents   │
│  Page 1: IBUF_BITMAP — change buffer bitmap             │
│  Page 2: INODE — segment metadata                       │
│  Page 3+: data pages (B+Tree root, etc.)                │
│                                                         │
│  Every 16384 pages (256MB): another XDES page           │
│  for tracking the next group of extents                 │
└─────────────────────────────────────────────────────────┘

Why Allocate by Extent? InnoDB uses extents (1MB contiguous space) as the basic allocation unit rather than allocating each page individually. This is because disk sequential R/W performance far exceeds random R/W. When a table grows, allocating contiguous 1MB extents ensures data is physically contiguous on disk, benefiting full table scans and range scans with sequential I/O. For very small tables (< 32 pages = 512KB), page-by-page allocation avoids excessive space waste.

8.7 .ibd File Space Reclamation & Defragmentation

After massive DELETE operations, .ibd files do not automatically shrink. Deleted space is marked as reusable, but the file's physical size stays the same. If fragmentation is severe, reclaim space via:


-- Method 1: OPTIMIZE TABLE (rebuilds the table, reclaims space)
OPTIMIZE TABLE large_table;
-- For InnoDB, this is equivalent to ALTER TABLE ... ENGINE=InnoDB
-- ⚠ Requires full table copy — blocks writes during operation
-- Use pt-online-schema-change or gh-ost for production

-- Method 2: ALTER TABLE (null operation that triggers rebuild)
ALTER TABLE large_table ENGINE=InnoDB;

-- Check fragmentation level:
SELECT TABLE_NAME,
       DATA_LENGTH,
       DATA_FREE,
       ROUND(DATA_FREE / DATA_LENGTH * 100, 1) AS frag_pct
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'mydb'
  AND DATA_FREE > 0
ORDER BY DATA_FREE DESC;
-- frag_pct > 20% suggests rebuild may be worthwhile

9. Row Formats

9.1 Comparison of Four Row Formats

Row Format	Since	Variable-Length Storage	Large Column Overflow	Recommended
`REDUNDANT`	Earliest	Full length info stored in-row	First 768 bytes stored in-row	No (legacy only)
`COMPACT`	5.0	Compact NULL flags and variable-length field lengths	First 768 bytes stored in-row	Usable but not preferred
`DYNAMIC`	5.7+	Same as COMPACT	Only 20-byte pointer in-row, all data in overflow pages	Yes (default, recommended)
`COMPRESSED`	5.5+	Same as DYNAMIC, but with page-level compression	Same as DYNAMIC	Special cases (cold data)

9.2 COMPACT/DYNAMIC Row Internal Structure


COMPACT/DYNAMIC Row Format:

┌──────────────────────────────────────────────────────────────────┐
│  Variable-Length Field Lengths (1-2 bytes each, reverse order)  │
├──────────────────────────────────────────────────────────────────┤
│  NULL Flags (1 bit per nullable column, rounded to bytes)       │
├──────────────────────────────────────────────────────────────────┤
│  Record Header (5 bytes)                                        │
│  ├─ info_flags (4 bits): delete-mark, min-rec, etc.             │
│  ├─ n_owned (4 bits): records owned by this slot in page dir    │
│  ├─ heap_no (13 bits): position in page heap                   │
│  ├─ record_type (3 bits): 0=leaf, 1=non-leaf, 2=infimum, 3=sup │
│  └─ next_record (16 bits): relative offset to next record       │
├──────────────────────────────────────────────────────────────────┤
│  Hidden Columns:                                                │
│  ├─ DB_ROW_ID (6 bytes, only if no explicit PK)                 │
│  ├─ DB_TRX_ID (6 bytes, transaction ID)                         │
│  └─ DB_ROLL_PTR (7 bytes, undo log pointer)                     │
├──────────────────────────────────────────────────────────────────┤
│  Column 1 data                                                  │
│  Column 2 data                                                  │
│  ...                                                            │
│  Column N data                                                  │
│  (If column > threshold → 20-byte pointer to overflow page)     │
└──────────────────────────────────────────────────────────────────┘

9.3 Overflow Page Storage (Off-page)

When a large column in a row (TEXT, BLOB, very long VARCHAR) exceeds the page's storage capacity, InnoDB stores part or all of the data in overflow pages. COMPACT and DYNAMIC handle this differently:

COMPACT / REDUNDANT

First 768 bytes stored in-row, remainder in overflow pages. This means even a primary key lookup may load large TEXT data into Buffer Pool data pages, wasting memory.

DYNAMIC

When a column overflows, only a 20-byte pointer (space ID + page number + offset) is stored in-row, with all data in overflow pages. This allows data pages to hold more rows, dramatically improving Buffer Pool efficiency.

Best Practice: Always use ROW_FORMAT=DYNAMIC (MySQL 8.0 default). If your table has TEXT/BLOB columns and you need disk space savings (accepting ~10% CPU overhead), consider COMPRESSED. However, in MySQL 8.0, prefer transparent page compression (COMPRESSION='zstd') over the COMPRESSED row format.

9.4 Hidden Columns in InnoDB Rows

Every InnoDB row has up to three hidden columns, critical for MVCC and internal management:

Hidden Column	Size	Description
`DB_ROW_ID`	6 bytes	Row ID, only added when the table has no explicit primary key and no non-NULL unique index. InnoDB generates a globally incrementing row_id as the internal primary key. This is why you should always define a primary key — otherwise InnoDB uses a globally mutex-protected counter to generate row_id, becoming a bottleneck under high-concurrency INSERTs.
`DB_TRX_ID`	6 bytes	Transaction ID of the last transaction that modified this row. Updated by INSERT/UPDATE/DELETE. The core basis for MVCC version visibility checks.
`DB_ROLL_PTR`	7 bytes	Roll pointer, pointing to the previous version of this row in the undo log. Following DB_ROLL_PTR traverses the version chain to find any historical version. 7 bytes = 1 bit (INSERT/UPDATE flag) + 7 bits (rollback segment ID) + 32 bits (page number) + 16 bits (offset within page).

9.5 COMPRESSED Row Format Deep Dive

The COMPRESSED row format adds page-level zlib compression on top of DYNAMIC. Each 16KB page is compressed and stored in smaller space. However, InnoDB must maintain both compressed and uncompressed copies of pages in memory:


COMPRESSED page handling in Buffer Pool:

Disk (compressed):       Buffer Pool (memory):
┌──────────────┐         ┌─────────────────────────┐
│  8KB on disk  │ ──────▶ │  8KB compressed page     │ ← always present
│  (compressed) │         │  +                       │
│               │         │  16KB uncompressed page   │ ← present when accessed
└──────────────┘         └─────────────────────────┘

Problems with COMPRESSED:
1. Double memory usage (compressed + uncompressed copies)
2. CPU overhead for compress/decompress
3. Compression failures: if modified page can't fit in original
   compressed size → reorganize page → still can't fit → page split
4. Modification log: InnoDB maintains a tiny log of modifications
   on the compressed page to avoid recompressing on every change
   → when log is full → recompress → potential split

COMPRESSED vs Transparent Page Compression MySQL 8.0 recommends transparent page compression (CREATE TABLE ... COMPRESSION='zstd') over the COMPRESSED row format: 1. Transparent page compression does not need two copies in memory 2. Supports more efficient compression algorithms (zstd instead of zlib) 3. Relies on filesystem sparse file (punch hole) support — Linux ext4/xfs/btrfs all support it 4. No extra memory overhead on Buffer Pool 5. Downside: requires filesystem punch hole support; actual savings depend on filesystem block alignment

9.6 Row Size Limits

InnoDB has several important row size limits:

Limit	Value	Description
MySQL layer row size limit	65,535 bytes	Sum of all column declared lengths (excluding BLOB/TEXT) cannot exceed this
InnoDB in-page row size limit	~8000 bytes	Approximately half of page size (16KB/2), ensuring at least 2 rows per page (B+Tree requirement)
Overflow trigger threshold	Dynamic	When in-row data exceeds ~8000 bytes, the longest variable-length fields are moved to overflow pages. DYNAMIC format keeps only a 20-byte pointer.
Maximum columns	1017	Maximum 1017 columns per table (limit after including hidden and virtual columns)

10. Flush Mechanism

10.1 Why Flush?

InnoDB writes first go to the Buffer Pool (in-memory dirty pages), with transactions guaranteed by the Redo Log. But dirty pages must eventually be written back to disk for two reasons:

Free Buffer Pool space for new pages
Advance the checkpoint LSN to free redo log space and shorten crash recovery time

10.2 Two Flush Paths

LRU Flushing

When Buffer Pool free pages are insufficient, page cleaner threads scan from the LRU tail, flush dirty pages, and free them. The goal is to ensure enough free pages for new read requests.

Flush List Flushing

Page cleaner threads flush dirty pages in order of their oldest_modification LSN from the Flush List, advancing the checkpoint. The goal is to ensure the redo log does not fill up.

10.3 Key Flush Parameters

Parameter	Default	Description	Recommended
`innodb_io_capacity`	200	Maximum I/O operations per second for InnoDB background tasks (flushing, merging change buffer).	SSD: 2000-10000, HDD: 200-400
`innodb_io_capacity_max`	2000	I/O operations cap during emergency flush (redo log near full).	2-4x `innodb_io_capacity`
`innodb_flush_neighbors`	0 (8.0)	Whether to flush neighboring dirty pages. Useful on HDD (reduces random I/O), should be OFF for SSD.	SSD: 0, HDD: 1
`innodb_adaptive_flushing`	ON	Adaptive flushing: InnoDB dynamically adjusts flush rate based on redo log consumption speed.	ON
`innodb_page_cleaners`	4	Number of page cleaner threads. Should equal or exceed `innodb_buffer_pool_instances`.	= buffer_pool_instances
`innodb_lru_scan_depth`	1024	Depth of LRU scan per Buffer Pool instance per page cleaner iteration.	Keep default or lower to 256-512 (if I/O-pressured)

10.4 Adaptive Flushing Algorithm

When innodb_adaptive_flushing=ON, InnoDB uses the following formula to calculate the ideal flush rate:


Adaptive Flushing Rate Calculation (simplified):

checkpoint_age = current_lsn - last_checkpoint_lsn
max_checkpoint_age = redo_log_capacity * 0.8125   (async threshold)

if checkpoint_age < max_checkpoint_age * 0.75:
    flush_rate = low  (normal pace)
elif checkpoint_age < max_checkpoint_age:
    flush_rate = scaled between io_capacity and io_capacity_max
    (proportional to how close checkpoint_age is to max)
else:
    flush_rate = io_capacity_max  (EMERGENCY: aggressive flush)
    ⚠ User queries may stall waiting for redo space!

The "adaptive" part: InnoDB also considers the rate of redo log
generation (LSN growth speed) to predict future flush needs.

Production Incident: Flushing Storm A company used default innodb_io_capacity=200 and innodb_redo_log_capacity=100MB (both too small). During peak traffic, the redo log filled up rapidly. InnoDB triggered emergency synchronous flushing, blocking all user queries for 5-10 seconds. QPS dropped to near zero, then recovered. This pattern repeated every few minutes. Solution: increase redo log to 4GB, increase innodb_io_capacity to 4000 (SSD). Problem disappeared.

Source ref: storage/innobase/buf/buf0flu.cc — buf_flush_page_cleaner_coordinator()

11. InnoDB Configuration Cheat Sheet

Below are the most critical InnoDB parameters with recommended settings (assuming 64GB RAM server, SSD storage):

Parameter	Default	Recommended (64G/SSD)	Description
`innodb_buffer_pool_size`	128MB	44G	~70% total RAM
`innodb_buffer_pool_instances`	8	16	~2.75GB per instance
`innodb_redo_log_capacity`	100MB	4G	8.0.30+; older: log_file_size*files
`innodb_log_buffer_size`	16MB	64MB	Increase for large transactions or high-concurrency writes
`innodb_flush_log_at_trx_commit`	1	1	Must be 1 for primary; replicas can use 2
`innodb_io_capacity`	200	4000	50-75% of SSD IOPS
`innodb_io_capacity_max`	2000	8000	2x io_capacity
`innodb_flush_method`	fsync	O_DIRECT	Bypass OS cache, avoid double buffering
`innodb_flush_neighbors`	0	0	SSD does not benefit from neighbor flushing
`innodb_page_cleaners`	4	16	= buffer_pool_instances
`innodb_purge_threads`	4	4-8	Increase for heavy UPDATE/DELETE workloads
`innodb_file_per_table`	ON	ON	Always keep ON
`innodb_doublewrite`	ON	ON	Unless using ZFS or BBWC RAID
`innodb_adaptive_hash_index`	ON	ON	Disable if contention observed
`innodb_change_buffer_max_size`	25	25	Set to 0 if no secondary indexes or all unique
`innodb_undo_tablespaces`	2	2-4	More tablespaces = better concurrency and truncation
`innodb_undo_log_truncate`	ON	ON	Auto-reclaim undo space
`innodb_print_all_deadlocks`	OFF	ON	Log all deadlocks to error log
`innodb_deadlock_detect`	ON	ON	Consider OFF + timeout for high-concurrency hotspot rows


-- Example my.cnf InnoDB section for 64GB SSD server:

[mysqld]
# Buffer Pool
innodb_buffer_pool_size         = 44G
innodb_buffer_pool_instances    = 16
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup  = ON

# Redo Log
innodb_redo_log_capacity        = 4G
innodb_log_buffer_size          = 64M
innodb_flush_log_at_trx_commit  = 1

# I/O
innodb_io_capacity              = 4000
innodb_io_capacity_max          = 8000
innodb_flush_method             = O_DIRECT
innodb_flush_neighbors          = 0
innodb_page_cleaners            = 16

# Tablespace
innodb_file_per_table           = ON
innodb_doublewrite              = ON
innodb_undo_tablespaces         = 3
innodb_undo_log_truncate        = ON

# Concurrency
innodb_adaptive_hash_index      = ON
innodb_adaptive_hash_index_parts = 16
innodb_purge_threads            = 4

# Monitoring
innodb_print_all_deadlocks      = ON
innodb_status_output_locks      = ON

12. Monitoring & Troubleshooting

12.1 SHOW ENGINE INNODB STATUS Full Walkthrough

This is InnoDB's most important diagnostic tool. The output is divided into sections, each providing different dimensional information:

(1) SEMAPHORES Section


----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 12345
--Thread 140234567890 has waited at btr0sea.ic line 92 for 0.5 seconds
  the semaphore: S-lock on RW-latch at 0x7f...
  a]writer (thread id 140234567891) has reserved it in mode exclusive
  number of readers 0, waiters flag 1, lock_word: 0

Key indicators:
- "has waited ... for X seconds": threads waiting for locks
- btr0sea.ic: AHI-related contention → consider disabling AHI
- buf0buf.cc: Buffer Pool mutex contention → increase instances
- Frequent long waits (>1s) = performance problem

(2) TRANSACTIONS Section


------------
TRANSACTIONS
------------
Trx id counter 12345678
Purge done for trx's n:o < 12345670 undo n:o < 0
History list length 234                              ← Undo records not yet purged
                                                        If growing → long transactions or purge lag

---TRANSACTION 12345677, ACTIVE 3600 sec            ← Active for 1 hour! Problem!
2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 42, OS thread handle 140..., query id 999
SELECT * FROM large_table WHERE status = 'active'

Key indicators:
- History list length: should be low (<1000); growing = purge lag
- ACTIVE NNN sec: long-running transactions → undo bloat risk
- lock struct(s) / row lock(s): lock usage per transaction

(3) FILE I/O Section


--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
...
Pending normal aio reads: [0, 0, 0, 0]     ← Should be 0 or very low
Pending normal aio writes: [0, 0, 0, 0]
Pending ibuf aio reads: 0
6821 OS file reads, 29412 OS file writes, 8513 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 1.86 writes/s, 0.76 fsyncs/s

Key indicators:
- High pending reads/writes = I/O subsystem is saturated
- fsyncs/s: correlates with commit rate and flush_log_at_trx_commit setting

(4) ROW OPERATIONS Section


--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB        ← Open read views = MVCC snapshots
Process ID=12345, Main thread ID=140..., state: sleeping
Number of rows inserted 567890, updated 123456, deleted 78901, read 45678901
0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

12.2 information_schema Monitoring Tables


-- Buffer Pool statistics per instance
SELECT POOL_ID, POOL_SIZE, FREE_BUFFERS, DATABASE_PAGES,
       OLD_DATABASE_PAGES, MODIFIED_DATABASE_PAGES,
       HIT_RATE, YOUNG_MAKE_PER_THOUSAND_GETS
FROM information_schema.INNODB_BUFFER_POOL_STATS;

-- Currently open transactions
SELECT trx_id, trx_state, trx_started,
       TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS duration_sec,
       trx_rows_locked, trx_rows_modified, trx_query
FROM information_schema.INNODB_TRX
ORDER BY trx_started ASC;

-- Lock waits (who is blocking whom)
SELECT r.trx_id AS waiting_trx,
       r.trx_query AS waiting_query,
       b.trx_id AS blocking_trx,
       b.trx_query AS blocking_query
FROM performance_schema.data_lock_waits w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.REQUESTING_ENGINE_TRANSACTION_ID
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.BLOCKING_ENGINE_TRANSACTION_ID;

-- InnoDB metrics (hundreds of counters)
SELECT NAME, COUNT, STATUS
FROM information_schema.INNODB_METRICS
WHERE STATUS = 'enabled'
ORDER BY NAME;

12.3 InnoDB Tables in performance_schema

Table	Purpose
`data_locks`	All currently held and waited-for InnoDB locks
`data_lock_waits`	Lock wait relationships: which transaction waits for which
`events_waits_summary_global_by_event_name`	InnoDB mutex/rw-lock wait statistics (search for `wait/synch/mutex/innodb`)
`file_summary_by_instance`	I/O statistics per .ibd file (read/write bytes, latency)
`table_io_waits_summary_by_table`	I/O wait statistics per table


-- Find InnoDB mutex contention hotspots
SELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_wait_ms
FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE 'wait/synch/mutex/innodb/%'
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

-- Find the most I/O-heavy .ibd files
SELECT FILE_NAME,
       COUNT_READ, SUM_NUMBER_OF_BYTES_READ,
       COUNT_WRITE, SUM_NUMBER_OF_BYTES_WRITE,
       SUM_TIMER_READ/1000000000 AS read_wait_ms,
       SUM_TIMER_WRITE/1000000000 AS write_wait_ms
FROM performance_schema.file_summary_by_instance
WHERE FILE_NAME LIKE '%.ibd'
ORDER BY SUM_TIMER_READ + SUM_TIMER_WRITE DESC
LIMIT 10;

12.4 Common Troubleshooting Checklist

Symptom	Possible Cause	Diagnostic	Solution
QPS periodic drops	Redo log full causing sync flush	checkpoint_age approaching redo log capacity	Increase redo log + io_capacity
Low Buffer Pool hit rate	Pool too small or full-scan pollution	`SHOW ENGINE INNODB STATUS` → hit rate	Increase pool size or optimize queries
History list keeps growing	Long transaction blocking purge	`INNODB_TRX` → trx_started	Find and kill long transaction
ibdata1 bloating	`innodb_file_per_table=OFF`	Check my.cnf	Enable, then logical rebuild
High CPU but low throughput	AHI rw-lock contention	SEMAPHORES → `btr0sea.ic`	Disable AHI or increase parts
Undo tablespace ballooning	Long transaction + high concurrent writes	`INNODB_TRX` + undo file size	Kill long txn + truncate undo

13. Frequently Asked Questions

Q1: What percentage of total RAM should innodb_buffer_pool_size be?

The general rule is 60-80% of total RAM, but it depends on the scenario: - Dedicated database server (MySQL only): 70-80% - Shared server (with app, cache, etc.): 50-60% - Small RAM (≤4GB): 50% Remaining memory is for: OS file cache, per-connection memory (sort_buffer, join_buffer, thread stack, ~2-10MB per connection), other processes. A common mistake is setting buffer_pool_size to 90%+, causing the OS to swap, which devastates performance.

Q2: Why does InnoDB use B+Tree instead of B-Tree or hash indexes?

B+Tree advantages over B-Tree: - All data in leaf nodes: non-leaf nodes store only keys and pointers, higher fan-out, shorter tree, fewer disk I/O - Leaf nodes form a doubly-linked list: range scans (BETWEEN, ORDER BY) traverse the list without backtracking - Stable query performance: all lookups go to leaf level, consistent path length (O(log n)) Hash indexes offer O(1) point lookups but cannot support range scans, sorting, or prefix matching — all extremely common in OLTP. InnoDB uses hashing as a supplement (AHI), not the primary structure.

Q3: Can innodb_flush_log_at_trx_commit=2 and sync_binlog=1 be used together?

Technically yes, but usually pointless. sync_binlog=1 guarantees binlog fsync on every commit, but flush_log_at_trx_commit=2 allows redo log loss. Crash recovery depends on consistency between redo log and binlog: if redo is lost but binlog exists, primary-replica inconsistency results. Recommended combinations: - Primary: flush_log_at_trx_commit=1 + sync_binlog=1 (dual-1, safest) - Replica: flush_log_at_trx_commit=2 + sync_binlog=0 (relaxed, better performance)

Q4: How to determine if the Buffer Pool is large enough?

Three key indicators: 1. Buffer Pool hit rate: "Buffer pool hit rate" in SHOW ENGINE INNODB STATUS. Should be >= 999/1000 (99.9%). 2. Free buffers: If persistently 0 but hit rate is still high, pool size is just right. If free=0 and hit rate is low, increase it. 3. Pages read vs Innodb_buffer_pool_reads: SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'. Innodb_buffer_pool_read_requests (logical reads) vs Innodb_buffer_pool_reads (physical reads); higher ratio is better. Ultimate test: if your entire dataset fits in the Buffer Pool (check total Data_length + Index_length from SHOW TABLE STATUS), set it large enough to hold everything.

Q5: Does MVCC completely avoid locks?

No. MVCC only avoids read-write conflicts (reads don't block writes, writes don't block reads), but write-write conflicts still require locks. Specifically: - Consistent reads (plain SELECT): read historical version via MVCC, no lock - Current reads (SELECT ... FOR UPDATE / LOCK IN SHARE MODE): locking read, reads latest version - INSERT / UPDATE / DELETE: first read latest version (current read), then take exclusive lock to modify Thus MVCC + locking is InnoDB's complete concurrency control mechanism — they complement, not replace, each other.

Q6: Why is innodb_flush_method=O_DIRECT recommended?

With the default fsync method, InnoDB's write path is: Buffer Pool → OS Page Cache → Disk. Data is stored twice in memory (InnoDB Buffer Pool + OS Page Cache), wasting memory. O_DIRECT bypasses the OS Page Cache: Buffer Pool → Disk directly. Benefits: - Avoid double caching, freeing valuable OS memory - Reduce OS Page Cache CPU overhead (cache lookups, LRU management) - Especially effective with large Buffer Pools Note: O_DIRECT still uses fsync for redo log. In MySQL 8.0.26+ you can use O_DIRECT_NO_FSYNC, which performs better on some filesystems.

Q7: Why does InnoDB need Doublewrite while PostgreSQL doesn't?

They use different strategies for the torn page problem: - InnoDB: Redo Log records physiological logs (offset + delta within a page). If the page itself is corrupted, redo cannot replay correctly. Thus Doublewrite provides an intact page copy as the replay base. - PostgreSQL: On the first modification of a page after a checkpoint, it writes a full page image (Full Page Write, FPW) in the WAL. Even if the page is corrupted, the full page can be recovered from the WAL. The cost is larger WAL size. Each approach has trade-offs: InnoDB's redo log is more compact but requires extra Doublewrite I/O; PostgreSQL's WAL is larger but architecturally simpler.

Q8: How to resize innodb_buffer_pool_size online?

MySQL 5.7+ supports online Buffer Pool resizing without restart: SET GLOBAL innodb_buffer_pool_size = 48 * 1024 * 1024 * 1024; -- 48GB Important notes: - Resizing happens in units of innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances (default 128MB * 8 = 1GB) - If the value is not a multiple of this unit, MySQL rounds up automatically - Growing is usually fast; shrinking can be slow (pages must be evicted) - Resizing does not block queries but may cause brief performance fluctuations - Monitor progress via SHOW STATUS LIKE 'Innodb_buffer_pool_resize_status'

Q9: How does InnoDB solve phantom reads under RR isolation?

InnoDB solves phantom reads under RR through two mechanisms: 1. MVCC (snapshot reads): Plain SELECT uses the ReadView from the transaction's start. Even if other transactions insert new rows, the snapshot read cannot see them — thus no phantom reads. 2. Next-Key Lock (current reads): SELECT ... FOR UPDATE / INSERT / UPDATE / DELETE uses current reads. InnoDB uses Next-Key Locks (Record Lock + Gap Lock) to lock gaps in the query range, preventing other transactions from inserting new rows. Caveat: RR does not perfectly prevent phantom reads in all cases. A classic counterexample: a transaction does a snapshot read (doesn't see new rows), then UPDATE (current read, affects new rows), then snapshot read again (may now see them — because UPDATE changed the new row's trx_id to the current transaction's ID).

Q10: What major InnoDB architectural improvements were made in MySQL 8.0?

MySQL 8.0's InnoDB improvements are historic: - Atomic data dictionary: migrated from file-based .frm to InnoDB's transactional data dictionary; DDL operations become atomic (crash-safe) - Redo Log redesign (8.0.30+): from fixed-size files to dynamic #innodb_redo/ directory, unified under innodb_redo_log_capacity - Separate Doublewrite files (8.0.20+): extracted from ibdata1 to standalone .dblwr files, enabling parallel writes - Instant ADD COLUMN (8.0.12+): certain ALTER TABLE ADD COLUMN operations complete instantly without table rebuild - Invisible indexes: ALTER TABLE ... ALTER INDEX idx INVISIBLE to test the impact of dropping an index without actually dropping it - Descending indexes: true descending B+Tree indexes, optimizing ORDER BY ... DESC queries - Session temporary tablespaces (8.0.15+): per-session temporary tablespace files, reclaimable after disconnect - Undo Tablespace management: online create/drop undo tablespace, automatic truncation - Parallel reads (8.0.14+): SELECT COUNT(*) and other full table scans can read multiple pages in parallel

12.5 Key InnoDB Status Variables

Besides SHOW ENGINE INNODB STATUS, these SHOW GLOBAL STATUS variables are most valuable for daily monitoring:

Variable	Meaning	Healthy Threshold
`Innodb_buffer_pool_read_requests`	Total logical read requests to Buffer Pool	Ratio to reads > 99.9%
`Innodb_buffer_pool_reads`	Buffer Pool misses requiring disk read	Lower is better
`Innodb_buffer_pool_pages_dirty`	Dirty page count in Buffer Pool	Should be stable, not continuously growing
`Innodb_buffer_pool_wait_free`	Waits for free pages (Free list empty)	Should be 0 or near 0
`Innodb_log_waits`	Waits due to Log Buffer being too small	Should be 0; if > 0 increase log_buffer_size
`Innodb_os_log_written`	Bytes written to Redo Log	Use to calculate write rate and estimate redo size
`Innodb_row_lock_time_avg`	Average row lock wait time (ms)	< 10ms; > 100ms needs investigation
`Innodb_row_lock_waits`	Row lock wait count	Trend matters more than absolute value
`Innodb_rows_read / inserted / updated / deleted`	InnoDB layer row operation statistics	Use for workload profiling
`Innodb_data_reads / writes`	Physical I/O operation count	Compare with logical reads to measure cache efficiency


-- Buffer Pool hit rate calculation:
SELECT
  (SELECT VARIABLE_VALUE FROM performance_schema.global_status
   WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests') AS logical_reads,
  (SELECT VARIABLE_VALUE FROM performance_schema.global_status
   WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') AS disk_reads,
  ROUND(
    (1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status
          WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') /
         (SELECT VARIABLE_VALUE FROM performance_schema.global_status
          WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')
    ) * 100, 3
  ) AS hit_rate_pct;

-- Redo log write rate (bytes per second):
-- Record Innodb_os_log_written at two points in time, divide the delta by seconds elapsed

-- Quick health check script:
SELECT 'Buffer Pool Hit Rate' AS metric,
       CONCAT(ROUND((1 - v2.val/v1.val)*100, 2), '%') AS value
FROM
  (SELECT VARIABLE_VALUE val FROM performance_schema.global_status
   WHERE VARIABLE_NAME='Innodb_buffer_pool_read_requests') v1,
  (SELECT VARIABLE_VALUE val FROM performance_schema.global_status
   WHERE VARIABLE_NAME='Innodb_buffer_pool_reads') v2
UNION ALL
SELECT 'Dirty Page Ratio',
       CONCAT(ROUND(v1.val/v2.val*100, 1), '%')
FROM
  (SELECT VARIABLE_VALUE val FROM performance_schema.global_status
   WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty') v1,
  (SELECT VARIABLE_VALUE val FROM performance_schema.global_status
   WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_total') v2
UNION ALL
SELECT 'Log Waits (should be 0)', VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_log_waits'
UNION ALL
SELECT 'Wait Free (should be 0)', VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_buffer_pool_wait_free'
UNION ALL
SELECT 'Avg Row Lock Time (ms)', VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_row_lock_time_avg';

12.6 Prometheus / Grafana Monitoring Recommendations

In production, mysqld_exporter typically exposes InnoDB metrics to Prometheus, visualized with Grafana. Recommended core dashboard panels include:

Panel	Metrics	Alert Threshold
Buffer Pool Hit Rate	`rate(mysql_global_status_innodb_buffer_pool_read_requests[5m])` `rate(mysql_global_status_innodb_buffer_pool_reads[5m])`	Alert if < 99.5%
Dirty Pages %	`mysql_global_status_innodb_buffer_pool_pages_dirty / mysql_global_status_innodb_buffer_pool_pages_total`	Alert if > 75%
Checkpoint Age	`mysql_global_status_innodb_checkpoint_age`	Alert if > 80% redo capacity
History List Length	`mysql_global_status_innodb_history_list_length`	Alert if > 10000
Row Lock Waits/s	`rate(mysql_global_status_innodb_row_lock_waits[5m])`	Alert if > 100/s
Long Transactions	`mysql_info_schema_innodb_trx (custom query)`	Alert if > 300 seconds
Redo Log Write Rate	`rate(mysql_global_status_innodb_os_log_written[5m])`	Trend monitoring (alert on anomaly)
Pending I/O	`mysql_global_status_innodb_data_pending_reads` `mysql_global_status_innodb_data_pending_writes`	Alert if > 0 for 5 minutes

12.7 Useful sys Schema Views


-- Top tables by buffer pool usage
SELECT object_schema, object_name, allocated, data, pages
FROM sys.innodb_buffer_stats_by_table
ORDER BY allocated DESC
LIMIT 20;

-- Top indexes by buffer pool usage
SELECT object_schema, object_name, index_name, allocated, data, pages
FROM sys.innodb_buffer_stats_by_schema
ORDER BY allocated DESC;

-- InnoDB lock waits (human-readable)
SELECT *
FROM sys.innodb_lock_waits\G

-- Tables with full table scans (potential buffer pool polluters)
SELECT object_schema, object_name,
       rows_full_scanned, latency
FROM sys.schema_tables_with_full_table_scans
ORDER BY rows_full_scanned DESC
LIMIT 10;

[← Prev: Architecture](/books/high-performance-mysql/architecture)
[Back to Contents](/books/high-performance-mysql)
[Next: Source Code →](/books/high-performance-mysql/source-code-guide)

Rate this chapter

4.7 / 5 (68 ratings)