Chapter 2: InnoDB Storage Engine Internals Deep Dive

Level: Advanced Reading time ~60 min Based on MySQL 8.0 / 8.4

InnoDB is MySQL's default and most critical storage engine. This chapter delivers a source-code-level deep dive from the architecture overview through Buffer Pool page structure, Redo Log WAL mechanics, Undo Log and MVCC version chains, Doublewrite Buffer torn-page protection, Change Buffer random-I/O optimization, Adaptive Hash Index, tablespace layout, row format comparison, all the way to flush strategies and production monitoring. Every topic includes ASCII architecture diagrams, MySQL source file references, key parameter tuning advice, and real-world incident case studies.

1. InnoDB Architecture Overview

1.1 Architecture Diagram

InnoDB's internal architecture divides into in-memory structures and on-disk structures. The ASCII diagram below shows every core component and data flow:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ InnoDB In-Memory Structures โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Buffer Pool (default 128MB) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Data Pages โ”‚ โ”‚Index Pages โ”‚ โ”‚Undo Pagesโ”‚ โ”‚ Change Buf โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (16KB ea.) โ”‚ โ”‚ (16KB) โ”‚ โ”‚ (16KB) โ”‚ โ”‚ Pages โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ LRU List: [Young Sublist 5/8] โ†โ†’ [Old Sublist 3/8] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Free List โ”‚ โ”‚ Flush List โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Adaptive Hash โ”‚ โ”‚ Log Buffer โ”‚ โ”‚ Change Buffer โ”‚ โ”‚ โ”‚ โ”‚ Index (AHI) โ”‚ โ”‚ (64MB default) โ”‚ โ”‚ (25% of pool max) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ Checkpoint โ”‚ WAL Write โ”‚ Merge on read โ–ผ โ–ผ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ InnoDB On-Disk Structures โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ System โ”‚ โ”‚ Redo Log โ”‚ โ”‚ Undo Tablespaces โ”‚ โ”‚ โ”‚ โ”‚ Tablespace โ”‚ โ”‚ ib_logfile0 โ”‚ โ”‚ undo_001, undo_002 โ”‚ โ”‚ โ”‚ โ”‚ (ibdata1) โ”‚ โ”‚ ib_logfile1 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ (or #innodb_ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ redo/ 8.0.30+)โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ File-per- โ”‚ โ”‚ Doublewrite Buffer โ”‚ โ”‚ โ”‚ โ”‚ table .ibd โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ (ibdata1 or separate โ”‚ โ”‚ โ”‚ โ”‚ files โ”‚ โ”‚ Temp Tablespaceโ”‚ โ”‚ .dblwr files 8.0.20+) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ (ibtmp1) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ General Tablespaces (CREATE TABLESPACE ... ) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Source ref: storage/innobase/include/buf0buf.h, storage/innobase/srv/srv0srv.cc

1.2 Query Data Flow

When a SELECT reaches InnoDB, the data flow is:

Client SQL โ”‚ โ–ผ MySQL Server Layer (parser โ†’ optimizer โ†’ executor) โ”‚ โ–ผ handler::ha_read_row() InnoDB Handler Interface โ”‚ โ”œโ”€โ”€โ–ถ Check Adaptive Hash Index โ”€โ”€โ–ถ HIT? โ†’ return row โ”‚ MISS โ†“ โ”œโ”€โ”€โ–ถ Check Buffer Pool LRU โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ HIT? โ†’ return page โ”‚ MISS โ†“ โ”œโ”€โ”€โ–ถ Read page from .ibd file (OS read / AIO) โ”‚ โ””โ”€โ”€โ–ถ Place page into Buffer Pool (Old sublist head) โ”‚ โ””โ”€โ”€โ–ถ If accessed again after innodb_old_blocks_time โ†’ move to Young โ”‚ โ–ผ Return row to Server Layer โ†’ Client

1.3 Write Data Flow

When an INSERT / UPDATE / DELETE executes, InnoDB's write path is considerably more complex than reads:

1. BEGIN (implicitly or explicitly) 2. Lock the target row(s) โ†’ record lock / gap lock / next-key lock 3. Write old row image to Undo Log (for rollback + MVCC) 4. Modify the page IN the Buffer Pool (in-memory only) 5. Write redo log record to Log Buffer 6. COMMIT: a. Flush Log Buffer โ†’ Redo Log files on disk (fsync per innodb_flush_log_at_trx_commit) b. Mark transaction committed in redo c. The dirty page stays in Buffer Pool โ†’ will be flushed later 7. Background threads (page cleaner) eventually: a. Write dirty page to Doublewrite Buffer (on disk) b. fsync Doublewrite Buffer c. Write page to actual .ibd file location d. fsync .ibd e. Advance checkpoint LSN
Key Insight: InnoDB's performance secret is converting random writes into sequential writes (the Redo Log is sequentially appended) and deferring actual dirty-page flushing to background threads. This is the core value of WAL (Write-Ahead Logging) โ€” a transaction commit only waits for the sequential fsync of the Redo Log, not the random I/O of every dirty page.

1.4 In-Memory vs On-Disk Structures

Component Location Purpose Key Parameter
Buffer Pool Memory Cache data and index pages to reduce disk I/O innodb_buffer_pool_size
Log Buffer Memory Temporarily hold redo log records before flushing innodb_log_buffer_size
Adaptive Hash Index Memory Auto-build hash indexes on frequently accessed B+Tree pages innodb_adaptive_hash_index
Change Buffer Memory + Disk Cache secondary index changes, defer merge innodb_change_buffer_max_size
Redo Log Disk Crash recovery, guarantee durability (D in ACID) innodb_redo_log_capacity
Undo Log Disk (+cached in Buffer Pool) Transaction rollback + MVCC versioned reads innodb_undo_tablespaces
Doublewrite Buffer Disk Prevent partial page writes (torn pages) innodb_doublewrite
System Tablespace Disk Data dictionary, change buffer persistence innodb_data_file_path

2. Buffer Pool Deep Dive

The Buffer Pool is InnoDB's most critical in-memory structure. Understanding how it works is foundational to understanding InnoDB performance.

2.1 Page Structure (16KB)

All data in InnoDB is stored in 16KB pages. Each page's internal structure is:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ InnoDB Page (16384 bytes) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ File Header (38 bytes) โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_SPACE_OR_CHKSUM (4) checksum โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_OFFSET (4) page number โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_PREV (4) previous page (B+Tree)โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_NEXT (4) next page (B+Tree) โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_LSN (8) last modified LSN โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_TYPE (2) page type โ”‚ โ”‚ โ”œโ”€ FIL_PAGE_FILE_FLUSH_LSN (8) flush LSN (space 0) โ”‚ โ”‚ โ””โ”€ FIL_PAGE_SPACE_ID (4) tablespace ID โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Page Header (56 bytes) โ”‚ โ”‚ โ”œโ”€ PAGE_N_DIR_SLOTS (2) slots in page dir โ”‚ โ”‚ โ”œโ”€ PAGE_HEAP_TOP (2) first free byte โ”‚ โ”‚ โ”œโ”€ PAGE_N_HEAP (2) number of records โ”‚ โ”‚ โ”œโ”€ PAGE_FREE (2) free record list head โ”‚ โ”‚ โ”œโ”€ PAGE_GARBAGE (2) deleted bytes โ”‚ โ”‚ โ”œโ”€ PAGE_LAST_INSERT (2) last insert position โ”‚ โ”‚ โ”œโ”€ PAGE_DIRECTION (2) insert direction โ”‚ โ”‚ โ”œโ”€ PAGE_N_DIRECTION (2) consecutive inserts โ”‚ โ”‚ โ”œโ”€ PAGE_N_RECS (2) user record count โ”‚ โ”‚ โ”œโ”€ PAGE_MAX_TRX_ID (8) max txn ID (sec idx) โ”‚ โ”‚ โ”œโ”€ PAGE_LEVEL (2) B+Tree level (0=leaf) โ”‚ โ”‚ โ”œโ”€ PAGE_INDEX_ID (8) index this page for โ”‚ โ”‚ โ””โ”€ ... (segment headers) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Infimum Record (13 bytes) โ”‚ โ”‚ Supremum Record (13 bytes) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ User Records (grows downward โ†“) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Record 1โ”‚โ†’โ”‚ Record 2โ”‚โ†’โ”‚ Record 3โ”‚โ†’โ”‚ Record nโ”‚โ†’Supremum โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ Free Space (grows toward each other) โ”‚ โ”‚ โ”‚ โ”‚ Page Directory (grows upward โ†‘) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚Slot 0โ”‚Slot 1โ”‚Slot 2โ”‚ ... โ”‚Slot nโ”‚ (2 bytes each) โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ File Trailer (8 bytes) โ”‚ โ”‚ โ”œโ”€ Checksum (4) matches header โ”‚ โ”‚ โ””โ”€ FIL_PAGE_LSN low 4 bytes (4) matches header LSN โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Source ref: storage/innobase/include/fil0types.h, storage/innobase/include/page0page.h

Why 16KB? 16KB is InnoDB's default page size, controlled by innodb_page_size (options: 4K/8K/16K/32K/64K, but only settable at initialization). 16KB is a carefully chosen trade-off: too small increases B+Tree depth (more disk seeks), too large wastes memory and increases intra-page fragmentation. For most OLTP workloads, 16KB is optimal.

2.2 LRU Algorithm: Dual-Sublist Design

InnoDB's Buffer Pool does not use a standard LRU (Least Recently Used) algorithm. Instead it employs a modified two-segment LRU to prevent full table scans from evicting hot data.

Buffer Pool LRU List โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ HEAD (MRU) TAIL (LRU) โ”‚ โ”‚ โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Young Sublist (5/8) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บโ”‚โ—„โ”€โ”€ Old (3/8) โ”€โ”€โ–บโ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ [HOT] [HOT] [HOT] ... [WARM] [WARM] โ”‚ [OLD] [OLD] [OLD] โ”‚ โ”‚ โ–ฒ โ”‚ โ–ฒ โ”‚ โ”‚ โ”‚ Page accessed again โ”‚ โ”‚ New page from โ”‚ โ”‚ โ”‚ (after old_blocks_time) โ”‚ โ”‚ disk lands HERE โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ ("midpoint โ”‚ โ”‚ Promotion โ”‚ insertion") โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Eviction from tail โ”€โ”€โ–บOUT โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Why Two Sublists?

Consider a typical production scenario: your Buffer Pool caches the last few hours of hot data (user table, order table index pages). Suddenly someone runs a SELECT * FROM large_table (full table scan), reading hundreds of thousands of data pages. With a standard LRU, these "cold" pages that will never be accessed again push out all the truly "hot" pages โ€” causing subsequent normal queries to hit disk. This is called "Buffer Pool pollution."

InnoDB's solution:

  1. When a new page is read from disk, it is inserted at the head of the Old sublist (midpoint insertion), not the head of the entire LRU.
  2. Only when the page is accessed again after staying in the Old sublist for more than innodb_old_blocks_time (default 1000ms) will it be promoted to the Young sublist.
  3. Full table scan pages are typically read and discarded within 1000ms, never entering the Young sublist, thus protecting hot data.

Key Parameters

Parameter Default Description
innodb_old_blocks_pct 37 Percentage of Buffer Pool for the Old sublist. Default 37% (~3/8). Range: 5-95.
innodb_old_blocks_time 1000 Milliseconds a page must stay in the Old sublist before it can be promoted to Young. Set to 0 to disable this protection.

2.3 Buffer Pool Instances

When the Buffer Pool is large (typically 32GB-128GB in production), a single Buffer Pool mutex becomes a concurrency bottleneck. InnoDB allows splitting the Buffer Pool into multiple independent instances, each with its own LRU list, Free list, Flush list, and mutex.

Parameter Default Recommended
innodb_buffer_pool_instances 8 (when pool >= 1GB) At least 1GB per instance. 128GB pool โ†’ 16 instances.
innodb_buffer_pool_chunk_size 128MB Unit for online pool resizing. pool_size should be a multiple of chunk_size * instances.
Note: MySQL 8.0 Change: In MySQL 8.0.26+, when innodb_buffer_pool_size < 1GB, innodb_buffer_pool_instances is forced to 1 regardless of your setting. This is because small Buffer Pools don't suffer from mutex contention.

2.4 Buffer Pool Warmup & Dump/Load

After a MySQL restart the Buffer Pool is empty (cold start) and query performance gradually recovers over 30-60 minutes. InnoDB provides automatic dump/load to accelerate warmup:

-- Enable automatic dump on shutdown and load on startup SET GLOBAL innodb_buffer_pool_dump_at_shutdown = ON; SET GLOBAL innodb_buffer_pool_load_at_startup = ON; -- Manual dump (writes to ib_buffer_pool file) SET GLOBAL innodb_buffer_pool_dump_now = ON; -- Check dump/load progress SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status'; SHOW STATUS LIKE 'Innodb_buffer_pool_load_status'; -- Dump file content example (tablespace_id:page_number) -- 0,7 -- 0,8 -- 12,0 -- 12,1

The dump file only saves (tablespace_id, page_number) pairs, not actual data. At startup InnoDB pre-reads pages from disk using this list. This is why even a 128GB Buffer Pool dump file is typically only a few MB.

2.5 Monitoring the Buffer Pool

-- Key metrics from SHOW ENGINE INNODB STATUS (BUFFER POOL AND MEMORY section) ---------------------- BUFFER POOL AND MEMORY ---------------------- Total large memory allocated 137363456 -- Total memory allocated to Buffer Pool Dictionary memory allocated 622013 -- Memory for data dictionary Buffer pool size 8192 -- Total pages (8192 * 16KB = 128MB) Free buffers 1024 -- Free pages not yet used Database pages 7137 -- Pages containing data Old database pages 2614 -- Pages in Old sublist Modified db pages 123 -- Dirty pages awaiting flush Pending reads 0 -- Pages waiting to be read from disk Pending writes: LRU 0, flush list 0, single page 0 Pages made young 18623, not young 2045921 -- Promotions vs blocks 3.47 youngs/s, 412.05 non-youngs/s -- Rate of promotions Pages read 16842, created 5765, written 29412 0.00 reads/s, 0.24 creates/s, 1.86 writes/s Buffer pool hit rate 1000 / 1000 -- 1000/1000 = 100% hit rate (ideal) Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
Production Metrics to Watch:
1. Buffer pool hit rate: Should stay at 999/1000 or above. Below this suggests the pool is too small or being polluted by large queries.
2. Free buffers: If persistently 0, the pool is full and every new page requires eviction.
3. Modified db pages: If continuously growing, flushing cannot keep up with writes.
4. Pages made young / not young: "not young" far exceeding "young" means innodb_old_blocks_time protection is working.

Source ref: storage/innobase/buf/buf0buf.cc โ€” buf_page_get_gen(), buf_page_make_young()

2.6 Buffer Pool Size Calculator

2.7 Buffer Pool Internal Lists

The Buffer Pool maintains three critical internal lists. Understanding them is essential for troubleshooting performance issues:

Buffer Pool Internal Lists: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 1. FREE LIST โ”‚ โ”‚ Contains pages that are completely free (never used or โ”‚ โ”‚ evicted). When InnoDB needs a new page, it takes one from โ”‚ โ”‚ here first. When this is empty, LRU eviction kicks in. โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚Freeโ”‚โ†’ โ”‚Freeโ”‚โ†’ โ”‚Freeโ”‚โ†’ โ”‚Freeโ”‚โ†’ NULL โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ 2. LRU LIST โ”‚ โ”‚ Contains all pages currently in use (clean + dirty). โ”‚ โ”‚ Organized as Young (5/8) + Old (3/8). โ”‚ โ”‚ Used for page eviction when FREE list is empty. โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚Hot โ”‚โ†’ โ”‚Hot โ”‚ โ†’โ†’ โ”‚Old โ”‚โ†’ โ”‚Old โ”‚โ†’ โ”‚Old โ”‚โ†’ evict โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ Young (MRU) โ”€โ”€โ”€โ”€โ”€โ”€โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Old (LRU) โ”€โ”€โ”€โ”€โ”€โ”€โ–บ โ”‚ โ”‚ โ”‚ โ”‚ 3. FLUSH LIST โ”‚ โ”‚ Contains only DIRTY pages, ordered by oldest_modification โ”‚ โ”‚ LSN (oldest first). Page cleaner flushes from here to โ”‚ โ”‚ advance the checkpoint. โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚LSN โ”‚โ†’ โ”‚LSN โ”‚โ†’ โ”‚LSN โ”‚โ†’ โ”‚LSN โ”‚โ†’ NULL โ”‚ โ”‚ โ”‚100 โ”‚ โ”‚200 โ”‚ โ”‚300 โ”‚ โ”‚400 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ oldest โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ newest โ”‚ โ”‚ โ”‚ โ”‚ A dirty page appears in BOTH the LRU list AND the Flush list. โ”‚ โ”‚ A clean page appears ONLY in the LRU list. โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

2.8 Page Types

Pages cached in the Buffer Pool are not just data pages. InnoDB defines multiple page types:

FIL_PAGE_TYPE Value Description
FIL_PAGE_INDEX0x45BFB+Tree leaf node (stores row data or secondary index leaf)
FIL_PAGE_RTREE0x45BER-Tree index page (spatial index)
FIL_PAGE_SDI0x45BDSerialized Dictionary Information page (MySQL 8.0)
FIL_PAGE_UNDO_LOG0x0002Undo Log page
FIL_PAGE_INODE0x0003Segment information node page
FIL_PAGE_IBUF_FREE_LIST0x0004Change Buffer free list page
FIL_PAGE_TYPE_ALLOCATED0x0000Allocated but not yet used page
FIL_PAGE_TYPE_SYS0x0006System page (e.g., transaction system page)
FIL_PAGE_TYPE_TRX_SYS0x0007Transaction system data page
FIL_PAGE_TYPE_FSP_HDR0x0008File Space Header page
FIL_PAGE_TYPE_XDES0x0009Extent Descriptor page
FIL_PAGE_TYPE_BLOB0x000AOverflow page (stores large columns)

Source ref: storage/innobase/include/fil0fil.h โ€” enum page_type_t

2.9 Buffer Pool Read-Ahead

InnoDB has two read-ahead mechanisms to reduce I/O wait time:

Linear Read-Ahead

When the number of sequentially accessed pages in an extent (64 contiguous pages = 1MB) exceeds innodb_read_ahead_threshold (default 56), InnoDB asynchronously pre-reads all pages in the next extent. Suits sequential access patterns like full table scans.

Random Read-Ahead

When more than 13 pages from an extent are in the Buffer Pool's Young sublist, InnoDB pre-reads the remaining pages in that extent. Disabled by default (innodb_random_read_ahead=OFF), as it rarely helps in most workloads.

-- Monitor read-ahead effectiveness: SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_ahead'; -- Pages pre-read SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_ahead_evicted'; -- Pre-read but never accessed -- If read_ahead_evicted / read_ahead is high (>50%), -- read-ahead is wasting I/O and should be tuned down. -- Increase innodb_read_ahead_threshold (up to 63) to be more conservative.

2.10 Buffer Pool Mutex & Latch Contention

Under high concurrency (>1000 QPS), Buffer Pool internal mutexes can become bottlenecks. Key mutexes include:

Mutex / Latch Protected Resource Contention Symptom Solution
buf_pool->mutex Buffer Pool instance global lock Broad contention on high concurrent R/W Increase innodb_buffer_pool_instances
buf_pool->LRU_list_mutex LRU list Contention during page eviction/promotion Increase instances + optimize queries to reduce eviction
buf_pool->flush_list_mutex Flush list Contention under high write load Increase instances + page_cleaners
hash_lock (page hash) Page hash table (lookup if page is in pool) Contention under extreme concurrent point lookups Partitioned in MySQL 8.0, usually no action needed
-- Diagnose Buffer Pool mutex waits: SELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_wait_ms, AVG_TIMER_WAIT/1000000 AS avg_wait_us FROM performance_schema.events_waits_summary_global_by_event_name WHERE EVENT_NAME LIKE 'wait/synch/%/innodb/buf_pool%' OR EVENT_NAME LIKE 'wait/synch/%/innodb/hash_table%' ORDER BY SUM_TIMER_WAIT DESC;

3. Redo Log (WAL) Mechanism

3.1 Write-Ahead Logging Principle

WAL (Write-Ahead Logging) is InnoDB's core mechanism for transaction durability (the D in ACID). The principle is simple: before modifying a data page, first record the modification operation in the Redo Log. If the system crashes, all committed transaction modifications can be recovered by replaying the Redo Log.

WAL Write Flow: Transaction: UPDATE users SET age=30 WHERE id=1 Step 1: Write redo record to Log Buffer (in memory) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Log Buffer (innodb_log_buffer_size = 64MB) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚ โ”‚ โ”‚ [type=MLOG_REC_UPDATE_IN_PLACE] โ”‚โ”‚ โ”‚ โ”‚ [space=12, page=45, offset=200] โ”‚โ”‚ โ”‚ โ”‚ [old: age=25 โ†’ new: age=30] โ”‚โ”‚ โ”‚ โ”‚ [LSN: 1234567890] โ”‚โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Step 2: Modify page in Buffer Pool (in memory, page becomes "dirty") Step 3: On COMMIT โ†’ flush Log Buffer to disk redo log files โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” fsync โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Log Buffer โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ ib_logfile0/1 โ”‚ โ”‚ (memory) โ”‚ โ”‚ (disk, sequential)โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Step 4: Dirty page will be flushed to .ibd later (async, background)

3.2 Redo Log File Structure

Before MySQL 8.0.30, the Redo Log consisted of a fixed number of files (default 2), written in a circular fashion:

Pre-8.0.30: Fixed redo log files โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ ib_logfile0 ib_logfile1 โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‘โ–‘โ–‘โ–‘ โ”‚ โ”€โ”€โ”€โ–ถ โ”‚ โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘ โ”‚ โ”€โ”€โ”€โ” โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ–ฒ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ circular โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆ = written redo records โ”‚ โ”‚ โ–‘โ–‘โ–‘โ–‘ = available space โ”‚ โ”‚ โ”‚ โ”‚ write_lsn โ”€โ”€โ–ถ current write position โ”‚ โ”‚ checkpoint_lsn โ”€โ”€โ–ถ all pages before this are on disk โ”‚ โ”‚ Space between checkpoint_lsn and write_lsn = "active" โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Post-8.0.30: Dynamic redo log (in #innodb_redo/ directory) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ #innodb_redo/ โ”‚ โ”‚ โ”œโ”€โ”€ #ib_redo0 (active, being written) โ”‚ โ”‚ โ”œโ”€โ”€ #ib_redo1 (active, being written) โ”‚ โ”‚ โ”œโ”€โ”€ #ib_redo2_tmp (spare, pre-allocated) โ”‚ โ”‚ โ””โ”€โ”€ #ib_redo3_tmp (spare, pre-allocated) โ”‚ โ”‚ โ”‚ โ”‚ Controlled by: innodb_redo_log_capacity (default 100MB)โ”‚ โ”‚ Files auto-created/removed as needed โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3.3 LSN (Log Sequence Number)

LSN is one of InnoDB's most important internal concepts. It is a monotonically increasing 64-bit integer representing the total bytes written to the Redo Log (counted from InnoDB initialization).

LSN Type Meaning Source
Log sequence number Current latest LSN (position written to Log Buffer) SHOW ENGINE INNODB STATUS
Log flushed up to LSN flushed to disk redo log files SHOW ENGINE INNODB STATUS
Pages flushed up to Dirty pages flushed up to this LSN (oldest dirty page LSN in flush list) SHOW ENGINE INNODB STATUS
Last checkpoint at LSN of last checkpoint. Crash recovery replays from here. SHOW ENGINE INNODB STATUS
LSN relationships (must always hold): Last checkpoint at โ‰ค Pages flushed up to โ‰ค Log flushed up to โ‰ค Log sequence number Example from SHOW ENGINE INNODB STATUS: --- LOG --- Log sequence number 3456789012 Log buffer assigned up to 3456789012 Log buffer completed up to 3456789012 Log written up to 3456789012 Log flushed up to 3456789012 Added dirty pages up to 3456789012 Pages flushed up to 3456780000 Last checkpoint at 3456780000 โ–ฒ Gap = 9012 bytes of redo not yet checkpointed Crash recovery would need to replay these 9012 bytes

3.4 Checkpoint Mechanism

A checkpoint flushes dirty pages from the Buffer Pool to disk and advances the checkpoint LSN. It serves two purposes:

  1. Free redo log space: redo log space before the checkpoint LSN can be reused
  2. Shorten crash recovery time: recovery only needs to replay redo after the checkpoint LSN
Sharp Checkpoint

Flushes all dirty pages at once. Only used during database shutdown (innodb_fast_shutdown=0). Blocks all operations during flush โ€” never used at runtime in production.

Fuzzy Checkpoint

Flushes dirty pages incrementally and asynchronously. InnoDB uses Fuzzy Checkpoint at runtime, including LRU flushing, Flush List flushing, and Async/Sync flushing (when redo log nears capacity).

3.5 innodb_flush_log_at_trx_commit: Durability vs Performance

This is arguably one of InnoDB's most important parameters. It controls the redo log flush strategy at transaction commit:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Value โ”‚ Behavior โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ COMMIT โ†’ write to Log Buffer โ”‚ โ”‚ 0 โ”‚ Background thread flushes to OS cache + fsync every 1 sec โ”‚ โ”‚ โ”‚ โš  Up to 1 second of committed transactions lost on crash โ”‚ โ”‚ โ”‚ โ˜… Fastest โ€” no sync on commit โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ COMMIT โ†’ write to OS cache โ†’ fsync to disk โ”‚ โ”‚ 1 โ”‚ Every single commit triggers an fsync โ”‚ โ”‚ โ”‚ โœ“ ZERO data loss on crash (ACID compliant) โ”‚ โ”‚ โ”‚ โ˜… Slowest โ€” fsync on every commit โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ COMMIT โ†’ write to OS cache (no fsync) โ”‚ โ”‚ 2 โ”‚ Background thread fsyncs every 1 second โ”‚ โ”‚ โ”‚ โš  Data safe on MySQL crash, lost on OS/power crash โ”‚ โ”‚ โ”‚ โ˜… Medium โ€” good for replicas โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Value Performance Data Safety Use Case
1 Slowest (fsync per commit) Safest (zero data loss) Primary, financial systems, any scenario that cannot lose data
2 Faster (OS crash may lose 1 sec) MySQL-crash safe, OS/power-failure unsafe Replicas, bulk imports, scenarios tolerating minor loss
0 Fastest (no fsync) Any crash may lose 1 second Testing, non-critical data
Production Incident Case: An e-commerce company set innodb_flush_log_at_trx_commit=0 on a replica. One day a UPS failure caused the replica to power-cycle. After restart, the replica's GTID position was ~3 seconds behind what the primary recorded, breaking replication. The replica had to be rebuilt. Lesson: even for replicas, use 2 instead of 0 when replication consistency matters.

3.6 Redo Log Size Tuning

A redo log that is too small causes frequent checkpoints (because the redo space must wait for dirty pages to flush before reusing space), while too large extends crash recovery time.

Parameter Version Default Recommended
innodb_log_file_size < 8.0.30 48MB 1GB-4GB (high write workloads)
innodb_log_files_in_group < 8.0.30 2 2
innodb_redo_log_capacity ≥ 8.0.30 100MB 2GB-8GB (replaces the two above)
-- How to estimate proper redo log size: -- 1. During peak load, check LSN growth over 1 hour: SHOW ENGINE INNODB STATUS; -- note "Log sequence number" = LSN_start -- wait 1 hour SHOW ENGINE INNODB STATUS; -- note "Log sequence number" = LSN_end -- redo_per_hour = LSN_end - LSN_start -- Recommended total redo size = redo_per_hour (covers 1 hour of writes) -- This means checkpoint must only happen once per hour -- Example: LSN grew by 2GB in 1 hour โ†’ set innodb_redo_log_capacity = 2G -- 2. Alternative: check checkpoint age SELECT (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_redo_log_current_lsn') - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_redo_log_checkpoint_lsn') AS checkpoint_age; -- If checkpoint_age frequently exceeds 75% of redo log capacity, -- you need larger redo logs.

Source ref: storage/innobase/log/log0log.cc, storage/innobase/log/log0chkp.cc

3.7 Group Commit & Binlog Coordination

When multiple transactions commit simultaneously, InnoDB uses Group Commit optimization, merging multiple transactions' redo log writes into a single fsync. With binlog enabled, InnoDB uses two-phase commit (2PC) to ensure redo log and binlog consistency:

Two-Phase Commit with Binlog: Phase 1: PREPARE โ”œโ”€ Write redo log with PREPARE state โ””โ”€ fsync redo log Phase 2: COMMIT โ”œโ”€ Write transaction to binlog โ”œโ”€ fsync binlog (group commit: multiple txns in one fsync) โ””โ”€ Write redo log with COMMIT state (no fsync neededโ€”binlog is the authority) Crash Recovery Logic: โ”œโ”€ If redo=PREPARE + binlog=YES โ†’ COMMIT (transaction was successful) โ”œโ”€ If redo=PREPARE + binlog=NO โ†’ ROLLBACK (transaction incomplete) โ””โ”€ If redo=COMMIT โ†’ already committed

3.8 Redo Log Record Types

The Redo Log records multiple types of page modification operations. Each record contains a type identifier, tablespace ID, page number, and specific modification content. Major types include:

Record Type Description Triggered By
MLOG_1BYTE / 2BYTE / 4BYTE / 8BYTE Write 1/2/4/8 bytes to specific page offset Page header field updates, etc.
MLOG_WRITE_STRING Write variable-length byte string to page Data changes
MLOG_REC_INSERT Insert a compact-format record INSERT
MLOG_REC_UPDATE_IN_PLACE In-place record update (same length) UPDATE (when field length unchanged)
MLOG_REC_DELETE Delete record (mark as deleted) DELETE
MLOG_PAGE_CREATE Create a new index page Page split, etc.
MLOG_UNDO_INSERT Insert an undo record into undo page DML operations generating undo
MLOG_COMP_PAGE_CREATE Create compact-format page Tablespace page allocation
MLOG_MULTI_REC_END Mini-transaction end marker At end of each mtr

Source ref: storage/innobase/include/mtr0types.h

3.9 Mini-Transaction (mtr) Concept

A Mini-Transaction is InnoDB's internal atomic operation unit, distinct from user-level transactions. A user transaction contains multiple mtrs, and each mtr's writes to the redo log are atomic. For example, a B+Tree page split may involve modifying 3 pages โ€” these 3 modifications must be written to the redo log as a unit: all or nothing.

User Transaction vs Mini-Transaction: BEGIN; -- User transaction starts INSERT INTO t VALUES (1, 'a'); -- mtr1: modify leaf page + undo page INSERT INTO t VALUES (2, 'b'); -- mtr2: modify leaf page + undo page INSERT INTO t VALUES (3, 'c'); -- mtr3: page split! modifies 3 pages -- 3a: allocate new page -- 3b: move half records to new page -- 3c: update parent page pointer -- These 3 writes are ONE mtr COMMIT; -- User transaction commits In the redo log: [mtr1 records][MLOG_MULTI_REC_END] [mtr2 records][MLOG_MULTI_REC_END] [mtr3 records][MLOG_MULTI_REC_END] ... (commit record)

3.10 Crash Recovery Process

When MySQL restarts after an abnormal shutdown, InnoDB crash recovery proceeds as follows:

Crash Recovery Phases: Phase 1: REDO (Forward Recovery) โ”œโ”€ Read redo log from last checkpoint LSN to end โ”œโ”€ Parse redo records and apply them to data pages โ”‚ โ”œโ”€ If page's LSN >= record's LSN โ†’ skip (already applied) โ”‚ โ””โ”€ If page's LSN < record's LSN โ†’ apply the redo record โ”œโ”€ This restores ALL modifications (both committed and uncommitted) โ””โ”€ After this phase: database pages are in the state at crash time Phase 2: UNDO (Backward Recovery) โ”œโ”€ Scan the undo log for active (uncommitted) transactions at crash time โ”œโ”€ Roll back each uncommitted transaction using undo records โ”œโ”€ This ensures atomicity: uncommitted changes are reversed โ””โ”€ After this phase: only committed data remains Phase 3: Purge โ”œโ”€ Clean up undo records from rolled-back transactions โ””โ”€ Background purge threads handle this after recovery completes Recovery time depends on: - Amount of redo to replay (checkpoint_age at crash) - Number of uncommitted transactions to roll back - I/O speed of the storage device
Recovery Time Estimation: On SSD storage, redo replay speed is ~100-200 MB/s. If checkpoint_age is 4GB, recovery takes approximately 20-40 seconds (redo phase only). The undo phase depends on the size of transactions to roll back. Large uncommitted transactions (e.g., a 2-hour UPDATE ... WHERE) can take longer to roll back than the forward recovery, because undo reverses operations row by row.

4. Undo Log & MVCC

4.1 Undo Log's Dual Mission

The Undo Log serves two critical roles in InnoDB:

  1. Transaction Rollback: When a transaction executes ROLLBACK, the old values recorded in the Undo Log restore data to the state before the transaction began.
  2. MVCC: When other transactions need to read a historical version of a row, the version chain in the Undo Log locates the version visible to that transaction.

4.2 Undo Log Storage Structure

Undo Tablespace Structure: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Undo Tablespace (undo_001, undo_002, ...) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Rollback Segment 1 (max 128 per tablespace) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Undo Slot 1 โ”€โ”€โ–ถ Undo Log 1 (one per txn) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Undo Slot 2 โ”€โ”€โ–ถ Undo Log 2 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ ... โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Undo Slot 1024 โ”€โ”€โ–ถ Undo Log 1024 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ โ”‚ โ”‚ Rollback Segment 2 โ”‚ โ”‚ โ”‚ โ”‚ ... โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Max concurrent transactions = undo_tablespaces ร— 128 rseg ร— 1024 slots = 2 ร— 128 ร— 1024 = 262,144 (default)

Each transaction needs 1 to 2 Undo Logs depending on operation type:

  • INSERT Undo Log: Only records the primary key of newly inserted rows; rollback deletes them. Can be released immediately after commit.
  • UPDATE Undo Log: Records old values of modified/deleted rows. Cannot be released immediately after commit โ€” must wait until all transactions that might read the old version finish, then purge threads clean up.

4.3 MVCC Version Chain

Every row in InnoDB has two hidden columns:

  • DB_TRX_ID (6 bytes): Transaction ID of the last transaction that modified this row
  • DB_ROLL_PTR (7 bytes): Pointer to the old version of this row in the Undo Log
MVCC Version Chain Example: Current row in Buffer Pool / .ibd file: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ id=1, name="Charlie", age=30 โ”‚ โ”‚ DB_TRX_ID = 300 โ”‚ โ”‚ DB_ROLL_PTR โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ Undo Log โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ name="Bob", age=28 โ”‚ โ”‚ DB_TRX_ID = 200 โ”‚ โ”‚ DB_ROLL_PTR โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ name="Alice", age=25 โ”‚ โ”‚ DB_TRX_ID = 100 โ”‚ โ”‚ DB_ROLL_PTR = NULL โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Transaction 150 (RR isolation) started before trx 200 and 300: โ†’ Sees row version: name="Alice", age=25 (trx 100, committed before 150) Transaction 250 (RR isolation) started after trx 200 committed: โ†’ Sees row version: name="Bob", age=28 (trx 200, committed before 250) Transaction 350 (any isolation) started after trx 300 committed: โ†’ Sees current row: name="Charlie", age=30

4.4 ReadView in Detail

ReadView is the core data structure for InnoDB's MVCC consistent reads. When a transaction performs a consistent read (non-locking read), InnoDB creates a ReadView containing:

Field Meaning
m_creator_trx_id Transaction ID that created this ReadView
m_low_limit_id Next transaction ID to be assigned when ReadView was created (i.e., max_trx_id at that point). Transactions with ID >= this are definitely invisible.
m_up_limit_id Minimum transaction ID in the active list when ReadView was created. Transactions with ID < this are definitely committed and visible.
m_ids List of all active (uncommitted) transaction IDs when ReadView was created.
ReadView Visibility Algorithm (simplified): function is_visible(row_trx_id, read_view): if row_trx_id == read_view.m_creator_trx_id: return true // I modified this row myself if row_trx_id < read_view.m_up_limit_id: return true // Transaction committed before ReadView was created if row_trx_id >= read_view.m_low_limit_id: return false // Transaction started after ReadView was created // m_up_limit_id <= row_trx_id < m_low_limit_id if row_trx_id in read_view.m_ids: return false // Transaction was active (uncommitted) when ReadView was created else: return true // Transaction committed before ReadView was created // (its ID is between up and low, but not in active list) If not visible โ†’ follow DB_ROLL_PTR to undo log, check previous version Repeat until a visible version is found or reach NULL (row didn't exist)

Source ref: storage/innobase/include/read0types.h โ€” class ReadView, storage/innobase/row/row0sel.cc

4.5 RR vs RC: When the ReadView Is Created

REPEATABLE READ (RR)

ReadView is created at the first SELECT in the transaction and reused throughout the transaction's lifetime. Thus reading the same row multiple times within the transaction always returns the same value โ€” even if other transactions have committed modifications.

READ COMMITTED (RC)

ReadView is recreated on every SELECT. Thus two reads of the same row within a transaction may return different values โ€” if another transaction committed a modification between the two reads (Non-Repeatable Read).

Example: RR vs RC difference Initial: users table has (id=1, name="Alice") Timeline: T1 (RR): BEGIN T2: BEGIN T1: SELECT name FROM users WHERE id=1 โ†’ "Alice" T2: UPDATE users SET name="Bob" WHERE id=1 T2: COMMIT T1: SELECT name FROM users WHERE id=1 โ†’ "Alice" (same ReadView!) T3 (RC): BEGIN T4: BEGIN T3: SELECT name FROM users WHERE id=1 โ†’ "Alice" T4: UPDATE users SET name="Bob" WHERE id=1 T4: COMMIT T3: SELECT name FROM users WHERE id=1 โ†’ "Bob" (new ReadView!)

4.6 Dangers of Long Transactions

Production Incident: Undo Log Bloat Causes Disk Exhaustion
A SaaS application had an analytics query transaction that ran for 8 hours without committing. During that time, UPDATE Undo Logs from other transactions could not be purged (because that long transaction's ReadView still needed old versions), causing the undo tablespace to bloat from 200MB to 150GB. Disk space was exhausted and the database crashed.

Prevention:
1. Set innodb_undo_log_truncate=ON (default ON in MySQL 8.0)
2. Monitor information_schema.INNODB_TRX for long-uncommitted transactions
3. Set kill_idle_transaction (Percona) or periodically check timeout transactions
4. Route analytics queries to replicas

4.7 Undo Configuration

Parameter Default Description
innodb_undo_tablespaces 2 Number of separate undo tablespaces. MySQL 8.0 minimum is 2.
innodb_undo_log_truncate ON (8.0) Automatically reclaim undo tablespaces exceeding innodb_max_undo_log_size.
innodb_max_undo_log_size 1GB Size threshold that triggers undo tablespace truncation.
innodb_rollback_segments 128 Number of rollback segments per undo tablespace.
innodb_purge_threads 4 Background threads for purge operations. Increase to 8 for high-write workloads.

Source ref: storage/innobase/trx/trx0trx.cc, storage/innobase/trx/trx0purge.cc

4.8 Purge Thread Internals

Purge is InnoDB's background cleanup operation, responsible for two tasks:

  1. Physically remove rows marked as "delete-marked" (DELETE only sets a logical deletion flag; purge actually removes the row from indexes)
  2. Clean up UPDATE undo log records no longer needed by any active transaction
Purge Process: 1. Purge coordinator thread (1 thread) reads the oldest ReadView across all active transactions 2. Any undo record with trx_id < oldest active ReadView's m_up_limit_id is safe to purge 3. For DELETE operations: โ”œโ”€ Purge removes the delete-marked record from the clustered index โ””โ”€ Purge removes corresponding entries from all secondary indexes 4. For UPDATE operations on indexed columns: โ””โ”€ Purge removes old secondary index entries (the "ghost" entries) 5. After purge: undo log pages can be reused Monitoring purge lag: SHOW ENGINE INNODB STATUS โ†’ "History list length" This = number of undo records NOT yet purged If growing โ†’ purge can't keep up โ†’ increase innodb_purge_threads Or a long transaction is holding back the oldest ReadView

4.9 MVCC and Secondary Indexes

MVCC implementation differs fundamentally between clustered indexes and secondary indexes:

Clustered Index

Each row has DB_TRX_ID and DB_ROLL_PTR, enabling direct version visibility checks and version chain traversal.

Secondary Index

Secondary index entries do not have DB_TRX_ID and DB_ROLL_PTR. InnoDB uses PAGE_MAX_TRX_ID in the page header (the maximum trx_id on that page) for a quick check: if PAGE_MAX_TRX_ID < ReadView's m_up_limit_id, all records on the page are visible without returning to the clustered index. Otherwise, it must go back to the clustered index to check each row's version visibility.

Performance Impact: This explains why in high-concurrency write scenarios, the advantage of covering indexes may be diminished. If a secondary index page's PAGE_MAX_TRX_ID is recent (due to active write transactions), even if all query columns are in the secondary index, InnoDB may still need to go back to the clustered index for version visibility checks. Long transactions exacerbate this problem.

5. Doublewrite Buffer

5.1 The Partial Page Write Problem (Torn Pages)

InnoDB's page size is 16KB, but the OS atomic write is typically 4KB (one filesystem block). A single 16KB page write requires 4 separate 4KB I/O operations. If a power failure or system crash occurs between these 4 I/O operations, only partial data may have been written โ€” this is a "torn page."

Why Can't the Redo Log Fix a Torn Page? The Redo Log records logical modifications to pages (e.g., "at offset 200, change value from 25 to 30"), not complete page contents. If the page itself is corrupted (only 8KB was correctly written), replaying redo operations on a corrupted page produces more errors. The Redo Log assumes the base page state is correct.

5.2 How Doublewrite Works

Doublewrite Buffer Write Flow: Step 1: Collect dirty pages to flush (batch, e.g., 64 pages) Step 2: Write pages sequentially to Doublewrite Buffer (on disk) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Doublewrite Buffer (2 extents = 2MB total) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ P1 โ”‚ P2 โ”‚ P3 โ”‚ ... โ”‚ P64 โ”‚ (unused) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Step 3: fsync the Doublewrite Buffer Step 4: Write each page to its actual location in .ibd files P1 โ†’ tablespace A, page 45 P2 โ†’ tablespace A, page 892 P3 โ†’ tablespace B, page 17 ... (these are random I/O) Step 5: fsync the .ibd files Crash Recovery: For each page in the .ibd file: 1. Verify page checksum 2. If checksum FAILS โ†’ page is torn โ†’ copy correct version from Doublewrite Buffer 3. Then apply redo log records on top of the good page

5.3 Performance Impact & When to Disable

Aspect Details
Write Amplification Each dirty page is written twice: once to the Doublewrite Buffer, once to the actual location. Theoretically 2x write I/O, but in practice ~5-10% overhead (because Doublewrite writes are sequential).
Safe to Disable When 1. Filesystem supports atomic writes (ZFS, RAID controller with battery-backed cache)
2. Some cloud databases (e.g., Aurora) implement storage-layer atomic writes
3. innodb_page_size set to 4KB (equal to filesystem block size โ€” no torn-page risk)
-- MySQL 8.0.20+: separate doublewrite files -- Old: doublewrite stored in ibdata1 -- New: separate .dblwr files in datadir -- Disable doublewrite (only if your storage guarantees atomic writes!): SET GLOBAL innodb_doublewrite = OFF; -- MySQL 8.0.30+: fine-grained control -- innodb_doublewrite = ON | OFF | DETECT_AND_RECOVER | DETECT_ONLY -- DETECT_AND_RECOVER (default): full protection -- DETECT_ONLY: detect torn pages but don't fix (useful with ZFS)

6. Change Buffer

6.1 What Problem Does the Change Buffer Solve?

When an INSERT/UPDATE/DELETE modifies a secondary index page that is not in the Buffer Pool, InnoDB faces a choice:

Without Change Buffer

Immediately read that secondary index page from disk into the Buffer Pool, modify it, mark as dirty. Problem: this is a random read I/O. For thousands of INSERTs per second, this generates massive random I/O.

With Change Buffer

Cache the modification in the Change Buffer without reading from disk. When the page is eventually read into the Buffer Pool for other queries, merge the cached modifications. This avoids massive random read I/O.

Important Limitation: The Change Buffer only works for secondary indexes (non-unique). It does NOT apply to:
1. Clustered index (primary key) โ€” the primary key page must be read on INSERT to determine insert position
2. Unique indexes โ€” the index page must be read on INSERT to check the uniqueness constraint

6.2 Merge Triggers

  • The secondary index page is read into the Buffer Pool by a SELECT query
  • Background merge thread executes periodically
  • Buffer Pool space is insufficient and Change Buffer pages must be evicted
  • At transaction commit (in certain cases)
  • During normal server shutdown (full merge)

6.3 Configuration

Parameter Default Description
innodb_change_buffering all Which operations to buffer: none, inserts, deletes, changes(ins+del), purges, all
innodb_change_buffer_max_size 25 Maximum percentage of Buffer Pool for Change Buffer. Range: 0-50.

6.4 When It Helps vs When It Hurts

Effective Scenarios
  • Heavy random INSERTs (log tables, secondary indexes on UUID PK tables)
  • Secondary index pages not frequently read by SELECTs (write-heavy, read-light)
  • Disk I/O is the bottleneck (HDD benefits more than SSD)
Ineffective/Harmful Scenarios
  • All indexes are unique (Change Buffer has zero effect)
  • Read immediately after write (merge triggers immediately, no different from direct read)
  • Buffer Pool large enough to cache all index pages (no need for Change Buffer optimization)

7. Adaptive Hash Index (AHI)

7.1 How It Works

All user data and indexes in InnoDB are stored in B+Trees. A B+Tree lookup traverses from the root node downward level by level. For a 3-4 level B+Tree, each lookup requires 3-4 page accesses (though upper-level pages are usually in the Buffer Pool, there is still CPU overhead).

The Adaptive Hash Index (AHI) is an in-memory hash table that InnoDB automatically builds on frequently accessed B+Tree pages. When InnoDB detects that a certain index prefix is repeatedly searched with the same pattern, it automatically builds a hash index, reducing lookup complexity from O(log n) to O(1).

Without AHI: query โ†’ B+Tree root โ†’ level 1 โ†’ level 2 โ†’ leaf page โ†’ row 3-4 page accesses, CPU-intensive comparison at each level With AHI: query โ†’ hash lookup โ†’ directly to leaf page + record offset โ†’ row 1 hash probe, O(1)

7.2 Monitoring AHI

-- From SHOW ENGINE INNODB STATUS: ------------------------------------- INSERT BUFFER AND ADAPTIVE HASH INDEX ------------------------------------- Hash table size 34679, node heap has 1 buffer(s) Hash table size 34679, node heap has 0 buffer(s) ... 2.84 hash searches/s, 3.27 non-hash searches/s โ† AHI hit ratio -- AHI hit rate = hash searches / (hash searches + non-hash searches) -- In this example: 2.84 / (2.84 + 3.27) = 46.5% -- If hit rate is consistently below 3-5%, consider disabling AHI

7.3 AHI Contention Issues

AHI is protected by rw-locks. Under high concurrency, many threads simultaneously accessing AHI can cause rw-lock contention, manifesting as high CPU usage with throughput actually decreasing.

Parameter Default Description
innodb_adaptive_hash_index ON Whether to enable AHI. Can be changed dynamically.
innodb_adaptive_hash_index_parts 8 Number of AHI partitions. Increasing reduces rw-lock contention. Range: 1-512.

7.4 When to Disable AHI

  • AHI hit rate consistently below 5% (wasting memory and CPU)
  • Seeing heavy btr_search rw-lock waits in the SEMAPHORES section of SHOW ENGINE INNODB STATUS
  • Workload is primarily range scans (AHI helps point lookups, not range scans)
  • Queries use many different index prefixes randomly (AHI cannot build effective hashes)

8. Tablespace Architecture

8.1 Tablespace Types Overview

InnoDB Tablespace Hierarchy: datadir/ โ”œโ”€โ”€ ibdata1 โ† System Tablespace โ”œโ”€โ”€ undo_001 โ† Undo Tablespace 1 โ”œโ”€โ”€ undo_002 โ† Undo Tablespace 2 โ”œโ”€โ”€ ibtmp1 โ† Temporary Tablespace โ”œโ”€โ”€ #innodb_temp/ โ† Session Temporary Tablespaces (8.0.15+) โ”‚ โ”œโ”€โ”€ temp_1.ibt โ”‚ โ””โ”€โ”€ temp_2.ibt โ”œโ”€โ”€ #innodb_redo/ โ† Redo Log files (8.0.30+) โ”‚ โ”œโ”€โ”€ #ib_redo0 โ”‚ โ””โ”€โ”€ #ib_redo1 โ”œโ”€โ”€ *.dblwr โ† Doublewrite files (8.0.20+) โ”œโ”€โ”€ mydb/ โ† Database directory โ”‚ โ”œโ”€โ”€ users.ibd โ† File-per-table tablespace โ”‚ โ”œโ”€โ”€ orders.ibd โ”‚ โ””โ”€โ”€ ... โ””โ”€โ”€ general_ts.ibd โ† General Tablespace (optional)

8.2 System Tablespace (ibdata1)

The system tablespace is InnoDB's original tablespace, storing:

  • Change Buffer persistent data
  • Doublewrite Buffer (before MySQL 8.0.20)
  • InnoDB internal data dictionary (before MySQL 8.0)
  • User table data if innodb_file_per_table=OFF
ibdata1 Only Grows, Never Shrinks The system tablespace file can only grow, never shrink automatically. Even after deleting massive amounts of data, ibdata1's size does not decrease. This is because InnoDB reuses freed space internally rather than returning it to the OS. Historically many DBAs encountered ibdata1 ballooning to tens of GB (usually because innodb_file_per_table=OFF). The only solution is logical backup + instance rebuild. Therefore, always keep innodb_file_per_table=ON (default ON since MySQL 5.6.6+).

8.3 File-per-Table Tablespaces

When innodb_file_per_table=ON, each InnoDB table's data and indexes are stored in its own .ibd file.

Advantages Disadvantages
DROP TABLE / TRUNCATE TABLE actually reclaims disk space Filesystem inode limits (relevant with tens of thousands of tables)
Can individually migrate tables with ALTER TABLE ... TABLESPACE Slight storage overhead for each .ibd file's metadata
Can place tables on different disks (via symlinks or DATA DIRECTORY) Finer fsync granularity (per file) may slightly impact write performance
Supports COMPRESSED row format

8.4 General Tablespaces

-- Create a general tablespace CREATE TABLESPACE ts_hot ADD DATAFILE '/fast_ssd/ts_hot.ibd' ENGINE = InnoDB; -- Create table in specific tablespace CREATE TABLE hot_data (id INT PRIMARY KEY, val VARCHAR(100)) TABLESPACE ts_hot; -- Move existing table to the tablespace ALTER TABLE orders TABLESPACE ts_hot;

General tablespaces allow organizing related tables in a single file (reducing file descriptor usage) or placing them on specific storage devices.

8.5 Temporary Tablespaces

MySQL 8.0 has two types of temporary tablespaces:

  • Global Temporary Tablespace (ibtmp1): Used for rollback segments of user-created temporary tables (CREATE TEMPORARY TABLE). Automatically rebuilt on restart.
  • Session Temporary Tablespaces (#innodb_temp/): Added in MySQL 8.0.15+, stores user-created temporary table data and internal temporary tables. One allocated per session; space is reclaimable after session ends.

8.6 Tablespace Internals: Segment, Extent, Page

InnoDB tablespaces organize disk space internally in a three-level hierarchy:

Tablespace Internal Hierarchy: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Tablespace (.ibd file) โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€ Segment (logical grouping) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ Each index has 2 segments: โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€ Leaf node segment (data) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€ Non-leaf node segment (internal nodes) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€ Extent (physical allocation unit) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ 1 extent = 64 contiguous pages = 1 MB โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ (when page_size=16KB) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚Page 0โ”‚Page 1โ”‚Page 2โ”‚...โ”‚Page63โ”‚ (16KB ea) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ For small segments (< 32 pages): โ”‚ โ”‚ โ”‚ โ”‚ Pages are allocated individually from a โ”‚ โ”‚ โ”‚ โ”‚ "fragment" extent (shared with other segments) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ Page 0: FSP_HDR (file space header) โ€” tracks extents โ”‚ โ”‚ Page 1: IBUF_BITMAP โ€” change buffer bitmap โ”‚ โ”‚ Page 2: INODE โ€” segment metadata โ”‚ โ”‚ Page 3+: data pages (B+Tree root, etc.) โ”‚ โ”‚ โ”‚ โ”‚ Every 16384 pages (256MB): another XDES page โ”‚ โ”‚ for tracking the next group of extents โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Why Allocate by Extent? InnoDB uses extents (1MB contiguous space) as the basic allocation unit rather than allocating each page individually. This is because disk sequential R/W performance far exceeds random R/W. When a table grows, allocating contiguous 1MB extents ensures data is physically contiguous on disk, benefiting full table scans and range scans with sequential I/O. For very small tables (< 32 pages = 512KB), page-by-page allocation avoids excessive space waste.

8.7 .ibd File Space Reclamation & Defragmentation

After massive DELETE operations, .ibd files do not automatically shrink. Deleted space is marked as reusable, but the file's physical size stays the same. If fragmentation is severe, reclaim space via:

-- Method 1: OPTIMIZE TABLE (rebuilds the table, reclaims space) OPTIMIZE TABLE large_table; -- For InnoDB, this is equivalent to ALTER TABLE ... ENGINE=InnoDB -- โš  Requires full table copy โ€” blocks writes during operation -- Use pt-online-schema-change or gh-ost for production -- Method 2: ALTER TABLE (null operation that triggers rebuild) ALTER TABLE large_table ENGINE=InnoDB; -- Check fragmentation level: SELECT TABLE_NAME, DATA_LENGTH, DATA_FREE, ROUND(DATA_FREE / DATA_LENGTH * 100, 1) AS frag_pct FROM information_schema.TABLES WHERE TABLE_SCHEMA = 'mydb' AND DATA_FREE > 0 ORDER BY DATA_FREE DESC; -- frag_pct > 20% suggests rebuild may be worthwhile

9. Row Formats

9.1 Comparison of Four Row Formats

Row Format Since Variable-Length Storage Large Column Overflow Recommended
REDUNDANT Earliest Full length info stored in-row First 768 bytes stored in-row No (legacy only)
COMPACT 5.0 Compact NULL flags and variable-length field lengths First 768 bytes stored in-row Usable but not preferred
DYNAMIC 5.7+ Same as COMPACT Only 20-byte pointer in-row, all data in overflow pages Yes (default, recommended)
COMPRESSED 5.5+ Same as DYNAMIC, but with page-level compression Same as DYNAMIC Special cases (cold data)

9.2 COMPACT/DYNAMIC Row Internal Structure

COMPACT/DYNAMIC Row Format: โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Variable-Length Field Lengths (1-2 bytes each, reverse order) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ NULL Flags (1 bit per nullable column, rounded to bytes) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Record Header (5 bytes) โ”‚ โ”‚ โ”œโ”€ info_flags (4 bits): delete-mark, min-rec, etc. โ”‚ โ”‚ โ”œโ”€ n_owned (4 bits): records owned by this slot in page dir โ”‚ โ”‚ โ”œโ”€ heap_no (13 bits): position in page heap โ”‚ โ”‚ โ”œโ”€ record_type (3 bits): 0=leaf, 1=non-leaf, 2=infimum, 3=sup โ”‚ โ”‚ โ””โ”€ next_record (16 bits): relative offset to next record โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Hidden Columns: โ”‚ โ”‚ โ”œโ”€ DB_ROW_ID (6 bytes, only if no explicit PK) โ”‚ โ”‚ โ”œโ”€ DB_TRX_ID (6 bytes, transaction ID) โ”‚ โ”‚ โ””โ”€ DB_ROLL_PTR (7 bytes, undo log pointer) โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚ Column 1 data โ”‚ โ”‚ Column 2 data โ”‚ โ”‚ ... โ”‚ โ”‚ Column N data โ”‚ โ”‚ (If column > threshold โ†’ 20-byte pointer to overflow page) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

9.3 Overflow Page Storage (Off-page)

When a large column in a row (TEXT, BLOB, very long VARCHAR) exceeds the page's storage capacity, InnoDB stores part or all of the data in overflow pages. COMPACT and DYNAMIC handle this differently:

COMPACT / REDUNDANT

First 768 bytes stored in-row, remainder in overflow pages. This means even a primary key lookup may load large TEXT data into Buffer Pool data pages, wasting memory.

DYNAMIC

When a column overflows, only a 20-byte pointer (space ID + page number + offset) is stored in-row, with all data in overflow pages. This allows data pages to hold more rows, dramatically improving Buffer Pool efficiency.

Best Practice: Always use ROW_FORMAT=DYNAMIC (MySQL 8.0 default). If your table has TEXT/BLOB columns and you need disk space savings (accepting ~10% CPU overhead), consider COMPRESSED. However, in MySQL 8.0, prefer transparent page compression (COMPRESSION='zstd') over the COMPRESSED row format.

9.4 Hidden Columns in InnoDB Rows

Every InnoDB row has up to three hidden columns, critical for MVCC and internal management:

Hidden Column Size Description
DB_ROW_ID 6 bytes Row ID, only added when the table has no explicit primary key and no non-NULL unique index. InnoDB generates a globally incrementing row_id as the internal primary key. This is why you should always define a primary key โ€” otherwise InnoDB uses a globally mutex-protected counter to generate row_id, becoming a bottleneck under high-concurrency INSERTs.
DB_TRX_ID 6 bytes Transaction ID of the last transaction that modified this row. Updated by INSERT/UPDATE/DELETE. The core basis for MVCC version visibility checks.
DB_ROLL_PTR 7 bytes Roll pointer, pointing to the previous version of this row in the undo log. Following DB_ROLL_PTR traverses the version chain to find any historical version. 7 bytes = 1 bit (INSERT/UPDATE flag) + 7 bits (rollback segment ID) + 32 bits (page number) + 16 bits (offset within page).

9.5 COMPRESSED Row Format Deep Dive

The COMPRESSED row format adds page-level zlib compression on top of DYNAMIC. Each 16KB page is compressed and stored in smaller space. However, InnoDB must maintain both compressed and uncompressed copies of pages in memory:

COMPRESSED page handling in Buffer Pool: Disk (compressed): Buffer Pool (memory): โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 8KB on disk โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ โ”‚ 8KB compressed page โ”‚ โ† always present โ”‚ (compressed) โ”‚ โ”‚ + โ”‚ โ”‚ โ”‚ โ”‚ 16KB uncompressed page โ”‚ โ† present when accessed โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Problems with COMPRESSED: 1. Double memory usage (compressed + uncompressed copies) 2. CPU overhead for compress/decompress 3. Compression failures: if modified page can't fit in original compressed size โ†’ reorganize page โ†’ still can't fit โ†’ page split 4. Modification log: InnoDB maintains a tiny log of modifications on the compressed page to avoid recompressing on every change โ†’ when log is full โ†’ recompress โ†’ potential split
COMPRESSED vs Transparent Page Compression
MySQL 8.0 recommends transparent page compression (CREATE TABLE ... COMPRESSION='zstd') over the COMPRESSED row format:
1. Transparent page compression does not need two copies in memory
2. Supports more efficient compression algorithms (zstd instead of zlib)
3. Relies on filesystem sparse file (punch hole) support โ€” Linux ext4/xfs/btrfs all support it
4. No extra memory overhead on Buffer Pool
5. Downside: requires filesystem punch hole support; actual savings depend on filesystem block alignment

9.6 Row Size Limits

InnoDB has several important row size limits:

Limit Value Description
MySQL layer row size limit 65,535 bytes Sum of all column declared lengths (excluding BLOB/TEXT) cannot exceed this
InnoDB in-page row size limit ~8000 bytes Approximately half of page size (16KB/2), ensuring at least 2 rows per page (B+Tree requirement)
Overflow trigger threshold Dynamic When in-row data exceeds ~8000 bytes, the longest variable-length fields are moved to overflow pages. DYNAMIC format keeps only a 20-byte pointer.
Maximum columns 1017 Maximum 1017 columns per table (limit after including hidden and virtual columns)

10. Flush Mechanism

10.1 Why Flush?

InnoDB writes first go to the Buffer Pool (in-memory dirty pages), with transactions guaranteed by the Redo Log. But dirty pages must eventually be written back to disk for two reasons:

  1. Free Buffer Pool space for new pages
  2. Advance the checkpoint LSN to free redo log space and shorten crash recovery time

10.2 Two Flush Paths

LRU Flushing

When Buffer Pool free pages are insufficient, page cleaner threads scan from the LRU tail, flush dirty pages, and free them. The goal is to ensure enough free pages for new read requests.

Flush List Flushing

Page cleaner threads flush dirty pages in order of their oldest_modification LSN from the Flush List, advancing the checkpoint. The goal is to ensure the redo log does not fill up.

10.3 Key Flush Parameters

Parameter Default Description Recommended
innodb_io_capacity 200 Maximum I/O operations per second for InnoDB background tasks (flushing, merging change buffer). SSD: 2000-10000, HDD: 200-400
innodb_io_capacity_max 2000 I/O operations cap during emergency flush (redo log near full). 2-4x innodb_io_capacity
innodb_flush_neighbors 0 (8.0) Whether to flush neighboring dirty pages. Useful on HDD (reduces random I/O), should be OFF for SSD. SSD: 0, HDD: 1
innodb_adaptive_flushing ON Adaptive flushing: InnoDB dynamically adjusts flush rate based on redo log consumption speed. ON
innodb_page_cleaners 4 Number of page cleaner threads. Should equal or exceed innodb_buffer_pool_instances. = buffer_pool_instances
innodb_lru_scan_depth 1024 Depth of LRU scan per Buffer Pool instance per page cleaner iteration. Keep default or lower to 256-512 (if I/O-pressured)

10.4 Adaptive Flushing Algorithm

When innodb_adaptive_flushing=ON, InnoDB uses the following formula to calculate the ideal flush rate:

Adaptive Flushing Rate Calculation (simplified): checkpoint_age = current_lsn - last_checkpoint_lsn max_checkpoint_age = redo_log_capacity * 0.8125 (async threshold) if checkpoint_age < max_checkpoint_age * 0.75: flush_rate = low (normal pace) elif checkpoint_age < max_checkpoint_age: flush_rate = scaled between io_capacity and io_capacity_max (proportional to how close checkpoint_age is to max) else: flush_rate = io_capacity_max (EMERGENCY: aggressive flush) โš  User queries may stall waiting for redo space! The "adaptive" part: InnoDB also considers the rate of redo log generation (LSN growth speed) to predict future flush needs.
Production Incident: Flushing Storm A company used default innodb_io_capacity=200 and innodb_redo_log_capacity=100MB (both too small). During peak traffic, the redo log filled up rapidly. InnoDB triggered emergency synchronous flushing, blocking all user queries for 5-10 seconds. QPS dropped to near zero, then recovered. This pattern repeated every few minutes. Solution: increase redo log to 4GB, increase innodb_io_capacity to 4000 (SSD). Problem disappeared.

Source ref: storage/innobase/buf/buf0flu.cc โ€” buf_flush_page_cleaner_coordinator()

11. InnoDB Configuration Cheat Sheet

Below are the most critical InnoDB parameters with recommended settings (assuming 64GB RAM server, SSD storage):

Parameter Default Recommended (64G/SSD) Description
innodb_buffer_pool_size 128MB 44G ~70% total RAM
innodb_buffer_pool_instances 8 16 ~2.75GB per instance
innodb_redo_log_capacity 100MB 4G 8.0.30+; older: log_file_size*files
innodb_log_buffer_size 16MB 64MB Increase for large transactions or high-concurrency writes
innodb_flush_log_at_trx_commit 1 1 Must be 1 for primary; replicas can use 2
innodb_io_capacity 200 4000 50-75% of SSD IOPS
innodb_io_capacity_max 2000 8000 2x io_capacity
innodb_flush_method fsync O_DIRECT Bypass OS cache, avoid double buffering
innodb_flush_neighbors 0 0 SSD does not benefit from neighbor flushing
innodb_page_cleaners 4 16 = buffer_pool_instances
innodb_purge_threads 4 4-8 Increase for heavy UPDATE/DELETE workloads
innodb_file_per_table ON ON Always keep ON
innodb_doublewrite ON ON Unless using ZFS or BBWC RAID
innodb_adaptive_hash_index ON ON Disable if contention observed
innodb_change_buffer_max_size 25 25 Set to 0 if no secondary indexes or all unique
innodb_undo_tablespaces 2 2-4 More tablespaces = better concurrency and truncation
innodb_undo_log_truncate ON ON Auto-reclaim undo space
innodb_print_all_deadlocks OFF ON Log all deadlocks to error log
innodb_deadlock_detect ON ON Consider OFF + timeout for high-concurrency hotspot rows
-- Example my.cnf InnoDB section for 64GB SSD server: [mysqld] # Buffer Pool innodb_buffer_pool_size = 44G innodb_buffer_pool_instances = 16 innodb_buffer_pool_dump_at_shutdown = ON innodb_buffer_pool_load_at_startup = ON # Redo Log innodb_redo_log_capacity = 4G innodb_log_buffer_size = 64M innodb_flush_log_at_trx_commit = 1 # I/O innodb_io_capacity = 4000 innodb_io_capacity_max = 8000 innodb_flush_method = O_DIRECT innodb_flush_neighbors = 0 innodb_page_cleaners = 16 # Tablespace innodb_file_per_table = ON innodb_doublewrite = ON innodb_undo_tablespaces = 3 innodb_undo_log_truncate = ON # Concurrency innodb_adaptive_hash_index = ON innodb_adaptive_hash_index_parts = 16 innodb_purge_threads = 4 # Monitoring innodb_print_all_deadlocks = ON innodb_status_output_locks = ON

12. Monitoring & Troubleshooting

12.1 SHOW ENGINE INNODB STATUS Full Walkthrough

This is InnoDB's most important diagnostic tool. The output is divided into sections, each providing different dimensional information:

(1) SEMAPHORES Section

---------- SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 12345 --Thread 140234567890 has waited at btr0sea.ic line 92 for 0.5 seconds the semaphore: S-lock on RW-latch at 0x7f... a]writer (thread id 140234567891) has reserved it in mode exclusive number of readers 0, waiters flag 1, lock_word: 0 Key indicators: - "has waited ... for X seconds": threads waiting for locks - btr0sea.ic: AHI-related contention โ†’ consider disabling AHI - buf0buf.cc: Buffer Pool mutex contention โ†’ increase instances - Frequent long waits (>1s) = performance problem

(2) TRANSACTIONS Section

------------ TRANSACTIONS ------------ Trx id counter 12345678 Purge done for trx's n:o < 12345670 undo n:o < 0 History list length 234 โ† Undo records not yet purged If growing โ†’ long transactions or purge lag ---TRANSACTION 12345677, ACTIVE 3600 sec โ† Active for 1 hour! Problem! 2 lock struct(s), heap size 1136, 1 row lock(s) MySQL thread id 42, OS thread handle 140..., query id 999 SELECT * FROM large_table WHERE status = 'active' Key indicators: - History list length: should be low (<1000); growing = purge lag - ACTIVE NNN sec: long-running transactions โ†’ undo bloat risk - lock struct(s) / row lock(s): lock usage per transaction

(3) FILE I/O Section

-------- FILE I/O -------- I/O thread 0 state: waiting for completed aio requests (insert buffer thread) I/O thread 1 state: waiting for completed aio requests (log thread) I/O thread 2 state: waiting for completed aio requests (read thread) ... Pending normal aio reads: [0, 0, 0, 0] โ† Should be 0 or very low Pending normal aio writes: [0, 0, 0, 0] Pending ibuf aio reads: 0 6821 OS file reads, 29412 OS file writes, 8513 OS fsyncs 0.00 reads/s, 0 avg bytes/read, 1.86 writes/s, 0.76 fsyncs/s Key indicators: - High pending reads/writes = I/O subsystem is saturated - fsyncs/s: correlates with commit rate and flush_log_at_trx_commit setting

(4) ROW OPERATIONS Section

-------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 0 read views open inside InnoDB โ† Open read views = MVCC snapshots Process ID=12345, Main thread ID=140..., state: sleeping Number of rows inserted 567890, updated 123456, deleted 78901, read 45678901 0.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s

12.2 information_schema Monitoring Tables

-- Buffer Pool statistics per instance SELECT POOL_ID, POOL_SIZE, FREE_BUFFERS, DATABASE_PAGES, OLD_DATABASE_PAGES, MODIFIED_DATABASE_PAGES, HIT_RATE, YOUNG_MAKE_PER_THOUSAND_GETS FROM information_schema.INNODB_BUFFER_POOL_STATS; -- Currently open transactions SELECT trx_id, trx_state, trx_started, TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS duration_sec, trx_rows_locked, trx_rows_modified, trx_query FROM information_schema.INNODB_TRX ORDER BY trx_started ASC; -- Lock waits (who is blocking whom) SELECT r.trx_id AS waiting_trx, r.trx_query AS waiting_query, b.trx_id AS blocking_trx, b.trx_query AS blocking_query FROM performance_schema.data_lock_waits w JOIN information_schema.INNODB_TRX r ON r.trx_id = w.REQUESTING_ENGINE_TRANSACTION_ID JOIN information_schema.INNODB_TRX b ON b.trx_id = w.BLOCKING_ENGINE_TRANSACTION_ID; -- InnoDB metrics (hundreds of counters) SELECT NAME, COUNT, STATUS FROM information_schema.INNODB_METRICS WHERE STATUS = 'enabled' ORDER BY NAME;

12.3 InnoDB Tables in performance_schema

Table Purpose
data_locks All currently held and waited-for InnoDB locks
data_lock_waits Lock wait relationships: which transaction waits for which
events_waits_summary_global_by_event_name InnoDB mutex/rw-lock wait statistics (search for wait/synch/mutex/innodb)
file_summary_by_instance I/O statistics per .ibd file (read/write bytes, latency)
table_io_waits_summary_by_table I/O wait statistics per table
-- Find InnoDB mutex contention hotspots SELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_wait_ms FROM performance_schema.events_waits_summary_global_by_event_name WHERE EVENT_NAME LIKE 'wait/synch/mutex/innodb/%' ORDER BY SUM_TIMER_WAIT DESC LIMIT 10; -- Find the most I/O-heavy .ibd files SELECT FILE_NAME, COUNT_READ, SUM_NUMBER_OF_BYTES_READ, COUNT_WRITE, SUM_NUMBER_OF_BYTES_WRITE, SUM_TIMER_READ/1000000000 AS read_wait_ms, SUM_TIMER_WRITE/1000000000 AS write_wait_ms FROM performance_schema.file_summary_by_instance WHERE FILE_NAME LIKE '%.ibd' ORDER BY SUM_TIMER_READ + SUM_TIMER_WRITE DESC LIMIT 10;

12.4 Common Troubleshooting Checklist

Symptom Possible Cause Diagnostic Solution
QPS periodic drops Redo log full causing sync flush checkpoint_age approaching redo log capacity Increase redo log + io_capacity
Low Buffer Pool hit rate Pool too small or full-scan pollution SHOW ENGINE INNODB STATUS โ†’ hit rate Increase pool size or optimize queries
History list keeps growing Long transaction blocking purge INNODB_TRX โ†’ trx_started Find and kill long transaction
ibdata1 bloating innodb_file_per_table=OFF Check my.cnf Enable, then logical rebuild
High CPU but low throughput AHI rw-lock contention SEMAPHORES โ†’ btr0sea.ic Disable AHI or increase parts
Undo tablespace ballooning Long transaction + high concurrent writes INNODB_TRX + undo file size Kill long txn + truncate undo

13. Frequently Asked Questions

Q1: What percentage of total RAM should innodb_buffer_pool_size be?

The general rule is 60-80% of total RAM, but it depends on the scenario:

  • Dedicated database server (MySQL only): 70-80%
  • Shared server (with app, cache, etc.): 50-60%
  • Small RAM (≤4GB): 50%

Remaining memory is for: OS file cache, per-connection memory (sort_buffer, join_buffer, thread stack, ~2-10MB per connection), other processes. A common mistake is setting buffer_pool_size to 90%+, causing the OS to swap, which devastates performance.

Q2: Why does InnoDB use B+Tree instead of B-Tree or hash indexes?

B+Tree advantages over B-Tree:

  • All data in leaf nodes: non-leaf nodes store only keys and pointers, higher fan-out, shorter tree, fewer disk I/O
  • Leaf nodes form a doubly-linked list: range scans (BETWEEN, ORDER BY) traverse the list without backtracking
  • Stable query performance: all lookups go to leaf level, consistent path length (O(log n))

Hash indexes offer O(1) point lookups but cannot support range scans, sorting, or prefix matching โ€” all extremely common in OLTP. InnoDB uses hashing as a supplement (AHI), not the primary structure.

Q3: Can innodb_flush_log_at_trx_commit=2 and sync_binlog=1 be used together?

Technically yes, but usually pointless. sync_binlog=1 guarantees binlog fsync on every commit, but flush_log_at_trx_commit=2 allows redo log loss. Crash recovery depends on consistency between redo log and binlog: if redo is lost but binlog exists, primary-replica inconsistency results.

Recommended combinations:

  • Primary: flush_log_at_trx_commit=1 + sync_binlog=1 (dual-1, safest)
  • Replica: flush_log_at_trx_commit=2 + sync_binlog=0 (relaxed, better performance)
Q4: How to determine if the Buffer Pool is large enough?

Three key indicators:

  1. Buffer Pool hit rate: "Buffer pool hit rate" in SHOW ENGINE INNODB STATUS. Should be >= 999/1000 (99.9%).
  2. Free buffers: If persistently 0 but hit rate is still high, pool size is just right. If free=0 and hit rate is low, increase it.
  3. Pages read vs Innodb_buffer_pool_reads: SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'. Innodb_buffer_pool_read_requests (logical reads) vs Innodb_buffer_pool_reads (physical reads); higher ratio is better.

Ultimate test: if your entire dataset fits in the Buffer Pool (check total Data_length + Index_length from SHOW TABLE STATUS), set it large enough to hold everything.

Q5: Does MVCC completely avoid locks?

No. MVCC only avoids read-write conflicts (reads don't block writes, writes don't block reads), but write-write conflicts still require locks. Specifically:

  • Consistent reads (plain SELECT): read historical version via MVCC, no lock
  • Current reads (SELECT ... FOR UPDATE / LOCK IN SHARE MODE): locking read, reads latest version
  • INSERT / UPDATE / DELETE: first read latest version (current read), then take exclusive lock to modify

Thus MVCC + locking is InnoDB's complete concurrency control mechanism โ€” they complement, not replace, each other.

Q6: Why is innodb_flush_method=O_DIRECT recommended?

With the default fsync method, InnoDB's write path is: Buffer Pool โ†’ OS Page Cache โ†’ Disk. Data is stored twice in memory (InnoDB Buffer Pool + OS Page Cache), wasting memory.

O_DIRECT bypasses the OS Page Cache: Buffer Pool โ†’ Disk directly. Benefits:

  • Avoid double caching, freeing valuable OS memory
  • Reduce OS Page Cache CPU overhead (cache lookups, LRU management)
  • Especially effective with large Buffer Pools

Note: O_DIRECT still uses fsync for redo log. In MySQL 8.0.26+ you can use O_DIRECT_NO_FSYNC, which performs better on some filesystems.

Q7: Why does InnoDB need Doublewrite while PostgreSQL doesn't?

They use different strategies for the torn page problem:

  • InnoDB: Redo Log records physiological logs (offset + delta within a page). If the page itself is corrupted, redo cannot replay correctly. Thus Doublewrite provides an intact page copy as the replay base.
  • PostgreSQL: On the first modification of a page after a checkpoint, it writes a full page image (Full Page Write, FPW) in the WAL. Even if the page is corrupted, the full page can be recovered from the WAL. The cost is larger WAL size.

Each approach has trade-offs: InnoDB's redo log is more compact but requires extra Doublewrite I/O; PostgreSQL's WAL is larger but architecturally simpler.

Q8: How to resize innodb_buffer_pool_size online?

MySQL 5.7+ supports online Buffer Pool resizing without restart:

SET GLOBAL innodb_buffer_pool_size = 48 * 1024 * 1024 * 1024;  -- 48GB

Important notes:

  • Resizing happens in units of innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances (default 128MB * 8 = 1GB)
  • If the value is not a multiple of this unit, MySQL rounds up automatically
  • Growing is usually fast; shrinking can be slow (pages must be evicted)
  • Resizing does not block queries but may cause brief performance fluctuations
  • Monitor progress via SHOW STATUS LIKE 'Innodb_buffer_pool_resize_status'
Q9: How does InnoDB solve phantom reads under RR isolation?

InnoDB solves phantom reads under RR through two mechanisms:

  1. MVCC (snapshot reads): Plain SELECT uses the ReadView from the transaction's start. Even if other transactions insert new rows, the snapshot read cannot see them โ€” thus no phantom reads.
  2. Next-Key Lock (current reads): SELECT ... FOR UPDATE / INSERT / UPDATE / DELETE uses current reads. InnoDB uses Next-Key Locks (Record Lock + Gap Lock) to lock gaps in the query range, preventing other transactions from inserting new rows.

Caveat: RR does not perfectly prevent phantom reads in all cases. A classic counterexample: a transaction does a snapshot read (doesn't see new rows), then UPDATE (current read, affects new rows), then snapshot read again (may now see them โ€” because UPDATE changed the new row's trx_id to the current transaction's ID).

Q10: What major InnoDB architectural improvements were made in MySQL 8.0?

MySQL 8.0's InnoDB improvements are historic:

  • Atomic data dictionary: migrated from file-based .frm to InnoDB's transactional data dictionary; DDL operations become atomic (crash-safe)
  • Redo Log redesign (8.0.30+): from fixed-size files to dynamic #innodb_redo/ directory, unified under innodb_redo_log_capacity
  • Separate Doublewrite files (8.0.20+): extracted from ibdata1 to standalone .dblwr files, enabling parallel writes
  • Instant ADD COLUMN (8.0.12+): certain ALTER TABLE ADD COLUMN operations complete instantly without table rebuild
  • Invisible indexes: ALTER TABLE ... ALTER INDEX idx INVISIBLE to test the impact of dropping an index without actually dropping it
  • Descending indexes: true descending B+Tree indexes, optimizing ORDER BY ... DESC queries
  • Session temporary tablespaces (8.0.15+): per-session temporary tablespace files, reclaimable after disconnect
  • Undo Tablespace management: online create/drop undo tablespace, automatic truncation
  • Parallel reads (8.0.14+): SELECT COUNT(*) and other full table scans can read multiple pages in parallel

12.5 Key InnoDB Status Variables

Besides SHOW ENGINE INNODB STATUS, these SHOW GLOBAL STATUS variables are most valuable for daily monitoring:

Variable Meaning Healthy Threshold
Innodb_buffer_pool_read_requests Total logical read requests to Buffer Pool Ratio to reads > 99.9%
Innodb_buffer_pool_reads Buffer Pool misses requiring disk read Lower is better
Innodb_buffer_pool_pages_dirty Dirty page count in Buffer Pool Should be stable, not continuously growing
Innodb_buffer_pool_wait_free Waits for free pages (Free list empty) Should be 0 or near 0
Innodb_log_waits Waits due to Log Buffer being too small Should be 0; if > 0 increase log_buffer_size
Innodb_os_log_written Bytes written to Redo Log Use to calculate write rate and estimate redo size
Innodb_row_lock_time_avg Average row lock wait time (ms) < 10ms; > 100ms needs investigation
Innodb_row_lock_waits Row lock wait count Trend matters more than absolute value
Innodb_rows_read / inserted / updated / deleted InnoDB layer row operation statistics Use for workload profiling
Innodb_data_reads / writes Physical I/O operation count Compare with logical reads to measure cache efficiency
-- Buffer Pool hit rate calculation: SELECT (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests') AS logical_reads, (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') AS disk_reads, ROUND( (1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') / (SELECT VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests') ) * 100, 3 ) AS hit_rate_pct; -- Redo log write rate (bytes per second): -- Record Innodb_os_log_written at two points in time, divide the delta by seconds elapsed -- Quick health check script: SELECT 'Buffer Pool Hit Rate' AS metric, CONCAT(ROUND((1 - v2.val/v1.val)*100, 2), '%') AS value FROM (SELECT VARIABLE_VALUE val FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_read_requests') v1, (SELECT VARIABLE_VALUE val FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_reads') v2 UNION ALL SELECT 'Dirty Page Ratio', CONCAT(ROUND(v1.val/v2.val*100, 1), '%') FROM (SELECT VARIABLE_VALUE val FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty') v1, (SELECT VARIABLE_VALUE val FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_total') v2 UNION ALL SELECT 'Log Waits (should be 0)', VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_log_waits' UNION ALL SELECT 'Wait Free (should be 0)', VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_buffer_pool_wait_free' UNION ALL SELECT 'Avg Row Lock Time (ms)', VARIABLE_VALUE FROM performance_schema.global_status WHERE VARIABLE_NAME='Innodb_row_lock_time_avg';

12.6 Prometheus / Grafana Monitoring Recommendations

In production, mysqld_exporter typically exposes InnoDB metrics to Prometheus, visualized with Grafana. Recommended core dashboard panels include:

Panel Metrics Alert Threshold
Buffer Pool Hit Rate rate(mysql_global_status_innodb_buffer_pool_read_requests[5m])
rate(mysql_global_status_innodb_buffer_pool_reads[5m])
Alert if < 99.5%
Dirty Pages % mysql_global_status_innodb_buffer_pool_pages_dirty / mysql_global_status_innodb_buffer_pool_pages_total Alert if > 75%
Checkpoint Age mysql_global_status_innodb_checkpoint_age Alert if > 80% redo capacity
History List Length mysql_global_status_innodb_history_list_length Alert if > 10000
Row Lock Waits/s rate(mysql_global_status_innodb_row_lock_waits[5m]) Alert if > 100/s
Long Transactions mysql_info_schema_innodb_trx (custom query) Alert if > 300 seconds
Redo Log Write Rate rate(mysql_global_status_innodb_os_log_written[5m]) Trend monitoring (alert on anomaly)
Pending I/O mysql_global_status_innodb_data_pending_reads
mysql_global_status_innodb_data_pending_writes
Alert if > 0 for 5 minutes

12.7 Useful sys Schema Views

-- Top tables by buffer pool usage SELECT object_schema, object_name, allocated, data, pages FROM sys.innodb_buffer_stats_by_table ORDER BY allocated DESC LIMIT 20; -- Top indexes by buffer pool usage SELECT object_schema, object_name, index_name, allocated, data, pages FROM sys.innodb_buffer_stats_by_schema ORDER BY allocated DESC; -- InnoDB lock waits (human-readable) SELECT * FROM sys.innodb_lock_waits\G -- Tables with full table scans (potential buffer pool polluters) SELECT object_schema, object_name, rows_full_scanned, latency FROM sys.schema_tables_with_full_table_scans ORDER BY rows_full_scanned DESC LIMIT 10;
โ† Prev: Architecture Back to Contents Next: Source Code โ†’