Chapter 2: InnoDB Storage Engine Internals Deep Dive
InnoDB is MySQL's default and most critical storage engine. This chapter delivers a source-code-level deep dive from the architecture overview through Buffer Pool page structure, Redo Log WAL mechanics, Undo Log and MVCC version chains, Doublewrite Buffer torn-page protection, Change Buffer random-I/O optimization, Adaptive Hash Index, tablespace layout, row format comparison, all the way to flush strategies and production monitoring. Every topic includes ASCII architecture diagrams, MySQL source file references, key parameter tuning advice, and real-world incident case studies.
1. InnoDB Architecture Overview
1.1 Architecture Diagram
InnoDB's internal architecture divides into in-memory structures and on-disk structures. The ASCII diagram below shows every core component and data flow:
Source ref: storage/innobase/include/buf0buf.h, storage/innobase/srv/srv0srv.cc
1.2 Query Data Flow
When a SELECT reaches InnoDB, the data flow is:
1.3 Write Data Flow
When an INSERT / UPDATE / DELETE executes, InnoDB's write path is considerably more complex than reads:
1.4 In-Memory vs On-Disk Structures
| Component | Location | Purpose | Key Parameter |
|---|---|---|---|
| Buffer Pool | Memory | Cache data and index pages to reduce disk I/O | innodb_buffer_pool_size |
| Log Buffer | Memory | Temporarily hold redo log records before flushing | innodb_log_buffer_size |
| Adaptive Hash Index | Memory | Auto-build hash indexes on frequently accessed B+Tree pages | innodb_adaptive_hash_index |
| Change Buffer | Memory + Disk | Cache secondary index changes, defer merge | innodb_change_buffer_max_size |
| Redo Log | Disk | Crash recovery, guarantee durability (D in ACID) | innodb_redo_log_capacity |
| Undo Log | Disk (+cached in Buffer Pool) | Transaction rollback + MVCC versioned reads | innodb_undo_tablespaces |
| Doublewrite Buffer | Disk | Prevent partial page writes (torn pages) | innodb_doublewrite |
| System Tablespace | Disk | Data dictionary, change buffer persistence | innodb_data_file_path |
2. Buffer Pool Deep Dive
The Buffer Pool is InnoDB's most critical in-memory structure. Understanding how it works is foundational to understanding InnoDB performance.
2.1 Page Structure (16KB)
All data in InnoDB is stored in 16KB pages. Each page's internal structure is:
Source ref: storage/innobase/include/fil0types.h, storage/innobase/include/page0page.h
innodb_page_size (options: 4K/8K/16K/32K/64K, but only settable at initialization). 16KB is a carefully chosen trade-off: too small increases B+Tree depth (more disk seeks), too large wastes memory and increases intra-page fragmentation. For most OLTP workloads, 16KB is optimal.
2.2 LRU Algorithm: Dual-Sublist Design
InnoDB's Buffer Pool does not use a standard LRU (Least Recently Used) algorithm. Instead it employs a modified two-segment LRU to prevent full table scans from evicting hot data.
Why Two Sublists?
Consider a typical production scenario: your Buffer Pool caches the last few hours of hot data (user table, order table index pages). Suddenly someone runs a SELECT * FROM large_table (full table scan), reading hundreds of thousands of data pages. With a standard LRU, these "cold" pages that will never be accessed again push out all the truly "hot" pages โ causing subsequent normal queries to hit disk. This is called "Buffer Pool pollution."
InnoDB's solution:
- When a new page is read from disk, it is inserted at the head of the Old sublist (midpoint insertion), not the head of the entire LRU.
- Only when the page is accessed again after staying in the Old sublist for more than
innodb_old_blocks_time(default 1000ms) will it be promoted to the Young sublist. - Full table scan pages are typically read and discarded within 1000ms, never entering the Young sublist, thus protecting hot data.
Key Parameters
| Parameter | Default | Description |
|---|---|---|
innodb_old_blocks_pct |
37 | Percentage of Buffer Pool for the Old sublist. Default 37% (~3/8). Range: 5-95. |
innodb_old_blocks_time |
1000 | Milliseconds a page must stay in the Old sublist before it can be promoted to Young. Set to 0 to disable this protection. |
2.3 Buffer Pool Instances
When the Buffer Pool is large (typically 32GB-128GB in production), a single Buffer Pool mutex becomes a concurrency bottleneck. InnoDB allows splitting the Buffer Pool into multiple independent instances, each with its own LRU list, Free list, Flush list, and mutex.
| Parameter | Default | Recommended |
|---|---|---|
innodb_buffer_pool_instances |
8 (when pool >= 1GB) | At least 1GB per instance. 128GB pool โ 16 instances. |
innodb_buffer_pool_chunk_size |
128MB | Unit for online pool resizing. pool_size should be a multiple of chunk_size * instances. |
innodb_buffer_pool_size < 1GB, innodb_buffer_pool_instances is forced to 1 regardless of your setting. This is because small Buffer Pools don't suffer from mutex contention.
2.4 Buffer Pool Warmup & Dump/Load
After a MySQL restart the Buffer Pool is empty (cold start) and query performance gradually recovers over 30-60 minutes. InnoDB provides automatic dump/load to accelerate warmup:
The dump file only saves (tablespace_id, page_number) pairs, not actual data. At startup InnoDB pre-reads pages from disk using this list. This is why even a 128GB Buffer Pool dump file is typically only a few MB.
2.5 Monitoring the Buffer Pool
1. Buffer pool hit rate: Should stay at 999/1000 or above. Below this suggests the pool is too small or being polluted by large queries.
2. Free buffers: If persistently 0, the pool is full and every new page requires eviction.
3. Modified db pages: If continuously growing, flushing cannot keep up with writes.
4. Pages made young / not young: "not young" far exceeding "young" means
innodb_old_blocks_time protection is working.
Source ref: storage/innobase/buf/buf0buf.cc โ buf_page_get_gen(), buf_page_make_young()
2.6 Buffer Pool Size Calculator
2.7 Buffer Pool Internal Lists
The Buffer Pool maintains three critical internal lists. Understanding them is essential for troubleshooting performance issues:
2.8 Page Types
Pages cached in the Buffer Pool are not just data pages. InnoDB defines multiple page types:
| FIL_PAGE_TYPE | Value | Description |
|---|---|---|
FIL_PAGE_INDEX | 0x45BF | B+Tree leaf node (stores row data or secondary index leaf) |
FIL_PAGE_RTREE | 0x45BE | R-Tree index page (spatial index) |
FIL_PAGE_SDI | 0x45BD | Serialized Dictionary Information page (MySQL 8.0) |
FIL_PAGE_UNDO_LOG | 0x0002 | Undo Log page |
FIL_PAGE_INODE | 0x0003 | Segment information node page |
FIL_PAGE_IBUF_FREE_LIST | 0x0004 | Change Buffer free list page |
FIL_PAGE_TYPE_ALLOCATED | 0x0000 | Allocated but not yet used page |
FIL_PAGE_TYPE_SYS | 0x0006 | System page (e.g., transaction system page) |
FIL_PAGE_TYPE_TRX_SYS | 0x0007 | Transaction system data page |
FIL_PAGE_TYPE_FSP_HDR | 0x0008 | File Space Header page |
FIL_PAGE_TYPE_XDES | 0x0009 | Extent Descriptor page |
FIL_PAGE_TYPE_BLOB | 0x000A | Overflow page (stores large columns) |
Source ref: storage/innobase/include/fil0fil.h โ enum page_type_t
2.9 Buffer Pool Read-Ahead
InnoDB has two read-ahead mechanisms to reduce I/O wait time:
Linear Read-Ahead
When the number of sequentially accessed pages in an extent (64 contiguous pages = 1MB) exceeds innodb_read_ahead_threshold (default 56), InnoDB asynchronously pre-reads all pages in the next extent. Suits sequential access patterns like full table scans.
Random Read-Ahead
When more than 13 pages from an extent are in the Buffer Pool's Young sublist, InnoDB pre-reads the remaining pages in that extent. Disabled by default (innodb_random_read_ahead=OFF), as it rarely helps in most workloads.
2.10 Buffer Pool Mutex & Latch Contention
Under high concurrency (>1000 QPS), Buffer Pool internal mutexes can become bottlenecks. Key mutexes include:
| Mutex / Latch | Protected Resource | Contention Symptom | Solution |
|---|---|---|---|
buf_pool->mutex |
Buffer Pool instance global lock | Broad contention on high concurrent R/W | Increase innodb_buffer_pool_instances |
buf_pool->LRU_list_mutex |
LRU list | Contention during page eviction/promotion | Increase instances + optimize queries to reduce eviction |
buf_pool->flush_list_mutex |
Flush list | Contention under high write load | Increase instances + page_cleaners |
hash_lock (page hash) |
Page hash table (lookup if page is in pool) | Contention under extreme concurrent point lookups | Partitioned in MySQL 8.0, usually no action needed |
3. Redo Log (WAL) Mechanism
3.1 Write-Ahead Logging Principle
WAL (Write-Ahead Logging) is InnoDB's core mechanism for transaction durability (the D in ACID). The principle is simple: before modifying a data page, first record the modification operation in the Redo Log. If the system crashes, all committed transaction modifications can be recovered by replaying the Redo Log.
3.2 Redo Log File Structure
Before MySQL 8.0.30, the Redo Log consisted of a fixed number of files (default 2), written in a circular fashion:
3.3 LSN (Log Sequence Number)
LSN is one of InnoDB's most important internal concepts. It is a monotonically increasing 64-bit integer representing the total bytes written to the Redo Log (counted from InnoDB initialization).
| LSN Type | Meaning | Source |
|---|---|---|
Log sequence number |
Current latest LSN (position written to Log Buffer) | SHOW ENGINE INNODB STATUS |
Log flushed up to |
LSN flushed to disk redo log files | SHOW ENGINE INNODB STATUS |
Pages flushed up to |
Dirty pages flushed up to this LSN (oldest dirty page LSN in flush list) | SHOW ENGINE INNODB STATUS |
Last checkpoint at |
LSN of last checkpoint. Crash recovery replays from here. | SHOW ENGINE INNODB STATUS |
3.4 Checkpoint Mechanism
A checkpoint flushes dirty pages from the Buffer Pool to disk and advances the checkpoint LSN. It serves two purposes:
- Free redo log space: redo log space before the checkpoint LSN can be reused
- Shorten crash recovery time: recovery only needs to replay redo after the checkpoint LSN
Sharp Checkpoint
Flushes all dirty pages at once. Only used during database shutdown (innodb_fast_shutdown=0). Blocks all operations during flush โ never used at runtime in production.
Fuzzy Checkpoint
Flushes dirty pages incrementally and asynchronously. InnoDB uses Fuzzy Checkpoint at runtime, including LRU flushing, Flush List flushing, and Async/Sync flushing (when redo log nears capacity).
3.5 innodb_flush_log_at_trx_commit: Durability vs Performance
This is arguably one of InnoDB's most important parameters. It controls the redo log flush strategy at transaction commit:
| Value | Performance | Data Safety | Use Case |
|---|---|---|---|
1 |
Slowest (fsync per commit) | Safest (zero data loss) | Primary, financial systems, any scenario that cannot lose data |
2 |
Faster (OS crash may lose 1 sec) | MySQL-crash safe, OS/power-failure unsafe | Replicas, bulk imports, scenarios tolerating minor loss |
0 |
Fastest (no fsync) | Any crash may lose 1 second | Testing, non-critical data |
innodb_flush_log_at_trx_commit=0 on a replica. One day a UPS failure caused the replica to power-cycle. After restart, the replica's GTID position was ~3 seconds behind what the primary recorded, breaking replication. The replica had to be rebuilt. Lesson: even for replicas, use 2 instead of 0 when replication consistency matters.
3.6 Redo Log Size Tuning
A redo log that is too small causes frequent checkpoints (because the redo space must wait for dirty pages to flush before reusing space), while too large extends crash recovery time.
| Parameter | Version | Default | Recommended |
|---|---|---|---|
innodb_log_file_size |
< 8.0.30 | 48MB | 1GB-4GB (high write workloads) |
innodb_log_files_in_group |
< 8.0.30 | 2 | 2 |
innodb_redo_log_capacity |
≥ 8.0.30 | 100MB | 2GB-8GB (replaces the two above) |
Source ref: storage/innobase/log/log0log.cc, storage/innobase/log/log0chkp.cc
3.7 Group Commit & Binlog Coordination
When multiple transactions commit simultaneously, InnoDB uses Group Commit optimization, merging multiple transactions' redo log writes into a single fsync. With binlog enabled, InnoDB uses two-phase commit (2PC) to ensure redo log and binlog consistency:
3.8 Redo Log Record Types
The Redo Log records multiple types of page modification operations. Each record contains a type identifier, tablespace ID, page number, and specific modification content. Major types include:
| Record Type | Description | Triggered By |
|---|---|---|
MLOG_1BYTE / 2BYTE / 4BYTE / 8BYTE |
Write 1/2/4/8 bytes to specific page offset | Page header field updates, etc. |
MLOG_WRITE_STRING |
Write variable-length byte string to page | Data changes |
MLOG_REC_INSERT |
Insert a compact-format record | INSERT |
MLOG_REC_UPDATE_IN_PLACE |
In-place record update (same length) | UPDATE (when field length unchanged) |
MLOG_REC_DELETE |
Delete record (mark as deleted) | DELETE |
MLOG_PAGE_CREATE |
Create a new index page | Page split, etc. |
MLOG_UNDO_INSERT |
Insert an undo record into undo page | DML operations generating undo |
MLOG_COMP_PAGE_CREATE |
Create compact-format page | Tablespace page allocation |
MLOG_MULTI_REC_END |
Mini-transaction end marker | At end of each mtr |
Source ref: storage/innobase/include/mtr0types.h
3.9 Mini-Transaction (mtr) Concept
A Mini-Transaction is InnoDB's internal atomic operation unit, distinct from user-level transactions. A user transaction contains multiple mtrs, and each mtr's writes to the redo log are atomic. For example, a B+Tree page split may involve modifying 3 pages โ these 3 modifications must be written to the redo log as a unit: all or nothing.
3.10 Crash Recovery Process
When MySQL restarts after an abnormal shutdown, InnoDB crash recovery proceeds as follows:
4. Undo Log & MVCC
4.1 Undo Log's Dual Mission
The Undo Log serves two critical roles in InnoDB:
- Transaction Rollback: When a transaction executes ROLLBACK, the old values recorded in the Undo Log restore data to the state before the transaction began.
- MVCC: When other transactions need to read a historical version of a row, the version chain in the Undo Log locates the version visible to that transaction.
4.2 Undo Log Storage Structure
Each transaction needs 1 to 2 Undo Logs depending on operation type:
- INSERT Undo Log: Only records the primary key of newly inserted rows; rollback deletes them. Can be released immediately after commit.
- UPDATE Undo Log: Records old values of modified/deleted rows. Cannot be released immediately after commit โ must wait until all transactions that might read the old version finish, then purge threads clean up.
4.3 MVCC Version Chain
Every row in InnoDB has two hidden columns:
DB_TRX_ID(6 bytes): Transaction ID of the last transaction that modified this rowDB_ROLL_PTR(7 bytes): Pointer to the old version of this row in the Undo Log
4.4 ReadView in Detail
ReadView is the core data structure for InnoDB's MVCC consistent reads. When a transaction performs a consistent read (non-locking read), InnoDB creates a ReadView containing:
| Field | Meaning |
|---|---|
m_creator_trx_id |
Transaction ID that created this ReadView |
m_low_limit_id |
Next transaction ID to be assigned when ReadView was created (i.e., max_trx_id at that point). Transactions with ID >= this are definitely invisible. |
m_up_limit_id |
Minimum transaction ID in the active list when ReadView was created. Transactions with ID < this are definitely committed and visible. |
m_ids |
List of all active (uncommitted) transaction IDs when ReadView was created. |
Source ref: storage/innobase/include/read0types.h โ class ReadView, storage/innobase/row/row0sel.cc
4.5 RR vs RC: When the ReadView Is Created
REPEATABLE READ (RR)
ReadView is created at the first SELECT in the transaction and reused throughout the transaction's lifetime. Thus reading the same row multiple times within the transaction always returns the same value โ even if other transactions have committed modifications.
READ COMMITTED (RC)
ReadView is recreated on every SELECT. Thus two reads of the same row within a transaction may return different values โ if another transaction committed a modification between the two reads (Non-Repeatable Read).
4.6 Dangers of Long Transactions
A SaaS application had an analytics query transaction that ran for 8 hours without committing. During that time, UPDATE Undo Logs from other transactions could not be purged (because that long transaction's ReadView still needed old versions), causing the undo tablespace to bloat from 200MB to 150GB. Disk space was exhausted and the database crashed.
Prevention:
1. Set
innodb_undo_log_truncate=ON (default ON in MySQL 8.0)
2. Monitor
information_schema.INNODB_TRX for long-uncommitted transactions
3. Set
kill_idle_transaction (Percona) or periodically check timeout transactions
4. Route analytics queries to replicas
4.7 Undo Configuration
| Parameter | Default | Description |
|---|---|---|
innodb_undo_tablespaces |
2 | Number of separate undo tablespaces. MySQL 8.0 minimum is 2. |
innodb_undo_log_truncate |
ON (8.0) | Automatically reclaim undo tablespaces exceeding innodb_max_undo_log_size. |
innodb_max_undo_log_size |
1GB | Size threshold that triggers undo tablespace truncation. |
innodb_rollback_segments |
128 | Number of rollback segments per undo tablespace. |
innodb_purge_threads |
4 | Background threads for purge operations. Increase to 8 for high-write workloads. |
Source ref: storage/innobase/trx/trx0trx.cc, storage/innobase/trx/trx0purge.cc
4.8 Purge Thread Internals
Purge is InnoDB's background cleanup operation, responsible for two tasks:
- Physically remove rows marked as "delete-marked" (DELETE only sets a logical deletion flag; purge actually removes the row from indexes)
- Clean up UPDATE undo log records no longer needed by any active transaction
4.9 MVCC and Secondary Indexes
MVCC implementation differs fundamentally between clustered indexes and secondary indexes:
Clustered Index
Each row has DB_TRX_ID and DB_ROLL_PTR, enabling direct version visibility checks and version chain traversal.
Secondary Index
Secondary index entries do not have DB_TRX_ID and DB_ROLL_PTR. InnoDB uses PAGE_MAX_TRX_ID in the page header (the maximum trx_id on that page) for a quick check: if PAGE_MAX_TRX_ID < ReadView's m_up_limit_id, all records on the page are visible without returning to the clustered index. Otherwise, it must go back to the clustered index to check each row's version visibility.
5. Doublewrite Buffer
5.1 The Partial Page Write Problem (Torn Pages)
InnoDB's page size is 16KB, but the OS atomic write is typically 4KB (one filesystem block). A single 16KB page write requires 4 separate 4KB I/O operations. If a power failure or system crash occurs between these 4 I/O operations, only partial data may have been written โ this is a "torn page."
5.2 How Doublewrite Works
5.3 Performance Impact & When to Disable
| Aspect | Details |
|---|---|
| Write Amplification | Each dirty page is written twice: once to the Doublewrite Buffer, once to the actual location. Theoretically 2x write I/O, but in practice ~5-10% overhead (because Doublewrite writes are sequential). |
| Safe to Disable When | 1. Filesystem supports atomic writes (ZFS, RAID controller with battery-backed cache) 2. Some cloud databases (e.g., Aurora) implement storage-layer atomic writes 3. innodb_page_size set to 4KB (equal to filesystem block size โ no torn-page risk) |
6. Change Buffer
6.1 What Problem Does the Change Buffer Solve?
When an INSERT/UPDATE/DELETE modifies a secondary index page that is not in the Buffer Pool, InnoDB faces a choice:
Without Change Buffer
Immediately read that secondary index page from disk into the Buffer Pool, modify it, mark as dirty. Problem: this is a random read I/O. For thousands of INSERTs per second, this generates massive random I/O.
With Change Buffer
Cache the modification in the Change Buffer without reading from disk. When the page is eventually read into the Buffer Pool for other queries, merge the cached modifications. This avoids massive random read I/O.
1. Clustered index (primary key) โ the primary key page must be read on INSERT to determine insert position
2. Unique indexes โ the index page must be read on INSERT to check the uniqueness constraint
6.2 Merge Triggers
- The secondary index page is read into the Buffer Pool by a SELECT query
- Background merge thread executes periodically
- Buffer Pool space is insufficient and Change Buffer pages must be evicted
- At transaction commit (in certain cases)
- During normal server shutdown (full merge)
6.3 Configuration
| Parameter | Default | Description |
|---|---|---|
innodb_change_buffering |
all | Which operations to buffer: none, inserts, deletes, changes(ins+del), purges, all |
innodb_change_buffer_max_size |
25 | Maximum percentage of Buffer Pool for Change Buffer. Range: 0-50. |
6.4 When It Helps vs When It Hurts
Effective Scenarios
- Heavy random INSERTs (log tables, secondary indexes on UUID PK tables)
- Secondary index pages not frequently read by SELECTs (write-heavy, read-light)
- Disk I/O is the bottleneck (HDD benefits more than SSD)
Ineffective/Harmful Scenarios
- All indexes are unique (Change Buffer has zero effect)
- Read immediately after write (merge triggers immediately, no different from direct read)
- Buffer Pool large enough to cache all index pages (no need for Change Buffer optimization)
7. Adaptive Hash Index (AHI)
7.1 How It Works
All user data and indexes in InnoDB are stored in B+Trees. A B+Tree lookup traverses from the root node downward level by level. For a 3-4 level B+Tree, each lookup requires 3-4 page accesses (though upper-level pages are usually in the Buffer Pool, there is still CPU overhead).
The Adaptive Hash Index (AHI) is an in-memory hash table that InnoDB automatically builds on frequently accessed B+Tree pages. When InnoDB detects that a certain index prefix is repeatedly searched with the same pattern, it automatically builds a hash index, reducing lookup complexity from O(log n) to O(1).
7.2 Monitoring AHI
7.3 AHI Contention Issues
AHI is protected by rw-locks. Under high concurrency, many threads simultaneously accessing AHI can cause rw-lock contention, manifesting as high CPU usage with throughput actually decreasing.
| Parameter | Default | Description |
|---|---|---|
innodb_adaptive_hash_index |
ON | Whether to enable AHI. Can be changed dynamically. |
innodb_adaptive_hash_index_parts |
8 | Number of AHI partitions. Increasing reduces rw-lock contention. Range: 1-512. |
7.4 When to Disable AHI
- AHI hit rate consistently below 5% (wasting memory and CPU)
- Seeing heavy
btr_searchrw-lock waits in the SEMAPHORES section ofSHOW ENGINE INNODB STATUS - Workload is primarily range scans (AHI helps point lookups, not range scans)
- Queries use many different index prefixes randomly (AHI cannot build effective hashes)
8. Tablespace Architecture
8.1 Tablespace Types Overview
8.2 System Tablespace (ibdata1)
The system tablespace is InnoDB's original tablespace, storing:
- Change Buffer persistent data
- Doublewrite Buffer (before MySQL 8.0.20)
- InnoDB internal data dictionary (before MySQL 8.0)
- User table data if
innodb_file_per_table=OFF
innodb_file_per_table=OFF). The only solution is logical backup + instance rebuild. Therefore, always keep innodb_file_per_table=ON (default ON since MySQL 5.6.6+).
8.3 File-per-Table Tablespaces
When innodb_file_per_table=ON, each InnoDB table's data and indexes are stored in its own .ibd file.
| Advantages | Disadvantages |
|---|---|
| DROP TABLE / TRUNCATE TABLE actually reclaims disk space | Filesystem inode limits (relevant with tens of thousands of tables) |
Can individually migrate tables with ALTER TABLE ... TABLESPACE |
Slight storage overhead for each .ibd file's metadata |
Can place tables on different disks (via symlinks or DATA DIRECTORY) |
Finer fsync granularity (per file) may slightly impact write performance |
| Supports COMPRESSED row format |
8.4 General Tablespaces
General tablespaces allow organizing related tables in a single file (reducing file descriptor usage) or placing them on specific storage devices.
8.5 Temporary Tablespaces
MySQL 8.0 has two types of temporary tablespaces:
- Global Temporary Tablespace (ibtmp1): Used for rollback segments of user-created temporary tables (
CREATE TEMPORARY TABLE). Automatically rebuilt on restart. - Session Temporary Tablespaces (#innodb_temp/): Added in MySQL 8.0.15+, stores user-created temporary table data and internal temporary tables. One allocated per session; space is reclaimable after session ends.
8.6 Tablespace Internals: Segment, Extent, Page
InnoDB tablespaces organize disk space internally in a three-level hierarchy:
8.7 .ibd File Space Reclamation & Defragmentation
After massive DELETE operations, .ibd files do not automatically shrink. Deleted space is marked as reusable, but the file's physical size stays the same. If fragmentation is severe, reclaim space via:
9. Row Formats
9.1 Comparison of Four Row Formats
| Row Format | Since | Variable-Length Storage | Large Column Overflow | Recommended |
|---|---|---|---|---|
REDUNDANT |
Earliest | Full length info stored in-row | First 768 bytes stored in-row | No (legacy only) |
COMPACT |
5.0 | Compact NULL flags and variable-length field lengths | First 768 bytes stored in-row | Usable but not preferred |
DYNAMIC |
5.7+ | Same as COMPACT | Only 20-byte pointer in-row, all data in overflow pages | Yes (default, recommended) |
COMPRESSED |
5.5+ | Same as DYNAMIC, but with page-level compression | Same as DYNAMIC | Special cases (cold data) |
9.2 COMPACT/DYNAMIC Row Internal Structure
9.3 Overflow Page Storage (Off-page)
When a large column in a row (TEXT, BLOB, very long VARCHAR) exceeds the page's storage capacity, InnoDB stores part or all of the data in overflow pages. COMPACT and DYNAMIC handle this differently:
COMPACT / REDUNDANT
First 768 bytes stored in-row, remainder in overflow pages. This means even a primary key lookup may load large TEXT data into Buffer Pool data pages, wasting memory.
DYNAMIC
When a column overflows, only a 20-byte pointer (space ID + page number + offset) is stored in-row, with all data in overflow pages. This allows data pages to hold more rows, dramatically improving Buffer Pool efficiency.
ROW_FORMAT=DYNAMIC (MySQL 8.0 default). If your table has TEXT/BLOB columns and you need disk space savings (accepting ~10% CPU overhead), consider COMPRESSED. However, in MySQL 8.0, prefer transparent page compression (COMPRESSION='zstd') over the COMPRESSED row format.
9.4 Hidden Columns in InnoDB Rows
Every InnoDB row has up to three hidden columns, critical for MVCC and internal management:
| Hidden Column | Size | Description |
|---|---|---|
DB_ROW_ID |
6 bytes | Row ID, only added when the table has no explicit primary key and no non-NULL unique index. InnoDB generates a globally incrementing row_id as the internal primary key. This is why you should always define a primary key โ otherwise InnoDB uses a globally mutex-protected counter to generate row_id, becoming a bottleneck under high-concurrency INSERTs. |
DB_TRX_ID |
6 bytes | Transaction ID of the last transaction that modified this row. Updated by INSERT/UPDATE/DELETE. The core basis for MVCC version visibility checks. |
DB_ROLL_PTR |
7 bytes | Roll pointer, pointing to the previous version of this row in the undo log. Following DB_ROLL_PTR traverses the version chain to find any historical version. 7 bytes = 1 bit (INSERT/UPDATE flag) + 7 bits (rollback segment ID) + 32 bits (page number) + 16 bits (offset within page). |
9.5 COMPRESSED Row Format Deep Dive
The COMPRESSED row format adds page-level zlib compression on top of DYNAMIC. Each 16KB page is compressed and stored in smaller space. However, InnoDB must maintain both compressed and uncompressed copies of pages in memory:
MySQL 8.0 recommends transparent page compression (
CREATE TABLE ... COMPRESSION='zstd') over the COMPRESSED row format:
1. Transparent page compression does not need two copies in memory
2. Supports more efficient compression algorithms (zstd instead of zlib)
3. Relies on filesystem sparse file (punch hole) support โ Linux ext4/xfs/btrfs all support it
4. No extra memory overhead on Buffer Pool
5. Downside: requires filesystem punch hole support; actual savings depend on filesystem block alignment
9.6 Row Size Limits
InnoDB has several important row size limits:
| Limit | Value | Description |
|---|---|---|
| MySQL layer row size limit | 65,535 bytes | Sum of all column declared lengths (excluding BLOB/TEXT) cannot exceed this |
| InnoDB in-page row size limit | ~8000 bytes | Approximately half of page size (16KB/2), ensuring at least 2 rows per page (B+Tree requirement) |
| Overflow trigger threshold | Dynamic | When in-row data exceeds ~8000 bytes, the longest variable-length fields are moved to overflow pages. DYNAMIC format keeps only a 20-byte pointer. |
| Maximum columns | 1017 | Maximum 1017 columns per table (limit after including hidden and virtual columns) |
10. Flush Mechanism
10.1 Why Flush?
InnoDB writes first go to the Buffer Pool (in-memory dirty pages), with transactions guaranteed by the Redo Log. But dirty pages must eventually be written back to disk for two reasons:
- Free Buffer Pool space for new pages
- Advance the checkpoint LSN to free redo log space and shorten crash recovery time
10.2 Two Flush Paths
LRU Flushing
When Buffer Pool free pages are insufficient, page cleaner threads scan from the LRU tail, flush dirty pages, and free them. The goal is to ensure enough free pages for new read requests.
Flush List Flushing
Page cleaner threads flush dirty pages in order of their oldest_modification LSN from the Flush List, advancing the checkpoint. The goal is to ensure the redo log does not fill up.
10.3 Key Flush Parameters
| Parameter | Default | Description | Recommended |
|---|---|---|---|
innodb_io_capacity |
200 | Maximum I/O operations per second for InnoDB background tasks (flushing, merging change buffer). | SSD: 2000-10000, HDD: 200-400 |
innodb_io_capacity_max |
2000 | I/O operations cap during emergency flush (redo log near full). | 2-4x innodb_io_capacity |
innodb_flush_neighbors |
0 (8.0) | Whether to flush neighboring dirty pages. Useful on HDD (reduces random I/O), should be OFF for SSD. | SSD: 0, HDD: 1 |
innodb_adaptive_flushing |
ON | Adaptive flushing: InnoDB dynamically adjusts flush rate based on redo log consumption speed. | ON |
innodb_page_cleaners |
4 | Number of page cleaner threads. Should equal or exceed innodb_buffer_pool_instances. |
= buffer_pool_instances |
innodb_lru_scan_depth |
1024 | Depth of LRU scan per Buffer Pool instance per page cleaner iteration. | Keep default or lower to 256-512 (if I/O-pressured) |
10.4 Adaptive Flushing Algorithm
When innodb_adaptive_flushing=ON, InnoDB uses the following formula to calculate the ideal flush rate:
innodb_io_capacity=200 and innodb_redo_log_capacity=100MB (both too small). During peak traffic, the redo log filled up rapidly. InnoDB triggered emergency synchronous flushing, blocking all user queries for 5-10 seconds. QPS dropped to near zero, then recovered. This pattern repeated every few minutes. Solution: increase redo log to 4GB, increase innodb_io_capacity to 4000 (SSD). Problem disappeared.
Source ref: storage/innobase/buf/buf0flu.cc โ buf_flush_page_cleaner_coordinator()
11. InnoDB Configuration Cheat Sheet
Below are the most critical InnoDB parameters with recommended settings (assuming 64GB RAM server, SSD storage):
| Parameter | Default | Recommended (64G/SSD) | Description |
|---|---|---|---|
innodb_buffer_pool_size |
128MB | 44G | ~70% total RAM |
innodb_buffer_pool_instances |
8 | 16 | ~2.75GB per instance |
innodb_redo_log_capacity |
100MB | 4G | 8.0.30+; older: log_file_size*files |
innodb_log_buffer_size |
16MB | 64MB | Increase for large transactions or high-concurrency writes |
innodb_flush_log_at_trx_commit |
1 | 1 | Must be 1 for primary; replicas can use 2 |
innodb_io_capacity |
200 | 4000 | 50-75% of SSD IOPS |
innodb_io_capacity_max |
2000 | 8000 | 2x io_capacity |
innodb_flush_method |
fsync | O_DIRECT | Bypass OS cache, avoid double buffering |
innodb_flush_neighbors |
0 | 0 | SSD does not benefit from neighbor flushing |
innodb_page_cleaners |
4 | 16 | = buffer_pool_instances |
innodb_purge_threads |
4 | 4-8 | Increase for heavy UPDATE/DELETE workloads |
innodb_file_per_table |
ON | ON | Always keep ON |
innodb_doublewrite |
ON | ON | Unless using ZFS or BBWC RAID |
innodb_adaptive_hash_index |
ON | ON | Disable if contention observed |
innodb_change_buffer_max_size |
25 | 25 | Set to 0 if no secondary indexes or all unique |
innodb_undo_tablespaces |
2 | 2-4 | More tablespaces = better concurrency and truncation |
innodb_undo_log_truncate |
ON | ON | Auto-reclaim undo space |
innodb_print_all_deadlocks |
OFF | ON | Log all deadlocks to error log |
innodb_deadlock_detect |
ON | ON | Consider OFF + timeout for high-concurrency hotspot rows |
12. Monitoring & Troubleshooting
12.1 SHOW ENGINE INNODB STATUS Full Walkthrough
This is InnoDB's most important diagnostic tool. The output is divided into sections, each providing different dimensional information:
(1) SEMAPHORES Section
(2) TRANSACTIONS Section
(3) FILE I/O Section
(4) ROW OPERATIONS Section
12.2 information_schema Monitoring Tables
12.3 InnoDB Tables in performance_schema
| Table | Purpose |
|---|---|
data_locks |
All currently held and waited-for InnoDB locks |
data_lock_waits |
Lock wait relationships: which transaction waits for which |
events_waits_summary_global_by_event_name |
InnoDB mutex/rw-lock wait statistics (search for wait/synch/mutex/innodb) |
file_summary_by_instance |
I/O statistics per .ibd file (read/write bytes, latency) |
table_io_waits_summary_by_table |
I/O wait statistics per table |
12.4 Common Troubleshooting Checklist
| Symptom | Possible Cause | Diagnostic | Solution |
|---|---|---|---|
| QPS periodic drops | Redo log full causing sync flush | checkpoint_age approaching redo log capacity | Increase redo log + io_capacity |
| Low Buffer Pool hit rate | Pool too small or full-scan pollution | SHOW ENGINE INNODB STATUS โ hit rate |
Increase pool size or optimize queries |
| History list keeps growing | Long transaction blocking purge | INNODB_TRX โ trx_started |
Find and kill long transaction |
| ibdata1 bloating | innodb_file_per_table=OFF |
Check my.cnf | Enable, then logical rebuild |
| High CPU but low throughput | AHI rw-lock contention | SEMAPHORES โ btr0sea.ic |
Disable AHI or increase parts |
| Undo tablespace ballooning | Long transaction + high concurrent writes | INNODB_TRX + undo file size |
Kill long txn + truncate undo |
13. Frequently Asked Questions
Q1: What percentage of total RAM should innodb_buffer_pool_size be?
The general rule is 60-80% of total RAM, but it depends on the scenario:
- Dedicated database server (MySQL only): 70-80%
- Shared server (with app, cache, etc.): 50-60%
- Small RAM (≤4GB): 50%
Remaining memory is for: OS file cache, per-connection memory (sort_buffer, join_buffer, thread stack, ~2-10MB per connection), other processes. A common mistake is setting buffer_pool_size to 90%+, causing the OS to swap, which devastates performance.
Q2: Why does InnoDB use B+Tree instead of B-Tree or hash indexes?
B+Tree advantages over B-Tree:
- All data in leaf nodes: non-leaf nodes store only keys and pointers, higher fan-out, shorter tree, fewer disk I/O
- Leaf nodes form a doubly-linked list: range scans (BETWEEN, ORDER BY) traverse the list without backtracking
- Stable query performance: all lookups go to leaf level, consistent path length (O(log n))
Hash indexes offer O(1) point lookups but cannot support range scans, sorting, or prefix matching โ all extremely common in OLTP. InnoDB uses hashing as a supplement (AHI), not the primary structure.
Q3: Can innodb_flush_log_at_trx_commit=2 and sync_binlog=1 be used together?
Technically yes, but usually pointless. sync_binlog=1 guarantees binlog fsync on every commit, but flush_log_at_trx_commit=2 allows redo log loss. Crash recovery depends on consistency between redo log and binlog: if redo is lost but binlog exists, primary-replica inconsistency results.
Recommended combinations:
- Primary:
flush_log_at_trx_commit=1+sync_binlog=1(dual-1, safest) - Replica:
flush_log_at_trx_commit=2+sync_binlog=0(relaxed, better performance)
Q4: How to determine if the Buffer Pool is large enough?
Three key indicators:
- Buffer Pool hit rate: "Buffer pool hit rate" in
SHOW ENGINE INNODB STATUS. Should be >= 999/1000 (99.9%). - Free buffers: If persistently 0 but hit rate is still high, pool size is just right. If free=0 and hit rate is low, increase it.
- Pages read vs Innodb_buffer_pool_reads:
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'.Innodb_buffer_pool_read_requests(logical reads) vsInnodb_buffer_pool_reads(physical reads); higher ratio is better.
Ultimate test: if your entire dataset fits in the Buffer Pool (check total Data_length + Index_length from SHOW TABLE STATUS), set it large enough to hold everything.
Q5: Does MVCC completely avoid locks?
No. MVCC only avoids read-write conflicts (reads don't block writes, writes don't block reads), but write-write conflicts still require locks. Specifically:
- Consistent reads (plain SELECT): read historical version via MVCC, no lock
- Current reads (SELECT ... FOR UPDATE / LOCK IN SHARE MODE): locking read, reads latest version
- INSERT / UPDATE / DELETE: first read latest version (current read), then take exclusive lock to modify
Thus MVCC + locking is InnoDB's complete concurrency control mechanism โ they complement, not replace, each other.
Q6: Why is innodb_flush_method=O_DIRECT recommended?
With the default fsync method, InnoDB's write path is: Buffer Pool โ OS Page Cache โ Disk. Data is stored twice in memory (InnoDB Buffer Pool + OS Page Cache), wasting memory.
O_DIRECT bypasses the OS Page Cache: Buffer Pool โ Disk directly. Benefits:
- Avoid double caching, freeing valuable OS memory
- Reduce OS Page Cache CPU overhead (cache lookups, LRU management)
- Especially effective with large Buffer Pools
Note: O_DIRECT still uses fsync for redo log. In MySQL 8.0.26+ you can use O_DIRECT_NO_FSYNC, which performs better on some filesystems.
Q7: Why does InnoDB need Doublewrite while PostgreSQL doesn't?
They use different strategies for the torn page problem:
- InnoDB: Redo Log records physiological logs (offset + delta within a page). If the page itself is corrupted, redo cannot replay correctly. Thus Doublewrite provides an intact page copy as the replay base.
- PostgreSQL: On the first modification of a page after a checkpoint, it writes a full page image (Full Page Write, FPW) in the WAL. Even if the page is corrupted, the full page can be recovered from the WAL. The cost is larger WAL size.
Each approach has trade-offs: InnoDB's redo log is more compact but requires extra Doublewrite I/O; PostgreSQL's WAL is larger but architecturally simpler.
Q8: How to resize innodb_buffer_pool_size online?
MySQL 5.7+ supports online Buffer Pool resizing without restart:
SET GLOBAL innodb_buffer_pool_size = 48 * 1024 * 1024 * 1024; -- 48GB
Important notes:
- Resizing happens in units of
innodb_buffer_pool_chunk_size * innodb_buffer_pool_instances(default 128MB * 8 = 1GB) - If the value is not a multiple of this unit, MySQL rounds up automatically
- Growing is usually fast; shrinking can be slow (pages must be evicted)
- Resizing does not block queries but may cause brief performance fluctuations
- Monitor progress via
SHOW STATUS LIKE 'Innodb_buffer_pool_resize_status'
Q9: How does InnoDB solve phantom reads under RR isolation?
InnoDB solves phantom reads under RR through two mechanisms:
- MVCC (snapshot reads): Plain SELECT uses the ReadView from the transaction's start. Even if other transactions insert new rows, the snapshot read cannot see them โ thus no phantom reads.
- Next-Key Lock (current reads): SELECT ... FOR UPDATE / INSERT / UPDATE / DELETE uses current reads. InnoDB uses Next-Key Locks (Record Lock + Gap Lock) to lock gaps in the query range, preventing other transactions from inserting new rows.
Caveat: RR does not perfectly prevent phantom reads in all cases. A classic counterexample: a transaction does a snapshot read (doesn't see new rows), then UPDATE (current read, affects new rows), then snapshot read again (may now see them โ because UPDATE changed the new row's trx_id to the current transaction's ID).
Q10: What major InnoDB architectural improvements were made in MySQL 8.0?
MySQL 8.0's InnoDB improvements are historic:
- Atomic data dictionary: migrated from file-based .frm to InnoDB's transactional data dictionary; DDL operations become atomic (crash-safe)
- Redo Log redesign (8.0.30+): from fixed-size files to dynamic
#innodb_redo/directory, unified underinnodb_redo_log_capacity - Separate Doublewrite files (8.0.20+): extracted from ibdata1 to standalone .dblwr files, enabling parallel writes
- Instant ADD COLUMN (8.0.12+): certain ALTER TABLE ADD COLUMN operations complete instantly without table rebuild
- Invisible indexes:
ALTER TABLE ... ALTER INDEX idx INVISIBLEto test the impact of dropping an index without actually dropping it - Descending indexes: true descending B+Tree indexes, optimizing ORDER BY ... DESC queries
- Session temporary tablespaces (8.0.15+): per-session temporary tablespace files, reclaimable after disconnect
- Undo Tablespace management: online create/drop undo tablespace, automatic truncation
- Parallel reads (8.0.14+):
SELECT COUNT(*)and other full table scans can read multiple pages in parallel
12.5 Key InnoDB Status Variables
Besides SHOW ENGINE INNODB STATUS, these SHOW GLOBAL STATUS variables are most valuable for daily monitoring:
| Variable | Meaning | Healthy Threshold |
|---|---|---|
Innodb_buffer_pool_read_requests |
Total logical read requests to Buffer Pool | Ratio to reads > 99.9% |
Innodb_buffer_pool_reads |
Buffer Pool misses requiring disk read | Lower is better |
Innodb_buffer_pool_pages_dirty |
Dirty page count in Buffer Pool | Should be stable, not continuously growing |
Innodb_buffer_pool_wait_free |
Waits for free pages (Free list empty) | Should be 0 or near 0 |
Innodb_log_waits |
Waits due to Log Buffer being too small | Should be 0; if > 0 increase log_buffer_size |
Innodb_os_log_written |
Bytes written to Redo Log | Use to calculate write rate and estimate redo size |
Innodb_row_lock_time_avg |
Average row lock wait time (ms) | < 10ms; > 100ms needs investigation |
Innodb_row_lock_waits |
Row lock wait count | Trend matters more than absolute value |
Innodb_rows_read / inserted / updated / deleted |
InnoDB layer row operation statistics | Use for workload profiling |
Innodb_data_reads / writes |
Physical I/O operation count | Compare with logical reads to measure cache efficiency |
12.6 Prometheus / Grafana Monitoring Recommendations
In production, mysqld_exporter typically exposes InnoDB metrics to Prometheus, visualized with Grafana. Recommended core dashboard panels include:
| Panel | Metrics | Alert Threshold |
|---|---|---|
| Buffer Pool Hit Rate | rate(mysql_global_status_innodb_buffer_pool_read_requests[5m])rate(mysql_global_status_innodb_buffer_pool_reads[5m]) |
Alert if < 99.5% |
| Dirty Pages % | mysql_global_status_innodb_buffer_pool_pages_dirty / mysql_global_status_innodb_buffer_pool_pages_total |
Alert if > 75% |
| Checkpoint Age | mysql_global_status_innodb_checkpoint_age |
Alert if > 80% redo capacity |
| History List Length | mysql_global_status_innodb_history_list_length |
Alert if > 10000 |
| Row Lock Waits/s | rate(mysql_global_status_innodb_row_lock_waits[5m]) |
Alert if > 100/s |
| Long Transactions | mysql_info_schema_innodb_trx (custom query) |
Alert if > 300 seconds |
| Redo Log Write Rate | rate(mysql_global_status_innodb_os_log_written[5m]) |
Trend monitoring (alert on anomaly) |
| Pending I/O | mysql_global_status_innodb_data_pending_readsmysql_global_status_innodb_data_pending_writes |
Alert if > 0 for 5 minutes |