Client SQL│▼MySQL Server Layer (parser → optimizer → executor)│▼ handler::ha_read_row()InnoDB Handler Interface│├──▶ Check Adaptive Hash Index ──▶ HIT? → return row│ MISS ↓├──▶ Check Buffer Pool LRU ──────▶ HIT? → return page│ MISS ↓├──▶ Read page from .ibd file (OS read / AIO)│ └──▶ Place page into Buffer Pool (Old sublist head)│ └──▶ If accessed again after innodb_old_blocks_time → move to Young│▼Return row to Server Layer → Client
1. BEGIN (implicitly or explicitly)2. Lock the target row(s) → record lock / gap lock / next-key lock3. Write old row image to Undo Log (for rollback + MVCC)4. Modify the page IN the Buffer Pool (in-memory only)5. Write redo log record to Log Buffer6. COMMIT: a. Flush Log Buffer → Redo Log files on disk (fsync per innodb_flush_log_at_trx_commit) b. Mark transaction committed in redo c. The dirty page stays in Buffer Pool → will be flushed later7. Background threads (page cleaner) eventually: a. Write dirty page to Doublewrite Buffer (on disk) b. fsync Doublewrite Buffer c. Write page to actual .ibd file location d. fsync .ibd e. Advance checkpoint LSN
InnoDB 的 Buffer Pool 并不使用标准的 LRU(Least Recently Used)算法,而是采用了改良的两段式 LRU,以防止全表扫描等操作冲刷掉热数据。
Buffer Pool LRU List┌─────────────────────────────────────────────────────────────────────┐│ ││ HEAD (MRU) TAIL (LRU) ││ ◄────────── Young Sublist (5/8) ──────────►│◄── Old (3/8) ──►│ ││ │ ││ [HOT] [HOT] [HOT] ... [WARM] [WARM] │ [OLD] [OLD] [OLD] ││ ▲ │ ▲ ││ │ Page accessed again │ │ New page from ││ │ (after old_blocks_time) │ │ disk lands HERE ││ └──────────────────────────────────────────┘ │ ("midpoint ││ Promotion │ insertion") ││ │ ││ Eviction from tail ──►OUT │└─────────────────────────────────────────────────────────────────────┘
为什么要两段式?
考虑一个典型的生产场景:你的 Buffer Pool 里缓存了过去几小时的热点数据(用户表、订单表的索引页)。突然有人执行了一个 SELECT * FROM large_table(全表扫描),这会读取数十万个数据页。如果使用标准 LRU,这些只会被读一次就不再访问的 "冷" 页会把所有真正的 "热" 页从 LRU 中挤出,导致之后的正常查询全部命中磁盘——这就是所谓的 "Buffer Pool 被冲刷"。
InnoDB 的解决方案:
新页从磁盘读入时,插入到 Old 子链表的头部(midpoint insertion),而不是整个 LRU 的头部。
只有当该页在 Old 子链表中停留超过 innodb_old_blocks_time(默认 1000ms)后再次被访问,才会被提升到 Young 子链表。
全表扫描的页通常在 1000ms 内就被读完丢弃,永远不会进入 Young 子链表,从而保护了热数据。
关键参数
参数
默认值
说明
innodb_old_blocks_pct
37
Old 子链表占 Buffer Pool 的百分比。默认 37%(约 3/8)。取值范围 5-95。
innodb_old_blocks_time
1000
页在 Old 子链表中需要停留多少毫秒后才能被提升到 Young 子链表。设为 0 则禁用此保护。
2.3 Buffer Pool 实例
当 Buffer Pool 很大时(生产环境通常 32GB-128GB),单个 Buffer Pool 上的 mutex 会成为并发瓶颈。InnoDB 允许将 Buffer Pool 分成多个独立实例,每个实例有自己的 LRU list、Free list、Flush list 和 mutex。
参数
默认值
推荐值
innodb_buffer_pool_instances
8(当 pool >= 1GB 时)
每个实例至少 1GB。128GB pool → 16 个实例。
innodb_buffer_pool_chunk_size
128MB
在线调整 pool size 的单位。pool_size 应该是 chunk_size * instances 的整数倍。
注意 MySQL 8.0 变化:
在 MySQL 8.0.26+ 中,当 innodb_buffer_pool_size < 1GB 时,innodb_buffer_pool_instances 被强制设为 1,不管你设什么值。这是因为小 Buffer Pool 没有 mutex 争用的问题。
2.4 Buffer Pool 预热与 Dump/Load
MySQL 重启后 Buffer Pool 是空的(冷启动),查询性能会在 30-60 分钟内逐渐恢复。InnoDB 提供了自动 dump/load 机制来加速预热:
-- Enable automatic dump on shutdown and load on startupSET GLOBAL innodb_buffer_pool_dump_at_shutdown = ON;
SET GLOBAL innodb_buffer_pool_load_at_startup = ON;
-- Manual dump (writes to ib_buffer_pool file)SET GLOBAL innodb_buffer_pool_dump_now = ON;
-- Check dump/load progressSHOW STATUS LIKE'Innodb_buffer_pool_dump_status';
SHOW STATUS LIKE'Innodb_buffer_pool_load_status';
-- Dump file content example (tablespace_id:page_number)-- 0,7-- 0,8-- 12,0-- 12,1
-- Key metrics from SHOW ENGINE INNODB STATUS (BUFFER POOL AND MEMORY section)----------------------BUFFER POOL AND MEMORY----------------------
Total large memory allocated 137363456-- Total memory allocated to Buffer Pool
Dictionary memory allocated 622013-- Memory for data dictionary
Buffer pool size 8192-- Total pages (8192 * 16KB = 128MB)
Free buffers 1024-- Free pages not yet used
Database pages 7137-- Pages containing data
Old database pages 2614-- Pages in Old sublist
Modified db pages 123-- Dirty pages awaiting flush
Pending reads 0-- Pages waiting to be read from disk
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 18623, not young 2045921-- Promotions vs blocks3.47 youngs/s, 412.05 non-youngs/s -- Rate of promotions
Pages read 16842, created 5765, written 294120.00 reads/s, 0.24 creates/s, 1.86 writes/s
Buffer pool hit rate 1000 / 1000-- 1000/1000 = 100% hit rate (ideal)
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
生产关注指标: 1. Buffer pool hit rate:应保持在 999/1000 以上。低于此值说明 Buffer Pool 太小或有大查询冲刷。
2. Free buffers:如果长期为 0,说明 Buffer Pool 已满,所有新页都必须先淘汰旧页。
3. Modified db pages:如果持续增长,说明刷盘速度跟不上写入速度。
4. Pages made young / not young:not young 远大于 young 说明 innodb_old_blocks_time 保护在有效工作。
Buffer Pool Internal Lists:┌─────────────────────────────────────────────────────────────────┐│ 1. FREE LIST ││ Contains pages that are completely free (never used or ││ evicted). When InnoDB needs a new page, it takes one from ││ here first. When this is empty, LRU eviction kicks in. ││ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ││ │Free│→ │Free│→ │Free│→ │Free│→ NULL ││ └────┘ └────┘ └────┘ └────┘ ││ ││ 2. LRU LIST ││ Contains all pages currently in use (clean + dirty). ││ Organized as Young (5/8) + Old (3/8). ││ Used for page eviction when FREE list is empty. ││ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ││ │Hot │→ │Hot │ →→ │Old │→ │Old │→ │Old │→ evict ││ └────┘ └────┘ └────┘ └────┘ └────┘ ││ Young (MRU) ──────│─────── Old (LRU) ──────► ││ ││ 3. FLUSH LIST ││ Contains only DIRTY pages, ordered by oldest_modification ││ LSN (oldest first). Page cleaner flushes from here to ││ advance the checkpoint. ││ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ││ │LSN │→ │LSN │→ │LSN │→ │LSN │→ NULL ││ │100 │ │200 │ │300 │ │400 │ ││ └────┘ └────┘ └────┘ └────┘ ││ oldest ────────────────────────► newest ││ ││ A dirty page appears in BOTH the LRU list AND the Flush list. ││ A clean page appears ONLY in the LRU list. │└─────────────────────────────────────────────────────────────────┘
当一个 extent 中有 13 个以上的页在 Buffer Pool 的 Young 子链表中时,InnoDB 预读该 extent 剩余的页。默认关闭(innodb_random_read_ahead=OFF),因为在大多数工作负载中效果不明显。
-- Monitor read-ahead effectiveness:SHOW GLOBAL STATUS LIKE'Innodb_buffer_pool_read_ahead'; -- Pages pre-readSHOW GLOBAL STATUS LIKE'Innodb_buffer_pool_read_ahead_evicted'; -- Pre-read but never accessed-- If read_ahead_evicted / read_ahead is high (>50%),-- read-ahead is wasting I/O and should be tuned down.-- Increase innodb_read_ahead_threshold (up to 63) to be more conservative.
2.10 Buffer Pool mutex 与 latch 争用排查
在高并发场景下(>1000 QPS),Buffer Pool 的内部 mutex 可能成为瓶颈。主要的 mutex 包括:
Mutex / Latch
保护的资源
争用表现
解决方案
buf_pool->mutex
Buffer Pool 实例的全局锁
高并发读写时的广泛争用
增加 innodb_buffer_pool_instances
buf_pool->LRU_list_mutex
LRU list
页面淘汰/提升时的争用
增加 instances + 优化查询减少页面淘汰
buf_pool->flush_list_mutex
Flush list
高写入负载时的争用
增加 instances + page_cleaners
hash_lock (page hash)
页哈希表(查找页是否在 pool 中)
极高并发点查时的争用
MySQL 8.0 已分区化,通常无需干预
-- Diagnose Buffer Pool mutex waits:SELECT EVENT_NAME, COUNT_STAR,
SUM_TIMER_WAIT/1000000000 AS total_wait_ms,
AVG_TIMER_WAIT/1000000 AS avg_wait_us
FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE'wait/synch/%/innodb/buf_pool%'OR EVENT_NAME LIKE'wait/synch/%/innodb/hash_table%'ORDER BY SUM_TIMER_WAIT DESC;
LSN relationships (must always hold):Last checkpoint at ≤ Pages flushed up to ≤ Log flushed up to ≤ Log sequence numberExample from SHOW ENGINE INNODB STATUS:
---
LOG
---
Log sequence number 3456789012
Log buffer assigned up to 3456789012
Log buffer completed up to 3456789012
Log written up to 3456789012
Log flushed up to 3456789012
Added dirty pages up to 3456789012
Pages flushed up to 3456780000
Last checkpoint at 3456780000 ▲ Gap = 9012 bytes of redo not yet checkpointed Crash recovery would need to replay these 9012 bytes
3.4 Checkpoint 机制
Checkpoint 是将 Buffer Pool 中的脏页刷到磁盘,并推进 checkpoint LSN 的过程。它的目的有两个:
┌─────────┬──────────────────────────────────────────────────────────────┐│ Value │ Behavior │├─────────┼──────────────────────────────────────────────────────────────┤│ │ COMMIT → write to Log Buffer ││ 0 │ Background thread flushes to OS cache + fsync every 1 sec ││ │ ⚠ Up to 1 second of committed transactions lost on crash ││ │ ★ Fastest — no sync on commit │├─────────┼──────────────────────────────────────────────────────────────┤│ │ COMMIT → write to OS cache → fsync to disk ││ 1 │ Every single commit triggers an fsync ││ │ ✓ ZERO data loss on crash (ACID compliant) ││ │ ★ Slowest — fsync on every commit │├─────────┼──────────────────────────────────────────────────────────────┤│ │ COMMIT → write to OS cache (no fsync) ││ 2 │ Background thread fsyncs every 1 second ││ │ ⚠ Data safe on MySQL crash, lost on OS/power crash ││ │ ★ Medium — good for replicas │└─────────┴──────────────────────────────────────────────────────────────┘
-- How to estimate proper redo log size:-- 1. During peak load, check LSN growth over 1 hour:SHOW ENGINE INNODB STATUS; -- note "Log sequence number" = LSN_start-- wait 1 hourSHOW ENGINE INNODB STATUS; -- note "Log sequence number" = LSN_end-- redo_per_hour = LSN_end - LSN_start-- Recommended total redo size = redo_per_hour (covers 1 hour of writes)-- This means checkpoint must only happen once per hour-- Example: LSN grew by 2GB in 1 hour → set innodb_redo_log_capacity = 2G-- 2. Alternative: check checkpoint ageSELECT
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_redo_log_current_lsn') -
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_redo_log_checkpoint_lsn')
AS checkpoint_age;
-- If checkpoint_age frequently exceeds 75% of redo log capacity,-- you need larger redo logs.
Two-Phase Commit with Binlog:Phase 1: PREPARE ├─ Write redo log with PREPARE state └─ fsync redo logPhase 2: COMMIT ├─ Write transaction to binlog ├─ fsync binlog (group commit: multiple txns in one fsync) └─ Write redo log with COMMIT state (no fsync needed—binlog is the authority)Crash Recovery Logic: ├─ If redo=PREPARE + binlog=YES → COMMIT (transaction was successful) ├─ If redo=PREPARE + binlog=NO → ROLLBACK (transaction incomplete) └─ If redo=COMMIT → already committed
User Transaction vs Mini-Transaction:BEGIN; -- User transaction starts INSERT INTO t VALUES (1, 'a'); -- mtr1: modify leaf page + undo page INSERT INTO t VALUES (2, 'b'); -- mtr2: modify leaf page + undo page INSERT INTO t VALUES (3, 'c'); -- mtr3: page split! modifies 3 pages -- 3a: allocate new page -- 3b: move half records to new page -- 3c: update parent page pointer -- These 3 writes are ONE mtrCOMMIT; -- User transaction commitsIn the redo log: [mtr1 records][MLOG_MULTI_REC_END] [mtr2 records][MLOG_MULTI_REC_END] [mtr3 records][MLOG_MULTI_REC_END] ... (commit record)
3.10 崩溃恢复流程
当 MySQL 异常关闭后重启时,InnoDB 的崩溃恢复按以下步骤进行:
Crash Recovery Phases:Phase 1: REDO (Forward Recovery) ├─ Read redo log from last checkpoint LSN to end ├─ Parse redo records and apply them to data pages │ ├─ If page's LSN >= record's LSN → skip (already applied) │ └─ If page's LSN < record's LSN → apply the redo record ├─ This restores ALL modifications (both committed and uncommitted) └─ After this phase: database pages are in the state at crash timePhase 2: UNDO (Backward Recovery) ├─ Scan the undo log for active (uncommitted) transactions at crash time ├─ Roll back each uncommitted transaction using undo records ├─ This ensures atomicity: uncommitted changes are reversed └─ After this phase: only committed data remainsPhase 3: Purge ├─ Clean up undo records from rolled-back transactions └─ Background purge threads handle this after recovery completesRecovery time depends on: - Amount of redo to replay (checkpoint_age at crash) - Number of uncommitted transactions to roll back - I/O speed of the storage device
ReadView Visibility Algorithm (simplified):functionis_visible(row_trx_id, read_view):
if row_trx_id == read_view.m_creator_trx_id:
returntrue// I modified this row myselfif row_trx_id < read_view.m_up_limit_id:
returntrue// Transaction committed before ReadView was createdif row_trx_id >= read_view.m_low_limit_id:
returnfalse// Transaction started after ReadView was created// m_up_limit_id <= row_trx_id < m_low_limit_idif row_trx_id in read_view.m_ids:
returnfalse// Transaction was active (uncommitted) when ReadView was createdelse:
returntrue// Transaction committed before ReadView was created// (its ID is between up and low, but not in active list)If not visible → follow DB_ROLL_PTR to undo log, check previous versionRepeat until a visible version is found or reach NULL (row didn't exist)
源码参考: storage/innobase/include/read0types.h — class ReadView, storage/innobase/row/row0sel.cc
Example: RR vs RC differenceInitial: users table has (id=1, name="Alice")Timeline: T1 (RR): BEGIN T2: BEGIN T1: SELECT name FROM users WHERE id=1 → "Alice" T2: UPDATE users SET name="Bob" WHERE id=1 T2: COMMIT T1: SELECT name FROM users WHERE id=1 → "Alice" (same ReadView!) T3 (RC): BEGIN T4: BEGIN T3: SELECT name FROM users WHERE id=1 → "Alice" T4: UPDATE users SET name="Bob" WHERE id=1 T4: COMMIT T3: SELECT name FROM users WHERE id=1 → "Bob" (new ReadView!)
Purge Process:1. Purge coordinator thread (1 thread) reads the oldest ReadView across all active transactions2. Any undo record with trx_id < oldest active ReadView's m_up_limit_id is safe to purge3. For DELETE operations: ├─ Purge removes the delete-marked record from the clustered index └─ Purge removes corresponding entries from all secondary indexes4. For UPDATE operations on indexed columns: └─ Purge removes old secondary index entries (the "ghost" entries)5. After purge: undo log pages can be reusedMonitoring purge lag: SHOW ENGINE INNODB STATUS → "History list length" This = number of undo records NOT yet purged If growing → purge can't keep up → increase innodb_purge_threads Or a long transaction is holding back the oldest ReadView
Doublewrite Buffer Write Flow:Step 1: Collect dirty pages to flush (batch, e.g., 64 pages)Step 2: Write pages sequentially to Doublewrite Buffer (on disk) ┌────────────────────────────────────────────────┐ │ Doublewrite Buffer (2 extents = 2MB total) │ │ ┌─────┬─────┬─────┬─────┬─────┬───────────┐ │ │ │ P1 │ P2 │ P3 │ ... │ P64 │ (unused) │ │ │ └─────┴─────┴─────┴─────┴─────┴───────────┘ │ └────────────────────────────────────────────────┘Step 3: fsync the Doublewrite BufferStep 4: Write each page to its actual location in .ibd files P1 → tablespace A, page 45 P2 → tablespace A, page 892 P3 → tablespace B, page 17 ... (these are random I/O)Step 5: fsync the .ibd filesCrash Recovery: For each page in the .ibd file: 1. Verify page checksum 2. If checksum FAILS → page is torn → copy correct version from Doublewrite Buffer 3. Then apply redo log records on top of the good page
-- MySQL 8.0.20+: separate doublewrite files-- Old: doublewrite stored in ibdata1-- New: separate .dblwr files in datadir-- Disable doublewrite (only if your storage guarantees atomic writes!):SET GLOBAL innodb_doublewrite = OFF;
-- MySQL 8.0.30+: fine-grained control-- innodb_doublewrite = ON | OFF | DETECT_AND_RECOVER | DETECT_ONLY-- DETECT_AND_RECOVER (default): full protection-- DETECT_ONLY: detect torn pages but don't fix (useful with ZFS)
6. Change Buffer
6.1 Change Buffer 解决什么问题?
当 INSERT/UPDATE/DELETE 修改的二级索引页不在 Buffer Pool 中时,InnoDB 面临一个选择:
-- Create a general tablespaceCREATE TABLESPACE ts_hot
ADD DATAFILE'/fast_ssd/ts_hot.ibd'ENGINE = InnoDB;
-- Create table in specific tablespaceCREATE TABLE hot_data (id INT PRIMARY KEY, val VARCHAR(100))
TABLESPACE ts_hot;
-- Move existing table to the tablespaceALTER TABLE orders TABLESPACE ts_hot;
-- Method 1: OPTIMIZE TABLE (rebuilds the table, reclaims space)OPTIMIZE TABLE large_table;
-- For InnoDB, this is equivalent to ALTER TABLE ... ENGINE=InnoDB-- ⚠ Requires full table copy — blocks writes during operation-- Use pt-online-schema-change or gh-ost for production-- Method 2: ALTER TABLE (null operation that triggers rebuild)ALTER TABLE large_table ENGINE=InnoDB;
-- Check fragmentation level:SELECT TABLE_NAME,
DATA_LENGTH,
DATA_FREE,
ROUND(DATA_FREE / DATA_LENGTH * 100, 1) AS frag_pct
FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'mydb'AND DATA_FREE > 0
ORDER BY DATA_FREE DESC;
-- frag_pct > 20% suggests rebuild may be worthwhile
9. 行格式 (Row Formats)
9.1 四种行格式对比
行格式
引入版本
变长字段存储
大字段溢出
推荐
REDUNDANT
最早
在行内存储完整长度信息
前 768 字节存在行内
否(仅兼容旧版)
COMPACT
5.0
紧凑存储 NULL 标志位和变长字段长度
前 768 字节存在行内
可用但非首选
DYNAMIC
5.7+
与 COMPACT 相同
仅存 20 字节指针,全部数据存到溢出页
是(默认,推荐)
COMPRESSED
5.5+
与 DYNAMIC 相同,但页级压缩
与 DYNAMIC 相同
特殊场景(冷数据)
9.2 COMPACT/DYNAMIC 行内部结构
COMPACT/DYNAMIC Row Format:┌──────────────────────────────────────────────────────────────────┐│ Variable-Length Field Lengths (1-2 bytes each, reverse order) │├──────────────────────────────────────────────────────────────────┤│ NULL Flags (1 bit per nullable column, rounded to bytes) │├──────────────────────────────────────────────────────────────────┤│ Record Header (5 bytes) ││ ├─ info_flags (4 bits): delete-mark, min-rec, etc. ││ ├─ n_owned (4 bits): records owned by this slot in page dir ││ ├─ heap_no (13 bits): position in page heap ││ ├─ record_type (3 bits): 0=leaf, 1=non-leaf, 2=infimum, 3=sup ││ └─ next_record (16 bits): relative offset to next record │├──────────────────────────────────────────────────────────────────┤│ Hidden Columns: ││ ├─ DB_ROW_ID (6 bytes, only if no explicit PK) ││ ├─ DB_TRX_ID (6 bytes, transaction ID) ││ └─ DB_ROLL_PTR (7 bytes, undo log pointer) │├──────────────────────────────────────────────────────────────────┤│ Column 1 data ││ Column 2 data ││ ... ││ Column N data ││ (If column > threshold → 20-byte pointer to overflow page) │└──────────────────────────────────────────────────────────────────┘
COMPRESSED page handling in Buffer Pool:Disk (compressed): Buffer Pool (memory):┌──────────────┐ ┌─────────────────────────┐│ 8KB on disk │ ──────▶ │ 8KB compressed page │ ← always present│ (compressed) │ │ + ││ │ │ 16KB uncompressed page │ ← present when accessed└──────────────┘ └─────────────────────────┘Problems with COMPRESSED:1. Double memory usage (compressed + uncompressed copies)2. CPU overhead for compress/decompress3. Compression failures: if modified page can't fit in original compressed size → reorganize page → still can't fit → page split4. Modification log: InnoDB maintains a tiny log of modifications on the compressed page to avoid recompressing on every change → when log is full → recompress → potential split
COMPRESSED vs 透明页压缩 MySQL 8.0 推荐使用透明页压缩(CREATE TABLE ... COMPRESSION='zstd')替代 COMPRESSED 行格式:
1. 透明页压缩不需要在内存中维护两份副本
2. 支持更高效的压缩算法(zstd,而非 zlib)
3. 依赖文件系统的稀疏文件(punch hole)支持——Linux ext4/xfs/btrfs 均支持
4. 对 Buffer Pool 无额外内存开销
5. 缺点:需要文件系统支持 punch hole;实际节省取决于文件系统的块对齐
-- Example my.cnf InnoDB section for 64GB SSD server:
[mysqld]
# Buffer Pool
innodb_buffer_pool_size = 44G
innodb_buffer_pool_instances = 16
innodb_buffer_pool_dump_at_shutdown = ON
innodb_buffer_pool_load_at_startup = ON# Redo Log
innodb_redo_log_capacity = 4G
innodb_log_buffer_size = 64M
innodb_flush_log_at_trx_commit = 1# I/O
innodb_io_capacity = 4000
innodb_io_capacity_max = 8000
innodb_flush_method = O_DIRECT
innodb_flush_neighbors = 0
innodb_page_cleaners = 16# Tablespace
innodb_file_per_table = ON
innodb_doublewrite = ON
innodb_undo_tablespaces = 3
innodb_undo_log_truncate = ON# Concurrency
innodb_adaptive_hash_index = ON
innodb_adaptive_hash_index_parts = 16
innodb_purge_threads = 4# Monitoring
innodb_print_all_deadlocks = ON
innodb_status_output_locks = ON
12. 监控与排障
12.1 SHOW ENGINE INNODB STATUS 完全解读
这是 InnoDB 最重要的诊断工具。输出分为多个段落,每个段落提供不同维度的信息:
(1) SEMAPHORES 段
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 12345
--Thread 140234567890 has waited at btr0sea.ic line 92 for 0.5 seconds
the semaphore: S-lock on RW-latch at 0x7f...
a]writer (thread id 140234567891) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0Key indicators:- "has waited ... for X seconds": threads waiting for locks- btr0sea.ic: AHI-related contention → consider disabling AHI- buf0buf.cc: Buffer Pool mutex contention → increase instances- Frequent long waits (>1s) = performance problem
(2) TRANSACTIONS 段
------------
TRANSACTIONS
------------
Trx id counter 12345678
Purge done for trx's n:o < 12345670 undo n:o < 0
History list length 234← Undo records not yet purged If growing → long transactions or purge lag
---TRANSACTION 12345677, ACTIVE 3600 sec ← Active for 1 hour! Problem!2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 42, OS thread handle 140..., query id 999SELECT * FROM large_table WHERE status = 'active'Key indicators:- History list length: should be low (<1000); growing = purge lag- ACTIVE NNN sec: long-running transactions → undo bloat risk- lock struct(s) / row lock(s): lock usage per transaction
(3) FILE I/O 段
--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
...
Pending normal aio reads: [0, 0, 0, 0] ← Should be 0 or very low
Pending normal aio writes: [0, 0, 0, 0]
Pending ibuf aio reads: 06821 OS file reads, 29412 OS file writes, 8513 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 1.86 writes/s, 0.76 fsyncs/s
Key indicators:- High pending reads/writes = I/O subsystem is saturated- fsyncs/s: correlates with commit rate and flush_log_at_trx_commit setting
(4) ROW OPERATIONS 段
--------------
ROW OPERATIONS
--------------
0 queries inside InnoDB, 0 queries in queue
0 read views open inside InnoDB ← Open read views = MVCC snapshots
Process ID=12345, Main thread ID=140..., state: sleeping
Number of rows inserted 567890, updated 123456, deleted 78901, read 456789010.00 inserts/s, 0.00 updates/s, 0.00 deletes/s, 0.00 reads/s
12.2 information_schema 监控表
-- Buffer Pool statistics per instanceSELECT POOL_ID, POOL_SIZE, FREE_BUFFERS, DATABASE_PAGES,
OLD_DATABASE_PAGES, MODIFIED_DATABASE_PAGES,
HIT_RATE, YOUNG_MAKE_PER_THOUSAND_GETS
FROM information_schema.INNODB_BUFFER_POOL_STATS;
-- Currently open transactionsSELECT trx_id, trx_state, trx_started,
TIMESTAMPDIFF(SECOND, trx_started, NOW()) AS duration_sec,
trx_rows_locked, trx_rows_modified, trx_query
FROM information_schema.INNODB_TRX
ORDER BY trx_started ASC;
-- Lock waits (who is blocking whom)SELECT r.trx_id AS waiting_trx,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx,
b.trx_query AS blocking_query
FROM performance_schema.data_lock_waits w
JOIN information_schema.INNODB_TRX r ON r.trx_id = w.REQUESTING_ENGINE_TRANSACTION_ID
JOIN information_schema.INNODB_TRX b ON b.trx_id = w.BLOCKING_ENGINE_TRANSACTION_ID;
-- InnoDB metrics (hundreds of counters)SELECT NAME, COUNT, STATUS
FROM information_schema.INNODB_METRICS
WHERE STATUS = 'enabled'ORDER BY NAME;
-- Find InnoDB mutex contention hotspotsSELECT EVENT_NAME, COUNT_STAR, SUM_TIMER_WAIT/1000000000 AS total_wait_ms
FROM performance_schema.events_waits_summary_global_by_event_name
WHERE EVENT_NAME LIKE'wait/synch/mutex/innodb/%'ORDER BY SUM_TIMER_WAIT DESCLIMIT 10;
-- Find the most I/O-heavy .ibd filesSELECT FILE_NAME,
COUNT_READ, SUM_NUMBER_OF_BYTES_READ,
COUNT_WRITE, SUM_NUMBER_OF_BYTES_WRITE,
SUM_TIMER_READ/1000000000 AS read_wait_ms,
SUM_TIMER_WRITE/1000000000 AS write_wait_ms
FROM performance_schema.file_summary_by_instance
WHERE FILE_NAME LIKE'%.ibd'ORDER BY SUM_TIMER_READ + SUM_TIMER_WRITE DESCLIMIT 10;
12.4 常见问题排查清单
症状
可能原因
检查方法
解决方案
QPS 周期性骤降
Redo log 满导致同步刷盘
checkpoint_age 接近 redo log capacity
增大 redo log + io_capacity
Buffer Pool hit rate 低
Pool 太小或全表扫描冲刷
SHOW ENGINE INNODB STATUS → hit rate
增大 pool size 或优化查询
History list 持续增长
长事务阻止 purge
INNODB_TRX → trx_started
找到并终止长事务
ibdata1 膨胀
innodb_file_per_table=OFF
检查 my.cnf
开启后逻辑重建
CPU 高但吞吐量低
AHI rw-lock 争用
SEMAPHORES → btr0sea.ic
关闭 AHI 或增加 parts
Undo tablespace 暴涨
长事务 + 高并发写入
INNODB_TRX + undo file size
终止长事务 + truncate undo
13. 常见问题 (FAQ)
Q1: innodb_buffer_pool_size 设置为总内存的多少合适?
通用规则是 60-80% 的总内存,但具体取决于场景:
专用数据库服务器(只跑 MySQL):70-80%
共享服务器(还有应用、缓存等):50-60%
小内存(≤4GB):50%
剩余内存留给:OS 文件缓存、每个连接的内存开销(sort_buffer、join_buffer、thread stack,每连接约 2-10MB)、其他进程。一个常见错误是把 buffer_pool_size 设到 90%+,导致 OS 开始 swap,性能反而暴跌。
Buffer Pool hit rate:SHOW ENGINE INNODB STATUS 中的 "Buffer pool hit rate"。应 >= 999/1000(99.9%)。
Free buffers:如果长期为 0 但 hit rate 还高,说明 pool 大小刚好。如果 free=0 且 hit rate 低,需要增大。
Pages read vs Innodb_buffer_pool_reads:SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read%'。Innodb_buffer_pool_read_requests(逻辑读)vs Innodb_buffer_pool_reads(物理读),比值越高越好。
终极测试:如果你的整个数据集能装进 Buffer Pool(检查 SHOW TABLE STATUS 的 Data_length + Index_length 总和),那就设成能装下全部数据的大小。
Q5: MVCC 能完全避免锁吗?
不能。MVCC 只避免了读-写冲突(读不阻塞写,写不阻塞读),但写-写冲突仍然需要锁。具体来说:
一致性读(普通 SELECT):通过 MVCC 读取历史版本,不加锁
当前读(SELECT ... FOR UPDATE / LOCK IN SHARE MODE):加锁读,读最新版本
INSERT / UPDATE / DELETE:先读最新版本(当前读),再加排他锁修改
因此 MVCC + 锁 是 InnoDB 并发控制的完整机制,两者互补而非替代。
Q6: 为什么 innodb_flush_method 推荐 O_DIRECT?
默认的 fsync 方法下,InnoDB 的写入路径是:Buffer Pool → OS Page Cache → 磁盘。数据在内存中存了两份(InnoDB Buffer Pool + OS Page Cache),浪费内存。
O_DIRECT 让数据绕过 OS Page Cache 直接写入磁盘:Buffer Pool → 磁盘。优势:
避免双重缓存,节省宝贵的 OS 内存给其他用途
减少 OS Page Cache 的 CPU 开销(cache 查找、LRU 管理)
在大 Buffer Pool 场景下效果特别明显
注意:O_DIRECT 对 redo log 仍使用 fsync。在 MySQL 8.0.26+ 中可以使用 O_DIRECT_NO_FSYNC,在某些文件系统上性能更好。
除了 SHOW ENGINE INNODB STATUS,以下 SHOW GLOBAL STATUS 变量对日常监控最有价值:
变量
含义
健康标准
Innodb_buffer_pool_read_requests
Buffer Pool 逻辑读请求总数
与 reads 的比值 > 99.9%
Innodb_buffer_pool_reads
Buffer Pool 未命中、需要从磁盘读取的次数
越低越好
Innodb_buffer_pool_pages_dirty
Buffer Pool 中脏页数量
应保持稳定,不应持续增长
Innodb_buffer_pool_wait_free
等待空闲页的次数(Free list 为空)
应为 0 或接近 0
Innodb_log_waits
Log Buffer 太小导致等待的次数
应为 0;如 > 0 增大 log_buffer_size
Innodb_os_log_written
Redo Log 已写入字节数
用于计算写入速率和估算 redo 大小
Innodb_row_lock_time_avg
平均行锁等待时间(ms)
< 10ms;> 100ms 需排查
Innodb_row_lock_waits
行锁等待次数
趋势比绝对值重要
Innodb_rows_read / inserted / updated / deleted
InnoDB 层行操作统计
用于工作负载画像
Innodb_data_reads / writes
物理 I/O 操作数
与逻辑读比较衡量缓存效率
-- Buffer Pool hit rate calculation:SELECT
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests') AS logical_reads,
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') AS disk_reads,
ROUND(
(1 - (SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') /
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')
) * 100, 3
) AS hit_rate_pct;
-- Redo log write rate (bytes per second):-- Record Innodb_os_log_written at two points in time, divide the delta by seconds elapsed-- Quick health check script:SELECT'Buffer Pool Hit Rate'AS metric,
CONCAT(ROUND((1 - v2.val/v1.val)*100, 2), '%') AS value
FROM
(SELECT VARIABLE_VALUE val FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_buffer_pool_read_requests') v1,
(SELECT VARIABLE_VALUE val FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_buffer_pool_reads') v2
UNION ALLSELECT'Dirty Page Ratio',
CONCAT(ROUND(v1.val/v2.val*100, 1), '%')
FROM
(SELECT VARIABLE_VALUE val FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_dirty') v1,
(SELECT VARIABLE_VALUE val FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_buffer_pool_pages_total') v2
UNION ALLSELECT'Log Waits (should be 0)', VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_log_waits'UNION ALLSELECT'Wait Free (should be 0)', VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_buffer_pool_wait_free'UNION ALLSELECT'Avg Row Lock Time (ms)', VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME='Innodb_row_lock_time_avg';
-- Top tables by buffer pool usageSELECT object_schema, object_name, allocated, data, pages
FROM sys.innodb_buffer_stats_by_table
ORDER BY allocated DESCLIMIT 20;
-- Top indexes by buffer pool usageSELECT object_schema, object_name, index_name, allocated, data, pages
FROM sys.innodb_buffer_stats_by_schema
ORDER BY allocated DESC;
-- InnoDB lock waits (human-readable)SELECT *
FROM sys.innodb_lock_waits\G
-- Tables with full table scans (potential buffer pool polluters)SELECT object_schema, object_name,
rows_full_scanned, latency
FROM sys.schema_tables_with_full_table_scans
ORDER BY rows_full_scanned DESCLIMIT 10;