Chapter 20

Replication: Principles & Practice

MySQL Replication Complete Guide

  MySQL 8.0 / 8.4
  ~45 min read
  Source-code depth





  MySQL replication is the cornerstone for building highly available, read-scalable, and disaster-recovery architectures. This chapter starts from the binary log write mechanism, deep-dives into the internals of asynchronous, semi-synchronous, and group replication; walks through GTID and multi-source replication setup step by step; covers parallel replication (MTS) tuning, replication lag root-cause analysis and fixes, failover best practices; and concludes with the full InnoDB Cluster + MySQL Router HA stack and read-write splitting patterns. All content is based on MySQL 8.0/8.4 source-code analysis and production experience.

1. Replication Overview

The core idea behind MySQL replication: the source server writes data changes to the binary log, and replicas pull these logs and replay them. This mechanism was introduced in MySQL 3.23 (2000) and has evolved into multiple modes over two decades.

1.1 What Problems Does Replication Solve?

Use Case Description Typical Topology
Read Scale-Out Distribute read queries across replicas to reduce source load 1 Source + N Replicas
High Availability Fast failover to a replica when the source fails Source + Standby + Witness
Disaster Recovery Cross-datacenter geo-replication Cross-DC async replication
Online Backup Run backups on replicas without impacting source performance Dedicated backup replica
Analytics Run reporting and analytics queries on replicas Analytics replica
Rolling Upgrades Upgrade replicas first, switch over, then upgrade old source Version rotation

1.2 Replication Mode Evolution Timeline

Version Milestone
3.23 (2000) STATEMENT-based async replication introduced
5.1 (2008) ROW-based binlog + MIXED mode
5.5 (2010) Semi-synchronous replication plugin
5.6 (2013) GTID + multi-threaded replica (schema-level parallelism)
5.7 (2015) Group Replication plugin + enhanced semi-sync + logical clock parallelism
8.0 (2018) InnoDB Cluster + MySQL Router + WriteSet-based parallel replication
8.0.22+ Asynchronous connection failover + Replica terminology rename
8.4 (2024) Replica terminology fully replaces Slave; Group Replication enhancements

1.3 Core Terminology

Starting from MySQL 8.0.22+, Source / Replica replaces the old Master / Slave terminology. This chapter uses the new terms throughout, noting old command names for backward compatibility.

New Term Old Term Description
Source Master Server that writes data changes to binlog
Replica Slave Server that reads and replays binlog
Binary Log (binlog) - Log recording all data change events
Relay Log - Local copy of binlog cached on the Replica
I/O Thread (Receiver) I/O Thread Responsible for pulling binlog from Source
SQL Thread (Applier) SQL Thread Responsible for replaying events in relay log
GTID - Global Transaction Identifier (server_uuid:seq_no)

2. Replication Architecture Deep Dive

2.1 Asynchronous Replication

Asynchronous replication is MySQL's default mode. The source commits transactions and returns to the client without waiting for any replica acknowledgment. This means:

Async Replication Internal Flow

## Source side (each transaction commit):
1. InnoDB prepares transaction (redo log flush)
2. Binary log group commit: FLUSH stage → SYNC stage → COMMIT stage
3. Binlog dump thread sends new events to connected replicas
   -- Source code: sql/rpl_binlog_sender.cc → Binlog_sender::send_events()

## Replica side:
4. I/O Receiver thread connects to Source via COM_BINLOG_DUMP_GTID
   -- Source code: sql/rpl_replica.cc → handle_slave_io()
5. Receives binlog events → writes to local Relay Log
   -- Source code: sql/rpl_replica.cc → queue_event()
6. SQL Applier thread reads Relay Log → replays events
   -- Source code: sql/rpl_rli.cc → apply_event_and_update_pos()
7. Updates relay_log.info (or mysql.slave_relay_log_info table) with position

Source refs: sql/binlog.cc (MYSQL_BIN_LOG::ordered_commit), sql/rpl_binlog_sender.cc, sql/rpl_replica.cc

2.2 Semi-Synchronous Replication

Semi-synchronous replication strikes a balance between async and fully synchronous: the source, after committing a transaction, waits for at least one replica to acknowledge receipt of the binlog events before returning to the client. Note that the replica only needs to confirm "received" (written to relay log), not "applied".

Two Acknowledgment Points (AFTER_SYNC vs AFTER_COMMIT)

AFTER_SYNC (Lossless Semi-Sync)

Default since MySQL 5.7. Source waits for Replica ACK after the SYNC stage but before the COMMIT stage.

AFTER_COMMIT (Legacy Semi-Sync)

MySQL 5.5-5.6 behavior. Source waits for ACK only after the COMMIT stage completes.

Semi-Sync Configuration

-- Source: 安装插件 / Install plugin
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';
SET GLOBAL rpl_semi_sync_source_enabled = 1;
SET GLOBAL rpl_semi_sync_source_timeout = 1000;  -- ms, 超时退化为异步 / fallback to async
SET GLOBAL rpl_semi_sync_source_wait_for_replica_count = 1;
SET GLOBAL rpl_semi_sync_source_wait_point = 'AFTER_SYNC';  -- 无损模式 / lossless

-- Replica: 安装插件 / Install plugin
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';
SET GLOBAL rpl_semi_sync_replica_enabled = 1;
STOP REPLICA; START REPLICA;

-- 验证状态 / Verify status
SHOW STATUS LIKE 'Rpl_semi_sync_source_%';
-- Rpl_semi_sync_source_status = ON       → 半同步生效
-- Rpl_semi_sync_source_no_tx   = 0       → 异步退化次数
-- Rpl_semi_sync_source_yes_tx  = 12345   → 半同步确认事务数

Timeout Fallback: When a replica does not return an ACK within rpl_semi_sync_source_timeout milliseconds, the source automatically falls back to async replication without blocking writes. Semi-sync resumes automatically when the replica catches up. This design prioritizes availability over consistency — during network partitions the source keeps serving, but DBAs must monitor Rpl_semi_sync_source_no_tx.

Source ref: plugin/semisync/semisync_source.cc → ReplSemiSyncMaster::commitTrx()

2.3 Three Modes Compared

Feature Async Semi-Sync (AFTER_SYNC) Group Replication
Data Safety May lose data Acknowledged won't be lost Majority consensus
Write Latency Lowest +1 RTT +consensus protocol overhead
Auto Failover No (external tools needed) No (external tools needed) Yes (built-in)
Conflict Detection No No Certification-based (WriteSet)
Multi-Primary No No Optional (multi-primary mode)
Max Nodes Unlimited Unlimited 9
Best For Read scale, backup, analytics Zero-data-loss primary-replica Fully automated HA cluster

3. Binary Log Formats In Depth

The binary log (binlog) is the data carrier for replication. MySQL supports three binlog formats, each making different trade-offs between determinism, data volume, and compatibility:

3.1 STATEMENT Format

Records the executed SQL statements themselves. This is the oldest format.

-- binlog 中记录的内容 / What's recorded in binlog:
UPDATE orders SET status = 'shipped' WHERE id = 12345;

-- 优势 / Advantages:
-- • 日志量小(一条 SQL 可能影响百万行)
-- • 易于审计和理解

-- 劣势 / Disadvantages:
-- • 非确定性函数问题:NOW(), UUID(), RAND(), FOUND_ROWS()
-- • 存储过程/触发器可能产生不一致
-- • INSERT ... SELECT 需要更多锁(防止不确定性)

-- 危险示例 / Dangerous example:
UPDATE t SET val = RAND() WHERE id > 100 LIMIT 10;
-- Source 和 Replica 选择的行可能不同!
-- Source and Replica may select different rows!

3.2 ROW Format

Records the before-and-after images of every changed row. Default format in MySQL 8.0.

-- binlog 中记录的内容(伪代码)/ What's recorded (pseudo):
-- Table: orders (id=12345)
-- BEFORE: {id:12345, status:'pending', updated_at:'2026-04-19 10:00:00'}
-- AFTER:  {id:12345, status:'shipped', updated_at:'2026-04-20 14:30:00'}

-- 优势 / Advantages:
-- • 完全确定性——任何函数、存储过程都安全
-- • 更少的锁竞争(不需要语句级锁保证确定性)
-- • 精确的行级别数据变更审计

-- 劣势 / Disadvantages:
-- • 日志量可能很大(UPDATE 100万行 = 100万行数据记录)
-- • 不可直接阅读(需要 mysqlbinlog --verbose 解码)

-- 优化选项 / Optimization:
-- binlog_row_image = MINIMAL  → 只记录变更列 + PK(默认 FULL)
-- binlog_row_image = NOBLOB   → 排除未变更的 BLOB 列

3.3 MIXED Format

Uses STATEMENT by default and automatically switches to ROW when MySQL detects potentially unsafe statements. The "unsafe" determination is implemented in decide_logging_format() in sql/sql_parse.cc.

Conditions Triggering ROW Switch

3.4 Format Comparison and Recommendations

Dimension STATEMENT ROW MIXED
Determinism Low (context-dependent) Perfect High (auto-switches)
Log Size Smallest Largest Medium
Readability Directly readable SQL Needs decoding Partially readable
Production Rec. Not recommended Preferred Acceptable

Production Recommendation: Always use binlog_format=ROW (MySQL 8.0 default). With binlog_row_image=FULL (default), you can use mysqlbinlog --verbose for data recovery and auditing. If binlog disk pressure is high, consider binlog_row_image=MINIMAL but note this limits flashback capabilities.

3.5 Binary Log Group Commit

To reduce the number of fsync calls, MySQL batches binlog writes from multiple transactions into a single fsync — this is Group Commit. It proceeds in three stages (implemented in MYSQL_BIN_LOG::ordered_commit()):

## Group Commit 三阶段 / Three stages:

FLUSH stage:   Leader 收集队列中所有等待的事务
               将它们的 binlog events 写入 binlog 文件(write, 不 fsync)
               Leader collects all waiting transactions in the queue
               Writes their binlog events to the binlog file (write, no fsync)

SYNC stage:    对 binlog 文件执行一次 fsync(合并多个事务)
               One fsync call for the binlog file (batching multiple transactions)
               -- 由 binlog_group_commit_sync_delay (μs) 控制等待时间
               -- 由 binlog_group_commit_sync_no_delay_count 控制最小批量大小

COMMIT stage: 按顺序调用各事务的 InnoDB commit
               Calls InnoDB commit for each transaction in order

-- 关键参数 / Key parameters:
binlog_group_commit_sync_delay = 0       -- 默认不额外等待
binlog_group_commit_sync_no_delay_count = 0  -- 不设最小批量
-- 在高并发写入场景下,适当增大 sync_delay 可以提升吞吐量
-- In high-concurrency write scenarios, increasing sync_delay can improve throughput

Source ref: sql/binlog.cc → MYSQL_BIN_LOG::ordered_commit()

4. GTID Replication In Depth

GTID (Global Transaction Identifier) is a global transaction identification mechanism introduced in MySQL 5.6, formatted as server_uuid:transaction_id. Every transaction has a unique identifier across the entire replication topology, greatly simplifying replication management.

4.1 How GTID Works

Each MySQL instance has a unique server_uuid (auto-generated in auto.cnf). When a transaction commits on the source:

  1. A GTID is assigned, e.g., 3E11FA47-71CA-11E1-9E33-C80AA9429562:42
  2. The GTID is written to binlog as the first event of the transaction (Gtid_log_event)
  3. After commit, the GTID is added to the gtid_executed set
  4. When the replica receives this transaction, it first checks whether the GTID is already in its own gtid_executed — if so, it skips it (idempotency)
  5. After execution, the GTID is added to the replica's own gtid_executed
-- 查看 GTID 状态 / Check GTID status
SELECT @@global.gtid_executed\G
-- 输出 / Output:
-- 3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12345,
-- A1B2C3D4-71CA-11E1-9E33-C80AA9429562:1-500

-- 查看已清理的 GTID / View purged GTIDs
SELECT @@global.gtid_purged\G

-- GTID 集合运算(MySQL 8.0 内置函数)/ GTID set operations
SELECT GTID_SUBTRACT(@@global.gtid_executed, '3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12340');
-- 结果:3E11FA47-...:12341-12345 (Replica 落后的事务)

SELECT GTID_SUBSET('3E11FA47-...:1-100', @@global.gtid_executed);
-- 判断一个 GTID 集合是否是另一个的子集 / Check if one GTID set is a subset

4.2 GTID Setup Step by Step (From Scratch)

Step 1: Source Configuration

# /etc/my.cnf (Source)
[mysqld]
server-id                = 1            # 每个实例唯一 / unique per instance
log-bin                  = mysql-bin
binlog-format            = ROW
gtid-mode                = ON
enforce-gtid-consistency = ON
log-replica-updates      = ON           # 链式复制必须 / required for chain replication
binlog-expire-logs-seconds = 604800      # 7天 / 7 days
sync-binlog              = 1
innodb-flush-log-at-trx-commit = 1       # 双1配置 / dual-1 for durability

Step 2: Create Replication User

-- 在 Source 上执行 / Run on Source
CREATE USER 'repl'@'192.168.1.%' IDENTIFIED BY 'StrongPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'192.168.1.%';
FLUSH PRIVILEGES;

Step 3: Replica Configuration

# /etc/my.cnf (Replica)
[mysqld]
server-id                = 2            # 与 Source 不同 / different from Source
log-bin                  = mysql-bin    # 级联复制需要 / needed for cascading
binlog-format            = ROW
gtid-mode                = ON
enforce-gtid-consistency = ON
log-replica-updates      = ON
relay-log                = relay-bin
relay-log-recovery       = ON            # 崩溃后自动清理损坏的 relay log
read-only                = ON            # 普通用户只读 / read-only for non-SUPER users
super-read-only          = ON            # SUPER 用户也只读 / read-only even for SUPER
replica-parallel-workers = 4             # 并行复制线程数 / parallel applier threads

Step 4: Initialize Replica Data

# 方法 A: mysqldump(适合小数据集  full_backup.sql

# 在 Replica 上导入 / Import on Replica:
mysql -u root -p < full_backup.sql

# 方法 B: xtrabackup(适合大数据集)
# Method B: xtrabackup (suitable for large datasets)
xtrabackup --backup --target-dir=/backup/full -u root -p
xtrabackup --prepare --target-dir=/backup/full
# 将备份拷贝到 Replica 的 datadir / Copy backup to Replica's datadir
xtrabackup --copy-back --target-dir=/backup/full

# 方法 C: CLONE 插件(MySQL 8.0.17+,最简单)
# Method C: CLONE plugin (MySQL 8.0.17+, simplest)
-- 在 Replica 上执行 / Run on Replica:
INSTALL PLUGIN clone SONAME 'mysql_clone.so';
SET GLOBAL clone_valid_donor_list = 'source_host:3306';
CLONE INSTANCE FROM 'repl'@'source_host':3306 IDENTIFIED BY 'StrongPassword123!';

Step 5: Start Replication

-- 在 Replica 上执行 / Run on Replica
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'source_host',
  SOURCE_PORT = 3306,
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'StrongPassword123!',
  SOURCE_AUTO_POSITION = 1,               -- GTID 自动定位 / GTID auto-positioning
  GET_SOURCE_PUBLIC_KEY = 1;               -- caching_sha2_password 需要

START REPLICA;

-- 验证复制状态 / Verify replication status
SHOW REPLICA STATUS\G
-- 关键检查项 / Key checks:
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0
-- Retrieved_Gtid_Set: 与 Source 的 gtid_executed 匹配
-- Executed_Gtid_Set: 持续增长

4.3 GTID vs Traditional Position-Based Replication

Feature Traditional Position GTID
Failover Manual binlog position calculation Auto-positioning (SOURCE_AUTO_POSITION=1)
Switching Source Complex and error-prone Simple: just CHANGE REPLICATION SOURCE TO
Skipping Errors SET GLOBAL sql_replica_skip_counter = 1 Inject empty transaction
Consistency Check Data comparison only GTID set comparison provides quick initial check

Skipping Errant Transactions with GTID

-- 假设需要跳过 Source 的 GTID: 3E11FA47-...:100
-- Suppose we need to skip Source's GTID: 3E11FA47-...:100
STOP REPLICA;
SET GTID_NEXT = '3E11FA47-71CA-11E1-9E33-C80AA9429562:100';
BEGIN; COMMIT;  -- 注入空事务 / inject empty transaction
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;

enforce_gtid_consistency Restrictions: With GTID enabled, the following are prohibited: (1) DML mixing InnoDB and MyISAM tables in a transaction; (2) CREATE TABLE ... SELECT (before MySQL 8.0.21); (3) CREATE TEMPORARY TABLE inside a transaction. These restrictions ensure each GTID corresponds to a unique atomic transaction. If your application uses these patterns, refactor before enabling GTID.

Source refs: sql/rpl_gtid.h (Gtid_set), sql/rpl_gtid_persist.cc, sql/rpl_replica.cc → get_source_uuid()

5. Multi-Source Replication

Multi-source replication allows a single replica to receive replication data from multiple sources simultaneously. A typical use case is aggregating data from multiple shards into a central node for analytics and reporting. MySQL 5.7+ supports this natively, using named replication channels to distinguish different sources.

5.1 Configuration Steps

-- Replica: 配置通道 A(连接 Source-A)/ Configure channel A (connect to Source-A)
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'source-a.example.com',
  SOURCE_PORT = 3306,
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'PasswordA!',
  SOURCE_AUTO_POSITION = 1
  FOR CHANNEL 'source_a';

-- Replica: 配置通道 B(连接 Source-B)/ Configure channel B (connect to Source-B)
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'source-b.example.com',
  SOURCE_PORT = 3306,
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'PasswordB!',
  SOURCE_AUTO_POSITION = 1
  FOR CHANNEL 'source_b';

-- 启动所有通道 / Start all channels
START REPLICA;
-- 或者单独启动 / Or start individually:
START REPLICA FOR CHANNEL 'source_a';
START REPLICA FOR CHANNEL 'source_b';

-- 查看特定通道状态 / Check specific channel status
SHOW REPLICA STATUS FOR CHANNEL 'source_a'\G
SHOW REPLICA STATUS FOR CHANNEL 'source_b'\G

-- performance_schema 中的通道信息 / Channel info in performance_schema
SELECT * FROM performance_schema.replication_connection_status;
SELECT * FROM performance_schema.replication_applier_status_by_worker;

5.2 Multi-Source Considerations

6. Replication Filters

Replication filters allow replicas to selectively replicate specific databases or tables. Filtering can happen at two levels:

Filter Level Parameter Description
Source (Binlog) binlog-do-db Generate binlog only for specified db (based on current USE db)
Source (Binlog) binlog-ignore-db Ignore binlog for specified db
Replica (SQL Thread) replicate-do-db Apply events only for specified db
Replica (SQL Thread) replicate-ignore-db Ignore events for specified db
Replica (SQL Thread) replicate-do-table Apply events only for specified table
Replica (SQL Thread) replicate-ignore-table Ignore events for specified table
Replica (SQL Thread) replicate-wild-do-table Wildcard table name matching
Replica (SQL Thread) replicate-rewrite-db Database name rewrite (from->to)

Filter Pitfalls: binlog-do-db and replicate-do-db in STATEMENT format are based on the current USE db, not the actual database of the table. For example, running USE app; UPDATE logs.audit SET ...: if the filter is replicate-do-db=logs, this statement will be ignored (because the current db is app). In ROW format, filtering is based on the actual database name of the table, which behaves more intuitively. This is yet another reason to use ROW format.

-- MySQL 8.0.22+ 在线设置过滤器(无需重启)/ Online filter setup (no restart)
STOP REPLICA SQL_THREAD FOR CHANNEL 'source_a';
CHANGE REPLICATION FILTER
  REPLICATE_DO_DB = (orders, inventory)
  FOR CHANNEL 'source_a';
START REPLICA SQL_THREAD FOR CHANNEL 'source_a';

-- 查看当前过滤器 / View current filters
SELECT * FROM performance_schema.replication_applier_filters;
SELECT * FROM performance_schema.replication_applier_global_filters;

7. Parallel Replication (Multi-Threaded Applier)

The single-threaded SQL Applier is the biggest bottleneck for replication lag. To speed up replay, MySQL supports multi-threaded parallel application of relay log transactions. The parallelism granularity has evolved through three generations:

7.1 Parallelism Strategy Comparison

Strategy Version replica_parallel_type Parallelism Granularity Limitations
Database-Based 5.6+ DATABASE Transactions on different DBs run in parallel Falls back to single-thread for single-DB workloads
Logical Clock 5.7+ LOGICAL_CLOCK Transactions within the same Group Commit batch run in parallel Small batches when source concurrency is low
WriteSet-Based 8.0+ LOGICAL_CLOCK + transaction_write_set_extraction Non-conflicting transactions run in parallel (even across batches) Extra memory for WriteSet tracking

7.2 Parallel Replication Configuration

# /etc/my.cnf (Replica)
[mysqld]
# 并行 worker 线程数(推荐 = CPU 核心数 / 2 到核心数)
# Parallel worker threads (recommended = CPU cores / 2 to cores)
replica-parallel-workers       = 16

# 并行类型:LOGICAL_CLOCK(MySQL 8.0 默认)
replica-parallel-type          = LOGICAL_CLOCK

# 保持提交顺序(默认 ON)——保证从库读一致性
# Preserve commit order (default ON) — ensures read consistency on replica
replica-preserve-commit-order  = ON

# Source 端优化:增大 group commit 窗口提升 Replica 并行度
# Source optimization: increase group commit window for better Replica parallelism
binlog-group-commit-sync-delay          = 100    -- μs
binlog-group-commit-sync-no-delay-count = 20

# WriteSet 追踪(Source 端启用)
# WriteSet tracking (enable on Source)
binlog-transaction-dependency-tracking = WRITESET
transaction-write-set-extraction       = XXHASH64

7.3 WriteSet Parallelism Internals

The core idea: if two transactions modify non-overlapping rows (non-overlapping WriteSets), they can safely run in parallel on the Replica, even if they weren't in the same Group Commit batch on the Source.

Implementation:

  1. Source side: each transaction computes a hash set of its modified rows' PK/unique key values (WriteSet)
  2. Records each transaction's last_committed (logical clock number) and sequence_number in binlog
  3. If the current transaction's WriteSet doesn't intersect with the last binlog_transaction_dependency_history_size transactions, last_committed is set to an earlier value — allowing more parallelism
  4. Replica side: transactions with the same last_committed can run in parallel
-- 查看 binlog 中的并行信息 / View parallelism info in binlog
$ mysqlbinlog --verbose mysql-bin.000042 | grep -A2 "last_committed"

-- 输出示例 / Example output:
#260420 14:30:01 ... last_committed=100  sequence_number=105
#260420 14:30:01 ... last_committed=100  sequence_number=106
#260420 14:30:01 ... last_committed=100  sequence_number=107
#260420 14:30:01 ... last_committed=105  sequence_number=108
-- 事务 105, 106, 107 可以并行(last_committed 相同)
-- Transactions 105, 106, 107 can run in parallel (same last_committed)
-- 事务 108 必须等 105 完成后才能开始
-- Transaction 108 must wait for 105 to complete

7.4 Parallel Replication Monitoring

-- 查看各 Worker 线程状态 / View worker thread status
SELECT WORKER_ID, LAST_APPLIED_TRANSACTION,
       APPLYING_TRANSACTION, LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE
FROM performance_schema.replication_applier_status_by_worker;

-- 查看协调器状态 / View coordinator status
SELECT * FROM performance_schema.replication_applier_status_by_coordinator;

-- 衡量并行效率:比较 worker 空闲时间 / Measure parallelism efficiency
SELECT WORKER_ID,
       LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP,
       LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP,
       TIMESTAMPDIFF(MICROSECOND,
         LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP,
         LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP) AS apply_us
FROM performance_schema.replication_applier_status_by_worker
WHERE WORKER_ID > 0;

Source refs: sql/rpl_rli_pdb.cc (Slave_worker), sql/rpl_replica.cc → slave_start_workers()

8. Replication Lag Monitoring & Troubleshooting

8.1 Lag Measurement Methods

Method Command/Tool Precision Caveats
Seconds_Behind_Source SHOW REPLICA STATUS Second-level Based on timestamp difference between currently processing event and now. Shows NULL when I/O thread disconnects. May jump during long transactions. Not accurate.
GTID Gap GTID_SUBTRACT() Transaction-level Shows "how many transactions behind" but can't directly convert to time
Heartbeat pt-heartbeat Sub-second Most reliable method. Source writes timestamps periodically, Replica reads and calculates delay
performance_schema Replication tables Microsecond-level MySQL 8.0 native high-precision timestamps
# 在 Source 上创建心跳表并启动守护进程
# Create heartbeat table on Source and start daemon
pt-heartbeat --database=percona --create-table \
  --update --daemonize -h source_host -u monitor -p

# 在 Replica 上监控延迟
# Monitor lag on Replica
pt-heartbeat --database=percona --monitor -h replica_host -u monitor -p
# 输出 / Output:
# 0.05s [  0.03s,  0.04s,  0.03s ]   ← 当前/1m/5m/15m 平均延迟

# 或者检查一次 / Or check once:
pt-heartbeat --database=percona --check -h replica_host -u monitor -p
# 输出: 0.05

Using performance_schema High-Precision Monitoring

-- MySQL 8.0 高精度复制延迟 / MySQL 8.0 high-precision lag
SELECT
  CHANNEL_NAME,
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP AS last_apply_ts,
  LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP AS source_commit_ts,
  TIMESTAMPDIFF(MICROSECOND,
    LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
  ) / 1000000.0 AS lag_seconds
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_APPLIED_TRANSACTION != ''
ORDER BY lag_seconds DESC
LIMIT 1;

8.2 Common Lag Causes and Fixes

Cause Diagnosis Solution
Single-threaded SQL replay replica_parallel_workers = 0 or 1 Enable parallel replication (workers = 16, LOGICAL_CLOCK + WRITESET)
Source large transactions Oversized Write_rows events in binlog Split large batch ops into smaller chunks (1000-5000 rows each)
Missing indexes on Replica ROW format: Replica looks up PK row-by-row; full table scan without index Ensure Replica table schema matches Source
Replica disk I/O bottleneck iostat shows disk utilization near 100% Upgrade disk (SSD/NVMe), or use flush=2, sync_binlog=0 on Replica
Query lock contention on Replica SQL Applier waiting for row locks (lock wait timeout) Avoid long transactions/queries on Replica; or use dedicated analytics replica
Network latency / insufficient bandwidth I/O thread's Retrieved_Gtid_Set far behind Source Optimize network / enable binlog compression (binlog_transaction_compression=ON, 8.0.20+)
DDL blocking ALTER TABLE on a large table blocks all subsequent transaction replay Use pt-online-schema-change or gh-ost for online DDL

8.3 Binlog Compression (MySQL 8.0.20+)

-- Source 端启用 / Enable on Source
SET GLOBAL binlog_transaction_compression = ON;
SET GLOBAL binlog_transaction_compression_level_zstd = 3;  -- 1-22, 默认3

-- 监控压缩效果 / Monitor compression effectiveness
SELECT *
FROM performance_schema.binary_log_transaction_compression_stats\G
-- COMPRESSION_TYPE: ZSTD
-- TRANSACTION_COUNT: 50000
-- COMPRESSED_BYTES_COUNTER: 125000000
-- UNCOMPRESSED_BYTES_COUNTER: 500000000   ← 75% 压缩率

9. Failover Procedures

9.1 Planned Switchover

Planned switchover is used during maintenance windows (upgrades, hardware replacement, etc.) and can achieve zero data loss.

## 步骤 1: 停止 Source 写入 / Step 1: Stop writes on Source
SET GLOBAL read_only = ON;
SET GLOBAL super_read_only = ON;

## 步骤 2: 等待 Replica 追上 / Step 2: Wait for Replica to catch up
-- 在 Replica 上检查 / Check on Replica:
SELECT GTID_SUBTRACT(
  (SELECT @@global.gtid_executed),  -- Replica 已执行的
  (SELECT @@global.gtid_executed)   -- 这里应传入 Source 的值
);
-- 当结果为空集时,Replica 已完全追上
-- When result is empty set, Replica is fully caught up

-- 或更简单 / Or simpler:
SHOW REPLICA STATUS\G
-- 等待 Seconds_Behind_Source = 0 且 Retrieved = Executed GTID Set

## 步骤 3: 停止 Replica 的复制 / Step 3: Stop replication on Replica
STOP REPLICA;
RESET REPLICA ALL;  -- 清除复制配置 / Clear replication config

## 步骤 4: 提升 Replica 为新 Source / Step 4: Promote Replica to new Source
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;

## 步骤 5: 将旧 Source 配置为新 Replica / Step 5: Configure old Source as new Replica
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'new_source_host',
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = '...',
  SOURCE_AUTO_POSITION = 1;
SET GLOBAL read_only = ON;
START REPLICA;

## 步骤 6: 更新应用端连接 / Step 6: Update application connections
-- VIP 漂移 / DNS 切换 / ProxySQL 后端切换
-- VIP drift / DNS switch / ProxySQL backend switch

9.2 Unplanned Failover

Emergency failover when the source suddenly crashes. The key is selecting the replica with the most complete data as the new source:

## 步骤 1: 确认 Source 确实不可恢复 / Step 1: Confirm Source is truly unrecoverable
-- 避免脑裂:确保旧 Source 不会自行恢复后继续接受写入
-- Avoid split-brain: ensure old Source won't recover and accept writes

## 步骤 2: 找到最新的 Replica / Step 2: Find the most up-to-date Replica
-- 在所有 Replica 上执行 / Run on all Replicas:
SELECT @@global.gtid_executed;
-- 选择 gtid_executed 最大(事务最多)的 Replica
-- Pick the Replica with the largest gtid_executed (most transactions)

## 步骤 3: 等待候选 Replica 应用完 relay log / Step 3: Let candidate finish relay log
STOP REPLICA IO_THREAD;
-- 等待 SQL Thread 处理完剩余的 relay log
-- Wait until SQL Thread finishes remaining relay log

## 步骤 4: 提升为新 Source / Step 4: Promote to new Source
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;

## 步骤 5: 将其他 Replica 指向新 Source / Step 5: Point other Replicas to new Source
-- GTID 模式下非常简单 / Very simple with GTID:
STOP REPLICA;
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'new_source_host',
  SOURCE_AUTO_POSITION = 1;
START REPLICA;

Errant Transaction Problem: If a Replica has "errant transactions" (GTIDs from writes made directly on the Replica, not present on the Source), these transactions will be pushed to other Replicas after failover, potentially causing inconsistencies. Prevention: (1) Always enable super_read_only=ON; (2) Use GTID_SUBTRACT() to regularly check for errant GTIDs.

-- 检查 Errant GTID / Check for Errant GTIDs
-- 在每个 Replica 上运行:/ Run on each Replica:
SELECT GTID_SUBTRACT(
  @@global.gtid_executed,
  'source_uuid:1-N'  -- 替换为 Source 的 server_uuid 和 GTID 范围
) AS errant_gtids;
-- 如果结果非空,说明有 errant GTID!
-- If result is non-empty, errant GTIDs exist!

9.3 Automated Failover Tools

Tool Maintainer Features
InnoDB Cluster / Group Replication Oracle (Official) Built-in consensus protocol, automatic leader election, no external tools needed
Orchestrator GitHub / Shlomi Noach Topology visualization, intelligent failure detection and switchover, pseudo-GTID support
MHA (Master High Availability) Yoshinori Matsunobu Classic tool, supports non-GTID mode, relay log difference compensation
ProxySQL ProxySQL Inc. Intelligent proxy + failure detection + R/W splitting
Vitess PlanetScale / CNCF Sharding + replication management + auto-failover (YouTube-scale proven)

10. Group Replication

Group Replication (GR), introduced in MySQL 5.7.17+, is a multi-node replication solution based on a Paxos-variant protocol (XCom). Unlike traditional primary-replica setups, GR provides:

10.1 Single-Primary vs Multi-Primary

Multi-Primary

10.2 Certification (Conflict Detection) Internals

When a node wants to commit a transaction:

  1. Transaction executes locally (optimistic execution) but does not commit
  2. Extracts the transaction's WriteSet (hash set of modified rows' PK/UK) and ReadSet
  3. Broadcasts WriteSet + transaction content to all nodes via the XCom protocol
  4. Each node performs Certification: checks if the WriteSet conflicts with recently certified transactions
  5. If no conflict and majority confirms → CERTIFY_OK, commit
  6. If conflict → CERTIFY_FAIL, transaction is rolled back on the originating node

Source ref: plugin/group_replication/src/certifier.cc → Certifier::certify()

10.3 Group Replication Setup (3-Node Single-Primary)

# /etc/my.cnf — 每个节点都需要(修改 server-id 和地址)
# /etc/my.cnf — required on each node (change server-id and address)
[mysqld]
server-id   = 1                     # 每个节点不同: 1, 2, 3
log-bin     = mysql-bin
binlog-format = ROW
gtid-mode   = ON
enforce-gtid-consistency = ON
log-replica-updates  = ON
binlog-checksum = NONE              # GR 要求 / Required by GR

# Group Replication 配置 / Group Replication settings
plugin_load_add = group_replication.so
group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee"  # UUID 格式
group_replication_local_address = "node1:33061"   # GR 通信端口(非 MySQL 端口)
group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
group_replication_start_on_boot = OFF              # 首次手动启动 / Manual start first time
group_replication_single_primary_mode = ON          # Single-Primary 模式
group_replication_bootstrap_group = OFF

# 安全与认证 / Security & authentication
group_replication_recovery_use_ssl = ON
group_replication_ssl_mode = REQUIRED
-- Node 1: 引导集群 / Bootstrap the group
SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;

-- Node 2, Node 3: 加入集群 / Join the group
START GROUP_REPLICATION;

-- 验证集群状态 / Verify cluster status
SELECT MEMBER_ID, MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members;
-- +------+-------+------+---------+--------+
-- | ...  | node1 | 3306 | ONLINE  | PRIMARY|
-- | ...  | node2 | 3306 | ONLINE  | SECONDARY|
-- | ...  | node3 | 3306 | ONLINE  | SECONDARY|
-- +------+-------+------+---------+--------+

10.4 Group Replication Limitations

11. InnoDB Cluster

InnoDB Cluster is MySQL's official complete high-availability solution, composed of three components:

Component Role
Group Replication Underlying replication engine, provides data synchronization and automatic failure detection
MySQL Shell Management tool, simplifies cluster configuration and management via AdminAPI
MySQL Router Intelligent proxy, automatically routes read/write requests to the correct node

11.1 Creating Cluster with MySQL Shell

# 使用 MySQL Shell 连接到第一个节点 / Connect to first node
$ mysqlsh root@node1:3306

// 检查实例是否满足 GR 要求 / Check if instance meets GR requirements
mysql-js> dba.checkInstanceConfiguration('root@node1:3306')
// 自动修复配置问题 / Auto-fix configuration issues
mysql-js> dba.configureInstance('root@node1:3306')

// 创建集群(在 node1 上引导)/ Create cluster (bootstrap on node1)
mysql-js> var cluster = dba.createCluster('myCluster')

// 添加其他节点 / Add other nodes
mysql-js> cluster.addInstance('root@node2:3306')
mysql-js> cluster.addInstance('root@node3:3306')

// 查看集群状态 / Check cluster status
mysql-js> cluster.status()
// {
//   "clusterName": "myCluster",
//   "defaultReplicaSet": {
//     "topology": {
//       "node1:3306": { "status": "ONLINE", "memberRole": "PRIMARY" },
//       "node2:3306": { "status": "ONLINE", "memberRole": "SECONDARY" },
//       "node3:3306": { "status": "ONLINE", "memberRole": "SECONDARY" }
//     }
//   }
// }

// 手动切换 Primary / Manual Primary switchover
mysql-js> cluster.setPrimaryInstance('root@node2:3306')

// 从集群中移除节点 / Remove node from cluster
mysql-js> cluster.removeInstance('root@node3:3306')

11.2 InnoDB ClusterSet (Cross-Datacenter)

MySQL 8.0.27+ introduces InnoDB ClusterSet, allowing deployment of multiple InnoDB Cluster replicas across different datacenters, connected via asynchronous replication. The Primary Cluster handles writes; Replica Clusters provide disaster recovery and local reads.

// 创建 ClusterSet / Create ClusterSet
mysql-js> var cs = cluster.createClusterSet('myClusterSet')

// 在远端 DC 创建 Replica Cluster / Create Replica Cluster in remote DC
mysql-js> var replicaCluster = cs.createReplicaCluster(
  'root@dc2-node1:3306', 'dc2Cluster')
mysql-js> replicaCluster.addInstance('root@dc2-node2:3306')
mysql-js> replicaCluster.addInstance('root@dc2-node3:3306')

// 灾难切换:提升 Replica Cluster 为 Primary / DR switchover
mysql-js> cs.setPrimaryCluster('dc2Cluster')

// 查看 ClusterSet 状态 / View ClusterSet status
mysql-js> cs.status({extended: 1})

12. MySQL Router

MySQL Router is a lightweight middleware proxy, typically deployed on the same host as the application server. It automatically retrieves topology information from InnoDB Cluster metadata and routes read/write requests to the correct nodes.

12.1 Bootstrapping Router

# 引导 Router(自动生成配置)/ Bootstrap Router (auto-generates config)
$ mysqlrouter --bootstrap root@node1:3306 \
  --directory /opt/mysqlrouter \
  --conf-use-sockets \
  --account routeruser

# 启动 / Start
$ /opt/mysqlrouter/start.sh

# 默认监听端口 / Default listening ports:
# 6446 — Read/Write (Classic protocol)  → 路由到 Primary
# 6447 — Read-Only (Classic protocol)   → 轮询到 Secondary
# 6448 — Read/Write (X protocol)
# 6449 — Read-Only (X protocol)
# 8443 — REST API

12.2 Router Configuration Details

# /opt/mysqlrouter/mysqlrouter.conf(自动生成,可调整)
[routing:myCluster_rw]
bind_address = 0.0.0.0
bind_port = 6446
destinations = metadata-cache://myCluster/?role=PRIMARY
routing_strategy = first-available
protocol = classic
max_connections = 1024
connection_sharing = 1                  # MySQL 8.0.33+ 连接共享

[routing:myCluster_ro]
bind_address = 0.0.0.0
bind_port = 6447
destinations = metadata-cache://myCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback
protocol = classic
max_connections = 2048

[metadata_cache:myCluster]
cluster_type = gr
router_id = 1
user = routeruser
metadata_cluster = myCluster
ttl = 0.5                                # 元数据刷新间隔(秒)
auth_cache_ttl = 300
auth_cache_refresh_interval = 2

12.3 Router Monitoring

# REST API 监控 / REST API monitoring
$ curl -s http://localhost:8443/api/20190715/routes | python3 -m json.tool

# 查看活跃连接 / View active connections
$ curl -s http://localhost:8443/api/20190715/routes/myCluster_rw/connections

# 查看后端健康状态 / View backend health
$ curl -s http://localhost:8443/api/20190715/routes/myCluster_rw/destinations

13. Read-Write Splitting Patterns

Read-write splitting is the core strategy for leveraging replication to boost read throughput. Depending on the implementation layer, there are several patterns:

13.1 Implementation Pattern Comparison

Pattern Implementation Advantages Disadvantages
Application Layer Separate read/write datasources in code Most flexible, precise control over which queries go to replicas Invasive, requires code changes
Framework Layer Spring AbstractRoutingDataSource, Django db router Transparent to business code Reads within transactions must go to primary
Proxy Layer ProxySQL / MySQL Router / MaxScale Fully transparent, supports query rewrite and caching Adds one network hop latency
Driver Layer MySQL Connector/J jdbc:mysql:replication:// Zero code changes, driver auto-routes Language/driver specific

13.2 ProxySQL Read-Write Splitting Configuration

-- ProxySQL Admin 接口(默认端口 6032)/ ProxySQL Admin (default port 6032)

-- 添加后端服务器 / Add backend servers
INSERT INTO mysql_servers (hostgroup_id, hostname, port, max_connections, weight)
VALUES
  (10, 'source', 3306, 200, 1000),   -- HG 10: 写组 / Write group
  (20, 'replica1', 3306, 200, 500),  -- HG 20: 读组 / Read group
  (20, 'replica2', 3306, 200, 500);

-- 配置用户 / Configure user
INSERT INTO mysql_users (username, password, default_hostgroup)
VALUES ('app', 'AppPassword!', 10);

-- 读写分离规则 / Read-write splitting rules
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
  (1, 1, '^SELECT .* FOR UPDATE', 10, 1),       -- FOR UPDATE → 写组
  (2, 1, '^SELECT', 20, 1),                     -- 普通 SELECT → 读组
  (3, 1, '^(INSERT|UPDATE|DELETE|REPLACE)', 10, 1);  -- 写操作 → 写组

-- 配置复制延迟检查 / Configure replication lag check
UPDATE mysql_servers SET max_replication_lag = 1
WHERE hostgroup_id = 20;  -- 延迟 >1s 的从库自动摘除 / Auto-remove replicas with >1s lag

-- 加载配置到运行时 / Load config to runtime
LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL USERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL USERS TO DISK;
SAVE MYSQL QUERY RULES TO DISK;

13.3 Consistency Challenges in R/W Splitting

The core challenge of R/W splitting is read consistency issues caused by replication lag: after a user writes data and immediately reads it, they may get stale data from a lagging replica ("read-your-writes inconsistency").

Solutions

Strategy Implementation Use Case
Read-after-write from Primary Force reads to Primary for N seconds after a write Simple, sufficient for most cases
GTID Wait Record GTID after write, call WAIT_FOR_EXECUTED_GTID_SET() before reading from replica Precise, minimizes unnecessary waits
Causal Consistency (MySQL Router 8.0.27+) Router auto-tracks write GTID, routes to replicas that have applied it Zero code changes, integrates with InnoDB Cluster
-- GTID 等待示例 / GTID wait example

-- 写入后,获取当前 GTID 位点 / After write, get current GTID position
-- (应用代码记住这个值 / application code remembers this value)
SELECT @@global.gtid_executed AS write_gtid;

-- 在从库上读取前,等待 GTID 回放到位 / On replica, wait for GTID before reading
SELECT WAIT_FOR_EXECUTED_GTID_SET('3E11FA47-...:1-12345', 1);
-- 第二个参数是超时秒数。返回 0 表示成功,1 表示超时
-- Second parameter is timeout in seconds. Returns 0 on success, 1 on timeout
-- 超时则 fallback 到主库读取 / On timeout, fall back to reading from primary

14. Production Configuration Templates

14.1 Source (Primary) Complete Configuration

# /etc/my.cnf — Source (Primary) 生产配置
# Production configuration for Source (Primary)
[mysqld]
# === 基本标识 / Identity ===
server-id                       = 1
report-host                     = source.example.com
report-port                     = 3306

# === Binary Log / GTID ===
log-bin                         = /var/lib/mysql/binlog/mysql-bin
binlog-format                   = ROW
binlog-row-image                = FULL
binlog-expire-logs-seconds      = 604800       # 7 days
max-binlog-size                 = 256M
sync-binlog                     = 1            # 每次提交 fsync / fsync on every commit
gtid-mode                       = ON
enforce-gtid-consistency        = ON
log-replica-updates             = ON
binlog-checksum                 = CRC32
binlog-transaction-compression  = ON           # 8.0.20+

# === 并行复制优化 / Parallel replication optimization ===
binlog-transaction-dependency-tracking  = WRITESET
transaction-write-set-extraction        = XXHASH64
binlog-group-commit-sync-delay          = 100  # μs
binlog-group-commit-sync-no-delay-count = 20

# === 半同步复制 / Semi-sync replication ===
plugin-load-add                 = semisync_source.so
rpl-semi-sync-source-enabled    = 1
rpl-semi-sync-source-timeout    = 1000         # ms
rpl-semi-sync-source-wait-point = AFTER_SYNC
rpl-semi-sync-source-wait-for-replica-count = 1

# === InnoDB 持久化 / InnoDB durability ===
innodb-flush-log-at-trx-commit  = 1            # 双1 / dual-1
innodb-flush-method             = O_DIRECT

# === 连接与安全 / Connection & security ===
max-connections                 = 500
require-secure-transport        = ON

14.2 Replica Complete Configuration

# /etc/my.cnf — Replica 生产配置
# Production configuration for Replica
[mysqld]
# === 基本标识 / Identity ===
server-id                       = 2
report-host                     = replica1.example.com

# === Binary Log / GTID ===
log-bin                         = /var/lib/mysql/binlog/mysql-bin
binlog-format                   = ROW
gtid-mode                       = ON
enforce-gtid-consistency        = ON
log-replica-updates             = ON

# === Relay Log ===
relay-log                       = /var/lib/mysql/relaylog/relay-bin
relay-log-recovery              = ON           # 崩溃恢复自动清理 / Auto cleanup on crash recovery
relay-log-info-repository       = TABLE
master-info-repository          = TABLE

# === 并行复制 / Parallel replication ===
replica-parallel-workers        = 16
replica-parallel-type           = LOGICAL_CLOCK
replica-preserve-commit-order   = ON

# === 只读 / Read-only ===
read-only                       = ON
super-read-only                 = ON

# === 半同步复制 / Semi-sync replication ===
plugin-load-add                 = semisync_replica.so
rpl-semi-sync-replica-enabled   = 1

# === InnoDB(从库可放宽)/ InnoDB (relaxed for replica) ===
innodb-flush-log-at-trx-commit  = 2            # 放宽以提升性能 / Relaxed for performance
sync-binlog                     = 0            # 从库不需要 fsync binlog
innodb-flush-method             = O_DIRECT

# === 复制延迟保护 / Lag protection ===
slave-net-timeout               = 30           # Source 无响应超时 / Source unresponsive timeout

14.3 Key Monitoring Metrics Summary

Metric Source Healthy Threshold Alert Action
Replication Lag pt-heartbeat / perf_schema < 1s Check parallel repl config, large txns, I/O
Replica I/O Thread SHOW REPLICA STATUS Running: Yes Check network and auth
Replica SQL Thread SHOW REPLICA STATUS Running: Yes Check Last_Error, fix and START REPLICA
Semi-Sync Status Rpl_semi_sync_source_status ON Check Replica connection and ACK timeout
Semi-Sync No Tx Rpl_semi_sync_source_no_tx 0 (or trend not growing) Check network latency and timeout settings
Errant GTIDs GTID_SUBTRACT() Empty set Fix immediately! May cause inconsistency after failover
Binlog Disk Usage SHOW BINARY LOGS Adequate disk space Adjust expire_logs or add disk
GR Member Status replication_group_members All nodes ONLINE Check offline node's network and logs
-- Prometheus mysqld_exporter 复制相关指标 / Replication-related metrics
-- 推荐的 Grafana 告警规则 / Recommended Grafana alert rules:

# 复制延迟告警 / Replication lag alert
ALERT MySQLReplicaLag
  IF mysql_slave_status_seconds_behind_master > 5
  FOR 2m
  LABELS { severity = "warning" }
  ANNOTATIONS { summary = "MySQL replica lag > 5s on  $labels.instance "}}" }

# 复制断开告警 / Replication broken alert
ALERT MySQLReplicaBroken
  IF mysql_slave_status_slave_sql_running == 0
      OR mysql_slave_status_slave_io_running == 0
  FOR 1m
  LABELS { severity = "critical" }
  ANNOTATIONS { summary = "MySQL replication broken on  $labels.instance "}}" }

# 半同步退化告警 / Semi-sync fallback alert
ALERT MySQLSemiSyncOff
  IF mysql_global_status_rpl_semi_sync_master_status == 0
  FOR 1m
  LABELS { severity = "critical" }
  ANNOTATIONS { summary = "Semi-sync disabled on  $labels.instance "}}" }

15. Source Code Reading Guide

Reading the source code is the most authoritative way to understand MySQL replication internals. Below is a navigation map of key source files and entry functions:

15.1 Binary Log Write Path

File Key Function Purpose
sql/binlog.cc MYSQL_BIN_LOG::ordered_commit() Core implementation of Group Commit three stages (FLUSH/SYNC/COMMIT)
sql/binlog.cc MYSQL_BIN_LOG::write_event() Write a single binlog event to the binlog file
sql/log_event.cc Write_rows_log_event::write_row() Encoding of row data events in ROW format
sql/binlog.cc MYSQL_BIN_LOG::new_file_impl() Binlog file rotation logic
sql/rpl_gtid_persist.cc Gtid_table_persistor::save() Persist GTID set to mysql.gtid_executed table

15.2 Replica-Side Replication Threads

File Key Function Purpose
sql/rpl_replica.cc handle_slave_io() I/O Receiver thread main loop: connect to Source, receive binlog, write relay log
sql/rpl_replica.cc handle_slave_sql() SQL Applier thread main loop (single-threaded mode)
sql/rpl_rli_pdb.cc Slave_worker::slave_worker_exec_event() Parallel replication worker thread executing a single event
sql/rpl_replica.cc queue_event() I/O thread writes received events to relay log
sql/rpl_rli.cc apply_event_and_update_pos() Apply event and update position info
sql/rpl_replica.cc slave_start_workers() Start parallel replication coordinator and worker threads

15.3 Binlog Dump Thread (Source Side)

File Key Function Purpose
sql/rpl_binlog_sender.cc Binlog_sender::run() Binlog dump thread main loop
sql/rpl_binlog_sender.cc Binlog_sender::send_events() Read events from binlog and send to Replica
sql/rpl_binlog_sender.cc Binlog_sender::send_heartbeat_event() Send heartbeat events (prevent connection timeout)

15.4 Semi-Synchronous Replication

File Key Function Purpose
plugin/semisync/semisync_source.cc ReplSemiSyncMaster::commitTrx() Core logic for Source waiting for Replica ACK
plugin/semisync/semisync_source.cc ReplSemiSyncMaster::readSlaveReply() Read Replica's ACK response
plugin/semisync/semisync_replica.cc ReplSemiSyncSlave::slaveReply() Replica sends ACK to Source

15.5 Group Replication / Certification

File Key Function Purpose
plugin/group_replication/src/certifier.cc Certifier::certify() Core algorithm for transaction conflict detection
plugin/group_replication/src/gcs_operations.cc Gcs_operations::send_message() Broadcast messages via XCom protocol
plugin/group_replication/src/applier.cc Applier_module::apply_action_packet() Apply consensus-passed transactions
plugin/group_replication/src/member_info.cc Group_member_info::update_member_status() Member status change handling (ONLINE, RECOVERING, OFFLINE, etc.)

16. Replication Troubleshooting Cookbook

16.1 Common Error Codes and Fixes

Error Code Meaning Typical Cause Fix
1062 Duplicate entry Row already exists on Replica (possibly a DELETE was skipped previously) Fix with pt-table-sync, or skip the GTID
1032 Row not found Row missing on Replica (possibly an INSERT was skipped previously) Fix data differences with pt-table-sync
1146 Table doesn't exist Table missing on Replica (replication filter misconfiguration) Export table structure from Source and create on Replica
1236 Cannot read binlog Binlog on Source has been purged or corrupted Rebuild Replica from backup
1205 Lock wait timeout SQL Applier on Replica waiting for lock timeout (long query running on Replica) Kill long transactions on Replica, or set slave_transaction_retries
1756 Cross-engine transaction violates GTID consistency Transaction mixes InnoDB and MyISAM Migrate all tables to InnoDB

16.2 Useful mysqlbinlog Commands

# 查看 binlog 事件摘要 / View binlog event summary
mysqlbinlog --no-defaults --base64-output=DECODE-ROWS --verbose \
  mysql-bin.000042 | head -100

# 按时间范围过滤 / Filter by time range
mysqlbinlog --start-datetime="2026-04-20 10:00:00" \
  --stop-datetime="2026-04-20 10:30:00" \
  mysql-bin.000042

# 按 GTID 过滤 / Filter by GTID
mysqlbinlog --include-gtids="3E11FA47-...:100-110" \
  mysql-bin.000042

# 排除特定 GTID(用于恢复时跳过某些事务)/ Exclude specific GTIDs
mysqlbinlog --exclude-gtids="3E11FA47-...:105" \
  mysql-bin.000042 | mysql -u root -p

# 查看远程 binlog / View remote binlog
mysqlbinlog --read-from-remote-server \
  --host=source_host --port=3306 \
  --user=repl --password \
  mysql-bin.000042

# 解析 ROW 格式的数据变更 / Parse ROW format data changes
mysqlbinlog --base64-output=DECODE-ROWS --verbose --verbose \
  mysql-bin.000042 | grep -A5 "### UPDATE"
# 双 --verbose 会显示列类型信息
# Double --verbose shows column type information

16.3 Data Consistency Verification

# 使用 pt-table-checksum 校验一致性 / Verify consistency with pt-table-checksum
# 在 Source 上运行,会自动计算和比对所有 Replica 的数据校验和
# Run on Source, auto-computes and compares checksums across all Replicas
pt-table-checksum \
  --replicate=percona.checksums \
  --databases=app_db \
  --tables=orders,users,products \
  --chunk-size=5000 \
  --max-lag=2 \
  -h source_host -u monitor -p

# 查看差异报告 / View differences report
pt-table-checksum --replicate=percona.checksums --replicate-check-only \
  -h source_host -u monitor -p

# 使用 pt-table-sync 修复差异 / Fix differences with pt-table-sync
# 先 dry-run 查看会做什么 / First dry-run to see what would happen
pt-table-sync --print \
  --replicate=percona.checksums \
  h=source_host,u=monitor,p=xxx

# 确认无误后执行 / Execute after confirmation
pt-table-sync --execute \
  --replicate=percona.checksums \
  h=source_host,u=monitor,p=xxx

16.4 Replication Topology Visualization

# 使用 Orchestrator 可视化复制拓扑 / Visualize topology with Orchestrator
# Orchestrator 提供 Web UI 和 CLI

# CLI: 查看拓扑 / CLI: View topology
orchestrator-client -c topology -i source_host:3306
# 输出 / Output:
# source_host:3306   [OK,5.7.42,rw,ROW,>>,GTID]
# + replica1:3306    [OK,5.7.42,ro,ROW,>>,GTID]
# + replica2:3306    [OK,5.7.42,ro,ROW,>>,GTID]
# + replica3:3306    [OK,5.7.42,ro,ROW,>>,GTID] (delayed: 3600s)

# CLI: 手动切换 / CLI: Manual switchover
orchestrator-client -c graceful-master-takeover \
  -i source_host:3306 -d replica1:3306

# 或使用 MySQL Shell 查看 InnoDB Cluster 拓扑
# Or use MySQL Shell for InnoDB Cluster topology
mysqlsh -- cluster status --extended=1

16.5 Replication Performance Tuning Checklist

# Check Item Recommended Value Impact
1 replica_parallel_workers 8-16 Parallel transaction replay, reduces lag
2 replica_parallel_type LOGICAL_CLOCK Finer-grained parallelism than DATABASE
3 binlog_transaction_dependency_tracking WRITESET Enables parallelism even with low Source concurrency
4 binlog_group_commit_sync_delay 100-1000 (us) Larger Group Commit batches improve Replica parallelism
5 binlog_transaction_compression ON (8.0.20+) Reduces network transfer and disk I/O by 60-70%
6 Replica Disk NVMe SSD I/O is the most common Replica bottleneck
7 innodb_flush_log_at_trx_commit (Replica) 2 Relax durability on Replica for performance
8 sync_binlog (Replica) 0 Replica doesn't need to fsync binlog on every commit
9 Avoid Large Transactions ≤ 5000 rows per batch Large transactions can't parallelize and block other replays
10 slave_transaction_retries 10 Auto-retry transient errors (deadlocks)

16.6 Replication Health Check Script

-- 一键复制健康检查 / One-click replication health check

-- 1. 基本复制状态 / Basic replication status
SELECT
  CHANNEL_NAME,
  SERVICE_STATE AS io_state,
  LAST_ERROR_NUMBER AS io_err
FROM performance_schema.replication_connection_status;

-- 2. SQL Applier 状态 / SQL Applier status
SELECT
  CHANNEL_NAME,
  SERVICE_STATE AS sql_state,
  LAST_ERROR_NUMBER AS sql_err,
  LAST_ERROR_MESSAGE AS sql_err_msg
FROM performance_schema.replication_applier_status;

-- 3. Worker 线程状态 / Worker thread status
SELECT
  CHANNEL_NAME,
  WORKER_ID,
  LAST_APPLIED_TRANSACTION,
  APPLYING_TRANSACTION,
  LAST_ERROR_NUMBER
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_ERROR_NUMBER != 0;

-- 4. 复制延迟精确计算 / Precise replication lag
SELECT
  CHANNEL_NAME,
  TIMESTAMPDIFF(SECOND,
    LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
  ) AS apply_lag_seconds,
  TIMESTAMPDIFF(SECOND,
    LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP
  ) AS transport_lag_seconds
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_APPLIED_TRANSACTION != ''
ORDER BY apply_lag_seconds DESC LIMIT 1;

-- 5. Group Replication 成员状态 / GR member status
SELECT
  MEMBER_HOST,
  MEMBER_PORT,
  MEMBER_STATE,
  MEMBER_ROLE,
  MEMBER_VERSION
FROM performance_schema.replication_group_members;

-- 6. Group Replication 统计 / GR statistics
SELECT
  MEMBER_ID,
  COUNT_TRANSACTIONS_IN_QUEUE AS cert_queue,
  COUNT_TRANSACTIONS_CHECKED AS certified,
  COUNT_CONFLICTS_DETECTED AS conflicts,
  COUNT_TRANSACTIONS_ROWS_VALIDATING AS certifier_rows,
  TRANSACTIONS_COMMITTED_ALL_MEMBERS AS applied_gtids
FROM performance_schema.replication_group_member_stats;

17. Frequently Asked Questions

Q1: Can I revert from GTID to traditional position-based replication?

Yes, but it requires a specific sequence. MySQL supports online GTID mode switching (no downtime required): 1. All nodes: SET GLOBAL gtid_mode = ON_PERMISSIVE (allow mixed) 2. All nodes: SET GLOBAL gtid_mode = OFF_PERMISSIVE (stop generating GTIDs) 3. Wait until all Replicas' Ongoing_anonymous_transaction_count reaches 0 4. All nodes: SET GLOBAL gtid_mode = OFF 5. Reconfigure Replicas for position-based replication However, reverting is strongly discouraged — GTID is superior in every aspect.

Q2: Replication suddenly broke — how to quickly diagnose?

Follow these diagnostic steps: 1. SHOW REPLICA STATUS\G — check Last_IO_Error and Last_SQL_Error 2. I/O thread stopped: usually network issues, authentication failure, or source binlog purged. Check if source is reachable, replication user permissions are correct, SHOW BINARY LOGS to confirm needed binlog still exists 3. SQL thread stopped: usually data conflicts. Common errors: 1062 (duplicate entry), 1032 (row not found), 1146 (table doesn't exist) 4. Check Replica's error log (SHOW VARIABLES LIKE 'log_error') for details 5. Use mysqlbinlog to parse the failing binlog event to understand what operation failed

Q3: How to upgrade from async to semi-sync replication without downtime?

Semi-sync can be enabled online without downtime: 1. Install and enable semi-sync plugin on all Replicas first 2. Restart Replica I/O threads: STOP REPLICA IO_THREAD; START REPLICA IO_THREAD; 3. Install and enable semi-sync plugin on Source 4. Verify: SHOW STATUS LIKE 'Rpl_semi_sync_source_clients' should show connected semi-sync replica count Order matters: Replica first, then Source. If Source is enabled first, it will timeout and fall back to async because no semi-sync replicas are connected.

Q4: How to choose between Group Replication and traditional semi-sync?

The choice depends on your requirements: - Choose Group Replication / InnoDB Cluster: need automatic failover, don't want external tools, <= 9 nodes, same datacenter or low-latency network, can accept the PK-required constraint - Choose semi-sync + external tools: more than 9 nodes, cross-high-latency datacenters, have MyISAM or PK-less tables, already have mature tooling like Orchestrator/MHA, need more flexible topologies (cascading replication, etc.) The trend is that MySQL officially recommends InnoDB Cluster as the standard HA solution, but traditional approaches remain widely used in many large internet companies.

Q5: How to handle error 1032 (row not found) during replication?

Error 1032 means the Replica cannot find the corresponding row when executing UPDATE or DELETE. This usually means Source and Replica data are already inconsistent. Resolution steps: 1. Short-term fix: skip the transaction (inject empty transaction in GTID mode) 2. Root cause analysis: check if someone wrote directly to the Replica, if transactions were previously skipped, if there are errant GTIDs 3. Consistency verification: use pt-table-checksum to check data differences between Source and Replica 4. Data repair: use pt-table-sync to sync inconsistent data from Source to Replica 5. Prevention: ensure super_read_only=ON, regularly run pt-table-checksum

Q6: What to do when binlog takes too much disk space?

Multiple approaches: - Adjust retention: binlog_expire_logs_seconds (default 2592000 = 30 days). Set based on backup strategy: if daily full backup + binlog backup, shorten to 3-7 days - Manual purge: PURGE BINARY LOGS BEFORE '2026-04-15 00:00:00', but first confirm all replicas have processed these binlogs - Enable compression: binlog_transaction_compression=ON (MySQL 8.0.20+), can reduce size by 60-70% - Use MINIMAL row image: binlog_row_image=MINIMAL reduces ROW format data volume (but limits flashback capability) - Separate disk: place binlog on a separate disk/partition to avoid impacting the data directory

Q7: How to set up delayed replication for data recovery?

Delayed replication intentionally keeps a Replica behind the Source by a set amount of time, serving as a "time machine" — if data is accidentally deleted on the Source, you can recover it from the delayed replica before it applies that operation. CHANGE REPLICATION SOURCE TO SOURCE_DELAY = 3600; -- 1 hour delay START REPLICA; Recovery process: after discovering the mistake, immediately STOP REPLICA, then use START REPLICA UNTIL SQL_BEFORE_GTIDS = '...' to replay precisely up to the GTID before the mistake, then export data from this delayed replica for recovery.

Q8: What value should replica_parallel_workers be set to?

There's no universal formula, but here's experiential guidance: - Starting value: half to equal CPU core count (e.g., 8-16 for a 16-core machine) - Observe metrics: check load balance across workers via performance_schema.replication_applier_status_by_worker - More is not always better: coordination overhead between workers means too many (e.g., 128) may reduce throughput due to lock contention - Pair with Source optimization: increasing binlog_group_commit_sync_delay to allow more transactions to parallelize is more effective than just adding workers - Real bottleneck: often the bottleneck isn't thread count but disk I/O or large transactions

Q9: What if MySQL Router goes down? Isn't it a single point of failure?

The recommended deployment for MySQL Router is on the same host as the application (sidecar pattern). This way: - If Router dies = that application instance has issues, load balancer auto-removes it - No additional Router HA mechanism needed - Application connects to Router via localhost, zero network latency If Router must be deployed as a standalone service, add a VIP (Keepalived) or load balancer in front of it, but this pattern is not recommended.

Q10: How to smoothly migrate from traditional replication to InnoDB Cluster?

Migration path: 1. Prerequisites: ensure all tables use InnoDB with primary keys; enable GTID (if not already); binlog format = ROW 2. Upgrade to MySQL 8.0+ (if still on 5.7) 3. Validate on existing Replicas: dba.checkInstanceConfiguration() for each instance 4. Fix all configuration issues with dba.configureInstance() 5. Create Cluster on Source: dba.createCluster('myCluster', {adoptFromGR: false}) 6. Add Replicas one by one: stop replication first, then cluster.addInstance() 7. Deploy MySQL Router and gradually shift traffic The entire process can be done online, but the final switch steps should be performed during a maintenance window.

[← Prev: Locks & Deadlocks](/books/high-performance-mysql/lock-deadlock)
[Back to Contents](/books/high-performance-mysql)
[Next: Query Optimization →](/books/high-performance-mysql/query-optimization)
Rate this chapter
4.8  / 5  (11 ratings)

💬 Comments