Replication: Principles & Practice
MySQL Replication Complete Guide
MySQL 8.0 / 8.4
~45 min read
Source-code depth
MySQL replication is the cornerstone for building highly available, read-scalable, and disaster-recovery architectures. This chapter starts from the binary log write mechanism, deep-dives into the internals of asynchronous, semi-synchronous, and group replication; walks through GTID and multi-source replication setup step by step; covers parallel replication (MTS) tuning, replication lag root-cause analysis and fixes, failover best practices; and concludes with the full InnoDB Cluster + MySQL Router HA stack and read-write splitting patterns. All content is based on MySQL 8.0/8.4 source-code analysis and production experience.
1. Replication Overview
The core idea behind MySQL replication: the source server writes data changes to the binary log, and replicas pull these logs and replay them. This mechanism was introduced in MySQL 3.23 (2000) and has evolved into multiple modes over two decades.
1.1 What Problems Does Replication Solve?
| Use Case | Description | Typical Topology |
|---|---|---|
| Read Scale-Out | Distribute read queries across replicas to reduce source load | 1 Source + N Replicas |
| High Availability | Fast failover to a replica when the source fails | Source + Standby + Witness |
| Disaster Recovery | Cross-datacenter geo-replication | Cross-DC async replication |
| Online Backup | Run backups on replicas without impacting source performance | Dedicated backup replica |
| Analytics | Run reporting and analytics queries on replicas | Analytics replica |
| Rolling Upgrades | Upgrade replicas first, switch over, then upgrade old source | Version rotation |
1.2 Replication Mode Evolution Timeline
| Version | Milestone |
|---|---|
| 3.23 (2000) | STATEMENT-based async replication introduced |
| 5.1 (2008) | ROW-based binlog + MIXED mode |
| 5.5 (2010) | Semi-synchronous replication plugin |
| 5.6 (2013) | GTID + multi-threaded replica (schema-level parallelism) |
| 5.7 (2015) | Group Replication plugin + enhanced semi-sync + logical clock parallelism |
| 8.0 (2018) | InnoDB Cluster + MySQL Router + WriteSet-based parallel replication |
| 8.0.22+ | Asynchronous connection failover + Replica terminology rename |
| 8.4 (2024) | Replica terminology fully replaces Slave; Group Replication enhancements |
1.3 Core Terminology
Starting from MySQL 8.0.22+, Source / Replica replaces the old Master / Slave terminology. This chapter uses the new terms throughout, noting old command names for backward compatibility.
| New Term | Old Term | Description |
|---|---|---|
| Source | Master | Server that writes data changes to binlog |
| Replica | Slave | Server that reads and replays binlog |
| Binary Log (binlog) | - | Log recording all data change events |
| Relay Log | - | Local copy of binlog cached on the Replica |
| I/O Thread (Receiver) | I/O Thread | Responsible for pulling binlog from Source |
| SQL Thread (Applier) | SQL Thread | Responsible for replaying events in relay log |
| GTID | - | Global Transaction Identifier (server_uuid:seq_no) |
2. Replication Architecture Deep Dive
2.1 Asynchronous Replication
Asynchronous replication is MySQL's default mode. The source commits transactions and returns to the client without waiting for any replica acknowledgment. This means:
- Zero performance overhead: Source commit latency is not affected by network round-trips
- Potential data loss: If the source crashes, committed but not-yet-transmitted transactions are lost
- Uncontrolled replication lag: Replicas can fall arbitrarily behind the source
Async Replication Internal Flow
## Source side (each transaction commit):
1. InnoDB prepares transaction (redo log flush)
2. Binary log group commit: FLUSH stage → SYNC stage → COMMIT stage
3. Binlog dump thread sends new events to connected replicas
-- Source code: sql/rpl_binlog_sender.cc → Binlog_sender::send_events()
## Replica side:
4. I/O Receiver thread connects to Source via COM_BINLOG_DUMP_GTID
-- Source code: sql/rpl_replica.cc → handle_slave_io()
5. Receives binlog events → writes to local Relay Log
-- Source code: sql/rpl_replica.cc → queue_event()
6. SQL Applier thread reads Relay Log → replays events
-- Source code: sql/rpl_rli.cc → apply_event_and_update_pos()
7. Updates relay_log.info (or mysql.slave_relay_log_info table) with position
Source refs: sql/binlog.cc (MYSQL_BIN_LOG::ordered_commit), sql/rpl_binlog_sender.cc, sql/rpl_replica.cc
2.2 Semi-Synchronous Replication
Semi-synchronous replication strikes a balance between async and fully synchronous: the source, after committing a transaction, waits for at least one replica to acknowledge receipt of the binlog events before returning to the client. Note that the replica only needs to confirm "received" (written to relay log), not "applied".
Two Acknowledgment Points (AFTER_SYNC vs AFTER_COMMIT)
AFTER_SYNC (Lossless Semi-Sync)
Default since MySQL 5.7. Source waits for Replica ACK after the SYNC stage but before the COMMIT stage.
- Other clients cannot see unacknowledged transactions (phantom read safe)
- After source crash, transactions returned to client are guaranteed on at least one replica
- This is the recommended mode
AFTER_COMMIT (Legacy Semi-Sync)
MySQL 5.5-5.6 behavior. Source waits for ACK only after the COMMIT stage completes.
- Other clients may see unacknowledged transactions
- After source crash, "phantom transactions" are possible — client saw the commit but replica doesn't have it
- Use only when backward compatibility is needed
Semi-Sync Configuration
-- Source: 安装插件 / Install plugin
INSTALL PLUGIN rpl_semi_sync_source SONAME 'semisync_source.so';
SET GLOBAL rpl_semi_sync_source_enabled = 1;
SET GLOBAL rpl_semi_sync_source_timeout = 1000; -- ms, 超时退化为异步 / fallback to async
SET GLOBAL rpl_semi_sync_source_wait_for_replica_count = 1;
SET GLOBAL rpl_semi_sync_source_wait_point = 'AFTER_SYNC'; -- 无损模式 / lossless
-- Replica: 安装插件 / Install plugin
INSTALL PLUGIN rpl_semi_sync_replica SONAME 'semisync_replica.so';
SET GLOBAL rpl_semi_sync_replica_enabled = 1;
STOP REPLICA; START REPLICA;
-- 验证状态 / Verify status
SHOW STATUS LIKE 'Rpl_semi_sync_source_%';
-- Rpl_semi_sync_source_status = ON → 半同步生效
-- Rpl_semi_sync_source_no_tx = 0 → 异步退化次数
-- Rpl_semi_sync_source_yes_tx = 12345 → 半同步确认事务数
Timeout Fallback: When a replica does not return an ACK within
rpl_semi_sync_source_timeoutmilliseconds, the source automatically falls back to async replication without blocking writes. Semi-sync resumes automatically when the replica catches up. This design prioritizes availability over consistency — during network partitions the source keeps serving, but DBAs must monitorRpl_semi_sync_source_no_tx.
Source ref: plugin/semisync/semisync_source.cc → ReplSemiSyncMaster::commitTrx()
2.3 Three Modes Compared
| Feature | Async | Semi-Sync (AFTER_SYNC) | Group Replication |
|---|---|---|---|
| Data Safety | May lose data | Acknowledged won't be lost | Majority consensus |
| Write Latency | Lowest | +1 RTT | +consensus protocol overhead |
| Auto Failover | No (external tools needed) | No (external tools needed) | Yes (built-in) |
| Conflict Detection | No | No | Certification-based (WriteSet) |
| Multi-Primary | No | No | Optional (multi-primary mode) |
| Max Nodes | Unlimited | Unlimited | 9 |
| Best For | Read scale, backup, analytics | Zero-data-loss primary-replica | Fully automated HA cluster |
3. Binary Log Formats In Depth
The binary log (binlog) is the data carrier for replication. MySQL supports three binlog formats, each making different trade-offs between determinism, data volume, and compatibility:
3.1 STATEMENT Format
Records the executed SQL statements themselves. This is the oldest format.
-- binlog 中记录的内容 / What's recorded in binlog:
UPDATE orders SET status = 'shipped' WHERE id = 12345;
-- 优势 / Advantages:
-- • 日志量小(一条 SQL 可能影响百万行)
-- • 易于审计和理解
-- 劣势 / Disadvantages:
-- • 非确定性函数问题:NOW(), UUID(), RAND(), FOUND_ROWS()
-- • 存储过程/触发器可能产生不一致
-- • INSERT ... SELECT 需要更多锁(防止不确定性)
-- 危险示例 / Dangerous example:
UPDATE t SET val = RAND() WHERE id > 100 LIMIT 10;
-- Source 和 Replica 选择的行可能不同!
-- Source and Replica may select different rows!
3.2 ROW Format
Records the before-and-after images of every changed row. Default format in MySQL 8.0.
-- binlog 中记录的内容(伪代码)/ What's recorded (pseudo):
-- Table: orders (id=12345)
-- BEFORE: {id:12345, status:'pending', updated_at:'2026-04-19 10:00:00'}
-- AFTER: {id:12345, status:'shipped', updated_at:'2026-04-20 14:30:00'}
-- 优势 / Advantages:
-- • 完全确定性——任何函数、存储过程都安全
-- • 更少的锁竞争(不需要语句级锁保证确定性)
-- • 精确的行级别数据变更审计
-- 劣势 / Disadvantages:
-- • 日志量可能很大(UPDATE 100万行 = 100万行数据记录)
-- • 不可直接阅读(需要 mysqlbinlog --verbose 解码)
-- 优化选项 / Optimization:
-- binlog_row_image = MINIMAL → 只记录变更列 + PK(默认 FULL)
-- binlog_row_image = NOBLOB → 排除未变更的 BLOB 列
3.3 MIXED Format
Uses STATEMENT by default and automatically switches to ROW when MySQL detects potentially unsafe statements. The "unsafe" determination is implemented in decide_logging_format() in sql/sql_parse.cc.
Conditions Triggering ROW Switch
- Non-deterministic functions:
UUID(),RAND(),SYSDATE(),FOUND_ROWS() - User-defined functions (UDF)
INSERT ... SELECTinvolving AUTO_INCREMENT- DML with triggers or stored procedures
LIMITwithUPDATE/DELETEwithoutORDER BY- Temporary table operations (under certain conditions)
3.4 Format Comparison and Recommendations
| Dimension | STATEMENT | ROW | MIXED |
|---|---|---|---|
| Determinism | Low (context-dependent) | Perfect | High (auto-switches) |
| Log Size | Smallest | Largest | Medium |
| Readability | Directly readable SQL | Needs decoding | Partially readable |
| Production Rec. | Not recommended | Preferred | Acceptable |
Production Recommendation: Always use
binlog_format=ROW(MySQL 8.0 default). Withbinlog_row_image=FULL(default), you can usemysqlbinlog --verbosefor data recovery and auditing. If binlog disk pressure is high, considerbinlog_row_image=MINIMALbut note this limits flashback capabilities.
3.5 Binary Log Group Commit
To reduce the number of fsync calls, MySQL batches binlog writes from multiple transactions into a single fsync — this is Group Commit. It proceeds in three stages (implemented in MYSQL_BIN_LOG::ordered_commit()):
## Group Commit 三阶段 / Three stages:
FLUSH stage: Leader 收集队列中所有等待的事务
将它们的 binlog events 写入 binlog 文件(write, 不 fsync)
Leader collects all waiting transactions in the queue
Writes their binlog events to the binlog file (write, no fsync)
SYNC stage: 对 binlog 文件执行一次 fsync(合并多个事务)
One fsync call for the binlog file (batching multiple transactions)
-- 由 binlog_group_commit_sync_delay (μs) 控制等待时间
-- 由 binlog_group_commit_sync_no_delay_count 控制最小批量大小
COMMIT stage: 按顺序调用各事务的 InnoDB commit
Calls InnoDB commit for each transaction in order
-- 关键参数 / Key parameters:
binlog_group_commit_sync_delay = 0 -- 默认不额外等待
binlog_group_commit_sync_no_delay_count = 0 -- 不设最小批量
-- 在高并发写入场景下,适当增大 sync_delay 可以提升吞吐量
-- In high-concurrency write scenarios, increasing sync_delay can improve throughput
Source ref: sql/binlog.cc → MYSQL_BIN_LOG::ordered_commit()
4. GTID Replication In Depth
GTID (Global Transaction Identifier) is a global transaction identification mechanism introduced in MySQL 5.6, formatted as server_uuid:transaction_id. Every transaction has a unique identifier across the entire replication topology, greatly simplifying replication management.
4.1 How GTID Works
Each MySQL instance has a unique server_uuid (auto-generated in auto.cnf). When a transaction commits on the source:
- A GTID is assigned, e.g.,
3E11FA47-71CA-11E1-9E33-C80AA9429562:42 - The GTID is written to binlog as the first event of the transaction (
Gtid_log_event) - After commit, the GTID is added to the
gtid_executedset - When the replica receives this transaction, it first checks whether the GTID is already in its own
gtid_executed— if so, it skips it (idempotency) - After execution, the GTID is added to the replica's own
gtid_executed
-- 查看 GTID 状态 / Check GTID status
SELECT @@global.gtid_executed\G
-- 输出 / Output:
-- 3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12345,
-- A1B2C3D4-71CA-11E1-9E33-C80AA9429562:1-500
-- 查看已清理的 GTID / View purged GTIDs
SELECT @@global.gtid_purged\G
-- GTID 集合运算(MySQL 8.0 内置函数)/ GTID set operations
SELECT GTID_SUBTRACT(@@global.gtid_executed, '3E11FA47-71CA-11E1-9E33-C80AA9429562:1-12340');
-- 结果:3E11FA47-...:12341-12345 (Replica 落后的事务)
SELECT GTID_SUBSET('3E11FA47-...:1-100', @@global.gtid_executed);
-- 判断一个 GTID 集合是否是另一个的子集 / Check if one GTID set is a subset
4.2 GTID Setup Step by Step (From Scratch)
Step 1: Source Configuration
# /etc/my.cnf (Source)
[mysqld]
server-id = 1 # 每个实例唯一 / unique per instance
log-bin = mysql-bin
binlog-format = ROW
gtid-mode = ON
enforce-gtid-consistency = ON
log-replica-updates = ON # 链式复制必须 / required for chain replication
binlog-expire-logs-seconds = 604800 # 7天 / 7 days
sync-binlog = 1
innodb-flush-log-at-trx-commit = 1 # 双1配置 / dual-1 for durability
Step 2: Create Replication User
-- 在 Source 上执行 / Run on Source
CREATE USER 'repl'@'192.168.1.%' IDENTIFIED BY 'StrongPassword123!';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'192.168.1.%';
FLUSH PRIVILEGES;
Step 3: Replica Configuration
# /etc/my.cnf (Replica)
[mysqld]
server-id = 2 # 与 Source 不同 / different from Source
log-bin = mysql-bin # 级联复制需要 / needed for cascading
binlog-format = ROW
gtid-mode = ON
enforce-gtid-consistency = ON
log-replica-updates = ON
relay-log = relay-bin
relay-log-recovery = ON # 崩溃后自动清理损坏的 relay log
read-only = ON # 普通用户只读 / read-only for non-SUPER users
super-read-only = ON # SUPER 用户也只读 / read-only even for SUPER
replica-parallel-workers = 4 # 并行复制线程数 / parallel applier threads
Step 4: Initialize Replica Data
# 方法 A: mysqldump(适合小数据集 full_backup.sql
# 在 Replica 上导入 / Import on Replica:
mysql -u root -p < full_backup.sql
# 方法 B: xtrabackup(适合大数据集)
# Method B: xtrabackup (suitable for large datasets)
xtrabackup --backup --target-dir=/backup/full -u root -p
xtrabackup --prepare --target-dir=/backup/full
# 将备份拷贝到 Replica 的 datadir / Copy backup to Replica's datadir
xtrabackup --copy-back --target-dir=/backup/full
# 方法 C: CLONE 插件(MySQL 8.0.17+,最简单)
# Method C: CLONE plugin (MySQL 8.0.17+, simplest)
-- 在 Replica 上执行 / Run on Replica:
INSTALL PLUGIN clone SONAME 'mysql_clone.so';
SET GLOBAL clone_valid_donor_list = 'source_host:3306';
CLONE INSTANCE FROM 'repl'@'source_host':3306 IDENTIFIED BY 'StrongPassword123!';
Step 5: Start Replication
-- 在 Replica 上执行 / Run on Replica
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'source_host',
SOURCE_PORT = 3306,
SOURCE_USER = 'repl',
SOURCE_PASSWORD = 'StrongPassword123!',
SOURCE_AUTO_POSITION = 1, -- GTID 自动定位 / GTID auto-positioning
GET_SOURCE_PUBLIC_KEY = 1; -- caching_sha2_password 需要
START REPLICA;
-- 验证复制状态 / Verify replication status
SHOW REPLICA STATUS\G
-- 关键检查项 / Key checks:
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: Yes
-- Seconds_Behind_Source: 0
-- Retrieved_Gtid_Set: 与 Source 的 gtid_executed 匹配
-- Executed_Gtid_Set: 持续增长
4.3 GTID vs Traditional Position-Based Replication
| Feature | Traditional Position | GTID |
|---|---|---|
| Failover | Manual binlog position calculation | Auto-positioning (SOURCE_AUTO_POSITION=1) |
| Switching Source | Complex and error-prone | Simple: just CHANGE REPLICATION SOURCE TO |
| Skipping Errors | SET GLOBAL sql_replica_skip_counter = 1 |
Inject empty transaction |
| Consistency Check | Data comparison only | GTID set comparison provides quick initial check |
Skipping Errant Transactions with GTID
-- 假设需要跳过 Source 的 GTID: 3E11FA47-...:100
-- Suppose we need to skip Source's GTID: 3E11FA47-...:100
STOP REPLICA;
SET GTID_NEXT = '3E11FA47-71CA-11E1-9E33-C80AA9429562:100';
BEGIN; COMMIT; -- 注入空事务 / inject empty transaction
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;
enforce_gtid_consistency Restrictions: With GTID enabled, the following are prohibited: (1) DML mixing InnoDB and MyISAM tables in a transaction; (2)
CREATE TABLE ... SELECT(before MySQL 8.0.21); (3)CREATE TEMPORARY TABLEinside a transaction. These restrictions ensure each GTID corresponds to a unique atomic transaction. If your application uses these patterns, refactor before enabling GTID.
Source refs: sql/rpl_gtid.h (Gtid_set), sql/rpl_gtid_persist.cc, sql/rpl_replica.cc → get_source_uuid()
5. Multi-Source Replication
Multi-source replication allows a single replica to receive replication data from multiple sources simultaneously. A typical use case is aggregating data from multiple shards into a central node for analytics and reporting. MySQL 5.7+ supports this natively, using named replication channels to distinguish different sources.
5.1 Configuration Steps
-- Replica: 配置通道 A(连接 Source-A)/ Configure channel A (connect to Source-A)
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'source-a.example.com',
SOURCE_PORT = 3306,
SOURCE_USER = 'repl',
SOURCE_PASSWORD = 'PasswordA!',
SOURCE_AUTO_POSITION = 1
FOR CHANNEL 'source_a';
-- Replica: 配置通道 B(连接 Source-B)/ Configure channel B (connect to Source-B)
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'source-b.example.com',
SOURCE_PORT = 3306,
SOURCE_USER = 'repl',
SOURCE_PASSWORD = 'PasswordB!',
SOURCE_AUTO_POSITION = 1
FOR CHANNEL 'source_b';
-- 启动所有通道 / Start all channels
START REPLICA;
-- 或者单独启动 / Or start individually:
START REPLICA FOR CHANNEL 'source_a';
START REPLICA FOR CHANNEL 'source_b';
-- 查看特定通道状态 / Check specific channel status
SHOW REPLICA STATUS FOR CHANNEL 'source_a'\G
SHOW REPLICA STATUS FOR CHANNEL 'source_b'\G
-- performance_schema 中的通道信息 / Channel info in performance_schema
SELECT * FROM performance_schema.replication_connection_status;
SELECT * FROM performance_schema.replication_applier_status_by_worker;
5.2 Multi-Source Considerations
- Schema/table name conflicts: Same schema names on different sources will write to the same schema on the replica. Use replication filters (
REPLICATE_REWRITE_DB) or ensure each source uses different schema names - GTID conflicts: Each source must have a different
server_uuid; GTIDs won't conflict naturally - AUTO_INCREMENT conflicts: Tables with AUTO_INCREMENT on multiple sources may have PK conflicts. Use
auto_increment_incrementandauto_increment_offsetto offset, or use UUID PKs - Channel independence: Each channel has independent I/O and SQL threads; one channel's failure does not affect others
- Repository type: Must use
master_info_repository = TABLEandrelay_log_info_repository = TABLE(TABLE is default in MySQL 8.0)
6. Replication Filters
Replication filters allow replicas to selectively replicate specific databases or tables. Filtering can happen at two levels:
| Filter Level | Parameter | Description |
|---|---|---|
| Source (Binlog) | binlog-do-db |
Generate binlog only for specified db (based on current USE db) |
| Source (Binlog) | binlog-ignore-db |
Ignore binlog for specified db |
| Replica (SQL Thread) | replicate-do-db |
Apply events only for specified db |
| Replica (SQL Thread) | replicate-ignore-db |
Ignore events for specified db |
| Replica (SQL Thread) | replicate-do-table |
Apply events only for specified table |
| Replica (SQL Thread) | replicate-ignore-table |
Ignore events for specified table |
| Replica (SQL Thread) | replicate-wild-do-table |
Wildcard table name matching |
| Replica (SQL Thread) | replicate-rewrite-db |
Database name rewrite (from->to) |
Filter Pitfalls:
binlog-do-dbandreplicate-do-dbin STATEMENT format are based on the currentUSE db, not the actual database of the table. For example, runningUSE app; UPDATE logs.audit SET ...: if the filter isreplicate-do-db=logs, this statement will be ignored (because the current db isapp). In ROW format, filtering is based on the actual database name of the table, which behaves more intuitively. This is yet another reason to use ROW format.
-- MySQL 8.0.22+ 在线设置过滤器(无需重启)/ Online filter setup (no restart)
STOP REPLICA SQL_THREAD FOR CHANNEL 'source_a';
CHANGE REPLICATION FILTER
REPLICATE_DO_DB = (orders, inventory)
FOR CHANNEL 'source_a';
START REPLICA SQL_THREAD FOR CHANNEL 'source_a';
-- 查看当前过滤器 / View current filters
SELECT * FROM performance_schema.replication_applier_filters;
SELECT * FROM performance_schema.replication_applier_global_filters;
7. Parallel Replication (Multi-Threaded Applier)
The single-threaded SQL Applier is the biggest bottleneck for replication lag. To speed up replay, MySQL supports multi-threaded parallel application of relay log transactions. The parallelism granularity has evolved through three generations:
7.1 Parallelism Strategy Comparison
| Strategy | Version | replica_parallel_type |
Parallelism Granularity | Limitations |
|---|---|---|---|---|
| Database-Based | 5.6+ | DATABASE |
Transactions on different DBs run in parallel | Falls back to single-thread for single-DB workloads |
| Logical Clock | 5.7+ | LOGICAL_CLOCK |
Transactions within the same Group Commit batch run in parallel | Small batches when source concurrency is low |
| WriteSet-Based | 8.0+ | LOGICAL_CLOCK + transaction_write_set_extraction |
Non-conflicting transactions run in parallel (even across batches) | Extra memory for WriteSet tracking |
7.2 Parallel Replication Configuration
# /etc/my.cnf (Replica)
[mysqld]
# 并行 worker 线程数(推荐 = CPU 核心数 / 2 到核心数)
# Parallel worker threads (recommended = CPU cores / 2 to cores)
replica-parallel-workers = 16
# 并行类型:LOGICAL_CLOCK(MySQL 8.0 默认)
replica-parallel-type = LOGICAL_CLOCK
# 保持提交顺序(默认 ON)——保证从库读一致性
# Preserve commit order (default ON) — ensures read consistency on replica
replica-preserve-commit-order = ON
# Source 端优化:增大 group commit 窗口提升 Replica 并行度
# Source optimization: increase group commit window for better Replica parallelism
binlog-group-commit-sync-delay = 100 -- μs
binlog-group-commit-sync-no-delay-count = 20
# WriteSet 追踪(Source 端启用)
# WriteSet tracking (enable on Source)
binlog-transaction-dependency-tracking = WRITESET
transaction-write-set-extraction = XXHASH64
7.3 WriteSet Parallelism Internals
The core idea: if two transactions modify non-overlapping rows (non-overlapping WriteSets), they can safely run in parallel on the Replica, even if they weren't in the same Group Commit batch on the Source.
Implementation:
- Source side: each transaction computes a hash set of its modified rows' PK/unique key values (WriteSet)
- Records each transaction's
last_committed(logical clock number) andsequence_numberin binlog - If the current transaction's WriteSet doesn't intersect with the last
binlog_transaction_dependency_history_sizetransactions,last_committedis set to an earlier value — allowing more parallelism - Replica side: transactions with the same
last_committedcan run in parallel
-- 查看 binlog 中的并行信息 / View parallelism info in binlog
$ mysqlbinlog --verbose mysql-bin.000042 | grep -A2 "last_committed"
-- 输出示例 / Example output:
#260420 14:30:01 ... last_committed=100 sequence_number=105
#260420 14:30:01 ... last_committed=100 sequence_number=106
#260420 14:30:01 ... last_committed=100 sequence_number=107
#260420 14:30:01 ... last_committed=105 sequence_number=108
-- 事务 105, 106, 107 可以并行(last_committed 相同)
-- Transactions 105, 106, 107 can run in parallel (same last_committed)
-- 事务 108 必须等 105 完成后才能开始
-- Transaction 108 must wait for 105 to complete
7.4 Parallel Replication Monitoring
-- 查看各 Worker 线程状态 / View worker thread status
SELECT WORKER_ID, LAST_APPLIED_TRANSACTION,
APPLYING_TRANSACTION, LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE
FROM performance_schema.replication_applier_status_by_worker;
-- 查看协调器状态 / View coordinator status
SELECT * FROM performance_schema.replication_applier_status_by_coordinator;
-- 衡量并行效率:比较 worker 空闲时间 / Measure parallelism efficiency
SELECT WORKER_ID,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP,
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP,
TIMESTAMPDIFF(MICROSECOND,
LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP) AS apply_us
FROM performance_schema.replication_applier_status_by_worker
WHERE WORKER_ID > 0;
Source refs: sql/rpl_rli_pdb.cc (Slave_worker), sql/rpl_replica.cc → slave_start_workers()
8. Replication Lag Monitoring & Troubleshooting
8.1 Lag Measurement Methods
| Method | Command/Tool | Precision | Caveats |
|---|---|---|---|
Seconds_Behind_Source |
SHOW REPLICA STATUS |
Second-level | Based on timestamp difference between currently processing event and now. Shows NULL when I/O thread disconnects. May jump during long transactions. Not accurate. |
| GTID Gap | GTID_SUBTRACT() |
Transaction-level | Shows "how many transactions behind" but can't directly convert to time |
| Heartbeat | pt-heartbeat | Sub-second | Most reliable method. Source writes timestamps periodically, Replica reads and calculates delay |
| performance_schema | Replication tables | Microsecond-level | MySQL 8.0 native high-precision timestamps |
Using pt-heartbeat (Recommended)
# 在 Source 上创建心跳表并启动守护进程
# Create heartbeat table on Source and start daemon
pt-heartbeat --database=percona --create-table \
--update --daemonize -h source_host -u monitor -p
# 在 Replica 上监控延迟
# Monitor lag on Replica
pt-heartbeat --database=percona --monitor -h replica_host -u monitor -p
# 输出 / Output:
# 0.05s [ 0.03s, 0.04s, 0.03s ] ← 当前/1m/5m/15m 平均延迟
# 或者检查一次 / Or check once:
pt-heartbeat --database=percona --check -h replica_host -u monitor -p
# 输出: 0.05
Using performance_schema High-Precision Monitoring
-- MySQL 8.0 高精度复制延迟 / MySQL 8.0 high-precision lag
SELECT
CHANNEL_NAME,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP AS last_apply_ts,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP AS source_commit_ts,
TIMESTAMPDIFF(MICROSECOND,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
) / 1000000.0 AS lag_seconds
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_APPLIED_TRANSACTION != ''
ORDER BY lag_seconds DESC
LIMIT 1;
8.2 Common Lag Causes and Fixes
| Cause | Diagnosis | Solution |
|---|---|---|
| Single-threaded SQL replay | replica_parallel_workers = 0 or 1 |
Enable parallel replication (workers = 16, LOGICAL_CLOCK + WRITESET) |
| Source large transactions | Oversized Write_rows events in binlog | Split large batch ops into smaller chunks (1000-5000 rows each) |
| Missing indexes on Replica | ROW format: Replica looks up PK row-by-row; full table scan without index | Ensure Replica table schema matches Source |
| Replica disk I/O bottleneck | iostat shows disk utilization near 100% |
Upgrade disk (SSD/NVMe), or use flush=2, sync_binlog=0 on Replica |
| Query lock contention on Replica | SQL Applier waiting for row locks (lock wait timeout) | Avoid long transactions/queries on Replica; or use dedicated analytics replica |
| Network latency / insufficient bandwidth | I/O thread's Retrieved_Gtid_Set far behind Source | Optimize network / enable binlog compression (binlog_transaction_compression=ON, 8.0.20+) |
| DDL blocking | ALTER TABLE on a large table blocks all subsequent transaction replay | Use pt-online-schema-change or gh-ost for online DDL |
8.3 Binlog Compression (MySQL 8.0.20+)
-- Source 端启用 / Enable on Source
SET GLOBAL binlog_transaction_compression = ON;
SET GLOBAL binlog_transaction_compression_level_zstd = 3; -- 1-22, 默认3
-- 监控压缩效果 / Monitor compression effectiveness
SELECT *
FROM performance_schema.binary_log_transaction_compression_stats\G
-- COMPRESSION_TYPE: ZSTD
-- TRANSACTION_COUNT: 50000
-- COMPRESSED_BYTES_COUNTER: 125000000
-- UNCOMPRESSED_BYTES_COUNTER: 500000000 ← 75% 压缩率
9. Failover Procedures
9.1 Planned Switchover
Planned switchover is used during maintenance windows (upgrades, hardware replacement, etc.) and can achieve zero data loss.
## 步骤 1: 停止 Source 写入 / Step 1: Stop writes on Source
SET GLOBAL read_only = ON;
SET GLOBAL super_read_only = ON;
## 步骤 2: 等待 Replica 追上 / Step 2: Wait for Replica to catch up
-- 在 Replica 上检查 / Check on Replica:
SELECT GTID_SUBTRACT(
(SELECT @@global.gtid_executed), -- Replica 已执行的
(SELECT @@global.gtid_executed) -- 这里应传入 Source 的值
);
-- 当结果为空集时,Replica 已完全追上
-- When result is empty set, Replica is fully caught up
-- 或更简单 / Or simpler:
SHOW REPLICA STATUS\G
-- 等待 Seconds_Behind_Source = 0 且 Retrieved = Executed GTID Set
## 步骤 3: 停止 Replica 的复制 / Step 3: Stop replication on Replica
STOP REPLICA;
RESET REPLICA ALL; -- 清除复制配置 / Clear replication config
## 步骤 4: 提升 Replica 为新 Source / Step 4: Promote Replica to new Source
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;
## 步骤 5: 将旧 Source 配置为新 Replica / Step 5: Configure old Source as new Replica
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'new_source_host',
SOURCE_USER = 'repl',
SOURCE_PASSWORD = '...',
SOURCE_AUTO_POSITION = 1;
SET GLOBAL read_only = ON;
START REPLICA;
## 步骤 6: 更新应用端连接 / Step 6: Update application connections
-- VIP 漂移 / DNS 切换 / ProxySQL 后端切换
-- VIP drift / DNS switch / ProxySQL backend switch
9.2 Unplanned Failover
Emergency failover when the source suddenly crashes. The key is selecting the replica with the most complete data as the new source:
## 步骤 1: 确认 Source 确实不可恢复 / Step 1: Confirm Source is truly unrecoverable
-- 避免脑裂:确保旧 Source 不会自行恢复后继续接受写入
-- Avoid split-brain: ensure old Source won't recover and accept writes
## 步骤 2: 找到最新的 Replica / Step 2: Find the most up-to-date Replica
-- 在所有 Replica 上执行 / Run on all Replicas:
SELECT @@global.gtid_executed;
-- 选择 gtid_executed 最大(事务最多)的 Replica
-- Pick the Replica with the largest gtid_executed (most transactions)
## 步骤 3: 等待候选 Replica 应用完 relay log / Step 3: Let candidate finish relay log
STOP REPLICA IO_THREAD;
-- 等待 SQL Thread 处理完剩余的 relay log
-- Wait until SQL Thread finishes remaining relay log
## 步骤 4: 提升为新 Source / Step 4: Promote to new Source
STOP REPLICA;
RESET REPLICA ALL;
SET GLOBAL read_only = OFF;
SET GLOBAL super_read_only = OFF;
## 步骤 5: 将其他 Replica 指向新 Source / Step 5: Point other Replicas to new Source
-- GTID 模式下非常简单 / Very simple with GTID:
STOP REPLICA;
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'new_source_host',
SOURCE_AUTO_POSITION = 1;
START REPLICA;
Errant Transaction Problem: If a Replica has "errant transactions" (GTIDs from writes made directly on the Replica, not present on the Source), these transactions will be pushed to other Replicas after failover, potentially causing inconsistencies. Prevention: (1) Always enable
super_read_only=ON; (2) UseGTID_SUBTRACT()to regularly check for errant GTIDs.
-- 检查 Errant GTID / Check for Errant GTIDs
-- 在每个 Replica 上运行:/ Run on each Replica:
SELECT GTID_SUBTRACT(
@@global.gtid_executed,
'source_uuid:1-N' -- 替换为 Source 的 server_uuid 和 GTID 范围
) AS errant_gtids;
-- 如果结果非空,说明有 errant GTID!
-- If result is non-empty, errant GTIDs exist!
9.3 Automated Failover Tools
| Tool | Maintainer | Features |
|---|---|---|
| InnoDB Cluster / Group Replication | Oracle (Official) | Built-in consensus protocol, automatic leader election, no external tools needed |
| Orchestrator | GitHub / Shlomi Noach | Topology visualization, intelligent failure detection and switchover, pseudo-GTID support |
| MHA (Master High Availability) | Yoshinori Matsunobu | Classic tool, supports non-GTID mode, relay log difference compensation |
| ProxySQL | ProxySQL Inc. | Intelligent proxy + failure detection + R/W splitting |
| Vitess | PlanetScale / CNCF | Sharding + replication management + auto-failover (YouTube-scale proven) |
10. Group Replication
Group Replication (GR), introduced in MySQL 5.7.17+, is a multi-node replication solution based on a Paxos-variant protocol (XCom). Unlike traditional primary-replica setups, GR provides:
- Automatic failure detection and leader election: automatic re-election after node failure, no external tools needed
- Transaction conflict detection (Certification): detects write conflicts before commit, preventing dirty writes
- Majority confirmation: transactions must be acknowledged by a majority of nodes before committing
10.1 Single-Primary vs Multi-Primary
Single-Primary (Recommended)
- Only one node accepts writes, others are read-only
- No write conflicts possible, more predictable performance
- Automatic election of new Primary after failure
- Best compatibility with traditional primary-replica applications
- Suitable for 99% of use cases
Multi-Primary
- All nodes can accept writes
- On write conflict, the later-committing transaction is rolled back
- Foreign key cascades and SERIALIZABLE isolation not supported
- Cross-node DDL operations need extra coordination
- Suitable for distributed writes with low conflict probability
10.2 Certification (Conflict Detection) Internals
When a node wants to commit a transaction:
- Transaction executes locally (optimistic execution) but does not commit
- Extracts the transaction's WriteSet (hash set of modified rows' PK/UK) and ReadSet
- Broadcasts WriteSet + transaction content to all nodes via the XCom protocol
- Each node performs Certification: checks if the WriteSet conflicts with recently certified transactions
- If no conflict and majority confirms → CERTIFY_OK, commit
- If conflict → CERTIFY_FAIL, transaction is rolled back on the originating node
Source ref: plugin/group_replication/src/certifier.cc → Certifier::certify()
10.3 Group Replication Setup (3-Node Single-Primary)
# /etc/my.cnf — 每个节点都需要(修改 server-id 和地址)
# /etc/my.cnf — required on each node (change server-id and address)
[mysqld]
server-id = 1 # 每个节点不同: 1, 2, 3
log-bin = mysql-bin
binlog-format = ROW
gtid-mode = ON
enforce-gtid-consistency = ON
log-replica-updates = ON
binlog-checksum = NONE # GR 要求 / Required by GR
# Group Replication 配置 / Group Replication settings
plugin_load_add = group_replication.so
group_replication_group_name = "aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee" # UUID 格式
group_replication_local_address = "node1:33061" # GR 通信端口(非 MySQL 端口)
group_replication_group_seeds = "node1:33061,node2:33061,node3:33061"
group_replication_start_on_boot = OFF # 首次手动启动 / Manual start first time
group_replication_single_primary_mode = ON # Single-Primary 模式
group_replication_bootstrap_group = OFF
# 安全与认证 / Security & authentication
group_replication_recovery_use_ssl = ON
group_replication_ssl_mode = REQUIRED
-- Node 1: 引导集群 / Bootstrap the group
SET GLOBAL group_replication_bootstrap_group = ON;
START GROUP_REPLICATION;
SET GLOBAL group_replication_bootstrap_group = OFF;
-- Node 2, Node 3: 加入集群 / Join the group
START GROUP_REPLICATION;
-- 验证集群状态 / Verify cluster status
SELECT MEMBER_ID, MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE
FROM performance_schema.replication_group_members;
-- +------+-------+------+---------+--------+
-- | ... | node1 | 3306 | ONLINE | PRIMARY|
-- | ... | node2 | 3306 | ONLINE | SECONDARY|
-- | ... | node3 | 3306 | ONLINE | SECONDARY|
-- +------+-------+------+---------+--------+
10.4 Group Replication Limitations
- Maximum 9 nodes (Paxos protocol limitation)
- All tables must use InnoDB with a primary key or non-null unique key
- No gap lock support (gap locks under REPEATABLE READ don't work in Multi-Primary)
- Large transaction limit: transactions exceeding
group_replication_transaction_size_limit(default 150MB) are rejected - Network sensitive: inter-node latency should be < 10ms (same datacenter or region)
- Write throughput: limited by consensus protocol overhead; single-node async replication has higher write throughput
11. InnoDB Cluster
InnoDB Cluster is MySQL's official complete high-availability solution, composed of three components:
| Component | Role |
|---|---|
| Group Replication | Underlying replication engine, provides data synchronization and automatic failure detection |
| MySQL Shell | Management tool, simplifies cluster configuration and management via AdminAPI |
| MySQL Router | Intelligent proxy, automatically routes read/write requests to the correct node |
11.1 Creating Cluster with MySQL Shell
# 使用 MySQL Shell 连接到第一个节点 / Connect to first node
$ mysqlsh root@node1:3306
// 检查实例是否满足 GR 要求 / Check if instance meets GR requirements
mysql-js> dba.checkInstanceConfiguration('root@node1:3306')
// 自动修复配置问题 / Auto-fix configuration issues
mysql-js> dba.configureInstance('root@node1:3306')
// 创建集群(在 node1 上引导)/ Create cluster (bootstrap on node1)
mysql-js> var cluster = dba.createCluster('myCluster')
// 添加其他节点 / Add other nodes
mysql-js> cluster.addInstance('root@node2:3306')
mysql-js> cluster.addInstance('root@node3:3306')
// 查看集群状态 / Check cluster status
mysql-js> cluster.status()
// {
// "clusterName": "myCluster",
// "defaultReplicaSet": {
// "topology": {
// "node1:3306": { "status": "ONLINE", "memberRole": "PRIMARY" },
// "node2:3306": { "status": "ONLINE", "memberRole": "SECONDARY" },
// "node3:3306": { "status": "ONLINE", "memberRole": "SECONDARY" }
// }
// }
// }
// 手动切换 Primary / Manual Primary switchover
mysql-js> cluster.setPrimaryInstance('root@node2:3306')
// 从集群中移除节点 / Remove node from cluster
mysql-js> cluster.removeInstance('root@node3:3306')
11.2 InnoDB ClusterSet (Cross-Datacenter)
MySQL 8.0.27+ introduces InnoDB ClusterSet, allowing deployment of multiple InnoDB Cluster replicas across different datacenters, connected via asynchronous replication. The Primary Cluster handles writes; Replica Clusters provide disaster recovery and local reads.
// 创建 ClusterSet / Create ClusterSet
mysql-js> var cs = cluster.createClusterSet('myClusterSet')
// 在远端 DC 创建 Replica Cluster / Create Replica Cluster in remote DC
mysql-js> var replicaCluster = cs.createReplicaCluster(
'root@dc2-node1:3306', 'dc2Cluster')
mysql-js> replicaCluster.addInstance('root@dc2-node2:3306')
mysql-js> replicaCluster.addInstance('root@dc2-node3:3306')
// 灾难切换:提升 Replica Cluster 为 Primary / DR switchover
mysql-js> cs.setPrimaryCluster('dc2Cluster')
// 查看 ClusterSet 状态 / View ClusterSet status
mysql-js> cs.status({extended: 1})
12. MySQL Router
MySQL Router is a lightweight middleware proxy, typically deployed on the same host as the application server. It automatically retrieves topology information from InnoDB Cluster metadata and routes read/write requests to the correct nodes.
12.1 Bootstrapping Router
# 引导 Router(自动生成配置)/ Bootstrap Router (auto-generates config)
$ mysqlrouter --bootstrap root@node1:3306 \
--directory /opt/mysqlrouter \
--conf-use-sockets \
--account routeruser
# 启动 / Start
$ /opt/mysqlrouter/start.sh
# 默认监听端口 / Default listening ports:
# 6446 — Read/Write (Classic protocol) → 路由到 Primary
# 6447 — Read-Only (Classic protocol) → 轮询到 Secondary
# 6448 — Read/Write (X protocol)
# 6449 — Read-Only (X protocol)
# 8443 — REST API
12.2 Router Configuration Details
# /opt/mysqlrouter/mysqlrouter.conf(自动生成,可调整)
[routing:myCluster_rw]
bind_address = 0.0.0.0
bind_port = 6446
destinations = metadata-cache://myCluster/?role=PRIMARY
routing_strategy = first-available
protocol = classic
max_connections = 1024
connection_sharing = 1 # MySQL 8.0.33+ 连接共享
[routing:myCluster_ro]
bind_address = 0.0.0.0
bind_port = 6447
destinations = metadata-cache://myCluster/?role=SECONDARY
routing_strategy = round-robin-with-fallback
protocol = classic
max_connections = 2048
[metadata_cache:myCluster]
cluster_type = gr
router_id = 1
user = routeruser
metadata_cluster = myCluster
ttl = 0.5 # 元数据刷新间隔(秒)
auth_cache_ttl = 300
auth_cache_refresh_interval = 2
12.3 Router Monitoring
# REST API 监控 / REST API monitoring
$ curl -s http://localhost:8443/api/20190715/routes | python3 -m json.tool
# 查看活跃连接 / View active connections
$ curl -s http://localhost:8443/api/20190715/routes/myCluster_rw/connections
# 查看后端健康状态 / View backend health
$ curl -s http://localhost:8443/api/20190715/routes/myCluster_rw/destinations
13. Read-Write Splitting Patterns
Read-write splitting is the core strategy for leveraging replication to boost read throughput. Depending on the implementation layer, there are several patterns:
13.1 Implementation Pattern Comparison
| Pattern | Implementation | Advantages | Disadvantages |
|---|---|---|---|
| Application Layer | Separate read/write datasources in code | Most flexible, precise control over which queries go to replicas | Invasive, requires code changes |
| Framework Layer | Spring AbstractRoutingDataSource, Django db router |
Transparent to business code | Reads within transactions must go to primary |
| Proxy Layer | ProxySQL / MySQL Router / MaxScale | Fully transparent, supports query rewrite and caching | Adds one network hop latency |
| Driver Layer | MySQL Connector/J jdbc:mysql:replication:// |
Zero code changes, driver auto-routes | Language/driver specific |
13.2 ProxySQL Read-Write Splitting Configuration
-- ProxySQL Admin 接口(默认端口 6032)/ ProxySQL Admin (default port 6032)
-- 添加后端服务器 / Add backend servers
INSERT INTO mysql_servers (hostgroup_id, hostname, port, max_connections, weight)
VALUES
(10, 'source', 3306, 200, 1000), -- HG 10: 写组 / Write group
(20, 'replica1', 3306, 200, 500), -- HG 20: 读组 / Read group
(20, 'replica2', 3306, 200, 500);
-- 配置用户 / Configure user
INSERT INTO mysql_users (username, password, default_hostgroup)
VALUES ('app', 'AppPassword!', 10);
-- 读写分离规则 / Read-write splitting rules
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, destination_hostgroup, apply)
VALUES
(1, 1, '^SELECT .* FOR UPDATE', 10, 1), -- FOR UPDATE → 写组
(2, 1, '^SELECT', 20, 1), -- 普通 SELECT → 读组
(3, 1, '^(INSERT|UPDATE|DELETE|REPLACE)', 10, 1); -- 写操作 → 写组
-- 配置复制延迟检查 / Configure replication lag check
UPDATE mysql_servers SET max_replication_lag = 1
WHERE hostgroup_id = 20; -- 延迟 >1s 的从库自动摘除 / Auto-remove replicas with >1s lag
-- 加载配置到运行时 / Load config to runtime
LOAD MYSQL SERVERS TO RUNTIME;
LOAD MYSQL USERS TO RUNTIME;
LOAD MYSQL QUERY RULES TO RUNTIME;
SAVE MYSQL SERVERS TO DISK;
SAVE MYSQL USERS TO DISK;
SAVE MYSQL QUERY RULES TO DISK;
13.3 Consistency Challenges in R/W Splitting
The core challenge of R/W splitting is read consistency issues caused by replication lag: after a user writes data and immediately reads it, they may get stale data from a lagging replica ("read-your-writes inconsistency").
Solutions
| Strategy | Implementation | Use Case |
|---|---|---|
| Read-after-write from Primary | Force reads to Primary for N seconds after a write | Simple, sufficient for most cases |
| GTID Wait | Record GTID after write, call WAIT_FOR_EXECUTED_GTID_SET() before reading from replica |
Precise, minimizes unnecessary waits |
| Causal Consistency (MySQL Router 8.0.27+) | Router auto-tracks write GTID, routes to replicas that have applied it | Zero code changes, integrates with InnoDB Cluster |
-- GTID 等待示例 / GTID wait example
-- 写入后,获取当前 GTID 位点 / After write, get current GTID position
-- (应用代码记住这个值 / application code remembers this value)
SELECT @@global.gtid_executed AS write_gtid;
-- 在从库上读取前,等待 GTID 回放到位 / On replica, wait for GTID before reading
SELECT WAIT_FOR_EXECUTED_GTID_SET('3E11FA47-...:1-12345', 1);
-- 第二个参数是超时秒数。返回 0 表示成功,1 表示超时
-- Second parameter is timeout in seconds. Returns 0 on success, 1 on timeout
-- 超时则 fallback 到主库读取 / On timeout, fall back to reading from primary
14. Production Configuration Templates
14.1 Source (Primary) Complete Configuration
# /etc/my.cnf — Source (Primary) 生产配置
# Production configuration for Source (Primary)
[mysqld]
# === 基本标识 / Identity ===
server-id = 1
report-host = source.example.com
report-port = 3306
# === Binary Log / GTID ===
log-bin = /var/lib/mysql/binlog/mysql-bin
binlog-format = ROW
binlog-row-image = FULL
binlog-expire-logs-seconds = 604800 # 7 days
max-binlog-size = 256M
sync-binlog = 1 # 每次提交 fsync / fsync on every commit
gtid-mode = ON
enforce-gtid-consistency = ON
log-replica-updates = ON
binlog-checksum = CRC32
binlog-transaction-compression = ON # 8.0.20+
# === 并行复制优化 / Parallel replication optimization ===
binlog-transaction-dependency-tracking = WRITESET
transaction-write-set-extraction = XXHASH64
binlog-group-commit-sync-delay = 100 # μs
binlog-group-commit-sync-no-delay-count = 20
# === 半同步复制 / Semi-sync replication ===
plugin-load-add = semisync_source.so
rpl-semi-sync-source-enabled = 1
rpl-semi-sync-source-timeout = 1000 # ms
rpl-semi-sync-source-wait-point = AFTER_SYNC
rpl-semi-sync-source-wait-for-replica-count = 1
# === InnoDB 持久化 / InnoDB durability ===
innodb-flush-log-at-trx-commit = 1 # 双1 / dual-1
innodb-flush-method = O_DIRECT
# === 连接与安全 / Connection & security ===
max-connections = 500
require-secure-transport = ON
14.2 Replica Complete Configuration
# /etc/my.cnf — Replica 生产配置
# Production configuration for Replica
[mysqld]
# === 基本标识 / Identity ===
server-id = 2
report-host = replica1.example.com
# === Binary Log / GTID ===
log-bin = /var/lib/mysql/binlog/mysql-bin
binlog-format = ROW
gtid-mode = ON
enforce-gtid-consistency = ON
log-replica-updates = ON
# === Relay Log ===
relay-log = /var/lib/mysql/relaylog/relay-bin
relay-log-recovery = ON # 崩溃恢复自动清理 / Auto cleanup on crash recovery
relay-log-info-repository = TABLE
master-info-repository = TABLE
# === 并行复制 / Parallel replication ===
replica-parallel-workers = 16
replica-parallel-type = LOGICAL_CLOCK
replica-preserve-commit-order = ON
# === 只读 / Read-only ===
read-only = ON
super-read-only = ON
# === 半同步复制 / Semi-sync replication ===
plugin-load-add = semisync_replica.so
rpl-semi-sync-replica-enabled = 1
# === InnoDB(从库可放宽)/ InnoDB (relaxed for replica) ===
innodb-flush-log-at-trx-commit = 2 # 放宽以提升性能 / Relaxed for performance
sync-binlog = 0 # 从库不需要 fsync binlog
innodb-flush-method = O_DIRECT
# === 复制延迟保护 / Lag protection ===
slave-net-timeout = 30 # Source 无响应超时 / Source unresponsive timeout
14.3 Key Monitoring Metrics Summary
| Metric | Source | Healthy Threshold | Alert Action |
|---|---|---|---|
| Replication Lag | pt-heartbeat / perf_schema | < 1s | Check parallel repl config, large txns, I/O |
| Replica I/O Thread | SHOW REPLICA STATUS |
Running: Yes | Check network and auth |
| Replica SQL Thread | SHOW REPLICA STATUS |
Running: Yes | Check Last_Error, fix and START REPLICA |
| Semi-Sync Status | Rpl_semi_sync_source_status |
ON | Check Replica connection and ACK timeout |
| Semi-Sync No Tx | Rpl_semi_sync_source_no_tx |
0 (or trend not growing) | Check network latency and timeout settings |
| Errant GTIDs | GTID_SUBTRACT() |
Empty set | Fix immediately! May cause inconsistency after failover |
| Binlog Disk Usage | SHOW BINARY LOGS |
Adequate disk space | Adjust expire_logs or add disk |
| GR Member Status | replication_group_members |
All nodes ONLINE | Check offline node's network and logs |
-- Prometheus mysqld_exporter 复制相关指标 / Replication-related metrics
-- 推荐的 Grafana 告警规则 / Recommended Grafana alert rules:
# 复制延迟告警 / Replication lag alert
ALERT MySQLReplicaLag
IF mysql_slave_status_seconds_behind_master > 5
FOR 2m
LABELS { severity = "warning" }
ANNOTATIONS { summary = "MySQL replica lag > 5s on $labels.instance "}}" }
# 复制断开告警 / Replication broken alert
ALERT MySQLReplicaBroken
IF mysql_slave_status_slave_sql_running == 0
OR mysql_slave_status_slave_io_running == 0
FOR 1m
LABELS { severity = "critical" }
ANNOTATIONS { summary = "MySQL replication broken on $labels.instance "}}" }
# 半同步退化告警 / Semi-sync fallback alert
ALERT MySQLSemiSyncOff
IF mysql_global_status_rpl_semi_sync_master_status == 0
FOR 1m
LABELS { severity = "critical" }
ANNOTATIONS { summary = "Semi-sync disabled on $labels.instance "}}" }
15. Source Code Reading Guide
Reading the source code is the most authoritative way to understand MySQL replication internals. Below is a navigation map of key source files and entry functions:
15.1 Binary Log Write Path
| File | Key Function | Purpose |
|---|---|---|
sql/binlog.cc |
MYSQL_BIN_LOG::ordered_commit() |
Core implementation of Group Commit three stages (FLUSH/SYNC/COMMIT) |
sql/binlog.cc |
MYSQL_BIN_LOG::write_event() |
Write a single binlog event to the binlog file |
sql/log_event.cc |
Write_rows_log_event::write_row() |
Encoding of row data events in ROW format |
sql/binlog.cc |
MYSQL_BIN_LOG::new_file_impl() |
Binlog file rotation logic |
sql/rpl_gtid_persist.cc |
Gtid_table_persistor::save() |
Persist GTID set to mysql.gtid_executed table |
15.2 Replica-Side Replication Threads
| File | Key Function | Purpose |
|---|---|---|
sql/rpl_replica.cc |
handle_slave_io() |
I/O Receiver thread main loop: connect to Source, receive binlog, write relay log |
sql/rpl_replica.cc |
handle_slave_sql() |
SQL Applier thread main loop (single-threaded mode) |
sql/rpl_rli_pdb.cc |
Slave_worker::slave_worker_exec_event() |
Parallel replication worker thread executing a single event |
sql/rpl_replica.cc |
queue_event() |
I/O thread writes received events to relay log |
sql/rpl_rli.cc |
apply_event_and_update_pos() |
Apply event and update position info |
sql/rpl_replica.cc |
slave_start_workers() |
Start parallel replication coordinator and worker threads |
15.3 Binlog Dump Thread (Source Side)
| File | Key Function | Purpose |
|---|---|---|
sql/rpl_binlog_sender.cc |
Binlog_sender::run() |
Binlog dump thread main loop |
sql/rpl_binlog_sender.cc |
Binlog_sender::send_events() |
Read events from binlog and send to Replica |
sql/rpl_binlog_sender.cc |
Binlog_sender::send_heartbeat_event() |
Send heartbeat events (prevent connection timeout) |
15.4 Semi-Synchronous Replication
| File | Key Function | Purpose |
|---|---|---|
plugin/semisync/semisync_source.cc |
ReplSemiSyncMaster::commitTrx() |
Core logic for Source waiting for Replica ACK |
plugin/semisync/semisync_source.cc |
ReplSemiSyncMaster::readSlaveReply() |
Read Replica's ACK response |
plugin/semisync/semisync_replica.cc |
ReplSemiSyncSlave::slaveReply() |
Replica sends ACK to Source |
15.5 Group Replication / Certification
| File | Key Function | Purpose |
|---|---|---|
plugin/group_replication/src/certifier.cc |
Certifier::certify() |
Core algorithm for transaction conflict detection |
plugin/group_replication/src/gcs_operations.cc |
Gcs_operations::send_message() |
Broadcast messages via XCom protocol |
plugin/group_replication/src/applier.cc |
Applier_module::apply_action_packet() |
Apply consensus-passed transactions |
plugin/group_replication/src/member_info.cc |
Group_member_info::update_member_status() |
Member status change handling (ONLINE, RECOVERING, OFFLINE, etc.) |
16. Replication Troubleshooting Cookbook
16.1 Common Error Codes and Fixes
| Error Code | Meaning | Typical Cause | Fix |
|---|---|---|---|
| 1062 | Duplicate entry | Row already exists on Replica (possibly a DELETE was skipped previously) | Fix with pt-table-sync, or skip the GTID |
| 1032 | Row not found | Row missing on Replica (possibly an INSERT was skipped previously) | Fix data differences with pt-table-sync |
| 1146 | Table doesn't exist | Table missing on Replica (replication filter misconfiguration) | Export table structure from Source and create on Replica |
| 1236 | Cannot read binlog | Binlog on Source has been purged or corrupted | Rebuild Replica from backup |
| 1205 | Lock wait timeout | SQL Applier on Replica waiting for lock timeout (long query running on Replica) | Kill long transactions on Replica, or set slave_transaction_retries |
| 1756 | Cross-engine transaction violates GTID consistency | Transaction mixes InnoDB and MyISAM | Migrate all tables to InnoDB |
16.2 Useful mysqlbinlog Commands
# 查看 binlog 事件摘要 / View binlog event summary
mysqlbinlog --no-defaults --base64-output=DECODE-ROWS --verbose \
mysql-bin.000042 | head -100
# 按时间范围过滤 / Filter by time range
mysqlbinlog --start-datetime="2026-04-20 10:00:00" \
--stop-datetime="2026-04-20 10:30:00" \
mysql-bin.000042
# 按 GTID 过滤 / Filter by GTID
mysqlbinlog --include-gtids="3E11FA47-...:100-110" \
mysql-bin.000042
# 排除特定 GTID(用于恢复时跳过某些事务)/ Exclude specific GTIDs
mysqlbinlog --exclude-gtids="3E11FA47-...:105" \
mysql-bin.000042 | mysql -u root -p
# 查看远程 binlog / View remote binlog
mysqlbinlog --read-from-remote-server \
--host=source_host --port=3306 \
--user=repl --password \
mysql-bin.000042
# 解析 ROW 格式的数据变更 / Parse ROW format data changes
mysqlbinlog --base64-output=DECODE-ROWS --verbose --verbose \
mysql-bin.000042 | grep -A5 "### UPDATE"
# 双 --verbose 会显示列类型信息
# Double --verbose shows column type information
16.3 Data Consistency Verification
# 使用 pt-table-checksum 校验一致性 / Verify consistency with pt-table-checksum
# 在 Source 上运行,会自动计算和比对所有 Replica 的数据校验和
# Run on Source, auto-computes and compares checksums across all Replicas
pt-table-checksum \
--replicate=percona.checksums \
--databases=app_db \
--tables=orders,users,products \
--chunk-size=5000 \
--max-lag=2 \
-h source_host -u monitor -p
# 查看差异报告 / View differences report
pt-table-checksum --replicate=percona.checksums --replicate-check-only \
-h source_host -u monitor -p
# 使用 pt-table-sync 修复差异 / Fix differences with pt-table-sync
# 先 dry-run 查看会做什么 / First dry-run to see what would happen
pt-table-sync --print \
--replicate=percona.checksums \
h=source_host,u=monitor,p=xxx
# 确认无误后执行 / Execute after confirmation
pt-table-sync --execute \
--replicate=percona.checksums \
h=source_host,u=monitor,p=xxx
16.4 Replication Topology Visualization
# 使用 Orchestrator 可视化复制拓扑 / Visualize topology with Orchestrator
# Orchestrator 提供 Web UI 和 CLI
# CLI: 查看拓扑 / CLI: View topology
orchestrator-client -c topology -i source_host:3306
# 输出 / Output:
# source_host:3306 [OK,5.7.42,rw,ROW,>>,GTID]
# + replica1:3306 [OK,5.7.42,ro,ROW,>>,GTID]
# + replica2:3306 [OK,5.7.42,ro,ROW,>>,GTID]
# + replica3:3306 [OK,5.7.42,ro,ROW,>>,GTID] (delayed: 3600s)
# CLI: 手动切换 / CLI: Manual switchover
orchestrator-client -c graceful-master-takeover \
-i source_host:3306 -d replica1:3306
# 或使用 MySQL Shell 查看 InnoDB Cluster 拓扑
# Or use MySQL Shell for InnoDB Cluster topology
mysqlsh -- cluster status --extended=1
16.5 Replication Performance Tuning Checklist
| # | Check Item | Recommended Value | Impact |
|---|---|---|---|
| 1 | replica_parallel_workers |
8-16 | Parallel transaction replay, reduces lag |
| 2 | replica_parallel_type |
LOGICAL_CLOCK | Finer-grained parallelism than DATABASE |
| 3 | binlog_transaction_dependency_tracking |
WRITESET | Enables parallelism even with low Source concurrency |
| 4 | binlog_group_commit_sync_delay |
100-1000 (us) | Larger Group Commit batches improve Replica parallelism |
| 5 | binlog_transaction_compression |
ON (8.0.20+) | Reduces network transfer and disk I/O by 60-70% |
| 6 | Replica Disk | NVMe SSD | I/O is the most common Replica bottleneck |
| 7 | innodb_flush_log_at_trx_commit (Replica) |
2 | Relax durability on Replica for performance |
| 8 | sync_binlog (Replica) |
0 | Replica doesn't need to fsync binlog on every commit |
| 9 | Avoid Large Transactions | ≤ 5000 rows per batch | Large transactions can't parallelize and block other replays |
| 10 | slave_transaction_retries |
10 | Auto-retry transient errors (deadlocks) |
16.6 Replication Health Check Script
-- 一键复制健康检查 / One-click replication health check
-- 1. 基本复制状态 / Basic replication status
SELECT
CHANNEL_NAME,
SERVICE_STATE AS io_state,
LAST_ERROR_NUMBER AS io_err
FROM performance_schema.replication_connection_status;
-- 2. SQL Applier 状态 / SQL Applier status
SELECT
CHANNEL_NAME,
SERVICE_STATE AS sql_state,
LAST_ERROR_NUMBER AS sql_err,
LAST_ERROR_MESSAGE AS sql_err_msg
FROM performance_schema.replication_applier_status;
-- 3. Worker 线程状态 / Worker thread status
SELECT
CHANNEL_NAME,
WORKER_ID,
LAST_APPLIED_TRANSACTION,
APPLYING_TRANSACTION,
LAST_ERROR_NUMBER
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_ERROR_NUMBER != 0;
-- 4. 复制延迟精确计算 / Precise replication lag
SELECT
CHANNEL_NAME,
TIMESTAMPDIFF(SECOND,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP
) AS apply_lag_seconds,
TIMESTAMPDIFF(SECOND,
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP
) AS transport_lag_seconds
FROM performance_schema.replication_applier_status_by_worker
WHERE LAST_APPLIED_TRANSACTION != ''
ORDER BY apply_lag_seconds DESC LIMIT 1;
-- 5. Group Replication 成员状态 / GR member status
SELECT
MEMBER_HOST,
MEMBER_PORT,
MEMBER_STATE,
MEMBER_ROLE,
MEMBER_VERSION
FROM performance_schema.replication_group_members;
-- 6. Group Replication 统计 / GR statistics
SELECT
MEMBER_ID,
COUNT_TRANSACTIONS_IN_QUEUE AS cert_queue,
COUNT_TRANSACTIONS_CHECKED AS certified,
COUNT_CONFLICTS_DETECTED AS conflicts,
COUNT_TRANSACTIONS_ROWS_VALIDATING AS certifier_rows,
TRANSACTIONS_COMMITTED_ALL_MEMBERS AS applied_gtids
FROM performance_schema.replication_group_member_stats;
17. Frequently Asked Questions
Q1: Can I revert from GTID to traditional position-based replication?
Yes, but it requires a specific sequence. MySQL supports online GTID mode switching (no downtime required): 1. All nodes: SET GLOBAL gtid_mode = ON_PERMISSIVE (allow mixed) 2. All nodes: SET GLOBAL gtid_mode = OFF_PERMISSIVE (stop generating GTIDs) 3. Wait until all Replicas' Ongoing_anonymous_transaction_count reaches 0 4. All nodes: SET GLOBAL gtid_mode = OFF 5. Reconfigure Replicas for position-based replication However, reverting is strongly discouraged — GTID is superior in every aspect.
Q2: Replication suddenly broke — how to quickly diagnose?
Follow these diagnostic steps: 1. SHOW REPLICA STATUS\G — check Last_IO_Error and Last_SQL_Error 2. I/O thread stopped: usually network issues, authentication failure, or source binlog purged. Check if source is reachable, replication user permissions are correct, SHOW BINARY LOGS to confirm needed binlog still exists 3. SQL thread stopped: usually data conflicts. Common errors: 1062 (duplicate entry), 1032 (row not found), 1146 (table doesn't exist) 4. Check Replica's error log (SHOW VARIABLES LIKE 'log_error') for details 5. Use mysqlbinlog to parse the failing binlog event to understand what operation failed
Q3: How to upgrade from async to semi-sync replication without downtime?
Semi-sync can be enabled online without downtime: 1. Install and enable semi-sync plugin on all Replicas first 2. Restart Replica I/O threads: STOP REPLICA IO_THREAD; START REPLICA IO_THREAD; 3. Install and enable semi-sync plugin on Source 4. Verify: SHOW STATUS LIKE 'Rpl_semi_sync_source_clients' should show connected semi-sync replica count Order matters: Replica first, then Source. If Source is enabled first, it will timeout and fall back to async because no semi-sync replicas are connected.
Q4: How to choose between Group Replication and traditional semi-sync?
The choice depends on your requirements: - Choose Group Replication / InnoDB Cluster: need automatic failover, don't want external tools, <= 9 nodes, same datacenter or low-latency network, can accept the PK-required constraint - Choose semi-sync + external tools: more than 9 nodes, cross-high-latency datacenters, have MyISAM or PK-less tables, already have mature tooling like Orchestrator/MHA, need more flexible topologies (cascading replication, etc.) The trend is that MySQL officially recommends InnoDB Cluster as the standard HA solution, but traditional approaches remain widely used in many large internet companies.
Q5: How to handle error 1032 (row not found) during replication?
Error 1032 means the Replica cannot find the corresponding row when executing UPDATE or DELETE. This usually means Source and Replica data are already inconsistent. Resolution steps: 1. Short-term fix: skip the transaction (inject empty transaction in GTID mode) 2. Root cause analysis: check if someone wrote directly to the Replica, if transactions were previously skipped, if there are errant GTIDs 3. Consistency verification: use pt-table-checksum to check data differences between Source and Replica 4. Data repair: use pt-table-sync to sync inconsistent data from Source to Replica 5. Prevention: ensure super_read_only=ON, regularly run pt-table-checksum
Q6: What to do when binlog takes too much disk space?
Multiple approaches: - Adjust retention: binlog_expire_logs_seconds (default 2592000 = 30 days). Set based on backup strategy: if daily full backup + binlog backup, shorten to 3-7 days - Manual purge: PURGE BINARY LOGS BEFORE '2026-04-15 00:00:00', but first confirm all replicas have processed these binlogs - Enable compression: binlog_transaction_compression=ON (MySQL 8.0.20+), can reduce size by 60-70% - Use MINIMAL row image: binlog_row_image=MINIMAL reduces ROW format data volume (but limits flashback capability) - Separate disk: place binlog on a separate disk/partition to avoid impacting the data directory
Q7: How to set up delayed replication for data recovery?
Delayed replication intentionally keeps a Replica behind the Source by a set amount of time, serving as a "time machine" — if data is accidentally deleted on the Source, you can recover it from the delayed replica before it applies that operation. CHANGE REPLICATION SOURCE TO SOURCE_DELAY = 3600; -- 1 hour delay START REPLICA; Recovery process: after discovering the mistake, immediately STOP REPLICA, then use START REPLICA UNTIL SQL_BEFORE_GTIDS = '...' to replay precisely up to the GTID before the mistake, then export data from this delayed replica for recovery.
Q8: What value should replica_parallel_workers be set to?
There's no universal formula, but here's experiential guidance: - Starting value: half to equal CPU core count (e.g., 8-16 for a 16-core machine) - Observe metrics: check load balance across workers via performance_schema.replication_applier_status_by_worker - More is not always better: coordination overhead between workers means too many (e.g., 128) may reduce throughput due to lock contention - Pair with Source optimization: increasing binlog_group_commit_sync_delay to allow more transactions to parallelize is more effective than just adding workers - Real bottleneck: often the bottleneck isn't thread count but disk I/O or large transactions
Q9: What if MySQL Router goes down? Isn't it a single point of failure?
The recommended deployment for MySQL Router is on the same host as the application (sidecar pattern). This way: - If Router dies = that application instance has issues, load balancer auto-removes it - No additional Router HA mechanism needed - Application connects to Router via localhost, zero network latency If Router must be deployed as a standalone service, add a VIP (Keepalived) or load balancer in front of it, but this pattern is not recommended.
Q10: How to smoothly migrate from traditional replication to InnoDB Cluster?
Migration path: 1. Prerequisites: ensure all tables use InnoDB with primary keys; enable GTID (if not already); binlog format = ROW 2. Upgrade to MySQL 8.0+ (if still on 5.7) 3. Validate on existing Replicas: dba.checkInstanceConfiguration() for each instance 4. Fix all configuration issues with dba.configureInstance() 5. Create Cluster on Source: dba.createCluster('myCluster', {adoptFromGR: false}) 6. Add Replicas one by one: stop replication first, then cluster.addInstance() 7. Deploy MySQL Router and gradually shift traffic The entire process can be done online, but the final switch steps should be performed during a maintenance window.
[← Prev: Locks & Deadlocks](/books/high-performance-mysql/lock-deadlock)
[Back to Contents](/books/high-performance-mysql)
[Next: Query Optimization →](/books/high-performance-mysql/query-optimization)