Chapter 29
CDC Data Pipeline
CDC: Change Data Capture
CDC captures every row-level change from a database's transaction log and streams it as events to downstream systems in near real-time — the core infrastructure for real-time data pipelines, heterogeneous sync, and microservice decoupling.
What is CDC
| Method | Latency | Capture DELETE | DB Load | Complexity |
|---|---|---|---|---|
| Full poll | Minutes | No | High | Low |
| Timestamp poll | Seconds | No | Medium | Low |
| Triggers | ms | Yes | High (write amp) | High |
| CDC (log parsing) | ms | Yes | Minimal | Medium |
MySQL Binlog Requirements
server_id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW # required for CDC
binlog_row_image = FULL # capture full before/after images
gtid_mode = ON
enforce_gtid_consistency = ON
Debezium Quickstart
# Grant CDC permissions
CREATE USER 'debezium'@'%' IDENTIFIED BY 'pass';
GRANT SELECT, RELOAD, SHOW DATABASES, REPLICATION SLAVE,
REPLICATION CLIENT ON *.* TO 'debezium'@'%';
# Register connector via Kafka Connect REST API
curl -X POST http://localhost:8083/connectors -H 'Content-Type: application/json' -d '{
"name": "mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql", "database.port": "3306",
"database.user": "debezium", "database.password": "pass",
"database.server.id": "184054",
"topic.prefix": "dbserver",
"database.include.list": "mydb",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.mydb"
}
}'
Each event contains: before, after, op (c/u/d/r), ts_ms, and source (gtid, file, pos).
Full Kafka Pipeline
MySQL → Debezium → Kafka → Flink CDC → ClickHouse (OLAP)
→ Elasticsearch (search)
→ Redis (cache invalidation)
Exactly-Once Delivery
CDC natively provides at-least-once delivery. Make consumers idempotent by tracking processed GTID sets. Use Schema Registry with Avro to handle DDL evolution gracefully.