Chapter 31

Cross-Cluster Replication and Disaster Recovery

Why Single-Cluster Deployments Hit a Wall

Production Kafka deployments quickly encounter scenarios that a single cluster cannot address:

Geo-redundancy: A data center-level failure (power loss, network cut, physical disaster) takes the entire cluster offline. Business continuity requires failing over to a standby cluster within seconds to minutes.
Data aggregation: Edge nodes — factory sensors, retail stores, CDN points of presence — generate data locally and need to feed it into a central cluster for unified analysis.
Cluster migration: Moving to new hardware, a new cloud provider, or a new Kafka version requires a zero-downtime migration path.
Compliance and data sovereignty: Regulations like GDPR (EU) or the Chinese Data Security Law mandate that certain data not leave a geographic boundary, while business operations need cross-region access.

Cross-cluster replication is the foundational technology for addressing all of these. This chapter focuses on MirrorMaker 2 (MM2), the Apache open-source replication engine built into Kafka since version 2.4, covering its architecture, three major deployment topologies, and the precise operational steps for disaster recovery failover.

MirrorMaker 2 Architecture

Built on Kafka Connect

MirrorMaker 1 (pre-Kafka 2.4) was a simple standalone process: multiple consumer threads reading from the source cluster, each acting as a producer writing to the destination cluster. Its fundamental problems were statelessness, inability to translate consumer group offsets, no topic metadata synchronization, and poor scalability.

MM2 is a complete architectural redesign built on the Kafka Connect framework:

MirrorSourceConnector: Runs against the source cluster as a consumer, writes data to the destination cluster. This is the primary data-movement connector.
MirrorCheckpointConnector: Periodically reads consumer group offsets from the source cluster, translates them to their destination-cluster equivalents, and writes the mapping to a checkpoint topic.
MirrorHeartbeatConnector: Emits periodic heartbeat records to the destination cluster so you can measure end-to-end replication latency.
MirrorSinkConnector: An alternative sink-side connector (less commonly used in practice).

By building on Kafka Connect's distributed mode, MM2 inherits horizontal scalability (add worker nodes to increase throughput), automatic task assignment and rebalancing, and built-in fault tolerance.

Topic Naming: Source Cluster Alias Prefix

MM2 prefixes destination topic names with the source cluster alias. This is not cosmetic — it is a fundamental mechanism for preventing replication loops in bidirectional (Active-Active) setups:

Source cluster alias: us-east
Source topic: orders
Destination topic: us-east.orders

MM2 recognizes that us-east.orders already originates from us-east and will not replicate it back, breaking any potential loop.

MM2 also automatically synchronizes:

Topic configurations (retention.ms, cleanup.policy, compression.type, etc.)
Partition count (MM2 will increase destination partition count to match the source; it will never decrease it)
It does not synchronize consumer group offsets directly — that requires the Checkpoint mechanism

Offset Translation Mechanism

Source cluster offsets and destination cluster offsets are independent. Source orders partition 0 at offset 100 may correspond to destination us-east.orders partition 0 at offset 95 (some messages may have been compacted or the destination started at a different point).

MM2 maintains this mapping through its Checkpoint mechanism:

Internal topic: us-east.checkpoints.internal (on destination cluster)
Record format:
  Key:   <consumer-group-id> | <topic> | <partition>
  Value: <source-offset> | <destination-offset> | <leader-epoch> | <metadata>

During failover, the RemoteClusterUtils API reads this mapping to translate consumer group offsets from source to destination equivalents, enabling consumers to resume from exactly where they left off.

Complete MM2 Deployment Configuration

Standalone Process Deployment

# mm2.properties — MirrorMaker 2 configuration
# Define cluster aliases and their connection details
clusters = us-east, us-west

us-east.bootstrap.servers = kafka-east-1:9092,kafka-east-2:9092,kafka-east-3:9092
us-west.bootstrap.servers = kafka-west-1:9092,kafka-west-2:9092,kafka-west-3:9092

# Replication direction: us-east → us-west (Active-Passive)
us-east->us-west.enabled = true
us-west->us-east.enabled = false    # Disable for Active-Passive

# Topic filtering (regex)
us-east->us-west.topics = orders.*, payments.*, inventory.*
us-east->us-west.topics.blacklist = .*\.internal, __consumer_offsets

# Consumer group offset synchronization
us-east->us-west.groups = payment-consumer-group, order-processor-group
us-east->us-west.groups.blacklist = console-consumer-.*

# Enable checkpoint-based offset sync
us-east->us-west.sync.group.offsets.enabled = true
us-east->us-west.sync.group.offsets.interval.seconds = 60

# Heartbeat monitoring
us-east->us-west.emit.heartbeats.enabled = true
us-east->us-west.emit.heartbeats.interval.seconds = 5

# Replication factor for replicated topics on destination
us-east->us-west.replication.factor = 3

# Topic naming policy (DefaultReplicationPolicy adds source alias prefix)
# This is critical for Active-Active to prevent replication loops
replication.policy.class = org.apache.kafka.connect.mirror.DefaultReplicationPolicy

# MM2 internal topics (stored on destination cluster)
checkpoints.topic.replication.factor = 3
heartbeats.topic.replication.factor = 3
offset-syncs.topic.replication.factor = 3

# Kafka Connect distributed mode storage topics
config.storage.topic = mm2-configs
offset.storage.topic = mm2-offsets
status.storage.topic = mm2-status
config.storage.replication.factor = 3
offset.storage.replication.factor = 3
status.storage.replication.factor = 3

Start MM2:

connect-mirror-maker.sh mm2.properties

Kubernetes Deployment

# kubernetes/mm2-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mirrormaker2
  namespace: kafka
spec:
  replicas: 3    # 3 workers for load distribution and fault tolerance
  selector:
    matchLabels:
      app: mirrormaker2
  template:
    metadata:
      labels:
        app: mirrormaker2
    spec:
      containers:
        - name: mirrormaker2
          image: apache/kafka:3.7.0
          command:
            - connect-mirror-maker.sh
            - /etc/mm2/mm2.properties
          env:
            - name: KAFKA_HEAP_OPTS
              value: "-Xms512m -Xmx1g"
          volumeMounts:
            - name: mm2-config
              mountPath: /etc/mm2
              readOnly: true
          resources:
            requests:
              cpu: "1"
              memory: "2Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          livenessProbe:
            httpGet:
              path: /connectors
              port: 8083
            initialDelaySeconds: 60
            periodSeconds: 30
      volumes:
        - name: mm2-config
          configMap:
            name: mm2-config

Monitoring replication lag:

# Check that heartbeats are being received on the destination cluster
kafka-console-consumer.sh \
  --bootstrap-server kafka-west-1:9092 \
  --topic us-east.heartbeats \
  --property print.timestamp=true \
  --max-messages 5

# Check MM2 replication lag via JMX
kafka-jmx.sh \
  --object-name "kafka.connect.mirror:type=MirrorSourceConnector,target=us-west,topic=orders,partition=0" \
  --attributes replication-latency-ms-avg,record-lag

Three Deployment Topologies

Topology 1: Active-Passive (Hot Standby)

Scenario: One primary data center (US-East) serves all production traffic. A standby data center (US-West) maintains a continuously updated replica. When the primary fails, traffic is manually redirected to the standby.

US-East (Primary)              US-West (Standby)
┌─────────────────────┐        ┌─────────────────────┐
│  Producers → Kafka  │──MM2──→│  Kafka (replica)    │
│  Consumers ← Kafka  │        │  Ready to activate  │
└─────────────────────┘        └─────────────────────┘

Advantages:

Simple architecture, no conflict resolution required
Standby cluster can serve read-only analytics queries (reducing primary load)
Straightforward operational model

Disadvantages:

Requires manual intervention to trigger failover (limits RTO)
RPO equals MM2's replication lag (typically seconds to low minutes)
Standby cluster resources are largely idle (cost inefficiency)

Configuration:

# Active-Passive is the default direction setting
us-east->us-west.enabled = true
us-west->us-east.enabled = false

Topology 2: Active-Active (Bidirectional)

Scenario: Both data centers simultaneously serve production traffic. Producers in each region write to their local cluster; MM2 replicates data bidirectionally so consumers in either region can access all data.

US-East (Active)               US-West (Active)
┌─────────────────────┐  MM2   ┌─────────────────────┐
│ Producers → Kafka   │───────→│ Kafka → Consumers   │
│ Consumers ← Kafka   │←───────│ Kafka ← Producers   │
└─────────────────────┘  MM2   └─────────────────────┘

Bidirectional configuration:

us-east->us-west.enabled = true
us-west->us-east.enabled = true

# The alias prefix is essential here:
# Producer writes orders to us-east → MM2 replicates as us-east.orders on us-west
# Producer writes orders to us-west → MM2 replicates as us-west.orders on us-east
# us-east.orders on us-west is NOT replicated back to us-east (loop prevented)

The central challenge: message duplicates and conflicts

In Active-Active mode, the same business event can originate in both clusters independently. A user submitting an order while connected to the US-East cluster generates one orders record; a different user submitting an order via US-West generates another. After replication, both clusters contain both records — which is correct. The challenge is when the same user action could reach both clusters (network routing failure, client retry to a different endpoint).

Resolution strategies:

Globally unique message keys: Use UUIDs or composite keys that include a region identifier (e.g., us-east:order-12345). Consumers deduplicate by key.
Last-write-wins semantics: For state updates, include a timestamp in the message and use it to select the most recent value. Requires reliable clock synchronization (NTP within 100ms).
Idempotent consumers: Implement consumer logic that is safe to replay — the same message applied twice produces the same result as applying it once.

Active-Active is appropriate for:

User behavior events (clickstreams, analytics, telemetry) — naturally idempotent
Log aggregation — duplicate log lines are acceptable
Metrics reporting — latest value overwrites previous values

Active-Active is not appropriate for:

Payment processing and financial transactions — any duplicate is a serious incident
Inventory deduction — duplicates cause overselling
Order creation requiring global uniqueness guarantees

Topology 3: Aggregation (Many Edge → Central)

Scenario: Many edge clusters operate independently at the data source (factory floor, retail stores, IoT gateways). Their data flows unidirectionally to a central cluster for unified storage, analytics, and processing.

Factory-A ──MM2──┐
Factory-B ──MM2──┼──→ Central Cluster → Data Lake / Analytics
Factory-C ──MM2──┘

Configuration for the central collector:

# mm2.properties — aggregation topology
clusters = factory-a, factory-b, factory-c, central

factory-a.bootstrap.servers = factory-a-kafka:9092
factory-b.bootstrap.servers = factory-b-kafka:9092
factory-c.bootstrap.servers = factory-c-kafka:9092
central.bootstrap.servers = central-kafka-1:9092,central-kafka-2:9092,central-kafka-3:9092

factory-a->central.enabled = true
factory-b->central.enabled = true
factory-c->central.enabled = true

# No data flows from central back to edge clusters
central->factory-a.enabled = false
central->factory-b.enabled = false
central->factory-c.enabled = false

# Replicate sensor data from all factories
factory-a->central.topics = sensor-data, production-events
factory-b->central.topics = sensor-data, production-events
factory-c->central.topics = sensor-data, production-events

The central cluster receives topics named:

factory-a.sensor-data
factory-b.sensor-data
factory-c.sensor-data

A Kafka Streams application can merge these into a unified stream, or the analytics layer can consume each separately.

Active-Passive Failover Runbook

The following is the standard operational procedure for failing over from a failed primary cluster to the standby cluster.

Step 1: Confirm Primary Failure; Stop Producers

# Attempt to connect to the primary cluster
kafka-broker-api-versions.sh \
  --bootstrap-server kafka-east-1:9092 \
  --command-config client.properties
# If this times out or fails, the cluster is unreachable

# Trigger circuit breaker / feature flag in application configuration
# This stops producers from attempting to write to the primary
# Allow in-flight requests to time out (typically 30-60 seconds)
echo "Primary cluster confirmed unreachable at $(date)"

Step 2: Verify MM2 Has Caught Up

# Check the most recent heartbeat timestamp on the standby cluster
# If heartbeats stopped updating, the primary was completely unavailable
# during that window — records after the last heartbeat may be lost
kafka-console-consumer.sh \
  --bootstrap-server kafka-west-1:9092 \
  --topic us-east.heartbeats \
  --property print.timestamp=true \
  --from-beginning \
  --max-messages 20 | tail -5

# Check record lag on the MM2 connectors
# Zero lag means the standby has caught up to the last record received from primary
kafka-jmx.sh \
  --object-name "kafka.connect.mirror:type=MirrorSourceConnector" \
  --attributes record-lag

Step 3: Translate Consumer Group Offsets

This is the most technically precise step. Consumer offsets from the source cluster cannot be used directly on the destination cluster — they must be translated using the mapping stored in the MM2 checkpoint topic.

Programmatic translation using RemoteClusterUtils (recommended for automation):

import org.apache.kafka.connect.mirror.RemoteClusterUtils;

Map<String, Object> drClusterConfig = new HashMap<>();
drClusterConfig.put("bootstrap.servers", "kafka-west-1:9092");
drClusterConfig.put("security.protocol", "SASL_SSL");
// ... additional auth config

// Translate the consumer group's source offsets to destination equivalents
Map<TopicPartition, OffsetAndMetadata> translatedOffsets =
    RemoteClusterUtils.translateOffsets(
        drClusterConfig,
        "us-east",                  // Source cluster alias (matches mm2.properties)
        "payment-consumer-group",   // Consumer group to translate
        Duration.ofSeconds(30)      // Timeout for reading checkpoint topic
    );

System.out.println("Translated offsets: " + translatedOffsets);

// Commit the translated offsets to the DR cluster
try (AdminClient adminClient = AdminClient.create(drClusterConfig)) {
    adminClient.alterConsumerGroupOffsets(
        "payment-consumer-group",
        translatedOffsets
    ).all().get(30, TimeUnit.SECONDS);
    System.out.println("Offsets committed successfully to DR cluster");
}

Command-line translation for manual failovers:

# Read the latest checkpoint for the consumer group
kafka-console-consumer.sh \
  --bootstrap-server kafka-west-1:9092 \
  --topic us-east.checkpoints.internal \
  --property print.key=true \
  --property print.value=true \
  --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
  --from-beginning \
  --max-messages 10000 \
  | grep "payment-consumer-group" \
  | tail -20

# Apply the translated offsets (adjust partition:offset pairs based on checkpoint output)
kafka-consumer-groups.sh \
  --bootstrap-server kafka-west-1:9092 \
  --group payment-consumer-group \
  --topic us-east.orders:0:95 \    # Partition 0 → offset 95 (from checkpoint)
  --reset-offsets \
  --execute

Step 4: Redirect Producers and Consumers to DR Cluster

# Update application configuration
# Change bootstrap.servers from kafka-east-1:9092 to kafka-west-1:9092
# This can be done via:
#   - Kubernetes ConfigMap + rolling restart
#   - Configuration management system (Consul, etcd)
#   - Environment variable update + pod restart

# Verify consumers are running on the DR cluster from the correct offsets
kafka-consumer-groups.sh \
  --bootstrap-server kafka-west-1:9092 \
  --group payment-consumer-group \
  --describe

# Monitor for any lag increase in the first minutes after failover
watch -n 10 "kafka-consumer-groups.sh \
  --bootstrap-server kafka-west-1:9092 \
  --group payment-consumer-group \
  --describe"

RPO and RTO Analysis

RPO (Recovery Point Objective — maximum data loss):

Equals MM2's replication lag at the moment of primary failure
Under normal network conditions: 1-10 seconds
Under degraded network (cross-region congestion): 30 seconds to a few minutes
If the primary cluster was completely unavailable (not just degraded), RPO equals the gap between the last successful replication and the failure — check the heartbeat timestamps to determine this

RTO (Recovery Time Objective — time to restore service):

Failure confirmation: 2-5 minutes (depends on monitoring alerting and on-call response)
Offset translation: 2-5 minutes (automated) or 10-20 minutes (manual)
Application restart and verification: 2-5 minutes
Total RTO with automation: 5-15 minutes; manual: 15-30 minutes

Confluent Cluster Linking vs MirrorMaker 2

Feature	MirrorMaker 2	Confluent Cluster Linking
License	Apache 2.0 (open source)	Confluent commercial only
Replication latency	Seconds to minutes	Sub-second (typically < 1s)
Implementation	Consumer + Producer	Reads directly from leader (like an internal follower)
Offset preservation	Requires translation via checkpoints	Original offsets preserved — no translation needed
Consumer group sync	Via checkpoint connector, minutes of delay	Automatic, millisecond-level
Topic naming	Default: adds source alias prefix	Optional; supports transparent mirroring
Availability	Any Kafka environment	Confluent Cloud or Confluent Platform

Cluster Linking's offset preservation is transformative for failover operations. The offset translation step — the most complex and error-prone part of MM2-based failover — is completely eliminated. Consumers simply reconnect to the destination cluster and continue from the same offsets. The cost is a Confluent commercial subscription.

For organizations running on Confluent Cloud, Cluster Linking is the clear choice. For open-source Kafka deployments, MM2 remains the standard.

Disaster Recovery Testing: Quarterly Failover Drills

An untested DR plan is not a DR plan — it is a document that creates false confidence. The Netflix approach to resilience engineering offers the right mindset: proactively inject failures before they happen spontaneously, during controlled conditions, so you can learn and improve.

Quarterly Failover Drill Template

Quarterly DR Drill Checklist

Pre-drill (1 week before):
  □ Select drill window (off-peak: Saturday 02:00-05:00 local)
  □ Notify stakeholders (on-call, product team, SRE)
  □ Define scope (which consumer groups will participate)
  □ Confirm MM2 replication is healthy and lag is near zero
  □ Verify DR cluster has sufficient capacity

During drill:
  □ Record start time
  □ Stop selected consumer group on primary
  □ Wait for MM2 to catch up (verify via heartbeats + record-lag)
  □ Execute offset translation (time this step)
  □ Start consumer group on DR cluster
  □ Verify: no messages skipped, no messages duplicated
  □ Record actual RTO (from consumer group stop to verified DR consumption)

Post-drill:
  □ Fail back to primary cluster (reverse the procedure)
  □ Verify primary consumer group resumes correctly
  □ Document: expected RTO vs actual RTO, any friction encountered
  □ Update runbook with lessons learned

Automated Failover Validation Script

#!/bin/bash
# dr-validation.sh — Run after every failover to validate correctness

set -e

DR_CLUSTER="${1:?Usage: dr-validation.sh <dr-bootstrap-servers> <consumer-group>}"
CONSUMER_GROUP="${2:?}"

echo "=== DR Failover Validation ==="
echo "Cluster: $DR_CLUSTER"
echo "Consumer Group: $CONSUMER_GROUP"
echo "Time: $(date -u)"
echo ""

# 1. Verify consumer group is present and active on DR cluster
echo "--- Consumer Group Status ---"
kafka-consumer-groups.sh \
  --bootstrap-server "$DR_CLUSTER" \
  --group "$CONSUMER_GROUP" \
  --describe

# 2. Check that lag is not exploding (consumer is making progress)
echo ""
echo "--- Lag Trend (checking twice, 30s apart) ---"
lag1=$(kafka-consumer-groups.sh \
  --bootstrap-server "$DR_CLUSTER" \
  --group "$CONSUMER_GROUP" \
  --describe 2>/dev/null \
  | awk 'NR>1 {sum += $5} END {print sum}')
echo "Lag at T+0: $lag1"

sleep 30

lag2=$(kafka-consumer-groups.sh \
  --bootstrap-server "$DR_CLUSTER" \
  --group "$CONSUMER_GROUP" \
  --describe 2>/dev/null \
  | awk 'NR>1 {sum += $5} END {print sum}')
echo "Lag at T+30s: $lag2"

if [ "$lag2" -lt "$lag1" ]; then
  echo "PASS: Lag is decreasing — consumer is catching up"
elif [ "$lag2" -eq "$lag1" ]; then
  echo "WARN: Lag is stable — consumer may be keeping pace but not catching up"
else
  echo "FAIL: Lag is increasing — consumer is falling behind on DR cluster"
  exit 1
fi

echo ""
echo "=== Validation complete ==="

Summary: Choosing Your Disaster Recovery Architecture

The right cross-cluster replication topology depends on the trade-offs your business can accept:

Dimension	Active-Passive	Active-Active	Aggregation
RPO	Seconds to minutes	Near zero (dual-write)	N/A (unidirectional)
RTO	5-30 minutes (manual)	Near zero (automatic)	N/A
Cost	Medium (idle standby)	High (double resources)	Low (lightweight edges)
Operational complexity	Low	High (conflict resolution)	Medium
Best for	General DR	Extreme availability	Edge data collection

No single architecture fits all scenarios. Business-critical systems (payments, core transactions) justify the complexity of Active-Active for near-zero RTO. Standard production services achieve a good cost/reliability balance with Active-Passive. Edge data collection is a natural fit for aggregation topology.

The most important principle, regardless of architecture choice: test your failover procedure regularly under realistic conditions, record the actual RTO achieved, and iteratively improve the runbook until the failover is routine. A DR plan that has never been executed will fail at the worst possible moment.

Rate this chapter

4.6 / 5 (3 ratings)