Chapter 17

Replication Lag, Consistency and Data Loss Scenarios

Chapter 17 โ€” Replication Lag, Consistency, and Data-Loss Scenarios

Redis master-replica replication is asynchronous by default. The primary acknowledges a write to the client before any replica has confirmed receipt. This chapter models four distinct data-loss scenarios with concrete timelines, analyzes min-replicas-to-write, the WAIT command, and the catastrophic replication backlog overflow spiral โ€” giving you the vocabulary and configuration levers to make an informed trade-off between availability and data safety.


17.1 The Nature of Asynchronous Replication

When a client sends SET k v, the primary follows this order:

1. Write to in-memory dict
2. Optionally write to AOF buffer (flushed asynchronously)
3. Return +OK to the client          โ† client considers the write durable
4. Write command bytes to repl_backlog
5. Push bytes to replica's output buffer
6. Replica reads and executes the command, updates its offset

The gap between steps 3 and 6 is the replication window โ€” the interval during which data exists only on the primary. Under normal conditions this window is less than 10 ms. Under load, network pressure, or misconfiguration it can grow to seconds.


17.2 Data-Loss Scenario Analysis

Scenario 1 โ€” Primary Crash with 1-Second Replica Lag

Timeline:
T=0:     Client writes 10,000 commands (~50 MB)
T=0.5s:  Primary has applied all 10,000; replica has applied 7,000 (500 ms lag)
T=1.0s:  Primary host crashes (power loss / OOM kill)
T=1.1s:  Sentinel detects primary unreachable
T=30s:   Sentinel completes failover; replica promoted to primary
Result:  3,000 commands (~15 MB of writes) are permanently lost

Quantifying the loss:

write_rate   = 100_000  # commands/second
lag_seconds  = 1.0
lost_commands = write_rate * lag_seconds  # 100,000

Mitigations:

# Option A: Force fsync on every write (strongest durability, ~30% throughput cost)
appendonly yes
appendfsync always

# Option B: Use WAIT after critical writes (see ยง17.4)
# Option C: Accept the loss for non-critical data (cache use cases)

Scenario 2 โ€” Network Partition and Split-Brain

This is the most dangerous pattern because the primary keeps accepting writes from clients while the cluster has already elected a new primary.

Initial topology:
  [Primary M] โ†replicationโ†’ [Replica S1]
                           โ†replicationโ†’ [Replica S2]
  [Sentinel-1] [Sentinel-2] [Sentinel-3]

T=0:     Network partition splits M from S1, S2, all Sentinels
         Client-A is co-located with M and continues writing to it
         M does not know it has been isolated

T=0~30s: Client-A writes 5 million commands to isolated M
         Sentinels cannot reach M, start failure detection

T=30s:   Sentinels declare M objectively down; S1 promoted to primary
         Client-A is notified of the new primary address
         Client-A redirects writes to S1

T=60s:   Network partition heals; old M comes back online
         Sentinel sends SLAVEOF S1 to old M
         Old M performs full sync with S1 โ†’ all writes from T=0 to T=30s are lost

Loss = partition_duration ร— write_rate

Defense: min-replicas-to-write

min-replicas-to-write 1
min-replicas-max-lag  10

With this configuration, when the partition occurs:

Cost: availability decreases โ€” M refuses writes when isolated.

Scenario 3 โ€” Output Buffer Overflow Spiral During BGSAVE

T=0:    Primary starts BGSAVE (forks child process)
T=0~5s: Heavy write load continues; Copy-on-Write inflates memory usage
        Primary generates 100 MB/s of commands that need to reach the replica
        Replica's client output buffer grows at 100 MB/s
T=2.5s: Output buffer reaches 256 MB hard limit
T=2.5s: Primary forcibly closes replica connection
T=2.5s: Replica immediately attempts to reconnect (PSYNC)
T=2.5s: BGSAVE still running; repl_backlog has been overrun
        โ†’ FULLRESYNC required
T=2.5s: Primary forks again for the new BGSAVE (double load)
โ†’ New BGSAVE increases memory pressure further
โ†’ Output buffer hits limit again in ~2 seconds
โ†’ Cycle repeats indefinitely

Resolving the spiral:

# 1. Raise the output buffer ceiling
client-output-buffer-limit replica 1gb 256mb 300

# 2. Use diskless replication to eliminate disk I/O during BGSAVE
repl-diskless-sync yes

# 3. Increase backlog so partial resync works if disconnect occurs
repl-backlog-size 2gb

# 4. Disable RDB auto-save if you rely on AOF
save ""

# 5. Real-time monitoring during incidents
watch -n1 "redis-cli client list | grep replica | grep -oP 'omem=\K[0-9]+'"

Scenario 4 โ€” min-replicas-to-write Blocking Availability

Configuration:
  min-replicas-to-write 1
  min-replicas-max-lag  10

Scenario: The single replica experiences a slowdown (heavy query / GC pause)
  โ†’ lag rises above 10 seconds

Effect:
  All writes to the primary are rejected:
  (error) NOREPLICAS Not enough good replicas to write.

  No data is lost, but the service is completely unavailable for writes.

Tiered approach for heterogeneous data criticality:

# Critical path (financial transactions, inventory):
redis_primary.set("account:balance:1001", 5000)
ack = redis_primary.wait(1, 500)  # wait up to 500ms for 1 replica
if ack < 1:
    alert("Replication confirmation missing โ€” investigate")

# Non-critical path (analytics counters, session metadata):
redis_primary.incr("page:view:home")  # fire and forget โ€” acceptable loss

17.3 min-replicas-to-write and min-replicas-max-lag

Semantics

# redis.conf
min-replicas-to-write 1   # primary refuses writes unless at least 1 replica qualifies
min-replicas-max-lag  10  # a replica qualifies only if its lag is <= 10 seconds

The primary checks this condition once per second. Lag is measured from the last REPLCONF ACK heartbeat received from each replica.

Internal Check Logic

/* replication.c โ€” simplified */
int replicationCountGoodReplicas(long long offset) {
    int good = 0;
    listIter li;
    listRewind(server.slaves, &li);
    while ((ln = listNext(&li))) {
        client *slave = ln->value;
        /* Skip replicas whose ACK is too old */
        if (server.unixtime - slave->repl_ack_time > server.repl_min_slaves_max_lag)
            continue;
        /* Skip replicas that haven't caught up to the required offset */
        if (slave->repl_ack_off < offset)
            continue;
        good++;
    }
    return good;
}

Configuration Mode Comparison

Mode Configuration Data Safety Availability Use Case
Pure async min-replicas-to-write 0 Low High Cache, disposable data
Semi-sync min-replicas-to-write 1, lag 10 Medium Medium General application data
WAIT-confirmed WAIT 2 500 per write High Low (latency) Financial, order data

Simulating and Verifying the Behavior

# Simulate replica lag (on the replica)
redis-cli debug sleep 20

# On the primary, observe write rejection
redis-cli set test_key value
# (error) NOREPLICAS Not enough good replicas to write.

# Replica recovers; writes resume automatically

17.4 The WAIT Command

Syntax and Semantics

WAIT numreplicas timeout_ms
โ†’ Returns: integer โ€” number of replicas that have acknowledged
           all write commands sent before this WAIT call

WAIT does not provide strong consistency in the distributed systems sense. It provides a best-effort confirmation that a specific number of replicas have received and applied all commands up to the current master_repl_offset.

Internal Protocol

1. Client issues: WAIT 1 500
2. Primary records current master_repl_offset = X
3. Primary broadcasts: REPLCONF GETACK * (to all replicas โ€” request immediate ACK)
4. Primary waits for replicas to respond with their current offset
5. Returns when:
   - numreplicas replicas report offset >= X, OR
   - timeout_ms elapses (whichever comes first)

Usage in Application Code

import redis

r = redis.Redis(host='primary_host', port=6379)

# Critical writes
pipe = r.pipeline(transaction=False)
pipe.set("account:balance:1001", 5000)
pipe.set("order:status:9999", "paid")
pipe.execute()

# Wait for at least 1 replica to confirm, up to 500ms
confirmed = r.wait(1, 500)

if confirmed >= 1:
    log.info(f"Write replicated to {confirmed} replica(s)")
else:
    log.warning("Replication not confirmed โ€” possible data-loss risk")
    # Decide: retry, alert, or accept the risk based on business requirements

WAIT Return Value Semantics

WAIT 1 0    โ†’ Returns immediately; 0 means no replica has confirmed yet
WAIT 1 500  โ†’ Returns 1: at least one replica confirmed within 500ms
WAIT 2 500  โ†’ Returns 1: only one replica confirmed (not an error โ€” just informational)

A return value less than numreplicas is not an error. The application layer must decide whether the partial confirmation is acceptable.

Performance Characteristics

# Baseline: no WAIT
redis-benchmark -n 100000 -c 50 -P 16 SET key value
# ~200,000 ops/sec

# With WAIT 1 500 (serialized โ€” WAIT blocks the connection)
# Use pipeline + single WAIT at the end of each batch
pipe = r.pipeline()
for i in range(1000):
    pipe.set(f"key:{i}", f"value:{i}")
pipe.execute()
r.wait(1, 1000)   # single WAIT per batch โ€” amortizes the cost

17.5 Monitoring Replication Lag

INFO replication Field Reference

# Replication
role:master
connected_slaves:2
slave0:ip=192.168.1.2,port=6380,state=online,offset=1234560,lag=0
slave1:ip=192.168.1.3,port=6381,state=online,offset=1234100,lag=1
master_replid:8371b4fb1155b71f4a04d3e1bc3e18c4a990aeeb
master_repl_offset:1234567

Important: lag is reported in whole seconds. lag=0 means lag is between 0 and 999ms. For sub-second precision, compute byte-level lag:

byte_lag = master_repl_offset - slave_offset
         = 1234567 - 1234100 = 467 bytes

Byte-Level Lag Script

#!/bin/bash
PRIMARY=192.168.1.1
REPLICA=192.168.1.2

master_off=$(redis-cli -h $PRIMARY info replication \
    | grep master_repl_offset | cut -d: -f2 | tr -d '\r ')
slave_off=$(redis-cli -h $REPLICA info replication \
    | grep master_repl_offset | cut -d: -f2 | tr -d '\r ')

echo "Byte lag: $((master_off - slave_off))"

Prometheus Alert Rules

# redis_exporter exposes these metrics automatically
groups:
  - name: redis_replication
    rules:
      - alert: RedisReplicationLag
        expr: redis_replication_lag > 30
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Replica {{ $labels.addr }} lag > 30s"

      - alert: RedisBacklogNearFull
        expr: redis_repl_backlog_histlen / redis_repl_backlog_size > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication backlog > 90% full โ€” risk of full sync"

      - alert: RedisNoReplicaConnected
        expr: redis_connected_slaves == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Primary has no connected replicas"

17.6 Recovering from the Full-Sync Spiral

Diagnosing the Problem

# Look for full sync events in the primary log
grep -i "BGSAVE\|SYNC\|fullresync" /var/log/redis/redis.log | tail -50

# Watch for rapid increases in rdb_bgsave_in_progress
watch -n2 "redis-cli info persistence | grep rdb_bgsave"

# Count ASK/MOVED redirects and SYNC operations
redis-cli info stats | grep -E 'sync_full|sync_partial'
# total_commands_processed:...
# sync_full:12            โ† 12 full syncs since startup (high!)
# sync_partial_ok:3
# sync_partial_err:9      โ† partial sync attempts that failed

Backlog Sizing Matrix

Write Rate Max Tolerable Disconnect Recommended Backlog
<10 MB/s 30 s 1 GB
10โ€“50 MB/s 30 s 3 GB
50โ€“100 MB/s 60 s 12 GB
>100 MB/s Shard with Cluster N/A

Emergency Runtime Fix (No Restart)

# Immediately increase backlog (takes effect without restart)
redis-cli -h primary CONFIG SET repl-backlog-size 2147483648  # 2 GB

# Raise output buffer limits
redis-cli -h primary CONFIG SET client-output-buffer-limit \
    "replica 1gb 256mb 300"

# Verify replicas are stabilizing
watch -n2 "redis-cli -h primary INFO replication | grep -E 'slave|backlog'"

# Persist the new config
redis-cli -h primary CONFIG REWRITE

17.7 Consistency Model Summary

Redis replication provides eventual consistency:

Operational (no failures):  lag < 10 ms     โ†’ effectively strongly consistent
Transient disruption:       lag 100ms โ€“ 1s  โ†’ briefly inconsistent
Partition / crash:          lag unknown     โ†’ risk of permanent data loss

CAP Theorem placement:
  With min-replicas-to-write=0:  AP (favors availability over consistency)
  With min-replicas-to-write=1:  CP (favors consistency, sacrifices availability)

Read-After-Write Consistency Patterns

# Problem: write to primary, immediately read from replica โ†’ may see stale data
primary.set("user:1:name", "Alice")
name = replica.get("user:1:name")  # might still be None or old value

# Pattern 1: Read from primary (strong consistency, higher load on primary)
name = primary.get("user:1:name")

# Pattern 2: Wait before reading from replica
primary.set("user:1:name", "Alice")
primary.wait(1, 100)        # wait up to 100ms for 1 replica
name = replica.get("user:1:name")   # now almost certainly consistent

# Pattern 3: Accept staleness (appropriate for cache scenarios)
# Clients tolerate reading data that is up to N seconds old

Chapter Summary

Scenario Root Cause Defense Mechanism
Primary crash loses recent writes Async replication window AOF fsync=always or WAIT per write
Split-brain during network partition Isolated primary keeps accepting writes min-replicas-to-write
Output buffer overflow spiral BGSAVE COW + slow replica Larger buffer + diskless + large backlog
min-replicas-to-write blocks writes Replica lag exceeds threshold Monitor lag; optimize replica resources

The right configuration depends entirely on your workload: the higher your write rate and the more critical the data, the more aggressively you must invest in replication buffer sizing and synchronization guarantees. Chapter 18 examines the Sentinel system's Raft-based leader election and the nine-step failover state machine.

Rate this chapter
4.9  / 5  (15 ratings)

๐Ÿ’ฌ Comments