Replication Lag, Consistency and Data Loss Scenarios
Chapter 17 โ Replication Lag, Consistency, and Data-Loss Scenarios
Redis master-replica replication is asynchronous by default. The primary acknowledges a write to the client before any replica has confirmed receipt. This chapter models four distinct data-loss scenarios with concrete timelines, analyzes min-replicas-to-write, the WAIT command, and the catastrophic replication backlog overflow spiral โ giving you the vocabulary and configuration levers to make an informed trade-off between availability and data safety.
17.1 The Nature of Asynchronous Replication
When a client sends SET k v, the primary follows this order:
1. Write to in-memory dict
2. Optionally write to AOF buffer (flushed asynchronously)
3. Return +OK to the client โ client considers the write durable
4. Write command bytes to repl_backlog
5. Push bytes to replica's output buffer
6. Replica reads and executes the command, updates its offset
The gap between steps 3 and 6 is the replication window โ the interval during which data exists only on the primary. Under normal conditions this window is less than 10 ms. Under load, network pressure, or misconfiguration it can grow to seconds.
17.2 Data-Loss Scenario Analysis
Scenario 1 โ Primary Crash with 1-Second Replica Lag
Timeline:
T=0: Client writes 10,000 commands (~50 MB)
T=0.5s: Primary has applied all 10,000; replica has applied 7,000 (500 ms lag)
T=1.0s: Primary host crashes (power loss / OOM kill)
T=1.1s: Sentinel detects primary unreachable
T=30s: Sentinel completes failover; replica promoted to primary
Result: 3,000 commands (~15 MB of writes) are permanently lost
Quantifying the loss:
write_rate = 100_000 # commands/second
lag_seconds = 1.0
lost_commands = write_rate * lag_seconds # 100,000
Mitigations:
# Option A: Force fsync on every write (strongest durability, ~30% throughput cost)
appendonly yes
appendfsync always
# Option B: Use WAIT after critical writes (see ยง17.4)
# Option C: Accept the loss for non-critical data (cache use cases)
Scenario 2 โ Network Partition and Split-Brain
This is the most dangerous pattern because the primary keeps accepting writes from clients while the cluster has already elected a new primary.
Initial topology:
[Primary M] โreplicationโ [Replica S1]
โreplicationโ [Replica S2]
[Sentinel-1] [Sentinel-2] [Sentinel-3]
T=0: Network partition splits M from S1, S2, all Sentinels
Client-A is co-located with M and continues writing to it
M does not know it has been isolated
T=0~30s: Client-A writes 5 million commands to isolated M
Sentinels cannot reach M, start failure detection
T=30s: Sentinels declare M objectively down; S1 promoted to primary
Client-A is notified of the new primary address
Client-A redirects writes to S1
T=60s: Network partition heals; old M comes back online
Sentinel sends SLAVEOF S1 to old M
Old M performs full sync with S1 โ all writes from T=0 to T=30s are lost
Loss = partition_duration ร write_rate
Defense: min-replicas-to-write
min-replicas-to-write 1
min-replicas-max-lag 10
With this configuration, when the partition occurs:
- M loses visibility of all replicas
- After 10 seconds (lag threshold exceeded), M begins rejecting all writes
- Clients receive
NOREPLICASerrors instead of silent data loss
Cost: availability decreases โ M refuses writes when isolated.
Scenario 3 โ Output Buffer Overflow Spiral During BGSAVE
T=0: Primary starts BGSAVE (forks child process)
T=0~5s: Heavy write load continues; Copy-on-Write inflates memory usage
Primary generates 100 MB/s of commands that need to reach the replica
Replica's client output buffer grows at 100 MB/s
T=2.5s: Output buffer reaches 256 MB hard limit
T=2.5s: Primary forcibly closes replica connection
T=2.5s: Replica immediately attempts to reconnect (PSYNC)
T=2.5s: BGSAVE still running; repl_backlog has been overrun
โ FULLRESYNC required
T=2.5s: Primary forks again for the new BGSAVE (double load)
โ New BGSAVE increases memory pressure further
โ Output buffer hits limit again in ~2 seconds
โ Cycle repeats indefinitely
Resolving the spiral:
# 1. Raise the output buffer ceiling
client-output-buffer-limit replica 1gb 256mb 300
# 2. Use diskless replication to eliminate disk I/O during BGSAVE
repl-diskless-sync yes
# 3. Increase backlog so partial resync works if disconnect occurs
repl-backlog-size 2gb
# 4. Disable RDB auto-save if you rely on AOF
save ""
# 5. Real-time monitoring during incidents
watch -n1 "redis-cli client list | grep replica | grep -oP 'omem=\K[0-9]+'"
Scenario 4 โ min-replicas-to-write Blocking Availability
Configuration:
min-replicas-to-write 1
min-replicas-max-lag 10
Scenario: The single replica experiences a slowdown (heavy query / GC pause)
โ lag rises above 10 seconds
Effect:
All writes to the primary are rejected:
(error) NOREPLICAS Not enough good replicas to write.
No data is lost, but the service is completely unavailable for writes.
Tiered approach for heterogeneous data criticality:
# Critical path (financial transactions, inventory):
redis_primary.set("account:balance:1001", 5000)
ack = redis_primary.wait(1, 500) # wait up to 500ms for 1 replica
if ack < 1:
alert("Replication confirmation missing โ investigate")
# Non-critical path (analytics counters, session metadata):
redis_primary.incr("page:view:home") # fire and forget โ acceptable loss
17.3 min-replicas-to-write and min-replicas-max-lag
Semantics
# redis.conf
min-replicas-to-write 1 # primary refuses writes unless at least 1 replica qualifies
min-replicas-max-lag 10 # a replica qualifies only if its lag is <= 10 seconds
The primary checks this condition once per second. Lag is measured from the last REPLCONF ACK heartbeat received from each replica.
Internal Check Logic
/* replication.c โ simplified */
int replicationCountGoodReplicas(long long offset) {
int good = 0;
listIter li;
listRewind(server.slaves, &li);
while ((ln = listNext(&li))) {
client *slave = ln->value;
/* Skip replicas whose ACK is too old */
if (server.unixtime - slave->repl_ack_time > server.repl_min_slaves_max_lag)
continue;
/* Skip replicas that haven't caught up to the required offset */
if (slave->repl_ack_off < offset)
continue;
good++;
}
return good;
}
Configuration Mode Comparison
| Mode | Configuration | Data Safety | Availability | Use Case |
|---|---|---|---|---|
| Pure async | min-replicas-to-write 0 |
Low | High | Cache, disposable data |
| Semi-sync | min-replicas-to-write 1, lag 10 |
Medium | Medium | General application data |
| WAIT-confirmed | WAIT 2 500 per write |
High | Low (latency) | Financial, order data |
Simulating and Verifying the Behavior
# Simulate replica lag (on the replica)
redis-cli debug sleep 20
# On the primary, observe write rejection
redis-cli set test_key value
# (error) NOREPLICAS Not enough good replicas to write.
# Replica recovers; writes resume automatically
17.4 The WAIT Command
Syntax and Semantics
WAIT numreplicas timeout_ms
โ Returns: integer โ number of replicas that have acknowledged
all write commands sent before this WAIT call
WAIT does not provide strong consistency in the distributed systems sense. It provides a best-effort confirmation that a specific number of replicas have received and applied all commands up to the current master_repl_offset.
Internal Protocol
1. Client issues: WAIT 1 500
2. Primary records current master_repl_offset = X
3. Primary broadcasts: REPLCONF GETACK * (to all replicas โ request immediate ACK)
4. Primary waits for replicas to respond with their current offset
5. Returns when:
- numreplicas replicas report offset >= X, OR
- timeout_ms elapses (whichever comes first)
Usage in Application Code
import redis
r = redis.Redis(host='primary_host', port=6379)
# Critical writes
pipe = r.pipeline(transaction=False)
pipe.set("account:balance:1001", 5000)
pipe.set("order:status:9999", "paid")
pipe.execute()
# Wait for at least 1 replica to confirm, up to 500ms
confirmed = r.wait(1, 500)
if confirmed >= 1:
log.info(f"Write replicated to {confirmed} replica(s)")
else:
log.warning("Replication not confirmed โ possible data-loss risk")
# Decide: retry, alert, or accept the risk based on business requirements
WAIT Return Value Semantics
WAIT 1 0 โ Returns immediately; 0 means no replica has confirmed yet
WAIT 1 500 โ Returns 1: at least one replica confirmed within 500ms
WAIT 2 500 โ Returns 1: only one replica confirmed (not an error โ just informational)
A return value less than numreplicas is not an error. The application layer must decide whether the partial confirmation is acceptable.
Performance Characteristics
# Baseline: no WAIT
redis-benchmark -n 100000 -c 50 -P 16 SET key value
# ~200,000 ops/sec
# With WAIT 1 500 (serialized โ WAIT blocks the connection)
# Use pipeline + single WAIT at the end of each batch
pipe = r.pipeline()
for i in range(1000):
pipe.set(f"key:{i}", f"value:{i}")
pipe.execute()
r.wait(1, 1000) # single WAIT per batch โ amortizes the cost
17.5 Monitoring Replication Lag
INFO replication Field Reference
# Replication
role:master
connected_slaves:2
slave0:ip=192.168.1.2,port=6380,state=online,offset=1234560,lag=0
slave1:ip=192.168.1.3,port=6381,state=online,offset=1234100,lag=1
master_replid:8371b4fb1155b71f4a04d3e1bc3e18c4a990aeeb
master_repl_offset:1234567
Important: lag is reported in whole seconds. lag=0 means lag is between 0 and 999ms. For sub-second precision, compute byte-level lag:
byte_lag = master_repl_offset - slave_offset
= 1234567 - 1234100 = 467 bytes
Byte-Level Lag Script
#!/bin/bash
PRIMARY=192.168.1.1
REPLICA=192.168.1.2
master_off=$(redis-cli -h $PRIMARY info replication \
| grep master_repl_offset | cut -d: -f2 | tr -d '\r ')
slave_off=$(redis-cli -h $REPLICA info replication \
| grep master_repl_offset | cut -d: -f2 | tr -d '\r ')
echo "Byte lag: $((master_off - slave_off))"
Prometheus Alert Rules
# redis_exporter exposes these metrics automatically
groups:
- name: redis_replication
rules:
- alert: RedisReplicationLag
expr: redis_replication_lag > 30
for: 1m
labels:
severity: warning
annotations:
summary: "Replica {{ $labels.addr }} lag > 30s"
- alert: RedisBacklogNearFull
expr: redis_repl_backlog_histlen / redis_repl_backlog_size > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Replication backlog > 90% full โ risk of full sync"
- alert: RedisNoReplicaConnected
expr: redis_connected_slaves == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Primary has no connected replicas"
17.6 Recovering from the Full-Sync Spiral
Diagnosing the Problem
# Look for full sync events in the primary log
grep -i "BGSAVE\|SYNC\|fullresync" /var/log/redis/redis.log | tail -50
# Watch for rapid increases in rdb_bgsave_in_progress
watch -n2 "redis-cli info persistence | grep rdb_bgsave"
# Count ASK/MOVED redirects and SYNC operations
redis-cli info stats | grep -E 'sync_full|sync_partial'
# total_commands_processed:...
# sync_full:12 โ 12 full syncs since startup (high!)
# sync_partial_ok:3
# sync_partial_err:9 โ partial sync attempts that failed
Backlog Sizing Matrix
| Write Rate | Max Tolerable Disconnect | Recommended Backlog |
|---|---|---|
| <10 MB/s | 30 s | 1 GB |
| 10โ50 MB/s | 30 s | 3 GB |
| 50โ100 MB/s | 60 s | 12 GB |
| >100 MB/s | Shard with Cluster | N/A |
Emergency Runtime Fix (No Restart)
# Immediately increase backlog (takes effect without restart)
redis-cli -h primary CONFIG SET repl-backlog-size 2147483648 # 2 GB
# Raise output buffer limits
redis-cli -h primary CONFIG SET client-output-buffer-limit \
"replica 1gb 256mb 300"
# Verify replicas are stabilizing
watch -n2 "redis-cli -h primary INFO replication | grep -E 'slave|backlog'"
# Persist the new config
redis-cli -h primary CONFIG REWRITE
17.7 Consistency Model Summary
Redis replication provides eventual consistency:
Operational (no failures): lag < 10 ms โ effectively strongly consistent
Transient disruption: lag 100ms โ 1s โ briefly inconsistent
Partition / crash: lag unknown โ risk of permanent data loss
CAP Theorem placement:
With min-replicas-to-write=0: AP (favors availability over consistency)
With min-replicas-to-write=1: CP (favors consistency, sacrifices availability)
Read-After-Write Consistency Patterns
# Problem: write to primary, immediately read from replica โ may see stale data
primary.set("user:1:name", "Alice")
name = replica.get("user:1:name") # might still be None or old value
# Pattern 1: Read from primary (strong consistency, higher load on primary)
name = primary.get("user:1:name")
# Pattern 2: Wait before reading from replica
primary.set("user:1:name", "Alice")
primary.wait(1, 100) # wait up to 100ms for 1 replica
name = replica.get("user:1:name") # now almost certainly consistent
# Pattern 3: Accept staleness (appropriate for cache scenarios)
# Clients tolerate reading data that is up to N seconds old
Chapter Summary
| Scenario | Root Cause | Defense Mechanism |
|---|---|---|
| Primary crash loses recent writes | Async replication window | AOF fsync=always or WAIT per write |
| Split-brain during network partition | Isolated primary keeps accepting writes | min-replicas-to-write |
| Output buffer overflow spiral | BGSAVE COW + slow replica | Larger buffer + diskless + large backlog |
min-replicas-to-write blocks writes |
Replica lag exceeds threshold | Monitor lag; optimize replica resources |
The right configuration depends entirely on your workload: the higher your write rate and the more critical the data, the more aggressively you must invest in replication buffer sizing and synchronization guarantees. Chapter 18 examines the Sentinel system's Raft-based leader election and the nine-step failover state machine.