Cluster Failover, Scaling and Data Migration
Chapter 20 โ Cluster Failover, Scale-Out, and Data Migration
This chapter is a production operations manual. It covers every aspect of Redis Cluster runtime management: the pfail/fail detection pipeline, the replica election algorithm that prioritizes data freshness, three distinct manual failover modes, a step-by-step scale-out playbook, the ASK protocol during live slot migration, and the complete scale-in procedure. Operational impact analysis and emergency remediation scripts are included throughout.
20.1 pfail and fail Detection Pipeline
pfail โ Subjective Failure (Node-Local)
A node marks a peer as pfail when it cannot receive a valid response within cluster-node-timeout:
/* cluster.c โ inside clusterCron(), runs every 100ms */
if (node->link &&
now - node->link->last_recv_time > server.cluster_node_timeout &&
now - node->ping_sent > server.cluster_node_timeout / 2) {
/* No PONG within timeout window */
node->flags |= CLUSTER_NODE_PFAIL;
server.cluster->stats_pfail_nodes++;
}
pfail is purely local. No cluster-wide action is taken. The node continues normal operation; only gossip propagation carries this information to peers.
fail โ Objective Failure (Cluster-Wide)
Multiple nodes independently report pfail, and one node accumulates enough reports to cross the threshold:
Propagation mechanism:
1. Node-A marks Primary-M as pfail
2. Node-A's subsequent PING/PONG heartbeats include:
gossip entry for M: flags = CLUSTER_NODE_PFAIL
3. Node-B receives the heartbeat, records: "A reports M pfail"
(stored in M's fail_reports list with a timestamp)
4. Any node receiving enough pfail reports checks the threshold:
if fail_report_count >= quorum AND all reports within 2รcluster_node_timeout:
โ broadcast FAIL message for M immediately
5. FAIL broadcast:
- Unlike Gossip, FAIL is sent to ALL known nodes directly
- Ensures rapid propagation regardless of Gossip round timing
- Receiving nodes immediately set CLUSTER_NODE_FAIL on M
Quorum for fail Declaration
quorum = floor(total_masters / 2) + 1
Examples:
3 masters โ quorum = 2 (majority of 3)
5 masters โ quorum = 3
7 masters โ quorum = 4
All quorum nodes must have submitted pfail reports within
the 2รcluster_node_timeout window (stale reports don't count).
Key Configuration Parameters
# redis.conf (cluster node)
cluster-node-timeout 15000 # ms; balance between sensitivity and false positives
cluster-require-full-coverage yes # refuse all reads/writes if any slot is uncovered
cluster-slave-no-failover no # whether replicas participate in failover
cluster-allow-reads-when-down no # allow stale reads when cluster is degraded
cluster-migration-barrier 1 # min replicas a primary must have before
# one can be migrated to an orphaned primary
20.2 Replica Election Algorithm
Trigger Conditions
A replica initiates an election only when all of the following hold:
- Its primary is marked
fail(objective, not just pfail) - The primary owns at least one slot (replicas of non-slotted primaries don't vote)
- The replica's data is not too stale (replication offset is recent enough)
- Enough time has elapsed since the last failover attempt (epoch guard)
Election Delay โ Prioritizing Freshness
Replicas do not race immediately. Each waits a computed delay before sending its vote request, giving the most up-to-date replica the shortest wait:
/* cluster.c โ clusterHandleSlaveFailover() */
/* my_rank = position in sorted replica list (by repl_offset descending)
rank 0 = highest offset = freshest replica */
mstime_t delay =
500 /* fixed base delay */
+ (random() % 500) /* jitter (0โ500ms) to prevent ties */
+ (data_age * 0.1) /* penalize replicas with stale data */
+ (my_rank * 1000); /* 1 second per rank position */
/* Result:
rank 0 (freshest): delay โ 500โ1000ms โ fires first
rank 1: delay โ 1500โ2000ms โ fires second if rank 0 fails
rank 2: delay โ 2500โ3000ms โ fallback */
Concrete Example
Primary-M fails. Three replicas:
Replica-A: offset=10,000 rank=0 โ delay โ 750ms
Replica-B: offset=9,500 rank=1 โ delay โ 1,750ms
Replica-C: offset=9,000 rank=2 โ delay โ 2,750ms
T=750ms: Replica-A sends FAILOVER_AUTH_REQUEST to all primaries
T=751ms: Master-X votes for A (epoch increased, not yet voted this epoch)
T=752ms: Master-Y votes for A
T=752ms: A has majority โ elected new primary for M's slots
T=1750ms: B's timer fires, but A has already claimed leadership โ B aborts
Vote Request Protocol
1. Replica sends to each primary node:
FAILOVER_AUTH_REQUEST
Payload: {
epoch: currentEpoch + 1, (incremented before sending)
primary_id: <failed-primary-id>,
primary_slots: <slot-bitmap>
}
2. Each primary evaluates:
โ request epoch > own currentEpoch
โ No vote cast in this epoch yet (voted_time check)
โ The claimed failed primary is genuinely in fail state
โ Grant vote: send FAILOVER_AUTH_ACK with vote epoch
3. Replica counts votes:
- Received > (total_masters / 2) votes โ elected
- Timeout: retry with epoch+1 (back to delay calculation)
Post-Election Actions
/* cluster.c */
void clusterHandleSlaveFailover(void) {
if (server.cluster->failover_auth_count >= needed_quorum) {
serverLog(LL_WARNING,
"Failover election won: I'm the new master.");
/* 1. Assume ownership of all former primary's slots */
clusterNodeSetAsMaster(myself);
/* 2. Update configEpoch to election epoch */
clusterBumpConfigEpochWithoutConsensus();
/* 3. Broadcast PONG with new slot bitmap to entire cluster */
clusterBroadcastPong(CLUSTER_BROADCAST_ALL);
/* 4. Stop replicating; become standalone primary */
replicationUnsetMaster();
}
}
20.3 Manual Failover
Three Modes
# Mode 1: Graceful (recommended for planned maintenance)
#
# Internal flow:
# 1. Replica sends MANUELFAILOVER_PAUSE_CLIENTS to primary
# 2. Primary pauses client writes
# 3. Replica waits until its offset matches primary's
# 4. Replica initiates election (guaranteed to win โ all data is current)
# 5. Zero data loss
# Prerequisite: primary must be reachable
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER
# Mode 2: Force (primary is slow/degraded but still reachable)
#
# Skips offset synchronization โ may lose a small number of recent writes.
# Replica immediately starts election without waiting to catch up.
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER FORCE
# Mode 3: Takeover (primary is completely unreachable)
#
# Replica unilaterally promotes itself โ bypasses voting entirely.
# Increments configEpoch and broadcasts new slot ownership directly.
# Risk: if the old primary recovers, configEpoch collision resolution
# may cause unexpected slot reassignment. Use with caution.
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER TAKEOVER
Graceful Failover Timeline
T=0: Operator runs: CLUSTER FAILOVER (on replica)
T=0: Replica โ Primary: CLUSTERMSG_TYPE_MFSTART
T=0: Primary pauses client connections (returns -LOADING temporarily)
T=0: Replica polls its offset vs primary's offset (INFO replication)
T=Xs: Replica offset == Primary offset (fully caught up)
T=Xs: Replica initiates election; wins immediately (freshest replica)
T=Xs+1: New primary begins accepting writes; old primary demoted to replica
Total downtime for writes: typically 1โ3 seconds
20.4 Online Scale-Out Playbook
Pre-Expansion Health Check
# Verify cluster is fully healthy before starting
redis-cli --cluster check 192.168.1.1:7000
# Expected:
# [OK] All nodes agree about slots configuration.
# [OK] All 16384 slots covered.
# Check current slot distribution
redis-cli --cluster info 192.168.1.1:7000
# 192.168.1.1:7000 โ 15000 keys | 5461 slots | 1 slave
# 192.168.1.2:7001 โ 14500 keys | 5461 slots | 1 slave
# 192.168.1.3:7002 โ 15200 keys | 5462 slots | 1 slave
Step 1 โ Start New Nodes
# New primary node (port 7006)
cat > /etc/redis/redis-7006.conf << 'EOF'
port 7006
bind 0.0.0.0
daemonize yes
logfile /var/log/redis/redis-7006.log
dir /var/lib/redis/7006
cluster-enabled yes
cluster-config-file /etc/redis/cluster/nodes-7006.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.4
cluster-announce-port 7006
cluster-announce-bus-port 17006
appendonly yes
appendfsync everysec
maxmemory 8gb
maxmemory-policy allkeys-lru
EOF
mkdir -p /var/lib/redis/7006
redis-server /etc/redis/redis-7006.conf
# Verify it started standalone (not yet in cluster)
redis-cli -p 7006 CLUSTER INFO | grep cluster_state
# cluster_state:ok (but cluster_known_nodes:1 โ only itself)
Step 2 โ Add New Primary to Cluster
redis-cli --cluster add-node 192.168.1.4:7006 192.168.1.1:7000
# Expected:
# [OK] New node added correctly.
# Verify: node is visible but has 0 slots
redis-cli -c -p 7000 CLUSTER NODES | grep 7006
# <id> 192.168.1.4:7006@17006 master - 0 ... connected
# Note: no slot range at the end โ 0 slots assigned
Step 3 โ Reshard: Migrate Slots to New Node
# Interactive mode (recommended for first time)
redis-cli --cluster reshard 192.168.1.1:7000
# How many slots do you want to move? โ 4096
# Receiving node ID? โ <7006-node-id>
# Source nodes? โ all
# Non-interactive (for automation)
redis-cli --cluster reshard 192.168.1.1:7000 \
--cluster-from all \
--cluster-to <7006-node-id> \
--cluster-slots 4096 \
--cluster-yes
# Monitor progress during reshard
watch -n2 "redis-cli --cluster info 192.168.1.1:7000"
Step 4 โ Add Replica for New Primary
# Start the replica node (port 7007)
redis-server /etc/redis/redis-7007.conf
# Add as replica of the new primary
redis-cli --cluster add-node 192.168.1.4:7007 192.168.1.1:7000 \
--cluster-slave \
--cluster-master-id <7006-node-id>
# Verify replication established
redis-cli -p 7006 INFO replication
# role:master
# connected_slaves:1
# slave0:ip=192.168.1.4,port=7007,state=online,offset=...,lag=0
Step 5 โ Post-Expansion Verification
# Full cluster health check
redis-cli --cluster check 192.168.1.1:7000
# Slot distribution should now show 4 primary nodes each with ~4096 slots
redis-cli --cluster info 192.168.1.1:7000
# 192.168.1.1:7000 โ ... | 4096 slots | 1 slave
# 192.168.1.2:7001 โ ... | 4096 slots | 1 slave
# 192.168.1.3:7002 โ ... | 4096 slots | 1 slave
# 192.168.1.4:7006 โ ... | 4096 slots | 1 slave โ new node
# Monitor for MOVED redirections (clients updating their routing cache)
redis-cli -p 7000 INFO stats | grep moved_redirect
20.5 Slot Migration and ASK Protocol in Detail
Migration State Setup
# Migrating slot 100 from Node-A (7000) to Node-B (7006)
NODE_A_ID=$(redis-cli -p 7000 CLUSTER MYID)
NODE_B_ID=$(redis-cli -p 7006 CLUSTER MYID)
# Step 1: Set importing state on destination
redis-cli -p 7006 CLUSTER SETSLOT 100 IMPORTING $NODE_A_ID
# Step 2: Set migrating state on source
redis-cli -p 7000 CLUSTER SETSLOT 100 MIGRATING $NODE_B_ID
MIGRATE Command: Atomic Key Transfer
# MIGRATE atomically moves a key from source to destination:
# 1. Source executes DUMP key โ serialized bytes
# 2. Source sends RESTORE key ttl serialized-value to destination
# 3. Destination confirms receipt
# 4. Source executes DEL key (only after destination confirms)
# Single key
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 mykey 0 5000
# Batch (recommended: 10โ100 keys at once)
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 "" 0 60000 KEYS key1 key2 key3
# Replace if key already exists at destination (handles retry safety)
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 "" 0 60000 REPLACE KEYS key1
Request Routing During Migration
Complete decision tree for a client request on slot 100 (being migrated):
Case 1: Client sends GET user:1 to Node-A; key HAS NOT been migrated yet
Node-A: key exists locally โ respond normally
Result: +value (no redirection)
Case 2: Client sends GET user:1 to Node-A; key HAS been migrated to B
Node-A: key not found locally; slot 100 is in migrating state
Node-A responds: -ASK 100 192.168.1.4:7006
Client:
โ connect to Node-B
โ send ASKING (one-shot unlock for importing slot)
โ send GET user:1
โ Node-B responds with value
Note: client's routing cache NOT updated (migration still in progress)
Case 3: Client sends GET user:1 directly to Node-B (correct node, wrong timing)
No ASKING flag: Node-B returns -MOVED 100 192.168.1.1:7000
(redirects to source โ key is still there)
With ASKING flag: Node-B checks locally โ returns value or nil
(ASK indicates "you came from source, key may have been sent here")
Case 4: Key does not exist anywhere (genuinely missing)
Node-A: not found; slot is migrating โ -ASK to Node-B
Client โ ASKING โ GET โ Node-B returns nil
Correct behavior: nil is the right answer
Completing the Migration
# After all keys in slot 100 have been migrated:
# Notify both source and destination that migration is complete
redis-cli -p 7000 CLUSTER SETSLOT 100 NODE $NODE_B_ID
redis-cli -p 7006 CLUSTER SETSLOT 100 NODE $NODE_B_ID
# Effect:
# Node-A: removes slot 100 from its bitmap; migrating flag cleared
# Node-B: adds slot 100 to its bitmap; importing flag cleared
# Node-B increments its configEpoch and broadcasts PONG with new slot bitmap
# Other nodes update their routing tables via Gossip
Production-Grade Migration Script
#!/bin/bash
# migrate_slot.sh โ safely migrate a single slot with retry logic
set -euo pipefail
SRC_HOST=192.168.1.1
SRC_PORT=7000
DST_HOST=192.168.1.4
DST_PORT=7006
SLOT=$1
BATCH_SIZE=50
MIGRATE_TIMEOUT=60000 # 60 seconds
DST_NODE_ID=$(redis-cli -p $DST_PORT CLUSTER MYID | tr -d '\r')
SRC_NODE_ID=$(redis-cli -p $SRC_PORT CLUSTER MYID | tr -d '\r')
echo "Migrating slot $SLOT: $SRC_HOST:$SRC_PORT โ $DST_HOST:$DST_PORT"
# Set migration states
redis-cli -p $DST_PORT CLUSTER SETSLOT $SLOT IMPORTING $SRC_NODE_ID
redis-cli -p $SRC_PORT CLUSTER SETSLOT $SLOT MIGRATING $DST_NODE_ID
# Migrate all keys in batches
total_migrated=0
while true; do
KEYS=$(redis-cli -p $SRC_PORT CLUSTER GETKEYSINSLOT $SLOT $BATCH_SIZE 2>/dev/null | tr '\r' '\n')
if [ -z "$KEYS" ]; then
echo "Slot $SLOT: all keys migrated ($total_migrated total)"
break
fi
key_count=$(echo "$KEYS" | wc -l | tr -d ' ')
# Build MIGRATE command with KEYS list
if redis-cli -p $SRC_PORT MIGRATE $DST_HOST $DST_PORT "" 0 $MIGRATE_TIMEOUT \
REPLACE KEYS $KEYS 2>&1 | grep -q "ERR\|IOERR"; then
echo "Migration error for batch in slot $SLOT; retrying..."
sleep 1
continue
fi
total_migrated=$((total_migrated + key_count))
echo " migrated $total_migrated keys so far..."
done
# Finalize
redis-cli -p $SRC_PORT CLUSTER SETSLOT $SLOT NODE $DST_NODE_ID
redis-cli -p $DST_PORT CLUSTER SETSLOT $SLOT NODE $DST_NODE_ID
echo "Slot $SLOT migration complete."
20.6 Scale-In (Node Removal) Procedure
# Goal: remove node 7006 (192.168.1.4:7006) from the cluster
# Prerequisite: 7006 must have 0 slots (all migrated away)
# Step 1: Move all of 7006's slots to remaining nodes
redis-cli --cluster reshard 192.168.1.1:7000 \
--cluster-from <7006-node-id> \
--cluster-to <7000-node-id> \
--cluster-slots 4096 \
--cluster-yes
# Step 2: Verify 7006 has no remaining slots
redis-cli -p 7006 CLUSTER INFO | grep cluster_slots_assigned
# cluster_slots_assigned:0
# Step 3: Remove 7006's replica first (always remove replicas before primaries)
redis-cli --cluster del-node 192.168.1.1:7000 <7007-replica-node-id>
# [OK] Node 192.168.1.4:7007 removed from cluster.
# Step 4: Remove 7006 primary node
redis-cli --cluster del-node 192.168.1.1:7000 <7006-primary-node-id>
# [OK] Node 192.168.1.4:7006 removed from cluster.
# Step 5: Shutdown the removed nodes
redis-cli -h 192.168.1.4 -p 7006 SHUTDOWN
redis-cli -h 192.168.1.4 -p 7007 SHUTDOWN
# Step 6: Final cluster health check
redis-cli --cluster check 192.168.1.1:7000
20.7 Operational Impact Analysis
Impact 1 โ MIGRATE Blocks the Key
Duration: typically 0.1โ5ms per key (depends on value size)
For a 10 MB string value: MIGRATE may block for 100ms+
Mitigation:
- Batch small values (up to 100 keys per MIGRATE)
- Migrate large values individually with longer timeout
- Schedule migrations during low-traffic windows
- Use OBJECT ENCODING to identify large keys before migration
Impact 2 โ Client Routing Cache Updates
During migration, clients with stale routing caches receive MOVED redirections.
Each MOVED adds one extra round trip (~0.5โ2ms on LAN).
Only the first request for a migrated slot causes a redirect.
After the client updates its cache, subsequent requests go directly to the
correct node with no overhead.
Monitoring MOVED redirection rate:
redis-cli -p 7000 INFO stats | grep moved_redirect
# moved_redirect:847 โ acceptable during migration; should drop after
Impact 3 โ Full Sync to New Replica
When a new replica joins, it performs a full sync with its primary:
- Full sync blocks the replica from serving reads
- Primary forks for BGSAVE; memory may spike (COW)
- Network bandwidth consumed for RDB transfer
Monitoring:
watch -n2 "redis-cli -p 7006 INFO replication | grep -E 'master_sync|connected_slaves'"
watch -n2 "redis-cli -p 7006 INFO memory | grep used_memory_human"
Mitigation:
- Enable diskless sync: repl-diskless-sync yes
- Add replicas during off-peak hours
- Add one replica at a time (parallel full syncs compound memory pressure)
20.8 Cluster Operations Quick Reference
# Health and status
redis-cli --cluster check <host>:<port>
redis-cli --cluster info <host>:<port>
redis-cli -c -p <port> CLUSTER INFO
redis-cli -c -p <port> CLUSTER NODES
# Slot operations
redis-cli -c -p <port> CLUSTER KEYSLOT <key>
redis-cli -c -p <port> CLUSTER GETKEYSINSLOT <slot> <count>
redis-cli -c -p <port> CLUSTER SLOTS
# Rebalancing
redis-cli --cluster rebalance <host>:<port>
redis-cli --cluster rebalance <host>:<port> --cluster-use-empty-masters
# Repair
redis-cli --cluster fix <host>:<port>
redis-cli --cluster fix <host>:<port> --cluster-fix-with-unreachable-masters
# Node management
redis-cli --cluster add-node <new> <existing>
redis-cli --cluster del-node <existing> <node-id-to-remove>
# Dangerous: CLUSTER RESET clears all cluster state
redis-cli -p <port> CLUSTER RESET SOFT # remove from cluster, keep data
redis-cli -p <port> CLUSTER RESET HARD # remove from cluster, flush data
20.9 Failure Scenario Remediation
Scenario A โ cluster_state: fail
# Diagnosis
redis-cli -p 7000 CLUSTER INFO | grep cluster_state
# cluster_state:fail
redis-cli -p 7000 CLUSTER NODES | grep " fail"
# Identify which node is in fail state and whether it has a replica
# Remediation:
# Case 1: Failed node has a replica โ wait for automatic failover (~15โ30s)
# Case 2: Failed node has no replica โ add a new node and migrate its slots
# Case 3: Need immediate reads on partial cluster:
redis-cli -p 7000 CONFIG SET cluster-require-full-coverage no
Scenario B โ Duplicate Primary for Same Slots (Split-Brain)
# Symptom: two masters claim ownership of overlapping slots
redis-cli -p 7000 CLUSTER NODES | grep master | awk '{print $1, $9, $10}'
# Resolution: configEpoch determines the winner
# The node with the higher configEpoch retains ownership;
# the lower-epoch node automatically demotes itself when it receives
# the PONG from the higher-epoch winner.
# Force recalculation if stuck:
redis-cli --cluster fix 192.168.1.1:7000
# Check configEpoch values:
redis-cli -p 7000 CLUSTER NODES | awk '{print $1, $7}' # field 7 = config-epoch
Scenario C โ High ASK Redirection Rate After Migration
# Check redirection statistics
redis-cli -p 7000 INFO stats | grep -E "ask_redirect|moved_redirect"
# ask_redirect:50000 โ very high: migration is running or clients are slow to update
# Verify migration completed on all nodes:
redis-cli -p 7000 CLUSTER NODES | grep -v "connected$"
# Any node not in 'connected' state needs attention
# Force clients to refresh their routing table:
# Most smart clients refresh automatically on MOVED errors
# If using a proxy (twemproxy, etc.): restart or trigger route refresh
Chapter Summary
| Operation | Core Mechanism | Key Advice |
|---|---|---|
| Failure detection | pfail (local) โ Gossip โ fail (cluster quorum) | Tune cluster-node-timeout |
| Automatic failover | Rank-delayed replica election | Replica with highest offset wins |
| Manual failover | CLUSTER FAILOVER [FORCE|TAKEOVER] | Default (graceful) for planned ops |
| Scale-out | add-node โ reshard โ add-node (replica) | Off-peak; monitor memory and lag |
| Slot migration | MIGRATE + SETSLOT + ASK protocol | Batch 50โ100 keys; handle REPLACE |
| Scale-in | reshard (clear slots) โ del-node โ shutdown | Remove replicas before primaries |
| Migration impact | MIGRATE blocks key, MOVED on stale clients | Smart clients handle transparently |
Redis Cluster is the authoritative solution for workloads that outgrow a single primary's memory or write throughput. Mastering its failover algorithm, slot migration protocol, and operational playbooks is the mark of a production-ready Redis engineer.