Chapter 20

Cluster Failover, Scaling and Data Migration

Chapter 20 — Cluster Failover, Scale-Out, and Data Migration

This chapter is a production operations manual. It covers every aspect of Redis Cluster runtime management: the pfail/fail detection pipeline, the replica election algorithm that prioritizes data freshness, three distinct manual failover modes, a step-by-step scale-out playbook, the ASK protocol during live slot migration, and the complete scale-in procedure. Operational impact analysis and emergency remediation scripts are included throughout.

20.1 pfail and fail Detection Pipeline

pfail — Subjective Failure (Node-Local)

A node marks a peer as pfail when it cannot receive a valid response within cluster-node-timeout:

/* cluster.c — inside clusterCron(), runs every 100ms */
if (node->link &&
    now - node->link->last_recv_time > server.cluster_node_timeout &&
    now - node->ping_sent > server.cluster_node_timeout / 2) {
    /* No PONG within timeout window */
    node->flags |= CLUSTER_NODE_PFAIL;
    server.cluster->stats_pfail_nodes++;
}

pfail is purely local. No cluster-wide action is taken. The node continues normal operation; only gossip propagation carries this information to peers.

fail — Objective Failure (Cluster-Wide)

Multiple nodes independently report pfail, and one node accumulates enough reports to cross the threshold:

Propagation mechanism:
1. Node-A marks Primary-M as pfail
2. Node-A's subsequent PING/PONG heartbeats include:
   gossip entry for M: flags = CLUSTER_NODE_PFAIL

3. Node-B receives the heartbeat, records: "A reports M pfail"
   (stored in M's fail_reports list with a timestamp)

4. Any node receiving enough pfail reports checks the threshold:
   if fail_report_count >= quorum AND all reports within 2×cluster_node_timeout:
       → broadcast FAIL message for M immediately

5. FAIL broadcast:
   - Unlike Gossip, FAIL is sent to ALL known nodes directly
   - Ensures rapid propagation regardless of Gossip round timing
   - Receiving nodes immediately set CLUSTER_NODE_FAIL on M

Quorum for fail Declaration

quorum = floor(total_masters / 2) + 1

Examples:
  3 masters → quorum = 2   (majority of 3)
  5 masters → quorum = 3
  7 masters → quorum = 4

All quorum nodes must have submitted pfail reports within
the 2×cluster_node_timeout window (stale reports don't count).

Key Configuration Parameters

# redis.conf (cluster node)
cluster-node-timeout 15000           # ms; balance between sensitivity and false positives
cluster-require-full-coverage yes    # refuse all reads/writes if any slot is uncovered
cluster-slave-no-failover no         # whether replicas participate in failover
cluster-allow-reads-when-down no     # allow stale reads when cluster is degraded
cluster-migration-barrier 1          # min replicas a primary must have before
                                     # one can be migrated to an orphaned primary

20.2 Replica Election Algorithm

Trigger Conditions

A replica initiates an election only when all of the following hold:

Its primary is marked fail (objective, not just pfail)
The primary owns at least one slot (replicas of non-slotted primaries don't vote)
The replica's data is not too stale (replication offset is recent enough)
Enough time has elapsed since the last failover attempt (epoch guard)

Election Delay — Prioritizing Freshness

Replicas do not race immediately. Each waits a computed delay before sending its vote request, giving the most up-to-date replica the shortest wait:

/* cluster.c — clusterHandleSlaveFailover() */

/* my_rank = position in sorted replica list (by repl_offset descending)
   rank 0 = highest offset = freshest replica */
mstime_t delay =
    500                       /* fixed base delay */
    + (random() % 500)        /* jitter (0–500ms) to prevent ties */
    + (data_age * 0.1)        /* penalize replicas with stale data */
    + (my_rank * 1000);       /* 1 second per rank position */

/* Result:
   rank 0 (freshest): delay ≈ 500–1000ms   → fires first
   rank 1:            delay ≈ 1500–2000ms  → fires second if rank 0 fails
   rank 2:            delay ≈ 2500–3000ms  → fallback */

Concrete Example

Primary-M fails. Three replicas:
  Replica-A: offset=10,000  rank=0 → delay ≈ 750ms
  Replica-B: offset=9,500   rank=1 → delay ≈ 1,750ms
  Replica-C: offset=9,000   rank=2 → delay ≈ 2,750ms

T=750ms:  Replica-A sends FAILOVER_AUTH_REQUEST to all primaries
T=751ms:  Master-X votes for A (epoch increased, not yet voted this epoch)
T=752ms:  Master-Y votes for A
T=752ms:  A has majority → elected new primary for M's slots
T=1750ms: B's timer fires, but A has already claimed leadership → B aborts

Vote Request Protocol

1. Replica sends to each primary node:
   FAILOVER_AUTH_REQUEST
   Payload: {
     epoch: currentEpoch + 1,      (incremented before sending)
     primary_id: <failed-primary-id>,
     primary_slots: <slot-bitmap>
   }

2. Each primary evaluates:
   ✓ request epoch > own currentEpoch
   ✓ No vote cast in this epoch yet (voted_time check)
   ✓ The claimed failed primary is genuinely in fail state
   → Grant vote: send FAILOVER_AUTH_ACK with vote epoch

3. Replica counts votes:
   - Received > (total_masters / 2) votes → elected
   - Timeout: retry with epoch+1 (back to delay calculation)

Post-Election Actions

/* cluster.c */
void clusterHandleSlaveFailover(void) {
    if (server.cluster->failover_auth_count >= needed_quorum) {
        serverLog(LL_WARNING,
            "Failover election won: I'm the new master.");

        /* 1. Assume ownership of all former primary's slots */
        clusterNodeSetAsMaster(myself);

        /* 2. Update configEpoch to election epoch */
        clusterBumpConfigEpochWithoutConsensus();

        /* 3. Broadcast PONG with new slot bitmap to entire cluster */
        clusterBroadcastPong(CLUSTER_BROADCAST_ALL);

        /* 4. Stop replicating; become standalone primary */
        replicationUnsetMaster();
    }
}

20.3 Manual Failover

Three Modes

# Mode 1: Graceful (recommended for planned maintenance)
#
# Internal flow:
#   1. Replica sends MANUELFAILOVER_PAUSE_CLIENTS to primary
#   2. Primary pauses client writes
#   3. Replica waits until its offset matches primary's
#   4. Replica initiates election (guaranteed to win — all data is current)
#   5. Zero data loss
#   Prerequisite: primary must be reachable
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER

# Mode 2: Force (primary is slow/degraded but still reachable)
#
# Skips offset synchronization — may lose a small number of recent writes.
# Replica immediately starts election without waiting to catch up.
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER FORCE

# Mode 3: Takeover (primary is completely unreachable)
#
# Replica unilaterally promotes itself — bypasses voting entirely.
# Increments configEpoch and broadcasts new slot ownership directly.
# Risk: if the old primary recovers, configEpoch collision resolution
#       may cause unexpected slot reassignment. Use with caution.
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER TAKEOVER

Graceful Failover Timeline

T=0:   Operator runs: CLUSTER FAILOVER (on replica)
T=0:   Replica → Primary: CLUSTERMSG_TYPE_MFSTART
T=0:   Primary pauses client connections (returns -LOADING temporarily)
T=0:   Replica polls its offset vs primary's offset (INFO replication)
T=Xs:  Replica offset == Primary offset (fully caught up)
T=Xs:  Replica initiates election; wins immediately (freshest replica)
T=Xs+1: New primary begins accepting writes; old primary demoted to replica
Total downtime for writes: typically 1–3 seconds

20.4 Online Scale-Out Playbook

Pre-Expansion Health Check

# Verify cluster is fully healthy before starting
redis-cli --cluster check 192.168.1.1:7000

# Expected:
# [OK] All nodes agree about slots configuration.
# [OK] All 16384 slots covered.

# Check current slot distribution
redis-cli --cluster info 192.168.1.1:7000
# 192.168.1.1:7000 → 15000 keys | 5461 slots | 1 slave
# 192.168.1.2:7001 → 14500 keys | 5461 slots | 1 slave
# 192.168.1.3:7002 → 15200 keys | 5462 slots | 1 slave

Step 1 — Start New Nodes

# New primary node (port 7006)
cat > /etc/redis/redis-7006.conf << 'EOF'
port 7006
bind 0.0.0.0
daemonize yes
logfile /var/log/redis/redis-7006.log
dir /var/lib/redis/7006
cluster-enabled yes
cluster-config-file /etc/redis/cluster/nodes-7006.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.4
cluster-announce-port 7006
cluster-announce-bus-port 17006
appendonly yes
appendfsync everysec
maxmemory 8gb
maxmemory-policy allkeys-lru
EOF

mkdir -p /var/lib/redis/7006
redis-server /etc/redis/redis-7006.conf

# Verify it started standalone (not yet in cluster)
redis-cli -p 7006 CLUSTER INFO | grep cluster_state
# cluster_state:ok  (but cluster_known_nodes:1 — only itself)

Step 2 — Add New Primary to Cluster

redis-cli --cluster add-node 192.168.1.4:7006 192.168.1.1:7000

# Expected:
# [OK] New node added correctly.

# Verify: node is visible but has 0 slots
redis-cli -c -p 7000 CLUSTER NODES | grep 7006
# <id> 192.168.1.4:7006@17006 master - 0 ... connected
# Note: no slot range at the end → 0 slots assigned

Step 3 — Reshard: Migrate Slots to New Node

# Interactive mode (recommended for first time)
redis-cli --cluster reshard 192.168.1.1:7000
# How many slots do you want to move? → 4096
# Receiving node ID? → <7006-node-id>
# Source nodes? → all

# Non-interactive (for automation)
redis-cli --cluster reshard 192.168.1.1:7000 \
    --cluster-from all \
    --cluster-to <7006-node-id> \
    --cluster-slots 4096 \
    --cluster-yes

# Monitor progress during reshard
watch -n2 "redis-cli --cluster info 192.168.1.1:7000"

Step 4 — Add Replica for New Primary

# Start the replica node (port 7007)
redis-server /etc/redis/redis-7007.conf

# Add as replica of the new primary
redis-cli --cluster add-node 192.168.1.4:7007 192.168.1.1:7000 \
    --cluster-slave \
    --cluster-master-id <7006-node-id>

# Verify replication established
redis-cli -p 7006 INFO replication
# role:master
# connected_slaves:1
# slave0:ip=192.168.1.4,port=7007,state=online,offset=...,lag=0

Step 5 — Post-Expansion Verification

# Full cluster health check
redis-cli --cluster check 192.168.1.1:7000

# Slot distribution should now show 4 primary nodes each with ~4096 slots
redis-cli --cluster info 192.168.1.1:7000
# 192.168.1.1:7000 → ... | 4096 slots | 1 slave
# 192.168.1.2:7001 → ... | 4096 slots | 1 slave
# 192.168.1.3:7002 → ... | 4096 slots | 1 slave
# 192.168.1.4:7006 → ... | 4096 slots | 1 slave  ← new node

# Monitor for MOVED redirections (clients updating their routing cache)
redis-cli -p 7000 INFO stats | grep moved_redirect

20.5 Slot Migration and ASK Protocol in Detail

Migration State Setup

# Migrating slot 100 from Node-A (7000) to Node-B (7006)
NODE_A_ID=$(redis-cli -p 7000 CLUSTER MYID)
NODE_B_ID=$(redis-cli -p 7006 CLUSTER MYID)

# Step 1: Set importing state on destination
redis-cli -p 7006 CLUSTER SETSLOT 100 IMPORTING $NODE_A_ID

# Step 2: Set migrating state on source
redis-cli -p 7000 CLUSTER SETSLOT 100 MIGRATING $NODE_B_ID

MIGRATE Command: Atomic Key Transfer

# MIGRATE atomically moves a key from source to destination:
# 1. Source executes DUMP key → serialized bytes
# 2. Source sends RESTORE key ttl serialized-value to destination
# 3. Destination confirms receipt
# 4. Source executes DEL key (only after destination confirms)

# Single key
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 mykey 0 5000

# Batch (recommended: 10–100 keys at once)
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 "" 0 60000 KEYS key1 key2 key3

# Replace if key already exists at destination (handles retry safety)
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 "" 0 60000 REPLACE KEYS key1

Request Routing During Migration

Complete decision tree for a client request on slot 100 (being migrated):

Case 1: Client sends GET user:1 to Node-A; key HAS NOT been migrated yet
  Node-A: key exists locally → respond normally
  Result: +value  (no redirection)

Case 2: Client sends GET user:1 to Node-A; key HAS been migrated to B
  Node-A: key not found locally; slot 100 is in migrating state
  Node-A responds: -ASK 100 192.168.1.4:7006
  Client:
    → connect to Node-B
    → send ASKING  (one-shot unlock for importing slot)
    → send GET user:1
    → Node-B responds with value
  Note: client's routing cache NOT updated (migration still in progress)

Case 3: Client sends GET user:1 directly to Node-B (correct node, wrong timing)
  No ASKING flag: Node-B returns -MOVED 100 192.168.1.1:7000
    (redirects to source — key is still there)
  With ASKING flag: Node-B checks locally → returns value or nil
    (ASK indicates "you came from source, key may have been sent here")

Case 4: Key does not exist anywhere (genuinely missing)
  Node-A: not found; slot is migrating → -ASK to Node-B
  Client → ASKING → GET → Node-B returns nil
  Correct behavior: nil is the right answer

Completing the Migration

# After all keys in slot 100 have been migrated:

# Notify both source and destination that migration is complete
redis-cli -p 7000 CLUSTER SETSLOT 100 NODE $NODE_B_ID
redis-cli -p 7006 CLUSTER SETSLOT 100 NODE $NODE_B_ID

# Effect:
# Node-A: removes slot 100 from its bitmap; migrating flag cleared
# Node-B: adds slot 100 to its bitmap; importing flag cleared
# Node-B increments its configEpoch and broadcasts PONG with new slot bitmap
# Other nodes update their routing tables via Gossip

Production-Grade Migration Script

#!/bin/bash
# migrate_slot.sh — safely migrate a single slot with retry logic
set -euo pipefail

SRC_HOST=192.168.1.1
SRC_PORT=7000
DST_HOST=192.168.1.4
DST_PORT=7006
SLOT=$1
BATCH_SIZE=50
MIGRATE_TIMEOUT=60000  # 60 seconds

DST_NODE_ID=$(redis-cli -p $DST_PORT CLUSTER MYID | tr -d '\r')
SRC_NODE_ID=$(redis-cli -p $SRC_PORT CLUSTER MYID | tr -d '\r')

echo "Migrating slot $SLOT: $SRC_HOST:$SRC_PORT → $DST_HOST:$DST_PORT"

# Set migration states
redis-cli -p $DST_PORT CLUSTER SETSLOT $SLOT IMPORTING $SRC_NODE_ID
redis-cli -p $SRC_PORT CLUSTER SETSLOT $SLOT MIGRATING $DST_NODE_ID

# Migrate all keys in batches
total_migrated=0
while true; do
    KEYS=$(redis-cli -p $SRC_PORT CLUSTER GETKEYSINSLOT $SLOT $BATCH_SIZE 2>/dev/null | tr '\r' '\n')
    if [ -z "$KEYS" ]; then
        echo "Slot $SLOT: all keys migrated ($total_migrated total)"
        break
    fi

    key_count=$(echo "$KEYS" | wc -l | tr -d ' ')

    # Build MIGRATE command with KEYS list
    if redis-cli -p $SRC_PORT MIGRATE $DST_HOST $DST_PORT "" 0 $MIGRATE_TIMEOUT \
        REPLACE KEYS $KEYS 2>&1 | grep -q "ERR\|IOERR"; then
        echo "Migration error for batch in slot $SLOT; retrying..."
        sleep 1
        continue
    fi

    total_migrated=$((total_migrated + key_count))
    echo "  migrated $total_migrated keys so far..."
done

# Finalize
redis-cli -p $SRC_PORT CLUSTER SETSLOT $SLOT NODE $DST_NODE_ID
redis-cli -p $DST_PORT CLUSTER SETSLOT $SLOT NODE $DST_NODE_ID
echo "Slot $SLOT migration complete."

20.6 Scale-In (Node Removal) Procedure

# Goal: remove node 7006 (192.168.1.4:7006) from the cluster
# Prerequisite: 7006 must have 0 slots (all migrated away)

# Step 1: Move all of 7006's slots to remaining nodes
redis-cli --cluster reshard 192.168.1.1:7000 \
    --cluster-from <7006-node-id> \
    --cluster-to <7000-node-id> \
    --cluster-slots 4096 \
    --cluster-yes

# Step 2: Verify 7006 has no remaining slots
redis-cli -p 7006 CLUSTER INFO | grep cluster_slots_assigned
# cluster_slots_assigned:0

# Step 3: Remove 7006's replica first (always remove replicas before primaries)
redis-cli --cluster del-node 192.168.1.1:7000 <7007-replica-node-id>
# [OK] Node 192.168.1.4:7007 removed from cluster.

# Step 4: Remove 7006 primary node
redis-cli --cluster del-node 192.168.1.1:7000 <7006-primary-node-id>
# [OK] Node 192.168.1.4:7006 removed from cluster.

# Step 5: Shutdown the removed nodes
redis-cli -h 192.168.1.4 -p 7006 SHUTDOWN
redis-cli -h 192.168.1.4 -p 7007 SHUTDOWN

# Step 6: Final cluster health check
redis-cli --cluster check 192.168.1.1:7000

20.7 Operational Impact Analysis

Impact 1 — MIGRATE Blocks the Key

Duration: typically 0.1–5ms per key (depends on value size)
For a 10 MB string value: MIGRATE may block for 100ms+

Mitigation:
  - Batch small values (up to 100 keys per MIGRATE)
  - Migrate large values individually with longer timeout
  - Schedule migrations during low-traffic windows
  - Use OBJECT ENCODING to identify large keys before migration

Impact 2 — Client Routing Cache Updates

During migration, clients with stale routing caches receive MOVED redirections.
Each MOVED adds one extra round trip (~0.5–2ms on LAN).

Only the first request for a migrated slot causes a redirect.
After the client updates its cache, subsequent requests go directly to the
correct node with no overhead.

Monitoring MOVED redirection rate:
redis-cli -p 7000 INFO stats | grep moved_redirect
# moved_redirect:847   ← acceptable during migration; should drop after

Impact 3 — Full Sync to New Replica

When a new replica joins, it performs a full sync with its primary:
  - Full sync blocks the replica from serving reads
  - Primary forks for BGSAVE; memory may spike (COW)
  - Network bandwidth consumed for RDB transfer

Monitoring:
watch -n2 "redis-cli -p 7006 INFO replication | grep -E 'master_sync|connected_slaves'"
watch -n2 "redis-cli -p 7006 INFO memory | grep used_memory_human"

Mitigation:
  - Enable diskless sync: repl-diskless-sync yes
  - Add replicas during off-peak hours
  - Add one replica at a time (parallel full syncs compound memory pressure)

20.8 Cluster Operations Quick Reference

# Health and status
redis-cli --cluster check <host>:<port>
redis-cli --cluster info <host>:<port>
redis-cli -c -p <port> CLUSTER INFO
redis-cli -c -p <port> CLUSTER NODES

# Slot operations
redis-cli -c -p <port> CLUSTER KEYSLOT <key>
redis-cli -c -p <port> CLUSTER GETKEYSINSLOT <slot> <count>
redis-cli -c -p <port> CLUSTER SLOTS

# Rebalancing
redis-cli --cluster rebalance <host>:<port>
redis-cli --cluster rebalance <host>:<port> --cluster-use-empty-masters

# Repair
redis-cli --cluster fix <host>:<port>
redis-cli --cluster fix <host>:<port> --cluster-fix-with-unreachable-masters

# Node management
redis-cli --cluster add-node <new> <existing>
redis-cli --cluster del-node <existing> <node-id-to-remove>

# Dangerous: CLUSTER RESET clears all cluster state
redis-cli -p <port> CLUSTER RESET SOFT   # remove from cluster, keep data
redis-cli -p <port> CLUSTER RESET HARD   # remove from cluster, flush data

20.9 Failure Scenario Remediation

Scenario A — cluster_state: fail

# Diagnosis
redis-cli -p 7000 CLUSTER INFO | grep cluster_state
# cluster_state:fail

redis-cli -p 7000 CLUSTER NODES | grep " fail"
# Identify which node is in fail state and whether it has a replica

# Remediation:
# Case 1: Failed node has a replica → wait for automatic failover (~15–30s)
# Case 2: Failed node has no replica → add a new node and migrate its slots
# Case 3: Need immediate reads on partial cluster:
redis-cli -p 7000 CONFIG SET cluster-require-full-coverage no

Scenario B — Duplicate Primary for Same Slots (Split-Brain)

# Symptom: two masters claim ownership of overlapping slots
redis-cli -p 7000 CLUSTER NODES | grep master | awk '{print $1, $9, $10}'

# Resolution: configEpoch determines the winner
# The node with the higher configEpoch retains ownership;
# the lower-epoch node automatically demotes itself when it receives
# the PONG from the higher-epoch winner.

# Force recalculation if stuck:
redis-cli --cluster fix 192.168.1.1:7000

# Check configEpoch values:
redis-cli -p 7000 CLUSTER NODES | awk '{print $1, $7}'  # field 7 = config-epoch

Scenario C — High ASK Redirection Rate After Migration

# Check redirection statistics
redis-cli -p 7000 INFO stats | grep -E "ask_redirect|moved_redirect"
# ask_redirect:50000    ← very high: migration is running or clients are slow to update

# Verify migration completed on all nodes:
redis-cli -p 7000 CLUSTER NODES | grep -v "connected$"
# Any node not in 'connected' state needs attention

# Force clients to refresh their routing table:
# Most smart clients refresh automatically on MOVED errors
# If using a proxy (twemproxy, etc.): restart or trigger route refresh

Chapter Summary

Operation	Core Mechanism	Key Advice
Failure detection	pfail (local) → Gossip → fail (cluster quorum)	Tune `cluster-node-timeout`
Automatic failover	Rank-delayed replica election	Replica with highest offset wins
Manual failover	CLUSTER FAILOVER [FORCE\|TAKEOVER]	Default (graceful) for planned ops
Scale-out	add-node → reshard → add-node (replica)	Off-peak; monitor memory and lag
Slot migration	MIGRATE + SETSLOT + ASK protocol	Batch 50–100 keys; handle REPLACE
Scale-in	reshard (clear slots) → del-node → shutdown	Remove replicas before primaries
Migration impact	MIGRATE blocks key, MOVED on stale clients	Smart clients handle transparently

Redis Cluster is the authoritative solution for workloads that outgrow a single primary's memory or write throughput. Mastering its failover algorithm, slot migration protocol, and operational playbooks is the mark of a production-ready Redis engineer.

Rate this chapter

4.8 / 5 (10 ratings)