Chapter 20

Cluster Failover, Scaling and Data Migration

Chapter 20 — Cluster Failover, Scale-Out, and Data Migration

This chapter is a production operations manual. It covers every aspect of Redis Cluster runtime management: the pfail/fail detection pipeline, the replica election algorithm that prioritizes data freshness, three distinct manual failover modes, a step-by-step scale-out playbook, the ASK protocol during live slot migration, and the complete scale-in procedure. Operational impact analysis and emergency remediation scripts are included throughout.


20.1 pfail and fail Detection Pipeline

pfail — Subjective Failure (Node-Local)

A node marks a peer as pfail when it cannot receive a valid response within cluster-node-timeout:

/* cluster.c — inside clusterCron(), runs every 100ms */
if (node->link &&
    now - node->link->last_recv_time > server.cluster_node_timeout &&
    now - node->ping_sent > server.cluster_node_timeout / 2) {
    /* No PONG within timeout window */
    node->flags |= CLUSTER_NODE_PFAIL;
    server.cluster->stats_pfail_nodes++;
}

pfail is purely local. No cluster-wide action is taken. The node continues normal operation; only gossip propagation carries this information to peers.

fail — Objective Failure (Cluster-Wide)

Multiple nodes independently report pfail, and one node accumulates enough reports to cross the threshold:

Propagation mechanism:
1. Node-A marks Primary-M as pfail
2. Node-A's subsequent PING/PONG heartbeats include:
   gossip entry for M: flags = CLUSTER_NODE_PFAIL

3. Node-B receives the heartbeat, records: "A reports M pfail"
   (stored in M's fail_reports list with a timestamp)

4. Any node receiving enough pfail reports checks the threshold:
   if fail_report_count >= quorum AND all reports within 2×cluster_node_timeout:
       → broadcast FAIL message for M immediately

5. FAIL broadcast:
   - Unlike Gossip, FAIL is sent to ALL known nodes directly
   - Ensures rapid propagation regardless of Gossip round timing
   - Receiving nodes immediately set CLUSTER_NODE_FAIL on M

Quorum for fail Declaration

quorum = floor(total_masters / 2) + 1

Examples:
  3 masters → quorum = 2   (majority of 3)
  5 masters → quorum = 3
  7 masters → quorum = 4

All quorum nodes must have submitted pfail reports within
the 2×cluster_node_timeout window (stale reports don't count).

Key Configuration Parameters

# redis.conf (cluster node)
cluster-node-timeout 15000           # ms; balance between sensitivity and false positives
cluster-require-full-coverage yes    # refuse all reads/writes if any slot is uncovered
cluster-slave-no-failover no         # whether replicas participate in failover
cluster-allow-reads-when-down no     # allow stale reads when cluster is degraded
cluster-migration-barrier 1          # min replicas a primary must have before
                                     # one can be migrated to an orphaned primary

20.2 Replica Election Algorithm

Trigger Conditions

A replica initiates an election only when all of the following hold:

  1. Its primary is marked fail (objective, not just pfail)
  2. The primary owns at least one slot (replicas of non-slotted primaries don't vote)
  3. The replica's data is not too stale (replication offset is recent enough)
  4. Enough time has elapsed since the last failover attempt (epoch guard)

Election Delay — Prioritizing Freshness

Replicas do not race immediately. Each waits a computed delay before sending its vote request, giving the most up-to-date replica the shortest wait:

/* cluster.c — clusterHandleSlaveFailover() */

/* my_rank = position in sorted replica list (by repl_offset descending)
   rank 0 = highest offset = freshest replica */
mstime_t delay =
    500                       /* fixed base delay */
    + (random() % 500)        /* jitter (0–500ms) to prevent ties */
    + (data_age * 0.1)        /* penalize replicas with stale data */
    + (my_rank * 1000);       /* 1 second per rank position */

/* Result:
   rank 0 (freshest): delay ≈ 500–1000ms   → fires first
   rank 1:            delay ≈ 1500–2000ms  → fires second if rank 0 fails
   rank 2:            delay ≈ 2500–3000ms  → fallback */

Concrete Example

Primary-M fails. Three replicas:
  Replica-A: offset=10,000  rank=0 → delay ≈ 750ms
  Replica-B: offset=9,500   rank=1 → delay ≈ 1,750ms
  Replica-C: offset=9,000   rank=2 → delay ≈ 2,750ms

T=750ms:  Replica-A sends FAILOVER_AUTH_REQUEST to all primaries
T=751ms:  Master-X votes for A (epoch increased, not yet voted this epoch)
T=752ms:  Master-Y votes for A
T=752ms:  A has majority → elected new primary for M's slots
T=1750ms: B's timer fires, but A has already claimed leadership → B aborts

Vote Request Protocol

1. Replica sends to each primary node:
   FAILOVER_AUTH_REQUEST
   Payload: {
     epoch: currentEpoch + 1,      (incremented before sending)
     primary_id: <failed-primary-id>,
     primary_slots: <slot-bitmap>
   }

2. Each primary evaluates:
   ✓ request epoch > own currentEpoch
   ✓ No vote cast in this epoch yet (voted_time check)
   ✓ The claimed failed primary is genuinely in fail state
   → Grant vote: send FAILOVER_AUTH_ACK with vote epoch

3. Replica counts votes:
   - Received > (total_masters / 2) votes → elected
   - Timeout: retry with epoch+1 (back to delay calculation)

Post-Election Actions

/* cluster.c */
void clusterHandleSlaveFailover(void) {
    if (server.cluster->failover_auth_count >= needed_quorum) {
        serverLog(LL_WARNING,
            "Failover election won: I'm the new master.");

        /* 1. Assume ownership of all former primary's slots */
        clusterNodeSetAsMaster(myself);

        /* 2. Update configEpoch to election epoch */
        clusterBumpConfigEpochWithoutConsensus();

        /* 3. Broadcast PONG with new slot bitmap to entire cluster */
        clusterBroadcastPong(CLUSTER_BROADCAST_ALL);

        /* 4. Stop replicating; become standalone primary */
        replicationUnsetMaster();
    }
}

20.3 Manual Failover

Three Modes

# Mode 1: Graceful (recommended for planned maintenance)
#
# Internal flow:
#   1. Replica sends MANUELFAILOVER_PAUSE_CLIENTS to primary
#   2. Primary pauses client writes
#   3. Replica waits until its offset matches primary's
#   4. Replica initiates election (guaranteed to win — all data is current)
#   5. Zero data loss
#   Prerequisite: primary must be reachable
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER

# Mode 2: Force (primary is slow/degraded but still reachable)
#
# Skips offset synchronization — may lose a small number of recent writes.
# Replica immediately starts election without waiting to catch up.
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER FORCE

# Mode 3: Takeover (primary is completely unreachable)
#
# Replica unilaterally promotes itself — bypasses voting entirely.
# Increments configEpoch and broadcasts new slot ownership directly.
# Risk: if the old primary recovers, configEpoch collision resolution
#       may cause unexpected slot reassignment. Use with caution.
#
redis-cli -h replica_host -p 7003 CLUSTER FAILOVER TAKEOVER

Graceful Failover Timeline

T=0:   Operator runs: CLUSTER FAILOVER (on replica)
T=0:   Replica → Primary: CLUSTERMSG_TYPE_MFSTART
T=0:   Primary pauses client connections (returns -LOADING temporarily)
T=0:   Replica polls its offset vs primary's offset (INFO replication)
T=Xs:  Replica offset == Primary offset (fully caught up)
T=Xs:  Replica initiates election; wins immediately (freshest replica)
T=Xs+1: New primary begins accepting writes; old primary demoted to replica
Total downtime for writes: typically 1–3 seconds

20.4 Online Scale-Out Playbook

Pre-Expansion Health Check

# Verify cluster is fully healthy before starting
redis-cli --cluster check 192.168.1.1:7000

# Expected:
# [OK] All nodes agree about slots configuration.
# [OK] All 16384 slots covered.

# Check current slot distribution
redis-cli --cluster info 192.168.1.1:7000
# 192.168.1.1:7000 → 15000 keys | 5461 slots | 1 slave
# 192.168.1.2:7001 → 14500 keys | 5461 slots | 1 slave
# 192.168.1.3:7002 → 15200 keys | 5462 slots | 1 slave

Step 1 — Start New Nodes

# New primary node (port 7006)
cat > /etc/redis/redis-7006.conf << 'EOF'
port 7006
bind 0.0.0.0
daemonize yes
logfile /var/log/redis/redis-7006.log
dir /var/lib/redis/7006
cluster-enabled yes
cluster-config-file /etc/redis/cluster/nodes-7006.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.4
cluster-announce-port 7006
cluster-announce-bus-port 17006
appendonly yes
appendfsync everysec
maxmemory 8gb
maxmemory-policy allkeys-lru
EOF

mkdir -p /var/lib/redis/7006
redis-server /etc/redis/redis-7006.conf

# Verify it started standalone (not yet in cluster)
redis-cli -p 7006 CLUSTER INFO | grep cluster_state
# cluster_state:ok  (but cluster_known_nodes:1 — only itself)

Step 2 — Add New Primary to Cluster

redis-cli --cluster add-node 192.168.1.4:7006 192.168.1.1:7000

# Expected:
# [OK] New node added correctly.

# Verify: node is visible but has 0 slots
redis-cli -c -p 7000 CLUSTER NODES | grep 7006
# <id> 192.168.1.4:7006@17006 master - 0 ... connected
# Note: no slot range at the end → 0 slots assigned

Step 3 — Reshard: Migrate Slots to New Node

# Interactive mode (recommended for first time)
redis-cli --cluster reshard 192.168.1.1:7000
# How many slots do you want to move? → 4096
# Receiving node ID? → <7006-node-id>
# Source nodes? → all

# Non-interactive (for automation)
redis-cli --cluster reshard 192.168.1.1:7000 \
    --cluster-from all \
    --cluster-to <7006-node-id> \
    --cluster-slots 4096 \
    --cluster-yes

# Monitor progress during reshard
watch -n2 "redis-cli --cluster info 192.168.1.1:7000"

Step 4 — Add Replica for New Primary

# Start the replica node (port 7007)
redis-server /etc/redis/redis-7007.conf

# Add as replica of the new primary
redis-cli --cluster add-node 192.168.1.4:7007 192.168.1.1:7000 \
    --cluster-slave \
    --cluster-master-id <7006-node-id>

# Verify replication established
redis-cli -p 7006 INFO replication
# role:master
# connected_slaves:1
# slave0:ip=192.168.1.4,port=7007,state=online,offset=...,lag=0

Step 5 — Post-Expansion Verification

# Full cluster health check
redis-cli --cluster check 192.168.1.1:7000

# Slot distribution should now show 4 primary nodes each with ~4096 slots
redis-cli --cluster info 192.168.1.1:7000
# 192.168.1.1:7000 → ... | 4096 slots | 1 slave
# 192.168.1.2:7001 → ... | 4096 slots | 1 slave
# 192.168.1.3:7002 → ... | 4096 slots | 1 slave
# 192.168.1.4:7006 → ... | 4096 slots | 1 slave  ← new node

# Monitor for MOVED redirections (clients updating their routing cache)
redis-cli -p 7000 INFO stats | grep moved_redirect

20.5 Slot Migration and ASK Protocol in Detail

Migration State Setup

# Migrating slot 100 from Node-A (7000) to Node-B (7006)
NODE_A_ID=$(redis-cli -p 7000 CLUSTER MYID)
NODE_B_ID=$(redis-cli -p 7006 CLUSTER MYID)

# Step 1: Set importing state on destination
redis-cli -p 7006 CLUSTER SETSLOT 100 IMPORTING $NODE_A_ID

# Step 2: Set migrating state on source
redis-cli -p 7000 CLUSTER SETSLOT 100 MIGRATING $NODE_B_ID

MIGRATE Command: Atomic Key Transfer

# MIGRATE atomically moves a key from source to destination:
# 1. Source executes DUMP key → serialized bytes
# 2. Source sends RESTORE key ttl serialized-value to destination
# 3. Destination confirms receipt
# 4. Source executes DEL key (only after destination confirms)

# Single key
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 mykey 0 5000

# Batch (recommended: 10–100 keys at once)
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 "" 0 60000 KEYS key1 key2 key3

# Replace if key already exists at destination (handles retry safety)
redis-cli -p 7000 MIGRATE 192.168.1.4 7006 "" 0 60000 REPLACE KEYS key1

Request Routing During Migration

Complete decision tree for a client request on slot 100 (being migrated):

Case 1: Client sends GET user:1 to Node-A; key HAS NOT been migrated yet
  Node-A: key exists locally → respond normally
  Result: +value  (no redirection)

Case 2: Client sends GET user:1 to Node-A; key HAS been migrated to B
  Node-A: key not found locally; slot 100 is in migrating state
  Node-A responds: -ASK 100 192.168.1.4:7006
  Client:
    → connect to Node-B
    → send ASKING  (one-shot unlock for importing slot)
    → send GET user:1
    → Node-B responds with value
  Note: client's routing cache NOT updated (migration still in progress)

Case 3: Client sends GET user:1 directly to Node-B (correct node, wrong timing)
  No ASKING flag: Node-B returns -MOVED 100 192.168.1.1:7000
    (redirects to source — key is still there)
  With ASKING flag: Node-B checks locally → returns value or nil
    (ASK indicates "you came from source, key may have been sent here")

Case 4: Key does not exist anywhere (genuinely missing)
  Node-A: not found; slot is migrating → -ASK to Node-B
  Client → ASKING → GET → Node-B returns nil
  Correct behavior: nil is the right answer

Completing the Migration

# After all keys in slot 100 have been migrated:

# Notify both source and destination that migration is complete
redis-cli -p 7000 CLUSTER SETSLOT 100 NODE $NODE_B_ID
redis-cli -p 7006 CLUSTER SETSLOT 100 NODE $NODE_B_ID

# Effect:
# Node-A: removes slot 100 from its bitmap; migrating flag cleared
# Node-B: adds slot 100 to its bitmap; importing flag cleared
# Node-B increments its configEpoch and broadcasts PONG with new slot bitmap
# Other nodes update their routing tables via Gossip

Production-Grade Migration Script

#!/bin/bash
# migrate_slot.sh — safely migrate a single slot with retry logic
set -euo pipefail

SRC_HOST=192.168.1.1
SRC_PORT=7000
DST_HOST=192.168.1.4
DST_PORT=7006
SLOT=$1
BATCH_SIZE=50
MIGRATE_TIMEOUT=60000  # 60 seconds

DST_NODE_ID=$(redis-cli -p $DST_PORT CLUSTER MYID | tr -d '\r')
SRC_NODE_ID=$(redis-cli -p $SRC_PORT CLUSTER MYID | tr -d '\r')

echo "Migrating slot $SLOT: $SRC_HOST:$SRC_PORT → $DST_HOST:$DST_PORT"

# Set migration states
redis-cli -p $DST_PORT CLUSTER SETSLOT $SLOT IMPORTING $SRC_NODE_ID
redis-cli -p $SRC_PORT CLUSTER SETSLOT $SLOT MIGRATING $DST_NODE_ID

# Migrate all keys in batches
total_migrated=0
while true; do
    KEYS=$(redis-cli -p $SRC_PORT CLUSTER GETKEYSINSLOT $SLOT $BATCH_SIZE 2>/dev/null | tr '\r' '\n')
    if [ -z "$KEYS" ]; then
        echo "Slot $SLOT: all keys migrated ($total_migrated total)"
        break
    fi

    key_count=$(echo "$KEYS" | wc -l | tr -d ' ')

    # Build MIGRATE command with KEYS list
    if redis-cli -p $SRC_PORT MIGRATE $DST_HOST $DST_PORT "" 0 $MIGRATE_TIMEOUT \
        REPLACE KEYS $KEYS 2>&1 | grep -q "ERR\|IOERR"; then
        echo "Migration error for batch in slot $SLOT; retrying..."
        sleep 1
        continue
    fi

    total_migrated=$((total_migrated + key_count))
    echo "  migrated $total_migrated keys so far..."
done

# Finalize
redis-cli -p $SRC_PORT CLUSTER SETSLOT $SLOT NODE $DST_NODE_ID
redis-cli -p $DST_PORT CLUSTER SETSLOT $SLOT NODE $DST_NODE_ID
echo "Slot $SLOT migration complete."

20.6 Scale-In (Node Removal) Procedure

# Goal: remove node 7006 (192.168.1.4:7006) from the cluster
# Prerequisite: 7006 must have 0 slots (all migrated away)

# Step 1: Move all of 7006's slots to remaining nodes
redis-cli --cluster reshard 192.168.1.1:7000 \
    --cluster-from <7006-node-id> \
    --cluster-to <7000-node-id> \
    --cluster-slots 4096 \
    --cluster-yes

# Step 2: Verify 7006 has no remaining slots
redis-cli -p 7006 CLUSTER INFO | grep cluster_slots_assigned
# cluster_slots_assigned:0

# Step 3: Remove 7006's replica first (always remove replicas before primaries)
redis-cli --cluster del-node 192.168.1.1:7000 <7007-replica-node-id>
# [OK] Node 192.168.1.4:7007 removed from cluster.

# Step 4: Remove 7006 primary node
redis-cli --cluster del-node 192.168.1.1:7000 <7006-primary-node-id>
# [OK] Node 192.168.1.4:7006 removed from cluster.

# Step 5: Shutdown the removed nodes
redis-cli -h 192.168.1.4 -p 7006 SHUTDOWN
redis-cli -h 192.168.1.4 -p 7007 SHUTDOWN

# Step 6: Final cluster health check
redis-cli --cluster check 192.168.1.1:7000

20.7 Operational Impact Analysis

Impact 1 — MIGRATE Blocks the Key

Duration: typically 0.1–5ms per key (depends on value size)
For a 10 MB string value: MIGRATE may block for 100ms+

Mitigation:
  - Batch small values (up to 100 keys per MIGRATE)
  - Migrate large values individually with longer timeout
  - Schedule migrations during low-traffic windows
  - Use OBJECT ENCODING to identify large keys before migration

Impact 2 — Client Routing Cache Updates

During migration, clients with stale routing caches receive MOVED redirections.
Each MOVED adds one extra round trip (~0.5–2ms on LAN).

Only the first request for a migrated slot causes a redirect.
After the client updates its cache, subsequent requests go directly to the
correct node with no overhead.

Monitoring MOVED redirection rate:
redis-cli -p 7000 INFO stats | grep moved_redirect
# moved_redirect:847   ← acceptable during migration; should drop after

Impact 3 — Full Sync to New Replica

When a new replica joins, it performs a full sync with its primary:
  - Full sync blocks the replica from serving reads
  - Primary forks for BGSAVE; memory may spike (COW)
  - Network bandwidth consumed for RDB transfer

Monitoring:
watch -n2 "redis-cli -p 7006 INFO replication | grep -E 'master_sync|connected_slaves'"
watch -n2 "redis-cli -p 7006 INFO memory | grep used_memory_human"

Mitigation:
  - Enable diskless sync: repl-diskless-sync yes
  - Add replicas during off-peak hours
  - Add one replica at a time (parallel full syncs compound memory pressure)

20.8 Cluster Operations Quick Reference

# Health and status
redis-cli --cluster check <host>:<port>
redis-cli --cluster info <host>:<port>
redis-cli -c -p <port> CLUSTER INFO
redis-cli -c -p <port> CLUSTER NODES

# Slot operations
redis-cli -c -p <port> CLUSTER KEYSLOT <key>
redis-cli -c -p <port> CLUSTER GETKEYSINSLOT <slot> <count>
redis-cli -c -p <port> CLUSTER SLOTS

# Rebalancing
redis-cli --cluster rebalance <host>:<port>
redis-cli --cluster rebalance <host>:<port> --cluster-use-empty-masters

# Repair
redis-cli --cluster fix <host>:<port>
redis-cli --cluster fix <host>:<port> --cluster-fix-with-unreachable-masters

# Node management
redis-cli --cluster add-node <new> <existing>
redis-cli --cluster del-node <existing> <node-id-to-remove>

# Dangerous: CLUSTER RESET clears all cluster state
redis-cli -p <port> CLUSTER RESET SOFT   # remove from cluster, keep data
redis-cli -p <port> CLUSTER RESET HARD   # remove from cluster, flush data

20.9 Failure Scenario Remediation

Scenario A — cluster_state: fail

# Diagnosis
redis-cli -p 7000 CLUSTER INFO | grep cluster_state
# cluster_state:fail

redis-cli -p 7000 CLUSTER NODES | grep " fail"
# Identify which node is in fail state and whether it has a replica

# Remediation:
# Case 1: Failed node has a replica → wait for automatic failover (~15–30s)
# Case 2: Failed node has no replica → add a new node and migrate its slots
# Case 3: Need immediate reads on partial cluster:
redis-cli -p 7000 CONFIG SET cluster-require-full-coverage no

Scenario B — Duplicate Primary for Same Slots (Split-Brain)

# Symptom: two masters claim ownership of overlapping slots
redis-cli -p 7000 CLUSTER NODES | grep master | awk '{print $1, $9, $10}'

# Resolution: configEpoch determines the winner
# The node with the higher configEpoch retains ownership;
# the lower-epoch node automatically demotes itself when it receives
# the PONG from the higher-epoch winner.

# Force recalculation if stuck:
redis-cli --cluster fix 192.168.1.1:7000

# Check configEpoch values:
redis-cli -p 7000 CLUSTER NODES | awk '{print $1, $7}'  # field 7 = config-epoch

Scenario C — High ASK Redirection Rate After Migration

# Check redirection statistics
redis-cli -p 7000 INFO stats | grep -E "ask_redirect|moved_redirect"
# ask_redirect:50000    ← very high: migration is running or clients are slow to update

# Verify migration completed on all nodes:
redis-cli -p 7000 CLUSTER NODES | grep -v "connected$"
# Any node not in 'connected' state needs attention

# Force clients to refresh their routing table:
# Most smart clients refresh automatically on MOVED errors
# If using a proxy (twemproxy, etc.): restart or trigger route refresh

Chapter Summary

Operation Core Mechanism Key Advice
Failure detection pfail (local) → Gossip → fail (cluster quorum) Tune cluster-node-timeout
Automatic failover Rank-delayed replica election Replica with highest offset wins
Manual failover CLUSTER FAILOVER [FORCE|TAKEOVER] Default (graceful) for planned ops
Scale-out add-node → reshard → add-node (replica) Off-peak; monitor memory and lag
Slot migration MIGRATE + SETSLOT + ASK protocol Batch 50–100 keys; handle REPLACE
Scale-in reshard (clear slots) → del-node → shutdown Remove replicas before primaries
Migration impact MIGRATE blocks key, MOVED on stale clients Smart clients handle transparently

Redis Cluster is the authoritative solution for workloads that outgrow a single primary's memory or write throughput. Mastering its failover algorithm, slot migration protocol, and operational playbooks is the mark of a production-ready Redis engineer.

Rate this chapter
4.8  / 5  (10 ratings)

💬 Comments