Chapter 19

Redis Cluster: 16384 Slots and Gossip Protocol

Chapter 19 — Redis Cluster: 16,384 Hash Slots and the Gossip Protocol

Redis Cluster is the official distributed-scale solution, sharding data across multiple primary nodes while maintaining high availability through automatic failover. This chapter explains why consistent hashing was rejected in favor of hash slots, where the magic number 16,384 comes from, the MOVED/ASK redirection protocol, and the Gossip messaging system that keeps every node's view of the topology eventually consistent.

19.1 Why Hash Slots Instead of Consistent Hashing

Distributed caching systems traditionally use consistent hashing to distribute keys. Redis Cluster chose a different model for concrete technical reasons.

Problems with Consistent Hashing

Problem 1: Poor key distribution with small node counts
  With 3 physical nodes mapped to a 360-degree ring:
    Node-A owns 0°–45°    → ~12.5% of keys
    Node-B owns 45°–200°  → ~43.0% of keys   ← severe imbalance
    Node-C owns 200°–360° → ~44.5% of keys

  Fix: virtual nodes (150+ per physical node)
  Cost: configuration complexity, memory overhead

Problem 2: Imprecise migration scope during rebalance
  Consistent hashing affects only the two nodes adjacent to the
  moved virtual node. The actual number of keys migrated is
  probabilistic and hard to predict or control.

Problem 3: No native concept of co-location (HashTag)
  Consistent hashing has no built-in mechanism to guarantee that
  related keys end up on the same node — required for multi-key
  operations (MGET, MULTI/EXEC, pipelines).

Hash Slot Advantages

Fixed 16,384 slots:
  - Each node owns a contiguous or arbitrary subset of slots
  - Scale-out: move slots from existing nodes to the new node (precise)
  - Scale-in: move slots away from a node, then remove it
  - Co-location: HashTag ({}) syntax forces related keys into one slot
  - Routing: any node holds the full slot-to-node mapping table

19.2 Why Exactly 16,384 Slots?

The Heartbeat Packet Constraint

Every Redis Cluster node periodically broadcasts a heartbeat carrying its slot bitmap — a bitfield indicating which slots the node owns. The size of this bitmap directly affects cluster network overhead:

16,384 slots → 16,384 bits = 2,048 bytes = 2 KB per heartbeat
65,536 slots → 65,536 bits = 8,192 bytes = 8 KB per heartbeat (4× larger)

With clusters potentially having thousands of nodes each sending heartbeats every 100 ms, the 4× size difference is significant.

Slot Granularity vs. Cluster Size

Redis Cluster officially supports up to 1,000 nodes:

16,384 slots / 1,000 nodes ≈ 16 slots per node    (adequate granularity)
65,536 slots / 1,000 nodes ≈ 65 slots per node    (overkill)

Fewer slots per node means less flexibility in rebalancing, but 16,384 provides ample granularity for any realistic deployment.

Antirez stated in GitHub issue #2576: "Normal heartbeats packets carry the full configuration of a node, that can be encoded in 2k. With 65536 it would be 8k, and this is not cool."

Slot Bitmap in Source Code

/* cluster.h */
#define CLUSTER_SLOTS 16384

typedef struct clusterNode {
    /* 2,048-byte bitfield; bit N set = this node owns slot N */
    unsigned char slots[CLUSTER_SLOTS / 8];
    int numslots;
    /* ... */
} clusterNode;

/* Check whether a node owns a given slot */
#define bitmapTestBit(b, bit) \
    ((b)[(bit)/8] & (1 << ((bit) & 7)))

19.3 Hash Slot Calculation

Formula

HASH_SLOT = CRC16(effective_key) mod 16384

Redis uses the CCITT CRC16 variant (polynomial x^16 + x^12 + x^5 + 1). The 16-bit output (0–65535) is reduced modulo 16,384.

Python Reference Implementation

def crc16_ccitt(data: bytes) -> int:
    crc = 0xFFFF
    for byte in data:
        crc ^= byte << 8
        for _ in range(8):
            if crc & 0x8000:
                crc = (crc << 1) ^ 0x1021
            else:
                crc <<= 1
    return crc & 0xFFFF

def hash_slot(key: str) -> int:
    key_bytes = key.encode()
    # HashTag extraction
    start = key_bytes.find(b'{')
    if start != -1:
        end = key_bytes.find(b'}', start + 1)
        if end != -1 and end > start + 1:
            key_bytes = key_bytes[start + 1:end]
    return crc16_ccitt(key_bytes) % 16384

# Verification
print(hash_slot("foo"))                # 12352
print(hash_slot("{user:1}:profile"))   # same slot for all {user:1} keys
print(hash_slot("{user:1}:sessions"))  # same slot
print(hash_slot("{user:1}:orders"))    # same slot

HashTag Rules

Rule: If the key contains {, and the first } follows it, and there is at
      least one character between them, only the bytes between { and }
      participate in the CRC16 hash.

Examples:
  "order:9999"           → hash("order:9999") → some slot
  "{order:9999}:items"   → hash("order:9999") → same slot as above
  "{order:9999}:status"  → hash("order:9999") → same slot

  "{}key"                → hash("{}key") (empty tag — whole key used)
  "{a}b{c}"              → hash("a") (first {} wins)

Use case: group keys for atomic multi-key operations
  MGET {user:1}:name {user:1}:age        → valid (same slot)
  MULTI + SET {order:9}:* commands       → valid
  Lua script with multiple {order:9}:*   → valid

Checking Slot Assignment

# Compute slot for a key
redis-cli -c -p 7000 CLUSTER KEYSLOT "user:1:orders"
# (integer) 10778

redis-cli -c -p 7000 CLUSTER KEYSLOT "{user:1}:orders"
# (integer) 6112

redis-cli -c -p 7000 CLUSTER KEYSLOT "{user:1}:profile"
# (integer) 6112   ← same as above

19.4 MOVED and ASK Redirections

Redis Cluster is client-side routed: nodes never transparently proxy requests to other nodes. Instead, they issue redirections that the client must follow. This design avoids bottlenecks and keeps the routing logic in clients.

MOVED — Permanent Redirection

Trigger: client requests a key whose slot has been permanently migrated

Client  →  Node-A:  GET user:1
Node-A  →  Client:  -MOVED 6112 192.168.1.3:7002

Meaning: "Slot 6112 permanently lives on 192.168.1.3:7002.
          Update your routing table and retry there."

Client behavior:
  1. Update local slot cache: slot 6112 → 192.168.1.3:7002
  2. Send GET user:1 to 192.168.1.3:7002
  3. Subsequent requests for any slot-6112 key go directly to 7002

ASK — Transient Redirection

Trigger: slot is being migrated; this specific key has already moved
         but migration is not yet complete

Client  →  Node-A:  GET user:1
Node-A  →  Client:  -ASK 6112 192.168.1.3:7002

Meaning: "This particular key has moved to 192.168.1.3:7002,
          but the slot is still in transition. Don't update your
          routing table — just follow this redirect once."

Client behavior:
  1. Do NOT update local slot cache
  2. Connect to 192.168.1.3:7002
  3. Send: ASKING   (unlocks the importing slot on Node-B)
  4. Send: GET user:1
  5. Node-B returns the result (because ASKING was sent)

MOVED vs. ASK Side-by-Side

Dimension	MOVED	ASK
When triggered	Slot permanently migrated	Slot migration in progress
Update routing table	Yes	No
Must send ASKING first	No	Yes
Semantic	"Always go there"	"Go there just this once"

Smart Client Implementation

class ClusterClient:
    def __init__(self, startup_nodes: list[tuple]):
        self.slots = {}        # slot → (host, port)
        self._refresh_slots(startup_nodes)

    def _refresh_slots(self, nodes):
        for host, port in nodes:
            try:
                r = redis.Redis(host=host, port=port)
                # CLUSTER SLOTS returns [[start, end, [ip, port, id], ...], ...]
                for entry in r.execute_command("CLUSTER SLOTS"):
                    start, end, master_info = entry[0], entry[1], entry[2]
                    master_host = master_info[0].decode()
                    master_port = master_info[1]
                    for slot in range(start, end + 1):
                        self.slots[slot] = (master_host, master_port)
                return
            except Exception:
                continue

    def execute(self, key: str, command: str, *args, max_retries=3):
        slot = hash_slot(key)
        host, port = self.slots.get(slot, (None, None))
        if not host:
            raise Exception(f"No node found for slot {slot}")

        for attempt in range(max_retries):
            r = redis.Redis(host=host, port=port)
            try:
                return r.execute_command(command, key, *args)
            except redis.ResponseError as err:
                error_str = str(err)
                if error_str.startswith("MOVED"):
                    _, slot_str, addr = error_str.split()
                    h, p = addr.split(":")
                    self.slots[int(slot_str)] = (h, int(p))  # update cache
                    host, port = h, int(p)
                elif error_str.startswith("ASK"):
                    _, slot_str, addr = error_str.split()
                    h, p = addr.split(":")
                    ask_r = redis.Redis(host=h, port=int(p))
                    ask_r.execute_command("ASKING")
                    return ask_r.execute_command(command, key, *args)
                    # Note: slot cache NOT updated for ASK
                else:
                    raise
        raise Exception(f"Max retries exceeded for key {key}")

19.5 The Gossip Protocol

Gossip is the messaging substrate that propagates cluster state changes — node additions, deletions, failures, slot migrations — across all nodes in a scalable, decentralized fashion.

Message Types

Type	Purpose	Trigger
PING	Heartbeat + piggyback state	clusterCron every 100ms
PONG	Reply to PING	On receipt of PING
MEET	New node introduction	CLUSTER MEET command
FAIL	Rapid failure broadcast	Node reaches fail threshold
PUBLISH	Cross-node Pub/Sub forwarding	Client PUBLISH command
UPDATE	Force config update	Detected stale configEpoch

PING/PONG Message Structure

/* cluster.h — simplified clusterMsg layout */
typedef struct {
    char sig[4];                        /* "RCmb" */
    uint32_t totlen;                    /* total message length */
    uint16_t ver;                       /* protocol version */
    uint16_t port;                      /* sender's data port */
    uint16_t type;                      /* message type */
    uint16_t count;                     /* gossip sections in payload */
    uint64_t currentEpoch;              /* sender's current epoch */
    uint64_t configEpoch;               /* sender's config epoch */
    uint64_t offset;                    /* sender's replication offset */
    char sender[CLUSTER_NAMELEN];       /* sender node ID (40 bytes) */
    unsigned char myslots[CLUSTER_SLOTS/8]; /* sender's slot bitmap (2 KB) */
    char slaveof[CLUSTER_NAMELEN];      /* primary's ID if this is a replica */
    /* ... padding ... */
    union clusterMsgData data;          /* gossip payloads */
} clusterMsg;

Gossip Payload: State Piggybacking

Each PING/PONG carries randomly sampled state for approximately N/10 other nodes (minimum 3):

/* cluster.h — one gossip section per sampled node */
typedef struct {
    char nodename[CLUSTER_NAMELEN];  /* sampled node's ID */
    uint32_t ping_sent;              /* when this node last pinged the sample */
    uint32_t pong_received;          /* when PONG was last received */
    char ip[NET_IP_STR_LEN];         /* sample node's IP */
    uint16_t port;                   /* sample node's data port */
    uint16_t cport;                  /* sample node's cluster bus port */
    uint16_t flags;                  /* sample node's flags */
    uint32_t notused1;
} clusterMsgDataGossip;

This random propagation ensures that a state change on any node reaches the entire cluster within O(log N) gossip rounds.

clusterCron() Execution Logic

/* cluster.c — runs every 100ms */
void clusterCron(void) {
    static long long iteration = 0;
    iteration++;

    /* Every 100ms: select nodes to PING */
    if (!(iteration % 10)) {
        /* Once per second: find the node with the longest PONG gap */
        /* Guarantees every node receives at least 1 PING per second */
    }

    /* Scan all nodes: flag pfail if no PONG within cluster-node-timeout */
    listIter li;
    listNode *ln;
    listRewind(server.cluster->nodes, &li);
    while ((ln = listNext(&li)) != NULL) {
        clusterNode *node = ln->value;
        if (node->flags & (CLUSTER_NODE_MYSELF | CLUSTER_NODE_NOADDR))
            continue;
        mstime_t now = mstime();
        if (node->link &&
            now - node->link->last_recv_time > server.cluster_node_timeout &&
            now - node->ping_sent > server.cluster_node_timeout / 2) {
            /* Mark as pfail */
            node->flags |= CLUSTER_NODE_PFAIL;
        }
    }

    /* Handle slave failover election */
    if (nodeIsSlave(myself))
        clusterHandleSlaveFailover();

    /* Flush cluster state to disk if dirty */
    if (update_state || server.cluster->todo_before_sleep)
        clusterSaveConfigOrDie(0);
}

19.6 clusterNode Key Data Structure

typedef struct clusterNode {
    mstime_t ctime;               /* creation time */
    char name[CLUSTER_NAMELEN];   /* 40-byte hex node ID (random at startup) */
    int flags;                    /* bitmask of node state */

    /* Flag values: */
    /* CLUSTER_NODE_MASTER    = (1<<0) */
    /* CLUSTER_NODE_SLAVE     = (1<<1) */
    /* CLUSTER_NODE_PFAIL     = (1<<2) — subjective failure */
    /* CLUSTER_NODE_FAIL      = (1<<3) — objective failure */
    /* CLUSTER_NODE_MYSELF    = (1<<4) */
    /* CLUSTER_NODE_HANDSHAKE = (1<<5) — TCP handshake pending */
    /* CLUSTER_NODE_NOADDR    = (1<<6) — address not yet known */
    /* CLUSTER_NODE_NOFAILOVER= (1<<7) — replica won't initiate failover */

    uint64_t configEpoch;         /* last config epoch for this node */
    unsigned char slots[CLUSTER_SLOTS/8]; /* 2,048-byte slot bitmap */
    int numslots;                 /* number of owned slots */

    int numslaves;                /* number of known replicas (if primary) */
    struct clusterNode **slaves;  /* replica pointers */
    struct clusterNode *slaveof;  /* primary pointer (if replica) */

    mstime_t ping_sent;           /* time of last PING sent to this node */
    mstime_t pong_received;       /* time of last PONG received from this node */
    mstime_t data_received;       /* time of last data from this node */
    mstime_t fail_time;           /* time this node was marked FAIL */
    mstime_t voted_time;          /* time of last vote (for duplicate vote prevention) */
    mstime_t repl_offset_time;    /* when repl_offset was last updated */
    long long repl_offset;        /* known replication offset */

    char ip[NET_IP_STR_LEN];
    int port;                     /* client-facing port */
    int cport;                    /* cluster bus port (typically port + 10000) */

    clusterLink *link;            /* active TCP connection to this node */
    list *fail_reports;           /* list of pfail reports from other nodes */
} clusterNode;

Inspecting Node State

redis-cli -c -p 7000 CLUSTER NODES
# <id> <ip:port@cport> <flags> <primary-id> <ping-sent> <pong-recv> \
#   <config-epoch> <link-state> <slot-range...>
#
# Example:
# 1a2b... 192.168.1.1:7000@17000 master - 0 1714000000000 1 connected 0-5460
# 4d5e... 192.168.1.2:7001@17001 master - 0 1714000000001 2 connected 5461-10922
# 7g8h... 192.168.1.3:7002@17002 master - 0 1714000000002 3 connected 10923-16383
# 9j0k... 192.168.1.1:7003@17003 slave 1a2b... 0 1714000000003 1 connected

redis-cli -c -p 7000 CLUSTER INFO
# cluster_enabled:1
# cluster_state:ok
# cluster_slots_assigned:16384
# cluster_slots_ok:16384
# cluster_slots_pfail:0
# cluster_slots_fail:0
# cluster_known_nodes:6
# cluster_size:3
# cluster_current_epoch:3
# cluster_my_epoch:1
# cluster_stats_messages_ping_sent:1502
# cluster_stats_messages_pong_sent:1500
# cluster_stats_messages_meet_sent:1
# cluster_stats_messages_received:3001

19.7 Cluster Topology Discovery

New Node Join Protocol

# New node 7006 starts with no cluster membership
redis-server --port 7006 \
    --cluster-enabled yes \
    --cluster-config-file nodes-7006.conf \
    --cluster-node-timeout 15000 \
    --daemonize yes

# CLUSTER MEET: introduce new node to one existing cluster member
redis-cli -p 7000 CLUSTER MEET 127.0.0.1 7006

# Internal MEET sequence:
# 1. Node-7000 → Node-7006:  MEET message
# 2. Node-7006 → Node-7000:  PONG (with own state)
# 3. Node-7006 adds Node-7000 to its known nodes list
# 4. Next heartbeat from Node-7000 includes Node-7006 state in gossip payload
# 5. Node-7001, Node-7002, ... learn about Node-7006 via Gossip
# 6. They PING Node-7006 directly; full mesh established
# Convergence time: O(log N) heartbeat rounds ≈ a few seconds

Gossip Convergence Speed

Assumptions:
  - Cluster size: N nodes
  - Each PING carries state for N/10 random nodes (min 3)
  - Heartbeat interval: 100ms (clusterCron)

Convergence rounds ≈ log(N) / log(max(N/10, 3) + 1)

Empirical data:
  N = 10 nodes  → ~3–4 rounds → ~300–400ms
  N = 100 nodes → ~5–6 rounds → ~500–600ms
  N = 1,000 nodes → ~7–8 rounds → ~700–800ms

Even at 1,000 nodes, a topology change propagates cluster-wide in under 1 second.

19.8 Minimal Cluster Setup (6 Nodes: 3 Primary + 3 Replica)

# Configuration template for port 7000
cat > /etc/redis/redis-7000.conf << 'EOF'
port 7000
bind 0.0.0.0
daemonize yes
logfile /var/log/redis/redis-7000.log
dir /var/lib/redis/7000

cluster-enabled yes
cluster-config-file /etc/redis/cluster/nodes-7000.conf
cluster-node-timeout 15000

# Required for NAT / multi-NIC / container environments
cluster-announce-ip 192.168.1.1
cluster-announce-port 7000
cluster-announce-bus-port 17000

appendonly yes
appendfsync everysec

maxmemory 8gb
maxmemory-policy allkeys-lru
hz 10
EOF

# Create cluster (--cluster-replicas 1: one replica per primary)
redis-cli --cluster create \
    192.168.1.1:7000 192.168.1.1:7001 \
    192.168.1.2:7002 192.168.1.2:7003 \
    192.168.1.3:7004 192.168.1.3:7005 \
    --cluster-replicas 1

# Expected output:
# Master[0] -> Slots 0 - 5460       (192.168.1.1:7000)
# Master[1] -> Slots 5461 - 10922   (192.168.1.2:7002)
# Master[2] -> Slots 10923 - 16383  (192.168.1.3:7004)
# Adding replica 192.168.1.2:7003 to 192.168.1.1:7000
# Adding replica 192.168.1.3:7005 to 192.168.1.2:7002
# Adding replica 192.168.1.1:7001 to 192.168.1.3:7004
# Can I set the above configuration? (type 'yes' to accept): yes

19.9 Cluster Limitations and Workarounds

Unsupported Operations

# Cross-slot MGET (keys in different slots)
MGET user:1:name user:2:name
# (error) CROSSSLOT Keys in request don't hash to the same slot

# Solution: HashTag
MGET {user:1}:name {user:1}:age  # same slot — works

# Cross-slot transactions
MULTI
SET user:1:name "Alice"    # slot A
SET order:1:status "paid"  # slot B (different node)
EXEC
# (error) CROSSSLOT

# Multi-database SELECT
SELECT 1
# (error) ERR SELECT is not allowed in cluster mode

# Lua scripts with keys in different slots
EVAL "return redis.call('SET',KEYS[1],ARGV[1])" 2 key1 key2 value
# (error) CROSSSLOT (if key1 and key2 are in different slots)

Pipeline Usage in Cluster Mode

from rediscluster import RedisCluster

rc = RedisCluster(startup_nodes=[{"host": "127.0.0.1", "port": "7000"}],
                  decode_responses=True)

# Smart cluster clients batch pipeline commands by slot internally
# Use HashTag to ensure all pipeline commands target a single slot
pipe = rc.pipeline()
pipe.set("{user:1}:name", "Alice")
pipe.set("{user:1}:age", 30)
pipe.set("{user:1}:score", 9500)
results = pipe.execute()  # all on same slot → efficient single-node pipeline

Key Design Recommendations

1. Use HashTag for all related keys in a business entity:
   {order:<id>}:items, {order:<id>}:status, {order:<id>}:meta

2. Never use HashTag for globally distributed access patterns:
   Putting all user sessions under {session}: defeats the purpose of sharding

3. Plan slot distribution before choosing HashTag strategy:
   Monitor slot key counts: CLUSTER GETKEYSINSLOT <slot> <count>
   Identify hot slots with: redis-cli --cluster info <host>:<port>

4. For cross-entity operations, consider denormalization:
   Store redundant data in the same slot rather than atomic cross-slot reads

Chapter Summary

Concept	Key Detail	Production Advice
Hash slots	CRC16(key) % 16384	Use HashTag to control placement
16,384 choice	2 KB heartbeat vs 8 KB for 65,536	—
MOVED	Permanent redirect; update routing cache	Smart client required
ASK	Transient redirect; do not update cache	Must send ASKING first
Gossip	PING/PONG carry random node state	Tune `cluster-node-timeout`
Cluster limits	No cross-slot MGET/transactions/SELECT	Design with HashTag upfront

Chapter 20 covers cluster failover mechanics, the complete online scale-out playbook, and a detailed walkthrough of slot migration under the ASK protocol.

Rate this chapter

4.6 / 5 (12 ratings)