Redis Cluster: 16384 Slots and Gossip Protocol
Chapter 19 โ Redis Cluster: 16,384 Hash Slots and the Gossip Protocol
Redis Cluster is the official distributed-scale solution, sharding data across multiple primary nodes while maintaining high availability through automatic failover. This chapter explains why consistent hashing was rejected in favor of hash slots, where the magic number 16,384 comes from, the MOVED/ASK redirection protocol, and the Gossip messaging system that keeps every node's view of the topology eventually consistent.
19.1 Why Hash Slots Instead of Consistent Hashing
Distributed caching systems traditionally use consistent hashing to distribute keys. Redis Cluster chose a different model for concrete technical reasons.
Problems with Consistent Hashing
Problem 1: Poor key distribution with small node counts
With 3 physical nodes mapped to a 360-degree ring:
Node-A owns 0ยฐโ45ยฐ โ ~12.5% of keys
Node-B owns 45ยฐโ200ยฐ โ ~43.0% of keys โ severe imbalance
Node-C owns 200ยฐโ360ยฐ โ ~44.5% of keys
Fix: virtual nodes (150+ per physical node)
Cost: configuration complexity, memory overhead
Problem 2: Imprecise migration scope during rebalance
Consistent hashing affects only the two nodes adjacent to the
moved virtual node. The actual number of keys migrated is
probabilistic and hard to predict or control.
Problem 3: No native concept of co-location (HashTag)
Consistent hashing has no built-in mechanism to guarantee that
related keys end up on the same node โ required for multi-key
operations (MGET, MULTI/EXEC, pipelines).
Hash Slot Advantages
Fixed 16,384 slots:
- Each node owns a contiguous or arbitrary subset of slots
- Scale-out: move slots from existing nodes to the new node (precise)
- Scale-in: move slots away from a node, then remove it
- Co-location: HashTag ({}) syntax forces related keys into one slot
- Routing: any node holds the full slot-to-node mapping table
19.2 Why Exactly 16,384 Slots?
The Heartbeat Packet Constraint
Every Redis Cluster node periodically broadcasts a heartbeat carrying its slot bitmap โ a bitfield indicating which slots the node owns. The size of this bitmap directly affects cluster network overhead:
16,384 slots โ 16,384 bits = 2,048 bytes = 2 KB per heartbeat
65,536 slots โ 65,536 bits = 8,192 bytes = 8 KB per heartbeat (4ร larger)
With clusters potentially having thousands of nodes each sending heartbeats every 100 ms, the 4ร size difference is significant.
Slot Granularity vs. Cluster Size
Redis Cluster officially supports up to 1,000 nodes:
16,384 slots / 1,000 nodes โ 16 slots per node (adequate granularity)
65,536 slots / 1,000 nodes โ 65 slots per node (overkill)
Fewer slots per node means less flexibility in rebalancing, but 16,384 provides ample granularity for any realistic deployment.
Antirez stated in GitHub issue #2576: "Normal heartbeats packets carry the full configuration of a node, that can be encoded in 2k. With 65536 it would be 8k, and this is not cool."
Slot Bitmap in Source Code
/* cluster.h */
#define CLUSTER_SLOTS 16384
typedef struct clusterNode {
/* 2,048-byte bitfield; bit N set = this node owns slot N */
unsigned char slots[CLUSTER_SLOTS / 8];
int numslots;
/* ... */
} clusterNode;
/* Check whether a node owns a given slot */
#define bitmapTestBit(b, bit) \
((b)[(bit)/8] & (1 << ((bit) & 7)))
19.3 Hash Slot Calculation
Formula
HASH_SLOT = CRC16(effective_key) mod 16384
Redis uses the CCITT CRC16 variant (polynomial x^16 + x^12 + x^5 + 1). The 16-bit output (0โ65535) is reduced modulo 16,384.
Python Reference Implementation
def crc16_ccitt(data: bytes) -> int:
crc = 0xFFFF
for byte in data:
crc ^= byte << 8
for _ in range(8):
if crc & 0x8000:
crc = (crc << 1) ^ 0x1021
else:
crc <<= 1
return crc & 0xFFFF
def hash_slot(key: str) -> int:
key_bytes = key.encode()
# HashTag extraction
start = key_bytes.find(b'{')
if start != -1:
end = key_bytes.find(b'}', start + 1)
if end != -1 and end > start + 1:
key_bytes = key_bytes[start + 1:end]
return crc16_ccitt(key_bytes) % 16384
# Verification
print(hash_slot("foo")) # 12352
print(hash_slot("{user:1}:profile")) # same slot for all {user:1} keys
print(hash_slot("{user:1}:sessions")) # same slot
print(hash_slot("{user:1}:orders")) # same slot
HashTag Rules
Rule: If the key contains {, and the first } follows it, and there is at
least one character between them, only the bytes between { and }
participate in the CRC16 hash.
Examples:
"order:9999" โ hash("order:9999") โ some slot
"{order:9999}:items" โ hash("order:9999") โ same slot as above
"{order:9999}:status" โ hash("order:9999") โ same slot
"{}key" โ hash("{}key") (empty tag โ whole key used)
"{a}b{c}" โ hash("a") (first {} wins)
Use case: group keys for atomic multi-key operations
MGET {user:1}:name {user:1}:age โ valid (same slot)
MULTI + SET {order:9}:* commands โ valid
Lua script with multiple {order:9}:* โ valid
Checking Slot Assignment
# Compute slot for a key
redis-cli -c -p 7000 CLUSTER KEYSLOT "user:1:orders"
# (integer) 10778
redis-cli -c -p 7000 CLUSTER KEYSLOT "{user:1}:orders"
# (integer) 6112
redis-cli -c -p 7000 CLUSTER KEYSLOT "{user:1}:profile"
# (integer) 6112 โ same as above
19.4 MOVED and ASK Redirections
Redis Cluster is client-side routed: nodes never transparently proxy requests to other nodes. Instead, they issue redirections that the client must follow. This design avoids bottlenecks and keeps the routing logic in clients.
MOVED โ Permanent Redirection
Trigger: client requests a key whose slot has been permanently migrated
Client โ Node-A: GET user:1
Node-A โ Client: -MOVED 6112 192.168.1.3:7002
Meaning: "Slot 6112 permanently lives on 192.168.1.3:7002.
Update your routing table and retry there."
Client behavior:
1. Update local slot cache: slot 6112 โ 192.168.1.3:7002
2. Send GET user:1 to 192.168.1.3:7002
3. Subsequent requests for any slot-6112 key go directly to 7002
ASK โ Transient Redirection
Trigger: slot is being migrated; this specific key has already moved
but migration is not yet complete
Client โ Node-A: GET user:1
Node-A โ Client: -ASK 6112 192.168.1.3:7002
Meaning: "This particular key has moved to 192.168.1.3:7002,
but the slot is still in transition. Don't update your
routing table โ just follow this redirect once."
Client behavior:
1. Do NOT update local slot cache
2. Connect to 192.168.1.3:7002
3. Send: ASKING (unlocks the importing slot on Node-B)
4. Send: GET user:1
5. Node-B returns the result (because ASKING was sent)
MOVED vs. ASK Side-by-Side
| Dimension | MOVED | ASK |
|---|---|---|
| When triggered | Slot permanently migrated | Slot migration in progress |
| Update routing table | Yes | No |
| Must send ASKING first | No | Yes |
| Semantic | "Always go there" | "Go there just this once" |
Smart Client Implementation
class ClusterClient:
def __init__(self, startup_nodes: list[tuple]):
self.slots = {} # slot โ (host, port)
self._refresh_slots(startup_nodes)
def _refresh_slots(self, nodes):
for host, port in nodes:
try:
r = redis.Redis(host=host, port=port)
# CLUSTER SLOTS returns [[start, end, [ip, port, id], ...], ...]
for entry in r.execute_command("CLUSTER SLOTS"):
start, end, master_info = entry[0], entry[1], entry[2]
master_host = master_info[0].decode()
master_port = master_info[1]
for slot in range(start, end + 1):
self.slots[slot] = (master_host, master_port)
return
except Exception:
continue
def execute(self, key: str, command: str, *args, max_retries=3):
slot = hash_slot(key)
host, port = self.slots.get(slot, (None, None))
if not host:
raise Exception(f"No node found for slot {slot}")
for attempt in range(max_retries):
r = redis.Redis(host=host, port=port)
try:
return r.execute_command(command, key, *args)
except redis.ResponseError as err:
error_str = str(err)
if error_str.startswith("MOVED"):
_, slot_str, addr = error_str.split()
h, p = addr.split(":")
self.slots[int(slot_str)] = (h, int(p)) # update cache
host, port = h, int(p)
elif error_str.startswith("ASK"):
_, slot_str, addr = error_str.split()
h, p = addr.split(":")
ask_r = redis.Redis(host=h, port=int(p))
ask_r.execute_command("ASKING")
return ask_r.execute_command(command, key, *args)
# Note: slot cache NOT updated for ASK
else:
raise
raise Exception(f"Max retries exceeded for key {key}")
19.5 The Gossip Protocol
Gossip is the messaging substrate that propagates cluster state changes โ node additions, deletions, failures, slot migrations โ across all nodes in a scalable, decentralized fashion.
Message Types
| Type | Purpose | Trigger |
|---|---|---|
| PING | Heartbeat + piggyback state | clusterCron every 100ms |
| PONG | Reply to PING | On receipt of PING |
| MEET | New node introduction | CLUSTER MEET command |
| FAIL | Rapid failure broadcast | Node reaches fail threshold |
| PUBLISH | Cross-node Pub/Sub forwarding | Client PUBLISH command |
| UPDATE | Force config update | Detected stale configEpoch |
PING/PONG Message Structure
/* cluster.h โ simplified clusterMsg layout */
typedef struct {
char sig[4]; /* "RCmb" */
uint32_t totlen; /* total message length */
uint16_t ver; /* protocol version */
uint16_t port; /* sender's data port */
uint16_t type; /* message type */
uint16_t count; /* gossip sections in payload */
uint64_t currentEpoch; /* sender's current epoch */
uint64_t configEpoch; /* sender's config epoch */
uint64_t offset; /* sender's replication offset */
char sender[CLUSTER_NAMELEN]; /* sender node ID (40 bytes) */
unsigned char myslots[CLUSTER_SLOTS/8]; /* sender's slot bitmap (2 KB) */
char slaveof[CLUSTER_NAMELEN]; /* primary's ID if this is a replica */
/* ... padding ... */
union clusterMsgData data; /* gossip payloads */
} clusterMsg;
Gossip Payload: State Piggybacking
Each PING/PONG carries randomly sampled state for approximately N/10 other nodes (minimum 3):
/* cluster.h โ one gossip section per sampled node */
typedef struct {
char nodename[CLUSTER_NAMELEN]; /* sampled node's ID */
uint32_t ping_sent; /* when this node last pinged the sample */
uint32_t pong_received; /* when PONG was last received */
char ip[NET_IP_STR_LEN]; /* sample node's IP */
uint16_t port; /* sample node's data port */
uint16_t cport; /* sample node's cluster bus port */
uint16_t flags; /* sample node's flags */
uint32_t notused1;
} clusterMsgDataGossip;
This random propagation ensures that a state change on any node reaches the entire cluster within O(log N) gossip rounds.
clusterCron() Execution Logic
/* cluster.c โ runs every 100ms */
void clusterCron(void) {
static long long iteration = 0;
iteration++;
/* Every 100ms: select nodes to PING */
if (!(iteration % 10)) {
/* Once per second: find the node with the longest PONG gap */
/* Guarantees every node receives at least 1 PING per second */
}
/* Scan all nodes: flag pfail if no PONG within cluster-node-timeout */
listIter li;
listNode *ln;
listRewind(server.cluster->nodes, &li);
while ((ln = listNext(&li)) != NULL) {
clusterNode *node = ln->value;
if (node->flags & (CLUSTER_NODE_MYSELF | CLUSTER_NODE_NOADDR))
continue;
mstime_t now = mstime();
if (node->link &&
now - node->link->last_recv_time > server.cluster_node_timeout &&
now - node->ping_sent > server.cluster_node_timeout / 2) {
/* Mark as pfail */
node->flags |= CLUSTER_NODE_PFAIL;
}
}
/* Handle slave failover election */
if (nodeIsSlave(myself))
clusterHandleSlaveFailover();
/* Flush cluster state to disk if dirty */
if (update_state || server.cluster->todo_before_sleep)
clusterSaveConfigOrDie(0);
}
19.6 clusterNode Key Data Structure
typedef struct clusterNode {
mstime_t ctime; /* creation time */
char name[CLUSTER_NAMELEN]; /* 40-byte hex node ID (random at startup) */
int flags; /* bitmask of node state */
/* Flag values: */
/* CLUSTER_NODE_MASTER = (1<<0) */
/* CLUSTER_NODE_SLAVE = (1<<1) */
/* CLUSTER_NODE_PFAIL = (1<<2) โ subjective failure */
/* CLUSTER_NODE_FAIL = (1<<3) โ objective failure */
/* CLUSTER_NODE_MYSELF = (1<<4) */
/* CLUSTER_NODE_HANDSHAKE = (1<<5) โ TCP handshake pending */
/* CLUSTER_NODE_NOADDR = (1<<6) โ address not yet known */
/* CLUSTER_NODE_NOFAILOVER= (1<<7) โ replica won't initiate failover */
uint64_t configEpoch; /* last config epoch for this node */
unsigned char slots[CLUSTER_SLOTS/8]; /* 2,048-byte slot bitmap */
int numslots; /* number of owned slots */
int numslaves; /* number of known replicas (if primary) */
struct clusterNode **slaves; /* replica pointers */
struct clusterNode *slaveof; /* primary pointer (if replica) */
mstime_t ping_sent; /* time of last PING sent to this node */
mstime_t pong_received; /* time of last PONG received from this node */
mstime_t data_received; /* time of last data from this node */
mstime_t fail_time; /* time this node was marked FAIL */
mstime_t voted_time; /* time of last vote (for duplicate vote prevention) */
mstime_t repl_offset_time; /* when repl_offset was last updated */
long long repl_offset; /* known replication offset */
char ip[NET_IP_STR_LEN];
int port; /* client-facing port */
int cport; /* cluster bus port (typically port + 10000) */
clusterLink *link; /* active TCP connection to this node */
list *fail_reports; /* list of pfail reports from other nodes */
} clusterNode;
Inspecting Node State
redis-cli -c -p 7000 CLUSTER NODES
# <id> <ip:port@cport> <flags> <primary-id> <ping-sent> <pong-recv> \
# <config-epoch> <link-state> <slot-range...>
#
# Example:
# 1a2b... 192.168.1.1:7000@17000 master - 0 1714000000000 1 connected 0-5460
# 4d5e... 192.168.1.2:7001@17001 master - 0 1714000000001 2 connected 5461-10922
# 7g8h... 192.168.1.3:7002@17002 master - 0 1714000000002 3 connected 10923-16383
# 9j0k... 192.168.1.1:7003@17003 slave 1a2b... 0 1714000000003 1 connected
redis-cli -c -p 7000 CLUSTER INFO
# cluster_enabled:1
# cluster_state:ok
# cluster_slots_assigned:16384
# cluster_slots_ok:16384
# cluster_slots_pfail:0
# cluster_slots_fail:0
# cluster_known_nodes:6
# cluster_size:3
# cluster_current_epoch:3
# cluster_my_epoch:1
# cluster_stats_messages_ping_sent:1502
# cluster_stats_messages_pong_sent:1500
# cluster_stats_messages_meet_sent:1
# cluster_stats_messages_received:3001
19.7 Cluster Topology Discovery
New Node Join Protocol
# New node 7006 starts with no cluster membership
redis-server --port 7006 \
--cluster-enabled yes \
--cluster-config-file nodes-7006.conf \
--cluster-node-timeout 15000 \
--daemonize yes
# CLUSTER MEET: introduce new node to one existing cluster member
redis-cli -p 7000 CLUSTER MEET 127.0.0.1 7006
# Internal MEET sequence:
# 1. Node-7000 โ Node-7006: MEET message
# 2. Node-7006 โ Node-7000: PONG (with own state)
# 3. Node-7006 adds Node-7000 to its known nodes list
# 4. Next heartbeat from Node-7000 includes Node-7006 state in gossip payload
# 5. Node-7001, Node-7002, ... learn about Node-7006 via Gossip
# 6. They PING Node-7006 directly; full mesh established
# Convergence time: O(log N) heartbeat rounds โ a few seconds
Gossip Convergence Speed
Assumptions:
- Cluster size: N nodes
- Each PING carries state for N/10 random nodes (min 3)
- Heartbeat interval: 100ms (clusterCron)
Convergence rounds โ log(N) / log(max(N/10, 3) + 1)
Empirical data:
N = 10 nodes โ ~3โ4 rounds โ ~300โ400ms
N = 100 nodes โ ~5โ6 rounds โ ~500โ600ms
N = 1,000 nodes โ ~7โ8 rounds โ ~700โ800ms
Even at 1,000 nodes, a topology change propagates cluster-wide in under 1 second.
19.8 Minimal Cluster Setup (6 Nodes: 3 Primary + 3 Replica)
# Configuration template for port 7000
cat > /etc/redis/redis-7000.conf << 'EOF'
port 7000
bind 0.0.0.0
daemonize yes
logfile /var/log/redis/redis-7000.log
dir /var/lib/redis/7000
cluster-enabled yes
cluster-config-file /etc/redis/cluster/nodes-7000.conf
cluster-node-timeout 15000
# Required for NAT / multi-NIC / container environments
cluster-announce-ip 192.168.1.1
cluster-announce-port 7000
cluster-announce-bus-port 17000
appendonly yes
appendfsync everysec
maxmemory 8gb
maxmemory-policy allkeys-lru
hz 10
EOF
# Create cluster (--cluster-replicas 1: one replica per primary)
redis-cli --cluster create \
192.168.1.1:7000 192.168.1.1:7001 \
192.168.1.2:7002 192.168.1.2:7003 \
192.168.1.3:7004 192.168.1.3:7005 \
--cluster-replicas 1
# Expected output:
# Master[0] -> Slots 0 - 5460 (192.168.1.1:7000)
# Master[1] -> Slots 5461 - 10922 (192.168.1.2:7002)
# Master[2] -> Slots 10923 - 16383 (192.168.1.3:7004)
# Adding replica 192.168.1.2:7003 to 192.168.1.1:7000
# Adding replica 192.168.1.3:7005 to 192.168.1.2:7002
# Adding replica 192.168.1.1:7001 to 192.168.1.3:7004
# Can I set the above configuration? (type 'yes' to accept): yes
19.9 Cluster Limitations and Workarounds
Unsupported Operations
# Cross-slot MGET (keys in different slots)
MGET user:1:name user:2:name
# (error) CROSSSLOT Keys in request don't hash to the same slot
# Solution: HashTag
MGET {user:1}:name {user:1}:age # same slot โ works
# Cross-slot transactions
MULTI
SET user:1:name "Alice" # slot A
SET order:1:status "paid" # slot B (different node)
EXEC
# (error) CROSSSLOT
# Multi-database SELECT
SELECT 1
# (error) ERR SELECT is not allowed in cluster mode
# Lua scripts with keys in different slots
EVAL "return redis.call('SET',KEYS[1],ARGV[1])" 2 key1 key2 value
# (error) CROSSSLOT (if key1 and key2 are in different slots)
Pipeline Usage in Cluster Mode
from rediscluster import RedisCluster
rc = RedisCluster(startup_nodes=[{"host": "127.0.0.1", "port": "7000"}],
decode_responses=True)
# Smart cluster clients batch pipeline commands by slot internally
# Use HashTag to ensure all pipeline commands target a single slot
pipe = rc.pipeline()
pipe.set("{user:1}:name", "Alice")
pipe.set("{user:1}:age", 30)
pipe.set("{user:1}:score", 9500)
results = pipe.execute() # all on same slot โ efficient single-node pipeline
Key Design Recommendations
1. Use HashTag for all related keys in a business entity:
{order:<id>}:items, {order:<id>}:status, {order:<id>}:meta
2. Never use HashTag for globally distributed access patterns:
Putting all user sessions under {session}: defeats the purpose of sharding
3. Plan slot distribution before choosing HashTag strategy:
Monitor slot key counts: CLUSTER GETKEYSINSLOT <slot> <count>
Identify hot slots with: redis-cli --cluster info <host>:<port>
4. For cross-entity operations, consider denormalization:
Store redundant data in the same slot rather than atomic cross-slot reads
Chapter Summary
| Concept | Key Detail | Production Advice |
|---|---|---|
| Hash slots | CRC16(key) % 16384 | Use HashTag to control placement |
| 16,384 choice | 2 KB heartbeat vs 8 KB for 65,536 | โ |
| MOVED | Permanent redirect; update routing cache | Smart client required |
| ASK | Transient redirect; do not update cache | Must send ASKING first |
| Gossip | PING/PONG carry random node state | Tune cluster-node-timeout |
| Cluster limits | No cross-slot MGET/transactions/SELECT | Design with HashTag upfront |
Chapter 20 covers cluster failover mechanics, the complete online scale-out playbook, and a detailed walkthrough of slot migration under the ASK protocol.