Skiplist: Why Not Red-Black Tree
Chapter 7: Skiplist โ Why Not a Red-Black Tree
7.1 Why Skiplist Instead of Red-Black Tree
This is one of the most frequently asked Redis internals questions in engineering interviews. The choice is nuanced โ Redis uses both structures in the same codebase (rbtree for timer management), so the ZSet decision is deliberate, not ignorance.
7.1.1 Range Query Performance
ZSet's defining operation is ZRANGE/ZRANGEBYSCORE โ returning a contiguous slice of sorted elements. This is where the structural difference matters most.
Red-Black Tree range query:
RB-tree logical view:
30 (black)
/ \
20 (red) 50 (black)
/ \ / \
10(blk) 25(blk) 40(red) 60(blk)
ZRANGEBYSCORE 20 50 โ find [20, 25, 30, 40, 50]:
Step 1: Locate lower bound (20): O(log N)
Step 2: In-order traversal from 20 to 50:
20 โ go to parent 30 โ go to left subtree of 50 โ 40 โ ...
Memory access pattern: childโparentโchild (tree-shaped jumps)
CPU prefetch: unpredictable, cache miss rate high
Skiplist range query:
Skiplist level 0 (ground level โ sorted linked list):
head โ [10] โ [20] โ [25] โ [30] โ [40] โ [50] โ [60] โ NULL
ZRANGEBYSCORE 20 50:
Step 1: Find lower bound (20): O(log N) using upper levels
Step 2: Walk level[0] forward from 20 to 50: O(M)
20 โ 25 โ 30 โ 40 โ 50 (sequential linked list, predictable)
Memory access pattern: linear linked list traversal
CPU prefetch: next pointer is predictable, prefetcher effective
For M=1000 elements in range, the skiplist's level[0] traversal is essentially sequential linked list access โ the CPU prefetcher predicts each ->forward pointer with high accuracy. Red-Black tree in-order traversal requires visiting ancestors before descending to the next in-order node, producing an irregular access pattern.
7.1.2 Implementation Complexity
Red-Black Tree โ operations required:
Insert: BST insert + fix-up (up to 3 rotations + recoloring)
Delete: BST delete + fix-up (up to 3 rotations + recoloring)
Cases: 6+ distinct structural cases for insertion fix-up alone
Plus symmetric cases for right-side insertions
Pointers needed per node: left, right, parent, color
Skiplist โ operations required:
Insert: Find position at each level (top-down)
Record update[] array (at most MAXLEVEL=32 pointers)
Generate random level height
Splice in new node at each level
Update span values
Delete: Similar โ find update[], splice out, update span
No rotations. No recoloring. No parent pointer.
Code size: roughly 1/3 of a correct RB-tree implementation
The CLRS textbook's RB-tree implementation spans ~120 lines for just the fix-up routines. The complete Redis skiplist implementation (t_zset.c, skiplist functions) is around 200 lines including comments and handles all ZSet operations.
7.1.3 Concurrency Scalability
Redis is single-threaded for commands, but this is an architectural consideration for future evolution:
Red-Black tree rotation:
A single rotation modifies 3 nodes simultaneously
(parent, child, grandchild โ plus sibling for double rotations)
Fine-grained locking is extremely difficult
Lock-free RB-tree: research area, no production implementations widely used
Skiplist:
Each level's pointer modifications are relatively independent
Lock-free skiplists: well-studied, ConcurrentSkipListMap in Java since 2004
Partial range lock possible: lock only affected level[0] range
Redis IO threads and future parallelism would benefit
7.1.4 Memory Per Element
Red-Black tree node (fixed):
left* 8 bytes
right* 8 bytes
parent* 8 bytes
color typically stored in low bit of parent pointer or separate byte
Subtotal: 24 bytes of pointer overhead per node (minimum)
Skiplist node (probabilistic, p=0.25, MAXLEVEL=32):
Expected level height = 1/(1-p) = 1/(1-0.25) = 1.333
Expected forward pointers: 1.333 (one per level)
backward pointer (level[0]): 1
Expected pointer overhead: (1.333 + 1) ร 8 = 18.7 bytes per node
Skiplist wins on expected pointer count (18.7 vs 24 bytes).
7.2 zskiplistNode Complete Structure
7.2.1 C Struct Definitions (t_zset.c)
/* Skiplist node for ZSet */
typedef struct zskiplistNode {
sds ele; /* member string (SDS) */
double score; /* sort key */
struct zskiplistNode *backward; /* level[0] backward pointer for ZRANGE DESC */
struct zskiplistLevel {
struct zskiplistNode *forward; /* forward pointer at this level */
unsigned long span; /* # of level[0] nodes skipped by this pointer */
} level[]; /* flexible array โ size determined at alloc time */
} zskiplistNode;
/* Skiplist container */
typedef struct zskiplist {
struct zskiplistNode *header, *tail; /* sentinel header + tail pointer */
unsigned long length; /* node count (excluding header sentinel) */
int level; /* current maximum level in use */
} zskiplist;
7.2.2 Memory Layout Diagram (3-level skiplist, 4 nodes)
level header(sentinel) node A node B node C node D
score=-inf score=10 score=20 score=30 score=40
L3: [head]โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบNULL
span=4
L2: [head]โโโโโโโโโโโโโโโโโโโโโโโโโโบ[node C]โโโโโโโโโโโโโโโโโโบNULL
span=3 span=1
L1: [head]โโโโโโโโโโโบ[node A]โโโโโโโโโโโโโโโโโโโบ[node C]โโโโโโโบNULL
span=1 span=2 span=1
L0: [head]โโโบ[node A]โโโบ[node B]โโโบ[node C]โโโบ[node D]โโโโโโโบNULL
span=1 span=1 span=1 span=1
backward: NULLโโโnodeAโโโnodeBโโโnodeCโโโnodeD (zskiplist.tail)
Span interpretation:
header L1 span=1: headerโnodeA crossing 1 L0 node
nodeA L1 span=2: nodeA โnodeC crossing 2 L0 nodes (nodeA itself + nodeB)
header L2 span=3: headerโnodeC crossing 3 L0 nodes
The span field is the secret behind O(log N) rank computation: accumulating span values along the search path gives you the exact rank of the found node without any additional traversal.
7.2.3 Sentinel Header Node
/* zslCreate โ header is pre-allocated with MAXLEVEL=32 levels */
zskiplist *zslCreate(void) {
int j;
zskiplist *zsl = zmalloc(sizeof(*zsl));
zsl->level = 1;
zsl->length = 0;
/* Header sentinel: always present, never contains real data */
zsl->header = zslCreateNode(ZSKIPLIST_MAXLEVEL, 0, NULL);
for (j = 0; j < ZSKIPLIST_MAXLEVEL; j++) {
zsl->header->level[j].forward = NULL;
zsl->header->level[j].span = 0;
}
zsl->header->backward = NULL;
zsl->tail = NULL;
return zsl;
}
Memory cost of one sentinel header: sizeof(zskiplistNode) + 32 * sizeof(zskiplistLevel) = (8+8+8) + 32*(8+8) = 24 + 512 = 536 bytes per skiplist โ a fixed overhead regardless of element count.
7.3 Random Level Algorithm
7.3.1 Implementation
/* server.h */
#define ZSKIPLIST_MAXLEVEL 32 /* maximum levels: supports 2^32 elements */
#define ZSKIPLIST_P 0.25 /* promotion probability */
int zslRandomLevel(void) {
static const int threshold = ZSKIPLIST_P * RAND_MAX;
int level = 1;
while (random() < threshold) /* random() returns [0, RAND_MAX] */
level += 1;
return (level < ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}
7.3.2 Probability Distribution
Level height distribution (p = 0.25):
Height Probability Cumulative ~% of nodes at this height or higher
1 (1-p) = 0.750 0.750 100%
2 p(1-p) = 0.1875 0.9375 25%
3 pยฒ(1-p) = 0.046875 0.984375 6.25%
4 pยณ(1-p) = 0.01172 0.99609375 1.56%
5 pโด(1-p) = 0.00293 0.999023 0.39%
Expected height: E[H] = 1/(1-p) = 1/0.75 โ 1.333
For N = 1,000,000 elements:
Expected max level used: log_{1/p} N = logโ(1M) โ 10 levels
Expected search comparisons: (log_{1/p} N) / p + 1/(1-p) โ 44
Red-black tree comparisons: 2 ร logโ(1M) โ 40
Difference: ~10% more comparisons, but better cache behavior
7.3.3 Why p=0.25 Instead of p=0.5
p=0.5 (classic Pugh skiplist):
Expected pointers/node: 1/(1-0.5) = 2
Search complexity: O(logโ N)
Memory: 2 forward pointers expected per node
p=0.25 (Redis choice):
Expected pointers/node: 1/(1-0.25) = 1.333
Search complexity: O(logโ N) = O(log N) with larger constant
Memory: 1.333 forward pointers expected per node
Redis rationale for p=0.25:
1. 33% less expected pointer memory vs p=0.5
2. For typical ZSet sizes (100-10K), O(logโ N) is fast enough
3. Level cap at 32: logโ(2^64) = 32, handles any practical key count
4. Pugh's original paper benchmarked p=0.25 as sweet spot for space/time
7.4 ZADD: Complete Operation Flow
7.4.1 Phase 1 โ Locate Insert Position
zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) {
zskiplistNode *update[ZSKIPLIST_MAXLEVEL]; /* predecessor at each level */
unsigned long rank[ZSKIPLIST_MAXLEVEL]; /* rank of predecessor at each level */
zskiplistNode *x;
int i, level;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
/* rank[i] = cumulative rank at level i+1 (0 for top level) */
rank[i] = (i == zsl->level-1) ? 0 : rank[i+1];
while (x->level[i].forward &&
(x->level[i].forward->score < score ||
(x->level[i].forward->score == score &&
sdscmp(x->level[i].forward->ele, ele) < 0))) {
rank[i] += x->level[i].span; /* accumulate span = track rank */
x = x->level[i].forward;
}
update[i] = x; /* last node at level i that is < new node */
}
7.4.2 Phase 2 โ Splice Node and Update Spans
level = zslRandomLevel();
/* If new level exceeds current max, initialize new levels */
if (level > zsl->level) {
for (i = zsl->level; i < level; i++) {
rank[i] = 0;
update[i] = zsl->header;
update[i]->level[i].span = zsl->length; /* span = entire list */
}
zsl->level = level;
}
x = zslCreateNode(level, score, ele);
for (i = 0; i < level; i++) {
x->level[i].forward = update[i]->level[i].forward;
update[i]->level[i].forward = x;
/*
* Span calculation:
* rank[0] = rank of update[0] (immediate predecessor at L0)
* rank[i] = rank of update[i] at level i
*
* x.span[i] = update[i].old_span - (rank[0] - rank[i])
* update[i].span[i] = (rank[0] - rank[i]) + 1
*/
x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
update[i]->level[i].span = (rank[0] - rank[i]) + 1;
}
/* Levels not touched by new node: increment their span by 1 */
for (i = level; i < zsl->level; i++) {
update[i]->level[i].span++;
}
/* Maintain backward pointer (doubly-linked at level 0) */
x->backward = (update[0] == zsl->header) ? NULL : update[0];
if (x->level[0].forward)
x->level[0].forward->backward = x;
else
zsl->tail = x;
zsl->length++;
return x;
}
7.4.3 Insertion Visualization
Before insertion (zsl->level = 2):
L1: head โโโโโโโโโโโโโโโโโโบ [node20, score=20] โโโโโโโโโโโบ NULL
span=2 span=1
L0: head โโโบ [node10] โโโโโโโบ [node20] โโโโโโโโโโโโโโโโโโโบ NULL
span=1 span=1
Insert score=15, ele="new", zslRandomLevel() returns 1:
After:
L1: head โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ [node20] โโโโโโโโโโบ NULL
span=3 (updated: was 2, +1 for new node in between)
L0: head โโโบ [node10] โโโบ [new,15] โโโบ [node20] โโโโโโโโโโบ NULL
span=1 span=1 span=1
Span arithmetic:
update[0] = node10, rank[0] = 1
update[1] = head, rank[1] = 0
new.level[0].span = node10.level[0].old_span - (rank[0]-rank[0]) = 1
node10.level[0].span = (rank[0]-rank[0]) + 1 = 1
head.level[1].span += 1 โ 3 (level 1 not included in new node's levels)
7.5 ZRANK: O(log N) via Span Accumulation
/* O(log N) โ span accumulation gives rank without any additional scan */
long zslGetRank(zskiplist *zsl, double score, sds ele) {
zskiplistNode *x;
unsigned long rank = 0;
int i;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
while (x->level[i].forward &&
(x->level[i].forward->score < score ||
(x->level[i].forward->score == score &&
sdscmp(x->level[i].forward->ele, ele) <= 0))) {
rank += x->level[i].span; /* each step adds the span to rank */
x = x->level[i].forward;
}
if (x->ele && sdscmp(x->ele, ele) == 0)
return rank; /* found the node, rank is exact */
}
return 0; /* not found */
}
This is the key insight that makes ZRANK O(log N): the search path itself accumulates the rank. Compare with an Augmented BST which requires maintaining subtree sizes at every node and updating them on every insertion/deletion.
Example trace (finding rank of score=30 in a 10-element skiplist):
Start at header (rank=0)
L3: header.span=8, forward.score=40 > 30 โ stop, don't advance
L2: header.span=4, forward.score=20 < 30 โ advance, rank += 4 โ 4
node20.span=3, forward.score=40 > 30 โ stop
L1: node20.span=2, forward.score=25 < 30 โ advance, rank += 2 โ 6
node25.span=1, forward.score=30 = 30 โ advance, rank += 1 โ 7
L0: node30.ele matches! return rank=7
7 comparisons for N=10. For N=1M: ~44 comparisons.
7.6 ZRANGE Range Query
/* Find first node in score range [min, max] */
zskiplistNode *zslFirstInRange(zskiplist *zsl, zrangespec *range) {
zskiplistNode *x;
int i;
/* Quick check: if the entire skiplist is out of range */
if (!zslIsInRange(zsl, range)) return NULL;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
/* Advance while not yet at or past min */
while (x->level[i].forward &&
!zslValueGteMin(x->level[i].forward->score, range))
x = x->level[i].forward;
}
/* x is now the last node BEFORE the range */
x = x->level[0].forward;
/* Verify this node is within [min, max] */
if (!x || !zslValueLteMax(x->score, range)) return NULL;
return x;
}
/* ZRANGE main loop: O(log N) to find start, O(M) to collect results */
void addReplyZiplistRange(client *c, zskiplist *zsl, zrangespec *range,
long offset, long count, int withscores) {
zskiplistNode *node = zslFirstInRange(zsl, range);
while (node && count--) {
if (!zslValueLteMax(node->score, range)) break;
/* collect: node->ele, node->score */
node = node->level[0].forward; /* pure sequential level[0] traversal */
}
}
Time complexity: O(log N + M) where N = total elements, M = elements in range.
Space complexity: O(M) for the output.
7.7 Dual-Structure Design: Skiplist + Dict
7.7.1 The zset Struct
/* t_zset.c โ large ZSet uses both structures simultaneously */
typedef struct zset {
dict *dict; /* member โ score mapping, for O(1) ZSCORE lookups */
zskiplist *zsl; /* score-ordered index, for O(log N) range queries */
} zset;
Operation routing:
| Command | Primary structure | Complexity |
|---|---|---|
ZSCORE member |
dict only | O(1) |
ZADD score member |
dict + skiplist | O(log N) |
ZINCRBY delta member |
dict (find score) + skiplist (reposition) | O(log N) |
ZRANK member |
skiplist only | O(log N) |
ZRANGE min max |
skiplist only | O(log N + M) |
ZCARD |
zsl.length | O(1) |
ZCOUNT min max |
skiplist (zslCountInRange) | O(log N) |
7.7.2 Shared SDS Pointer (No Copying)
/* zadd path in t_zset.c */
static int zsetAdd(robj *zobj, double score, sds ele, ...) {
/* ... */
zskiplistNode *znode = zslInsert(zs->zsl, score, ele);
/* CRITICAL: dict key is the SAME sds pointer as znode->ele */
serverAssert(dictAdd(zs->dict, ele, &znode->score) == DICT_OK);
/* Both structures share one sds allocation โ no duplication */
}
Memory layout showing shared sds:
Heap: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SDS struct: "leaderboard_A" โ โ one allocation
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ โฒ
โ โ
zskiplistNode.ele dictEntry.key
(skiplist owns) (dict references)
When deleting: must remove from BOTH structures, free sds ONCE
7.8 Memory: listpack vs skiplist+dict
ZSet with 100 elements:
member: "user:XXXXXXXX" (13 bytes), score: double (8 bytes)
listpack encoding (count โค 128 elements):
Header: 6 bytes
Per pair: member_elem(1+13+1=15) + score_elem(1+8+1=10) = 25 bytes
100 pairs: 2500 bytes
Total: ~2507 bytes โ 2.5 KB
skiplist + dict encoding (count > 128 or value > 64 bytes):
Per zskiplistNode:
zmalloc header: 8 bytes (jemalloc metadata)
sds ele: 16 bytes (sds header 3B + 13B data + '\0' + alignment)
score (double): 8 bytes
backward ptr: 8 bytes
level[] array: expected 1.333 levels ร 16 B/level = 21 bytes
Subtotal: ~61 bytes per node
100 nodes skiplist: ~6100 bytes
zskiplist struct: 24 bytes
dict (100 entries, load factor ~0.78 with 128 buckets):
buckets: 128 ร 8 = 1024 bytes
entries: 100 ร 32 = 3200 bytes (key ptr + val ptr + next ptr + metadata)
dict overhead: ~4224 bytes
Total skiplist+dict: ~10348 bytes โ 10 KB
Ratio: listpack uses ~4x less memory for 100 elements.
This is precisely why the default threshold is 128 entries.
7.9 Summary
| Criterion | Red-Black Tree | Skiplist |
|---|---|---|
| Range query cache behavior | Poor (tree traversal) | Excellent (sequential L0 scan) |
| Implementation complexity | High (rotations, recoloring) | Low (no structural rebalancing) |
| Expected pointers/node | 3 (left, right, parent) | 1.333 (p=0.25) |
| ZRANK operation | O(log N) w/ augmentation | O(log N) via span field |
| Lock-free potential | Very difficult | Well-studied (ConcurrentSkipList) |
| Constant factor (search) | Slightly smaller | Slightly larger |
The skiplist's dominance in ZSet comes from a synergy of factors: range queries map to sequential linked-list traversal (cache-friendly), the span field gives O(log N) rank for free, and the code simplicity reduces bugs in a system where correctness is paramount.
Redis's choice validates a broader principle: when multiple data structure choices give the same asymptotic complexity, implementation simplicity and cache behavior often matter more than constant-factor theoretical advantages.