Chapter 7

Skiplist: Why Not Red-Black Tree

Chapter 7: Skiplist โ€” Why Not a Red-Black Tree

7.1 Why Skiplist Instead of Red-Black Tree

This is one of the most frequently asked Redis internals questions in engineering interviews. The choice is nuanced โ€” Redis uses both structures in the same codebase (rbtree for timer management), so the ZSet decision is deliberate, not ignorance.

7.1.1 Range Query Performance

ZSet's defining operation is ZRANGE/ZRANGEBYSCORE โ€” returning a contiguous slice of sorted elements. This is where the structural difference matters most.

Red-Black Tree range query:

RB-tree logical view:
            30 (black)
           /           \
       20 (red)       50 (black)
      /       \       /         \
  10(blk)  25(blk) 40(red)   60(blk)

ZRANGEBYSCORE 20 50 โ€” find [20, 25, 30, 40, 50]:
  Step 1: Locate lower bound (20): O(log N)
  Step 2: In-order traversal from 20 to 50:
    20 โ†’ go to parent 30 โ†’ go to left subtree of 50 โ†’ 40 โ†’ ...
    Memory access pattern: childโ†’parentโ†’child (tree-shaped jumps)
    CPU prefetch: unpredictable, cache miss rate high

Skiplist range query:

Skiplist level 0 (ground level โ€” sorted linked list):
head โ†’ [10] โ†’ [20] โ†’ [25] โ†’ [30] โ†’ [40] โ†’ [50] โ†’ [60] โ†’ NULL

ZRANGEBYSCORE 20 50:
  Step 1: Find lower bound (20): O(log N) using upper levels
  Step 2: Walk level[0] forward from 20 to 50: O(M)
    20 โ†’ 25 โ†’ 30 โ†’ 40 โ†’ 50 (sequential linked list, predictable)
    Memory access pattern: linear linked list traversal
    CPU prefetch: next pointer is predictable, prefetcher effective

For M=1000 elements in range, the skiplist's level[0] traversal is essentially sequential linked list access โ€” the CPU prefetcher predicts each ->forward pointer with high accuracy. Red-Black tree in-order traversal requires visiting ancestors before descending to the next in-order node, producing an irregular access pattern.

7.1.2 Implementation Complexity

Red-Black Tree โ€” operations required:
  Insert:  BST insert + fix-up (up to 3 rotations + recoloring)
  Delete:  BST delete + fix-up (up to 3 rotations + recoloring)
  Cases:   6+ distinct structural cases for insertion fix-up alone
           Plus symmetric cases for right-side insertions
  Pointers needed per node: left, right, parent, color

Skiplist โ€” operations required:
  Insert:  Find position at each level (top-down)
           Record update[] array (at most MAXLEVEL=32 pointers)
           Generate random level height
           Splice in new node at each level
           Update span values
  Delete:  Similar โ€” find update[], splice out, update span
  No rotations. No recoloring. No parent pointer.
  Code size: roughly 1/3 of a correct RB-tree implementation

The CLRS textbook's RB-tree implementation spans ~120 lines for just the fix-up routines. The complete Redis skiplist implementation (t_zset.c, skiplist functions) is around 200 lines including comments and handles all ZSet operations.

7.1.3 Concurrency Scalability

Redis is single-threaded for commands, but this is an architectural consideration for future evolution:

Red-Black tree rotation:
  A single rotation modifies 3 nodes simultaneously
  (parent, child, grandchild โ€” plus sibling for double rotations)
  Fine-grained locking is extremely difficult
  Lock-free RB-tree: research area, no production implementations widely used

Skiplist:
  Each level's pointer modifications are relatively independent
  Lock-free skiplists: well-studied, ConcurrentSkipListMap in Java since 2004
  Partial range lock possible: lock only affected level[0] range
  Redis IO threads and future parallelism would benefit

7.1.4 Memory Per Element

Red-Black tree node (fixed):
  left*   8 bytes
  right*  8 bytes
  parent* 8 bytes
  color   typically stored in low bit of parent pointer or separate byte
  Subtotal: 24 bytes of pointer overhead per node (minimum)

Skiplist node (probabilistic, p=0.25, MAXLEVEL=32):
  Expected level height = 1/(1-p) = 1/(1-0.25) = 1.333
  Expected forward pointers: 1.333 (one per level)
  backward pointer (level[0]): 1
  Expected pointer overhead: (1.333 + 1) ร— 8 = 18.7 bytes per node

Skiplist wins on expected pointer count (18.7 vs 24 bytes).

7.2 zskiplistNode Complete Structure

7.2.1 C Struct Definitions (t_zset.c)

/* Skiplist node for ZSet */
typedef struct zskiplistNode {
    sds ele;                         /* member string (SDS) */
    double score;                    /* sort key */
    struct zskiplistNode *backward;  /* level[0] backward pointer for ZRANGE DESC */
    struct zskiplistLevel {
        struct zskiplistNode *forward;  /* forward pointer at this level */
        unsigned long span;             /* # of level[0] nodes skipped by this pointer */
    } level[];                       /* flexible array โ€” size determined at alloc time */
} zskiplistNode;

/* Skiplist container */
typedef struct zskiplist {
    struct zskiplistNode *header, *tail;  /* sentinel header + tail pointer */
    unsigned long length;                 /* node count (excluding header sentinel) */
    int level;                            /* current maximum level in use */
} zskiplist;

7.2.2 Memory Layout Diagram (3-level skiplist, 4 nodes)

level   header(sentinel)    node A      node B      node C      node D
        score=-inf          score=10    score=20    score=30    score=40

  L3:  [head]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บNULL
        span=4

  L2:  [head]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ[node C]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บNULL
        span=3                          span=1

  L1:  [head]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ[node A]โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ[node C]โ”€โ”€โ”€โ”€โ”€โ”€โ–บNULL
        span=1           span=2                     span=1

  L0:  [head]โ”€โ”€โ–บ[node A]โ”€โ”€โ–บ[node B]โ”€โ”€โ–บ[node C]โ”€โ”€โ–บ[node D]โ”€โ”€โ”€โ”€โ”€โ”€โ–บNULL
        span=1   span=1     span=1     span=1

backward:  NULLโ—„โ”€โ”€nodeAโ—„โ”€โ”€nodeBโ—„โ”€โ”€nodeCโ—„โ”€โ”€nodeD (zskiplist.tail)

Span interpretation:
  header L1 span=1: headerโ†’nodeA crossing 1 L0 node
  nodeA  L1 span=2: nodeA โ†’nodeC crossing 2 L0 nodes (nodeA itself + nodeB)
  header L2 span=3: headerโ†’nodeC crossing 3 L0 nodes

The span field is the secret behind O(log N) rank computation: accumulating span values along the search path gives you the exact rank of the found node without any additional traversal.

7.2.3 Sentinel Header Node

/* zslCreate โ€” header is pre-allocated with MAXLEVEL=32 levels */
zskiplist *zslCreate(void) {
    int j;
    zskiplist *zsl = zmalloc(sizeof(*zsl));
    zsl->level = 1;
    zsl->length = 0;
    /* Header sentinel: always present, never contains real data */
    zsl->header = zslCreateNode(ZSKIPLIST_MAXLEVEL, 0, NULL);
    for (j = 0; j < ZSKIPLIST_MAXLEVEL; j++) {
        zsl->header->level[j].forward = NULL;
        zsl->header->level[j].span = 0;
    }
    zsl->header->backward = NULL;
    zsl->tail = NULL;
    return zsl;
}

Memory cost of one sentinel header: sizeof(zskiplistNode) + 32 * sizeof(zskiplistLevel) = (8+8+8) + 32*(8+8) = 24 + 512 = 536 bytes per skiplist โ€” a fixed overhead regardless of element count.

7.3 Random Level Algorithm

7.3.1 Implementation

/* server.h */
#define ZSKIPLIST_MAXLEVEL 32    /* maximum levels: supports 2^32 elements */
#define ZSKIPLIST_P 0.25         /* promotion probability */

int zslRandomLevel(void) {
    static const int threshold = ZSKIPLIST_P * RAND_MAX;
    int level = 1;
    while (random() < threshold)   /* random() returns [0, RAND_MAX] */
        level += 1;
    return (level < ZSKIPLIST_MAXLEVEL) ? level : ZSKIPLIST_MAXLEVEL;
}

7.3.2 Probability Distribution

Level height distribution (p = 0.25):

Height  Probability          Cumulative    ~% of nodes at this height or higher
  1     (1-p)   = 0.750      0.750         100%
  2     p(1-p)  = 0.1875     0.9375         25%
  3     pยฒ(1-p) = 0.046875   0.984375        6.25%
  4     pยณ(1-p) = 0.01172    0.99609375      1.56%
  5     pโด(1-p) = 0.00293    0.999023        0.39%

Expected height: E[H] = 1/(1-p) = 1/0.75 โ‰ˆ 1.333

For N = 1,000,000 elements:
  Expected max level used: log_{1/p} N = logโ‚„(1M) โ‰ˆ 10 levels
  Expected search comparisons: (log_{1/p} N) / p + 1/(1-p) โ‰ˆ 44
  Red-black tree comparisons:  2 ร— logโ‚‚(1M) โ‰ˆ 40
  Difference: ~10% more comparisons, but better cache behavior

7.3.3 Why p=0.25 Instead of p=0.5

p=0.5 (classic Pugh skiplist):
  Expected pointers/node: 1/(1-0.5) = 2
  Search complexity: O(logโ‚‚ N)
  Memory: 2 forward pointers expected per node

p=0.25 (Redis choice):
  Expected pointers/node: 1/(1-0.25) = 1.333
  Search complexity: O(logโ‚„ N) = O(log N) with larger constant
  Memory: 1.333 forward pointers expected per node

Redis rationale for p=0.25:
  1. 33% less expected pointer memory vs p=0.5
  2. For typical ZSet sizes (100-10K), O(logโ‚„ N) is fast enough
  3. Level cap at 32: logโ‚„(2^64) = 32, handles any practical key count
  4. Pugh's original paper benchmarked p=0.25 as sweet spot for space/time

7.4 ZADD: Complete Operation Flow

7.4.1 Phase 1 โ€” Locate Insert Position

zskiplistNode *zslInsert(zskiplist *zsl, double score, sds ele) {
    zskiplistNode *update[ZSKIPLIST_MAXLEVEL];  /* predecessor at each level */
    unsigned long rank[ZSKIPLIST_MAXLEVEL];     /* rank of predecessor at each level */
    zskiplistNode *x;
    int i, level;

    x = zsl->header;
    for (i = zsl->level-1; i >= 0; i--) {
        /* rank[i] = cumulative rank at level i+1 (0 for top level) */
        rank[i] = (i == zsl->level-1) ? 0 : rank[i+1];
        while (x->level[i].forward &&
               (x->level[i].forward->score < score ||
                (x->level[i].forward->score == score &&
                 sdscmp(x->level[i].forward->ele, ele) < 0))) {
            rank[i] += x->level[i].span;  /* accumulate span = track rank */
            x = x->level[i].forward;
        }
        update[i] = x;  /* last node at level i that is < new node */
    }

7.4.2 Phase 2 โ€” Splice Node and Update Spans

    level = zslRandomLevel();

    /* If new level exceeds current max, initialize new levels */
    if (level > zsl->level) {
        for (i = zsl->level; i < level; i++) {
            rank[i] = 0;
            update[i] = zsl->header;
            update[i]->level[i].span = zsl->length;  /* span = entire list */
        }
        zsl->level = level;
    }

    x = zslCreateNode(level, score, ele);
    for (i = 0; i < level; i++) {
        x->level[i].forward = update[i]->level[i].forward;
        update[i]->level[i].forward = x;

        /*
         * Span calculation:
         *   rank[0]  = rank of update[0] (immediate predecessor at L0)
         *   rank[i]  = rank of update[i] at level i
         *
         *   x.span[i]          = update[i].old_span - (rank[0] - rank[i])
         *   update[i].span[i]  = (rank[0] - rank[i]) + 1
         */
        x->level[i].span = update[i]->level[i].span - (rank[0] - rank[i]);
        update[i]->level[i].span = (rank[0] - rank[i]) + 1;
    }

    /* Levels not touched by new node: increment their span by 1 */
    for (i = level; i < zsl->level; i++) {
        update[i]->level[i].span++;
    }

    /* Maintain backward pointer (doubly-linked at level 0) */
    x->backward = (update[0] == zsl->header) ? NULL : update[0];
    if (x->level[0].forward)
        x->level[0].forward->backward = x;
    else
        zsl->tail = x;

    zsl->length++;
    return x;
}

7.4.3 Insertion Visualization

Before insertion (zsl->level = 2):
L1: head โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ [node20, score=20] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ NULL
     span=2                   span=1
L0: head โ”€โ”€โ–บ [node10] โ”€โ”€โ”€โ”€โ”€โ”€โ–บ [node20] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ NULL
     span=1   span=1

Insert score=15, ele="new", zslRandomLevel() returns 1:

After:
L1: head โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ [node20] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ NULL
     span=3  (updated: was 2, +1 for new node in between)
L0: head โ”€โ”€โ–บ [node10] โ”€โ”€โ–บ [new,15] โ”€โ”€โ–บ [node20] โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ NULL
     span=1   span=1       span=1

Span arithmetic:
  update[0] = node10, rank[0] = 1
  update[1] = head,   rank[1] = 0
  new.level[0].span = node10.level[0].old_span - (rank[0]-rank[0]) = 1
  node10.level[0].span = (rank[0]-rank[0]) + 1 = 1
  head.level[1].span += 1 โ†’ 3  (level 1 not included in new node's levels)

7.5 ZRANK: O(log N) via Span Accumulation

/* O(log N) โ€” span accumulation gives rank without any additional scan */
long zslGetRank(zskiplist *zsl, double score, sds ele) {
    zskiplistNode *x;
    unsigned long rank = 0;
    int i;

    x = zsl->header;
    for (i = zsl->level-1; i >= 0; i--) {
        while (x->level[i].forward &&
               (x->level[i].forward->score < score ||
                (x->level[i].forward->score == score &&
                 sdscmp(x->level[i].forward->ele, ele) <= 0))) {
            rank += x->level[i].span;   /* each step adds the span to rank */
            x = x->level[i].forward;
        }
        if (x->ele && sdscmp(x->ele, ele) == 0)
            return rank;  /* found the node, rank is exact */
    }
    return 0;  /* not found */
}

This is the key insight that makes ZRANK O(log N): the search path itself accumulates the rank. Compare with an Augmented BST which requires maintaining subtree sizes at every node and updating them on every insertion/deletion.

Example trace (finding rank of score=30 in a 10-element skiplist):
  Start at header (rank=0)
  L3: header.span=8, forward.score=40 > 30 โ€” stop, don't advance
  L2: header.span=4, forward.score=20 < 30 โ€” advance, rank += 4 โ†’ 4
      node20.span=3, forward.score=40 > 30 โ€” stop
  L1: node20.span=2, forward.score=25 < 30 โ€” advance, rank += 2 โ†’ 6
      node25.span=1, forward.score=30 = 30 โ€” advance, rank += 1 โ†’ 7
  L0: node30.ele matches! return rank=7

7 comparisons for N=10. For N=1M: ~44 comparisons.

7.6 ZRANGE Range Query

/* Find first node in score range [min, max] */
zskiplistNode *zslFirstInRange(zskiplist *zsl, zrangespec *range) {
    zskiplistNode *x;
    int i;

    /* Quick check: if the entire skiplist is out of range */
    if (!zslIsInRange(zsl, range)) return NULL;

    x = zsl->header;
    for (i = zsl->level-1; i >= 0; i--) {
        /* Advance while not yet at or past min */
        while (x->level[i].forward &&
               !zslValueGteMin(x->level[i].forward->score, range))
            x = x->level[i].forward;
    }
    /* x is now the last node BEFORE the range */
    x = x->level[0].forward;

    /* Verify this node is within [min, max] */
    if (!x || !zslValueLteMax(x->score, range)) return NULL;
    return x;
}

/* ZRANGE main loop: O(log N) to find start, O(M) to collect results */
void addReplyZiplistRange(client *c, zskiplist *zsl, zrangespec *range,
                          long offset, long count, int withscores) {
    zskiplistNode *node = zslFirstInRange(zsl, range);
    while (node && count--) {
        if (!zslValueLteMax(node->score, range)) break;
        /* collect: node->ele, node->score */
        node = node->level[0].forward;   /* pure sequential level[0] traversal */
    }
}

Time complexity: O(log N + M) where N = total elements, M = elements in range. Space complexity: O(M) for the output.

7.7 Dual-Structure Design: Skiplist + Dict

7.7.1 The zset Struct

/* t_zset.c โ€” large ZSet uses both structures simultaneously */
typedef struct zset {
    dict *dict;       /* member โ†’ score mapping, for O(1) ZSCORE lookups */
    zskiplist *zsl;   /* score-ordered index, for O(log N) range queries */
} zset;

Operation routing:

Command Primary structure Complexity
ZSCORE member dict only O(1)
ZADD score member dict + skiplist O(log N)
ZINCRBY delta member dict (find score) + skiplist (reposition) O(log N)
ZRANK member skiplist only O(log N)
ZRANGE min max skiplist only O(log N + M)
ZCARD zsl.length O(1)
ZCOUNT min max skiplist (zslCountInRange) O(log N)

7.7.2 Shared SDS Pointer (No Copying)

/* zadd path in t_zset.c */
static int zsetAdd(robj *zobj, double score, sds ele, ...) {
    /* ... */
    zskiplistNode *znode = zslInsert(zs->zsl, score, ele);
    /* CRITICAL: dict key is the SAME sds pointer as znode->ele */
    serverAssert(dictAdd(zs->dict, ele, &znode->score) == DICT_OK);
    /* Both structures share one sds allocation โ€” no duplication */
}
Memory layout showing shared sds:

  Heap:  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚  SDS struct: "leaderboard_A"   โ”‚  โ† one allocation
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   โ–ฒ                โ–ฒ
                   โ”‚                โ”‚
  zskiplistNode.ele         dictEntry.key
  (skiplist owns)           (dict references)

  When deleting: must remove from BOTH structures, free sds ONCE

7.8 Memory: listpack vs skiplist+dict

ZSet with 100 elements:
  member: "user:XXXXXXXX" (13 bytes), score: double (8 bytes)

listpack encoding (count โ‰ค 128 elements):
  Header: 6 bytes
  Per pair: member_elem(1+13+1=15) + score_elem(1+8+1=10) = 25 bytes
  100 pairs: 2500 bytes
  Total: ~2507 bytes โ‰ˆ 2.5 KB

skiplist + dict encoding (count > 128 or value > 64 bytes):
  Per zskiplistNode:
    zmalloc header:  8 bytes (jemalloc metadata)
    sds ele:        16 bytes (sds header 3B + 13B data + '\0' + alignment)
    score (double):  8 bytes
    backward ptr:    8 bytes
    level[] array:   expected 1.333 levels ร— 16 B/level = 21 bytes
    Subtotal:       ~61 bytes per node
  100 nodes skiplist: ~6100 bytes
  zskiplist struct: 24 bytes
  dict (100 entries, load factor ~0.78 with 128 buckets):
    buckets: 128 ร— 8 = 1024 bytes
    entries: 100 ร— 32 = 3200 bytes (key ptr + val ptr + next ptr + metadata)
  dict overhead: ~4224 bytes
  Total skiplist+dict: ~10348 bytes โ‰ˆ 10 KB

Ratio: listpack uses ~4x less memory for 100 elements.
This is precisely why the default threshold is 128 entries.

7.9 Summary

Criterion Red-Black Tree Skiplist
Range query cache behavior Poor (tree traversal) Excellent (sequential L0 scan)
Implementation complexity High (rotations, recoloring) Low (no structural rebalancing)
Expected pointers/node 3 (left, right, parent) 1.333 (p=0.25)
ZRANK operation O(log N) w/ augmentation O(log N) via span field
Lock-free potential Very difficult Well-studied (ConcurrentSkipList)
Constant factor (search) Slightly smaller Slightly larger

The skiplist's dominance in ZSet comes from a synergy of factors: range queries map to sequential linked-list traversal (cache-friendly), the span field gives O(log N) rank for free, and the code simplicity reduces bugs in a system where correctness is paramount.

Redis's choice validates a broader principle: when multiple data structure choices give the same asymptotic complexity, implementation simplicity and cache behavior often matter more than constant-factor theoretical advantages.

Rate this chapter
4.6  / 5  (57 ratings)

๐Ÿ’ฌ Comments