Chapter 9

Hash Table and Incremental Rehash

Chapter 9: Hash Table and Incremental Rehashing

9.1 The Three-Layer dict Structure

Redis's dict is the foundation of the entire system โ€” the main key-value database is itself a dict. Understanding its structure is essential for understanding Redis's performance characteristics, memory behavior, and correctness guarantees.

9.1.1 C Struct Definitions (dict.h)

/* Hash table entry โ€” stores one key-value pair */
typedef struct dictEntry {
    void *key;               /* key (any type, accessed via dictType function pointers) */
    union {
        void *val;           /* generic value pointer */
        uint64_t u64;        /* unsigned integer (avoids malloc for integer values) */
        int64_t s64;         /* signed integer */
        double d;            /* double-precision float */
    } v;
    struct dictEntry *next;  /* chaining: next entry in the same bucket */
    void *metadata[];        /* flexible array for extended metadata (size via dictType) */
} dictEntry;

/* dictType โ€” function pointer table enabling type-generic behavior */
typedef struct dictType {
    uint64_t (*hashFunction)(const void *key);
    void *(*keyDup)(dict *d, const void *key);
    void *(*valDup)(dict *d, const void *obj);
    int (*keyCompare)(dict *d, const void *key1, const void *key2);
    void (*keyDestructor)(dict *d, void *key);
    void (*valDestructor)(dict *d, void *obj);
    int (*expandAllowed)(size_t moreMem, double usedRatio);
    size_t (*dictEntryMetadataBytes)(dict *d);  /* extra bytes per entry */
} dictType;

/* Redis 7.0+ โ€” dictht fields inlined into dict (eliminates one indirection) */
typedef struct dict {
    dictType *type;               /* type-specific function table */
    dictEntry **ht_table[2];      /* two bucket arrays: ht_table[0] active, ht_table[1] during rehash */
    unsigned long ht_used[2];     /* entry count in each table */
    long rehashidx;               /* -1 = not rehashing; >=0 = index of bucket being migrated */
    int16_t pauserehash;          /* >0 = rehash paused (safe iterator held) */
    signed char ht_size_exp[2];   /* exponent: bucket array size = 1 << ht_size_exp */
} dict;

Evolution note: Before Redis 7.0, there was a separate dictht struct. Inlining it into dict saves one pointer dereference per hash operation.

9.1.2 Memory Layout

dict struct (~64 bytes on heap):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  type ptr 8B โ”‚ ht_table[0] ptr 8โ”‚ ht_table[1] ptr 8โ”‚ ht_used 16B โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ rehashidx 8B โ”‚ pauserehash 2B โ”‚ ht_size_exp[2] 2B โ”‚ padding     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚                โ”‚                  โ”‚
       โ”‚                โ–ผ                  โ–ผ (during rehash only)
       โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚   โ”‚  Bucket array ht[0]  โ”‚   โ”‚  Bucket array ht[1]  โ”‚
       โ”‚   โ”‚  [0] โ†’ entry โ†’ entry โ”‚   โ”‚  [0] โ†’ entry         โ”‚
       โ”‚   โ”‚  [1] โ†’ NULL          โ”‚   โ”‚  [1] โ†’ NULL          โ”‚
       โ”‚   โ”‚  [2] โ†’ entry         โ”‚   โ”‚  [2] โ†’ entry โ†’ entry โ”‚
       โ”‚   โ”‚  ...                 โ”‚   โ”‚  ...                 โ”‚
       โ”‚   โ”‚  [2^exp-1] โ†’ NULL    โ”‚   โ”‚  [2^exp-1] โ†’ NULL    โ”‚
       โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ–ผ
  dictType (function table โ€” shared across all dicts of same type)

dictEntry layout (chaining collision resolution):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  key 8B  โ”‚  val union 8B  โ”‚  next 8B โ”‚  metadata[]  โ”‚
โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚                           โ”‚
     โ–ผ                           โ–ผ
  SDS key (or integer)    next entry in bucket chain (or NULL)

9.2 Hash Function: SipHash-1-2

9.2.1 Why Cryptographic Hash Function?

Redis 4.0 and earlier used MurmurHash2 โ€” fast but not adversarially secure. An attacker who knows the hash function can craft inputs that all hash to the same bucket, turning O(1) lookups into O(n) scans.

Hash Flooding Attack (Hash DoS) โ€” real incidents:
  2011: Python, PHP, Ruby, Java all vulnerable when exposed via web params
  Attack: POST 50,000 crafted form fields โ†’ all hash to bucket 0
          Server spends O(Nยฒ) time processing what should be O(N)
          CPU: 100%, response time: seconds instead of milliseconds

Redis exposure:
  HSET mykey field1 val1 field2 val2 ... (crafted field names)
  All fields land in the same bucket โ†’ O(n) per HSET โ†’ Redis blocked

9.2.2 SipHash-1-2 Implementation

/* siphash.c โ€” embedded in Redis source */
uint64_t siphash(const uint8_t *in, const size_t inlen, const uint8_t *k) {
    /* Initialization with magic constants XOR'd with 128-bit key */
    uint64_t v0 = UINT64_C(0x736f6d6570736575) ^ k0;
    uint64_t v1 = UINT64_C(0x646f72616e646f6d) ^ k1;
    uint64_t v2 = UINT64_C(0x6c7967656e657261) ^ k0;
    uint64_t v3 = UINT64_C(0x7465646279746573) ^ k1;

    /* Process 8-byte blocks: c=1 compression rounds */
    for each 8-byte block m:
        v3 ^= m;
        SIPROUND;    /* one SipRound: 4 ARX (add-rotate-xor) operations */
        v0 ^= m;

    /* Finalization: d=2 finalization rounds */
    v2 ^= 0xff;
    SIPROUND; SIPROUND;
    return v0 ^ v1 ^ v2 ^ v3;
}

/* Each SIPROUND:
   v0 += v1;  v1 = ROTL(v1,13);  v1 ^= v0;  v0 = ROTL(v0,32);
   v2 += v3;  v3 = ROTL(v3,16);  v3 ^= v2;
   v0 += v3;  v3 = ROTL(v3,21);  v3 ^= v0;
   v2 += v1;  v1 = ROTL(v1,17);  v1 ^= v2;  v2 = ROTL(v2,32);
*/

Redis uses a process-wide random 16-byte key, generated at startup:

/* server.c โ€” startup initialization */
uint8_t hashseed[16];
getRandomBytes(hashseed, sizeof(hashseed));
dictSetHashFunctionSeed(hashseed);

Each Redis process restart generates a new seed. An attacker cannot predict the hash distribution without knowing the seed, making crafted-collision attacks infeasible.

9.2.3 Performance Cost of SipHash

Benchmark: hashing 32-byte key, single call

SipHash-1-2:  ~12 ns/call
MurmurHash2:  ~5 ns/call
xxHash64:     ~3 ns/call

Overhead of SipHash vs MurmurHash: ~7 ns per hash operation

At 350,000 GET/s:
  350,000 ร— 7ns = 2.45ms/s of extra CPU time
  = 0.245% of a single core

Security benefit: prevents Hash DoS entirely.
Cost: negligible (< 0.25% CPU overhead).

9.3 Expansion Mechanism

9.3.1 Expansion Trigger Conditions

/* dict.c */
static int _dictExpandIfNeeded(dict *d) {
    if (dictIsRehashing(d)) return DICT_OK;  /* already expanding */

    /* Empty table: initialize to minimum size (4 buckets) */
    if (DICTHT_SIZE(d->ht_size_exp[0]) == 0)
        return dictExpand(d, DICT_HT_INITIAL_SIZE);

    /*
     * Two thresholds based on dict_can_resize:
     *
     * DICT_RESIZE_ENABLE (normal operation):
     *   Expand when load_factor >= 1 (used >= bucket_count)
     *   This keeps average chain length โ‰ค 2
     *
     * DICT_RESIZE_AVOID (during fork/persistence):
     *   Only expand when load_factor >= 5 (used >= 5 ร— bucket_count)
     *   Avoids COW memory explosion during BGSAVE
     *
     * DICT_RESIZE_FORBID:
     *   Never expand (rare; forced by external code)
     */
    if ((dict_can_resize == DICT_RESIZE_ENABLE &&
         d->ht_used[0] >= DICTHT_SIZE(d->ht_size_exp[0])) ||
        (dict_can_resize != DICT_RESIZE_FORBID &&
         d->ht_used[0] / DICTHT_SIZE(d->ht_size_exp[0]) > dict_force_resize_ratio))
    {
        return dictExpand(d, d->ht_used[0] + 1);
    }
    return DICT_OK;
}

Why two thresholds?

During BGSAVE or BGREWRITEAOF, Redis calls fork():
  Parent process: continues serving requests (writes trigger COW)
  Child process:  reads memory snapshot for persistence

If parent triggers hashtable expansion during fork:
  - New bucket array allocated: entirely NEW memory pages
  - Old entries migrated to new buckets: each migrated entry modifies a page
  - Each modified page triggers COW: OS duplicates the 4KB page
  - 1M entries ร— ~2 COW pages each = 2M extra pages = ~8GB extra memory!

Solution: raise load factor threshold to 5 during persistence
  - Allow table to become denser (slower, but no COW explosion)
  - After persistence completes: resume normal threshold (expand immediately)

dict_can_resize state machine:
  Normal:           DICT_RESIZE_ENABLE   (threshold = 1)
  During BGSAVE:    DICT_RESIZE_AVOID    (threshold = 5)
  Explicit disable: DICT_RESIZE_FORBID   (never expand)

9.3.2 New Capacity Calculation

/* Find smallest power of 2 >= size */
static signed char _dictNextExp(unsigned long size) {
    unsigned char e = DICT_HT_INITIAL_EXP;  /* = 2, meaning size=4 */
    if (size >= LONG_MAX) return (8*sizeof(long)-1);
    while (1) {
        if ((1ul << e) >= size) return e;
        e++;
    }
}

/* dictExpand โ€” allocate ht[1] and begin incremental rehash */
int dictExpand(dict *d, unsigned long size) {
    if (dictIsRehashing(d) || d->ht_used[0] > size) return DICT_ERR;

    signed char new_ht_size_exp = _dictNextExp(size);
    size_t newsize = 1ul << new_ht_size_exp;

    if (newsize < size) return DICT_ERR;   /* overflow check */

    /* Allocate new bucket array as ht[1] (zero-initialized) */
    dictEntry **new_ht_table = zcalloc(newsize * sizeof(dictEntry*));
    unsigned long new_ht_used = 0;

    d->ht_table[1] = new_ht_table;
    d->ht_used[1] = new_ht_used;
    d->ht_size_exp[1] = new_ht_size_exp;
    d->rehashidx = 0;  /* mark: rehash begins at bucket 0 */
    return DICT_OK;
}

/* Example expansion progression:
   4 entries โ†’ used=4, size=4 โ†’ expand to size=8
   8 entries โ†’ used=8, size=8 โ†’ expand to size=16
   ...
   N entries โ†’ expand to next power of 2 >= N*2
*/

9.4 Incremental Rehashing โ€” The Core Design

9.4.1 Why Incremental?

Consider: 1,000,000-entry dict needs to expand from 1M to 2M buckets

Naive approach (all-at-once rehash):
  1. Allocate 2M-bucket array
  2. Iterate all 1M entries in ht[0]
  3. Compute new bucket index: h & (2M-1) instead of h & (1M-1)
  4. Insert each entry into ht[1]
  5. Free ht[0], swap ht[1] โ†’ ht[0]
  Time: ~100-500ms (depends on hardware, cache state)
  Impact: Redis blocked for 100-500ms โ€” P99 spike from 0.5ms to 500ms!

Incremental rehash approach:
  - Migrate 1 bucket per client command (amortized into normal operations)
  - Migrate 100 buckets per serverCron tick (100ms interval, max 1ms total)
  - Total migration time: spread over seconds, no visible latency spike
  - While migrating: both ht[0] and ht[1] are live, both searched

9.4.2 One-Step Rehash

/* dict.c โ€” migrate n buckets (one step of incremental rehash) */
int dictRehash(dict *d, int n) {
    int empty_visits = n * 10;  /* bound on empty buckets visited: prevents long loops */
    if (!dictIsRehashing(d)) return 0;

    while (n-- && d->ht_used[0] != 0) {
        dictEntry *de, *nextde;

        /* Skip empty buckets (but limit how many we skip) */
        assert(DICTHT_SIZE(d->ht_size_exp[0]) > (unsigned long)d->rehashidx);
        while (d->ht_table[0][d->rehashidx] == NULL) {
            d->rehashidx++;
            if (--empty_visits == 0) return 1;
        }

        /* Migrate entire chain at d->rehashidx to ht[1] */
        de = d->ht_table[0][d->rehashidx];
        while (de) {
            uint64_t h;
            nextde = de->next;
            /* Compute new bucket index in ht[1] */
            h = dictHashKey(d, de->key) & DICTHT_SIZE_MASK(d->ht_size_exp[1]);
            /* Insert at head of ht[1] bucket (O(1)) */
            de->next = d->ht_table[1][h];
            d->ht_table[1][h] = de;
            d->ht_used[0]--;
            d->ht_used[1]++;
            de = nextde;
        }
        d->ht_table[0][d->rehashidx] = NULL;  /* mark bucket as migrated */
        d->rehashidx++;
    }

    /* Check if rehash is complete */
    if (d->ht_used[0] == 0) {
        zfree(d->ht_table[0]);
        /* Swap: ht[1] becomes the new ht[0] */
        d->ht_table[0]  = d->ht_table[1];
        d->ht_used[0]   = d->ht_used[1];
        d->ht_size_exp[0] = d->ht_size_exp[1];
        _dictReset(d, 1);   /* clear ht[1] fields */
        d->rehashidx = -1;  /* done! */
        return 0;
    }
    return 1;  /* more work to do */
}

9.4.3 Background Migration via serverCron

/* server.c โ€” called every serverCron tick (default: every 100ms) */
void databasesCron(void) {
    /* ... */
    if (server.activerehashing) {
        for (j = 0; j < dbs_per_call; j++) {
            if (incrementallyRehash(rehash_db)) break;
        }
    }
}

/* Rehash for at most 1ms per db per tick */
int incrementallyRehash(int dbid) {
    if (dictIsRehashing(server.db[dbid].dict)) {
        dictRehashMilliseconds(server.db[dbid].dict, 1);
        return 1;  /* there's work to do, call again next tick */
    }
    if (dictIsRehashing(server.db[dbid].expires)) {
        dictRehashMilliseconds(server.db[dbid].expires, 1);
        return 1;
    }
    return 0;
}

int dictRehashMilliseconds(dict *d, int ms) {
    if (d->pauserehash > 0) return 0;
    long long start = timeInMilliseconds();
    int rehashes = 0;
    while (dictRehash(d, 100)) {     /* migrate 100 buckets per call */
        rehashes += 100;
        if (timeInMilliseconds() - start > ms) break;
    }
    return rehashes;
}

Why both "per-operation" and "background" migration?

Per-operation migration (1 bucket per command):
  Ensures progress even when serverCron is delayed
  Amortizes migration cost into command latency (imperceptibly small)

Background migration (serverCron, 100 buckets/100ms):
  Handles low-traffic periods where commands are rare
  Without this, a suddenly quiet server would never finish rehashing
  Max 1ms/100ms = 1% CPU budget for background migration

Together: rehash always makes progress, bounded CPU usage.

9.4.4 Read/Write Behavior During Rehash

Lookup (dictFind):

dictEntry *dictFind(dict *d, const void *key) {
    dictEntry *he;
    uint64_t h, idx, table;
    if (dictSize(d) == 0) return NULL;

    /* Opportunistically advance rehash by 1 step */
    if (dictIsRehashing(d)) _dictRehashStep(d);

    h = dictHashKey(d, key);
    /* Search BOTH tables โ€” entry may be in ht[0] (not yet migrated) or ht[1] */
    for (table = 0; table <= 1; table++) {
        idx = h & DICTHT_SIZE_MASK(d->ht_size_exp[table]);
        he = d->ht_table[table][idx];
        while (he) {
            if (key == he->key || dictCompareKeys(d, key, he->key))
                return he;
            he = he->next;
        }
        if (!dictIsRehashing(d)) break;  /* not rehashing: ht[1] is empty, skip */
    }
    return NULL;
}

Insert (dictAddRaw):

/* During rehash: NEW entries always go into ht[1], never ht[0] */
static long _dictKeyIndex(dict *d, const void *key, uint64_t hash,
                          dictEntry **existing) {
    /* Check both tables for existing key */
    for (table = 0; table <= 1; table++) {
        idx = hash & DICTHT_SIZE_MASK(d->ht_size_exp[table]);
        /* ... scan chain for existing key ... */
        if (!dictIsRehashing(d)) break;
    }
    /* Return index in ht[1] if rehashing, ht[0] otherwise */
    table = dictIsRehashing(d) ? 1 : 0;
    return idx;
}

Invariant during rehash:

ht[0]: entries being migrated OUT (only decreases)
ht[1]: destination table (only increases via migration or new inserts)

This invariant guarantees:
  - Every entry exists in exactly one of the two tables
  - Once migrated from ht[0] to ht[1], it stays in ht[1]
  - New entries go to ht[1], so ht[0] monotonically drains
  - Rehash terminates when ht[0].used == 0

9.5 Shrinking Mechanism

/* Shrink trigger: load factor < 10% */
int htNeedsResize(dict *dict) {
    long long size = dictSlots(dict);
    long long used = dictSize(dict);
    return (size > DICT_HT_INITIAL_SIZE &&
            (used * 100 / size < HASHTABLE_MIN_FILL));  /* HASHTABLE_MIN_FILL = 10 */
}

/* Called from serverCron โ€” check all databases */
void tryResizeHashTables(int dbid) {
    if (htNeedsResize(server.db[dbid].dict))
        dictResize(server.db[dbid].dict);
    if (htNeedsResize(server.db[dbid].expires))
        dictResize(server.db[dbid].expires);
}

/* Shrink to smallest power of 2 >= used count */
int dictResize(dict *d) {
    unsigned long minimal;
    if (!dict_can_resize || dictIsRehashing(d)) return DICT_ERR;
    minimal = d->ht_used[0];
    if (minimal < DICT_HT_INITIAL_SIZE) minimal = DICT_HT_INITIAL_SIZE;
    return dictExpand(d, minimal);  /* same mechanism as expansion */
}

Shrinking uses the identical incremental rehash mechanism as expansion, just with a smaller target table.

9.6 Iterator Safety

9.6.1 Safe Iterator

/* Safe iterator: pauses rehash for duration of iteration */
dictIterator *dictGetIterator(dict *d) {
    dictIterator *iter = zmalloc(sizeof(*iter));
    iter->d = d;
    iter->table = 0;
    iter->index = -1;
    iter->safe = 1;
    iter->entry = iter->nextEntry = NULL;
    return iter;
}

/* First call to dictNext initializes: */
static void _dictInitIterator(dictIterator *iter) {
    if (iter->safe)
        dictPauseRehashing(iter->d);     /* d->pauserehash++ */
    else
        iter->fingerprint = dictFingerprint(iter->d);
}

/* dictReleaseIterator: */
void dictReleaseIterator(dictIterator *iter) {
    if (!(iter->index == -1 && iter->table == 0)) {
        if (iter->safe)
            dictResumeRehashing(iter->d);   /* d->pauserehash-- */
        else
            assert(iter->fingerprint == dictFingerprint(iter->d));
    }
    zfree(iter);
}

Why pausing rehash prevents duplicates/omissions:

Without pause during iteration:
  Iterating bucket 5 โ†’ rehash migrates bucket 5 โ†’ entries move to ht[1]
  Iterator then visits ht[1] later and sees those entries AGAIN (duplicates!)

  OR:
  Iterator at bucket 10 in ht[0] โ†’ rehash migrates bucket 3
  If entry from bucket 3 in ht[0] maps to bucket 3 in ht[1]:
    But we already passed bucket 3 in ht[1] โ†’ entry MISSED (omission!)

With pause:
  d->pauserehash > 0 โ†’ dictRehash() returns immediately without doing work
  Rehash resumes after dictReleaseIterator sets pauserehash back to 0

9.6.2 Unsafe Iterator (Fingerprint-Based)

/* Fingerprint: hash of the dict's structural state */
unsigned long long dictFingerprint(dict *d) {
    unsigned long long integers[6], hash = 0;
    integers[0] = (long)d->ht_table[0];
    integers[1] = d->ht_size_exp[0];
    integers[2] = d->ht_used[0];
    integers[3] = (long)d->ht_table[1];
    integers[4] = d->ht_size_exp[1];
    integers[5] = d->ht_used[1];

    /* SipHash mix of all 6 integers */
    for (int j = 0; j < 6; j++) {
        hash += integers[j];
        hash = (~hash) + (hash << 21);
        hash = hash ^ (hash >> 24);
        /* ... more mixing ... */
    }
    return hash;
}

/* Usage: safe=0 means "I know what I'm doing, no writes during iteration" */
dictIterator *dictGetSafeIterator(dict *d) { /* misnaming โ€” this is UNSAFE */ }
/* Actually: dictGetSafeIterator IS the safe one. The API is confusing. */
/* Correct usage:
   Safe (pauses rehash):   dictGetIterator() with safe=1
   Unsafe (fingerprint):   iter->safe = 0, uses fingerprint check */

Use cases:

Safe iterator (safe=1, pauses rehash):
  KEYS *           โ€” must return all keys, no duplicates
  HGETALL key      โ€” must return all fields, no duplicates
  SMEMBERS key     โ€” must return all members
  DEBUG DIGEST     โ€” checksum of all data requires complete iteration

Unsafe iterator (safe=0, fingerprint):
  SCAN cursor      โ€” incremental; duplicates acceptable, omissions not ideal
                    (SCAN guarantees full coverage over multiple calls)
  Internal cleanup routines that know they won't modify the dict

9.7 listpack โ†’ hashtable Upgrade

9.7.1 Upgrade Trigger and Process

/* t_hash.c โ€” check and trigger upgrade after each HSET */
void hashTypeConvert(robj *o, int enc) {
    if (enc == OBJ_ENCODING_HT) {
        dict *dict = dictCreate(&hashDictType);
        hashTypeIterator *hi;
        /* Iterate listpack, insert each field/value into new dict */
        hi = hashTypeInitIterator(o);
        while (hashTypeNext(hi) != C_ERR) {
            sds field = hashTypeCurrentObjectEncoding(hi, OBJ_HASH_KEY);
            sds value = hashTypeCurrentObjectEncoding(hi, OBJ_HASH_VALUE);
            dictAdd(dict, field, value);
        }
        hashTypeReleaseIterator(hi);
        /* Replace listpack with dict */
        o->encoding = OBJ_ENCODING_HT;
        zfree(o->ptr);         /* free listpack */
        o->ptr = dict;
    }
}

/* Trigger conditions (checked in hashTypeTryConversion): */
if (sdslen(value) > server.hash_max_listpack_value)   /* default: 64 bytes */
    hashTypeConvert(o, OBJ_ENCODING_HT);

unsigned int len = hashTypeLength(o);
if (len > server.hash_max_listpack_entries)           /* default: 128 entries */
    hashTypeConvert(o, OBJ_ENCODING_HT);

9.7.2 Memory Before and After Upgrade

Hash object: 100 fields, each field=value=8 bytes

listpack encoding:
  Header:       6 bytes
  Per pair:     (1+8+1) + (1+8+1) = 20 bytes  (enc+data+backlen each)
  100 pairs:    2000 bytes
  End byte:     1 byte
  Total:        ~2007 bytes โ‰ˆ 2.0 KB

hashtable encoding (dict):
  dict struct:             64 bytes
  Bucket array (128):     128 ร— 8 = 1024 bytes  (next power of 2 >= 100)
  100 ร— dictEntry:        100 ร— (8+8+8) = 2400 bytes
  100 ร— SDS field:        100 ร— (3+8+1) = 1200 bytes  (sds header + data + \0)
  100 ร— SDS value:        100 ร— (3+8+1) = 1200 bytes
  jemalloc overhead:      ~800 bytes (per-allocation headers, size classes)
  Total:                  ~6688 bytes โ‰ˆ 6.5 KB

Memory ratio: hashtable uses ~3.2ร— more memory than listpack.
This validates the 128-entry threshold: below it, listpack is a huge win.

9.8 Performance Benchmarks

9.8.1 Rehash Impact on QPS

Setup: 1M-key database, trigger expansion (1M โ†’ 2M buckets)
       Average key: 32 bytes, value: 64 bytes

Phase           Throughput   P99 Latency  Notes
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Before rehash   350K GET/s   0.30ms       Normal operation
During rehash   310K GET/s   0.40ms       -11% due to 2-table search
After rehash    350K GET/s   0.30ms       Fully recovered

Rehash duration: ~3 seconds
  serverCron: 100 buckets/tick ร— 10 ticks/second = 1000 buckets/second
  1M buckets / 1000 per second = 1000 seconds? NO โ€” per-command migration helps:
  350K commands ร— 1 bucket each = 350K buckets/second
  Total: 350K + 1K = 351K buckets/second โ†’ 1M/351K โ‰ˆ 2.8 seconds

No single command blocked > 1ms (individual bucket migrations are ฮผs-level)

9.8.2 Hash Table Operation Complexity

Operation     Not Rehashing   During Rehash   Notes
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
GET           O(1) avg        O(1) avg        Search 2 tables
SET           O(1) avg        O(1) avg        Insert only in ht[1]
DEL           O(1) avg        O(1) avg        Search and delete from 1 table
KEYS *        O(N)            O(N)            Iteration (safe iter pauses rehash)
SCAN          O(N/cursor)     O(N/cursor)     Incremental, cursor-based

Load factor impact on chain length:
  load_factor = 0.5: avg chain = 1.0 (nearly all buckets have 0-1 entries)
  load_factor = 1.0: avg chain = 1.5 (expand triggered here)
  load_factor = 2.0: avg chain = 2.5 (pathological; only during persistence)
  load_factor = 5.0: avg chain = 5.5 (force expand triggers here)

9.9 Summary

Redis dict's incremental rehashing is a textbook solution to the "online algorithm" problem: how to restructure a large data structure without blocking the service.

Design Decision Problem Solved
Two hash tables (ht[0]/ht[1]) Allow service during expansion
rehashidx tracks progress O(1) resume after any pause
New writes only to ht[1] ht[0] only drains โ€” migration is monotonic
Reads check both tables No data loss during migration
serverCron background migration Progress even under zero load
Safe iterator pauses rehash Full-scan commands return correct results
SipHash-1-2 with random seed Hash DoS attack prevention
Higher load factor during persistence Prevents COW memory explosion under fork

The combination of these mechanisms allows Redis to transparently grow from a 4-bucket table to a 2^32-bucket table while serving millions of requests per second, with no client-visible pause longer than a single command's execution time.

Rate this chapter
4.7  / 5  (44 ratings)

๐Ÿ’ฌ Comments