Chapter 13

RDB Snapshot: fork, COW and File Format

Chapter 13: RDB Snapshots — fork, Copy-On-Write, and File Format

13.1 Design Intent

RDB (Redis Database Backup) is Redis's point-in-time binary snapshot mechanism. It serializes the in-memory dataset to a compact binary file with these characteristics:


13.2 Trigger Mechanisms

13.2.1 Manual Triggers

# Foreground save (blocks the main thread — avoid in production)
SAVE

# Background save (recommended)
BGSAVE

# Check result of last BGSAVE
LASTSAVE                    # returns Unix timestamp
BGSAVE SCHEDULE             # enqueue BGSAVE, run after current one completes

13.2.2 Automatic Triggers

In redis.conf:

# Format: save <seconds> <changes>
# Trigger BGSAVE if at least <changes> writes occurred within <seconds>
save 3600 1       # 1 write in the last hour
save 300 100      # 100 writes in 5 minutes
save 60 10000     # 10,000 writes in 1 minute

# Disable automatic RDB entirely
save ""

# File configuration
dbfilename dump.rdb
dir /var/lib/redis

serverCron() in server.c runs every 100 ms and checks whether any save condition is satisfied:

for (j = 0; j < server.saveparamslen; j++) {
    struct saveparam *sp = server.saveparams + j;
    if (server.dirty >= sp->changes &&
        server.unixtime - server.lastsave > sp->seconds &&
        (server.unixtime - server.lastbgsave_try > CONFIG_BGSAVE_RETRY_DELAY ||
         server.lastbgsave_status == C_OK))
    {
        serverLog(LL_NOTICE, "%lld changes in %d seconds. Saving...",
                  sp->changes, (int)sp->seconds);
        rdbSaveBackground(server.rdb_filename, NULL);
        break;
    }
}

13.2.3 Other Trigger Scenarios

SHUTDOWN [NOSAVE|SAVE]   # graceful shutdown triggers BGSAVE by default
FLUSHALL                 # writes an empty RDB after flushing
DEBUG RELOAD             # used in tests to force RDB cycle
# Full resync with replica: master triggers BGSAVE to send RDB to replica

13.3 fork() and Copy-On-Write

13.3.1 The fork() Call

/* rdb.c: rdbSaveBackground() */
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
    pid_t childpid;

    if (hasActiveChildProcess()) return C_ERR; /* one child at a time */

    server.dirty_before_bgsave = server.dirty;
    server.lastbgsave_try = time(NULL);

    if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
        /* Child process */
        redisSetCpuAffinity(server.bgsave_cpulist);
        retval = rdbSave(filename, rsi);
        sendChildInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, server.child_info_data.cow_size, "RDB");
        exitFromChild((retval == C_OK) ? 0 : 1);
    } else {
        /* Parent process (main thread) resumes immediately */
        server.rdb_child_pid = childpid;
        /* Disable dict resize during child process life to reduce COW pages */
        updateDictResizePolicy();
        return C_OK;
    }
}

After fork():

13.3.2 Copy-On-Write Mechanics

After fork() — shared physical pages:

  Physical Memory:
  ┌──────────────────────────────────┐
  │  Page A: {key1:"v1", key2:"v2"} │
  │  Page B: {key3:"v3", key4:"v4"} │
  │  Page C: {key5:"v5", key6:"v6"} │
  └──────────────────────────────────┘
       ▲ Parent virtual addr 0x1000     ▲ Child virtual addr 0x1000
       (both point to same pages, read-only)

  Parent executes: SET key2 "newvalue"
  ┌────────────────────────────────────────────────────────┐
  │ 1. CPU detects write to read-only page → Page Fault    │
  │ 2. Kernel copies Page A → new Page A' (4 KB)           │
  │ 3. Parent's virtual addr 0x1000 → Page A' (writable)  │
  │ 4. Child's  virtual addr 0x1000 → Page A  (unchanged) │
  └────────────────────────────────────────────────────────┘

  Result: child always sees the state at the moment of fork()

13.3.3 COW Memory Overhead

Scenario: Redis uses 10 GB, BGSAVE runs for 60 seconds,
          30% of data is modified during that window

Best case (all writes land on same pages):
  COW copies: O(few pages)   → negligible overhead

Worst case (each write touches a unique page):
  Modified data: 3 GB
  COW copies   : 3 GB × 1 (one physical copy per modified page)
  Peak memory  : 10 GB (original) + 3 GB (COW copies) = 13 GB

Practical estimate:
  For typical workloads with hot keys, actual COW overhead
  is 20–40% of write volume, not 100%.

Monitoring COW size (Redis 4.0+):

redis-cli INFO persistence
# rdb_last_cow_size: 3145728    ← 3 MB copied during last BGSAVE
# aof_last_cow_size: 1048576   ← 1 MB during AOF rewrite

13.3.4 Transparent Huge Pages (THP) — A Dangerous Interaction

Linux THP uses 2 MB pages instead of 4 KB pages. When COW triggers on a THP page, the kernel must copy 2 MB instead of 4 KB — a 512× increase in copy granularity.

# Symptom: BGSAVE takes unusually long, rdb_last_cow_size is enormous

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never  ← "always" is dangerous for Redis

# Disable THP permanently
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Add to /etc/rc.local for persistence across reboots

# Also disable defrag
echo never > /sys/kernel/mm/transparent_hugepage/defrag

13.3.5 Dict Resize Freeze During BGSAVE

Redis disables hash table incremental rehash during BGSAVE to prevent unnecessary COW:

void updateDictResizePolicy(void) {
    if (server.rdb_child_pid == -1 && server.aof_child_pid == -1)
        dictEnableResize();
    else
        dictDisableResize(); /* no rehash while child exists */
}

Without this, every dict rehash would COW-copy the old and new hash table arrays, potentially doubling memory usage.


13.4 rdbSave() Serialization Pipeline

int rdbSaveRio(rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi) {
    char magic[10];
    int j;

    /* 1. Magic header: "REDIS" + 4-digit version */
    snprintf(magic, sizeof(magic), "REDIS%04d", RDB_VERSION);
    rdbWriteRaw(rdb, magic, 9);          /* e.g. "REDIS0011" */

    /* 2. AUX fields: metadata key-value pairs */
    rdbSaveAuxFieldStrStr(rdb, "redis-ver",  REDIS_VERSION);   /* "7.2.0" */
    rdbSaveAuxFieldStrInt(rdb, "redis-bits", sizeof(long)*8);  /* 64 */
    rdbSaveAuxFieldStrInt(rdb, "ctime",      time(NULL));
    rdbSaveAuxFieldStrLong(rdb, "used-mem",  server.used_memory);
    if (rsi) {
        rdbSaveAuxFieldStrInt(rdb, "repl-stream-db",  rsi->repl_stream_db);
        rdbSaveAuxFieldStrStr(rdb, "repl-id",         server.replid);
        rdbSaveAuxFieldStrLong(rdb, "repl-offset",    server.master_repl_offset);
    }

    /* 3. Iterate over all databases */
    for (j = 0; j < server.dbnum; j++) {
        redisDb *db = server.db + j;
        dict *d = db->dict;
        if (dictSize(d) == 0) continue;

        /* SELECTDB opcode + db index */
        rdbSaveType(rdb, RDB_OPCODE_SELECTDB);   /* 0xFE */
        rdbSaveLen(rdb, j);

        /* RESIZEDB opcode + sizes (speeds up loading by pre-sizing dicts) */
        rdbSaveType(rdb, RDB_OPCODE_RESIZEDB);   /* 0xFB */
        rdbSaveLen(rdb, dictSize(d));
        rdbSaveLen(rdb, dictSize(db->expires));

        /* Iterate every key-value pair */
        dictIterator *di = dictGetSafeIterator(d);
        dictEntry *de;
        while ((de = dictNext(di)) != NULL) {
            sds keystr   = dictGetKey(de);
            robj *key    = createStringObjectFromLongLongForValue(...)
            robj *o      = dictGetVal(de);
            long long expiretime = getExpire(db, key);

            /* Expiry time (millisecond precision, opcode 0xFC) */
            if (expiretime != -1) {
                rdbSaveType(rdb, RDB_OPCODE_EXPIRETIME_MS);
                rdbSaveMillisecondTime(rdb, expiretime);
            }

            /* LRU idle time or LFU frequency */
            if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
                rdbSaveType(rdb, RDB_OPCODE_IDLE);
                rdbSaveLen(rdb, estimateObjectIdleTime(o) / 1000);
            } else if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
                rdbSaveType(rdb, RDB_OPCODE_FREQ);
                rdbWriteRaw(rdb, &o->lru, 1); /* LFU counter byte */
            }

            /* Value type opcode */
            rdbSaveObjectType(rdb, o);

            /* Key: always a string */
            rdbSaveStringObject(rdb, key);

            /* Value: type-specific encoding */
            rdbSaveObject(rdb, o, key, rdbflags);
        }
        dictReleaseIterator(di);
    }

    /* 4. EOF marker */
    rdbSaveType(rdb, RDB_OPCODE_EOF);    /* 0xFF */

    /* 5. CRC-64 checksum of entire file content */
    uint64_t cksum = rdb->cksum;
    memrev64ifbe(&cksum);
    rdbWriteRaw(rdb, &cksum, 8);
    return C_OK;
}

13.5 Encoding Formats

13.5.1 Length Encoding

Redis uses a compact variable-length encoding for all integer lengths:

Byte pattern         Meaning
─────────────────────────────────────────────────────────
00xxxxxx             6-bit integer: 0–63          (1 byte total)
01xxxxxx xxxxxxxx    14-bit integer: 0–16383       (2 bytes total)
10000000 <4 bytes>   32-bit big-endian integer     (5 bytes total)
10000001 <8 bytes>   64-bit big-endian integer     (9 bytes total)
11000000 <1 byte>    Special: int8 stored as string
11000001 <2 bytes>   Special: int16 stored as string
11000010 <4 bytes>   Special: int32 stored as string
11000011 ...         Special: LZF-compressed string

Examples:

Length 5:   0x05          (00000101, fits in 6 bits)
Length 200: 0x40 0xC8     (01000000 11001000, 14-bit: 0b00_11001000 = 200)
Length 100,000: 0x80 0x00 0x01 0x86 0xA0  (32-bit big-endian)

13.5.2 String Object Encoding

ssize_t rdbSaveRawString(rio *rdb, unsigned char *s, size_t len) {
    /* 1. Attempt integer encoding (saves length prefix for small integers) */
    if (len <= 11) {
        long long value;
        if (string2ll((char*)s, len, &value)) {
            return rdbSaveLongLongAsStringObject(rdb, value);
            /* writes: 0xC0 + int8, or 0xC1 + int16, or 0xC2 + int32 */
        }
    }

    /* 2. LZF compression for strings > 20 bytes (if rdbcompression=yes) */
    if (server.rdb_compression && len > 20) {
        size_t comprlen = LZF_COMPRESS_ALLOCATED(len);
        void *out = zmalloc(comprlen);
        comprlen = lzf_compress(s, len, out, comprlen);
        if (comprlen > 0 && comprlen < len) {
            ssize_t nwritten = rdbSaveLzfBlob(rdb, out, comprlen, len);
            zfree(out);
            return nwritten;
        }
        zfree(out);
    }

    /* 3. Raw: length prefix + raw bytes */
    if (rdbSaveLen(rdb, len) == -1) return -1;
    return rdbWriteRaw(rdb, s, len);
}

13.5.3 Type-Specific Serialization

Redis type Internal encoding RDB serialization
String INT Compact integer (0xC0/C1/C2 prefix)
String EMBSTR / RAW Length-prefixed raw bytes (optionally LZF)
List LISTPACK Listpack blob (length + raw bytes)
List QUICKLIST Node count + per-node (listpack + compress flag)
Hash LISTPACK Listpack blob
Hash HT Count + alternating key/value strings
Set LISTPACK Listpack blob
Set INTSET Intset blob (encoding + length + raw int array)
Set HT Count + member strings
ZSet LISTPACK Listpack blob
ZSet SKIPLIST Count + [member string + score as 8-byte double] pairs
Stream Encoded Radix Tree + consumer groups

13.6 RDB File Structure — Hex Walkthrough

Minimal RDB for one key SET key value:

Offset  Hex bytes                           Decoded value
──────────────────────────────────────────────────────────────────
0x0000  52 45 44 49 53 30 30 31 31          "REDIS0011" (magic + version 11)

0x0009  FA                                  RDB_OPCODE_AUX
0x000A  09                                  AUX key length: 9
0x000B  72 65 64 69 73 2D 76 65 72          "redis-ver"
0x0014  05                                  AUX val length: 5
0x0015  37 2E 32 2E 30                      "7.2.0"

0x001A  FA                                  RDB_OPCODE_AUX
0x001B  0A                                  key length: 10
0x001C  72 65 64 69 73 2D 62 69 74 73       "redis-bits"
0x0026  C0 40                               int encoding: 0xC0 = int8, 0x40 = 64

0x0028  FA 05 63 74 69 6D 65 ...            "ctime" AUX field

(more AUX fields...)

0x00XX  FE                                  RDB_OPCODE_SELECTDB (0xFE)
0x00XX  00                                  DB index: 0

0x00XX  FB                                  RDB_OPCODE_RESIZEDB (0xFB)
0x00XX  01                                  dict_size: 1
0x00XX  00                                  expires_size: 0

0x00XX  00                                  Type opcode: OBJ_STRING (0x00)
0x00XX  03                                  Key length: 3
0x00XX  6B 65 79                            Key: "key"
0x00XX  05                                  Value length: 5
0x00XX  76 61 6C 75 65                      Value: "value"

0x00XX  FF                                  RDB_OPCODE_EOF (0xFF)
0x00XX  XX XX XX XX XX XX XX XX             CRC-64 checksum (8 bytes, little-endian)

If the key had EX 100, the expiry section would appear before the type opcode:

0x00XX  FC                                  RDB_OPCODE_EXPIRETIME_MS (0xFC)
0x00XX  00 50 AF 6E A2 01 00 00             8 bytes: abs expiry in ms (little-endian)
0x00XX  00                                  Type opcode: OBJ_STRING
...

13.7 Load Path — rdbLoad()

int rdbLoad(char *filename, rdbSaveInfo *rsi, int rdbflags) {
    rio rdb;
    FILE *fp = fopen(filename, "r");
    rioInitWithFile(&rdb, fp);

    /* Verify magic string "REDIS" */
    char buf[1024];
    rioRead(&rdb, buf, 9);
    if (memcmp(buf, "REDIS", 5) != 0) { /* not an RDB file */ }

    /* Parse version */
    int rdbver = atoi(buf + 5);
    if (rdbver < 1 || rdbver > RDB_VERSION) { /* incompatible */ }

    redisDb *db = server.db;
    long long expiretime = -1;

    while (1) {
        int type = rdbLoadType(&rdb); /* read 1 byte */

        if (type == RDB_OPCODE_EXPIRETIME_MS) {
            expiretime = rdbLoadMillisecondTime(&rdb, rdbver);
            type = rdbLoadType(&rdb);
        }
        if (type == RDB_OPCODE_EXPIRETIME) { /* old seconds precision */
            expiretime = rdbLoadTime(&rdb) * 1000;
            type = rdbLoadType(&rdb);
        }
        if (type == RDB_OPCODE_FREQ) {
            uint8_t byte; rioRead(&rdb, &byte, 1);
            lfu_freq = byte;
            type = rdbLoadType(&rdb);
        }
        if (type == RDB_OPCODE_IDLE) {
            lru_idle = rdbLoadLen(&rdb, NULL);
            type = rdbLoadType(&rdb);
        }
        if (type == RDB_OPCODE_EOF) break;
        if (type == RDB_OPCODE_SELECTDB) {
            int dbid = rdbLoadLen(&rdb, NULL);
            db = server.db + dbid;
            continue;
        }
        if (type == RDB_OPCODE_RESIZEDB) {
            uint64_t db_size = rdbLoadLen(&rdb, NULL);
            uint64_t exp_size = rdbLoadLen(&rdb, NULL);
            dictExpand(db->dict, db_size);
            dictExpand(db->expires, exp_size);
            continue;
        }
        if (type == RDB_OPCODE_AUX) {
            robj *auxkey = rdbLoadStringObject(&rdb);
            robj *auxval = rdbLoadStringObject(&rdb);
            loadAuxField(rsi, auxkey, auxval);
            continue;
        }

        /* Regular key-value */
        robj *key = rdbLoadStringObject(&rdb);
        robj *val = rdbLoadObject(type, &rdb, key->ptr, db, &error);

        /* Skip already-expired keys */
        if (expiretime != -1 && expiretime < now) {
            decrRefCount(key); decrRefCount(val);
            expiretime = -1;
            continue;
        }

        dbAdd(db, key, val);
        if (expiretime != -1) setExpire(NULL, db, key, expiretime);
        expiretime = -1;
    }

    /* Verify CRC-64 checksum */
    uint64_t cksum_stored, cksum_computed = rdb.cksum;
    rioRead(&rdb, &cksum_stored, 8);
    memrev64ifbe(&cksum_stored);
    if (cksum_stored != cksum_computed) {
        serverLog(LL_WARNING, "Wrong RDB checksum. Aborting now.");
        exit(1);
    }

    fclose(fp);
    return C_OK;
}

Load performance benchmarks:

Dataset RDB file size Load time Equivalent AOF replay
1M String keys 80 MB ~2 s ~15 s
10M String keys 800 MB ~20 s ~150 s
1M Hash keys (small) 200 MB ~8 s ~40 s
100M String keys 8 GB ~3.5 min ~30 min

RDB is typically 7–10× faster than AOF replay for the same dataset.


13.8 Checking and Repairing RDB Files

# Check file integrity
redis-check-rdb /var/lib/redis/dump.rdb

# Sample healthy output:
# [offset 0] Checking RDB file dump.rdb
# [offset 26] AUX FIELD redis-ver = '7.2.0'
# [offset 40] AUX FIELD redis-bits = '64'
# [offset 57] AUX FIELD ctime = '1704067200'
# [offset 69] AUX FIELD used-mem = '1048576'
# [offset 84] Selecting DB ID 0
# [offset 87] Key count: 10000
# [offset 524352] Checksum OK
# \o/ RDB looks OK!

# Sample corrupted output:
# [offset 524288] FATAL: RDB CRC error
# Expected: 0xdeadbeefcafe1234
# Got:      0x12345678abcdef00

Corruption scenarios:

Scenario Symptom Recovery
Truncated write (crash during BGSAVE) EOF before 0xFF opcode Partial data recoverable up to truncation point; use previous backup
CRC mismatch Checksum error at end Data may be intact but unverifiable; try loading with rdbchecksum no
Wrong magic "Wrong signature" error File is not RDB (wrong path, corrupted header)
Version too new "RDB version N not supported" Downgrade Redis or upgrade the version
Bit rot Random CRC or parse error Restore from backup; never serve corrupted data

13.9 Production Best Practices

# Layered save configuration
save 3600 1
save 300 100
save 60 10000

# Enable compression and checksum
rdbcompression yes
rdbchecksum yes

# Memory headroom: Redis should have at least 50% free RAM for COW
# If Redis uses 10 GB, system should have 15+ GB available

# Monitor continuously
watch -n 5 "redis-cli INFO persistence | grep -E 'rdb_|aof_'"

# Backup script (run via cron every hour)
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
redis-cli BGSAVE
sleep 30  # wait for completion
cp /var/lib/redis/dump.rdb /backup/redis/dump_${TIMESTAMP}.rdb
# Upload to object storage
aws s3 cp /backup/redis/dump_${TIMESTAMP}.rdb s3://my-bucket/redis-backups/
# Prune backups older than 7 days
find /backup/redis/ -name "dump_*.rdb" -mtime +7 -delete

Memory sizing formula:

Required RAM = redis_used_memory × (1 + peak_write_rate_during_bgsave × bgsave_duration_s / total_keys_count)

Conservative rule: provision 2× the Redis working set size
Rate this chapter
4.7  / 5  (26 ratings)

💬 Comments