RDB Snapshot: fork, COW and File Format
Chapter 13: RDB Snapshots โ fork, Copy-On-Write, and File Format
13.1 Design Intent
RDB (Redis Database Backup) is Redis's point-in-time binary snapshot mechanism. It serializes the in-memory dataset to a compact binary file with these characteristics:
- Small file size: custom binary encoding with variable-length integers and optional LZF compression
- Fast load time: direct deserialization โ no command log replay
- Non-blocking: uses
fork()+ kernel Copy-On-Write so the child process generates the snapshot while the parent continues serving requests - Trade-off: data written between the last snapshot and a crash is lost
13.2 Trigger Mechanisms
13.2.1 Manual Triggers
# Foreground save (blocks the main thread โ avoid in production)
SAVE
# Background save (recommended)
BGSAVE
# Check result of last BGSAVE
LASTSAVE # returns Unix timestamp
BGSAVE SCHEDULE # enqueue BGSAVE, run after current one completes
13.2.2 Automatic Triggers
In redis.conf:
# Format: save <seconds> <changes>
# Trigger BGSAVE if at least <changes> writes occurred within <seconds>
save 3600 1 # 1 write in the last hour
save 300 100 # 100 writes in 5 minutes
save 60 10000 # 10,000 writes in 1 minute
# Disable automatic RDB entirely
save ""
# File configuration
dbfilename dump.rdb
dir /var/lib/redis
serverCron() in server.c runs every 100 ms and checks whether any save condition is satisfied:
for (j = 0; j < server.saveparamslen; j++) {
struct saveparam *sp = server.saveparams + j;
if (server.dirty >= sp->changes &&
server.unixtime - server.lastsave > sp->seconds &&
(server.unixtime - server.lastbgsave_try > CONFIG_BGSAVE_RETRY_DELAY ||
server.lastbgsave_status == C_OK))
{
serverLog(LL_NOTICE, "%lld changes in %d seconds. Saving...",
sp->changes, (int)sp->seconds);
rdbSaveBackground(server.rdb_filename, NULL);
break;
}
}
13.2.3 Other Trigger Scenarios
SHUTDOWN [NOSAVE|SAVE] # graceful shutdown triggers BGSAVE by default
FLUSHALL # writes an empty RDB after flushing
DEBUG RELOAD # used in tests to force RDB cycle
# Full resync with replica: master triggers BGSAVE to send RDB to replica
13.3 fork() and Copy-On-Write
13.3.1 The fork() Call
/* rdb.c: rdbSaveBackground() */
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
pid_t childpid;
if (hasActiveChildProcess()) return C_ERR; /* one child at a time */
server.dirty_before_bgsave = server.dirty;
server.lastbgsave_try = time(NULL);
if ((childpid = redisFork(CHILD_TYPE_RDB)) == 0) {
/* Child process */
redisSetCpuAffinity(server.bgsave_cpulist);
retval = rdbSave(filename, rsi);
sendChildInfo(CHILD_INFO_TYPE_RDB_COW_SIZE, server.child_info_data.cow_size, "RDB");
exitFromChild((retval == C_OK) ? 0 : 1);
} else {
/* Parent process (main thread) resumes immediately */
server.rdb_child_pid = childpid;
/* Disable dict resize during child process life to reduce COW pages */
updateDictResizePolicy();
return C_OK;
}
}
After fork():
- Parent and child share all physical memory pages
- The kernel marks every shared page as read-only
- The child walks all data structures and serializes them to a temp file
- The parent handles client requests normally
13.3.2 Copy-On-Write Mechanics
After fork() โ shared physical pages:
Physical Memory:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Page A: {key1:"v1", key2:"v2"} โ
โ Page B: {key3:"v3", key4:"v4"} โ
โ Page C: {key5:"v5", key6:"v6"} โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โฒ Parent virtual addr 0x1000 โฒ Child virtual addr 0x1000
(both point to same pages, read-only)
Parent executes: SET key2 "newvalue"
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ 1. CPU detects write to read-only page โ Page Fault โ
โ 2. Kernel copies Page A โ new Page A' (4 KB) โ
โ 3. Parent's virtual addr 0x1000 โ Page A' (writable) โ
โ 4. Child's virtual addr 0x1000 โ Page A (unchanged) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Result: child always sees the state at the moment of fork()
13.3.3 COW Memory Overhead
Scenario: Redis uses 10 GB, BGSAVE runs for 60 seconds,
30% of data is modified during that window
Best case (all writes land on same pages):
COW copies: O(few pages) โ negligible overhead
Worst case (each write touches a unique page):
Modified data: 3 GB
COW copies : 3 GB ร 1 (one physical copy per modified page)
Peak memory : 10 GB (original) + 3 GB (COW copies) = 13 GB
Practical estimate:
For typical workloads with hot keys, actual COW overhead
is 20โ40% of write volume, not 100%.
Monitoring COW size (Redis 4.0+):
redis-cli INFO persistence
# rdb_last_cow_size: 3145728 โ 3 MB copied during last BGSAVE
# aof_last_cow_size: 1048576 โ 1 MB during AOF rewrite
13.3.4 Transparent Huge Pages (THP) โ A Dangerous Interaction
Linux THP uses 2 MB pages instead of 4 KB pages. When COW triggers on a THP page, the kernel must copy 2 MB instead of 4 KB โ a 512ร increase in copy granularity.
# Symptom: BGSAVE takes unusually long, rdb_last_cow_size is enormous
# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never โ "always" is dangerous for Redis
# Disable THP permanently
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Add to /etc/rc.local for persistence across reboots
# Also disable defrag
echo never > /sys/kernel/mm/transparent_hugepage/defrag
13.3.5 Dict Resize Freeze During BGSAVE
Redis disables hash table incremental rehash during BGSAVE to prevent unnecessary COW:
void updateDictResizePolicy(void) {
if (server.rdb_child_pid == -1 && server.aof_child_pid == -1)
dictEnableResize();
else
dictDisableResize(); /* no rehash while child exists */
}
Without this, every dict rehash would COW-copy the old and new hash table arrays, potentially doubling memory usage.
13.4 rdbSave() Serialization Pipeline
int rdbSaveRio(rio *rdb, int *error, int rdbflags, rdbSaveInfo *rsi) {
char magic[10];
int j;
/* 1. Magic header: "REDIS" + 4-digit version */
snprintf(magic, sizeof(magic), "REDIS%04d", RDB_VERSION);
rdbWriteRaw(rdb, magic, 9); /* e.g. "REDIS0011" */
/* 2. AUX fields: metadata key-value pairs */
rdbSaveAuxFieldStrStr(rdb, "redis-ver", REDIS_VERSION); /* "7.2.0" */
rdbSaveAuxFieldStrInt(rdb, "redis-bits", sizeof(long)*8); /* 64 */
rdbSaveAuxFieldStrInt(rdb, "ctime", time(NULL));
rdbSaveAuxFieldStrLong(rdb, "used-mem", server.used_memory);
if (rsi) {
rdbSaveAuxFieldStrInt(rdb, "repl-stream-db", rsi->repl_stream_db);
rdbSaveAuxFieldStrStr(rdb, "repl-id", server.replid);
rdbSaveAuxFieldStrLong(rdb, "repl-offset", server.master_repl_offset);
}
/* 3. Iterate over all databases */
for (j = 0; j < server.dbnum; j++) {
redisDb *db = server.db + j;
dict *d = db->dict;
if (dictSize(d) == 0) continue;
/* SELECTDB opcode + db index */
rdbSaveType(rdb, RDB_OPCODE_SELECTDB); /* 0xFE */
rdbSaveLen(rdb, j);
/* RESIZEDB opcode + sizes (speeds up loading by pre-sizing dicts) */
rdbSaveType(rdb, RDB_OPCODE_RESIZEDB); /* 0xFB */
rdbSaveLen(rdb, dictSize(d));
rdbSaveLen(rdb, dictSize(db->expires));
/* Iterate every key-value pair */
dictIterator *di = dictGetSafeIterator(d);
dictEntry *de;
while ((de = dictNext(di)) != NULL) {
sds keystr = dictGetKey(de);
robj *key = createStringObjectFromLongLongForValue(...)
robj *o = dictGetVal(de);
long long expiretime = getExpire(db, key);
/* Expiry time (millisecond precision, opcode 0xFC) */
if (expiretime != -1) {
rdbSaveType(rdb, RDB_OPCODE_EXPIRETIME_MS);
rdbSaveMillisecondTime(rdb, expiretime);
}
/* LRU idle time or LFU frequency */
if (server.maxmemory_policy & MAXMEMORY_FLAG_LRU) {
rdbSaveType(rdb, RDB_OPCODE_IDLE);
rdbSaveLen(rdb, estimateObjectIdleTime(o) / 1000);
} else if (server.maxmemory_policy & MAXMEMORY_FLAG_LFU) {
rdbSaveType(rdb, RDB_OPCODE_FREQ);
rdbWriteRaw(rdb, &o->lru, 1); /* LFU counter byte */
}
/* Value type opcode */
rdbSaveObjectType(rdb, o);
/* Key: always a string */
rdbSaveStringObject(rdb, key);
/* Value: type-specific encoding */
rdbSaveObject(rdb, o, key, rdbflags);
}
dictReleaseIterator(di);
}
/* 4. EOF marker */
rdbSaveType(rdb, RDB_OPCODE_EOF); /* 0xFF */
/* 5. CRC-64 checksum of entire file content */
uint64_t cksum = rdb->cksum;
memrev64ifbe(&cksum);
rdbWriteRaw(rdb, &cksum, 8);
return C_OK;
}
13.5 Encoding Formats
13.5.1 Length Encoding
Redis uses a compact variable-length encoding for all integer lengths:
Byte pattern Meaning
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
00xxxxxx 6-bit integer: 0โ63 (1 byte total)
01xxxxxx xxxxxxxx 14-bit integer: 0โ16383 (2 bytes total)
10000000 <4 bytes> 32-bit big-endian integer (5 bytes total)
10000001 <8 bytes> 64-bit big-endian integer (9 bytes total)
11000000 <1 byte> Special: int8 stored as string
11000001 <2 bytes> Special: int16 stored as string
11000010 <4 bytes> Special: int32 stored as string
11000011 ... Special: LZF-compressed string
Examples:
Length 5: 0x05 (00000101, fits in 6 bits)
Length 200: 0x40 0xC8 (01000000 11001000, 14-bit: 0b00_11001000 = 200)
Length 100,000: 0x80 0x00 0x01 0x86 0xA0 (32-bit big-endian)
13.5.2 String Object Encoding
ssize_t rdbSaveRawString(rio *rdb, unsigned char *s, size_t len) {
/* 1. Attempt integer encoding (saves length prefix for small integers) */
if (len <= 11) {
long long value;
if (string2ll((char*)s, len, &value)) {
return rdbSaveLongLongAsStringObject(rdb, value);
/* writes: 0xC0 + int8, or 0xC1 + int16, or 0xC2 + int32 */
}
}
/* 2. LZF compression for strings > 20 bytes (if rdbcompression=yes) */
if (server.rdb_compression && len > 20) {
size_t comprlen = LZF_COMPRESS_ALLOCATED(len);
void *out = zmalloc(comprlen);
comprlen = lzf_compress(s, len, out, comprlen);
if (comprlen > 0 && comprlen < len) {
ssize_t nwritten = rdbSaveLzfBlob(rdb, out, comprlen, len);
zfree(out);
return nwritten;
}
zfree(out);
}
/* 3. Raw: length prefix + raw bytes */
if (rdbSaveLen(rdb, len) == -1) return -1;
return rdbWriteRaw(rdb, s, len);
}
13.5.3 Type-Specific Serialization
| Redis type | Internal encoding | RDB serialization |
|---|---|---|
| String | INT | Compact integer (0xC0/C1/C2 prefix) |
| String | EMBSTR / RAW | Length-prefixed raw bytes (optionally LZF) |
| List | LISTPACK | Listpack blob (length + raw bytes) |
| List | QUICKLIST | Node count + per-node (listpack + compress flag) |
| Hash | LISTPACK | Listpack blob |
| Hash | HT | Count + alternating key/value strings |
| Set | LISTPACK | Listpack blob |
| Set | INTSET | Intset blob (encoding + length + raw int array) |
| Set | HT | Count + member strings |
| ZSet | LISTPACK | Listpack blob |
| ZSet | SKIPLIST | Count + [member string + score as 8-byte double] pairs |
| Stream | โ | Encoded Radix Tree + consumer groups |
13.6 RDB File Structure โ Hex Walkthrough
Minimal RDB for one key SET key value:
Offset Hex bytes Decoded value
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
0x0000 52 45 44 49 53 30 30 31 31 "REDIS0011" (magic + version 11)
0x0009 FA RDB_OPCODE_AUX
0x000A 09 AUX key length: 9
0x000B 72 65 64 69 73 2D 76 65 72 "redis-ver"
0x0014 05 AUX val length: 5
0x0015 37 2E 32 2E 30 "7.2.0"
0x001A FA RDB_OPCODE_AUX
0x001B 0A key length: 10
0x001C 72 65 64 69 73 2D 62 69 74 73 "redis-bits"
0x0026 C0 40 int encoding: 0xC0 = int8, 0x40 = 64
0x0028 FA 05 63 74 69 6D 65 ... "ctime" AUX field
(more AUX fields...)
0x00XX FE RDB_OPCODE_SELECTDB (0xFE)
0x00XX 00 DB index: 0
0x00XX FB RDB_OPCODE_RESIZEDB (0xFB)
0x00XX 01 dict_size: 1
0x00XX 00 expires_size: 0
0x00XX 00 Type opcode: OBJ_STRING (0x00)
0x00XX 03 Key length: 3
0x00XX 6B 65 79 Key: "key"
0x00XX 05 Value length: 5
0x00XX 76 61 6C 75 65 Value: "value"
0x00XX FF RDB_OPCODE_EOF (0xFF)
0x00XX XX XX XX XX XX XX XX XX CRC-64 checksum (8 bytes, little-endian)
If the key had EX 100, the expiry section would appear before the type opcode:
0x00XX FC RDB_OPCODE_EXPIRETIME_MS (0xFC)
0x00XX 00 50 AF 6E A2 01 00 00 8 bytes: abs expiry in ms (little-endian)
0x00XX 00 Type opcode: OBJ_STRING
...
13.7 Load Path โ rdbLoad()
int rdbLoad(char *filename, rdbSaveInfo *rsi, int rdbflags) {
rio rdb;
FILE *fp = fopen(filename, "r");
rioInitWithFile(&rdb, fp);
/* Verify magic string "REDIS" */
char buf[1024];
rioRead(&rdb, buf, 9);
if (memcmp(buf, "REDIS", 5) != 0) { /* not an RDB file */ }
/* Parse version */
int rdbver = atoi(buf + 5);
if (rdbver < 1 || rdbver > RDB_VERSION) { /* incompatible */ }
redisDb *db = server.db;
long long expiretime = -1;
while (1) {
int type = rdbLoadType(&rdb); /* read 1 byte */
if (type == RDB_OPCODE_EXPIRETIME_MS) {
expiretime = rdbLoadMillisecondTime(&rdb, rdbver);
type = rdbLoadType(&rdb);
}
if (type == RDB_OPCODE_EXPIRETIME) { /* old seconds precision */
expiretime = rdbLoadTime(&rdb) * 1000;
type = rdbLoadType(&rdb);
}
if (type == RDB_OPCODE_FREQ) {
uint8_t byte; rioRead(&rdb, &byte, 1);
lfu_freq = byte;
type = rdbLoadType(&rdb);
}
if (type == RDB_OPCODE_IDLE) {
lru_idle = rdbLoadLen(&rdb, NULL);
type = rdbLoadType(&rdb);
}
if (type == RDB_OPCODE_EOF) break;
if (type == RDB_OPCODE_SELECTDB) {
int dbid = rdbLoadLen(&rdb, NULL);
db = server.db + dbid;
continue;
}
if (type == RDB_OPCODE_RESIZEDB) {
uint64_t db_size = rdbLoadLen(&rdb, NULL);
uint64_t exp_size = rdbLoadLen(&rdb, NULL);
dictExpand(db->dict, db_size);
dictExpand(db->expires, exp_size);
continue;
}
if (type == RDB_OPCODE_AUX) {
robj *auxkey = rdbLoadStringObject(&rdb);
robj *auxval = rdbLoadStringObject(&rdb);
loadAuxField(rsi, auxkey, auxval);
continue;
}
/* Regular key-value */
robj *key = rdbLoadStringObject(&rdb);
robj *val = rdbLoadObject(type, &rdb, key->ptr, db, &error);
/* Skip already-expired keys */
if (expiretime != -1 && expiretime < now) {
decrRefCount(key); decrRefCount(val);
expiretime = -1;
continue;
}
dbAdd(db, key, val);
if (expiretime != -1) setExpire(NULL, db, key, expiretime);
expiretime = -1;
}
/* Verify CRC-64 checksum */
uint64_t cksum_stored, cksum_computed = rdb.cksum;
rioRead(&rdb, &cksum_stored, 8);
memrev64ifbe(&cksum_stored);
if (cksum_stored != cksum_computed) {
serverLog(LL_WARNING, "Wrong RDB checksum. Aborting now.");
exit(1);
}
fclose(fp);
return C_OK;
}
Load performance benchmarks:
| Dataset | RDB file size | Load time | Equivalent AOF replay |
|---|---|---|---|
| 1M String keys | 80 MB | ~2 s | ~15 s |
| 10M String keys | 800 MB | ~20 s | ~150 s |
| 1M Hash keys (small) | 200 MB | ~8 s | ~40 s |
| 100M String keys | 8 GB | ~3.5 min | ~30 min |
RDB is typically 7โ10ร faster than AOF replay for the same dataset.
13.8 Checking and Repairing RDB Files
# Check file integrity
redis-check-rdb /var/lib/redis/dump.rdb
# Sample healthy output:
# [offset 0] Checking RDB file dump.rdb
# [offset 26] AUX FIELD redis-ver = '7.2.0'
# [offset 40] AUX FIELD redis-bits = '64'
# [offset 57] AUX FIELD ctime = '1704067200'
# [offset 69] AUX FIELD used-mem = '1048576'
# [offset 84] Selecting DB ID 0
# [offset 87] Key count: 10000
# [offset 524352] Checksum OK
# \o/ RDB looks OK!
# Sample corrupted output:
# [offset 524288] FATAL: RDB CRC error
# Expected: 0xdeadbeefcafe1234
# Got: 0x12345678abcdef00
Corruption scenarios:
| Scenario | Symptom | Recovery |
|---|---|---|
| Truncated write (crash during BGSAVE) | EOF before 0xFF opcode | Partial data recoverable up to truncation point; use previous backup |
| CRC mismatch | Checksum error at end | Data may be intact but unverifiable; try loading with rdbchecksum no |
| Wrong magic | "Wrong signature" error | File is not RDB (wrong path, corrupted header) |
| Version too new | "RDB version N not supported" | Downgrade Redis or upgrade the version |
| Bit rot | Random CRC or parse error | Restore from backup; never serve corrupted data |
13.9 Production Best Practices
# Layered save configuration
save 3600 1
save 300 100
save 60 10000
# Enable compression and checksum
rdbcompression yes
rdbchecksum yes
# Memory headroom: Redis should have at least 50% free RAM for COW
# If Redis uses 10 GB, system should have 15+ GB available
# Monitor continuously
watch -n 5 "redis-cli INFO persistence | grep -E 'rdb_|aof_'"
# Backup script (run via cron every hour)
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
redis-cli BGSAVE
sleep 30 # wait for completion
cp /var/lib/redis/dump.rdb /backup/redis/dump_${TIMESTAMP}.rdb
# Upload to object storage
aws s3 cp /backup/redis/dump_${TIMESTAMP}.rdb s3://my-bucket/redis-backups/
# Prune backups older than 7 days
find /backup/redis/ -name "dump_*.rdb" -mtime +7 -delete
Memory sizing formula:
Required RAM = redis_used_memory ร (1 + peak_write_rate_during_bgsave ร bgsave_duration_s / total_keys_count)
Conservative rule: provision 2ร the Redis working set size