Chapter 14

AOF Persistence: Write-After-Log and Rewrite

Chapter 14: AOF Persistence โ€” Write-After-Log and the Rewrite Mechanism

14.1 Design Philosophy: Write-After-Log vs WAL

Redis AOF (Append-Only File) uses a write-after-log (WAL in reverse) design. This is a deliberate deviation from the classical Write-Ahead Logging used in relational databases.

Classical WAL (used by PostgreSQL, MySQL InnoDB):
  Write log โ†’ Execute command โ†’ Respond to client
  Pro: log exists before data changes; supports crash recovery and rollback
  Con: every write command blocks on a disk fsync before executing

Redis AOF (write-after-log):
  Execute command โ†’ Write log โ†’ Respond to client
  Pro: command execution never blocks on disk; no pre-validation of syntax needed
  Con: if Redis crashes between execution and log write, those commands are lost

Why Redis chose write-after-log:
  1. Redis is an in-memory store โ€” the "source of truth" is RAM, not disk
  2. Only successfully executed commands are logged (no need to log failed attempts)
  3. The hot path (execute โ†’ respond) is entirely in-memory, sub-microsecond

Important implication: AOF cannot support rollback. If a command executes and is later found to be logically wrong (e.g., wrong value), you cannot "undo" it via AOF. AOF is replay-only.


14.2 AOF Wire Format

The AOF file is valid RESP (Redis Serialization Protocol) and can be read in any text editor.

# Commands executed:
SET hello world
EXPIRE hello 100
RPUSH mylist a b c
HSET profile name Alice age 30

# Resulting AOF content (formatted for readability):
*2\r\n$6\r\nSELECT\r\n$1\r\n0\r\n
*3\r\n$3\r\nSET\r\n$5\r\nhello\r\n$5\r\nworld\r\n
*3\r\n$9\r\nPEXPIREAT\r\n$5\r\nhello\r\n$13\r\n1704067300000\r\n
*5\r\n$5\r\nRPUSH\r\n$6\r\nmylist\r\n$1\r\na\r\n$1\r\nb\r\n$1\r\nc\r\n
*6\r\n$4\r\nHSET\r\n$7\r\nprofile\r\n$4\r\nname\r\n$5\r\nAlice\r\n$3\r\nage\r\n$2\r\n30\r\n

Two key transformations happen during AOF serialization:

1. Relative TTLs become absolute timestamps

EXPIRE hello 100 is stored as PEXPIREAT hello 1704067300000 (absolute millisecond epoch). This ensures that replaying the AOF 10 minutes later doesn't give hello an extra 10 minutes of life.

2. Some commands are rewritten for equivalence

SETEX key 100 value  โ†’  SET key value + PEXPIREAT key <abs_ts>
GETSET key newval    โ†’  SET key newval (GET result doesn't need logging)
SINTERSTORE dest k1 k2 โ†’ SADD dest member1 member2 ...

14.3 The Three fsync Strategies

14.3.1 Strategy Details

# appendfsync always โ€” Maximum durability
# After each command, call fsync() synchronously
# Main thread blocks until disk acknowledges the write
appendfsync always

# appendfsync everysec โ€” Balanced (default)
# Background thread calls fsync() once per second
# Main thread only calls write() โ€” non-blocking
appendfsync everysec

# appendfsync no โ€” Maximum throughput
# Never call fsync(); rely on OS to flush page cache
# On Linux: vm.dirty_expire_centisecs = 3000 โ†’ up to 30s of data at risk
appendfsync no

14.3.2 Internal Implementation

/* aof.c: flushAppendOnlyFile() โ€” called from beforeSleep() */
void flushAppendOnlyFile(int force) {
    ssize_t nwritten;
    int sync_in_progress = 0;

    if (sdslen(server.aof_buf) == 0) {
        /* No data to write, but check if a pending fsync is needed */
        if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
            server.aof_fd != -1 &&
            server.unixtime > server.aof_last_fsync &&
            !(sync_in_progress = aofFsyncInProgress())) {
            aof_background_fsync(server.aof_fd);
            server.aof_last_fsync = server.unixtime;
        }
        return;
    }

    if (server.aof_fsync == AOF_FSYNC_EVERYSEC)
        sync_in_progress = aofFsyncInProgress();

    /* If a background fsync is running and not forced, skip this cycle
       to avoid starving the background thread */
    if (server.aof_fsync == AOF_FSYNC_EVERYSEC && !force) {
        if (sync_in_progress) {
            if (server.aof_flush_postponed_start == 0) {
                server.aof_flush_postponed_start = server.unixtime;
                return;
            } else if (server.unixtime - server.aof_flush_postponed_start < 2) {
                return; /* wait up to 2 seconds for background fsync */
            }
            /* 2 seconds elapsed โ€” force the write anyway */
            server.aof_delayed_fsync++;
        }
    }

    /* Write from AOF buffer to kernel page cache */
    nwritten = aofWrite(server.aof_fd, server.aof_buf, sdslen(server.aof_buf));

    if (nwritten != (ssize_t)sdslen(server.aof_buf)) {
        /* Partial write โ€” this is serious */
        server.aof_last_write_status = C_ERR;
        if (nwritten > 0)
            sdsrange(server.aof_buf, nwritten, -1);
        /* stop_writes_on_bgsave_error applies here too */
        return;
    }

    server.aof_current_size += nwritten;
    if (sdslen(server.aof_buf) != nwritten)
        sdsrange(server.aof_buf, nwritten, -1);
    else
        sdsclear(server.aof_buf);

    /* fsync according to strategy */
    if (server.aof_fsync == AOF_FSYNC_ALWAYS) {
        /* Synchronous โ€” blocks main thread */
        latencyStartMonitor(latency);
        redis_fsync(server.aof_fd);
        latencyEndMonitor(latency);
        latencyAddSampleIfNeeded("aof-fsync-always", latency);
        server.aof_last_fsync = server.unixtime;
    } else if (server.aof_fsync == AOF_FSYNC_EVERYSEC &&
               server.unixtime > server.aof_last_fsync) {
        if (!sync_in_progress) {
            /* Delegate to BIO (Background I/O) thread */
            aof_background_fsync(server.aof_fd);
            server.aof_last_fsync = server.unixtime;
        }
    }
}

14.3.3 Performance Numbers

Test environment: Redis 7.2, NVMe SSD (Samsung 990 Pro), 8-core Xeon, 50 concurrent connections, pure SET workload.

appendfsync TPS p50 latency p99 latency Max data loss
always ~15,000 0.6 ms 8 ms ~1 command
everysec ~420,000 0.4 ms 0.9 ms ~1 second
no ~520,000 0.3 ms 0.7 ms OS-decided (~30 s)
(no AOF) ~580,000 0.3 ms 0.6 ms All data

The always strategy's throughput ceiling is determined by the NVMe drive's random write latency (60โ€“80 ยตs per fsync including write barrier flush), not the drive's sequential throughput.

On spinning HDDs, always degrades further to ~200โ€“500 TPS due to rotational latency (~5โ€“10 ms per seek).

14.3.4 The no-appendfsync-on-rewrite Option

# Default: no โ€” continue fsyncing even during AOF rewrite
# Set to yes: suspend fsync while child is running AOF rewrite

no-appendfsync-on-rewrite no

When yes, the main process skips calling fsync() while the child process is hammering the disk with the rewrite. This prevents I/O contention that causes latency spikes. The tradeoff: up to aof_rewrite_duration seconds of data at risk if a crash occurs during rewrite.


14.4 AOF Buffer Architecture

Client command
  โ”‚
  โ–ผ
setCommand() / xaddCommand() / etc.
  โ”‚
  โ–ผ (on success)
feedAppendOnlyFile()
  โ”‚
  โ”œโ”€โ”€โ–ถ server.aof_buf                โ† main buffer (SDS)
  โ”‚      written to disk each event loop iteration
  โ”‚
  โ””โ”€โ”€โ–ถ server.aof_rewrite_buf_blocks โ† rewrite buffer (list of blocks)
         (only written when child is running AOF rewrite)
         contains all writes since fork() โ€” will be appended to new AOF
void feedAppendOnlyFile(struct redisCommand *cmd, int dictid,
                        robj **argv, int argc) {
    sds buf = sdsempty();

    /* Build SELECT if DB changed */
    if (dictid != server.aof_selected_db) {
        char seldb[64];
        snprintf(seldb, sizeof(seldb), "%d", dictid);
        buf = sdscatprintf(buf, "*2\r\n$6\r\nSELECT\r\n$%zu\r\n%s\r\n",
                           strlen(seldb), seldb);
        server.aof_selected_db = dictid;
    }

    /* Transform relative TTL to absolute PEXPIREAT */
    if (cmd->proc == setCommand && argc > 3 && /* has EX/PX */)
        buf = catAppendOnlyExpireAtCommand(buf, server.db+dictid, argv[1]);

    /* Serialize command to RESP */
    buf = catAppendOnlyGenericCommand(buf, argc, argv);

    /* Append to main AOF buffer */
    server.aof_buf = sdscatlen(server.aof_buf, buf, sdslen(buf));

    /* If rewrite child is running, also append to rewrite buffer */
    if (server.child_type == CHILD_TYPE_AOF)
        aofRewriteBufferAppend((unsigned char*)buf, sdslen(buf));

    sdsfree(buf);
}

14.5 AOF Rewrite โ€” Complete Walkthrough

14.5.1 Why Rewrite Is Needed

Without rewriting, the AOF grows unboundedly. A key incremented 10,000 times generates 10,000 AOF lines, but only the final value matters:

# 10,000 lines in AOF โ€” all redundant except the last:
*4\r\n$6\r\nINCRBY\r\n$7\r\ncounter\r\n$1\r\n1\r\n
*4\r\n$6\r\nINCRBY\r\n$7\r\ncounter\r\n$1\r\n1\r\n
...  (9998 more)

# After rewrite โ€” one line captures the current state:
*3\r\n$3\r\nSET\r\n$7\r\ncounter\r\n$5\r\n10000\r\n

Rewrite also makes restart time predictable: a 10 GB AOF file that could be compressed to 1 GB loads 10ร— faster.

14.5.2 Trigger Conditions

# Both conditions must be true simultaneously:
# 1. AOF file size > auto-aof-rewrite-min-size
# 2. AOF has grown by auto-aof-rewrite-percentage% since last rewrite

auto-aof-rewrite-min-size 64mb
auto-aof-rewrite-percentage 100   # triggers when AOF doubles in size
/* serverCron() โ€” evaluated every 100 ms */
if (server.aof_state == AOF_ON &&
    !hasActiveChildProcess() &&
    server.aof_current_size > server.aof_rewrite_min_size) {
    long long base = server.aof_rewrite_base_size ?: 1;
    long long growth_pct = (server.aof_current_size * 100 / base) - 100;
    if (growth_pct >= server.aof_rewrite_perc) {
        serverLog(LL_NOTICE, "Starting automatic rewriting of AOF on %lld%% growth",
                  growth_pct);
        rewriteAppendOnlyFileBackground();
    }
}

14.5.3 The Full Rewrite Protocol

Phase 1: Fork

  Main process calls fork()
  โ”œโ”€โ”€ Child process starts:
  โ”‚     Reads current in-memory state
  โ”‚     Generates minimal RESP commands for every key
  โ”‚     Writes to temp-rewriteaof-<pid>.aof
  โ”‚     Reports COW size periodically via pipe
  โ”‚
  โ””โ”€โ”€ Main process continues:
        Serves client requests normally
        โ”œโ”€โ”€ Writes to server.aof_buf โ†’ original AOF file (safety net)
        โ””โ”€โ”€ Writes to server.aof_rewrite_buf_blocks (delta capture)

Phase 2: Child completes

  Child writes final bytes, calls fsync(), exits with code 0

  Main process (in serverCron โ†’ checkChildrenDone):
  โ”œโ”€โ”€ Appends server.aof_rewrite_buf_blocks to new AOF file
  โ”‚   (this contains all writes since fork())
  โ”œโ”€โ”€ Calls fsync() on new AOF file
  โ”œโ”€โ”€ rename(temp-rewriteaof-<pid>.aof, appendonly.aof)  โ† atomic
  โ”œโ”€โ”€ Switches server.aof_fd to the new file
  โ””โ”€โ”€ Closes old file descriptor (file is deleted when refcount hits 0)

Phase 3: Cleanup

  server.aof_rewrite_buf_blocks is freed
  server.aof_rewrite_base_size = server.aof_current_size
  Normal AOF append continues to the new smaller file

14.5.4 Child-Side Serialization Logic

/* aof.c: rewriteAppendOnlyFile() */
int rewriteAppendOnlyFile(char *filename) {
    rio aof;
    FILE *fp;
    char tmpfile[256];

    snprintf(tmpfile, sizeof(tmpfile), "temp-rewriteaof-%d.aof", (int)getpid());
    fp = fopen(tmpfile, "w");
    rioInitWithFile(&aof, fp);

    /* Optional: write RDB preamble if aof-use-rdb-preamble=yes */
    if (server.aof_use_rdb_preamble) {
        if (rdbSaveRio(&aof, &error, RDB_SAVE_AOF_PREAMBLE, NULL) == C_ERR)
            goto werr;
        /* Remaining RESP section will only have incremental commands
           from aof_rewrite_buf โ€” the entire current state is in the RDB preamble */
        goto done;
    }

    /* Pure AOF mode: walk all databases */
    for (j = 0; j < server.dbnum; j++) {
        char selectcmd[] = "*2\r\n$6\r\nSELECT\r\n";
        /* write SELECT j */

        dict *d = server.db[j].dict;
        di = dictGetSafeIterator(d);
        while ((de = dictNext(di)) != NULL) {
            robj *key = dictGetKey(de);
            robj *val = dictGetVal(de);
            long long expiretime = getExpire(server.db+j, key);

            /* Skip already-expired keys */
            if (expiretime != -1 && expiretime < now) continue;

            /* Generate the appropriate reconstruction command */
            int ret = C_ERR;
            switch (val->type) {
            case OBJ_STRING:
                ret = rewriteStringObject(&aof, key, val);        /* SET */
                break;
            case OBJ_LIST:
                ret = rewriteListObject(&aof, key, val);          /* RPUSH (batched) */
                break;
            case OBJ_HASH:
                ret = rewriteHashObject(&aof, key, val);          /* HSET (batched) */
                break;
            case OBJ_SET:
                ret = rewriteSetObject(&aof, key, val);           /* SADD (batched) */
                break;
            case OBJ_ZSET:
                ret = rewriteSortedSetObject(&aof, key, val);     /* ZADD (batched) */
                break;
            case OBJ_STREAM:
                ret = rewriteStreamObject(&aof, key, val);
                break;
            }

            /* Expiry: PEXPIREAT key <absolute_ms_timestamp> */
            if (expiretime != -1)
                ret = rewriteExpireAtCommand(&aof, key, expiretime);
        }
        dictReleaseIterator(di);
    }

done:
    fflush(fp);
    fsync(fileno(fp));
    fclose(fp);
    rename(tmpfile, filename);
    return C_OK;
}

Batching large collections (to avoid single commands with millions of arguments):

/* AOF_REWRITE_ITEMS_PER_CMD = 64 */
/* A Set with 200 members becomes: */
*66\r\n$4\r\nSADD\r\n$6\r\nmyset\r\n$2\r\nm1\r\n ... $3\r\nm64\r\n   (64 members)
*66\r\n$4\r\nSADD\r\n$6\r\nmyset\r\n$3\r\nm65\r\n ... $4\r\nm128\r\n  (64 members)
*46\r\n$4\r\nSADD\r\n$6\r\nmyset\r\n...                                 (remaining 72)

14.6 Mixed Persistence (aof-use-rdb-preamble)

14.6.1 Motivation

Pure AOF rewrite still produces a file that must be replayed command-by-command on startup. For a 100M-key dataset, AOF replay can take 5โ€“10 minutes. The RDB format is 7โ€“10ร— faster to load.

Mixed persistence (introduced in Redis 4.0) combines both formats in a single file:

# Enable mixed persistence (strongly recommended)
aof-use-rdb-preamble yes

14.6.2 Mixed AOF File Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  RDB PREAMBLE (binary, fast to load)                     โ”‚
โ”‚  REDIS0011...                                            โ”‚
โ”‚  โ† snapshot at the moment of BGREWRITEAOF               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  AOF TAIL (RESP text, written incrementally)             โ”‚
โ”‚  *3\r\n$3\r\nSET\r\n...                                 โ”‚
โ”‚  โ† all write commands since the fork() for rewrite      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Startup load sequence:

/* aof.c: loadDataFromAOF() */
int loadDataFromAOF(redisDb *db, ...) {
    char sig[5];
    if (fread(sig, 1, 5, fp) != 5) goto readerr;

    if (memcmp(sig, "REDIS", 5) == 0) {
        /* File starts with RDB magic โ€” load RDB preamble */
        if (rdbLoadRio(&rdb, RDB_LOAD_AOF, NULL) != C_OK) {
            serverLog(LL_WARNING, "Error reading the RDB preamble of the AOF file");
            goto readerr;
        }
        /* Continue reading the RESP portion after RDB */
    } else {
        /* Pure AOF file: seek back to position 0 */
        fseek(fp, 0, SEEK_SET);
    }

    /* Replay remaining RESP commands */
    while (1) {
        argc = readArgc(fp, &argv); /* parse RESP */
        if (argc == 0) break;       /* EOF */
        processCommand(argv, argc);  /* execute */
    }
}

14.6.3 File Detection Logic

# Simplified detection at startup
with open('appendonly.aof', 'rb') as f:
    header = f.read(5)
    if header == b'REDIS':
        mode = 'mixed'   # starts with RDB preamble
    elif header[0:1] == b'*':
        mode = 'pure-aof'  # starts with RESP array
    else:
        raise Exception("Unknown AOF format")

14.7 Multi-Part AOF (Redis 7.0+)

Redis 7.0 introduced a directory-based, multi-file AOF to eliminate the buffering overhead of single-file rewrite.

<appenddirname>/
โ”œโ”€โ”€ appendonly.aof.1.base.rdb    โ† BASE: latest snapshot (RDB format)
โ”œโ”€โ”€ appendonly.aof.1.incr.aof    โ† INCR: commands after base was written
โ”œโ”€โ”€ appendonly.aof.2.incr.aof    โ† INCR: commands written during rewrite
โ””โ”€โ”€ appendonly.aof.manifest      โ† manifest: ordered file list

Manifest file format:

file appendonly.aof.1.base.rdb seq 1 type b
file appendonly.aof.1.incr.aof seq 1 type i
file appendonly.aof.2.incr.aof seq 2 type i

Rewrite protocol without double-buffering:

Phase 1: Fork
  โ””โ”€โ”€ Main process: new writes go to new INCR file (seq 2)
      Child process: writes RDB snapshot to new BASE file

Phase 2: Atomic swap
  โ””โ”€โ”€ Main process:
        rename new BASE into place
        update manifest: remove old base and old incr seq 1
        new manifest: base=new_rdb, incr=seq2_file

Phase 3: No aof_rewrite_buf needed!
  โ””โ”€โ”€ During rewrite, main process wrote to the new INCR file directly
      Child process only writes the BASE file
      No in-memory buffer required for capturing delta commands

This eliminates the peak memory spike from aof_rewrite_buf_blocks that could be hundreds of MB for busy servers.


14.8 Repairing Corrupted AOF Files

# Check integrity
redis-check-aof appendonly.aof
# Output for truncated file:
# [offset 1024] AOF is not valid. Truncating to 1024 bytes.

# Fix: truncate at last valid RESP boundary
redis-check-aof --fix appendonly.aof
# Successfully truncated AOF to offset 1024

# For Multi-Part AOF, check each file:
redis-check-aof --fix appendonlydir/appendonly.aof.1.incr.aof

# Manual recovery from FLUSHALL:
# 1. Stop Redis
redis-cli SHUTDOWN NOSAVE

# 2. Find the FLUSHALL command in the AOF
grep -n "FLUSHALL" appendonly.aof
# Line 1420

# 3. Identify the RESP block containing FLUSHALL (starts with *N)
# and delete that block (from *N\r\n through the final \r\n of the command)
# Use Python for precision:
python3 - << 'EOF'
with open('appendonly.aof', 'rb') as f:
    data = f.read()

# Find FLUSHALL and remove its RESP block
import re
# Pattern: match the full RESP array containing FLUSHALL
pattern = rb'\*\d+\r\n(?:\$\d+\r\n[^\r]*\r\n)*\$8\r\nFLUSHALL\r\n'
clean = re.sub(pattern, b'', data)
with open('appendonly.aof', 'wb') as f:
    f.write(clean)
print(f"Removed FLUSHALL. New size: {len(clean)} bytes")
EOF

# 4. Verify and reload
redis-check-aof appendonly.aof
redis-server redis.conf

14.9 Configuration Quick Reference

appendonly yes                   # enable AOF
appendfilename "appendonly.aof"  # single-file name (legacy)
appenddirname "appendonlydir"    # directory for Multi-Part AOF (Redis 7.0+)

appendfsync everysec             # fsync strategy: always | everysec | no
no-appendfsync-on-rewrite no     # suspend fsync during rewrite child

auto-aof-rewrite-percentage 100  # rewrite when AOF doubles in size
auto-aof-rewrite-min-size 64mb   # minimum size before rewrite considers triggering

aof-use-rdb-preamble yes         # mixed persistence (recommended)
aof-load-truncated yes           # tolerate truncated AOF at startup
aof-timestamp-enabled no         # embed timestamps (Redis 7.0+, for debugging)

14.10 Monitoring AOF Health

redis-cli INFO persistence

# Key fields to watch:
aof_enabled:1
aof_rewrite_in_progress:0         # 1 during active rewrite
aof_last_rewrite_time_sec:45      # how long the last rewrite took
aof_current_size:134217728        # 128 MB โ€” approaching double of 64 MB min?
aof_base_size:67108864            # 64 MB โ€” last rewrite size
# ratio: 128/64 = 2.0 โ†’ rewrite should trigger soon

aof_buffer_length:0               # bytes pending in server.aof_buf
aof_rewrite_buffer_length:0       # bytes accumulated in rewrite delta buffer
aof_pending_bio_fsync:0           # background fsync queue depth
aof_delayed_fsync:0               # count of fsyncs delayed due to rewrite
aof_last_cow_size:3145728         # 3 MB COW during last rewrite โ€” healthy
aof_last_bgrewrite_status:ok
aof_last_write_status:ok          # if "err" here, writes are being rejected!

Alerting thresholds:

Metric Warning Critical
aof_rewrite_in_progress > 300s > 600s
aof_pending_bio_fsync > 100 > 1000
aof_delayed_fsync rate > 10/min > 100/min
aof_last_write_status err err (same โ€” fix immediately)
aof_last_cow_size > 1 GB > 5 GB
Rate this chapter
4.6  / 5  (23 ratings)

๐Ÿ’ฌ Comments