Chapter 15

Persistence Strategy and Disaster Recovery

Chapter 15: Persistence Strategy Selection and Disaster Recovery

15.1 The Data Safety Matrix

Before choosing a persistence strategy, answer one foundational question: How much data loss is acceptable?

Strategy	Max data loss	Recovery speed	File size	CPU overhead	Memory overhead
No persistence	Everything (restart = empty)	N/A	0	Minimal	Minimal
RDB hourly	Up to 1 hour	Fastest (5–10 min/10 GB)	Smallest	Low (fork)	Low
RDB every 5 min	Up to 5 min	Fastest	Small	Medium (frequent forks)	Medium
AOF everysec	~1 second	Slow (replay all commands)	Large (unbounded growth)	Medium	Medium
AOF always	~1 command	Slowest	Largest	High (fsync blocks)	Medium
Mixed persistence	~1 second	Fast (RDB + small AOF replay)	Medium	Medium	Medium
RDB + AOF + replica	Near zero	Fastest (promote replica)	Medium	Higher	Higher

15.2 Strategy Selection Guide by Use Case

15.2.1 Case 1: Pure Cache (Full Data Loss Acceptable)

# redis.conf
save ""            # disable RDB
appendonly no      # disable AOF

# Appropriate for:
# - Session store (users can re-authenticate)
# - Page/API response cache
# - CDN hot data prefetch
# - Temporary computation scratchpad

Benefits: Zero persistence overhead, maximum throughput, no fork() latency spikes.

Important: Disabling persistence on the master does not disable replication. Replicas hold an in-memory copy, but they will sync from an empty master if the master restarts cleanly — see Section 15.6.3 for the dangerous edge case.

Expected throughput gain: ~15–20% more ops/s versus mixed persistence, because there is no aof_buf write on every command and no periodic fork() stall.

15.2.2 Case 2: Cache with Fast Rebuild (Minutes of Loss OK)

save 3600 1
save 300 100
save 60 10000
appendonly no
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis

# Appropriate for:
# - Product catalog cache (rebuildable from SQL DB)
# - Leaderboards (minor regression acceptable)
# - Real-time counters (small error margin acceptable)

Recovery time estimates:

Dataset: 10 GB RDB on NVMe SSD
  Disk read @ 1 GB/s: 10 seconds
  Dict rebuild + pointer init: ~30 seconds
  Total restart time: ~40 seconds

Comparison: rebuild from MySQL (10M rows JOIN):
  Query execution: 5–20 minutes
  Network transfer: additional minutes

15.2.3 Case 3: Business Data (No More Than 1 Second of Loss)

# Mixed persistence — recommended default for production
save 3600 1
save 300 100
save 60 10000
appendonly yes
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-use-rdb-preamble yes

# Appropriate for:
# - User points and balances (1-second loss tolerated)
# - Order status cache (non-payment critical path)
# - Message queues (Stream)
# - Inventory with compensating transactions

Why mixed persistence is the sweet spot:

Restart recovery sequence:
  1. Load RDB preamble (snapshot at last rewrite): 40 seconds
  2. Replay incremental AOF since rewrite: 1–30 seconds (typically)
  Total: < 1 minute for 10 GB

versus pure AOF:
  Replay entire command history: 10–20 minutes for same dataset

15.2.4 Case 4: Financial / Orders (Near-Zero Loss)

appendonly yes
appendfsync always         # fsync on every command
aof-use-rdb-preamble yes  # still useful for fast restart
save ""                    # optional: disable RDB (AOF is primary safety net)

# Combine with synchronous replication:
# In redis.conf on the client side, issue WAIT after critical writes:
# WAIT 1 100    # block until at least 1 replica acknowledges, timeout 100ms

# Appropriate for:
# - Payment transaction logs
# - Financial ledger balances
# - Auction / flash-sale inventory (strict correctness)

Throughput with appendfsync always:

Hardware: NVMe SSD (Samsung 990 Pro)
  fsync latency: ~60–80 µs
  Max TPS: 1000ms / 75µs = ~13,000 TPS

Hardware: SATA SSD
  fsync latency: ~200–500 µs
  Max TPS: ~2,000–5,000 TPS

Hardware: Spinning HDD
  fsync latency: ~5–10 ms
  Max TPS: ~100–200 TPS

Adding WAIT 1 100 halves effective write throughput further, but guarantees data survives a master crash (the replica has it).

15.3 Production Backup Architecture

15.3.1 Layered Backup Strategy

Layer 1 — Live AOF (continuous):
  Maximum exposure window: 1 second (with everysec)
  Storage: local disk, same server
  RPO: ~1 second

Layer 2 — Hourly RDB snapshots:
  Triggered: cron at :00 past each hour
  Retained: 7 days local
  RPO: up to 1 hour (if AOF also corrupted)

Layer 3 — Remote object storage:
  Uploaded: immediately after each hourly RDB
  Retained: 30 days (S3/GCS/OSS)
  RPO: same as Layer 2, but geographically redundant

Layer 4 — Cross-datacenter replica:
  Replication lag: typically < 100 ms (intra-region)
  RPO: milliseconds
  RTO: seconds (manual or Sentinel-automated failover)

15.3.2 Production Backup Script

#!/usr/bin/env bash
# /etc/cron.hourly/redis-backup
# Runs as root or redis user with S3 IAM role attached

set -euo pipefail

REDIS_CLI="redis-cli"
REDIS_DATA_DIR="/var/lib/redis"
BACKUP_DIR="/backup/redis"
S3_BUCKET="s3://acme-redis-backups"
HOST=$(hostname -s)
RETAIN_LOCAL_DAYS=7
RETAIN_S3_DAYS=30
ALERT_ENDPOINT="https://hooks.slack.com/services/..."

log()  { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] [INFO]  $*" | tee -a /var/log/redis-backup.log; }
warn() { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] [WARN]  $*" | tee -a /var/log/redis-backup.log; }
die()  { echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ')] [ERROR] $*" | tee -a /var/log/redis-backup.log
         curl -s -X POST "$ALERT_ENDPOINT" -d "{\"text\": \"Redis backup FAILED on $HOST: $*\"}" || true
         exit 1; }

TIMESTAMP=$(date -u '+%Y%m%d_%H%M%S')
DEST_FILE="$BACKUP_DIR/dump_${TIMESTAMP}.rdb"

# Step 1: Trigger background save
log "Triggering BGSAVE on $HOST"
$REDIS_CLI BGSAVE || die "BGSAVE command failed"

# Step 2: Poll until complete (max 10 minutes)
log "Waiting for BGSAVE to complete..."
for i in $(seq 1 120); do
    IN_PROG=$($REDIS_CLI INFO persistence | grep rdb_bgsave_in_progress | awk -F: '{print $2}' | tr -d $'\r')
    STATUS=$($REDIS_CLI INFO persistence | grep rdb_last_bgsave_status | awk -F: '{print $2}' | tr -d $'\r')
    if [ "$IN_PROG" = "0" ]; then
        [ "$STATUS" = "ok" ] || die "BGSAVE failed with status: $STATUS"
        log "BGSAVE completed in $((i * 5)) seconds."
        break
    fi
    [ $i -eq 120 ] && die "BGSAVE did not complete within 10 minutes"
    sleep 5
done

# Step 3: Copy and verify
mkdir -p "$BACKUP_DIR"
cp "$REDIS_DATA_DIR/dump.rdb" "$DEST_FILE"
SIZE=$(stat -c%s "$DEST_FILE")
log "Copied to $DEST_FILE (${SIZE} bytes)"

redis-check-rdb "$DEST_FILE" > /dev/null 2>&1 || die "RDB integrity check failed for $DEST_FILE"
log "RDB integrity check passed."

# Step 4: Upload to S3 with server-side encryption
aws s3 cp "$DEST_FILE" \
    "${S3_BUCKET}/${HOST}/$(basename $DEST_FILE)" \
    --storage-class STANDARD_IA \
    --server-side-encryption AES256 \
    --only-show-errors
log "Uploaded to ${S3_BUCKET}/${HOST}/$(basename $DEST_FILE)"

# Step 5: Prune local backups
find "$BACKUP_DIR" -name "dump_*.rdb" -mtime "+${RETAIN_LOCAL_DAYS}" -print -delete \
    | while read f; do log "Deleted local: $f"; done

# Step 6: Prune S3 backups
aws s3 ls "${S3_BUCKET}/${HOST}/" --recursive | awk '{print $4}' | while read key; do
    FILE_DATE=$(basename "$key" | grep -oP '\d{8}' | head -1 || true)
    [ -z "$FILE_DATE" ] && continue
    AGE=$(( ( $(date +%s) - $(date -d "${FILE_DATE}" +%s 2>/dev/null || echo 0) ) / 86400 ))
    if [ "$AGE" -gt "$RETAIN_S3_DAYS" ]; then
        aws s3 rm "s3://$(echo $S3_BUCKET | sed 's|s3://||')/$key"
        log "Deleted S3: $key (${AGE} days old)"
    fi
done

log "Backup cycle completed successfully."

15.3.3 Weekly Backup Validation

#!/usr/bin/env bash
# Run every Monday at 02:00 via cron

LATEST_REMOTE=$(aws s3 ls s3://acme-redis-backups/$(hostname -s)/ | sort | tail -1 | awk '{print $4}')
TMPFILE=$(mktemp /tmp/redis-verify-XXXXXX.rdb)

aws s3 cp "s3://acme-redis-backups/$(hostname -s)/${LATEST_REMOTE}" "$TMPFILE"

if redis-check-rdb "$TMPFILE" > /dev/null 2>&1; then
    echo "PASS: Remote backup ${LATEST_REMOTE} is valid"
else
    echo "FAIL: Remote backup ${LATEST_REMOTE} is corrupted!"
    curl -X POST "$ALERT_ENDPOINT" -d "{\"text\": \"Redis remote backup corrupted: ${LATEST_REMOTE}\"}"
fi

rm -f "$TMPFILE"

15.4 Disaster Recovery Playbooks

15.4.1 Playbook 1: Process Crash, Data on Disk

Trigger: redis-server process disappears; clients report connection refused.

Detection:

redis-cli PING   # Connection refused
systemctl status redis  # Active: failed

Recovery steps:

# 1. Verify data files exist and are intact
ls -la /var/lib/redis/
redis-check-rdb /var/lib/redis/dump.rdb
redis-check-aof /var/lib/redis/appendonly.aof  # if AOF enabled

# 2. If AOF is truncated (crash during write), fix it
redis-check-aof --fix /var/lib/redis/appendonly.aof

# 3. Restart — Redis automatically loads RDB then replays AOF tail
systemctl start redis
systemctl status redis

# 4. Validate data integrity
redis-cli PING                    # PONG
redis-cli DBSIZE                  # compare with expected count
redis-cli INFO keyspace           # verify per-DB key counts
redis-cli DEBUG SLEEP 0           # quick responsiveness check

Expected recovery time (10 GB, mixed persistence):

RDB load: ~40 seconds
AOF tail replay (typically < 1 second of commands): ~1 second
Total: ~45 seconds from process start to first client response

15.4.2 Playbook 2: Disk Failure — Restore from Remote Backup

Trigger: Storage array failure, NVMe device failure, or accidental rm.

# 1. Stop Redis if still running
systemctl stop redis || true

# 2. Mount replacement disk or provision new volume
# (OS-level operation, varies by environment)

# 3. List available remote backups
aws s3 ls s3://acme-redis-backups/$(hostname -s)/ | sort | tail -10

# 4. Choose recovery point — latest backup before incident
TARGET="dump_20240115_130000.rdb"
aws s3 cp "s3://acme-redis-backups/$(hostname -s)/${TARGET}" \
    /var/lib/redis/dump.rdb

# 5. Set correct ownership and permissions
chown redis:redis /var/lib/redis/dump.rdb
chmod 640 /var/lib/redis/dump.rdb

# 6. Remove stale AOF (it references data that no longer exists)
rm -f /var/lib/redis/appendonly.aof
rm -rf /var/lib/redis/appendonlydir/

# 7. Verify and start
redis-check-rdb /var/lib/redis/dump.rdb
systemctl start redis
redis-cli INFO keyspace

RTO calculation example:

Scenario: 10 GB RDB in S3 Standard-IA
  Download @ 1 Gbps (125 MB/s): ~80 seconds
  File copy + permissions: 2 seconds
  Redis startup + RDB load: ~45 seconds
  Total RTO: ~2.5 minutes

Optimization strategies:
  - Keep a warm standby Redis with replica enabled (RTO: ~5 seconds)
  - Use S3 Transfer Acceleration for cross-region recovery
  - Pre-stage backups on local disk of standby host

15.4.3 Playbook 3: Accidental FLUSHALL

Trigger: redis-cli FLUSHALL executed (operator error, runaway script).

Critical first step: Stop Redis immediately with SHUTDOWN NOSAVE to prevent an empty RDB from overwriting the backup.

# IMMEDIATE ACTION — do this within seconds of the accident
redis-cli SHUTDOWN NOSAVE
# NOSAVE ensures no empty RDB is written to disk

Recovery from AOF (if enabled):

# 1. Backup the current AOF before modifying it
cp /var/lib/redis/appendonly.aof /var/lib/redis/appendonly.aof.bak.$(date +%s)

# 2. Remove the FLUSHALL command from the AOF file
python3 << 'PYEOF'
import re, sys

aof_path = '/var/lib/redis/appendonly.aof'
with open(aof_path, 'rb') as f:
    data = f.read()

original_size = len(data)

# Match RESP arrays containing FLUSHALL (case-insensitive)
# *1\r\n$8\r\nFLUSHALL\r\n
# *2\r\n$8\r\nFLUSHALL\r\n$5\r\nASYNC\r\n  (Redis 4.0+ FLUSHALL ASYNC)
# *2\r\n$8\r\nFLUSHALL\r\n$4\r\nSYNC\r\n
pattern = re.compile(
    rb'\*\d+\r\n(?:\$\d+\r\n\S*\r\n)*?\$8\r\n[Ff][Ll][Uu][Ss][Hh][Aa][Ll][Ll]\r\n(?:\$\d+\r\n\S*\r\n)?',
    re.MULTILINE
)

matches = list(pattern.finditer(data))
if not matches:
    print("ERROR: No FLUSHALL found in AOF — wrong file?")
    sys.exit(1)

print(f"Found {len(matches)} FLUSHALL command(s):")
for m in matches:
    print(f"  Offset {m.start()}: {m.group()[:60]!r}")

# Remove all occurrences
cleaned = pattern.sub(b'', data)
print(f"Cleaned: {original_size} → {len(cleaned)} bytes ({original_size - len(cleaned)} bytes removed)")

with open(aof_path, 'wb') as f:
    f.write(cleaned)
print("Done. AOF written.")
PYEOF

# 3. Verify the modified AOF
redis-check-aof /var/lib/redis/appendonly.aof
# Should report: "AOF analyzed: size=N, ok_up_to=N, ok_up_to_line=N"

# 4. Remove the empty RDB if it exists
# (dump.rdb may have been written before SHUTDOWN NOSAVE if auto-save ran)
DBSIZE_IN_RDB=$(redis-server --rdbchecksum yes --port 0 /dev/null 2>/dev/null || echo "unknown")
# Safer: just delete it — Redis will use AOF as primary source
rm -f /var/lib/redis/dump.rdb

# 5. Restart and validate
systemctl start redis
redis-cli DBSIZE   # should be > 0 if recovery succeeded
redis-cli RANDOMKEY

Recovery from RDB backup (if AOF not enabled or AOF is corrupted):

# Find most recent pre-incident RDB
ls -lt /backup/redis/dump_*.rdb | head -5
# Choose the latest one that predates the FLUSHALL

cp /backup/redis/dump_20240115_130000.rdb /var/lib/redis/dump.rdb
chown redis:redis /var/lib/redis/dump.rdb
systemctl start redis
redis-cli DBSIZE

15.4.4 Playbook 4: Point-in-Time Recovery

Scenario: Application bug deployed at 14:00 wrote corrupted data for 30 minutes. Roll back to 13:55.

# Step 1: Identify candidate backup
ls /backup/redis/ | grep "20240115_13"
# dump_20240115_120000.rdb  → 12:00 snapshot
# dump_20240115_130000.rdb  → 13:00 snapshot  ← best starting point

# Step 2: Spin up a recovery instance (don't touch production yet)
mkdir -p /tmp/redis-recovery
cp /backup/redis/dump_20240115_130000.rdb /tmp/redis-recovery/dump.rdb
redis-server --port 6399 \
             --dir /tmp/redis-recovery \
             --dbfilename dump.rdb \
             --save "" \
             --appendonly no \
             --daemonize yes \
             --logfile /tmp/redis-recovery/redis-recovery.log

redis-cli -p 6399 DBSIZE   # confirm data loaded

# Step 3: If aof-timestamp-enabled=yes (Redis 7.0+), replay up to 13:55
# redis-server --aof-timestamp 1704027300 ...  (Unix timestamp for 13:55)
# This replays only AOF entries with timestamp <= 13:55

# Step 4 (without timestamps): manual AOF replay up to target time
# Search AOF for approximate position near 13:55 using known key patterns
# that should/shouldn't exist, then truncate at that offset

# Step 5: Validate recovered state
redis-cli -p 6399 RANDOMKEY
redis-cli -p 6399 TYPE <key>
redis-cli -p 6399 DEBUG OBJECT <key>
# Compare with expected state from application logs

# Step 6: Promote recovery instance to production
redis-cli SHUTDOWN NOSAVE                          # stop production
cp /tmp/redis-recovery/dump.rdb /var/lib/redis/dump.rdb
chown redis:redis /var/lib/redis/dump.rdb
systemctl start redis
redis-cli DBSIZE

# Step 7: Clean up recovery instance
redis-cli -p 6399 SHUTDOWN NOSAVE
rm -rf /tmp/redis-recovery

15.5 redis-check-rdb and redis-check-aof Reference

15.5.1 redis-check-rdb

# Basic check
redis-check-rdb dump.rdb
# Healthy: "\o/ RDB looks OK!"
# Corrupt: "CRITICAL: RDB CRC error" or "Wrong type ..."

# Verbose — dump all keys found
redis-check-rdb dump.rdb 2>&1 | head -200

# Exit codes: 0 = OK, non-zero = corrupted

# Common errors:
Error message                              | Cause                 | Fix
"Wrong RDB checksum"                       | Last 8 bytes corrupt  | Restore from backup
"FATAL: short read or OOM ..."             | File truncated        | Recover from earlier backup
"RDB version N is not supported"           | Newer Redis wrote it  | Upgrade Redis version
"DB load failed"                           | Mid-file corruption   | Restore from backup
"Unexpected EOF reading..."                | Crash during BGSAVE   | Use earlier backup or accept partial loss

15.5.2 redis-check-aof

# Check integrity
redis-check-aof appendonly.aof
# Output includes: "AOF analyzed: size=N, ok_up_to=N, ok_up_to_line=N"
# If ok_up_to < size: file is truncated; lines after ok_up_to are incomplete

# Repair: truncate to last complete command
redis-check-aof --fix appendonly.aof
# "Successfully truncated AOF appendonly.aof to offset N"
# Data after offset N is permanently lost

# For Multi-Part AOF
redis-check-aof --fix appendonlydir/appendonly.aof.2.incr.aof

# Check mixed persistence AOF (has RDB preamble)
redis-check-aof appendonly.aof
# Tool auto-detects RDB header and validates both sections

15.6 Master-Replica Persistence Combinations

15.6.1 Recommended: Persistence on Replica, Not Master

# Master (redis-master.conf)
save ""                    # no RDB — eliminates fork() latency spikes
appendonly yes             # AOF for real-time safety
appendfsync everysec
aof-use-rdb-preamble yes

# Replica (redis-replica.conf)
save 3600 1                # hourly RDB snapshot
save 300 100
appendonly yes
appendfsync everysec
aof-use-rdb-preamble yes
replicaof 192.168.1.10 6379

Rationale: BGSAVE on the master calls fork(), which can cause latency spikes of 10–200 ms for large datasets (time for the OS to set all pages read-only). Offloading snapshots to the replica keeps master latency predictable.

15.6.2 Delayed Replica as Accidental Deletion Guard

# Configure one replica with a replication delay (Redis 7.0+)
replicaof 192.168.1.10 6379
replica-lazy-flush yes
# External tool: Delphix or custom proxy to delay replay by 30 minutes

A 30-minute delayed replica ensures that even after FLUSHALL, you have a 30-minute window to stop the replica before it syncs the deletion.

15.6.3 The Deadly Trap: No Persistence + Auto-Restart

Sequence of events:
  1. Master has no persistence (save "", appendonly no)
  2. Master crashes
  3. systemd restarts master (Restart=on-failure)
  4. Master starts with empty dataset (no files to load)
  5. Replica detects master restarted (new replication ID)
  6. Replica initiates full resync with master
  7. Replica replaces ALL its data with master's empty dataset
  8. BOTH master and replica now have zero data

Prevention:
  Option A: Never disable persistence on master (accept slight overhead)
  Option B: Set Restart=no in systemd — require manual intervention after master crash
  Option C: Sentinel handles failover — promotes replica to master BEFORE restarting old master

# Safe systemd unit for Redis master with no persistence
[Service]
ExecStart=/usr/bin/redis-server /etc/redis/redis.conf
Restart=no           # DO NOT auto-restart — operator must intervene
# If Sentinel is controlling this instance, let Sentinel manage restarts

15.7 Persistence Monitoring and Alerting

import redis
import time
import requests

def check_redis_persistence(host='localhost', port=6379,
                             alert_webhook=None):
    r = redis.Redis(host=host, port=port, decode_responses=True)
    info = r.info('persistence')
    alerts = []

    # --- RDB checks ---
    last_save_age = time.time() - info['rdb_last_save_time']
    if last_save_age > 3600:  # no save in 1 hour
        alerts.append(('warn', f"No RDB save in {last_save_age/3600:.1f}h"))

    if info['rdb_last_bgsave_status'] != 'ok':
        alerts.append(('crit', f"BGSAVE failed: {info['rdb_last_bgsave_status']}"))

    if info.get('rdb_current_bgsave_time_sec', -1) > 600:
        t = info['rdb_current_bgsave_time_sec']
        alerts.append(('warn', f"BGSAVE running for {t}s — possible fork() issue"))

    # --- AOF checks ---
    if info.get('aof_enabled') == 1:
        if info.get('aof_last_write_status') != 'ok':
            alerts.append(('crit', "AOF write failed — WRITES BEING REJECTED"))

        cow = info.get('aof_last_cow_size', 0)
        if cow > 1 * 1024**3:  # > 1 GB
            alerts.append(('warn', f"AOF rewrite COW = {cow/1024**2:.0f} MB — check THP"))

        pending = info.get('aof_pending_bio_fsync', 0)
        if pending > 500:
            alerts.append(('warn', f"AOF pending fsync queue: {pending}"))

        delayed = info.get('aof_delayed_fsync', 0)
        if delayed > 0:
            # This counter is cumulative — alert on rate increase
            alerts.append(('info', f"Cumulative delayed fsyncs: {delayed}"))

    # Send alerts
    for level, msg in alerts:
        print(f"[{level.upper()}] {host}:{port} — {msg}")
        if alert_webhook and level in ('warn', 'crit'):
            requests.post(alert_webhook, json={'text': f"Redis {level}: {msg}"})

    return len([a for a in alerts if a[0] == 'crit']) == 0

# Run every minute from monitoring infrastructure
if not check_redis_persistence(alert_webhook='https://hooks.slack.com/...'):
    print("CRITICAL issues detected — paging on-call")

Alerting thresholds summary:

Metric	Warning	Critical	Action
Last RDB save age	> 1 hour	> 3 hours	Check save config, disk space
BGSAVE status	`err`	`err`	Check disk space, ulimits
BGSAVE duration	> 5 min	> 15 min	Check COW, THP, memory
AOF write status	N/A	`err`	Disk full; CRITICAL
AOF rewrite duration	> 5 min	> 20 min	Check disk I/O
AOF COW size	> 500 MB	> 2 GB	Disable THP immediately
AOF pending fsyncs	> 100	> 1000	Disk overloaded

15.8 Persistence Decision Tree

START: What is your data loss tolerance?
  │
  ├─ FULL (restart = empty is fine)
  │   └─▶ NO PERSISTENCE
  │         save ""
  │         appendonly no
  │         [Maximum performance, zero disk overhead]
  │
  ├─ MINUTES (5–60 minutes acceptable)
  │   └─▶ RDB ONLY
  │         save 3600 1
  │         save 300 100
  │         save 60 10000
  │         appendonly no
  │         [Fast restarts, simple operation]
  │
  ├─ SECONDS (~1 second acceptable)  ← MOST PRODUCTION WORKLOADS
  │   └─▶ MIXED PERSISTENCE
  │         appendonly yes
  │         appendfsync everysec
  │         aof-use-rdb-preamble yes
  │         save 3600 1
  │         [Recommended default: fast restart + 1-second RPO]
  │
  └─ NEAR-ZERO (1 command or 0 acceptable)
      └─▶ AOF ALWAYS + REPLICATION
            appendfsync always
            + WAIT 1 100 on critical writes
            + Sentinel/Redis Cluster for automatic failover
            [Highest durability; ~10K–15K TPS ceiling on NVMe]

Rate this chapter

4.8 / 5 (20 ratings)