Chapter 43

Production Incident Post-Mortems: 10 Real Cases

Chapter 43: Production Post-Mortems: Root-Cause Analysis of 10 Real Incidents

Production Redis failures are rarely caused by a single factor. They are the combined result of misconfiguration, incorrect usage patterns, and architectural defects. This chapter uses a consistent format — Symptoms → Investigation → Root Cause → Fix → Prevention — to analyze 10 real production incidents. Every incident includes directly executable diagnostic commands and concrete remediation steps.

Incident 1: Bigkey Blocks the Main Thread

Symptoms

Business monitoring alerts: Redis response latency spikes intermittently from 1ms to 30–50ms, occurring every few minutes and lasting about 200ms. During each spike, all commands (including simple GET requests) are affected.

Investigation

# Step 1: Confirm the latency pattern
redis-cli --latency-history -i 1
# Output shows per-second maximum latency; observe for periodic spikes

# Step 2: Check the slow log
redis-cli SLOWLOG GET 20
# → 1)  1) (integer) 14            # slow log entry ID
#       2) (integer) 1706000000    # timestamp
#       3) (integer) 198432        # duration in microseconds ≈ 198ms
#       4) 1) "HGETALL"
#          2) "user:profile:hash"  # ← the culprit

# Step 3: Scan for bigkeys (non-blocking, uses SCAN iteration internally)
redis-cli --bigkeys
# → Biggest hash found so far 'user:profile:hash' with 523714 fields

# Step 4: Confirm the key size
redis-cli HLEN user:profile:hash
# → (integer) 523714

redis-cli DEBUG OBJECT user:profile:hash
# → Value at:0x7f... encoding:hashtable serializedlength:28432901 ...
# serializedlength ≈ 28 MB!

Root Cause

Application code accumulated all user activity logs into a single Hash key (using user ID + date as the key and event type + timestamp as fields). Over time, the Hash reached 523,714 fields. Each HGETALL had to serialize 28 MB of data and transmit it over the network, taking approximately 200ms on the main thread. During those 200ms, every other command queued behind it.

Fix

# Step 1: Split the bigkey into 500 sub-hashes using HSCAN
cursor=0
while true; do
  result=$(redis-cli HSCAN user:profile:hash $cursor COUNT 1000)
  cursor=$(echo "$result" | head -1)
  fields=$(echo "$result" | tail -n +2)
  # In practice: pipe each field to HSET user:profile:hash:{bucket} field value
  # where bucket = CRC16(field) % 500
  [ "$cursor" = "0" ] && break
done

# Step 2: Verify the split results
for i in $(seq 0 4); do redis-cli HLEN "user:profile:hash:$i"; done

# Step 3: Asynchronously delete the original bigkey (UNLINK is non-blocking)
redis-cli UNLINK user:profile:hash

Prevention

# 1. Configure slow log threshold (record commands taking more than 10ms)
slowlog-log-slower-than 10000   # microseconds
slowlog-max-len 1000

# 2. Code level: ban HGETALL; use HMGET to retrieve only needed fields
# Wrong:  HGETALL user:profile:hash
# Right:  HMGET user:profile:hash field1 field2 field3

# 3. Regular bigkey monitoring
redis-cli --bigkeys 2>&1 | grep "Biggest" | sort -t= -k2 -rn

Incident 2: Replication Backlog Overflow Triggers Full Resync Storm

Symptoms

Replica logs continuously show: Connecting to MASTER redis-master:6379. Master CPU and memory keep rising. INFO replication shows rdb_bgsave_in_progress:1 nearly continuously. Business impact: both read and write latency increase (read requests fall back to the master as replicas go offline).

Investigation

# Step 1: Check master replication status
redis-cli -h redis-master INFO replication
# → master_repl_offset:52428800000
#   repl_backlog_size:1048576        # default 1 MB
#   repl_backlog_histlen:1048576     # backlog is FULL
#
# Replica entry:
#   slave0:ip=10.0.1.2,...,offset=52427750000,...
#   gap = 52428800000 - 52427750000 = 1,050,000 bytes > 1,048,576 (backlog size)!

# Step 2: Measure the write rate
redis-cli -h redis-master INFO stats | grep instantaneous_output_kbps
# → instantaneous_output_kbps:102400   # ≈ 100 MB/s write rate!

# Step 3: Inspect BGSAVE frequency
redis-cli -h redis-master INFO persistence
# → rdb_last_bgsave_time_sec:45     # last BGSAVE took 45s
#   rdb_current_bgsave_time_sec:12  # current BGSAVE has been running 12s

# Step 4: Calculate the required backlog size
# Peak write rate = 100 MB/s, max tolerable network interruption = 30s
# Recommended: 100 × 30 × 2 = 6,000 MB

Root Cause

repl-backlog-size defaults to 1 MB while the peak write rate was 100 MB/s. A 15-second network interruption (caused by memory pressure on a Kubernetes node) allowed 100 × 15 = 1,500 MB to accumulate, far exceeding the 1 MB backlog. When the replica reconnected, it could not find its last offset in the backlog (PSYNC failed), triggering a full resync. Full resync triggers BGSAVE, BGSAVE doubles memory usage, new writes accumulate during RDB transfer — a vicious cycle.

Fix

# Immediate fix (hot config, no restart needed)
redis-cli -h redis-master CONFIG SET repl-backlog-size 536870912   # 512 MB

# Verify
redis-cli -h redis-master CONFIG GET repl-backlog-size

# Enable diskless replication (reduces BGSAVE I/O pressure)
redis-cli -h redis-master CONFIG SET repl-diskless-sync yes
redis-cli -h redis-master CONFIG SET repl-diskless-sync-delay 5

# Persist to redis.conf
redis-cli -h redis-master CONFIG REWRITE

Prevention

# Formula:
# repl-backlog-size = peak_write_MB_per_sec × max_network_interruption_seconds × 2 × 1048576

# Alert when backlog utilization exceeds 80%
redis-cli INFO replication | grep repl_backlog_histlen
# Alert when: histlen / repl_backlog_size > 0.8

Incident 3: Hot Key Cache Stampede Floods the Database

Symptoms

During a flash sale, database CPU spikes to 100%. API response time climbs from 50ms to 3s. Simultaneously, Redis hit rate drops from 99% to 20%.

Investigation

# Step 1: Identify the hot key
redis-cli --hotkeys
# → Hot key 'product:flash:100' freq=98765

# Step 2: Real-time access monitoring (use briefly; avoid on busy production)
redis-cli MONITOR | grep -c "product:flash:100"

# Step 3: Check the TTL
redis-cli TTL product:flash:100
# → 23    (23 seconds remaining)
# or → -2   (already expired!)

# Step 4: Estimate the DB hit rate
# DB QPS = app QPS × cache miss rate = 1000 × 0.8 = 800 queries/second hitting the DB

Root Cause

The hot product key product:flash:100 had a 60-second TTL. At each expiration, 1,000 concurrent requests simultaneously detected a cache miss and all queried the database, a classic "cache stampede" (single hot key, massive concurrent penetration).

Fix

# Solution 1: Mutex lock (distributed lock via SET NX)
# Pseudo-code:
#   value = redis.GET(key)
#   if value is None:
#     if redis.SET(key + ":lock", "1", NX=True, EX=5):
#       value = db.query(id)
#       redis.SET(key, value, EX=60)
#       redis.DELETE(key + ":lock")
#     else:
#       time.sleep(0.05)
#       value = redis.GET(key)   # retry after brief wait

# Solution 2: Logical expiration (key never physically expires)
# Store: SET product:flash:100 '{"data": {...}, "expire_at": 1706001000}'  (no EX)
# On read: if expire_at < now → trigger async background refresh, return stale value

# Solution 3: Local second-level cache (Caffeine / Guava)
# Cache hot keys in application memory for 10 seconds
# During those 10s, return local cache regardless of Redis state

Prevention

# 1. Avoid short TTLs for hot keys; use random TTL jitter to stagger expirations
TTL = base_ttl + random(0, base_ttl * 0.1)

# 2. Pre-warm before events: actively populate caches before flash sales
redis-cli SET product:flash:100 "<value>" EX 7200   # 2 hours

# 3. Monitor and alert on hot keys
redis-cli --hotkeys 2>&1 | awk '/Hot key/ {print $3, $5}' | sort -k2 -rn | head -10

Incident 4: Lua Script Infinite Loop Makes Redis Completely Unresponsive

Symptoms

After a deployment, Redis stops responding to all commands including PING. Connections can be established but hang indefinitely. All business requests time out, triggering circuit breakers.

Investigation

# Step 1: Attempt PING (times out)
redis-cli -h redis-host --no-auth-warning -a "$PASS" PING
# → (no response, or connection timeout)

# Step 2: Check process state (bypassing Redis protocol)
ps aux | grep redis-server
# Process exists, CPU at 100% on one core

# Step 3: Check Redis logs
tail -f /var/log/redis/redis.log
# → Lua slow script detected...

# Step 4: Attempt SCRIPT KILL (from a second connection)
redis-cli -h redis-host SCRIPT KILL
# → OK   (if the script has not performed any writes)
# OR
# → (error) UNKILLABLE Script: Sorry the script already executed write commands...
# In this case, only a restart can recover Redis

# Step 5: Confirm script has been killed
redis-cli DEBUG SLEEP 0   # if this returns, the script was killed

Root Cause

A developer submitted a Lua data-migration script containing while true do end (intended to wait for a condition, but the condition check had a bug and was always true). Redis's Lua engine runs in the main thread; once in an infinite loop, 100% of the main thread is consumed, and no other commands can execute. The default lua-time-limit 5000 (5 seconds) allows SCRIPT KILL to be accepted after the timeout, but if the script has already written data, SCRIPT KILL is rejected and a restart is required.

Fix

# If SCRIPT KILL is effective:
redis-cli SCRIPT KILL

# If the script has performed writes (unkillable):
# Promote a replica to take over, then restart the original master
redis-cli -h redis-replica SLAVEOF NO ONE

# In Kubernetes:
kubectl rollout restart statefulset/redis-master -n redis

Prevention

# 1. Set a reasonable lua-time-limit (milliseconds)
lua-time-limit 5000    # default; allows SCRIPT KILL after 5s timeout

# 2. Never run migration scripts directly in production Lua; use offline tools instead
# 3. Add static analysis to CI/CD: detect infinite loop patterns in Lua scripts
# 4. Use Function instead of bare EVAL for better version management and rollback
# 5. Validate scripts in a test environment before deploying:
redis-cli --eval script.lua key1 , arg1 arg2
# Automatically triggers SCRIPT KILL if it runs beyond lua-time-limit

Incident 5: Cluster Split-Brain Causes Dual-Write Data Loss

Symptoms

Order data is inconsistent: two records exist for the same order ID with different amounts. Tracing the timeline shows the two records were written approximately 40 seconds apart, matching a network fault window.

Investigation

# Step 1: Check current cluster state (fault has recovered)
redis-cli CLUSTER INFO
# → cluster_state:ok

# Step 2: Cross-check each node's view of the cluster
redis-cli -h node1 CLUSTER NODES
redis-cli -h node2 CLUSTER NODES
# Finding: during the fault, both node1 and node3 believed they were master for slots 0-5460

# Step 3: Analyze Redis logs
grep "MASTER MODE" /var/log/redis/redis-node3.log
# → [1234] 15 Jan 2024 14:30:05.123 # Failover election won: I'm the new master

grep "Connection refused" /var/log/redis/redis-node1.log
# → node1 continued accepting writes while isolated from the cluster majority

# Step 4: Reconstruct the event sequence
# 14:29:50 — node1 (master) loses connectivity with node2 and node4
# 14:30:05 — cluster majority elects node3 as the new master
# 14:29:50–14:30:30 — node1 continues accepting client writes (clients unaware of the partition)
# 14:30:30 — network recovers; node1 demoted to replica; its data overwritten by FULLRESYNC from node3

Root Cause

Network partition triggered a Redis Cluster split-brain: the old master (node1) continued accepting writes while isolated from the cluster majority. When the new master was elected and the partition healed, node1 rejoined as a replica and received a full resync from node3, overwriting all writes made to node1 during the partition window.

Fix

# Immediate: stop all writes, audit and manually reconcile data differences

# Long-term: configure anti-split-brain parameters (hot config)
redis-cli CONFIG SET min-replicas-to-write 1
# Master only accepts writes if at least 1 replica has acknowledged
# If disconnected from all replicas, master rejects writes (returns error)

redis-cli CONFIG SET min-replicas-max-lag 10
# A replica with replication lag > 10s is excluded from the count

# Persist
redis-cli CONFIG REWRITE

Prevention

# 1. Configure split-brain protection (accept slight availability reduction for data safety)
min-replicas-to-write 1
min-replicas-max-lag 10

# 2. Use WAIT for critical writes to confirm replication
redis-cli SET order:1001 "<value>"
redis-cli WAIT 1 1000    # wait for at least 1 replica to confirm, timeout 1000ms
# → (integer) 1   (1 replica confirmed)

# 3. Use Cluster-aware clients that detect topology changes promptly
# 4. Monitor inter-node RTT and alert on elevated latency

Incident 6: Memory Fragmentation Spikes Cause OOM Kill

Symptoms

Monitoring shows: used_memory=8GB, maxmemory=10GB, but used_memory_rss=18GB (RSS = physical memory consumed by the process). The Kubernetes node runs out of memory and the Redis Pod is OOM-killed, causing a service outage.

Investigation

# Step 1: Inspect memory details
redis-cli INFO memory
# → used_memory:8589934592          # Redis thinks it uses 8 GB
#   used_memory_rss:19327352832     # OS sees Redis using 18 GB
#   mem_fragmentation_ratio:2.25    # fragmentation ratio is 2.25 (healthy: 1.0–1.5)
#   mem_fragmentation_bytes:10737418240  # 10 GB of fragmentation!

# Step 2: Analyze key distribution
redis-cli INFO keyspace
# → db0:keys=5000000,expires=4900000,avg_ttl=30000
# 5 million keys, average TTL 30s: massive churn of short-lived keys

# Step 3: Check active defrag status
redis-cli CONFIG GET activedefrag
# → "no"   (active defragmentation is disabled!)

Root Cause

An inventory system writes tens of thousands of stock:{sku_id}:lock keys per second (different SKU IDs produce different value sizes), each with a TTL of 5–60 seconds. jemalloc allocates memory in size classes; constant creation and destruction of keys of varying sizes leaves "holes" in the allocated memory pages — memory that has been freed internally but not returned to the operating system. used_memory_rss grew continuously, eventually exceeding the Kubernetes limits.memory, triggering an OOM kill.

Fix

# Immediate: enable active defragmentation
redis-cli CONFIG SET activedefrag yes
redis-cli CONFIG SET active-defrag-ignore-bytes 100mb    # start defrag when >100MB fragmented
redis-cli CONFIG SET active-defrag-threshold-lower 10   # start when fragmentation > 10%
redis-cli CONFIG SET active-defrag-threshold-upper 100  # maximum effort above 100%
redis-cli CONFIG SET active-defrag-cycle-min 1          # minimum 1% CPU for defrag
redis-cli CONFIG SET active-defrag-cycle-max 25         # maximum 25% CPU for defrag

# Monitor defrag progress
watch -n 5 "redis-cli INFO memory | grep -E 'mem_fragmentation|used_memory'"

# Temporarily increase Kubernetes memory limits
kubectl patch statefulset redis-master -n redis -p \
  '{"spec":{"template":{"spec":{"containers":[{"name":"redis","resources":{"limits":{"memory":"14Gi"}}}]}}}}'

Prevention

# 1. Enable active defrag as a baseline configuration
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-cycle-max 25

# 2. Three-layer memory planning in Kubernetes
# limits = maxmemory × 1.5 (fragmentation headroom)

# 3. Monitoring and alerting:
# mem_fragmentation_ratio > 1.5 → warning
# mem_fragmentation_ratio > 2.0 → critical (consider rolling restart)

# 4. Scheduled rolling restarts (replica → failover → restart former master)
# completely eliminates fragmentation without data loss

Incident 7: KEYS * Blocks Production for 30 Seconds

Symptoms

After a nightly maintenance script runs, Redis blocks for approximately 30 seconds. All business requests time out during this window and the alerting system fires massively. No other anomalies appear in the logs for that time period.

Investigation

# Step 1: Review the slow log (post-incident analysis)
redis-cli SLOWLOG GET 10
# → 1)  1) (integer) 201
#       2) (integer) 1706050000
#       3) (integer) 28432156    # 28 seconds!
#       4) 1) "KEYS"
#          2) "*"

# Step 2: Confirm the key count
redis-cli DBSIZE
# → (integer) 5234821   # over 5 million keys

# Step 3: Understand the complexity
# KEYS * is O(N) and runs on the main thread
# 5 million keys × ~5μs per string comparison ≈ 25 seconds

# Step 4: Identify the source
redis-cli CLIENT LIST
# → id=1234 addr=10.0.1.100:54321 cmd=keys ...

Root Cause

The nightly maintenance script needed to find all expired session keys matching session:* and used KEYS session:* directly. With 5 million keys, Redis spent 28 seconds scanning all of them on the main thread. Every other command queued behind it for the full duration.

Fix

# Ban the dangerous command (hot config; lost on restart)
redis-cli CONFIG SET rename-command "KEYS" ""

# Persist to redis.conf
echo 'rename-command KEYS ""' >> /etc/redis/redis.conf

# Rewrite the maintenance script using SCAN (non-blocking, iterative)
# Wrong:
redis-cli KEYS "session:*"

# Right (iterate in batches, process each batch immediately):
cursor=0
while true; do
  result=$(redis-cli SCAN $cursor MATCH "session:*" COUNT 100)
  cursor=$(echo "$result" | head -1)
  keys=$(echo "$result" | tail -n +2)
  if [ -n "$keys" ]; then
    echo "$keys" | xargs redis-cli DEL
  fi
  [ "$cursor" = "0" ] && break
  sleep 0.01   # rate-limit to avoid sustained pressure on Redis
done

Prevention

# Disable dangerous commands in redis.conf
rename-command KEYS     ""
rename-command FLUSHDB  ""
rename-command FLUSHALL ""
rename-command DEBUG    ""
rename-command CONFIG   "CONFIG-INTERNAL"   # rename rather than disable for admin use

# Add to code review checklist:
# - Ban KEYS / SMEMBERS / HGETALL on unbounded key spaces
# - All iteration must use SCAN / HSCAN / SSCAN / ZSCAN

Incident 8: Connection Pool Exhaustion (Too Many Connections)

Symptoms

Application logs flood with: JedisConnectionException: Could not get a resource from the pool. On the Redis side: ERR max number of clients reached. New connections cannot be established; existing connections continue to work normally.

Investigation

# Step 1: Check current connection count
redis-cli INFO clients
# → connected_clients:10000   # maxclients default 10000 — fully saturated!

# Step 2: Analyze connection sources (CLIENT LIST)
redis-cli CLIENT LIST | \
  awk -F'[ =]' '{for(i=1;i<=NF;i++) if($i=="addr") print $(i+1)}' | \
  cut -d: -f1 | sort | uniq -c | sort -rn | head -20
# → 3000 10.0.1.10   # app node 1
#   2800 10.0.1.11   # app node 2
#   500  10.0.1.100  # ops jumpbox (unexpectedly holding many connections!)

# Step 3: Find idle connections
redis-cli CLIENT LIST | \
  awk -F'[ =]' '{for(i=1;i<=NF;i++) if($i=="idle") print $(i+1)}' | \
  sort -n | tail -20
# → 3600   # connections idle for 3600 seconds (1 hour)!

# Step 4: Break down connection types
redis-cli CLIENT LIST | grep -c "cmd=replconf"   # replica connections
redis-cli CLIENT LIST | grep -c "cmd=ping"        # monitoring connections

Root Cause

Java application JedisPool configured with maxTotal=200. With 20 Pod instances: 4,000 connections. Added to that:

Replica replication connections: 2 replicas × 1 = 2
Sentinel connections: 3 Sentinels × 2 = 6
Prometheus redis-exporter: 1
Ops scripts leaving idle connections behind: ~200 (scripts completed without properly closing connections)

Total exceeded 10,000, hitting the limit.

Fix

# Immediate: increase maxclients (hot config)
redis-cli CONFIG SET maxclients 50000

# Kill connections idle for more than 1 hour
redis-cli CLIENT LIST | \
  awk '/idle=[3-9][0-9]{3}/ {match($0, /id=([0-9]+)/, a); print a[1]}' | \
  xargs -I{} redis-cli CLIENT KILL ID {}

# Set idle connection timeout (hot config)
redis-cli CONFIG SET timeout 300      # disconnect after 300 seconds of inactivity
redis-cli CONFIG SET tcp-keepalive 60  # TCP keepalive to detect dead connections

# Persist
redis-cli CONFIG REWRITE

Prevention

# Connection budget formula:
# total = Σ(app_pods × pool_maxTotal) + replica_connections + sentinel_connections
#       + monitoring_connections + ops_reserve
# Ensure: total < maxclients × 0.8

# JedisPool tuning:
config.setMaxTotal(50)                          # size based on actual QPS, not arbitrarily large
config.setMinIdle(5)                            # maintain warm connections
config.setTestOnBorrow(true)                    # validate before borrowing
config.setMaxWait(Duration.ofMillis(3000))      # wait timeout
config.setSoTimeout(2000)                       # read timeout 2s

Incident 9: Disk Full During AOF Rewrite

Symptoms

Redis logs fill with: MISCONF Redis is configured to save RDB snapshots, but it's currently unable to persist on disk. New writes succeed, but RDB saves fail and the AOF keeps growing. Disk monitoring alerts: /data partition at 100% usage.

Investigation

# Step 1: Check disk usage (from inside the Pod or host)
df -h /data
# → /dev/sdb    20G   20G   0G   100%  /data

# Step 2: Find large files
ls -lh /data/
# → total 20G
#   -rw-r--r-- 1 redis redis 8.0G Jan 15 14:30 appendonly.aof
#   -rw-r--r-- 1 redis redis 4.5G Jan 15 12:00 appendonly.aof.tmp.1
#   -rw-r--r-- 1 redis redis 3.2G Jan 14 23:00 appendonly.aof.tmp.2
#   -rw-r--r-- 1 redis redis 2.1G Jan 14 11:00 appendonly.aof.tmp.3
#   -rw-r--r-- 1 redis redis 2.1G Jan 15 02:00 dump.rdb

# Step 3: Identify the tmp file source
# Each .aof.tmp was left behind by an AOF rewrite interrupted by OOM Kill
# Redis does not clean up .tmp files from a previous failed run on restart

# Step 4: Confirm current AOF rewrite state
redis-cli INFO persistence
# → aof_rewrite_in_progress:1
#   aof_current_size:8589934592    # 8 GB current AOF
#   aof_base_size:1073741824       # 1 GB at last successful rewrite

Root Cause

AOF rewrite was interrupted by an OOM Kill, leaving appendonly.aof.tmp behind. Redis does not automatically clean up temporary files from a failed previous rewrite on the next startup. Subsequent rewrite attempts each created another .tmp file. Multiple accumulated .tmp files filled the disk.

Fix

# Step 1: Confirm no active rewrite before deleting
redis-cli INFO persistence | grep aof_rewrite_in_progress
# → aof_rewrite_in_progress:0

# Step 2: Remove stale tmp files
ls -lt /data/*.tmp   # verify the file list first
rm /data/appendonly.aof.tmp.1
rm /data/appendonly.aof.tmp.2
rm /data/appendonly.aof.tmp.3

# Step 3: Optionally trigger AOF rewrite to compact the AOF
redis-cli BGREWRITEAOF
redis-cli INFO persistence   # monitor progress

# Step 4: Expand the PVC (Kubernetes)
kubectl patch pvc redis-data-redis-0 -n redis -p \
  '{"spec":{"resources":{"requests":{"storage":"40Gi"}}}}'

Prevention

# 1. Set disk alert threshold at 75% (not 80%)
# AOF rewrite needs free space equal to the current AOF size (old + new coexist during rewrite)

# 2. Mount Redis data directory on an independent PVC
# Never share with the OS root filesystem

# 3. Monitor AOF rewrite state
redis-cli INFO persistence | grep -E "aof_rewrite|aof_current_size"

# 4. Upgrade to Redis 7.0 Multi-Part AOF
# Incremental INCR files are much smaller; a failed rewrite leaves a smaller tmp file

Incident 10: Misusing Deployment Causes Total Production Data Loss

Symptoms

After an ops engineer runs kubectl rollout restart on a Redis Deployment (to update the image version), all business-side cache data disappears. Database load spikes immediately. Confirmed: Redis data is completely gone, even though RDB persistence was enabled.

Investigation

# Step 1: Examine Pod history
kubectl describe pod redis-7d4f9b-abc12 -n prod
# → Name: redis-7d4f9b-abc12   (old Pod, Terminated)
kubectl get pod redis-8e5f0c-xyz89 -n prod
# → Name: redis-8e5f0c-xyz89   (new Pod, Running)
# Pod names are completely different!

# Step 2: Inspect PVC bindings
kubectl get pvc -n prod
# → NAME         STATUS    VOLUME      CAPACITY
#   redis-data   Bound     pvc-aaa     20Gi

# Step 3: Check which PVC the new Pod mounted
kubectl describe pod redis-8e5f0c-xyz89 -n prod | grep "ClaimName"
# → ClaimName: redis-data

# Step 4: Trace what happened during the rollout
# During rollout, Kubernetes launched the new Pod while the old one was still running
# The PVC (ReadWriteOnce) can only be mounted by one Pod at a time
# The new Pod remained Pending until the old Pod terminated
# A scheduler race condition led the new Pod to briefly mount an empty volume path
# By the time the correct PVC was available, Redis had already initialized with no data

# Step 5: Review the Deployment volume configuration
kubectl get deployment redis -n prod -o yaml | grep -A 10 volumes
# → volumes:
#   - name: redis-data
#     persistentVolumeClaim:
#       claimName: redis-data   # static binding, NOT volumeClaimTemplate!

Root Cause

The production Redis was deployed as a Deployment instead of a StatefulSet. Deployment binds PVCs statically via claimName. During rollout restart, Kubernetes starts the new Pod before the old one terminates. The ReadWriteOnce PVC can only be mounted by one Pod at a time, so the new Pod stayed Pending. A scheduler timing window caused the new Pod to start against an empty mount point; by the time the correct PVC was attachable, Redis had already initialized with an empty dataset and written a new (empty) RDB file, overwriting the original data on disk.

Fix

# 1. Immediately reduce database pressure (all cache is gone)
# Activate temporary rate limiting; increase DB connection pool limits

# 2. Attempt to restore from backup
aws s3 ls s3://my-redis-backups/redis/ | sort | tail -5
# Find the most recent RDB, restore it (see Incident 9 fix procedure)

# 3. Migrate to StatefulSet (the correct long-term fix)
kubectl get deployment redis -n prod -o yaml > /tmp/redis-deployment.yaml
# Create a StatefulSet from this as a base:
# - Replace volumeClaimTemplates (remove the static volumes.pvc reference)
# - Add proper labels and serviceName fields
# - Apply and verify

# 4. Trigger application-side cache warming logic
# For data that cannot be restored from Redis backup,
# initiate database-to-cache warmup procedures

Prevention

# 1. Enforce policy: Redis must use StatefulSet
# Add a check to your CI/CD pipeline:
kubectl get deployments -n prod | grep redis
# Fail the pipeline and alert if any redis Deployment is found

# 2. OPA/Gatekeeper policy: deny Deployment resources with redis labels in production namespaces

# 3. Regular backup validation (see Incident 9 prevention)

# 4. Quarterly disaster recovery drills:
# Simulate total Pod loss → restore from backup → verify data integrity
# Document RTO and RPO from actual drill results

# 5. Always verify resource types before any ops action:
kubectl get all -n prod | grep -E "(deployment|statefulset).*redis"
# Expected: only StatefulSet entries

Prevention Summary Matrix

Incident Type	Detection Command	Prevention
Bigkey blocking	`redis-cli --bigkeys` + SLOWLOG	Ban HGETALL; code review gates
Backlog overflow	`INFO replication` offset gap	Set backlog = peak_rate × tolerance × 2
Hot key stampede	`redis-cli --hotkeys` + MONITOR	Logical expiry + local L2 cache
Lua infinite loop	Process CPU at 100%	`lua-time-limit` + pre-deploy validation
Split-brain dual write	CLUSTER NODES cross-comparison	`min-replicas-to-write` + WAIT
Fragmentation OOM	`INFO memory` fragmentation_ratio	`activedefrag yes` + limits headroom
KEYS * blocking	SLOWLOG	`rename-command KEYS ""`
Connection exhaustion	`INFO clients` connected_clients	Budget planning + idle timeout
Disk full (AOF tmp)	`df -h` + ls *.tmp	75% disk alert + independent partition
Deployment data loss	`kubectl get all` resource types	Enforce StatefulSet + CI check

The central lesson across all 10 incidents is the same: monitoring must precede failures. Every command shown in the "Investigation" sections above should become a routine metric with a defined baseline and alert threshold before the first incident occurs. A Redis deployment without proactive alerting on replication lag, memory fragmentation, connection count, and slow log accumulation is not production-ready — it is simply an incident waiting to happen.

Rate this chapter

4.6 / 5 (3 ratings)

Production Incident Post-Mortems: 10 Real Cases

Chapter 43: Production Post-Mortems: Root-Cause Analysis of 10 Real Incidents

Incident 1: Bigkey Blocks the Main Thread

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 2: Replication Backlog Overflow Triggers Full Resync Storm

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 3: Hot Key Cache Stampede Floods the Database

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 4: Lua Script Infinite Loop Makes Redis Completely Unresponsive

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 5: Cluster Split-Brain Causes Dual-Write Data Loss

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 6: Memory Fragmentation Spikes Cause OOM Kill

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 7: KEYS * Blocks Production for 30 Seconds

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 8: Connection Pool Exhaustion (Too Many Connections)

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 9: Disk Full During AOF Rewrite

Symptoms

Investigation

Root Cause

Fix

Prevention

Incident 10: Misusing Deployment Causes Total Production Data Loss

Symptoms

Investigation

Root Cause

Fix

Prevention

Prevention Summary Matrix

💬 Comments