Chapter 32

Production War Stories: Root Cause Analysis of 10 Real Incidents

Failures Are the Best Teacher

The ten incidents documented in this chapter are drawn from real production failure patterns — some directly experienced, others distilled from recurring themes in the Kafka community, engineering blogs, and postmortem archives. Every incident follows the same analytical framework: symptom → investigation steps → root cause → fix → prevention.

This framework matters as much as the individual cases. When a production incident strikes, the instinctive response is panic and blind action. Systematic diagnosis — observe symptoms, form a hypothesis, validate the hypothesis, identify the root cause — is what determines whether MTTR is measured in minutes or hours.

Incident 1: ISR Thrashing Triggers Throughput Collapse

Symptom

A payment system's Kafka cluster suffered a severe throughput drop during peak traffic — from a normal 500 MB/s to under 50 MB/s. Monitoring showed UnderReplicatedPartitions oscillating violently between 0 and 200+. The producer side logged a flood of NOT_ENOUGH_REPLICAS errors (caused by acks=all writes being rejected when ISR membership dropped below min.insync.replicas).

Investigation

# Identify which partitions are under-replicated
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --under-replicated-partitions

# Find the ISR change pattern in broker logs
grep "ISR updated" /var/log/kafka/server.log | tail -50
# All ISR events point to Broker 3 as the lagging follower

# Check disk I/O on Broker 3
iostat -x 1 10 -d /dev/sdb
# Output: await=1200ms (normal < 5ms) — disk is extremely slow

# Check RAID controller status
/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aAll
# Output: "Degraded" — one disk failed, RAID 5 running in degraded mode
# Degraded RAID 5 recalculates parity on every write → 10-20x write latency increase

Root Cause

A disk in the RAID 5 array failed, putting the array into degraded mode. In degraded mode, the RAID controller recalculates parity for every write, spiking disk write latency from 2ms to 1,200ms. Broker 3's follower fetch requests couldn't complete within replica.lag.time.max.ms (30 seconds), triggering ISR removal. With ISR below min.insync.replicas, producer acks=all requests were rejected — causing the throughput cliff.

Fix

# Immediate: Buy time by extending lag tolerance (temporary, not a cure)
kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type brokers --entity-name 3 \
  --add-config replica.lag.time.max.ms=120000

# Replace the failed disk and allow RAID rebuild (maintenance window required)
# Performance remains degraded during rebuild — monitor ISR status closely

# Long-term: Migrate from RAID to JBOD
# Kafka's replication already provides redundancy; RAID adds overhead and fragility

Prevention

Use JBOD storage, not RAID. Kafka's replication factor makes disk-level RAID redundant.
Alert on DiskReadBytes and DiskWriteBytes metrics, and on disk await latency.
Alert when ISR shrink rate exceeds 3 events per minute.

Incident 2: Consumer Rebalance Storm

Symptom

An engineering team performed a rolling deployment of 50 consumer pods in a consumer group. During the update, consumption completely stopped for 15 minutes. Each pod restart triggered a full group rebalance (approximately 20 seconds). With 50 sequential restarts, that was 50 sequential rebalances — 1,000 seconds of zero consumption.

Investigation

# Watch the consumer group state during the rolling deployment
watch -n 2 "kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group payment-consumer-group --describe"

# State cycles: Stable → PreparingRebalance → CompletingRebalance → Stable
# Each cycle takes ~20 seconds, repeating for every pod restart

# Count rebalance events in broker logs
grep "Preparing to rebalance group" /var/log/kafka/server.log | wc -l
# Output: 50+ — one per pod restart, exactly as suspected

Root Cause

Two design deficiencies compounded each other:

Eager rebalance strategy (default RangeAssignor / RoundRobinAssignor): When any member joins or leaves the group, every member stops consuming and surrenders all its partitions. Partitions are reassigned from scratch. The stop-the-world nature means the group produces zero output during any rebalance.
No static membership: Each pod restart generates a new randomly assigned member.id. The Group Coordinator sees a new member joining and the old member leaving — two events that each trigger a rebalance.

Fix

Properties props = new Properties();

// Fix 1: Switch to CooperativeStickyAssignor (incremental rebalance)
// Only partitions that need to move are reassigned; others keep consuming
props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG,
    "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");

// Fix 2: Enable static membership using pod name as group instance ID
// The Group Coordinator recognizes the restarted pod as the same member
// and delays rebalancing until session.timeout.ms expires (rather than triggering immediately)
props.put(ConsumerConfig.GROUP_INSTANCE_ID_CONFIG,
    System.getenv("POD_NAME"));    // e.g., "payment-consumer-0"

// Increase session timeout to give the pod time to restart before the coordinator
// treats it as failed and triggers rebalance
props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 90000);  // 90 seconds
props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 3000); // 3 seconds

Result: Rolling deployment of 50 pods reduced from 1,000 seconds of downtime to approximately 30 seconds (one final rebalance when the last batch completes).

Prevention

Use CooperativeStickyAssignor for all new consumer services from day one.
In Kubernetes, set group.instance.id to POD_NAME.
Alert on rebalance rate: more than 5 rebalances per hour in a single group warrants investigation.

Incident 3: Silent Message Loss

Symptom

A downstream analytics team discovered a gap in order data for a specific time window: approximately 3,000 records present in the upstream database were missing from Kafka. The gap corresponded to the 30 seconds surrounding a leader failover event.

Investigation

# Find leader election events for the affected partition
grep "New leader for partition" /var/log/kafka/server.log \
  | grep "orders-0" | head -10
# Confirms a leader election occurred within the loss window

# Check producer acks configuration
grep "acks" /opt/app/config/producer.properties
# Found: acks=1

# Reconstruct the data loss sequence:
# 1. Producer sends record → Leader writes to OS page cache
# 2. Leader sends ACK to producer (acks=1 = "I received it")
# 3. Follower has NOT yet fetched this record
# 4. Leader crashes — record is in page cache, which is lost on crash
# 5. New leader elected — this record was never replicated
# 6. Record is permanently gone

Root Cause

acks=1 is the direct cause. The leader acknowledges the producer write before any follower has replicated the record. If the leader crashes before replication and before the OS flushes the page cache, the record is permanently lost.

Fix

// Change producer configuration — this is non-negotiable for critical data
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE);
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true); // Prevents retry-induced duplicates

# Set min.insync.replicas=2 on the topic
# acks=all requires all ISR members to acknowledge
# With replication.factor=3 and min.insync.replicas=2:
# even if one broker is down, writes succeed and data survives one additional failure
kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type topics --entity-name orders \
  --add-config min.insync.replicas=2

Data Recovery

The 3,000 missing records were permanently lost from Kafka. Partial recovery was achieved by:

Re-reading from the upstream database (the producer wrote to DB before Kafka)
Triggering a business compensation process to republish the historical records

Prevention

Mandate acks=all for all production producers. Add a CI check that fails on acks=1 or acks=0 configurations.
Apply min.insync.replicas=2 to all business-critical topics.
Use enable.idempotence=true to prevent duplicate records from producer retries.

Incident 4: Producer OOM Blocks the Main Thread

Symptom

A microservice started exhibiting HTTP request timeouts under load. The main application thread became blocked, Kubernetes liveness probes failed, and pods restarted. Logs showed either java.lang.OutOfMemoryError: Java heap space or thread dumps with the main thread stuck inside KafkaProducer.send().

Investigation

# Check producer configuration
# Found: buffer.memory=33554432 (32 MB, default)
#        max.block.ms=60000 (60 seconds, default)

# Check broker-side produce latency
kafka-jmx.sh \
  --object-name "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce" \
  --attributes 99thPercentile
# Output: 15,000ms (P99 = 15 seconds!)

# Examine producer buffer metrics
kafka-jmx.sh \
  --object-name "kafka.producer:type=producer-metrics,client-id=payment-producer" \
  --attributes buffer-available-bytes,record-queue-time-avg
# buffer-available-bytes ≈ 0 (completely full)
# record-queue-time-avg = 58,000ms (approaching max.block.ms)

Root Cause

Broker-side produce latency spiked (due to an unrelated I/O issue). The producer's RecordAccumulator buffer filled up as records piled up waiting for broker acknowledgment. With buffer.memory=32 MB exhausted, the next call to KafkaProducer.send() blocked the calling thread for up to max.block.ms=60 seconds. The main HTTP request handling thread was frozen, triggering liveness probe failures and pod restarts.

Fix

// Approach 1: Increase buffer and reduce block timeout (palliative)
props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 134217728L);  // 128 MB
props.put(ProducerConfig.MAX_BLOCK_MS_CONFIG, 5000);          // Fail fast after 5s

// Approach 2: Asynchronous send with backpressure callback (recommended)
producer.send(record, (metadata, exception) -> {
    if (exception != null) {
        if (exception instanceof BufferExhaustedException) {
            // Kafka is backed up — apply backpressure to the business layer
            metrics.increment("kafka.producer.buffer.exhausted");
            rateLimiter.reducePermits(1);
        }
        log.warn("Kafka send failed", exception);
    }
});

// Approach 3: Dedicated Kafka send thread pool (full isolation)
// The main request-handling thread never touches KafkaProducer.send() directly
ExecutorService kafkaSendPool = Executors.newFixedThreadPool(
    4,
    new ThreadFactoryBuilder().setNameFormat("kafka-send-%d").build()
);
kafkaSendPool.submit(() -> producer.send(record));

Prevention

Never call KafkaProducer.send() from the main business thread in a latency-sensitive service.
Alert when buffer-available-bytes drops below 10% of buffer.memory.
Set max.block.ms to a value that aligns with your HTTP timeout budget (typically 1,000-5,000ms, not the default 60,000ms).

Incident 5: Disk Exhaustion from Compaction Lag

Symptom

A topic with log compaction enabled had a normally functioning log cleaner. Despite this, disk space declined steadily and was completely exhausted within 4 hours (480 GB SSD), causing the broker to stop accepting writes.

Investigation

# Check write rate on the affected topic
kafka-jmx.sh \
  --object-name "kafka.server:type=BrokerTopicMetrics,name=BytesInPerSec,topic=user-events" \
  --attributes OneMinuteRate
# Output: 1,073,741,824 bytes/s = 1 GB/s write rate

# Check compaction throughput
kafka-jmx.sh \
  --object-name "kafka.log:type=LogCleaner,name=cleaner-io-ratio" \
  --attributes Value
# Output: 0.2 (the cleaner is running at 20% of disk bandwidth = ~200 MB/s)

# The math:
# Write rate: 1,000 MB/s
# Compaction rate: 200 MB/s
# Net accumulation: 800 MB/s = 48 GB/min
# Available disk: 480 GB / 48 GB/min = ~10 minutes to fill
# (Actually 4 hours because TTL-expired records are also being deleted concurrently)

Root Cause

Log compaction cleanup throughput (200 MB/s) is far below the write rate (1 GB/s). Unreplaced old record versions accumulate faster than the cleaner can remove them. The fundamental bottleneck: log.cleaner.threads=1 (default). A single compaction thread cannot keep up with a 1 GB/s write workload.

Fix

# Immediate: Increase compaction thread count (dynamic, no restart needed)
kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type brokers --entity-name 1 \
  --add-config log.cleaner.threads=4

# Increase compaction I/O budget
kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type brokers --entity-name 1 \
  --add-config log.cleaner.io.max.bytes.per.second=536870912  # 512 MB/s

# Allow compaction to run sooner on the high-write topic
kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type topics --entity-name user-events \
  --add-config min.compaction.lag.ms=3600000   # Eligible for compaction after 1 hour

Prevention

Alert at 60% disk usage (P2) and 80% (P1) — leave adequate response time.
Monitor cleaner-io-ratio: below 0.5 means compaction cannot keep pace with writes.
Capacity planning: provision 3x the expected peak write rate in disk space.
For high-write compacted topics, provision separate fast disks and increase log.cleaner.threads proactively.

Incident 6: Consumer Group Offsets Expire During Extended Downtime

Symptom

A batch processing job's consumer group was paused for scheduled maintenance lasting 8 days. When restarted, instead of resuming from the last committed offset, the consumers started from the latest position — skipping millions of records accumulated during the maintenance window.

Investigation

# Attempt to describe the consumer group after restart
kafka-consumer-groups.sh --bootstrap-server kafka:9092 \
  --group batch-processor-group --describe
# Output: "Consumer group 'batch-processor-group' does not exist."
# The offsets are completely gone!

# Check the offset retention configuration
kafka-configs.sh --bootstrap-server kafka:9092 \
  --describe --entity-type brokers | grep offset.retention
# Output: offset.retention.minutes=10080 (7 days — the default)

# Timeline:
# Maintenance downtime: 8 days
# offset.retention.minutes: 7 days
# 8 days > 7 days → offsets expired and were deleted

Root Cause

offset.retention.minutes (default: 10,080 minutes = 7 days) controls how long the broker retains consumer group offset records for groups with no active members. After 7 days of inactivity, the __consumer_offsets topic record for this group was deleted. On restart, the consumer fell back to auto.offset.reset=latest, jumping to the current log end and skipping the entire 8-day backlog.

Fix

# Increase offset retention to 30 days
kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type brokers --entity-name 1 \
  --add-config offset.retention.minutes=43200  # 30 days

# For the immediate incident: attempt recovery from upstream sources
# (re-read from source DB or re-trigger the events)

// For critical batch consumer groups, use 'earliest' as the reset fallback
// This means data is reprocessed rather than skipped if offsets expire
// Requires the consumer to be idempotent (same record applied twice = same result)
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

Prevention

Set offset.retention.minutes=43200 (30 days) or longer for any cluster with batch workloads.
Alert when a consumer group has been inactive for more than 72 hours.
Before planned extended downtime, export consumer group offsets to an external store (database, configuration management system).
For all business-critical consumer groups, set auto.offset.reset=earliest so that expired offsets cause reprocessing rather than data loss.

Incident 7: Too Many Partitions Crushes the Controller

Symptom

A cluster with 50 topics × 1,000 partitions each (50,000 partitions total). Any time a broker restarted — whether for a routine rolling upgrade, a GC-induced crash, or a hardware failure — the recovery time was 10-15 minutes. Normal broker recovery should take 1-2 minutes. During the extended recovery window, producers and consumers intermittently experienced connection errors.

Investigation

# Confirm total partition count
kafka-topics.sh --bootstrap-server kafka:9092 --describe \
  | grep "PartitionCount" \
  | awk '{sum += $2} END {print "Total partitions:", sum}'
# Output: Total partitions: 50000

# Count LeaderAndIsr requests during a broker restart
grep "Sent LeaderAndIsr" /var/log/kafka/server.log | wc -l
# Output: ~50,000 — one per partition, sent serially

# Measure controller leader election latency
kafka-jmx.sh \
  --object-name "kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs" \
  --attributes 99thPercentile
# Output: very high, indicating the controller is spending most of its time on elections

Root Cause

In ZooKeeper mode, the Kafka Controller processes leader elections serially (single-threaded). When a broker restarts, the Controller must send a LeaderAndIsr request for every partition previously led or followed by that broker — up to 50,000 in this case. Each request requires a network round-trip and a ZooKeeper write (1-5ms each). 50,000 × 3ms = 150 seconds minimum, just for the Controller to complete its work.

Fix

Short-term: Reduce partition count by consolidating low-traffic topics. The practical upper limit for ZooKeeper-mode clusters is roughly 1,000-4,000 partitions per broker, or 10,000-20,000 total.

Long-term: Migrate to KRaft mode

KRaft (Kafka Raft Metadata mode) replaces ZooKeeper with a built-in Raft log. The Controller is distributed across multiple brokers. Metadata is stored in an internal Kafka topic rather than ZooKeeper. The result is dramatically better performance at high partition counts — KRaft clusters routinely handle millions of partitions.

# Kafka 3.7+ supports migration from ZooKeeper mode to KRaft
# Generate a new cluster ID for KRaft
KAFKA_CLUSTER_ID=$(kafka-storage.sh random-uuid)

# See the official ZooKeeper to KRaft migration guide for the full procedure
# Typically requires a maintenance window of 1-2 hours for a 3-broker cluster

Prevention

Alert when total partition count exceeds 10,000 in a ZooKeeper-mode cluster.
Implement a partition count approval process for new topic creation (require justification for any topic requesting more than 200 partitions).
Plan a KRaft migration — it is production-stable in Kafka 3.3+ and strongly recommended in Kafka 3.7+.

Incident 8: Expired SSL Certificate Causes Cluster Split

Symptom

Early one morning, monitoring fired: UnderReplicatedPartitions jumped from 0 to 150+, ActiveControllerCount briefly dropped to 0, and consumers started logging SSL handshake failed errors in large numbers.

Investigation

# Check SSL certificate expiry on all brokers
for broker in kafka-1 kafka-2 kafka-3 kafka-4 kafka-5; do
    echo "=== $broker ==="
    openssl s_client -connect $broker:9092 -brief 2>/dev/null \
    | openssl x509 -noout -dates 2>/dev/null
done

# Output:
# === kafka-1 === notAfter=Dec 31 23:59:59 2023 GMT  ← EXPIRED
# === kafka-2 === notAfter=Dec 31 23:59:59 2023 GMT  ← EXPIRED
# === kafka-3 === notAfter=Dec 31 23:59:59 2023 GMT  ← EXPIRED
# === kafka-4 === notAfter=Dec 31 23:59:59 2024 GMT  (valid)
# === kafka-5 === notAfter=Dec 31 23:59:59 2024 GMT  (valid)

# Three brokers had certificates issued in the same batch, expiring on the same day.
# Brokers 4 and 5 rejected SSL connections from Brokers 1, 2, and 3.
# The cluster split into two mutually isolated subclusters.

Root Cause

Three of five brokers used certificates issued in the same batch with the same expiry date. When that date passed, brokers 4 and 5 rejected SSL connections from brokers 1, 2, and 3 on the REPLICATION listener. Inter-broker replication stopped. ISR membership collapsed as followers could not reach leaders. The Controller briefly lost quorum as the split prevented consensus.

Fix

# Issue new certificates immediately via Vault
vault write -format=json pki/issue/kafka-broker \
  common_name="kafka-broker-1.internal" \
  ttl="8760h" > /tmp/new-cert.json

# Extract certificate and key
cat /tmp/new-cert.json | jq -r '.data.certificate' \
  > /etc/kafka/ssl/new-broker.pem
cat /tmp/new-cert.json | jq -r '.data.private_key' \
  > /etc/kafka/ssl/new-broker-key.pem

# Kafka 3.x supports dynamic SSL certificate reload without broker restart
kafka-configs.sh --bootstrap-server kafka-4:9092 \
  --command-config admin-client.properties \
  --alter \
  --entity-type brokers --entity-name 1 \
  --add-config "ssl.keystore.location=/etc/kafka/ssl/new-keystore.jks,\
                ssl.keystore.password=newpassword"

Prevention

Prometheus alert: certificate expiry < 30 days → P2; < 7 days → P1; < 3 days → P0.
Stagger certificate expiry dates across brokers — never issue the same expiry date to more than one broker.
Automate certificate rotation using cert-manager (Kubernetes) or Vault PKI with short-lived certificates (365 days maximum).
Annually audit certificate expiry dates for all cluster components (brokers, Schema Registry, Kafka Connect workers, clients).

Incident 9: Cross-AZ Latency Spike Stalls High Watermark

Symptom

The business team reported end-to-end latency (P99) for Kafka messages spiking from 20ms to over 2 seconds, sustained for about an hour. Producer throughput appeared normal; consumers were actively consuming — but the messages being consumed were 1-2 seconds old.

Investigation

# Check ISR status — all replicas appear in-sync
kafka-topics.sh --bootstrap-server kafka:9092 \
  --describe --topic orders | grep "Isr:"
# ISR is full — no replicas have been removed

# Break down fetch latency by broker (and by AZ)
# AZ-A: Broker 1, 2 → P99 fetch latency: 15ms (normal)
# AZ-B: Broker 3, 4 → P99 fetch latency: 18ms (normal)
# AZ-C: Broker 5, 6 → P99 fetch latency: 2,100ms (anomalous!)

# Measure network latency from AZ-A to AZ-C
ping kafka-broker-5  # executed from a host in AZ-A
# Expected: < 1ms for same-region cross-AZ
# Actual: 52ms — AZ-C has an extra 50ms network delay (cloud provider issue)

# Trace the HW advancement path:
# HW = min(LEO of all ISR members)
# AZ-C followers are in ISR (50ms < replica.lag.time.max.ms = 30s)
# But AZ-C followers' LEO advances slowly due to slow fetching
# Leader's HW can only advance to AZ-C's current LEO
# Consumers can only read up to HW → consumer lag = AZ-C's fetch latency

Root Cause

A network anomaly introduced 50ms of additional latency to AZ-C. The AZ-C followers remained in ISR (50ms is far below the 30-second replica.lag.time.max.ms threshold), so they continued to constrain the High Watermark advancement. Since consumers can only read records up to the High Watermark, their effective latency became 50ms (the follower's additional network delay), which accumulated into 2+ second end-to-end latency over the course of multiple produce-fetch cycles.

Fix

Immediate: Reduce replica.lag.time.max.ms to cause the slow AZ-C followers to leave ISR sooner, allowing the HW to advance without them:

kafka-configs.sh --bootstrap-server kafka:9092 --alter \
  --entity-type brokers --entity-default \
  --add-config replica.lag.time.max.ms=5000  # Reduced from 30s to 5s

Long-term architectural improvement: Rack-aware replica placement + consumer rack affinity:

// Consumer: prefer to fetch from replicas in the same AZ
// (Kafka 2.4+ KIP-392: Fetch from closest replica)
props.put(ConsumerConfig.CLIENT_RACK_CONFIG, "az-a");  // Consumer's AZ

// Broker: must have broker.rack set (in server.properties)
// broker.rack=az-a (for brokers in AZ-A)

// When both consumer and broker have rack configured,
// Kafka routes fetch requests to the nearest replica
// This reduces inter-AZ data transfer costs AND
// means a slow AZ's latency only affects consumers in that AZ

# Ensure topics have rack-aware replica assignment
# (partitions spread across AZs, each partition's ISR in different AZs)
kafka-topics.sh --bootstrap-server kafka:9092 \
  --create --topic orders-v2 \
  --partitions 12 \
  --replication-factor 3 \
  --config rack.aware.assignment=true

Prevention

Test cross-AZ network latency as part of cluster health checks; alert on > 5ms baseline.
Use consumer rack affinity to limit the blast radius of AZ-specific network issues.
Consider whether all three AZs need to be in ISR — sometimes a 2-AZ ISR with min.insync.replicas=2 provides better latency predictability.

Incident 10: Kafka on Kubernetes Pod Reschedule Causes Data Loss

Symptom

After a Kubernetes node eviction, a broker pod was rescheduled to a different node. Upon restart, the broker's data directory was empty — all on-disk log segments were gone. The cluster showed a sudden spike in UnderReplicatedPartitions (the affected broker's partitions had no local data), and consumers on other brokers saw the broker's partitions re-synchronize from the beginning.

Investigation

# Check PVC status
kubectl get pvc -n kafka
# Output: kafka-data-kafka-0   Lost   (PVC is in Lost state!)

# Describe the PVC to understand the storage configuration
kubectl describe pvc kafka-data-kafka-0 -n kafka
# StorageClass: local-storage
# AccessModes: ReadWriteOnce

# Check the deployment type
kubectl get deployment,statefulset -n kafka
# Found: Deployment "kafka" (NOT a StatefulSet!)

# Root cause becomes clear:
# Deployment does not guarantee stable Pod identity or PVC binding
# When pod rescheduled to a new node:
# - The old PVC (bound to node A's local disk) cannot follow the pod to node B
# - A new empty PVC was created on node B
# - The broker started with an empty data directory

Root Cause

Two fundamental configuration errors compounded:

Deployment instead of StatefulSet: Deployments do not provide stable pod identities or stable PVC-to-pod bindings. StatefulSet is the only correct Kubernetes workload resource for stateful applications like Kafka.
local-storage StorageClass with ReadWriteOnce: Local storage is bound to the specific node. When the pod moves to a different node (due to eviction, cluster autoscaling, or node maintenance), the PVC cannot follow. A new empty PVC is created, and the broker sees no existing data.

Fix

# Correct Kafka on Kubernetes configuration
apiVersion: apps/v1
kind: StatefulSet    # StatefulSet, not Deployment
metadata:
  name: kafka
  namespace: kafka
spec:
  serviceName: kafka-headless
  replicas: 3
  podManagementPolicy: OrderedReady    # Start/stop pods in order (safer for Kafka)
  updateStrategy:
    type: RollingUpdate
  selector:
    matchLabels:
      app: kafka
  template:
    metadata:
      labels:
        app: kafka
    spec:
      terminationGracePeriodSeconds: 60    # Allow in-flight requests to complete
      containers:
        - name: kafka
          image: apache/kafka:3.7.0
          env:
            - name: KAFKA_LOG_DIRS
              value: /var/kafka/data
          volumeMounts:
            - name: kafka-data
              mountPath: /var/kafka/data

  # volumeClaimTemplates creates a stable PVC per pod, bound for the pod's lifetime
  volumeClaimTemplates:
    - metadata:
        name: kafka-data
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: gp3-encrypted    # Network-attached storage that can follow pods
        resources:
          requests:
            storage: 500Gi
---
# PodDisruptionBudget: prevent Kubernetes from evicting too many brokers simultaneously
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: kafka-pdb
  namespace: kafka
spec:
  minAvailable: 2    # Always keep at least 2 of 3 brokers running
  selector:
    matchLabels:
      app: kafka

StorageClass selection by platform:

AWS EKS: gp3 (EBS) — network-attached, follows pods across nodes in the same AZ
GCP GKE: pd-ssd (Persistent Disk) — same principle
Azure AKS: managed-premium (Premium SSD)
On-premises: Rook-Ceph or Longhorn — distributed block storage that decouples data from specific nodes
Never use: local-storage, hostPath, or node-local volumes for Kafka data directories

Prevention

Never deploy Kafka on Kubernetes with a Deployment — always use StatefulSet.
Always use network-attached storage (not local/host storage) so pods can be rescheduled safely.
Configure PodDisruptionBudget to prevent batch evictions during node maintenance.
Quarterly drill: manually evict a broker pod and verify data integrity after rescheduling.

Incident Pattern Analysis and Systemic Prevention

Reviewing all ten incidents, root causes cluster into five categories:

Category	Incidents	Typical Root Cause
Storage and disk	1, 5	RAID degradation, compaction lag
Configuration defaults	3, 6	`acks=1`, `offset.retention` too short
Scale-related	7, 9	Partition count too high, cross-AZ HW coupling
Infrastructure	8, 10	Certificate expiry, K8s storage misconfiguration
Client behavior	2, 4	Rebalance storm, producer backpressure mishandling

Systemic prevention priority order:

Infrastructure automation (eliminates incidents 8, 10): Certificate auto-rotation; StatefulSet + network-attached storage as the mandatory Kafka on K8s pattern.
Configuration hardening (eliminates incidents 3, 6): Enforce acks=all; set min.insync.replicas=2; increase offset.retention.minutes to 30+ days. These should be platform defaults that new services inherit automatically.
Monitoring and alerting (enables early detection for incidents 1, 2, 5): Alert on disk usage, ISR churn rate, consumer group inactivity, and rebalance frequency — catch problems before they become outages.
Architecture review processes (prevents incidents 7, 9): Mandatory review for topics requesting >200 partitions; cross-AZ latency baselines tested at cluster setup time and periodically thereafter.

Every incident documented in this chapter had a postmortem moment where the team recognized "if we had done X earlier, this would not have happened." The discipline of Kafka reliability engineering is converting those retrospective insights into forward-looking practice — making the "obviously preventable" choices before the next incident proves them necessary.

Rate this chapter

4.8 / 5 (3 ratings)

Production War Stories: Root Cause Analysis of 10 Real Incidents

Failures Are the Best Teacher

Incident 1: ISR Thrashing Triggers Throughput Collapse

Incident 2: Consumer Rebalance Storm

Incident 3: Silent Message Loss

Incident 4: Producer OOM Blocks the Main Thread

Incident 5: Disk Exhaustion from Compaction Lag

Incident 6: Consumer Group Offsets Expire During Extended Downtime

Incident 7: Too Many Partitions Crushes the Controller

Incident 8: Expired SSL Certificate Causes Cluster Split

Incident 9: Cross-AZ Latency Spike Stalls High Watermark

Incident 10: Kafka on Kubernetes Pod Reschedule Causes Data Loss

Incident Pattern Analysis and Systemic Prevention

💬 Comments