Chapter 28

Monitoring: The Complete Observability Stack

Why Kafka Monitoring Is Harder Than It Looks

A mid-sized Kafka cluster exposes more than 5,000 JMX MBeans. Every broker, every topic, every partition has its own metric dimensions. The challenge is not scarcity — it is signal-to-noise ratio. Most of those metrics are irrelevant to production stability. The few that matter are easily buried under the avalanche of data.

The central thesis of this chapter: Kafka monitoring is not about collecting everything; it is about detecting the highest-risk failures with the minimum viable set of signals. We build a complete observability stack around five golden signals, covering the full pipeline from JMX to Prometheus, from Grafana dashboards to PagerDuty alerts.

JMX Exporter: Bridging Kafka Metrics to Prometheus

Why JMX Exporter Over Confluent Metrics Reporter

Kafka natively exposes metrics through JMX (Java Management Extensions). In the Prometheus ecosystem, two approaches are common:

JMX Exporter (Prometheus official): Runs as a Java agent inside the Kafka process, exposing a /metrics HTTP endpoint. No separate process required. Highly configurable. Universally used in open-source deployments.
Confluent Metrics Reporter: A Confluent Platform product that writes metrics into an internal Kafka topic. Richer feature set but tightly coupled to the Confluent ecosystem.

For open-source Kafka 3.7+, JMX Exporter is the standard choice. Inject the agent into kafka-server-start.sh:

export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka/kafka-jmx-exporter.yml"

Port 7071 becomes the Prometheus scrape target.

kafka-jmx-exporter.yml: Production-Validated Configuration

The following configuration focuses on the five golden signals plus essential request-performance metrics. Every rule here has been field-tested in clusters handling hundreds of gigabytes per second.

# kafka-jmx-exporter.yml
# Tested with Apache Kafka 3.7+ and JMX Exporter 0.20+
startDelaySeconds: 30
lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # ============================================================
  # Replication and ISR Health
  # ============================================================
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_server_replicamanager_underreplicatedpartitions
    help: "Number of partitions with fewer replicas than replication.factor; must be 0"
    type: GAUGE

  - pattern: 'kafka.server<type=ReplicaManager, name=ISRShrinksPerSec><>(\w+)'
    name: kafka_server_replicamanager_isrshrinks_total
    help: "Rate of ISR shrink events per second"
    type: COUNTER

  - pattern: 'kafka.server<type=ReplicaManager, name=ISRExpandsPerSec><>(\w+)'
    name: kafka_server_replicamanager_isrexpands_total
    help: "Rate of ISR expand events per second"
    type: COUNTER

  # ============================================================
  # Controller Health
  # ============================================================
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_controller_kafkacontroller_activecontrollercount
    help: "Number of active controllers; must be exactly 1"
    type: GAUGE

  - pattern: 'kafka.controller<type=ControllerStats, name=LeaderElectionRateAndTimeMs><>(\w+)'
    name: kafka_controller_stats_leaderelection
    help: "Leader election rate and duration"
    type: SUMMARY

  # ============================================================
  # Request Handler Idle Percentage (I/O Bottleneck Signal)
  # ============================================================
  - pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>(\w+)'
    name: kafka_server_requesthandler_avgidlepercent
    help: "Average fraction of time request handler threads are idle; < 0.3 indicates I/O bottleneck"
    type: GAUGE

  # ============================================================
  # Network Request Latency (Produce / Fetch)
  # ============================================================
  - pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\w+)><>(\d+)thPercentile'
    name: kafka_network_requestmetrics_totaltimems
    help: "Total request latency in milliseconds by request type and percentile"
    labels:
      request: "$1"
      quantile: "$2"
    type: GAUGE

  - pattern: 'kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(\w+), version=(\d+)><>(\w+)'
    name: kafka_network_requestmetrics_requestspersec
    help: "Requests per second by type"
    labels:
      request: "$1"
      version: "$2"
    type: COUNTER

  # ============================================================
  # Broker-Level Throughput
  # ============================================================
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec)><>(\w+)'
    name: kafka_server_brokertopicmetrics_$1
    help: "Broker-level throughput metrics"
    type: COUNTER

  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec), topic=(\S+)><>(\w+)'
    name: kafka_server_brokertopicmetrics_topic_$1
    help: "Per-topic throughput metrics"
    labels:
      topic: "$2"
    type: COUNTER

  # ============================================================
  # Disk Health
  # ============================================================
  - pattern: 'kafka.log<type=LogManager, name=OfflineLogDirectoryCount><>Value'
    name: kafka_log_logmanager_offlinelogdirectorycount
    help: "Number of offline log directories"
    type: GAUGE

  - pattern: 'kafka.log<type=Log, name=Size, topic=(\S+), partition=(\d+)><>Value'
    name: kafka_log_size_bytes
    help: "Partition log size in bytes"
    labels:
      topic: "$1"
      partition: "$2"
    type: GAUGE

  # ============================================================
  # JVM GC and Memory
  # ============================================================
  - pattern: 'java.lang<type=GarbageCollector, name=(\S+)><>(\w+)'
    name: jvm_gc_$2
    labels:
      gc: "$1"
    type: UNTYPED

  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>(\w+)'
    name: jvm_memory_heap_$1
    type: GAUGE

Consumer Lag: Why JMX Alone Is Not Enough

Consumer Lag is the metric your business cares about most, but Kafka broker JMX does not directly expose it. The broker knows the LEO (Log End Offset) for each partition, but calculating lag requires knowing the consumer group's committed offset stored in __consumer_offsets. You need an external exporter that reads both pieces and computes the difference.

Option 1: kafka-consumer-groups.sh (simple, not scalable)

kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --group payment-consumer-group

Fine for ad-hoc inspection. Polling this in a script for Prometheus scraping does not scale beyond ~50 consumer groups.

Option 2: kafka_exporter (recommended for production)

docker run -d \
  --name kafka-exporter \
  -p 9308:9308 \
  danielqsj/kafka_exporter \
  --kafka.server=kafka-broker-1:9092 \
  --kafka.server=kafka-broker-2:9092 \
  --kafka.server=kafka-broker-3:9092 \
  --group.filter=".*" \
  --topic.filter="^(?!__).*" \
  --sasl.enabled \
  --sasl.username=monitor \
  --sasl.password=secret \
  --sasl.mechanism=PLAIN

Key metrics exposed by kafka_exporter:

kafka_consumer_group_lag — lag per (group, topic, partition) tuple
kafka_consumer_group_lag_sum — total lag for a group on a topic
kafka_consumer_group_current_offset — committed offset per partition

Prometheus scrape config:

scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets:
          - 'kafka-broker-1:7071'
          - 'kafka-broker-2:7071'
          - 'kafka-broker-3:7071'
    scrape_interval: 15s
    scrape_timeout: 10s

  - job_name: 'kafka-exporter'
    static_configs:
      - targets: ['kafka-exporter:9308']
    scrape_interval: 30s

The Five Golden Signals, In Depth

Signal 1: UnderReplicatedPartitions

What it means: Partitions where the number of in-sync replicas is less than the configured replication.factor. In a healthy cluster, this value is always 0. Any non-zero value is a data-safety risk: if the leader fails while replicas are lagging, you face either data loss or unavailability.

Common root causes (ranked by frequency):

Follower broker disk I/O saturated — fetch requests take longer than replica.lag.time.max.ms (default 30,000 ms), causing follower to be removed from ISR
Broker crashed or restarting
Network partition between leader and follower
JVM GC stop-the-world pause exceeding replica.lag.time.max.ms

Inspection commands:

# List all under-replicated partitions cluster-wide
kafka-topics.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --under-replicated-partitions

# Inspect replica placement for a specific topic
kafka-topics.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --topic orders

Alert rule:

- alert: KafkaUnderReplicatedPartitions
  expr: kafka_server_replicamanager_underreplicatedpartitions > 0
  for: 1m
  labels:
    severity: critical
    page: "true"
  annotations:
    summary: "P0: Kafka under-replicated partitions on {{ $labels.instance }}"
    description: |
      Broker {{ $labels.instance }} has {{ $value }} under-replicated partition(s)
      for more than 1 minute. Check disk I/O and follower broker health immediately.
    runbook: "https://wiki.internal/kafka/runbook-under-replicated"

Signal 2: ISRShrinkRate and ISRExpandRate

Frequent ISR membership changes are an early warning signal — they typically appear minutes before UnderReplicatedPartitions starts climbing. Think of them as the leading indicator to URP's lagging indicator.

Healthy state: Both shrink and expand rates near zero. One or two events per hour are normal (planned restarts, minor transient delays). Multiple events per minute across the cluster indicates that followers are chronically struggling to keep up with the leader's write rate.

PromQL queries:

# ISR shrinks per minute (5-minute rate window)
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m])) * 60

# ISR expands per minute
sum(rate(kafka_server_replicamanager_isrexpands_total[5m])) * 60

Alert rule:

- alert: KafkaISRHighChurnRate
  expr: |
    (
      sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
      +
      sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
    ) * 60 > 5
  for: 3m
  labels:
    severity: warning
  annotations:
    summary: "Kafka ISR churn rate > 5 events/min — cluster instability detected"
    description: "Frequent ISR changes typically indicate disk I/O saturation or network jitter on follower brokers."

Signal 3: ActiveControllerCount

The Kafka cluster requires exactly one active controller at all times. The controller is responsible for partition leader elections, broker join/leave events, and topic lifecycle management.

Value = 0: No controller. The cluster cannot elect new leaders. Producers will eventually exhaust retries; consumers will stall. Total outage.
Value = 1: Normal.
Value > 1: Split-brain. Two brokers simultaneously believe they are the controller. Extremely dangerous — conflicting leadership decisions can corrupt metadata. Usually caused by a ZooKeeper network partition where one broker sees a session expiry that others do not.

- alert: KafkaControllerCountAbnormal
  expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
  for: 30s
  labels:
    severity: critical
    page: "true"
  annotations:
    summary: "P0: Kafka active controller count is {{ $value }}, expected 1"
    description: "The cluster may be unavailable or in split-brain state. Investigate ZooKeeper/KRaft connectivity immediately."

To find which broker currently holds the controller role:

# ZooKeeper mode
zookeeper-shell.sh zk:2181 get /controller

# KRaft mode
kafka-metadata-quorum.sh \
  --bootstrap-server kafka:9092 \
  describe --status

Signal 4: RequestHandlerAvgIdlePercent

This is the most direct measure of a broker's I/O processing headroom. Kafka request handling is split into two thread pools:

Network Threads (num.network.threads): Accept connections, read bytes off the socket, send responses back. Pure network I/O.
I/O Threads / Request Handlers (num.io.threads): Decode requests, perform disk reads/writes, coordinate with replicas. CPU and disk I/O bound.

RequestHandlerAvgIdlePercent measures the fraction of time the I/O thread pool is idle:

Range	Interpretation
> 0.7	Healthy, significant headroom
0.3 – 0.7	Normal operating load
0.1 – 0.3	High load — monitor closely
< 0.1	Severe bottleneck — requests queuing, latency spiking

- alert: KafkaBrokerIOBottleneck
  expr: kafka_server_requesthandler_avgidlepercent < 0.2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P1: Kafka broker {{ $labels.instance }} I/O thread pool near saturation"
    description: |
      Request handler idle percent is {{ $value | humanizePercentage }}.
      Consider increasing num.io.threads (current recommendation: 2x CPU cores)
      or redistributing partitions to reduce per-broker load.

Signal 5: Consumer Lag

Consumer Lag is the business-impact signal. It directly measures the gap between what has been produced and what has been consumed. Different workloads have vastly different lag tolerances: a payment processor may need sub-second lag, while a log aggregation pipeline might tolerate minutes.

The critical insight is that lag trend matters more than absolute lag value. A lag of 100,000 messages that is stable is usually not a problem. A lag of 1,000 messages that is growing exponentially is a crisis in progress.

# Is lag growing? (5-minute delta > 0 means consumers are falling behind)
delta(kafka_consumer_group_lag_sum{group="payment-consumer"}[5m]) > 0

# Combined: lag is large AND growing
kafka_consumer_group_lag_sum > 5000
  and
delta(kafka_consumer_group_lag_sum[5m]) > 1000

Alert rule:

- alert: KafkaConsumerLagGrowing
  expr: |
    delta(kafka_consumer_group_lag_sum[5m]) > 1000
    and
    kafka_consumer_group_lag_sum > 5000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P1: Consumer group {{ $labels.consumergroup }} lag growing on {{ $labels.topic }}"
    description: |
      Lag is {{ $value }} and has been growing for 5+ minutes.
      Check consumer application health, GC pauses, and downstream dependencies.

Grafana Dashboard: Key Panel Definitions

A production Kafka dashboard contains roughly 20 panels organized into three rows: Cluster Health, Throughput Trends, and Consumer Status. Below are the JSON definitions for the most critical panels.

{
  "title": "Kafka Cluster Overview",
  "uid": "kafka-cluster-overview-v2",
  "tags": ["kafka", "platform"],
  "refresh": "30s",
  "time": {"from": "now-3h", "to": "now"},
  "panels": [
    {
      "id": 1,
      "title": "UnderReplicated Partitions",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
      "targets": [
        {
          "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)",
          "legendFormat": "URP Count"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {"color": "green", "value": null},
              {"color": "red", "value": 1}
            ]
          },
          "color": {"mode": "thresholds"},
          "mappings": [
            {"type": "value", "options": {"0": {"text": "HEALTHY", "color": "green"}}}
          ]
        }
      }
    },
    {
      "id": 2,
      "title": "Active Controller Count",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
      "targets": [
        {
          "expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)",
          "legendFormat": "Controllers"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "green", "value": 1},
              {"color": "red", "value": 2}
            ]
          }
        }
      }
    },
    {
      "id": 3,
      "title": "Request Handler Idle % per Broker",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
      "targets": [
        {
          "expr": "kafka_server_requesthandler_avgidlepercent",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "min": 0,
          "max": 1,
          "thresholds": {
            "steps": [
              {"color": "red", "value": 0},
              {"color": "yellow", "value": 0.3},
              {"color": "green", "value": 0.7}
            ]
          },
          "custom": {"fillOpacity": 10}
        }
      }
    },
    {
      "id": 4,
      "title": "Consumer Group Lag",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
      "targets": [
        {
          "expr": "kafka_consumer_group_lag_sum",
          "legendFormat": "{{ consumergroup }} / {{ topic }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "custom": {"fillOpacity": 5}
        }
      }
    },
    {
      "id": 5,
      "title": "Network Throughput per Broker (Bytes In/Out)",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 12},
      "targets": [
        {
          "expr": "rate(kafka_server_brokertopicmetrics_BytesInPerSec[5m])",
          "legendFormat": "In — {{ instance }}"
        },
        {
          "expr": "rate(kafka_server_brokertopicmetrics_BytesOutPerSec[5m]) * -1",
          "legendFormat": "Out — {{ instance }}"
        }
      ],
      "fieldConfig": {"defaults": {"unit": "Bps"}}
    },
    {
      "id": 6,
      "title": "ISR Shrink / Expand Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
      "targets": [
        {
          "expr": "rate(kafka_server_replicamanager_isrshrinks_total[5m]) * 60",
          "legendFormat": "Shrink/min — {{ instance }}"
        },
        {
          "expr": "rate(kafka_server_replicamanager_isrexpands_total[5m]) * 60",
          "legendFormat": "Expand/min — {{ instance }}"
        }
      ]
    }
  ]
}

End-to-End Latency Tracing with Interceptors

JMX metrics reflect the broker's perspective. The business perspective requires measuring the full round-trip: from the moment producer.send() is called to the moment the consumer processes the record. Kafka's Interceptor API provides a clean way to instrument this without touching business logic.

ProducerInterceptor: Stamping the Produce Timestamp

public class E2ELatencyProducerInterceptor<K, V>
    implements ProducerInterceptor<K, V> {

    private static final String HEADER_PRODUCE_TS = "x-produce-timestamp";

    @Override
    public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
        long produceTime = System.currentTimeMillis();
        record.headers().add(
            HEADER_PRODUCE_TS,
            Long.toString(produceTime).getBytes(StandardCharsets.UTF_8)
        );
        return record;
    }

    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
        // Optionally record broker-ack latency here
    }

    @Override public void close() {}
    @Override public void configure(Map<String, ?> configs) {}
}

ConsumerInterceptor: Computing the Delta

public class E2ELatencyConsumerInterceptor<K, V>
    implements ConsumerInterceptor<K, V> {

    private static final String HEADER_PRODUCE_TS = "x-produce-timestamp";

    // Prometheus Histogram with meaningful buckets for streaming latency
    private static final Histogram E2E_LATENCY = Histogram.build()
        .name("kafka_e2e_latency_milliseconds")
        .help("End-to-end latency from producer send() to consumer onConsume()")
        .labelNames("topic", "consumer_group")
        .buckets(1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000)
        .register();

    private String consumerGroupId;

    @Override
    public void configure(Map<String, ?> configs) {
        consumerGroupId = (String) configs.get(ConsumerConfig.GROUP_ID_CONFIG);
    }

    @Override
    public ConsumerRecords<K, V> onConsume(ConsumerRecords<K, V> records) {
        long consumeTime = System.currentTimeMillis();

        for (ConsumerRecord<K, V> record : records) {
            Header tsHeader = record.headers().lastHeader(HEADER_PRODUCE_TS);
            if (tsHeader != null) {
                long produceTime = Long.parseLong(
                    new String(tsHeader.value(), StandardCharsets.UTF_8)
                );
                long latencyMs = consumeTime - produceTime;

                E2E_LATENCY
                    .labels(record.topic(), consumerGroupId)
                    .observe(latencyMs);
            }
        }
        return records;
    }

    @Override
    public void onCommit(Map<TopicPartition, OffsetAndMetadata> offsets) {}
    @Override public void close() {}
}

Wiring the interceptors:

// Producer configuration
props.put(
    ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,
    "com.example.kafka.E2ELatencyProducerInterceptor"
);

// Consumer configuration
props.put(
    ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG,
    "com.example.kafka.E2ELatencyConsumerInterceptor"
);

Key advantages of this approach:

Zero business logic changes — interceptors are applied transparently
Headers survive MirrorMaker 2 replication, enabling cross-cluster latency tracking
Prometheus Histogram provides P50/P95/P99 latency distribution without additional aggregation tooling

PromQL for latency SLO monitoring:

# 99th percentile end-to-end latency over 5 minutes
histogram_quantile(0.99,
  sum(rate(kafka_e2e_latency_milliseconds_bucket[5m])) by (le, topic, consumer_group)
)

# Fraction of messages delivered under 100ms (latency SLO)
sum(rate(kafka_e2e_latency_milliseconds_bucket{le="100"}[5m])) by (topic)
/
sum(rate(kafka_e2e_latency_milliseconds_count[5m])) by (topic)

Kafka Management UI Comparison

Feature	Kafka Manager (CMAK)	AKHQ	Redpanda Console
Maintenance status	Abandoned (Yahoo)	Active	Active (Redpanda Inc.)
Kafka compatibility	Up to Kafka 2.x	Kafka 2.x / 3.x	Kafka 2.x / 3.x
Authentication support	Basic	Full SASL/SSL	Full SASL/SSL
Consumer lag visualization	Yes	Yes, graphical	Yes, real-time
Schema Registry integration	No	Yes	Yes
Message browser/search	No	Yes, with filters	Yes, Protobuf support
Deployment	JAR	Docker / Helm	Docker / Helm
License	Apache 2.0	Apache 2.0	Redpanda BSL

Production recommendation: AKHQ is the most feature-complete open-source option, with multi-cluster support, RBAC, and active development. Redpanda Console has a more polished UI but the BSL license requires careful review for large-scale commercial use. CMAK (formerly Kafka Manager) is no longer maintained and should not be used for new deployments.

Deploy AKHQ with Docker Compose:

services:
  akhq:
    image: tchiotludo/akhq:0.24.0
    ports:
      - "8080:8080"
    environment:
      AKHQ_CONFIGURATION: |
        akhq:
          connections:
            production:
              properties:
                bootstrap.servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
                security.protocol: SASL_SSL
                sasl.mechanism: SCRAM-SHA-256
                sasl.jaas.config: >
                  org.apache.kafka.common.security.scram.ScramLoginModule required
                  username="akhq-monitor" password="monitor-secret";
                ssl.truststore.location: /certs/kafka.truststore.jks
                ssl.truststore.password: changeit
              schema-registry:
                url: "http://schema-registry:8081"
    volumes:
      - ./certs:/certs:ro

Reading Broker Logs: What to Look For

Kafka broker logs (logs/server.log) record cluster lifecycle events. Training yourself to recognize these patterns is the difference between a 5-minute MTTR and a 2-hour war room.

ISR Change Events

# Follower removed from ISR (shrink)
[2024-01-15 10:23:45,123] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2] (was [1, 2, 3]) due to follower 3
not making progress for 30009 ms (kafka.cluster.Partition)

# Follower caught up and rejoined ISR (expand)
[2024-01-15 10:25:12,456] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2, 3] (was [1, 2]) (kafka.cluster.Partition)

Seeing these lines repeatedly for the same partition is a definitive indicator of I/O pressure on broker 3. Cross-reference with that broker's disk I/O metrics and fetch lag.

Controller Election Events

# Controller elected
[2024-01-15 10:30:00,789] INFO [KafkaController id=2]
2 successfully elected as the controller. Epoch incremented to 5

# Broker registered (joined the cluster)
[2024-01-15 10:30:01,234] INFO [KafkaController id=2]
Broker 3 has registered, need to process its partition assignments

Controller elections more than 2-3 times per hour are abnormal. Common causes: ZooKeeper session timeouts (usually from GC pauses), KRaft leader instability, or clock skew between nodes.

Authentication and Authorization Failures

# SASL authentication failure
[2024-01-15 10:35:00,100] INFO [SocketServer listenerName=SASL_PLAINTEXT]
Failed authentication with /10.0.0.5 (SSL handshake failed)

# ACL authorization denial
[2024-01-15 10:35:01,200] INFO Principal = User:service-account is Denied
Operation = Write from host = 10.0.0.5 on resource = Topic:LITERAL:orders
(kafka.authorizer.logger)

A burst of authentication failures usually means a certificate expired or credentials were rotated without updating all clients. Configure a separate log appender for the authorizer logger so these events go to a dedicated audit file:

# log4j.properties
log4j.logger.kafka.authorizer.logger=INFO, authorizerAppender
log4j.appender.authorizerAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.authorizerAppender.File=/var/log/kafka/kafka-authorizer.log
log4j.appender.authorizerAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.authorizerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.authorizerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.additivity.kafka.authorizer.logger=false

Complete Alert Rules Reference

# kafka-alerts.yml — load into Prometheus Alertmanager
groups:
  - name: kafka.critical
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 1m
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "P0: Kafka under-replicated partitions — data safety at risk"

      - alert: KafkaControllerCountAbnormal
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
        for: 30s
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "P0: Kafka controller count is {{ $value }}, must be 1"

  - name: kafka.warning
    rules:
      - alert: KafkaConsumerLagGrowing
        expr: |
          delta(kafka_consumer_group_lag_sum[5m]) > 1000
          and kafka_consumer_group_lag_sum > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P1: Consumer lag growing — {{ $labels.consumergroup }}"

      - alert: KafkaBrokerIOBottleneck
        expr: kafka_server_requesthandler_avgidlepercent < 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P1: Broker {{ $labels.instance }} I/O threads saturated"

      - alert: KafkaBrokerDiskUsageHigh
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint=~"/data/kafka.*"}
               / node_filesystem_size_bytes{mountpoint=~"/data/kafka.*"}) > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P2: Kafka broker disk usage > 80% on {{ $labels.instance }}"

      - alert: KafkaISRHighChurnRate
        expr: |
          (
            sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
            + sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
          ) * 60 > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "P2: ISR churn rate > 5 events/min — instability detected"

Summary: The Observability Priority Ladder

Building an effective Kafka monitoring system is about discipline, not comprehensiveness. The alert fatigue that comes from alerting on 200 metrics is just as dangerous as having no monitoring at all — both lead to the real signal being missed.

The priority ladder:

P0 — Page immediately (< 15 min response): UnderReplicatedPartitions > 0, ActiveControllerCount != 1. These indicate active data safety or availability failures.
P1 — Ticket within 1 hour: Consumer lag growing, RequestHandlerIdlePercent < 0.2. Business impact in progress or imminent.
P2 — Work-hours response: Disk usage > 80%, ISR churn rate elevated. Capacity and stability signals that need attention but are not yet critical.
Dashboard only (no alert): Throughput trends, end-to-end latency percentiles, GC pause times. Use these for baseline establishment and capacity planning, not for waking engineers.

This four-level structure ensures genuine production incidents get immediate attention while giving the team cognitive headroom to do proactive work.

Rate this chapter

4.5 / 5 (3 ratings)

Monitoring: The Complete Observability Stack

Why Kafka Monitoring Is Harder Than It Looks

JMX Exporter: Bridging Kafka Metrics to Prometheus

Why JMX Exporter Over Confluent Metrics Reporter

kafka-jmx-exporter.yml: Production-Validated Configuration

Consumer Lag: Why JMX Alone Is Not Enough

The Five Golden Signals, In Depth

Signal 1: UnderReplicatedPartitions

Signal 2: ISRShrinkRate and ISRExpandRate

Signal 3: ActiveControllerCount

Signal 4: RequestHandlerAvgIdlePercent

Signal 5: Consumer Lag

Grafana Dashboard: Key Panel Definitions

End-to-End Latency Tracing with Interceptors

ProducerInterceptor: Stamping the Produce Timestamp

ConsumerInterceptor: Computing the Delta

Kafka Management UI Comparison

Reading Broker Logs: What to Look For

ISR Change Events

Controller Election Events

Authentication and Authorization Failures

Complete Alert Rules Reference

Summary: The Observability Priority Ladder

💬 Comments