Monitoring: The Complete Observability Stack
Why Kafka Monitoring Is Harder Than It Looks
A mid-sized Kafka cluster exposes more than 5,000 JMX MBeans. Every broker, every topic, every partition has its own metric dimensions. The challenge is not scarcity โ it is signal-to-noise ratio. Most of those metrics are irrelevant to production stability. The few that matter are easily buried under the avalanche of data.
The central thesis of this chapter: Kafka monitoring is not about collecting everything; it is about detecting the highest-risk failures with the minimum viable set of signals. We build a complete observability stack around five golden signals, covering the full pipeline from JMX to Prometheus, from Grafana dashboards to PagerDuty alerts.
JMX Exporter: Bridging Kafka Metrics to Prometheus
Why JMX Exporter Over Confluent Metrics Reporter
Kafka natively exposes metrics through JMX (Java Management Extensions). In the Prometheus ecosystem, two approaches are common:
- JMX Exporter (Prometheus official): Runs as a Java agent inside the Kafka process, exposing a
/metricsHTTP endpoint. No separate process required. Highly configurable. Universally used in open-source deployments. - Confluent Metrics Reporter: A Confluent Platform product that writes metrics into an internal Kafka topic. Richer feature set but tightly coupled to the Confluent ecosystem.
For open-source Kafka 3.7+, JMX Exporter is the standard choice. Inject the agent into kafka-server-start.sh:
export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka/kafka-jmx-exporter.yml"
Port 7071 becomes the Prometheus scrape target.
kafka-jmx-exporter.yml: Production-Validated Configuration
The following configuration focuses on the five golden signals plus essential request-performance metrics. Every rule here has been field-tested in clusters handling hundreds of gigabytes per second.
# kafka-jmx-exporter.yml
# Tested with Apache Kafka 3.7+ and JMX Exporter 0.20+
startDelaySeconds: 30
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# ============================================================
# Replication and ISR Health
# ============================================================
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
name: kafka_server_replicamanager_underreplicatedpartitions
help: "Number of partitions with fewer replicas than replication.factor; must be 0"
type: GAUGE
- pattern: 'kafka.server<type=ReplicaManager, name=ISRShrinksPerSec><>(\w+)'
name: kafka_server_replicamanager_isrshrinks_total
help: "Rate of ISR shrink events per second"
type: COUNTER
- pattern: 'kafka.server<type=ReplicaManager, name=ISRExpandsPerSec><>(\w+)'
name: kafka_server_replicamanager_isrexpands_total
help: "Rate of ISR expand events per second"
type: COUNTER
# ============================================================
# Controller Health
# ============================================================
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
name: kafka_controller_kafkacontroller_activecontrollercount
help: "Number of active controllers; must be exactly 1"
type: GAUGE
- pattern: 'kafka.controller<type=ControllerStats, name=LeaderElectionRateAndTimeMs><>(\w+)'
name: kafka_controller_stats_leaderelection
help: "Leader election rate and duration"
type: SUMMARY
# ============================================================
# Request Handler Idle Percentage (I/O Bottleneck Signal)
# ============================================================
- pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>(\w+)'
name: kafka_server_requesthandler_avgidlepercent
help: "Average fraction of time request handler threads are idle; < 0.3 indicates I/O bottleneck"
type: GAUGE
# ============================================================
# Network Request Latency (Produce / Fetch)
# ============================================================
- pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\w+)><>(\d+)thPercentile'
name: kafka_network_requestmetrics_totaltimems
help: "Total request latency in milliseconds by request type and percentile"
labels:
request: "$1"
quantile: "$2"
type: GAUGE
- pattern: 'kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(\w+), version=(\d+)><>(\w+)'
name: kafka_network_requestmetrics_requestspersec
help: "Requests per second by type"
labels:
request: "$1"
version: "$2"
type: COUNTER
# ============================================================
# Broker-Level Throughput
# ============================================================
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec)><>(\w+)'
name: kafka_server_brokertopicmetrics_$1
help: "Broker-level throughput metrics"
type: COUNTER
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec), topic=(\S+)><>(\w+)'
name: kafka_server_brokertopicmetrics_topic_$1
help: "Per-topic throughput metrics"
labels:
topic: "$2"
type: COUNTER
# ============================================================
# Disk Health
# ============================================================
- pattern: 'kafka.log<type=LogManager, name=OfflineLogDirectoryCount><>Value'
name: kafka_log_logmanager_offlinelogdirectorycount
help: "Number of offline log directories"
type: GAUGE
- pattern: 'kafka.log<type=Log, name=Size, topic=(\S+), partition=(\d+)><>Value'
name: kafka_log_size_bytes
help: "Partition log size in bytes"
labels:
topic: "$1"
partition: "$2"
type: GAUGE
# ============================================================
# JVM GC and Memory
# ============================================================
- pattern: 'java.lang<type=GarbageCollector, name=(\S+)><>(\w+)'
name: jvm_gc_$2
labels:
gc: "$1"
type: UNTYPED
- pattern: 'java.lang<type=Memory><HeapMemoryUsage>(\w+)'
name: jvm_memory_heap_$1
type: GAUGE
Consumer Lag: Why JMX Alone Is Not Enough
Consumer Lag is the metric your business cares about most, but Kafka broker JMX does not directly expose it. The broker knows the LEO (Log End Offset) for each partition, but calculating lag requires knowing the consumer group's committed offset stored in __consumer_offsets. You need an external exporter that reads both pieces and computes the difference.
Option 1: kafka-consumer-groups.sh (simple, not scalable)
kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--describe \
--group payment-consumer-group
Fine for ad-hoc inspection. Polling this in a script for Prometheus scraping does not scale beyond ~50 consumer groups.
Option 2: kafka_exporter (recommended for production)
docker run -d \
--name kafka-exporter \
-p 9308:9308 \
danielqsj/kafka_exporter \
--kafka.server=kafka-broker-1:9092 \
--kafka.server=kafka-broker-2:9092 \
--kafka.server=kafka-broker-3:9092 \
--group.filter=".*" \
--topic.filter="^(?!__).*" \
--sasl.enabled \
--sasl.username=monitor \
--sasl.password=secret \
--sasl.mechanism=PLAIN
Key metrics exposed by kafka_exporter:
kafka_consumer_group_lagโ lag per (group, topic, partition) tuplekafka_consumer_group_lag_sumโ total lag for a group on a topickafka_consumer_group_current_offsetโ committed offset per partition
Prometheus scrape config:
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets:
- 'kafka-broker-1:7071'
- 'kafka-broker-2:7071'
- 'kafka-broker-3:7071'
scrape_interval: 15s
scrape_timeout: 10s
- job_name: 'kafka-exporter'
static_configs:
- targets: ['kafka-exporter:9308']
scrape_interval: 30s
The Five Golden Signals, In Depth
Signal 1: UnderReplicatedPartitions
What it means: Partitions where the number of in-sync replicas is less than the configured replication.factor. In a healthy cluster, this value is always 0. Any non-zero value is a data-safety risk: if the leader fails while replicas are lagging, you face either data loss or unavailability.
Common root causes (ranked by frequency):
- Follower broker disk I/O saturated โ fetch requests take longer than
replica.lag.time.max.ms(default 30,000 ms), causing follower to be removed from ISR - Broker crashed or restarting
- Network partition between leader and follower
- JVM GC stop-the-world pause exceeding
replica.lag.time.max.ms
Inspection commands:
# List all under-replicated partitions cluster-wide
kafka-topics.sh \
--bootstrap-server kafka:9092 \
--describe \
--under-replicated-partitions
# Inspect replica placement for a specific topic
kafka-topics.sh \
--bootstrap-server kafka:9092 \
--describe \
--topic orders
Alert rule:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 1m
labels:
severity: critical
page: "true"
annotations:
summary: "P0: Kafka under-replicated partitions on {{ $labels.instance }}"
description: |
Broker {{ $labels.instance }} has {{ $value }} under-replicated partition(s)
for more than 1 minute. Check disk I/O and follower broker health immediately.
runbook: "https://wiki.internal/kafka/runbook-under-replicated"
Signal 2: ISRShrinkRate and ISRExpandRate
Frequent ISR membership changes are an early warning signal โ they typically appear minutes before UnderReplicatedPartitions starts climbing. Think of them as the leading indicator to URP's lagging indicator.
Healthy state: Both shrink and expand rates near zero. One or two events per hour are normal (planned restarts, minor transient delays). Multiple events per minute across the cluster indicates that followers are chronically struggling to keep up with the leader's write rate.
PromQL queries:
# ISR shrinks per minute (5-minute rate window)
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m])) * 60
# ISR expands per minute
sum(rate(kafka_server_replicamanager_isrexpands_total[5m])) * 60
Alert rule:
- alert: KafkaISRHighChurnRate
expr: |
(
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
+
sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
) * 60 > 5
for: 3m
labels:
severity: warning
annotations:
summary: "Kafka ISR churn rate > 5 events/min โ cluster instability detected"
description: "Frequent ISR changes typically indicate disk I/O saturation or network jitter on follower brokers."
Signal 3: ActiveControllerCount
The Kafka cluster requires exactly one active controller at all times. The controller is responsible for partition leader elections, broker join/leave events, and topic lifecycle management.
- Value = 0: No controller. The cluster cannot elect new leaders. Producers will eventually exhaust retries; consumers will stall. Total outage.
- Value = 1: Normal.
- Value > 1: Split-brain. Two brokers simultaneously believe they are the controller. Extremely dangerous โ conflicting leadership decisions can corrupt metadata. Usually caused by a ZooKeeper network partition where one broker sees a session expiry that others do not.
- alert: KafkaControllerCountAbnormal
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 30s
labels:
severity: critical
page: "true"
annotations:
summary: "P0: Kafka active controller count is {{ $value }}, expected 1"
description: "The cluster may be unavailable or in split-brain state. Investigate ZooKeeper/KRaft connectivity immediately."
To find which broker currently holds the controller role:
# ZooKeeper mode
zookeeper-shell.sh zk:2181 get /controller
# KRaft mode
kafka-metadata-quorum.sh \
--bootstrap-server kafka:9092 \
describe --status
Signal 4: RequestHandlerAvgIdlePercent
This is the most direct measure of a broker's I/O processing headroom. Kafka request handling is split into two thread pools:
- Network Threads (
num.network.threads): Accept connections, read bytes off the socket, send responses back. Pure network I/O. - I/O Threads / Request Handlers (
num.io.threads): Decode requests, perform disk reads/writes, coordinate with replicas. CPU and disk I/O bound.
RequestHandlerAvgIdlePercent measures the fraction of time the I/O thread pool is idle:
| Range | Interpretation |
|---|---|
| > 0.7 | Healthy, significant headroom |
| 0.3 โ 0.7 | Normal operating load |
| 0.1 โ 0.3 | High load โ monitor closely |
| < 0.1 | Severe bottleneck โ requests queuing, latency spiking |
- alert: KafkaBrokerIOBottleneck
expr: kafka_server_requesthandler_avgidlepercent < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "P1: Kafka broker {{ $labels.instance }} I/O thread pool near saturation"
description: |
Request handler idle percent is {{ $value | humanizePercentage }}.
Consider increasing num.io.threads (current recommendation: 2x CPU cores)
or redistributing partitions to reduce per-broker load.
Signal 5: Consumer Lag
Consumer Lag is the business-impact signal. It directly measures the gap between what has been produced and what has been consumed. Different workloads have vastly different lag tolerances: a payment processor may need sub-second lag, while a log aggregation pipeline might tolerate minutes.
The critical insight is that lag trend matters more than absolute lag value. A lag of 100,000 messages that is stable is usually not a problem. A lag of 1,000 messages that is growing exponentially is a crisis in progress.
# Is lag growing? (5-minute delta > 0 means consumers are falling behind)
delta(kafka_consumer_group_lag_sum{group="payment-consumer"}[5m]) > 0
# Combined: lag is large AND growing
kafka_consumer_group_lag_sum > 5000
and
delta(kafka_consumer_group_lag_sum[5m]) > 1000
Alert rule:
- alert: KafkaConsumerLagGrowing
expr: |
delta(kafka_consumer_group_lag_sum[5m]) > 1000
and
kafka_consumer_group_lag_sum > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "P1: Consumer group {{ $labels.consumergroup }} lag growing on {{ $labels.topic }}"
description: |
Lag is {{ $value }} and has been growing for 5+ minutes.
Check consumer application health, GC pauses, and downstream dependencies.
Grafana Dashboard: Key Panel Definitions
A production Kafka dashboard contains roughly 20 panels organized into three rows: Cluster Health, Throughput Trends, and Consumer Status. Below are the JSON definitions for the most critical panels.
{
"title": "Kafka Cluster Overview",
"uid": "kafka-cluster-overview-v2",
"tags": ["kafka", "platform"],
"refresh": "30s",
"time": {"from": "now-3h", "to": "now"},
"panels": [
{
"id": 1,
"title": "UnderReplicated Partitions",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)",
"legendFormat": "URP Count"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
},
"color": {"mode": "thresholds"},
"mappings": [
{"type": "value", "options": {"0": {"text": "HEALTHY", "color": "green"}}}
]
}
}
},
{
"id": 2,
"title": "Active Controller Count",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"targets": [
{
"expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)",
"legendFormat": "Controllers"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "green", "value": 1},
{"color": "red", "value": 2}
]
}
}
}
},
{
"id": 3,
"title": "Request Handler Idle % per Broker",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [
{
"expr": "kafka_server_requesthandler_avgidlepercent",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 0.3},
{"color": "green", "value": 0.7}
]
},
"custom": {"fillOpacity": 10}
}
}
},
{
"id": 4,
"title": "Consumer Group Lag",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"targets": [
{
"expr": "kafka_consumer_group_lag_sum",
"legendFormat": "{{ consumergroup }} / {{ topic }}"
}
],
"fieldConfig": {
"defaults": {
"custom": {"fillOpacity": 5}
}
}
},
{
"id": 5,
"title": "Network Throughput per Broker (Bytes In/Out)",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 12},
"targets": [
{
"expr": "rate(kafka_server_brokertopicmetrics_BytesInPerSec[5m])",
"legendFormat": "In โ {{ instance }}"
},
{
"expr": "rate(kafka_server_brokertopicmetrics_BytesOutPerSec[5m]) * -1",
"legendFormat": "Out โ {{ instance }}"
}
],
"fieldConfig": {"defaults": {"unit": "Bps"}}
},
{
"id": 6,
"title": "ISR Shrink / Expand Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
"targets": [
{
"expr": "rate(kafka_server_replicamanager_isrshrinks_total[5m]) * 60",
"legendFormat": "Shrink/min โ {{ instance }}"
},
{
"expr": "rate(kafka_server_replicamanager_isrexpands_total[5m]) * 60",
"legendFormat": "Expand/min โ {{ instance }}"
}
]
}
]
}
End-to-End Latency Tracing with Interceptors
JMX metrics reflect the broker's perspective. The business perspective requires measuring the full round-trip: from the moment producer.send() is called to the moment the consumer processes the record. Kafka's Interceptor API provides a clean way to instrument this without touching business logic.
ProducerInterceptor: Stamping the Produce Timestamp
public class E2ELatencyProducerInterceptor<K, V>
implements ProducerInterceptor<K, V> {
private static final String HEADER_PRODUCE_TS = "x-produce-timestamp";
@Override
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
long produceTime = System.currentTimeMillis();
record.headers().add(
HEADER_PRODUCE_TS,
Long.toString(produceTime).getBytes(StandardCharsets.UTF_8)
);
return record;
}
@Override
public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
// Optionally record broker-ack latency here
}
@Override public void close() {}
@Override public void configure(Map<String, ?> configs) {}
}
ConsumerInterceptor: Computing the Delta
public class E2ELatencyConsumerInterceptor<K, V>
implements ConsumerInterceptor<K, V> {
private static final String HEADER_PRODUCE_TS = "x-produce-timestamp";
// Prometheus Histogram with meaningful buckets for streaming latency
private static final Histogram E2E_LATENCY = Histogram.build()
.name("kafka_e2e_latency_milliseconds")
.help("End-to-end latency from producer send() to consumer onConsume()")
.labelNames("topic", "consumer_group")
.buckets(1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000)
.register();
private String consumerGroupId;
@Override
public void configure(Map<String, ?> configs) {
consumerGroupId = (String) configs.get(ConsumerConfig.GROUP_ID_CONFIG);
}
@Override
public ConsumerRecords<K, V> onConsume(ConsumerRecords<K, V> records) {
long consumeTime = System.currentTimeMillis();
for (ConsumerRecord<K, V> record : records) {
Header tsHeader = record.headers().lastHeader(HEADER_PRODUCE_TS);
if (tsHeader != null) {
long produceTime = Long.parseLong(
new String(tsHeader.value(), StandardCharsets.UTF_8)
);
long latencyMs = consumeTime - produceTime;
E2E_LATENCY
.labels(record.topic(), consumerGroupId)
.observe(latencyMs);
}
}
return records;
}
@Override
public void onCommit(Map<TopicPartition, OffsetAndMetadata> offsets) {}
@Override public void close() {}
}
Wiring the interceptors:
// Producer configuration
props.put(
ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,
"com.example.kafka.E2ELatencyProducerInterceptor"
);
// Consumer configuration
props.put(
ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG,
"com.example.kafka.E2ELatencyConsumerInterceptor"
);
Key advantages of this approach:
- Zero business logic changes โ interceptors are applied transparently
- Headers survive MirrorMaker 2 replication, enabling cross-cluster latency tracking
- Prometheus Histogram provides P50/P95/P99 latency distribution without additional aggregation tooling
PromQL for latency SLO monitoring:
# 99th percentile end-to-end latency over 5 minutes
histogram_quantile(0.99,
sum(rate(kafka_e2e_latency_milliseconds_bucket[5m])) by (le, topic, consumer_group)
)
# Fraction of messages delivered under 100ms (latency SLO)
sum(rate(kafka_e2e_latency_milliseconds_bucket{le="100"}[5m])) by (topic)
/
sum(rate(kafka_e2e_latency_milliseconds_count[5m])) by (topic)
Kafka Management UI Comparison
| Feature | Kafka Manager (CMAK) | AKHQ | Redpanda Console |
|---|---|---|---|
| Maintenance status | Abandoned (Yahoo) | Active | Active (Redpanda Inc.) |
| Kafka compatibility | Up to Kafka 2.x | Kafka 2.x / 3.x | Kafka 2.x / 3.x |
| Authentication support | Basic | Full SASL/SSL | Full SASL/SSL |
| Consumer lag visualization | Yes | Yes, graphical | Yes, real-time |
| Schema Registry integration | No | Yes | Yes |
| Message browser/search | No | Yes, with filters | Yes, Protobuf support |
| Deployment | JAR | Docker / Helm | Docker / Helm |
| License | Apache 2.0 | Apache 2.0 | Redpanda BSL |
Production recommendation: AKHQ is the most feature-complete open-source option, with multi-cluster support, RBAC, and active development. Redpanda Console has a more polished UI but the BSL license requires careful review for large-scale commercial use. CMAK (formerly Kafka Manager) is no longer maintained and should not be used for new deployments.
Deploy AKHQ with Docker Compose:
services:
akhq:
image: tchiotludo/akhq:0.24.0
ports:
- "8080:8080"
environment:
AKHQ_CONFIGURATION: |
akhq:
connections:
production:
properties:
bootstrap.servers: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
security.protocol: SASL_SSL
sasl.mechanism: SCRAM-SHA-256
sasl.jaas.config: >
org.apache.kafka.common.security.scram.ScramLoginModule required
username="akhq-monitor" password="monitor-secret";
ssl.truststore.location: /certs/kafka.truststore.jks
ssl.truststore.password: changeit
schema-registry:
url: "http://schema-registry:8081"
volumes:
- ./certs:/certs:ro
Reading Broker Logs: What to Look For
Kafka broker logs (logs/server.log) record cluster lifecycle events. Training yourself to recognize these patterns is the difference between a 5-minute MTTR and a 2-hour war room.
ISR Change Events
# Follower removed from ISR (shrink)
[2024-01-15 10:23:45,123] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2] (was [1, 2, 3]) due to follower 3
not making progress for 30009 ms (kafka.cluster.Partition)
# Follower caught up and rejoined ISR (expand)
[2024-01-15 10:25:12,456] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2, 3] (was [1, 2]) (kafka.cluster.Partition)
Seeing these lines repeatedly for the same partition is a definitive indicator of I/O pressure on broker 3. Cross-reference with that broker's disk I/O metrics and fetch lag.
Controller Election Events
# Controller elected
[2024-01-15 10:30:00,789] INFO [KafkaController id=2]
2 successfully elected as the controller. Epoch incremented to 5
# Broker registered (joined the cluster)
[2024-01-15 10:30:01,234] INFO [KafkaController id=2]
Broker 3 has registered, need to process its partition assignments
Controller elections more than 2-3 times per hour are abnormal. Common causes: ZooKeeper session timeouts (usually from GC pauses), KRaft leader instability, or clock skew between nodes.
Authentication and Authorization Failures
# SASL authentication failure
[2024-01-15 10:35:00,100] INFO [SocketServer listenerName=SASL_PLAINTEXT]
Failed authentication with /10.0.0.5 (SSL handshake failed)
# ACL authorization denial
[2024-01-15 10:35:01,200] INFO Principal = User:service-account is Denied
Operation = Write from host = 10.0.0.5 on resource = Topic:LITERAL:orders
(kafka.authorizer.logger)
A burst of authentication failures usually means a certificate expired or credentials were rotated without updating all clients. Configure a separate log appender for the authorizer logger so these events go to a dedicated audit file:
# log4j.properties
log4j.logger.kafka.authorizer.logger=INFO, authorizerAppender
log4j.appender.authorizerAppender=org.apache.log4j.DailyRollingFileAppender
log4j.appender.authorizerAppender.File=/var/log/kafka/kafka-authorizer.log
log4j.appender.authorizerAppender.DatePattern='.'yyyy-MM-dd-HH
log4j.appender.authorizerAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.authorizerAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
log4j.additivity.kafka.authorizer.logger=false
Complete Alert Rules Reference
# kafka-alerts.yml โ load into Prometheus Alertmanager
groups:
- name: kafka.critical
rules:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 1m
labels:
severity: critical
page: "true"
annotations:
summary: "P0: Kafka under-replicated partitions โ data safety at risk"
- alert: KafkaControllerCountAbnormal
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 30s
labels:
severity: critical
page: "true"
annotations:
summary: "P0: Kafka controller count is {{ $value }}, must be 1"
- name: kafka.warning
rules:
- alert: KafkaConsumerLagGrowing
expr: |
delta(kafka_consumer_group_lag_sum[5m]) > 1000
and kafka_consumer_group_lag_sum > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "P1: Consumer lag growing โ {{ $labels.consumergroup }}"
- alert: KafkaBrokerIOBottleneck
expr: kafka_server_requesthandler_avgidlepercent < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "P1: Broker {{ $labels.instance }} I/O threads saturated"
- alert: KafkaBrokerDiskUsageHigh
expr: |
(1 - node_filesystem_avail_bytes{mountpoint=~"/data/kafka.*"}
/ node_filesystem_size_bytes{mountpoint=~"/data/kafka.*"}) > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "P2: Kafka broker disk usage > 80% on {{ $labels.instance }}"
- alert: KafkaISRHighChurnRate
expr: |
(
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
+ sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
) * 60 > 5
for: 3m
labels:
severity: warning
annotations:
summary: "P2: ISR churn rate > 5 events/min โ instability detected"
Summary: The Observability Priority Ladder
Building an effective Kafka monitoring system is about discipline, not comprehensiveness. The alert fatigue that comes from alerting on 200 metrics is just as dangerous as having no monitoring at all โ both lead to the real signal being missed.
The priority ladder:
- P0 โ Page immediately (< 15 min response):
UnderReplicatedPartitions > 0,ActiveControllerCount != 1. These indicate active data safety or availability failures. - P1 โ Ticket within 1 hour: Consumer lag growing, RequestHandlerIdlePercent < 0.2. Business impact in progress or imminent.
- P2 โ Work-hours response: Disk usage > 80%, ISR churn rate elevated. Capacity and stability signals that need attention but are not yet critical.
- Dashboard only (no alert): Throughput trends, end-to-end latency percentiles, GC pause times. Use these for baseline establishment and capacity planning, not for waking engineers.
This four-level structure ensures genuine production incidents get immediate attention while giving the team cognitive headroom to do proactive work.