监控体系:从指标到告警的完整方案
第28章:监控体系:从指标到告警的完整方案
导读:如何构建完整的 Kafka 监控体系?
本章核心问题:如何构建完整的 Kafka 监控体系?
读完本章你将理解:
- 关键 JMX 指标分类
- Prometheus + Grafana 方案
- Consumer Lag 监控与告警
- 端到端延迟追踪
Level 1 · 你需要知道的(1-3年经验)
JMX Exporter:将 Kafka 指标接入 Prometheus
为什么选择 JMX Exporter 而非 Confluent Metrics Reporter
Kafka 原生通过 JMX(Java Management Extensions)暴露指标。Prometheus 生态下,有两种主流方案将 JMX 指标转换为 Prometheus 格式:
- JMX Exporter(Prometheus 官方):以 Java Agent 方式运行在 Kafka 进程内,通过 HTTP 接口暴露
/metrics。无需单独进程,配置灵活,社区广泛使用。 - Confluent Metrics Reporter:Confluent 商业版方案,将指标写入内部 Kafka Topic,功能更丰富但耦合 Confluent 平台。
开源环境下,JMX Exporter 是标准选择。在 kafka-server-start.sh 中注入 Agent:
export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka/kafka-jmx-exporter.yml"
端口 7071 将成为 Prometheus 的抓取目标(scrape target)。
kafka-jmx-exporter.yml 完整配置
以下是经过生产验证的 JMX Exporter 配置文件,聚焦五个黄金信号相关 MBean,同时保留必要的请求性能指标:
# kafka-jmx-exporter.yml
# 适用于 Apache Kafka 3.7+,基于 JMX Exporter 0.20+
startDelaySeconds: 30
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# ============================================================
# 副本与 ISR 健康
# ============================================================
- pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
name: kafka_server_replicamanager_underreplicatedpartitions
help: "副本不足的分区数,> 0 立即告警"
type: GAUGE
- pattern: 'kafka.server<type=ReplicaManager, name=ISRShrinksPerSec><>(\w+)'
name: kafka_server_replicamanager_isrshrinks_total
help: "ISR 收缩速率(每秒)"
type: COUNTER
- pattern: 'kafka.server<type=ReplicaManager, name=ISRExpandsPerSec><>(\w+)'
name: kafka_server_replicamanager_isrexpands_total
help: "ISR 扩展速率(每秒)"
type: COUNTER
# ============================================================
# Controller 健康
# ============================================================
- pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
name: kafka_controller_kafkacontroller_activecontrollercount
help: "活跃 Controller 数量,必须精确为 1"
type: GAUGE
- pattern: 'kafka.controller<type=ControllerStats, name=LeaderElectionRateAndTimeMs><>(\w+)'
name: kafka_controller_stats_leaderelection
help: "Leader 选举速率与延迟"
type: SUMMARY
# ============================================================
# 请求处理线程空闲率(I/O 瓶颈核心指标)
# ============================================================
- pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>(\w+)'
name: kafka_server_requesthandler_avgidlepercent
help: "请求处理线程平均空闲百分比,< 0.3 表示 I/O 瓶颈"
type: GAUGE
# ============================================================
# 网络请求性能(Produce/Fetch 延迟)
# ============================================================
- pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\w+)><>(\d+)thPercentile'
name: kafka_network_requestmetrics_totaltimems
help: "请求总延迟(毫秒),按请求类型和百分位分组"
labels:
request: "$1"
quantile: "$2"
type: GAUGE
- pattern: 'kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(\w+), version=(\d+)><>(\w+)'
name: kafka_network_requestmetrics_requestspersec
help: "每秒请求数,按请求类型分组"
labels:
request: "$1"
version: "$2"
type: COUNTER
# ============================================================
# Broker I/O 吞吐量
# ============================================================
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec)><>(\w+)'
name: kafka_server_brokertopicmetrics_$1
help: "Broker 级别吞吐量"
type: COUNTER
- pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec), topic=(\S+)><>(\w+)'
name: kafka_server_brokertopicmetrics_topic_$1
help: "Topic 级别吞吐量"
labels:
topic: "$2"
type: COUNTER
# ============================================================
# 磁盘使用(Log Size)
# ============================================================
- pattern: 'kafka.log<type=LogManager, name=OfflineLogDirectoryCount><>Value'
name: kafka_log_logmanager_offlinelogdirectorycount
help: "离线日志目录数"
type: GAUGE
- pattern: 'kafka.log<type=Log, name=Size, topic=(\S+), partition=(\d+)><>Value'
name: kafka_log_size_bytes
help: "分区日志大小(字节)"
labels:
topic: "$1"
partition: "$2"
type: GAUGE
# ============================================================
# Consumer Lag(通过 kafka_consumer_group_lag 外部采集,见下文)
# ============================================================
# ============================================================
# JVM GC 与内存(通用 Java 指标)
# ============================================================
- pattern: 'java.lang<type=GarbageCollector, name=(\S+)><>(\w+)'
name: jvm_gc_$2
labels:
gc: "$1"
type: UNTYPED
- pattern: 'java.lang<type=Memory><HeapMemoryUsage>(\w+)'
name: jvm_memory_heap_$1
type: GAUGE
Consumer Lag:为什么不能只靠 JMX
Consumer Lag(消费者积压)是业务最关心的指标,但 Kafka Broker 的 JMX 并不直接提供 Consumer Group 的 Lag 值。原因在于:Kafka 将 Consumer Group Offset 存储在内部 Topic __consumer_offsets,Broker 知道每个分区的 LEO(Log End Offset),但需要计算 LEO - CommittedOffset 才能得到 Lag。
生产环境有两种采集方式:
方案一:kafka-consumer-groups.sh 脚本(简单但性能差)
# 每 30 秒轮询一次,适合小规模集群(< 50 个 Consumer Group)
kafka-consumer-groups.sh \
--bootstrap-server kafka:9092 \
--describe \
--group my-consumer-group
方案二:kafka_exporter(推荐,生产级)
danielqsj/kafka_exporter 是专为 Prometheus 设计的独立采集器,通过 Kafka Admin API 拉取 Consumer Group Offset 并计算 Lag:
docker run -d \
--name kafka-exporter \
-p 9308:9308 \
danielqsj/kafka_exporter \
--kafka.server=kafka-broker-1:9092 \
--kafka.server=kafka-broker-2:9092 \
--kafka.server=kafka-broker-3:9092 \
--group.filter=".*" \
--topic.filter="^(?!__).*" \
--sasl.enabled \
--sasl.username=monitor \
--sasl.password=secret \
--sasl.mechanism=PLAIN
kafka_exporter 暴露的关键指标:
kafka_consumer_group_lag:每个 (group, topic, partition) 三元组的 Lag 值kafka_consumer_group_lag_sum:Consumer Group 对某个 Topic 的总 Lagkafka_consumer_group_current_offset:Consumer Group 当前提交的 Offset
Prometheus scrape 配置:
# prometheus.yml
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets:
- 'kafka-broker-1:7071'
- 'kafka-broker-2:7071'
- 'kafka-broker-3:7071'
scrape_interval: 15s
scrape_timeout: 10s
- job_name: 'kafka-exporter'
static_configs:
- targets: ['kafka-exporter:9308']
scrape_interval: 30s
Grafana Dashboard:可视化完整方案
Dashboard JSON 模板核心片段
以下是生产级 Grafana Dashboard 的关键 Panel 配置。完整 Dashboard 包含约 20 个 Panel,分为三个 Row:集群健康、吞吐量趋势、消费者状态。
{
"title": "Kafka Cluster Overview",
"uid": "kafka-cluster-overview-v2",
"tags": ["kafka", "platform"],
"refresh": "30s",
"time": {"from": "now-3h", "to": "now"},
"panels": [
{
"id": 1,
"title": "UnderReplicated Partitions",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)",
"legendFormat": "URP Count"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 1}
]
},
"color": {"mode": "thresholds"},
"mappings": [{"type": "value", "options": {"0": {"text": "HEALTHY"}}}]
}
}
},
{
"id": 2,
"title": "Active Controller Count",
"type": "stat",
"gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
"targets": [
{
"expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)",
"legendFormat": "Controllers"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "red", "value": null},
{"color": "green", "value": 1},
{"color": "red", "value": 2}
]
}
}
}
},
{
"id": 3,
"title": "Broker Request Handler Idle %",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
"targets": [
{
"expr": "kafka_server_requesthandler_avgidlepercent",
"legendFormat": "{{ instance }}"
}
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 0.3},
{"color": "green", "value": 0.7}
]
}
}
}
},
{
"id": 4,
"title": "Consumer Group Lag",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
"targets": [
{
"expr": "kafka_consumer_group_lag_sum",
"legendFormat": "{{ consumergroup }}/{{ topic }}"
}
]
},
{
"id": 5,
"title": "Bytes In/Out Per Broker",
"type": "timeseries",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 12},
"targets": [
{
"expr": "rate(kafka_server_brokertopicmetrics_BytesInPerSec[5m])",
"legendFormat": "In - {{ instance }}"
},
{
"expr": "rate(kafka_server_brokertopicmetrics_BytesOutPerSec[5m]) * -1",
"legendFormat": "Out - {{ instance }}"
}
],
"fieldConfig": {"defaults": {"unit": "Bps"}}
},
{
"id": 6,
"title": "ISR Shrink/Expand Rate",
"type": "timeseries",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
"targets": [
{
"expr": "rate(kafka_server_replicamanager_isrshrinks_total[5m])",
"legendFormat": "ISR Shrink - {{ instance }}"
},
{
"expr": "rate(kafka_server_replicamanager_isrexpands_total[5m])",
"legendFormat": "ISR Expand - {{ instance }}"
}
]
}
]
}
Kafka 管理 UI 工具对比
| 特性 | Kafka Manager (CMAK) | AKHQ | Redpanda Console |
|---|---|---|---|
| 维护状态 | 停止维护(Yahoo) | 活跃 | 活跃(Redpanda 公司) |
| Kafka 兼容性 | 到 Kafka 2.x | Kafka 2.x/3.x | Kafka 2.x/3.x |
| 认证支持 | 基础 | SASL/SSL 完整 | SASL/SSL 完整 |
| Consumer Lag UI | 有 | 有,图形化 | 有,实时刷新 |
| Schema Registry | 无 | 有 | 有 |
| 消息搜索 | 无 | 有,支持过滤 | 有,支持 Protobuf |
| 部署方式 | JAR | Docker/Helm | Docker/Helm |
| License | Apache 2.0 | Apache 2.0 | Redpanda BSL |
生产建议:AKHQ 是当前开源生态中功能最完整的选择,支持多集群管理和 RBAC。Redpanda Console 界面更现代,但 License 需要注意(BSL 在一定规模后需商业授权)。CMAK 已停止维护,不建议新项目使用。
快速部署 AKHQ:
# docker-compose.yml
services:
akhq:
image: tchiotludo/akhq:0.24.0
ports:
- "8080:8080"
environment:
AKHQ_CONFIGURATION: |
akhq:
connections:
production:
properties:
bootstrap.servers: "kafka-broker-1:9092,kafka-broker-2:9092"
security.protocol: SASL_SSL
sasl.mechanism: SCRAM-SHA-256
sasl.jaas.config: |
org.apache.kafka.common.security.scram.ScramLoginModule required
username="admin" password="admin-secret";
schema-registry:
url: "http://schema-registry:8081"
小结:可观测性的优先级原则
构建 Kafka 监控体系的核心原则是分层递进:
- 第一层(P0 告警,15 分钟内响应):UnderReplicatedPartitions、ActiveControllerCount——直接关联数据安全和集群可用性
- 第二层(P1 告警,1 小时内响应):Consumer Lag 增长、RequestHandlerAvgIdlePercent——关联业务实时性和处理能力
- 第三层(P2 告警,工作时间响应):磁盘使用率、ISR 变动频率——容量规划和稳定性预警
- 第四层(Dashboard 观察,无告警):吞吐量趋势、端到端延迟分布——性能基线和容量规划
这个分层体系确保真正的生产事故能在第一时间被发现和响应,同时避免告警疲劳(alert fatigue)将团队拖入"狼来了"的困境。
Level 2 · 它是怎么运行的(3-5年经验)
端到端延迟追踪:ProducerInterceptor + ConsumerInterceptor
JMX 指标反映的是 Broker 端视角的延迟。但业务真正关心的是端到端延迟:从 Producer 调用 send() 到 Consumer 收到消息的完整时间。这需要通过 Kafka 的 Interceptor 机制来实现。
ProducerInterceptor:注入生产时间戳
public class E2ELatencyProducerInterceptor<K, V>
implements ProducerInterceptor<K, V> {
private static final String PRODUCE_TIMESTAMP_HEADER = "x-produce-timestamp";
@Override
public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
// 在消息 Header 中注入生产时间戳(毫秒)
long produceTime = System.currentTimeMillis();
record.headers().add(
PRODUCE_TIMESTAMP_HEADER,
Long.toString(produceTime).getBytes(StandardCharsets.UTF_8)
);
return record;
}
@Override
public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
// 记录 Broker ACK 延迟(Producer 视角)
// 此处可以打点到 Prometheus Pushgateway
}
@Override public void close() {}
@Override public void configure(Map<String, ?> configs) {}
}
ConsumerInterceptor:计算端到端延迟
public class E2ELatencyConsumerInterceptor<K, V>
implements ConsumerInterceptor<K, V> {
private static final String PRODUCE_TIMESTAMP_HEADER = "x-produce-timestamp";
private static final Histogram E2E_LATENCY = Histogram.build()
.name("kafka_e2e_latency_milliseconds")
.help("Kafka 端到端延迟(毫秒)")
.labelNames("topic", "consumer_group")
.buckets(1, 5, 10, 50, 100, 500, 1000, 5000)
.register();
@Override
public ConsumerRecords<K, V> onConsume(ConsumerRecords<K, V> records) {
long consumeTime = System.currentTimeMillis();
for (ConsumerRecord<K, V> record : records) {
Header tsHeader = record.headers().lastHeader(PRODUCE_TIMESTAMP_HEADER);
if (tsHeader != null) {
long produceTime = Long.parseLong(
new String(tsHeader.value(), StandardCharsets.UTF_8)
);
long latencyMs = consumeTime - produceTime;
E2E_LATENCY
.labels(record.topic(), getConsumerGroupId())
.observe(latencyMs);
}
}
return records;
}
@Override
public void onCommit(Map<TopicPartition, OffsetAndMetadata> offsets) {}
@Override public void close() {}
@Override public void configure(Map<String, ?> configs) {}
}
Interceptor 注册(Producer/Consumer 配置):
// Producer
props.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,
"com.example.E2ELatencyProducerInterceptor");
// Consumer
props.put(ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG,
"com.example.E2ELatencyConsumerInterceptor");
这套机制的优势:无侵入性,业务代码无需改动;Header 会随消息一起复制到所有副本,跨集群复制(MirrorMaker 2)后依然有效;Prometheus Histogram 提供 P50/P95/P99 分位数查询。
Broker 日志:Log4j 关键事件识别
Kafka Broker 日志(默认位于 logs/server.log)记录了集群生命周期中的关键事件。以下是生产环境中需要重点关注的日志模式:
ISR 变动日志
# ISR 收缩(副本落后被踢出 ISR)
[2024-01-15 10:23:45,123] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2] (was [1, 2, 3]) due to follower 3
not making progress for 30009 ms (kafka.cluster.Partition)
# ISR 扩展(副本追上重新加入 ISR)
[2024-01-15 10:25:12,456] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2, 3] (was [1, 2]) (kafka.cluster.Partition)
频繁出现 ISR 变动日志,立即检查对应 Follower Broker 的磁盘 I/O 和网络延迟。
Controller 选举日志
# Controller 选举成功
[2024-01-15 10:30:00,789] INFO [KafkaController id=2]
2 successfully elected as the controller.
Epoch incremented to 5 (kafka.controller.KafkaController)
# Broker 加入/离开集群
[2024-01-15 10:30:01,234] INFO [KafkaController id=2]
Broker 3 has registered, need to process its partition assignments
Controller 频繁重新选举(每小时超过 2-3 次)是严重的集群不稳定信号,通常由 ZooKeeper/KRaft 网络问题或 Broker JVM GC 停顿引起。
认证失败日志
# SASL 认证失败
[2024-01-15 10:35:00,100] INFO [SocketServer listenerName=SASL_PLAINTEXT
securityProtocol=SASL_PLAINTEXT] Failed authentication with /10.0.0.5
(SSL handshake failed) (org.apache.kafka.common.network.Selector)
# ACL 授权拒绝
[2024-01-15 10:35:01,200] INFO Principal = User:service-account is Denied
Operation = Write from host = 10.0.0.5 on resource = Topic:LITERAL:orders
(kafka.authorizer.logger)
认证/授权失败日志集中爆发,立即检查证书是否过期或 ACL 配置是否变更。生产环境建议配置 log4j.logger.kafka.authorizer.logger=INFO,authorizerAppender,将授权日志单独输出到审计文件。
磁盘空间告警
# Prometheus 磁盘告警(基于 node_exporter)
- alert: KafkaBrokerDiskUsageHigh
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/data/kafka"}
/ node_filesystem_size_bytes{mountpoint="/data/kafka"}) > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka Broker 磁盘使用率超过 80%"
description: "{{ $labels.instance }} 上 /data/kafka 磁盘使用率为 {{ $value | humanizePercentage }}。检查 log.retention.bytes 和 log.cleaner 状态。"
Level 3 · 规范怎么定义的(资深)
本章的协议与规范细节已融入上述 Level 中。如需深入了解,请参阅 Kafka 官方协议文档和对应 KIP 提案。
Level 4 · 边界与陷阱(所有人)
为什么 Kafka 监控比一般分布式系统更难
Kafka 的监控难点不在于指标数量少,恰恰相反——一个中等规模的 Kafka 集群能暴露超过 5000 个 JMX MBean,每个 Broker、每个 Topic、每个 Partition 都有独立的指标维度。问题在于:绝大多数指标对生产稳定性毫无影响,而那几个真正关键的指标,却容易被淹没在噪声之中。
本章的核心命题是:Kafka 监控不是"采集所有指标",而是"用最少的指标感知最大的风险"。我们将围绕五个黄金信号构建完整的可观测性栈,从 JMX 到 Prometheus,从 Grafana 到 PagerDuty,覆盖指标采集、可视化和告警的全链路。
五个黄金信号详解
信号一:UnderReplicatedPartitions(副本不足分区数)
含义:当前副本数量低于 replication.factor 配置值的分区数。正常状态应为 0。任何非零值都意味着数据安全风险——如果此时 Leader 宕机,可能发生数据丢失或服务不可用。
触发原因(按常见程度排序):
- Follower Broker 磁盘 I/O 过载,Fetch 请求延迟超过
replica.lag.time.max.ms(默认 30000ms) - Broker 宕机,副本暂时不在线
- 网络隔离,Follower 无法连接到 Leader
- JVM GC STW 暂停超过
replica.lag.time.max.ms
查询命令:
# 查看具体哪些分区副本不足
kafka-topics.sh \
--bootstrap-server kafka:9092 \
--describe \
--under-replicated-partitions
# 查看副本分布
kafka-topics.sh \
--bootstrap-server kafka:9092 \
--describe \
--topic my-topic
Prometheus Alert 规则:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Kafka 副本不足(Broker {{ $labels.instance }})"
description: "Broker {{ $labels.instance }} 有 {{ $value }} 个分区副本不足,持续超过 1 分钟。立即检查磁盘 I/O 和 Broker 健康状态。"
runbook: "https://wiki.internal/kafka/runbook-under-replicated"
信号二:ISRShrinkRate 与 ISRExpandRate(ISR 变动速率)
ISR(In-Sync Replicas)的频繁收缩与扩展,是集群不稳定的早期预警信号,往往比 UnderReplicatedPartitions 提前几分钟出现。
健康状态:Shrink 和 Expand 速率均接近 0(每小时偶发 1-2 次属正常)。频繁变动(每分钟数次)说明副本追赶速度与生产速度之间存在持续张力。
Prometheus 查询(计算每分钟 ISR 变动总次数):
# ISR 收缩速率(1分钟内总变动次数)
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m])) * 60
# ISR 扩展速率
sum(rate(kafka_server_replicamanager_isrexpands_total[5m])) * 60
告警规则:
- alert: KafkaISRHighChurnRate
expr: |
(
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
+
sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
) * 60 > 5
for: 3m
labels:
severity: warning
annotations:
summary: "Kafka ISR 变动频繁(每分钟 > 5 次)"
description: "ISR 频繁变动通常预示磁盘 I/O 过载或网络抖动,请提前介入。"
信号三:ActiveControllerCount(活跃 Controller 数)
整个 Kafka 集群同一时刻必须有且仅有一个 Controller。Controller 负责:分区 Leader 选举、Broker 上下线处理、Topic 创建删除。
- 值为 0:无 Controller,集群完全不可用(生产者和消费者都会报错)
- 值为 1:正常
- 值 > 1:脑裂(Split Brain),极其危险,通常发生在 ZooKeeper 网络分区后
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 30s
labels:
severity: critical
annotations:
summary: "Kafka 无活跃 Controller 或存在多个 Controller"
description: "当前活跃 Controller 数量为 {{ $value }},正常值为 1。集群可能不可用。"
查询哪个 Broker 是当前 Controller:
# ZooKeeper 模式
zookeeper-shell.sh zk:2181 get /controller
# KRaft 模式
kafka-metadata-quorum.sh \
--bootstrap-server kafka:9092 \
describe --status
信号四:RequestHandlerAvgIdlePercent(请求处理线程空闲率)
这是 Broker 端 I/O 处理能力最直接的度量指标。Kafka 的请求处理架构由两层线程池组成:
- Network Threads(
num.network.threads):负责网络 I/O,接收请求、发送响应 - I/O Threads / Request Handlers(
num.io.threads):负责实际处理请求(读写磁盘、与副本通信)
RequestHandlerAvgIdlePercent 反映的是 I/O Threads 的空闲时间比例:
- > 0.7:充裕,处理能力有余量
- 0.3 ~ 0.7:正常负载
- 0.1 ~ 0.3:高负载,需要关注
- < 0.1:严重瓶颈,请求排队,延迟飙升
# 查看每个 Broker 的请求处理线程空闲率
kafka_server_requesthandler_avgidlepercent
- alert: KafkaBrokerIOBottleneck
expr: kafka_server_requesthandler_avgidlepercent < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka Broker I/O 处理线程过忙({{ $labels.instance }})"
description: "请求处理线程空闲率 {{ $value | humanizePercentage }},低于 20%。增加 num.io.threads 或减少每 Broker 分区数。"
信号五:Consumer Lag(消费者积压)
Consumer Lag 是业务影响最直接的指标,直接关联到数据处理的实时性。不同业务场景对 Lag 的容忍度差异巨大:支付流水处理的 Lag 容忍度是秒级,日志聚合的 Lag 容忍度可以是分钟级。
关键的判断不是 Lag 的绝对值,而是 Lag 的变化趋势:
# Lag 是否在增长?(5分钟内 Lag 增量 > 0 说明消费速度跟不上生产速度)
delta(kafka_consumer_group_lag_sum{group="payment-consumer"}[5m]) > 0
Prometheus 告警:
- alert: KafkaConsumerLagGrowing
expr: |
delta(kafka_consumer_group_lag_sum[5m]) > 1000
and
kafka_consumer_group_lag_sum > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "Kafka 消费积压持续增长(Group: {{ $labels.consumergroup }})"
description: "Consumer Group {{ $labels.consumergroup }} 在 Topic {{ $labels.topic }} 上的 Lag 为 {{ $value }},且持续增长超过 5 分钟。"
完整告警规则汇总
# kafka-alerts.yml
groups:
- name: kafka.critical
rules:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 1m
labels:
severity: critical
page: "true"
annotations:
summary: "P0: Kafka 副本不足"
- alert: KafkaNoActiveController
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 30s
labels:
severity: critical
page: "true"
annotations:
summary: "P0: Kafka Controller 异常"
- name: kafka.warning
rules:
- alert: KafkaConsumerLagGrowing
expr: delta(kafka_consumer_group_lag_sum[5m]) > 1000 and kafka_consumer_group_lag_sum > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "P1: Kafka 消费积压持续增长"
- alert: KafkaBrokerIOBottleneck
expr: kafka_server_requesthandler_avgidlepercent < 0.2
for: 5m
labels:
severity: warning
annotations:
summary: "P1: Kafka Broker I/O 瓶颈"
- alert: KafkaBrokerDiskUsageHigh
expr: (1 - node_filesystem_avail_bytes{mountpoint=~"/data/kafka.*"} / node_filesystem_size_bytes{mountpoint=~"/data/kafka.*"}) > 0.80
for: 5m
labels:
severity: warning
annotations:
summary: "P2: Kafka Broker 磁盘使用率过高"
- alert: KafkaISRHighChurnRate
expr: (sum(rate(kafka_server_replicamanager_isrshrinks_total[5m])) + sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))) * 60 > 5
for: 3m
labels:
severity: warning
annotations:
summary: "P2: Kafka ISR 变动异常频繁"