第 28 章

监控体系:从指标到告警的完整方案

第28章:监控体系:从指标到告警的完整方案

导读:如何构建完整的 Kafka 监控体系?

本章核心问题:如何构建完整的 Kafka 监控体系?

读完本章你将理解


Level 1 · 你需要知道的(1-3年经验)

JMX Exporter:将 Kafka 指标接入 Prometheus

为什么选择 JMX Exporter 而非 Confluent Metrics Reporter

Kafka 原生通过 JMX(Java Management Extensions)暴露指标。Prometheus 生态下,有两种主流方案将 JMX 指标转换为 Prometheus 格式:

开源环境下,JMX Exporter 是标准选择。在 kafka-server-start.sh 中注入 Agent:

export KAFKA_OPTS="-javaagent:/opt/kafka/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka/kafka-jmx-exporter.yml"

端口 7071 将成为 Prometheus 的抓取目标(scrape target)。

kafka-jmx-exporter.yml 完整配置

以下是经过生产验证的 JMX Exporter 配置文件,聚焦五个黄金信号相关 MBean,同时保留必要的请求性能指标:

# kafka-jmx-exporter.yml
# 适用于 Apache Kafka 3.7+,基于 JMX Exporter 0.20+
startDelaySeconds: 30
lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # ============================================================
  # 副本与 ISR 健康
  # ============================================================
  - pattern: 'kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value'
    name: kafka_server_replicamanager_underreplicatedpartitions
    help: "副本不足的分区数,> 0 立即告警"
    type: GAUGE

  - pattern: 'kafka.server<type=ReplicaManager, name=ISRShrinksPerSec><>(\w+)'
    name: kafka_server_replicamanager_isrshrinks_total
    help: "ISR 收缩速率(每秒)"
    type: COUNTER

  - pattern: 'kafka.server<type=ReplicaManager, name=ISRExpandsPerSec><>(\w+)'
    name: kafka_server_replicamanager_isrexpands_total
    help: "ISR 扩展速率(每秒)"
    type: COUNTER

  # ============================================================
  # Controller 健康
  # ============================================================
  - pattern: 'kafka.controller<type=KafkaController, name=ActiveControllerCount><>Value'
    name: kafka_controller_kafkacontroller_activecontrollercount
    help: "活跃 Controller 数量,必须精确为 1"
    type: GAUGE

  - pattern: 'kafka.controller<type=ControllerStats, name=LeaderElectionRateAndTimeMs><>(\w+)'
    name: kafka_controller_stats_leaderelection
    help: "Leader 选举速率与延迟"
    type: SUMMARY

  # ============================================================
  # 请求处理线程空闲率(I/O 瓶颈核心指标)
  # ============================================================
  - pattern: 'kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>(\w+)'
    name: kafka_server_requesthandler_avgidlepercent
    help: "请求处理线程平均空闲百分比,< 0.3 表示 I/O 瓶颈"
    type: GAUGE

  # ============================================================
  # 网络请求性能(Produce/Fetch 延迟)
  # ============================================================
  - pattern: 'kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(\w+)><>(\d+)thPercentile'
    name: kafka_network_requestmetrics_totaltimems
    help: "请求总延迟(毫秒),按请求类型和百分位分组"
    labels:
      request: "$1"
      quantile: "$2"
    type: GAUGE

  - pattern: 'kafka.network<type=RequestMetrics, name=RequestsPerSec, request=(\w+), version=(\d+)><>(\w+)'
    name: kafka_network_requestmetrics_requestspersec
    help: "每秒请求数,按请求类型分组"
    labels:
      request: "$1"
      version: "$2"
    type: COUNTER

  # ============================================================
  # Broker I/O 吞吐量
  # ============================================================
  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec|MessagesInPerSec)><>(\w+)'
    name: kafka_server_brokertopicmetrics_$1
    help: "Broker 级别吞吐量"
    type: COUNTER

  - pattern: 'kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec), topic=(\S+)><>(\w+)'
    name: kafka_server_brokertopicmetrics_topic_$1
    help: "Topic 级别吞吐量"
    labels:
      topic: "$2"
    type: COUNTER

  # ============================================================
  # 磁盘使用(Log Size)
  # ============================================================
  - pattern: 'kafka.log<type=LogManager, name=OfflineLogDirectoryCount><>Value'
    name: kafka_log_logmanager_offlinelogdirectorycount
    help: "离线日志目录数"
    type: GAUGE

  - pattern: 'kafka.log<type=Log, name=Size, topic=(\S+), partition=(\d+)><>Value'
    name: kafka_log_size_bytes
    help: "分区日志大小(字节)"
    labels:
      topic: "$1"
      partition: "$2"
    type: GAUGE

  # ============================================================
  # Consumer Lag(通过 kafka_consumer_group_lag 外部采集,见下文)
  # ============================================================

  # ============================================================
  # JVM GC 与内存(通用 Java 指标)
  # ============================================================
  - pattern: 'java.lang<type=GarbageCollector, name=(\S+)><>(\w+)'
    name: jvm_gc_$2
    labels:
      gc: "$1"
    type: UNTYPED

  - pattern: 'java.lang<type=Memory><HeapMemoryUsage>(\w+)'
    name: jvm_memory_heap_$1
    type: GAUGE

Consumer Lag:为什么不能只靠 JMX

Consumer Lag(消费者积压)是业务最关心的指标,但 Kafka Broker 的 JMX 并不直接提供 Consumer Group 的 Lag 值。原因在于:Kafka 将 Consumer Group Offset 存储在内部 Topic __consumer_offsets,Broker 知道每个分区的 LEO(Log End Offset),但需要计算 LEO - CommittedOffset 才能得到 Lag。

生产环境有两种采集方式:

方案一:kafka-consumer-groups.sh 脚本(简单但性能差)

# 每 30 秒轮询一次,适合小规模集群(< 50 个 Consumer Group)
kafka-consumer-groups.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --group my-consumer-group

方案二:kafka_exporter(推荐,生产级)

danielqsj/kafka_exporter 是专为 Prometheus 设计的独立采集器,通过 Kafka Admin API 拉取 Consumer Group Offset 并计算 Lag:

docker run -d \
  --name kafka-exporter \
  -p 9308:9308 \
  danielqsj/kafka_exporter \
  --kafka.server=kafka-broker-1:9092 \
  --kafka.server=kafka-broker-2:9092 \
  --kafka.server=kafka-broker-3:9092 \
  --group.filter=".*" \
  --topic.filter="^(?!__).*" \
  --sasl.enabled \
  --sasl.username=monitor \
  --sasl.password=secret \
  --sasl.mechanism=PLAIN

kafka_exporter 暴露的关键指标:

Prometheus scrape 配置:

# prometheus.yml
scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets:
          - 'kafka-broker-1:7071'
          - 'kafka-broker-2:7071'
          - 'kafka-broker-3:7071'
    scrape_interval: 15s
    scrape_timeout: 10s

  - job_name: 'kafka-exporter'
    static_configs:
      - targets: ['kafka-exporter:9308']
    scrape_interval: 30s

Grafana Dashboard:可视化完整方案

Dashboard JSON 模板核心片段

以下是生产级 Grafana Dashboard 的关键 Panel 配置。完整 Dashboard 包含约 20 个 Panel,分为三个 Row:集群健康、吞吐量趋势、消费者状态。

{
  "title": "Kafka Cluster Overview",
  "uid": "kafka-cluster-overview-v2",
  "tags": ["kafka", "platform"],
  "refresh": "30s",
  "time": {"from": "now-3h", "to": "now"},
  "panels": [
    {
      "id": 1,
      "title": "UnderReplicated Partitions",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 0, "y": 0},
      "targets": [
        {
          "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)",
          "legendFormat": "URP Count"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              {"color": "green", "value": null},
              {"color": "red", "value": 1}
            ]
          },
          "color": {"mode": "thresholds"},
          "mappings": [{"type": "value", "options": {"0": {"text": "HEALTHY"}}}]
        }
      }
    },
    {
      "id": 2,
      "title": "Active Controller Count",
      "type": "stat",
      "gridPos": {"h": 4, "w": 4, "x": 4, "y": 0},
      "targets": [
        {
          "expr": "sum(kafka_controller_kafkacontroller_activecontrollercount)",
          "legendFormat": "Controllers"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"color": "red", "value": null},
              {"color": "green", "value": 1},
              {"color": "red", "value": 2}
            ]
          }
        }
      }
    },
    {
      "id": 3,
      "title": "Broker Request Handler Idle %",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 4},
      "targets": [
        {
          "expr": "kafka_server_requesthandler_avgidlepercent",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percentunit",
          "min": 0,
          "max": 1,
          "thresholds": {
            "steps": [
              {"color": "red", "value": 0},
              {"color": "yellow", "value": 0.3},
              {"color": "green", "value": 0.7}
            ]
          }
        }
      }
    },
    {
      "id": 4,
      "title": "Consumer Group Lag",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 12, "y": 4},
      "targets": [
        {
          "expr": "kafka_consumer_group_lag_sum",
          "legendFormat": "{{ consumergroup }}/{{ topic }}"
        }
      ]
    },
    {
      "id": 5,
      "title": "Bytes In/Out Per Broker",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 24, "x": 0, "y": 12},
      "targets": [
        {
          "expr": "rate(kafka_server_brokertopicmetrics_BytesInPerSec[5m])",
          "legendFormat": "In - {{ instance }}"
        },
        {
          "expr": "rate(kafka_server_brokertopicmetrics_BytesOutPerSec[5m]) * -1",
          "legendFormat": "Out - {{ instance }}"
        }
      ],
      "fieldConfig": {"defaults": {"unit": "Bps"}}
    },
    {
      "id": 6,
      "title": "ISR Shrink/Expand Rate",
      "type": "timeseries",
      "gridPos": {"h": 8, "w": 12, "x": 0, "y": 20},
      "targets": [
        {
          "expr": "rate(kafka_server_replicamanager_isrshrinks_total[5m])",
          "legendFormat": "ISR Shrink - {{ instance }}"
        },
        {
          "expr": "rate(kafka_server_replicamanager_isrexpands_total[5m])",
          "legendFormat": "ISR Expand - {{ instance }}"
        }
      ]
    }
  ]
}

Kafka 管理 UI 工具对比

特性 Kafka Manager (CMAK) AKHQ Redpanda Console
维护状态 停止维护(Yahoo) 活跃 活跃(Redpanda 公司)
Kafka 兼容性 到 Kafka 2.x Kafka 2.x/3.x Kafka 2.x/3.x
认证支持 基础 SASL/SSL 完整 SASL/SSL 完整
Consumer Lag UI 有,图形化 有,实时刷新
Schema Registry
消息搜索 有,支持过滤 有,支持 Protobuf
部署方式 JAR Docker/Helm Docker/Helm
License Apache 2.0 Apache 2.0 Redpanda BSL

生产建议:AKHQ 是当前开源生态中功能最完整的选择,支持多集群管理和 RBAC。Redpanda Console 界面更现代,但 License 需要注意(BSL 在一定规模后需商业授权)。CMAK 已停止维护,不建议新项目使用。

快速部署 AKHQ:

# docker-compose.yml
services:
  akhq:
    image: tchiotludo/akhq:0.24.0
    ports:
      - "8080:8080"
    environment:
      AKHQ_CONFIGURATION: |
        akhq:
          connections:
            production:
              properties:
                bootstrap.servers: "kafka-broker-1:9092,kafka-broker-2:9092"
                security.protocol: SASL_SSL
                sasl.mechanism: SCRAM-SHA-256
                sasl.jaas.config: |
                  org.apache.kafka.common.security.scram.ScramLoginModule required
                  username="admin" password="admin-secret";
              schema-registry:
                url: "http://schema-registry:8081"

小结:可观测性的优先级原则

构建 Kafka 监控体系的核心原则是分层递进

  1. 第一层(P0 告警,15 分钟内响应):UnderReplicatedPartitions、ActiveControllerCount——直接关联数据安全和集群可用性
  2. 第二层(P1 告警,1 小时内响应):Consumer Lag 增长、RequestHandlerAvgIdlePercent——关联业务实时性和处理能力
  3. 第三层(P2 告警,工作时间响应):磁盘使用率、ISR 变动频率——容量规划和稳定性预警
  4. 第四层(Dashboard 观察,无告警):吞吐量趋势、端到端延迟分布——性能基线和容量规划

这个分层体系确保真正的生产事故能在第一时间被发现和响应,同时避免告警疲劳(alert fatigue)将团队拖入"狼来了"的困境。


Level 2 · 它是怎么运行的(3-5年经验)

端到端延迟追踪:ProducerInterceptor + ConsumerInterceptor

JMX 指标反映的是 Broker 端视角的延迟。但业务真正关心的是端到端延迟:从 Producer 调用 send() 到 Consumer 收到消息的完整时间。这需要通过 Kafka 的 Interceptor 机制来实现。

ProducerInterceptor:注入生产时间戳

public class E2ELatencyProducerInterceptor<K, V>
    implements ProducerInterceptor<K, V> {

    private static final String PRODUCE_TIMESTAMP_HEADER = "x-produce-timestamp";

    @Override
    public ProducerRecord<K, V> onSend(ProducerRecord<K, V> record) {
        // 在消息 Header 中注入生产时间戳(毫秒)
        long produceTime = System.currentTimeMillis();
        record.headers().add(
            PRODUCE_TIMESTAMP_HEADER,
            Long.toString(produceTime).getBytes(StandardCharsets.UTF_8)
        );
        return record;
    }

    @Override
    public void onAcknowledgement(RecordMetadata metadata, Exception exception) {
        // 记录 Broker ACK 延迟(Producer 视角)
        // 此处可以打点到 Prometheus Pushgateway
    }

    @Override public void close() {}
    @Override public void configure(Map<String, ?> configs) {}
}

ConsumerInterceptor:计算端到端延迟

public class E2ELatencyConsumerInterceptor<K, V>
    implements ConsumerInterceptor<K, V> {

    private static final String PRODUCE_TIMESTAMP_HEADER = "x-produce-timestamp";
    private static final Histogram E2E_LATENCY = Histogram.build()
        .name("kafka_e2e_latency_milliseconds")
        .help("Kafka 端到端延迟(毫秒)")
        .labelNames("topic", "consumer_group")
        .buckets(1, 5, 10, 50, 100, 500, 1000, 5000)
        .register();

    @Override
    public ConsumerRecords<K, V> onConsume(ConsumerRecords<K, V> records) {
        long consumeTime = System.currentTimeMillis();

        for (ConsumerRecord<K, V> record : records) {
            Header tsHeader = record.headers().lastHeader(PRODUCE_TIMESTAMP_HEADER);
            if (tsHeader != null) {
                long produceTime = Long.parseLong(
                    new String(tsHeader.value(), StandardCharsets.UTF_8)
                );
                long latencyMs = consumeTime - produceTime;

                E2E_LATENCY
                    .labels(record.topic(), getConsumerGroupId())
                    .observe(latencyMs);
            }
        }
        return records;
    }

    @Override
    public void onCommit(Map<TopicPartition, OffsetAndMetadata> offsets) {}
    @Override public void close() {}
    @Override public void configure(Map<String, ?> configs) {}
}

Interceptor 注册(Producer/Consumer 配置):

// Producer
props.put(ProducerConfig.INTERCEPTOR_CLASSES_CONFIG,
    "com.example.E2ELatencyProducerInterceptor");

// Consumer
props.put(ConsumerConfig.INTERCEPTOR_CLASSES_CONFIG,
    "com.example.E2ELatencyConsumerInterceptor");

这套机制的优势:无侵入性,业务代码无需改动;Header 会随消息一起复制到所有副本,跨集群复制(MirrorMaker 2)后依然有效;Prometheus Histogram 提供 P50/P95/P99 分位数查询。

Broker 日志:Log4j 关键事件识别

Kafka Broker 日志(默认位于 logs/server.log)记录了集群生命周期中的关键事件。以下是生产环境中需要重点关注的日志模式:

ISR 变动日志

# ISR 收缩(副本落后被踢出 ISR)
[2024-01-15 10:23:45,123] INFO [Partition orders-0 broker=1] 
ISR updated to [1, 2] (was [1, 2, 3]) due to follower 3 
not making progress for 30009 ms (kafka.cluster.Partition)

# ISR 扩展(副本追上重新加入 ISR)
[2024-01-15 10:25:12,456] INFO [Partition orders-0 broker=1]
ISR updated to [1, 2, 3] (was [1, 2]) (kafka.cluster.Partition)

频繁出现 ISR 变动日志,立即检查对应 Follower Broker 的磁盘 I/O 和网络延迟。

Controller 选举日志

# Controller 选举成功
[2024-01-15 10:30:00,789] INFO [KafkaController id=2] 
2 successfully elected as the controller. 
Epoch incremented to 5 (kafka.controller.KafkaController)

# Broker 加入/离开集群
[2024-01-15 10:30:01,234] INFO [KafkaController id=2] 
Broker 3 has registered, need to process its partition assignments

Controller 频繁重新选举(每小时超过 2-3 次)是严重的集群不稳定信号,通常由 ZooKeeper/KRaft 网络问题或 Broker JVM GC 停顿引起。

认证失败日志

# SASL 认证失败
[2024-01-15 10:35:00,100] INFO [SocketServer listenerName=SASL_PLAINTEXT 
securityProtocol=SASL_PLAINTEXT] Failed authentication with /10.0.0.5 
(SSL handshake failed) (org.apache.kafka.common.network.Selector)

# ACL 授权拒绝
[2024-01-15 10:35:01,200] INFO Principal = User:service-account is Denied 
Operation = Write from host = 10.0.0.5 on resource = Topic:LITERAL:orders 
(kafka.authorizer.logger)

认证/授权失败日志集中爆发,立即检查证书是否过期或 ACL 配置是否变更。生产环境建议配置 log4j.logger.kafka.authorizer.logger=INFO,authorizerAppender,将授权日志单独输出到审计文件。

磁盘空间告警

# Prometheus 磁盘告警(基于 node_exporter)
- alert: KafkaBrokerDiskUsageHigh
  expr: |
    (1 - node_filesystem_avail_bytes{mountpoint="/data/kafka"}
    / node_filesystem_size_bytes{mountpoint="/data/kafka"}) > 0.80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Kafka Broker 磁盘使用率超过 80%"
    description: "{{ $labels.instance }} 上 /data/kafka 磁盘使用率为 {{ $value | humanizePercentage }}。检查 log.retention.bytes 和 log.cleaner 状态。"

Level 3 · 规范怎么定义的(资深)

本章的协议与规范细节已融入上述 Level 中。如需深入了解,请参阅 Kafka 官方协议文档和对应 KIP 提案。


Level 4 · 边界与陷阱(所有人)

为什么 Kafka 监控比一般分布式系统更难

Kafka 的监控难点不在于指标数量少,恰恰相反——一个中等规模的 Kafka 集群能暴露超过 5000 个 JMX MBean,每个 Broker、每个 Topic、每个 Partition 都有独立的指标维度。问题在于:绝大多数指标对生产稳定性毫无影响,而那几个真正关键的指标,却容易被淹没在噪声之中。

本章的核心命题是:Kafka 监控不是"采集所有指标",而是"用最少的指标感知最大的风险"。我们将围绕五个黄金信号构建完整的可观测性栈,从 JMX 到 Prometheus,从 Grafana 到 PagerDuty,覆盖指标采集、可视化和告警的全链路。

五个黄金信号详解

信号一:UnderReplicatedPartitions(副本不足分区数)

含义:当前副本数量低于 replication.factor 配置值的分区数。正常状态应为 0。任何非零值都意味着数据安全风险——如果此时 Leader 宕机,可能发生数据丢失或服务不可用。

触发原因(按常见程度排序):

  1. Follower Broker 磁盘 I/O 过载,Fetch 请求延迟超过 replica.lag.time.max.ms(默认 30000ms)
  2. Broker 宕机,副本暂时不在线
  3. 网络隔离,Follower 无法连接到 Leader
  4. JVM GC STW 暂停超过 replica.lag.time.max.ms

查询命令

# 查看具体哪些分区副本不足
kafka-topics.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --under-replicated-partitions

# 查看副本分布
kafka-topics.sh \
  --bootstrap-server kafka:9092 \
  --describe \
  --topic my-topic

Prometheus Alert 规则

- alert: KafkaUnderReplicatedPartitions
  expr: kafka_server_replicamanager_underreplicatedpartitions > 0
  for: 1m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "Kafka 副本不足(Broker {{ $labels.instance }})"
    description: "Broker {{ $labels.instance }} 有 {{ $value }} 个分区副本不足,持续超过 1 分钟。立即检查磁盘 I/O 和 Broker 健康状态。"
    runbook: "https://wiki.internal/kafka/runbook-under-replicated"

信号二:ISRShrinkRate 与 ISRExpandRate(ISR 变动速率)

ISR(In-Sync Replicas)的频繁收缩与扩展,是集群不稳定的早期预警信号,往往比 UnderReplicatedPartitions 提前几分钟出现。

健康状态:Shrink 和 Expand 速率均接近 0(每小时偶发 1-2 次属正常)。频繁变动(每分钟数次)说明副本追赶速度与生产速度之间存在持续张力。

Prometheus 查询(计算每分钟 ISR 变动总次数):

# ISR 收缩速率(1分钟内总变动次数)
sum(rate(kafka_server_replicamanager_isrshrinks_total[5m])) * 60

# ISR 扩展速率
sum(rate(kafka_server_replicamanager_isrexpands_total[5m])) * 60

告警规则

- alert: KafkaISRHighChurnRate
  expr: |
    (
      sum(rate(kafka_server_replicamanager_isrshrinks_total[5m]))
      +
      sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))
    ) * 60 > 5
  for: 3m
  labels:
    severity: warning
  annotations:
    summary: "Kafka ISR 变动频繁(每分钟 > 5 次)"
    description: "ISR 频繁变动通常预示磁盘 I/O 过载或网络抖动,请提前介入。"

信号三:ActiveControllerCount(活跃 Controller 数)

整个 Kafka 集群同一时刻必须有且仅有一个 Controller。Controller 负责:分区 Leader 选举、Broker 上下线处理、Topic 创建删除。

- alert: KafkaNoActiveController
  expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
  for: 30s
  labels:
    severity: critical
  annotations:
    summary: "Kafka 无活跃 Controller 或存在多个 Controller"
    description: "当前活跃 Controller 数量为 {{ $value }},正常值为 1。集群可能不可用。"

查询哪个 Broker 是当前 Controller:

# ZooKeeper 模式
zookeeper-shell.sh zk:2181 get /controller

# KRaft 模式
kafka-metadata-quorum.sh \
  --bootstrap-server kafka:9092 \
  describe --status

信号四:RequestHandlerAvgIdlePercent(请求处理线程空闲率)

这是 Broker 端 I/O 处理能力最直接的度量指标。Kafka 的请求处理架构由两层线程池组成:

RequestHandlerAvgIdlePercent 反映的是 I/O Threads 的空闲时间比例:

# 查看每个 Broker 的请求处理线程空闲率
kafka_server_requesthandler_avgidlepercent
- alert: KafkaBrokerIOBottleneck
  expr: kafka_server_requesthandler_avgidlepercent < 0.2
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Kafka Broker I/O 处理线程过忙({{ $labels.instance }})"
    description: "请求处理线程空闲率 {{ $value | humanizePercentage }},低于 20%。增加 num.io.threads 或减少每 Broker 分区数。"

信号五:Consumer Lag(消费者积压)

Consumer Lag 是业务影响最直接的指标,直接关联到数据处理的实时性。不同业务场景对 Lag 的容忍度差异巨大:支付流水处理的 Lag 容忍度是秒级,日志聚合的 Lag 容忍度可以是分钟级。

关键的判断不是 Lag 的绝对值,而是 Lag 的变化趋势

# Lag 是否在增长?(5分钟内 Lag 增量 > 0 说明消费速度跟不上生产速度)
delta(kafka_consumer_group_lag_sum{group="payment-consumer"}[5m]) > 0

Prometheus 告警

- alert: KafkaConsumerLagGrowing
  expr: |
    delta(kafka_consumer_group_lag_sum[5m]) > 1000
    and
    kafka_consumer_group_lag_sum > 5000
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Kafka 消费积压持续增长(Group: {{ $labels.consumergroup }})"
    description: "Consumer Group {{ $labels.consumergroup }} 在 Topic {{ $labels.topic }} 上的 Lag 为 {{ $value }},且持续增长超过 5 分钟。"

完整告警规则汇总

# kafka-alerts.yml
groups:
  - name: kafka.critical
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 1m
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "P0: Kafka 副本不足"

      - alert: KafkaNoActiveController
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
        for: 30s
        labels:
          severity: critical
          page: "true"
        annotations:
          summary: "P0: Kafka Controller 异常"

  - name: kafka.warning
    rules:
      - alert: KafkaConsumerLagGrowing
        expr: delta(kafka_consumer_group_lag_sum[5m]) > 1000 and kafka_consumer_group_lag_sum > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P1: Kafka 消费积压持续增长"

      - alert: KafkaBrokerIOBottleneck
        expr: kafka_server_requesthandler_avgidlepercent < 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P1: Kafka Broker I/O 瓶颈"

      - alert: KafkaBrokerDiskUsageHigh
        expr: (1 - node_filesystem_avail_bytes{mountpoint=~"/data/kafka.*"} / node_filesystem_size_bytes{mountpoint=~"/data/kafka.*"}) > 0.80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P2: Kafka Broker 磁盘使用率过高"

      - alert: KafkaISRHighChurnRate
        expr: (sum(rate(kafka_server_replicamanager_isrshrinks_total[5m])) + sum(rate(kafka_server_replicamanager_isrexpands_total[5m]))) * 60 > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "P2: Kafka ISR 变动异常频繁"
本章评分
4.5  / 5  (3 评分)

💬 留言讨论