第 39 章

监控体系:Prometheus + Grafana + 告警规则

第39章 监控体系:Prometheus + Grafana + 告警规则

完整的 Redis 监控体系是生产可靠性的基石。本章从 INFO 命令原始输出出发,讲解 redis_exporter 指标采集、Grafana 面板配置、Prometheus 告警规则,以及告警分级与响应流程。


1. INFO 命令各段详解

INFO 是 Redis 自我诊断的入口,每个字段都有明确含义:

1.1 server 段

INFO server
# redis_version:7.2.3
# redis_git_sha1:00000000
# os:Linux 5.15.0-91-generic x86_64
# arch_bits:64
# tcp_port:6379
# config_file:/etc/redis/redis.conf
# uptime_in_seconds:864000           ← 运行时长(秒),864000 = 10天
# uptime_in_days:10
# hz:10                              ← 内部定时器频率(影响过期检查精度)
# configured_hz:10
# lru_clock:12345678                 ← LRU 时钟(秒级精度)
# executable:/usr/bin/redis-server

1.2 clients 段

INFO clients
# connected_clients:127              ← 当前连接数(关键指标!)
# cluster_connections:0
# maxclients:10000                   ← 最大连接数配置
# client_recent_max_input_buffer:20480   ← 最大输入缓冲(字节)
# client_recent_max_output_buffer:0
# blocked_clients:3                  ← 阻塞在 BLPOP/BRPOP/WAIT 的客户端数
# tracking_clients:0
# clients_in_timeout_table:0

1.3 memory 段

INFO memory
# used_memory:1073741824             ← Redis 分配内存(字节)= 1GB
# used_memory_human:1.00G
# used_memory_rss:1342177280         ← OS 视角RSS内存(包含碎片)= 1.25GB
# used_memory_rss_human:1.25G
# used_memory_peak:1200000000        ← 历史峰值
# used_memory_peak_human:1.12G
# used_memory_peak_perc:89.48%       ← 当前/峰值
# used_memory_overhead:524288        ← Redis 内部结构开销
# used_memory_dataset:1073217536     ← 实际数据内存 = used - overhead
# mem_fragmentation_ratio:1.25       ← RSS/used,正常1.0-1.5
# mem_fragmentation_bytes:268435456
# mem_allocator:jemalloc-5.3.0
# maxmemory:12884901888              ← maxmemory 配置 = 12GB
# maxmemory_human:12.00G
# maxmemory_policy:allkeys-lru

1.4 stats 段

INFO stats
# total_connections_received:1234567 ← 历史总连接数
# total_commands_processed:98765432  ← 历史总命令数
# instantaneous_ops_per_sec:12345    ← 当前 QPS(每秒操作数)
# total_net_input_bytes:10737418240
# total_net_output_bytes:21474836480
# instantaneous_input_kbps:1024.00   ← 当前输入流量 kbps
# instantaneous_output_kbps:2048.00
# rejected_connections:0             ← 被拒绝的连接数(达到maxclients时)
# sync_full:2                        ← 全量同步次数
# sync_partial_ok:100                ← 增量同步成功次数
# sync_partial_err:1                 ← 增量同步失败次数(触发全量同步)
# expired_keys:456789                ← 已过期删除的 key 数
# evicted_keys:0                     ← 被淘汰的 key 数(达到maxmemory时)
# keyspace_hits:9876543              ← 命中次数
# keyspace_misses:123456             ← 未命中次数
# pubsub_channels:5                  ← pub/sub 频道数

1.5 replication 段

INFO replication
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.11,port=6379,state=online,offset=123456789,lag=0
# slave1:ip=192.168.1.12,port=6379,state=online,offset=123456700,lag=1
# master_failover_state:no-failover
# master_replid:a1b2c3d4e5f6...       ← 复制ID
# master_repl_offset:123456789        ← 主库偏移量
# repl_backlog_active:1
# repl_backlog_size:1073741824        ← 复制积压缓冲区大小 = 1GB
# repl_backlog_first_byte_offset:1   
# repl_backlog_histlen:123456789

1.6 keyspace 段

INFO keyspace
# db0:keys=100000,expires=80000,avg_ttl=3600000
# db1:keys=5000,expires=5000,avg_ttl=86400000
# ↑ avg_ttl 单位毫秒,3600000ms = 1小时

1.7 commandstats 段

INFO commandstats
# cmdstat_get:calls=9876543,usec=12345678,usec_per_call=1.25,rejected_calls=0,failed_calls=0
# cmdstat_set:calls=1234567,usec=3456789,usec_per_call=2.80
# cmdstat_hget:calls=567890,usec=2345678,usec_per_call=4.13
# ↑ usec_per_call:该命令平均耗时(微秒)

2. redis_exporter 部署

# docker-compose.yml
services:
  redis:
    image: redis:7.2
    command: redis-server /etc/redis/redis.conf
    volumes:
      - ./redis.conf:/etc/redis/redis.conf
    ports:
      - "6379:6379"

  redis-exporter:
    image: oliver006/redis_exporter:v1.58.0
    environment:
      REDIS_ADDR: "redis://redis:6379"
      REDIS_PASSWORD: "yourpassword"
      REDIS_EXPORTER_LOG_FORMAT: "json"
    ports:
      - "9121:9121"    # Prometheus 抓取端口
    depends_on:
      - redis
# 验证 exporter 正常工作
curl -s http://localhost:9121/metrics | grep redis_connected_clients
# 输出:redis_connected_clients 5

3. 关键指标列表

# redis_exporter 采集的核心指标(Prometheus 格式)

# 连接与客户端
redis_connected_clients              # 当前连接数
redis_blocked_clients               # 阻塞客户端数
redis_connected_slaves              # 从库连接数

# 内存
redis_used_memory_bytes             # 使用内存
redis_used_memory_rss_bytes         # RSS 内存
redis_used_memory_peak_bytes        # 峰值内存
redis_mem_fragmentation_ratio       # 碎片率
redis_maxmemory_bytes               # maxmemory 配置

# 性能
redis_commands_total                # 累计命令数(counter)
redis_commands_duration_seconds_total  # 累计命令耗时(counter)
redis_keyspace_hits_total           # 累计命中(counter)
redis_keyspace_misses_total         # 累计未命中(counter)
redis_instantaneous_ops_per_sec     # 实时 QPS(gauge)

# 持久化
redis_rdb_last_bgsave_status        # 最近 RDB 状态(1=成功)
redis_aof_enabled                   # AOF 是否开启
redis_aof_rewrite_in_progress       # AOF 是否正在重写

# 复制
redis_replication_offset            # 主库复制偏移量
redis_slave_replication_offset      # 从库复制偏移量(在从库实例上)
redis_connected_slaves              # 从库数量

# Keyspace
redis_db_keys{db="db0"}            # 总 key 数
redis_db_expiring_keys{db="db0"}   # 有 TTL 的 key 数
redis_expired_keys_total           # 累计过期删除 key 数
redis_evicted_keys_total           # 累计被淘汰 key 数
redis_rejected_connections_total   # 被拒绝连接数(超过maxclients)

4. Prometheus 抓取配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "redis"
    static_configs:
      - targets:
          - "redis-exporter-01:9121"
          - "redis-exporter-02:9121"
          - "redis-exporter-03:9121"
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: "(.*):9121"
        replacement: "$1"

  # Cluster 多实例
  - job_name: "redis-cluster"
    static_configs:
      - targets:
          - "redis-node-01:9121"
          - "redis-node-02:9121"
          - "redis-node-03:9121"
          - "redis-node-04:9121"
          - "redis-node-05:9121"
          - "redis-node-06:9121"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.*):.*"
        target_label: node
        replacement: "$1"

5. Grafana Dashboard 核心面板

5.1 命中率(Cache Hit Rate)

# 命中率(5分钟窗口)
rate(redis_keyspace_hits_total[5m]) 
  / 
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100

# 推荐显示类型:Stat 面板,单位 %,阈值:<80 红色,80-95 黄色,>95 绿色

5.2 QPS

# 每秒命令数
rate(redis_commands_total[1m])

# 按命令类型拆分(需要 cmdstat 指标)
rate(redis_commands_total{cmd="get"}[1m])
rate(redis_commands_total{cmd="set"}[1m])
rate(redis_commands_total{cmd="hget"}[1m])

5.3 平均延迟

# 平均命令延迟(毫秒)
rate(redis_commands_duration_seconds_total[1m]) 
  / 
rate(redis_commands_total[1m]) 
* 1000

# 按命令类型
rate(redis_commands_duration_seconds_total{cmd="get"}[1m]) 
  / 
rate(redis_commands_total{cmd="get"}[1m]) 
* 1000

5.4 内存使用率

# 内存使用率(%)
redis_used_memory_bytes / redis_maxmemory_bytes * 100

# 内存碎片率
redis_mem_fragmentation_ratio

# 已用内存趋势
redis_used_memory_bytes

5.5 连接数

# 当前连接数
redis_connected_clients

# 连接数使用率(%)
redis_connected_clients / redis_config_maxclients * 100

# 被拒绝的连接(非0即告警)
increase(redis_rejected_connections_total[5m])

5.6 复制延迟

# 主从复制 lag(字节差)
redis_replication_offset - on(instance) group_right() 
    (redis_slave_replication_offset * on(master_host) group_left() 
     label_replace(redis_replication_offset{role="master"}, "master_host", "$1", "instance", "(.*)"))

# 简化版(直接用 lag 标签,exporter 版本 >= 1.45 支持)
redis_replication_lag

6. Prometheus 告警规则

# redis_alerts.yml
groups:
  - name: redis.critical
    rules:
      # 实例宕机
      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "Redis instance {{ $labels.instance }} is down"
          description: "Redis has been unreachable for more than 1 minute."
          runbook: "https://wiki.internal/runbooks/redis-down"

      # 内存溢出(eviction 发生)
      - alert: RedisOOM
        expr: increase(redis_evicted_keys_total[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Redis is evicting keys on {{ $labels.instance }}"
          description: "{{ $value }} keys evicted in last 5 minutes. maxmemory policy is active."

      # 复制断开
      - alert: RedisReplicationBroken
        expr: redis_connected_slaves < 1 and redis_replication_role == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Redis master {{ $labels.instance }} has no slaves"

  - name: redis.warning
    rules:
      # 内存使用率 > 85%
      - alert: RedisHighMemoryUsage
        expr: redis_used_memory_bytes / redis_maxmemory_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage > 85% on {{ $labels.instance }}"
          description: "Current usage: {{ $value | humanizePercentage }}. Consider expanding maxmemory or scaling."

      # 命中率 < 80%
      - alert: RedisLowHitRate
        expr: |
          rate(redis_keyspace_hits_total[5m]) 
            / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) 
            < 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis cache hit rate < 80% on {{ $labels.instance }}"
          description: "Hit rate: {{ $value | humanizePercentage }}. Check for cache invalidation storms or incorrect TTL configuration."

      # 连接数过高
      - alert: RedisHighConnections
        expr: redis_connected_clients > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis high connection count: {{ $value }} on {{ $labels.instance }}"

      # 复制延迟 > 1MB
      - alert: RedisReplicationLag
        expr: redis_replication_offset - on(instance) group_right() redis_slave_replication_offset > 1000000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Redis replication lag > 1MB on {{ $labels.instance }}"
          description: "Lag bytes: {{ $value }}. Slave may fall behind and trigger full resync."

      # 内存碎片率 > 1.5
      - alert: RedisHighFragmentation
        expr: redis_mem_fragmentation_ratio > 1.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory fragmentation ratio > 1.5 on {{ $labels.instance }}"
          description: "Fragmentation ratio: {{ $value }}. Consider enabling activedefrag or scheduling a restart."

      # 拒绝连接(超出 maxclients)
      - alert: RedisRejectedConnections
        expr: increase(redis_rejected_connections_total[5m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Redis is rejecting connections on {{ $labels.instance }}"

  - name: redis.info
    rules:
      # 慢查询增长
      - alert: RedisSlowlogGrowing
        expr: increase(redis_slowlog_length[5m]) > 10
        for: 0m
        labels:
          severity: info
        annotations:
          summary: "Redis slowlog has {{ $value }} new entries in last 5m on {{ $labels.instance }}"

      # RDB 保存失败
      - alert: RedisRDBSaveFailed
        expr: redis_rdb_last_bgsave_status == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis RDB last save failed on {{ $labels.instance }}"

      # key 过期速率异常(突然大量过期)
      - alert: RedisHighExpiredKeyRate
        expr: rate(redis_expired_keys_total[5m]) > 1000
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Redis high key expiration rate: {{ $value }}/s on {{ $labels.instance }}"

7. AlertManager 路由配置

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.company.com:587"
  smtp_from: "[email protected]"

route:
  group_by: ["alertname", "instance"]
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"
  
  routes:
    # Critical 告警:立即通知 + PagerDuty
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 0s
      repeat_interval: 1h

    # Warning 告警:Slack 通知
    - match:
        severity: warning
      receiver: "slack-warning"
      repeat_interval: 4h

    # Info 告警:仅邮件,每天一次
    - match:
        severity: info
      receiver: "email-info"
      repeat_interval: 24h

receivers:
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "your-pagerduty-service-key"

  - name: "slack-warning"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#redis-alerts"
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: "email-info"
    email_configs:
      - to: "[email protected]"
        send_resolved: true

8. 运维仪表盘建议布局

行1(总览):
  [实例状态] [当前QPS] [命中率] [内存使用率] [连接数]

行2(性能):
  [QPS趋势(折线图,分命令类型)] | [平均延迟(折线图)]

行3(内存):
  [内存使用量趋势] | [碎片率趋势] | [淘汰key数趋势]

行4(复制):
  [主从复制偏移量差值] | [从库连接数] | [全量同步次数]

行5(持久化):
  [最近RDB时间] | [AOF文件大小] | [AOF重写耗时]

行6(告警历史):
  [最近24小时告警列表]

本章总结

本章评分
4.8  / 5  (3 评分)

💬 留言讨论