第 39 章
监控体系:Prometheus + Grafana + 告警规则
第39章 监控体系:Prometheus + Grafana + 告警规则
完整的 Redis 监控体系是生产可靠性的基石。本章从 INFO 命令原始输出出发,讲解 redis_exporter 指标采集、Grafana 面板配置、Prometheus 告警规则,以及告警分级与响应流程。
1. INFO 命令各段详解
INFO 是 Redis 自我诊断的入口,每个字段都有明确含义:
1.1 server 段
INFO server
# redis_version:7.2.3
# redis_git_sha1:00000000
# os:Linux 5.15.0-91-generic x86_64
# arch_bits:64
# tcp_port:6379
# config_file:/etc/redis/redis.conf
# uptime_in_seconds:864000 ← 运行时长(秒),864000 = 10天
# uptime_in_days:10
# hz:10 ← 内部定时器频率(影响过期检查精度)
# configured_hz:10
# lru_clock:12345678 ← LRU 时钟(秒级精度)
# executable:/usr/bin/redis-server
1.2 clients 段
INFO clients
# connected_clients:127 ← 当前连接数(关键指标!)
# cluster_connections:0
# maxclients:10000 ← 最大连接数配置
# client_recent_max_input_buffer:20480 ← 最大输入缓冲(字节)
# client_recent_max_output_buffer:0
# blocked_clients:3 ← 阻塞在 BLPOP/BRPOP/WAIT 的客户端数
# tracking_clients:0
# clients_in_timeout_table:0
1.3 memory 段
INFO memory
# used_memory:1073741824 ← Redis 分配内存(字节)= 1GB
# used_memory_human:1.00G
# used_memory_rss:1342177280 ← OS 视角RSS内存(包含碎片)= 1.25GB
# used_memory_rss_human:1.25G
# used_memory_peak:1200000000 ← 历史峰值
# used_memory_peak_human:1.12G
# used_memory_peak_perc:89.48% ← 当前/峰值
# used_memory_overhead:524288 ← Redis 内部结构开销
# used_memory_dataset:1073217536 ← 实际数据内存 = used - overhead
# mem_fragmentation_ratio:1.25 ← RSS/used,正常1.0-1.5
# mem_fragmentation_bytes:268435456
# mem_allocator:jemalloc-5.3.0
# maxmemory:12884901888 ← maxmemory 配置 = 12GB
# maxmemory_human:12.00G
# maxmemory_policy:allkeys-lru
1.4 stats 段
INFO stats
# total_connections_received:1234567 ← 历史总连接数
# total_commands_processed:98765432 ← 历史总命令数
# instantaneous_ops_per_sec:12345 ← 当前 QPS(每秒操作数)
# total_net_input_bytes:10737418240
# total_net_output_bytes:21474836480
# instantaneous_input_kbps:1024.00 ← 当前输入流量 kbps
# instantaneous_output_kbps:2048.00
# rejected_connections:0 ← 被拒绝的连接数(达到maxclients时)
# sync_full:2 ← 全量同步次数
# sync_partial_ok:100 ← 增量同步成功次数
# sync_partial_err:1 ← 增量同步失败次数(触发全量同步)
# expired_keys:456789 ← 已过期删除的 key 数
# evicted_keys:0 ← 被淘汰的 key 数(达到maxmemory时)
# keyspace_hits:9876543 ← 命中次数
# keyspace_misses:123456 ← 未命中次数
# pubsub_channels:5 ← pub/sub 频道数
1.5 replication 段
INFO replication
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.11,port=6379,state=online,offset=123456789,lag=0
# slave1:ip=192.168.1.12,port=6379,state=online,offset=123456700,lag=1
# master_failover_state:no-failover
# master_replid:a1b2c3d4e5f6... ← 复制ID
# master_repl_offset:123456789 ← 主库偏移量
# repl_backlog_active:1
# repl_backlog_size:1073741824 ← 复制积压缓冲区大小 = 1GB
# repl_backlog_first_byte_offset:1
# repl_backlog_histlen:123456789
1.6 keyspace 段
INFO keyspace
# db0:keys=100000,expires=80000,avg_ttl=3600000
# db1:keys=5000,expires=5000,avg_ttl=86400000
# ↑ avg_ttl 单位毫秒,3600000ms = 1小时
1.7 commandstats 段
INFO commandstats
# cmdstat_get:calls=9876543,usec=12345678,usec_per_call=1.25,rejected_calls=0,failed_calls=0
# cmdstat_set:calls=1234567,usec=3456789,usec_per_call=2.80
# cmdstat_hget:calls=567890,usec=2345678,usec_per_call=4.13
# ↑ usec_per_call:该命令平均耗时(微秒)
2. redis_exporter 部署
# docker-compose.yml
services:
redis:
image: redis:7.2
command: redis-server /etc/redis/redis.conf
volumes:
- ./redis.conf:/etc/redis/redis.conf
ports:
- "6379:6379"
redis-exporter:
image: oliver006/redis_exporter:v1.58.0
environment:
REDIS_ADDR: "redis://redis:6379"
REDIS_PASSWORD: "yourpassword"
REDIS_EXPORTER_LOG_FORMAT: "json"
ports:
- "9121:9121" # Prometheus 抓取端口
depends_on:
- redis
# 验证 exporter 正常工作
curl -s http://localhost:9121/metrics | grep redis_connected_clients
# 输出:redis_connected_clients 5
3. 关键指标列表
# redis_exporter 采集的核心指标(Prometheus 格式)
# 连接与客户端
redis_connected_clients # 当前连接数
redis_blocked_clients # 阻塞客户端数
redis_connected_slaves # 从库连接数
# 内存
redis_used_memory_bytes # 使用内存
redis_used_memory_rss_bytes # RSS 内存
redis_used_memory_peak_bytes # 峰值内存
redis_mem_fragmentation_ratio # 碎片率
redis_maxmemory_bytes # maxmemory 配置
# 性能
redis_commands_total # 累计命令数(counter)
redis_commands_duration_seconds_total # 累计命令耗时(counter)
redis_keyspace_hits_total # 累计命中(counter)
redis_keyspace_misses_total # 累计未命中(counter)
redis_instantaneous_ops_per_sec # 实时 QPS(gauge)
# 持久化
redis_rdb_last_bgsave_status # 最近 RDB 状态(1=成功)
redis_aof_enabled # AOF 是否开启
redis_aof_rewrite_in_progress # AOF 是否正在重写
# 复制
redis_replication_offset # 主库复制偏移量
redis_slave_replication_offset # 从库复制偏移量(在从库实例上)
redis_connected_slaves # 从库数量
# Keyspace
redis_db_keys{db="db0"} # 总 key 数
redis_db_expiring_keys{db="db0"} # 有 TTL 的 key 数
redis_expired_keys_total # 累计过期删除 key 数
redis_evicted_keys_total # 累计被淘汰 key 数
redis_rejected_connections_total # 被拒绝连接数(超过maxclients)
4. Prometheus 抓取配置
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "redis"
static_configs:
- targets:
- "redis-exporter-01:9121"
- "redis-exporter-02:9121"
- "redis-exporter-03:9121"
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "(.*):9121"
replacement: "$1"
# Cluster 多实例
- job_name: "redis-cluster"
static_configs:
- targets:
- "redis-node-01:9121"
- "redis-node-02:9121"
- "redis-node-03:9121"
- "redis-node-04:9121"
- "redis-node-05:9121"
- "redis-node-06:9121"
relabel_configs:
- source_labels: [__address__]
regex: "(.*):.*"
target_label: node
replacement: "$1"
5. Grafana Dashboard 核心面板
5.1 命中率(Cache Hit Rate)
# 命中率(5分钟窗口)
rate(redis_keyspace_hits_total[5m])
/
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100
# 推荐显示类型:Stat 面板,单位 %,阈值:<80 红色,80-95 黄色,>95 绿色
5.2 QPS
# 每秒命令数
rate(redis_commands_total[1m])
# 按命令类型拆分(需要 cmdstat 指标)
rate(redis_commands_total{cmd="get"}[1m])
rate(redis_commands_total{cmd="set"}[1m])
rate(redis_commands_total{cmd="hget"}[1m])
5.3 平均延迟
# 平均命令延迟(毫秒)
rate(redis_commands_duration_seconds_total[1m])
/
rate(redis_commands_total[1m])
* 1000
# 按命令类型
rate(redis_commands_duration_seconds_total{cmd="get"}[1m])
/
rate(redis_commands_total{cmd="get"}[1m])
* 1000
5.4 内存使用率
# 内存使用率(%)
redis_used_memory_bytes / redis_maxmemory_bytes * 100
# 内存碎片率
redis_mem_fragmentation_ratio
# 已用内存趋势
redis_used_memory_bytes
5.5 连接数
# 当前连接数
redis_connected_clients
# 连接数使用率(%)
redis_connected_clients / redis_config_maxclients * 100
# 被拒绝的连接(非0即告警)
increase(redis_rejected_connections_total[5m])
5.6 复制延迟
# 主从复制 lag(字节差)
redis_replication_offset - on(instance) group_right()
(redis_slave_replication_offset * on(master_host) group_left()
label_replace(redis_replication_offset{role="master"}, "master_host", "$1", "instance", "(.*)"))
# 简化版(直接用 lag 标签,exporter 版本 >= 1.45 支持)
redis_replication_lag
6. Prometheus 告警规则
# redis_alerts.yml
groups:
- name: redis.critical
rules:
# 实例宕机
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
team: infra
annotations:
summary: "Redis instance {{ $labels.instance }} is down"
description: "Redis has been unreachable for more than 1 minute."
runbook: "https://wiki.internal/runbooks/redis-down"
# 内存溢出(eviction 发生)
- alert: RedisOOM
expr: increase(redis_evicted_keys_total[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Redis is evicting keys on {{ $labels.instance }}"
description: "{{ $value }} keys evicted in last 5 minutes. maxmemory policy is active."
# 复制断开
- alert: RedisReplicationBroken
expr: redis_connected_slaves < 1 and redis_replication_role == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Redis master {{ $labels.instance }} has no slaves"
- name: redis.warning
rules:
# 内存使用率 > 85%
- alert: RedisHighMemoryUsage
expr: redis_used_memory_bytes / redis_maxmemory_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage > 85% on {{ $labels.instance }}"
description: "Current usage: {{ $value | humanizePercentage }}. Consider expanding maxmemory or scaling."
# 命中率 < 80%
- alert: RedisLowHitRate
expr: |
rate(redis_keyspace_hits_total[5m])
/ (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
< 0.80
for: 10m
labels:
severity: warning
annotations:
summary: "Redis cache hit rate < 80% on {{ $labels.instance }}"
description: "Hit rate: {{ $value | humanizePercentage }}. Check for cache invalidation storms or incorrect TTL configuration."
# 连接数过高
- alert: RedisHighConnections
expr: redis_connected_clients > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Redis high connection count: {{ $value }} on {{ $labels.instance }}"
# 复制延迟 > 1MB
- alert: RedisReplicationLag
expr: redis_replication_offset - on(instance) group_right() redis_slave_replication_offset > 1000000
for: 1m
labels:
severity: warning
annotations:
summary: "Redis replication lag > 1MB on {{ $labels.instance }}"
description: "Lag bytes: {{ $value }}. Slave may fall behind and trigger full resync."
# 内存碎片率 > 1.5
- alert: RedisHighFragmentation
expr: redis_mem_fragmentation_ratio > 1.5
for: 15m
labels:
severity: warning
annotations:
summary: "Redis memory fragmentation ratio > 1.5 on {{ $labels.instance }}"
description: "Fragmentation ratio: {{ $value }}. Consider enabling activedefrag or scheduling a restart."
# 拒绝连接(超出 maxclients)
- alert: RedisRejectedConnections
expr: increase(redis_rejected_connections_total[5m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Redis is rejecting connections on {{ $labels.instance }}"
- name: redis.info
rules:
# 慢查询增长
- alert: RedisSlowlogGrowing
expr: increase(redis_slowlog_length[5m]) > 10
for: 0m
labels:
severity: info
annotations:
summary: "Redis slowlog has {{ $value }} new entries in last 5m on {{ $labels.instance }}"
# RDB 保存失败
- alert: RedisRDBSaveFailed
expr: redis_rdb_last_bgsave_status == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Redis RDB last save failed on {{ $labels.instance }}"
# key 过期速率异常(突然大量过期)
- alert: RedisHighExpiredKeyRate
expr: rate(redis_expired_keys_total[5m]) > 1000
for: 5m
labels:
severity: info
annotations:
summary: "Redis high key expiration rate: {{ $value }}/s on {{ $labels.instance }}"
7. AlertManager 路由配置
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: "smtp.company.com:587"
smtp_from: "[email protected]"
route:
group_by: ["alertname", "instance"]
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
routes:
# Critical 告警:立即通知 + PagerDuty
- match:
severity: critical
receiver: "pagerduty-critical"
group_wait: 0s
repeat_interval: 1h
# Warning 告警:Slack 通知
- match:
severity: warning
receiver: "slack-warning"
repeat_interval: 4h
# Info 告警:仅邮件,每天一次
- match:
severity: info
receiver: "email-info"
repeat_interval: 24h
receivers:
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "your-pagerduty-service-key"
- name: "slack-warning"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#redis-alerts"
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: "email-info"
email_configs:
- to: "[email protected]"
send_resolved: true
8. 运维仪表盘建议布局
行1(总览):
[实例状态] [当前QPS] [命中率] [内存使用率] [连接数]
行2(性能):
[QPS趋势(折线图,分命令类型)] | [平均延迟(折线图)]
行3(内存):
[内存使用量趋势] | [碎片率趋势] | [淘汰key数趋势]
行4(复制):
[主从复制偏移量差值] | [从库连接数] | [全量同步次数]
行5(持久化):
[最近RDB时间] | [AOF文件大小] | [AOF重写耗时]
行6(告警历史):
[最近24小时告警列表]
本章总结
INFO命令是 Redis 自诊断的完整入口,重点关注 memory/clients/replication/keyspace 段- redis_exporter 将 INFO 指标导出为 Prometheus 格式,是监控栈的核心组件
- 核心告警层次:Critical(宕机/OOM/复制断开)→ Warning(内存高/命中率低/延迟高)→ Info(慢查询/过期率异常)
- Grafana 面板核心:命中率 + QPS + 延迟 + 内存使用率 + 复制 lag
- 告警规则 PromQL 精确到 rate/increase 窗口,避免瞬间抖动触发误报