Chapter 39
Monitoring: Prometheus + Grafana + Alert Rules
Chapter 39: Monitoring Stack: Prometheus + Grafana + Alerting Rules
A complete Redis monitoring stack is the foundation of production reliability. This chapter walks through the raw output of the INFO command, redis_exporter metric collection, Grafana dashboard construction, Prometheus alerting rules, and alert routing with AlertManager.
1. INFO Command: Field-by-Field Reference
INFO is Redis's self-diagnostic endpoint. Understanding each field is prerequisite to building meaningful dashboards.
1.1 server Section
INFO server
# redis_version:7.2.3
# os:Linux 5.15.0-91-generic x86_64
# arch_bits:64
# tcp_port:6379
# config_file:/etc/redis/redis.conf
# uptime_in_seconds:864000 ← uptime (864000 s = 10 days)
# uptime_in_days:10
# hz:10 ← internal timer frequency; affects expiry precision
# executable:/usr/bin/redis-server
1.2 clients Section
INFO clients
# connected_clients:127 ← current live connections — KEY METRIC
# maxclients:10000 ← configured ceiling
# client_recent_max_input_buffer:20480 ← peak input buffer (bytes)
# blocked_clients:3 ← connections blocked on BLPOP / BRPOP / WAIT
# tracking_clients:0
1.3 memory Section
INFO memory
# used_memory:1073741824 ← bytes allocated by Redis = 1 GB
# used_memory_rss:1342177280 ← OS-reported RSS (includes fragmentation) = 1.25 GB
# used_memory_peak:1200000000 ← historical maximum
# used_memory_overhead:524288 ← Redis internal structures
# used_memory_dataset:1073217536 ← actual data = used_memory - overhead
# mem_fragmentation_ratio:1.25 ← RSS / used_memory; healthy: 1.0–1.5
# mem_allocator:jemalloc-5.3.0
# maxmemory:12884901888 ← configured maxmemory = 12 GB
# maxmemory_policy:allkeys-lru
1.4 stats Section
INFO stats
# total_connections_received:1234567
# total_commands_processed:98765432
# instantaneous_ops_per_sec:12345 ← real-time QPS
# instantaneous_input_kbps:1024.00
# instantaneous_output_kbps:2048.00
# rejected_connections:0 ← connections rejected because maxclients was reached
# sync_full:2 ← number of full resync events
# sync_partial_ok:100 ← successful partial resyncs
# sync_partial_err:1 ← failed partial resyncs (triggers full resync)
# expired_keys:456789 ← keys expired and deleted so far
# evicted_keys:0 ← keys evicted by maxmemory policy
# keyspace_hits:9876543 ← cache hits (cumulative counter)
# keyspace_misses:123456 ← cache misses (cumulative counter)
1.5 replication Section
INFO replication
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.11,port=6379,state=online,offset=123456789,lag=0
# slave1:ip=192.168.1.12,port=6379,state=online,offset=123456700,lag=1
# master_replid:a1b2c3d4e5f6...
# master_repl_offset:123456789 ← master's write cursor
# repl_backlog_active:1
# repl_backlog_size:1073741824 ← backlog buffer = 1 GB
# repl_backlog_histlen:123456789 ← bytes currently stored in backlog
1.6 keyspace Section
INFO keyspace
# db0:keys=100000,expires=80000,avg_ttl=3600000
# db1:keys=5000,expires=5000,avg_ttl=86400000
# avg_ttl is in milliseconds: 3600000 ms = 1 hour
1.7 commandstats Section
INFO commandstats
# cmdstat_get:calls=9876543,usec=12345678,usec_per_call=1.25,rejected_calls=0,failed_calls=0
# cmdstat_set:calls=1234567,usec=3456789,usec_per_call=2.80
# cmdstat_hget:calls=567890,usec=2345678,usec_per_call=4.13
# usec_per_call is the average latency per command invocation in microseconds
2. Deploying redis_exporter
# docker-compose.yml
services:
redis:
image: redis:7.2
command: redis-server /etc/redis/redis.conf
volumes:
- ./redis.conf:/etc/redis/redis.conf
ports:
- "6379:6379"
redis-exporter:
image: oliver006/redis_exporter:v1.58.0
environment:
REDIS_ADDR: "redis://redis:6379"
REDIS_PASSWORD: "yourpassword"
REDIS_EXPORTER_LOG_FORMAT: "json"
ports:
- "9121:9121" # Prometheus scrape endpoint
depends_on:
- redis
# Verify the exporter is working
curl -s http://localhost:9121/metrics | grep redis_connected_clients
# redis_connected_clients 5
3. Key Metrics Reference
# Core metrics exposed by redis_exporter (Prometheus format)
# Connections
redis_connected_clients # current live connections
redis_blocked_clients # connections blocked on blocking commands
redis_connected_slaves # number of replicas connected
# Memory
redis_used_memory_bytes # allocated memory
redis_used_memory_rss_bytes # OS-reported RSS
redis_used_memory_peak_bytes # historical peak
redis_mem_fragmentation_ratio # fragmentation ratio
redis_maxmemory_bytes # configured maxmemory
# Throughput
redis_commands_total # cumulative command count (counter)
redis_commands_duration_seconds_total # cumulative command duration (counter)
redis_keyspace_hits_total # cumulative hits (counter)
redis_keyspace_misses_total # cumulative misses (counter)
redis_instantaneous_ops_per_sec # real-time QPS (gauge)
# Persistence
redis_rdb_last_bgsave_status # 1 = last save succeeded; 0 = failed
redis_aof_enabled # 1 = AOF enabled
redis_aof_rewrite_in_progress # 1 = rewrite running
# Replication
redis_replication_offset # master write offset
redis_slave_replication_offset # replica applied offset
redis_replication_lag # lag in bytes (exporter >= 1.45)
# Keyspace
redis_db_keys{db="db0"} # total key count
redis_db_expiring_keys{db="db0"} # keys with an expiry set
redis_expired_keys_total # cumulative expired key deletions
redis_evicted_keys_total # cumulative evictions by maxmemory policy
redis_rejected_connections_total # connections rejected (maxclients exceeded)
4. Prometheus Scrape Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "redis"
static_configs:
- targets:
- "redis-exporter-01:9121"
- "redis-exporter-02:9121"
- "redis-exporter-03:9121"
relabel_configs:
- source_labels: [__address__]
regex: "(.*):9121"
target_label: instance
replacement: "$1"
- job_name: "redis-cluster"
static_configs:
- targets:
- "redis-node-01:9121"
- "redis-node-02:9121"
- "redis-node-03:9121"
- "redis-node-04:9121"
- "redis-node-05:9121"
- "redis-node-06:9121"
relabel_configs:
- source_labels: [__address__]
regex: "(.*):.*"
target_label: node
replacement: "$1"
5. Core Grafana Panel Queries
5.1 Cache Hit Rate
# 5-minute rolling hit rate (percentage)
rate(redis_keyspace_hits_total[5m])
/
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100
# Recommended display: Stat panel, unit = %, thresholds: red < 80, yellow 80–95, green > 95
5.2 Commands per Second (QPS)
# Total QPS
rate(redis_commands_total[1m])
# QPS broken down by command type
rate(redis_commands_total{cmd="get"}[1m])
rate(redis_commands_total{cmd="set"}[1m])
rate(redis_commands_total{cmd="hget"}[1m])
5.3 Average Command Latency
# Average latency in milliseconds (all commands)
rate(redis_commands_duration_seconds_total[1m])
/
rate(redis_commands_total[1m])
* 1000
# Latency for a specific command
rate(redis_commands_duration_seconds_total{cmd="get"}[1m])
/
rate(redis_commands_total{cmd="get"}[1m])
* 1000
5.4 Memory Usage
# Memory utilization (%)
redis_used_memory_bytes / redis_maxmemory_bytes * 100
# Fragmentation ratio
redis_mem_fragmentation_ratio
# Memory trend (bytes) — use for capacity planning
redis_used_memory_bytes
5.5 Connections
# Current connection count
redis_connected_clients
# Connection utilization (%)
redis_connected_clients / redis_config_maxclients * 100
# New rejected connections in the last 5 minutes (non-zero = alert)
increase(redis_rejected_connections_total[5m])
5.6 Replication Lag
# Replication lag in bytes (from exporter >= 1.45)
redis_replication_lag
# Alternative calculation
redis_replication_offset
- on(instance) group_right()
redis_slave_replication_offset
6. Prometheus Alerting Rules
# redis_alerts.yml
groups:
- name: redis.critical
rules:
# Instance down
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
team: infra
annotations:
summary: "Redis instance {{ $labels.instance }} is down"
description: "Redis has been unreachable for more than 1 minute."
runbook: "https://wiki.internal/runbooks/redis-down"
# Key eviction — maxmemory policy is actively running
- alert: RedisOOM
expr: increase(redis_evicted_keys_total[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Redis is evicting keys on {{ $labels.instance }}"
description: "{{ $value }} keys evicted in the last 5 minutes. maxmemory ceiling reached."
# Master lost all replicas
- alert: RedisReplicationBroken
expr: redis_connected_slaves < 1 and redis_replication_role == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Redis master {{ $labels.instance }} has no connected replicas"
# Connections rejected (maxclients reached)
- alert: RedisRejectedConnections
expr: increase(redis_rejected_connections_total[5m]) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Redis is rejecting new connections on {{ $labels.instance }}"
description: "{{ $value }} connections rejected in last 5 min. Increase maxclients or reduce pool size."
- name: redis.warning
rules:
# Memory > 85%
- alert: RedisHighMemoryUsage
expr: redis_used_memory_bytes / redis_maxmemory_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage > 85% on {{ $labels.instance }}"
description: "Current: {{ $value | humanizePercentage }}. Plan capacity increase or lower TTLs."
# Cache hit rate < 80%
- alert: RedisLowHitRate
expr: |
rate(redis_keyspace_hits_total[5m])
/ (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
< 0.80
for: 10m
labels:
severity: warning
annotations:
summary: "Redis cache hit rate < 80% on {{ $labels.instance }}"
description: "Hit rate: {{ $value | humanizePercentage }}. Investigate TTL configuration or cache invalidation storms."
# Too many connections
- alert: RedisHighConnections
expr: redis_connected_clients > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Redis connection count {{ $value }} on {{ $labels.instance }}"
description: "Check connection pool configuration in client applications."
# Replication lag > 1 MB
- alert: RedisReplicationLag
expr: redis_replication_offset - on(instance) group_right() redis_slave_replication_offset > 1000000
for: 1m
labels:
severity: warning
annotations:
summary: "Redis replication lag > 1 MB on {{ $labels.instance }}"
description: "Lag: {{ $value }} bytes. Replica may fall behind and trigger a full resync."
# Memory fragmentation > 1.5
- alert: RedisHighFragmentation
expr: redis_mem_fragmentation_ratio > 1.5
for: 15m
labels:
severity: warning
annotations:
summary: "Redis memory fragmentation ratio {{ $value }} on {{ $labels.instance }}"
description: "Enable activedefrag or plan a maintenance restart."
# RDB save failed
- alert: RedisRDBSaveFailed
expr: redis_rdb_last_bgsave_status == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Redis RDB background save failed on {{ $labels.instance }}"
- name: redis.info
rules:
# Slowlog growing quickly
- alert: RedisSlowlogGrowing
expr: increase(redis_slowlog_length[5m]) > 10
for: 0m
labels:
severity: info
annotations:
summary: "Redis slowlog added {{ $value }} entries in last 5 min on {{ $labels.instance }}"
# High key expiration rate (possible TTL misconfiguration)
- alert: RedisHighExpiredKeyRate
expr: rate(redis_expired_keys_total[5m]) > 1000
for: 5m
labels:
severity: info
annotations:
summary: "Redis expiring {{ $value }}/s keys on {{ $labels.instance }}"
7. AlertManager Routing
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: "smtp.company.com:587"
smtp_from: "[email protected]"
route:
group_by: ["alertname", "instance"]
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
routes:
- match:
severity: critical
receiver: "pagerduty-critical"
group_wait: 0s
repeat_interval: 1h
- match:
severity: warning
receiver: "slack-warning"
repeat_interval: 4h
- match:
severity: info
receiver: "email-info"
repeat_interval: 24h
receivers:
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "YOUR_PAGERDUTY_SERVICE_KEY"
- name: "slack-warning"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
channel: "#redis-alerts"
title: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}"
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
- name: "email-info"
email_configs:
- to: "[email protected]"
send_resolved: true
8. Recommended Grafana Dashboard Layout
Row 1 — Overview
[Instance Status] [QPS] [Hit Rate] [Memory Usage %] [Connection Count]
Row 2 — Throughput
[QPS over time (line, per-command breakdown)] | [Average latency (line)]
Row 3 — Memory
[Used memory trend] | [Fragmentation ratio] | [Evicted keys rate]
Row 4 — Replication
[Replication lag (bytes)] | [Connected replicas] | [Full resyncs count]
Row 5 — Persistence
[Time since last RDB] | [AOF file size] | [AOF rewrite duration]
Row 6 — Alert History
[Last 24-hour alert timeline]
Chapter Summary
INFOprovides a complete self-diagnostic snapshot; prioritize the memory, clients, replication, and keyspace sections.- redis_exporter translates INFO output into Prometheus-compatible metrics — it is the critical bridge between Redis and your monitoring stack.
- Alert severity tiers: Critical (instance down / OOM / replication broken) → Warning (high memory / low hit rate / replication lag) → Info (slowlog growth / expiration rate spikes).
- Core Grafana panels: hit rate + QPS + latency + memory utilization + replication lag.
- Use
rate()andincrease()in alert expressions rather than raw counter values to avoid false positives from transient spikes.