Chapter 39

Monitoring: Prometheus + Grafana + Alert Rules

Chapter 39: Monitoring Stack: Prometheus + Grafana + Alerting Rules

A complete Redis monitoring stack is the foundation of production reliability. This chapter walks through the raw output of the INFO command, redis_exporter metric collection, Grafana dashboard construction, Prometheus alerting rules, and alert routing with AlertManager.


1. INFO Command: Field-by-Field Reference

INFO is Redis's self-diagnostic endpoint. Understanding each field is prerequisite to building meaningful dashboards.

1.1 server Section

INFO server
# redis_version:7.2.3
# os:Linux 5.15.0-91-generic x86_64
# arch_bits:64
# tcp_port:6379
# config_file:/etc/redis/redis.conf
# uptime_in_seconds:864000       โ† uptime (864000 s = 10 days)
# uptime_in_days:10
# hz:10                          โ† internal timer frequency; affects expiry precision
# executable:/usr/bin/redis-server

1.2 clients Section

INFO clients
# connected_clients:127          โ† current live connections โ€” KEY METRIC
# maxclients:10000               โ† configured ceiling
# client_recent_max_input_buffer:20480   โ† peak input buffer (bytes)
# blocked_clients:3              โ† connections blocked on BLPOP / BRPOP / WAIT
# tracking_clients:0

1.3 memory Section

INFO memory
# used_memory:1073741824         โ† bytes allocated by Redis = 1 GB
# used_memory_rss:1342177280     โ† OS-reported RSS (includes fragmentation) = 1.25 GB
# used_memory_peak:1200000000    โ† historical maximum
# used_memory_overhead:524288    โ† Redis internal structures
# used_memory_dataset:1073217536 โ† actual data = used_memory - overhead
# mem_fragmentation_ratio:1.25   โ† RSS / used_memory; healthy: 1.0โ€“1.5
# mem_allocator:jemalloc-5.3.0
# maxmemory:12884901888          โ† configured maxmemory = 12 GB
# maxmemory_policy:allkeys-lru

1.4 stats Section

INFO stats
# total_connections_received:1234567
# total_commands_processed:98765432
# instantaneous_ops_per_sec:12345    โ† real-time QPS
# instantaneous_input_kbps:1024.00
# instantaneous_output_kbps:2048.00
# rejected_connections:0             โ† connections rejected because maxclients was reached
# sync_full:2                        โ† number of full resync events
# sync_partial_ok:100                โ† successful partial resyncs
# sync_partial_err:1                 โ† failed partial resyncs (triggers full resync)
# expired_keys:456789                โ† keys expired and deleted so far
# evicted_keys:0                     โ† keys evicted by maxmemory policy
# keyspace_hits:9876543              โ† cache hits (cumulative counter)
# keyspace_misses:123456             โ† cache misses (cumulative counter)

1.5 replication Section

INFO replication
# role:master
# connected_slaves:2
# slave0:ip=192.168.1.11,port=6379,state=online,offset=123456789,lag=0
# slave1:ip=192.168.1.12,port=6379,state=online,offset=123456700,lag=1
# master_replid:a1b2c3d4e5f6...
# master_repl_offset:123456789       โ† master's write cursor
# repl_backlog_active:1
# repl_backlog_size:1073741824       โ† backlog buffer = 1 GB
# repl_backlog_histlen:123456789     โ† bytes currently stored in backlog

1.6 keyspace Section

INFO keyspace
# db0:keys=100000,expires=80000,avg_ttl=3600000
# db1:keys=5000,expires=5000,avg_ttl=86400000
# avg_ttl is in milliseconds: 3600000 ms = 1 hour

1.7 commandstats Section

INFO commandstats
# cmdstat_get:calls=9876543,usec=12345678,usec_per_call=1.25,rejected_calls=0,failed_calls=0
# cmdstat_set:calls=1234567,usec=3456789,usec_per_call=2.80
# cmdstat_hget:calls=567890,usec=2345678,usec_per_call=4.13
# usec_per_call is the average latency per command invocation in microseconds

2. Deploying redis_exporter

# docker-compose.yml
services:
  redis:
    image: redis:7.2
    command: redis-server /etc/redis/redis.conf
    volumes:
      - ./redis.conf:/etc/redis/redis.conf
    ports:
      - "6379:6379"

  redis-exporter:
    image: oliver006/redis_exporter:v1.58.0
    environment:
      REDIS_ADDR: "redis://redis:6379"
      REDIS_PASSWORD: "yourpassword"
      REDIS_EXPORTER_LOG_FORMAT: "json"
    ports:
      - "9121:9121"    # Prometheus scrape endpoint
    depends_on:
      - redis
# Verify the exporter is working
curl -s http://localhost:9121/metrics | grep redis_connected_clients
# redis_connected_clients 5

3. Key Metrics Reference

# Core metrics exposed by redis_exporter (Prometheus format)

# Connections
redis_connected_clients              # current live connections
redis_blocked_clients               # connections blocked on blocking commands
redis_connected_slaves              # number of replicas connected

# Memory
redis_used_memory_bytes             # allocated memory
redis_used_memory_rss_bytes         # OS-reported RSS
redis_used_memory_peak_bytes        # historical peak
redis_mem_fragmentation_ratio       # fragmentation ratio
redis_maxmemory_bytes               # configured maxmemory

# Throughput
redis_commands_total                # cumulative command count (counter)
redis_commands_duration_seconds_total  # cumulative command duration (counter)
redis_keyspace_hits_total           # cumulative hits (counter)
redis_keyspace_misses_total         # cumulative misses (counter)
redis_instantaneous_ops_per_sec     # real-time QPS (gauge)

# Persistence
redis_rdb_last_bgsave_status        # 1 = last save succeeded; 0 = failed
redis_aof_enabled                   # 1 = AOF enabled
redis_aof_rewrite_in_progress       # 1 = rewrite running

# Replication
redis_replication_offset            # master write offset
redis_slave_replication_offset      # replica applied offset
redis_replication_lag               # lag in bytes (exporter >= 1.45)

# Keyspace
redis_db_keys{db="db0"}            # total key count
redis_db_expiring_keys{db="db0"}   # keys with an expiry set
redis_expired_keys_total           # cumulative expired key deletions
redis_evicted_keys_total           # cumulative evictions by maxmemory policy
redis_rejected_connections_total   # connections rejected (maxclients exceeded)

4. Prometheus Scrape Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "redis"
    static_configs:
      - targets:
          - "redis-exporter-01:9121"
          - "redis-exporter-02:9121"
          - "redis-exporter-03:9121"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.*):9121"
        target_label: instance
        replacement: "$1"

  - job_name: "redis-cluster"
    static_configs:
      - targets:
          - "redis-node-01:9121"
          - "redis-node-02:9121"
          - "redis-node-03:9121"
          - "redis-node-04:9121"
          - "redis-node-05:9121"
          - "redis-node-06:9121"
    relabel_configs:
      - source_labels: [__address__]
        regex: "(.*):.*"
        target_label: node
        replacement: "$1"

5. Core Grafana Panel Queries

5.1 Cache Hit Rate

# 5-minute rolling hit rate (percentage)
rate(redis_keyspace_hits_total[5m])
  /
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100

# Recommended display: Stat panel, unit = %, thresholds: red < 80, yellow 80โ€“95, green > 95

5.2 Commands per Second (QPS)

# Total QPS
rate(redis_commands_total[1m])

# QPS broken down by command type
rate(redis_commands_total{cmd="get"}[1m])
rate(redis_commands_total{cmd="set"}[1m])
rate(redis_commands_total{cmd="hget"}[1m])

5.3 Average Command Latency

# Average latency in milliseconds (all commands)
rate(redis_commands_duration_seconds_total[1m])
  /
rate(redis_commands_total[1m])
* 1000

# Latency for a specific command
rate(redis_commands_duration_seconds_total{cmd="get"}[1m])
  /
rate(redis_commands_total{cmd="get"}[1m])
* 1000

5.4 Memory Usage

# Memory utilization (%)
redis_used_memory_bytes / redis_maxmemory_bytes * 100

# Fragmentation ratio
redis_mem_fragmentation_ratio

# Memory trend (bytes) โ€” use for capacity planning
redis_used_memory_bytes

5.5 Connections

# Current connection count
redis_connected_clients

# Connection utilization (%)
redis_connected_clients / redis_config_maxclients * 100

# New rejected connections in the last 5 minutes (non-zero = alert)
increase(redis_rejected_connections_total[5m])

5.6 Replication Lag

# Replication lag in bytes (from exporter >= 1.45)
redis_replication_lag

# Alternative calculation
redis_replication_offset
  - on(instance) group_right()
redis_slave_replication_offset

6. Prometheus Alerting Rules

# redis_alerts.yml
groups:
  - name: redis.critical
    rules:
      # Instance down
      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
          team: infra
        annotations:
          summary: "Redis instance {{ $labels.instance }} is down"
          description: "Redis has been unreachable for more than 1 minute."
          runbook: "https://wiki.internal/runbooks/redis-down"

      # Key eviction โ€” maxmemory policy is actively running
      - alert: RedisOOM
        expr: increase(redis_evicted_keys_total[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Redis is evicting keys on {{ $labels.instance }}"
          description: "{{ $value }} keys evicted in the last 5 minutes. maxmemory ceiling reached."

      # Master lost all replicas
      - alert: RedisReplicationBroken
        expr: redis_connected_slaves < 1 and redis_replication_role == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Redis master {{ $labels.instance }} has no connected replicas"

      # Connections rejected (maxclients reached)
      - alert: RedisRejectedConnections
        expr: increase(redis_rejected_connections_total[5m]) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Redis is rejecting new connections on {{ $labels.instance }}"
          description: "{{ $value }} connections rejected in last 5 min. Increase maxclients or reduce pool size."

  - name: redis.warning
    rules:
      # Memory > 85%
      - alert: RedisHighMemoryUsage
        expr: redis_used_memory_bytes / redis_maxmemory_bytes > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory usage > 85% on {{ $labels.instance }}"
          description: "Current: {{ $value | humanizePercentage }}. Plan capacity increase or lower TTLs."

      # Cache hit rate < 80%
      - alert: RedisLowHitRate
        expr: |
          rate(redis_keyspace_hits_total[5m])
            / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
            < 0.80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis cache hit rate < 80% on {{ $labels.instance }}"
          description: "Hit rate: {{ $value | humanizePercentage }}. Investigate TTL configuration or cache invalidation storms."

      # Too many connections
      - alert: RedisHighConnections
        expr: redis_connected_clients > 500
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis connection count {{ $value }} on {{ $labels.instance }}"
          description: "Check connection pool configuration in client applications."

      # Replication lag > 1 MB
      - alert: RedisReplicationLag
        expr: redis_replication_offset - on(instance) group_right() redis_slave_replication_offset > 1000000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Redis replication lag > 1 MB on {{ $labels.instance }}"
          description: "Lag: {{ $value }} bytes. Replica may fall behind and trigger a full resync."

      # Memory fragmentation > 1.5
      - alert: RedisHighFragmentation
        expr: redis_mem_fragmentation_ratio > 1.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Redis memory fragmentation ratio {{ $value }} on {{ $labels.instance }}"
          description: "Enable activedefrag or plan a maintenance restart."

      # RDB save failed
      - alert: RedisRDBSaveFailed
        expr: redis_rdb_last_bgsave_status == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis RDB background save failed on {{ $labels.instance }}"

  - name: redis.info
    rules:
      # Slowlog growing quickly
      - alert: RedisSlowlogGrowing
        expr: increase(redis_slowlog_length[5m]) > 10
        for: 0m
        labels:
          severity: info
        annotations:
          summary: "Redis slowlog added {{ $value }} entries in last 5 min on {{ $labels.instance }}"

      # High key expiration rate (possible TTL misconfiguration)
      - alert: RedisHighExpiredKeyRate
        expr: rate(redis_expired_keys_total[5m]) > 1000
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "Redis expiring {{ $value }}/s keys on {{ $labels.instance }}"

7. AlertManager Routing

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.company.com:587"
  smtp_from: "[email protected]"

route:
  group_by: ["alertname", "instance"]
  group_wait: 10s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"

  routes:
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 0s
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: "slack-warning"
      repeat_interval: 4h

    - match:
        severity: info
      receiver: "email-info"
      repeat_interval: 24h

receivers:
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "YOUR_PAGERDUTY_SERVICE_KEY"

  - name: "slack-warning"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#redis-alerts"
        title: "[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"

  - name: "email-info"
    email_configs:
      - to: "[email protected]"
        send_resolved: true

Row 1 โ€” Overview
  [Instance Status]  [QPS]  [Hit Rate]  [Memory Usage %]  [Connection Count]

Row 2 โ€” Throughput
  [QPS over time (line, per-command breakdown)]  |  [Average latency (line)]

Row 3 โ€” Memory
  [Used memory trend]  |  [Fragmentation ratio]  |  [Evicted keys rate]

Row 4 โ€” Replication
  [Replication lag (bytes)]  |  [Connected replicas]  |  [Full resyncs count]

Row 5 โ€” Persistence
  [Time since last RDB]  |  [AOF file size]  |  [AOF rewrite duration]

Row 6 โ€” Alert History
  [Last 24-hour alert timeline]

Chapter Summary

Rate this chapter
4.8  / 5  (3 ratings)

๐Ÿ’ฌ Comments