Chapter 23

Monitoring & Alerting

MySQL Monitoring and Alerting Complete Guide

Effective monitoring is critical for maintaining MySQL performance and availability. This guide covers essential metrics, tools, alert design, and practical monitoring strategies.

1. Critical Metrics to Monitor

1.1 Query Performance Metrics


QUERY PERFORMANCE INDICATORS (QPI):

Queries Per Second (QPS):
  SHOW GLOBAL STATUS WHERE variable_name IN (
    'Questions',      -- All SQL statements
    'Com_select',     -- SELECT count
    'Com_insert',     -- INSERT count
    'Com_update',     -- UPDATE count
    'Com_delete'      -- DELETE count
  );

  Calculation:
  Current QPS = (Questions_now - Questions_previous) / 60

  Baseline analysis:
  ├─ Peak QPS: 1000-5000 (normal)
  ├─ Off-peak: 100-500
  └─ Alert trigger: > 150% of baseline

Query Latency:
  -- Global execution time
  SELECT *,
         ROUND(SUM_TIMER_WAIT/SUM_ROWS_EXAMINED, 2) as avg_latency_ms
  FROM performance_schema.table_io_waits_summary_by_table
  ORDER BY SUM_TIMER_WAIT DESC;

  Distribution metrics:
  ├─ P50 (median):  100 slow queries/minute
  ├─ P99 latency increases > 2x
  └─ Single query > 5 seconds

1.2 Connection and Concurrency Metrics


CONNECTION METRICS:

Active Connections:
  SHOW PROCESSLIST;  -- Current connections

  Key metrics:
  ├─ Threads_connected: Active connections
  ├─ Threads_running: Executing queries
  ├─ max_connections: Server limit
  └─ Connections ratio: Active / Max

  Baseline:
  ├─ Typical: 10-50 connections
  ├─ Peak: 100-500 connections
  └─ Alert when > 80% of max_connections

Connection Errors:
  SHOW GLOBAL STATUS LIKE '%Connection%';

  Critical metrics:
  ├─ Connection_errors_max_connections: Hit limit
  ├─ Connection_errors_tcpwrap: TCP wrapper reject
  ├─ Aborted_clients: Client disconnect
  └─ Aborted_connects: Failed auth/connection

  Action:
  ├─ If errors rising: Increase max_connections
  ├─ If aborted_connects spike: Check firewall rules
  └─ If aborted_clients rising: Connection pool issues

Thread Usage:
  -- Monitor thread creation
  SHOW GLOBAL STATUS LIKE 'Threads_%';

  Results interpretation:
  ├─ Threads_cached: Reusable threads in pool
  ├─ Threads_created: New threads created (should stay low)
  ├─ Threads_running: Currently executing
  └─ Threads_connected: Active connections

  Optimization:
  ├─ Keep thread_cache_size = 50-100
  └─ If Threads_created/hour > 100: Increase cache

1.3 Replication and Data Consistency Metrics


REPLICATION MONITORING:

Replication Lag (Seconds Behind Master):
  SHOW SLAVE STATUS\G

  Key field: Seconds_Behind_Master
  ├─ 0 seconds: Perfect sync
  ├─  30 seconds: Critical (page on-call)

  Common causes:
  ├─ Long-running transaction on master
  ├─ Replica under-resourced (CPU/IO)
  ├─ Network latency
  ├─ Large batch inserts
  └─ Replica query contention

Binlog Status:
  -- Check binlog position
  SHOW MASTER STATUS;

  Fields:
  ├─ File: Current binlog filename
  ├─ Position: Write position
  └─ Binlog_Do_DB: Included databases

Data Consistency:
  -- Quick consistency check
  CHECKSUM TABLE orders;  -- On master and slave

  Compare checksums:
  Master: Checksum 12345678
  Slave:  Checksum 12345678  -- Should match

  If mismatch:
  ├─ Use pt-table-sync for repair
  ├─ Compare row counts
  └─ Identify missing rows

1.4 Memory and Buffer Pool Metrics


MEMORY PERFORMANCE:

InnoDB Buffer Pool:
  -- Check buffer pool usage
  SELECT * FROM performance_schema.global_variables
  WHERE variable_name = 'innodb_buffer_pool_size';

  Key metrics:
  ├─ Buffer pool size: 50-80% of RAM (typical)
  ├─ Reads from disk: Should be  99% (target)
  └─ Pages dirty:  disk

  If disk temp tables spike:
  ├─ Increase tmp_table_size setting
  ├─ Optimize query to reduce temp table
  └─ Consider partition by time

1.5 I/O and Disk Metrics


DISK I/O MONITORING:

InnoDB I/O Statistics:
  SHOW ENGINE INNODB STATUS\G | grep -A 20 "Innodb_data"

  Key metrics:
  ├─ Innodb_data_read: Bytes read from disk
  ├─ Innodb_data_written: Bytes written to disk
  ├─ Innodb_data_fsyncs: Fsync operations
  └─ Innodb_data_pending_ops: Pending operations

Fsync Operations:
  -- Measure fsync frequency
  SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs';

  Target:
  ├─  1000 fsyncs/second: Alert (storage bottleneck)

  Optimization:
  ├─ Set innodb_flush_log_at_trx_commit = 2 (less durable but faster)
  ├─ Use SSD for better random I/O
  └─ Consider group commit batching

Table Lock Metrics:
  SELECT * FROM performance_schema.table_lock_waits_summary_by_table
  ORDER BY COUNT_STAR DESC;

  Analysis:
  ├─ High COUNT_STAR: Many lock waits
  ├─ Tables with > 100 waits/hour: Problematic
  └─ Solution: Optimize query, increase parallelism

2. Monitoring Stack Setup

2.1 Prometheus + Grafana Architecture


MONITORING STACK COMPONENTS:

┌─────────────────────────────────────────────────────────┐
│ MySQL Servers (Master + Replicas)                       │
│ ├─ mysqld_exporter (port 9104)                          │
│ └─ node_exporter (port 9100)                            │
└──────────┬──────────────────────────────────────────────┘
           │ (Scrape every 15s)
           ↓
┌─────────────────────────────────────────────────────────┐
│ Prometheus Server (port 9090)                           │
│ ├─ Stores metrics (TSDB)                                │
│ ├─ Alert evaluation                                     │
│ └─ Alertmanager (port 9093)                             │
└──────────┬──────────────────────────────────────────────┘
           │
           ↓
┌─────────────────────────────────────────────────────────┐
│ Grafana (port 3000)                                     │
│ ├─ Dashboard visualization                              │
│ ├─ Alert routing and notifications                      │
│ └─ User access control                                  │
└─────────────────────────────────────────────────────────┘

Alert Flow:
MySQL metric spike → Prometheus detects → Alert rule fired →
Alertmanager → Route to channel (Slack/PagerDuty/Email) →
On-call engineer

INSTALLATION:

1. mysqld_exporter Setup

   Create MySQL user for exporter:
   CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter_password';
   GRANT REPLICATION CLIENT ON *.* TO 'exporter'@'localhost';
   GRANT PROCESS ON *.* TO 'exporter'@'localhost';
   FLUSH PRIVILEGES;

   Create config file: ~/.my.cnf
   [mysqld_exporter]
   user=exporter
   password=exporter_password

   Start exporter:
   mysqld_exporter --config.my-cnf=/root/.my.cnf --web.telemetry-path=/metrics &

   Verify: curl http://localhost:9104/metrics | head -20

2. Prometheus Configuration

   /etc/prometheus/prometheus.yml:

   global:
     scrape_interval: 15s
     evaluation_interval: 15s

   scrape_configs:
     - job_name: 'mysql-master'
       static_configs:
         - targets: ['192.168.1.100:9104']
       scrape_interval: 10s

     - job_name: 'mysql-slave-1'
       static_configs:
         - targets: ['192.168.1.101:9104']

   alerting:
     alertmanagers:
       - static_configs:
           - targets: ['localhost:9093']

3. Alertmanager Configuration

   /etc/alertmanager/alertmanager.yml:

   route:
     receiver: 'default'
     group_by: ['alertname', 'cluster']
     routes:
       - match:
           severity: critical
         receiver: 'pagerduty'
       - match:
           severity: warning
         receiver: 'slack'

   receivers:
     - name: 'pagerduty'
       pagerduty_configs:
         - service_key: 'YOUR_PAGERDUTY_KEY'

     - name: 'slack'
       slack_configs:
         - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
           channel: '#alerts'

2.2 Alert Rules Design


PROMETHEUS ALERT RULES:

/etc/prometheus/alerts.yml:

groups:
  - name: mysql_alerts
    interval: 30s
    rules:

      # Query Performance Alerts
      - alert: HighQPS
        expr: rate(mysql_global_status_questions[5m]) > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High QPS on  $labels.instance "}}"
          description: "QPS is  $value "}} (threshold: 10000)"

      - alert: SlowQuerySpike
        expr: rate(mysql_global_status_slow_queries[5m]) > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Slow query spike on  $labels.instance "}}"
          description: " $value "}} slow queries/second"

      # Connection Alerts
      - alert: ConnectionPoolExhaustion
        expr: mysql_global_variables_max_connections * 0.8  80%"

      # Replication Alerts
      - alert: ReplicationLagHigh
        expr: mysql_slave_status_seconds_behind_master > 30
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag on  $labels.instance "}}"
          description: "Lag:  $value "}} seconds (critical > 30s)"

      - alert: ReplicationStopped
        expr: mysql_slave_status_slave_io_running == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Replication IO thread stopped on  $labels.instance "}}"
          description: "Immediate investigation required"

      # Memory Alerts
      - alert: BufferPoolHitRatioLow
        expr: mysql_innodb_buffer_pool_hit_ratio  30s
├─ Connection pool > 90%
├─ Disk space  50% of baseline
├─ Query latency P99 > 200ms
├─ Replication lag > 5s
├─ Buffer pool hit ratio < 95%
└─ Action: Alert slack/email

3. Grafana Dashboard Setup


GRAFANA DASHBOARD CONFIGURATION:

1. Data Source Setup

   Grafana UI → Configuration → Data Sources

   Create Prometheus data source:
   ├─ Name: Prometheus
   ├─ Type: Prometheus
   ├─ URL: http://localhost:9090
   └─ Save & Test

2. Dashboard Creation

   CREATE DASHBOARD: MySQL Overview

   Row 1: System Metrics
   ├─ Panel: CPU Usage (gauge)
   │  Query: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
   │  Threshold: Warning 70%, Critical 90%
   │
   ├─ Panel: Memory Usage (gauge)
   │  Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
   │
   └─ Panel: Disk Usage (gauge)
       Query: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

   Row 2: MySQL Query Performance
   ├─ Panel: QPS (graph)
   │  Query: rate(mysql_global_status_questions[1m])
   │
   ├─ Panel: Slow Queries (counter)
   │  Query: rate(mysql_global_status_slow_queries[5m])
   │
   └─ Panel: Query Type Distribution (pie)
       Queries: Com_select, Com_insert, Com_update, Com_delete

   Row 3: Connection & Concurrency
   ├─ Panel: Active Connections (graph)
   │  Query: mysql_global_status_threads_connected
   │
   └─ Panel: Connection Pool Usage (gauge)
       Query: (mysql_global_status_threads_connected / mysql_global_variables_max_connections) * 100
       Threshold: 80%

   Row 4: Replication Status
   ├─ Panel: Replication Lag (graph)
   │  Query: mysql_slave_status_seconds_behind_master
   │
   └─ Panel: Slave IO/SQL Running (single stat)
       Query: mysql_slave_status_slave_io_running

   Row 5: Memory & Buffer Pool
   ├─ Panel: Buffer Pool Hit Ratio (gauge)
   │  Query: mysql_innodb_buffer_pool_hit_ratio * 100
   │
   └─ Panel: Temp Tables (counter)
       Query: rate(mysql_global_status_created_tmp_disk_tables[5m])

   Row 6: InnoDB I/O
   └─ Panel: InnoDB Read/Write Ops (graph)
       Queries: rate(mysql_innodb_data_reads[5m]), rate(mysql_innodb_data_writes[5m])

4. Performance Baseline Establishment


BASELINE METRICS COLLECTION:

Week 1: Data Collection Phase

Baseline Calculation (After 7 days):

1. Calculate percentiles for each metric:

   QPS Baseline:
   ├─ Off-peak average (2 AM - 8 AM): 150 QPS
   ├─ Normal (8 AM - 6 PM): 1200 QPS
   ├─ Peak (6 PM - 10 PM): 3500 QPS
   ├─ P95 peak: 4000 QPS
   └─ P99 peak: 4500 QPS

   Alert threshold: P99 + 20% = 5400 QPS
   Warning threshold: P95 + 20% = 4800 QPS

   Query Latency Baseline:
   ├─ P50 (median): 5ms
   ├─ P95: 15ms
   ├─ P99: 50ms
   └─ P100 (max): 800ms

   Alert on P99 > 100ms (2x baseline)
   Warning on P95 > 30ms (2x baseline)

2. Create baseline profile by time:

   Time_Window  QPS      Latency_P95  Connections  CPU
   ─────────────────────────────────────────────────────
   00:00-08:00  150      3ms          5            5%
   08:00-12:00  800      8ms          20           25%
   12:00-18:00  1200     10ms         30           40%
   18:00-22:00  3500     25ms         50           75%
   22:00-00:00  500      5ms          15           20%

3. Document baseline anomalies:

   Monday mornings: 30% higher QPS (batch processing)
   Noon peak: 15% latency increase
   End-of-week report run: Custom query spike
   Database backups: Brief I/O spike (expected)

Trend Analysis:

Week 1 baseline: QPS average 1000
Week 2 baseline: QPS average 1100 (+10%)
Week 3 baseline: QPS average 1250 (+25%)
Week 4 baseline: QPS average 1350 (+35%)

Trend: +10-15% per week growth
├─ Linear projection: Baseline doubles in 6-7 weeks
├─ Action: Plan scaling before saturation
└─ Monitor growth rate: Alert if > 25% week-over-week

5. Monitoring Best Practices


OPERATIONAL GUIDELINES:

Metric Collection Strategy:

1. Sample Interval Tuning

   High-frequency metrics (1s interval):
   ├─ Replication lag (critical to catch)
   ├─ Connection count
   ├─ Active queries
   └─ Error rate spikes

   Standard metrics (15s interval):
   ├─ QPS and latency
   ├─ CPU and memory
   ├─ Buffer pool stats
   └─ I/O operations

   Low-frequency metrics (5m interval):
   ├─ Table sizes
   ├─ Replication consistency checks
   ├─ Backup verification
   └─ Long-term trend data

2. Metric Retention Policy

   Prometheus retention:
   ├─ 15-day high-resolution (30s)
   ├─ 90-day medium-resolution (1m)
   └─ 1-year low-resolution (1h)

Alerting Best Practices:

1. Alert Fatigue Prevention

   Problem: Too many alerts → ignored
   Solution:
   ├─ Alert on root cause, not symptom
   ├─ Set meaningful thresholds using baseline analysis
   ├─ Group related alerts
   │  If master down → suppress replica-related alerts
   └─ Set quiet hours for non-critical

2. On-call Readiness

   Create runbooks for top 10 alerts:

   Alert: ReplicationLagHigh (> 30 seconds)
   Runbook Steps:
   ├─ 1. Check SHOW SLAVE STATUS
   ├─ 2. Identify long-running query on master
   ├─ 3. Kill query or wait for completion
   ├─ 4. Verify lag returns to normal
   ├─ 5. If problem persists:
   │      - Check network latency
   │      - Check replica resource usage
   └─ 6. Document root cause

3. Dashboard Review Cadence

   Daily review (on-call engineer):
   ├─ Check for any overnight alerts
   └─ Verify key metrics are healthy (5 minutes)

   Weekly review (team):
   ├─ Analyze alert trends
   └─ Update thresholds based on data (30 minutes)

   Monthly review (operations + engineering):
   ├─ Capacity planning discussion
   └─ Review alert effectiveness (1-2 hours)

6. Practical Monitoring Scenarios


REAL-WORLD MONITORING CASES:

Scenario 1: Replication Lag Spike
─────────────────────────────────

Symptoms:
├─ Alert: ReplicationLagHigh (lag = 45 seconds)
├─ Dashboard: Lag spike visible last 5 minutes
└─ Queries: No obvious slow query spike

Investigation Steps:

1. Check slave status:
   SHOW SLAVE STATUS\G
   Seconds_Behind_Master: 45

2. Identify long-running query on master:
   SHOW PROCESSLIST;
   Result: UPDATE large_table SET ... (40 seconds still running)

3. Options:
   Option A: Wait for completion (lag resolves automatically)
   Option B: KILL QUERY and re-run with better WHERE clause

Resolution: Schedule alert reduction for known batch windows

---

Scenario 2: Connection Pool Exhaustion
──────────────────────────────────────

Symptoms:
├─ Alert: ConnectionPoolExhaustion (490/500)
└─ Application: Getting "too many connections" error

Investigation:

1. Check current connections: SHOW PROCESSLIST;
   ├─ 200 sleep processes (idle)
   └─ 90 unauthenticated connections

2. Immediate action:
   -- Kill idle connections
   SET GLOBAL wait_timeout = 600;  (10 minutes)
   -- Immediately closes idle connections

3. Long-term fix:
   ├─ Use connection pooling (HikariCP)
   ├─ Set max pool size = 20-50
   └─ Monitor pool metrics

---

Scenario 3: Buffer Pool Hit Ratio Degradation
──────────────────────────────────────────────

Symptoms:
├─ Alert: BufferPoolHitRatioLow (91% vs 99% baseline)
└─ Application: Slightly slower queries

Investigation:

1. Check buffer pool:
   SHOW ENGINE INNODB STATUS\G

2. Recent query found:
   SELECT * FROM analytics_table (4 GB scan, not in buffer pool)
   └─ This evicted 4 GB of hot data from pool

Solutions:
├─ Short-term: Kill query, allow warm-up (10 minutes)
├─ Medium-term: Add WHERE clause, create index
└─ Long-term: Separate OLTP and OLAP workloads

---

Scenario 4: Query Latency Increase
──────────────────────────────────

Symptoms:
├─ Alert: P99 latency 180ms (was 50ms)
└─ User complaints: "Site is slow"

Investigation:

1. Find top slow queries:
   SELECT * FROM performance_schema.events_statements_summary_by_digest
   ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;

2. Found: SELECT users WHERE status='active' (no index, full scan)
   EXPLAIN SELECT ... WHERE status='active'\G
   Result: Full table scan on 50M rows

3. Immediate fix:
   CREATE INDEX idx_status ON users(status);
   Query time: 1000ms → 5ms

4. Prevent recurrence:
   ├─ Add EXPLAIN to code review
   └─ Test with production data volumes

7. Common Monitoring Pitfalls


MISTAKES TO AVOID:

1. Monitoring Wrong Metrics

   ❌ Alert on CPU > 80%
   └─ Better: Alert on query latency/QPS ratio

   ✅ Alert on response time P99 > 100ms
   └─ Slow response = actual user impact

2. Alerting on Correlated Metrics

   ❌ Alert on both:
   ├─ high_connections (alert)
   └─ high_memory_usage (alert)

   ✅ Alert on root cause:
   └─ connection_pool_exhaustion (one alert)

3. Ignoring Baseline Seasonality

   ❌ Fixed threshold: QPS > 1000
   └─ Misses real issues at different times

   ✅ Time-based thresholds:
   ├─ 6 AM: Alert if QPS > 300 (2x baseline)
   └─ 8 PM: Alert if QPS > 5500 (1.5x baseline)

4. Alert Storms

   ❌ Without inhibition rules:
   └─ Master fails → 500+ notifications → alert fatigue

   ✅ With inhibition rules:
   └─ Master fails → 1-3 actionable alerts

5. Incomplete Replication Monitoring

   ❌ Only monitor Seconds_Behind_Master

   ✅ Monitor both:
   ├─ Seconds_Behind_Master > 5 seconds
   ├─ Slave_IO_Running != 'Yes'
   └─ Slave_SQL_Running != 'Yes'

6. Not Testing Alert Conditions

   ❌ Assumption: "Our alerts will catch it"
   └─ Reality: Alert misconfigured, never fires

   ✅ Quarterly drill:
   ├─ Simulate high QPS: run benchmark
   ├─ Verify alert fires
   └─ Check notification reaches on-call

7. Using Rate() Incorrectly

   ❌ rate(mysql_questions[1h])  (1 hour window too large)
   └─ Smooths out spikes, misses short bursts

   ✅ rate(mysql_questions[5m])  (5 minute = ideal)
   └─ Captures spike patterns without jitter

Conclusion

Effective monitoring is about understanding what matters and alerting only on actionable issues. Build your monitoring progressively: collect data, establish baselines, set meaningful thresholds, and evolve based on real incidents. The best monitoring system is one that catches problems before users notice them, while minimizing alert fatigue for your team.

Rate this chapter
4.8  / 5  (7 ratings)

💬 Comments