Chapter 23
Monitoring & Alerting
MySQL Monitoring and Alerting Complete Guide
Effective monitoring is critical for maintaining MySQL performance and availability. This guide covers essential metrics, tools, alert design, and practical monitoring strategies.
1. Critical Metrics to Monitor
1.1 Query Performance Metrics
QUERY PERFORMANCE INDICATORS (QPI):
Queries Per Second (QPS):
SHOW GLOBAL STATUS WHERE variable_name IN (
'Questions', -- All SQL statements
'Com_select', -- SELECT count
'Com_insert', -- INSERT count
'Com_update', -- UPDATE count
'Com_delete' -- DELETE count
);
Calculation:
Current QPS = (Questions_now - Questions_previous) / 60
Baseline analysis:
├─ Peak QPS: 1000-5000 (normal)
├─ Off-peak: 100-500
└─ Alert trigger: > 150% of baseline
Query Latency:
-- Global execution time
SELECT *,
ROUND(SUM_TIMER_WAIT/SUM_ROWS_EXAMINED, 2) as avg_latency_ms
FROM performance_schema.table_io_waits_summary_by_table
ORDER BY SUM_TIMER_WAIT DESC;
Distribution metrics:
├─ P50 (median): 100 slow queries/minute
├─ P99 latency increases > 2x
└─ Single query > 5 seconds
1.2 Connection and Concurrency Metrics
CONNECTION METRICS:
Active Connections:
SHOW PROCESSLIST; -- Current connections
Key metrics:
├─ Threads_connected: Active connections
├─ Threads_running: Executing queries
├─ max_connections: Server limit
└─ Connections ratio: Active / Max
Baseline:
├─ Typical: 10-50 connections
├─ Peak: 100-500 connections
└─ Alert when > 80% of max_connections
Connection Errors:
SHOW GLOBAL STATUS LIKE '%Connection%';
Critical metrics:
├─ Connection_errors_max_connections: Hit limit
├─ Connection_errors_tcpwrap: TCP wrapper reject
├─ Aborted_clients: Client disconnect
└─ Aborted_connects: Failed auth/connection
Action:
├─ If errors rising: Increase max_connections
├─ If aborted_connects spike: Check firewall rules
└─ If aborted_clients rising: Connection pool issues
Thread Usage:
-- Monitor thread creation
SHOW GLOBAL STATUS LIKE 'Threads_%';
Results interpretation:
├─ Threads_cached: Reusable threads in pool
├─ Threads_created: New threads created (should stay low)
├─ Threads_running: Currently executing
└─ Threads_connected: Active connections
Optimization:
├─ Keep thread_cache_size = 50-100
└─ If Threads_created/hour > 100: Increase cache
1.3 Replication and Data Consistency Metrics
REPLICATION MONITORING:
Replication Lag (Seconds Behind Master):
SHOW SLAVE STATUS\G
Key field: Seconds_Behind_Master
├─ 0 seconds: Perfect sync
├─ 30 seconds: Critical (page on-call)
Common causes:
├─ Long-running transaction on master
├─ Replica under-resourced (CPU/IO)
├─ Network latency
├─ Large batch inserts
└─ Replica query contention
Binlog Status:
-- Check binlog position
SHOW MASTER STATUS;
Fields:
├─ File: Current binlog filename
├─ Position: Write position
└─ Binlog_Do_DB: Included databases
Data Consistency:
-- Quick consistency check
CHECKSUM TABLE orders; -- On master and slave
Compare checksums:
Master: Checksum 12345678
Slave: Checksum 12345678 -- Should match
If mismatch:
├─ Use pt-table-sync for repair
├─ Compare row counts
└─ Identify missing rows
1.4 Memory and Buffer Pool Metrics
MEMORY PERFORMANCE:
InnoDB Buffer Pool:
-- Check buffer pool usage
SELECT * FROM performance_schema.global_variables
WHERE variable_name = 'innodb_buffer_pool_size';
Key metrics:
├─ Buffer pool size: 50-80% of RAM (typical)
├─ Reads from disk: Should be 99% (target)
└─ Pages dirty: disk
If disk temp tables spike:
├─ Increase tmp_table_size setting
├─ Optimize query to reduce temp table
└─ Consider partition by time
1.5 I/O and Disk Metrics
DISK I/O MONITORING:
InnoDB I/O Statistics:
SHOW ENGINE INNODB STATUS\G | grep -A 20 "Innodb_data"
Key metrics:
├─ Innodb_data_read: Bytes read from disk
├─ Innodb_data_written: Bytes written to disk
├─ Innodb_data_fsyncs: Fsync operations
└─ Innodb_data_pending_ops: Pending operations
Fsync Operations:
-- Measure fsync frequency
SHOW GLOBAL STATUS LIKE 'Innodb_data_fsyncs';
Target:
├─ 1000 fsyncs/second: Alert (storage bottleneck)
Optimization:
├─ Set innodb_flush_log_at_trx_commit = 2 (less durable but faster)
├─ Use SSD for better random I/O
└─ Consider group commit batching
Table Lock Metrics:
SELECT * FROM performance_schema.table_lock_waits_summary_by_table
ORDER BY COUNT_STAR DESC;
Analysis:
├─ High COUNT_STAR: Many lock waits
├─ Tables with > 100 waits/hour: Problematic
└─ Solution: Optimize query, increase parallelism
2. Monitoring Stack Setup
2.1 Prometheus + Grafana Architecture
MONITORING STACK COMPONENTS:
┌─────────────────────────────────────────────────────────┐
│ MySQL Servers (Master + Replicas) │
│ ├─ mysqld_exporter (port 9104) │
│ └─ node_exporter (port 9100) │
└──────────┬──────────────────────────────────────────────┘
│ (Scrape every 15s)
↓
┌─────────────────────────────────────────────────────────┐
│ Prometheus Server (port 9090) │
│ ├─ Stores metrics (TSDB) │
│ ├─ Alert evaluation │
│ └─ Alertmanager (port 9093) │
└──────────┬──────────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────┐
│ Grafana (port 3000) │
│ ├─ Dashboard visualization │
│ ├─ Alert routing and notifications │
│ └─ User access control │
└─────────────────────────────────────────────────────────┘
Alert Flow:
MySQL metric spike → Prometheus detects → Alert rule fired →
Alertmanager → Route to channel (Slack/PagerDuty/Email) →
On-call engineer
INSTALLATION:
1. mysqld_exporter Setup
Create MySQL user for exporter:
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'exporter_password';
GRANT REPLICATION CLIENT ON *.* TO 'exporter'@'localhost';
GRANT PROCESS ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
Create config file: ~/.my.cnf
[mysqld_exporter]
user=exporter
password=exporter_password
Start exporter:
mysqld_exporter --config.my-cnf=/root/.my.cnf --web.telemetry-path=/metrics &
Verify: curl http://localhost:9104/metrics | head -20
2. Prometheus Configuration
/etc/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'mysql-master'
static_configs:
- targets: ['192.168.1.100:9104']
scrape_interval: 10s
- job_name: 'mysql-slave-1'
static_configs:
- targets: ['192.168.1.101:9104']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
3. Alertmanager Configuration
/etc/alertmanager/alertmanager.yml:
route:
receiver: 'default'
group_by: ['alertname', 'cluster']
routes:
- match:
severity: critical
receiver: 'pagerduty'
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_KEY'
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK'
channel: '#alerts'
2.2 Alert Rules Design
PROMETHEUS ALERT RULES:
/etc/prometheus/alerts.yml:
groups:
- name: mysql_alerts
interval: 30s
rules:
# Query Performance Alerts
- alert: HighQPS
expr: rate(mysql_global_status_questions[5m]) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High QPS on $labels.instance "}}"
description: "QPS is $value "}} (threshold: 10000)"
- alert: SlowQuerySpike
expr: rate(mysql_global_status_slow_queries[5m]) > 10
for: 2m
labels:
severity: critical
annotations:
summary: "Slow query spike on $labels.instance "}}"
description: " $value "}} slow queries/second"
# Connection Alerts
- alert: ConnectionPoolExhaustion
expr: mysql_global_variables_max_connections * 0.8 80%"
# Replication Alerts
- alert: ReplicationLagHigh
expr: mysql_slave_status_seconds_behind_master > 30
for: 5m
labels:
severity: critical
annotations:
summary: "Replication lag on $labels.instance "}}"
description: "Lag: $value "}} seconds (critical > 30s)"
- alert: ReplicationStopped
expr: mysql_slave_status_slave_io_running == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Replication IO thread stopped on $labels.instance "}}"
description: "Immediate investigation required"
# Memory Alerts
- alert: BufferPoolHitRatioLow
expr: mysql_innodb_buffer_pool_hit_ratio 30s
├─ Connection pool > 90%
├─ Disk space 50% of baseline
├─ Query latency P99 > 200ms
├─ Replication lag > 5s
├─ Buffer pool hit ratio < 95%
└─ Action: Alert slack/email
3. Grafana Dashboard Setup
GRAFANA DASHBOARD CONFIGURATION:
1. Data Source Setup
Grafana UI → Configuration → Data Sources
Create Prometheus data source:
├─ Name: Prometheus
├─ Type: Prometheus
├─ URL: http://localhost:9090
└─ Save & Test
2. Dashboard Creation
CREATE DASHBOARD: MySQL Overview
Row 1: System Metrics
├─ Panel: CPU Usage (gauge)
│ Query: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
│ Threshold: Warning 70%, Critical 90%
│
├─ Panel: Memory Usage (gauge)
│ Query: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
│
└─ Panel: Disk Usage (gauge)
Query: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
Row 2: MySQL Query Performance
├─ Panel: QPS (graph)
│ Query: rate(mysql_global_status_questions[1m])
│
├─ Panel: Slow Queries (counter)
│ Query: rate(mysql_global_status_slow_queries[5m])
│
└─ Panel: Query Type Distribution (pie)
Queries: Com_select, Com_insert, Com_update, Com_delete
Row 3: Connection & Concurrency
├─ Panel: Active Connections (graph)
│ Query: mysql_global_status_threads_connected
│
└─ Panel: Connection Pool Usage (gauge)
Query: (mysql_global_status_threads_connected / mysql_global_variables_max_connections) * 100
Threshold: 80%
Row 4: Replication Status
├─ Panel: Replication Lag (graph)
│ Query: mysql_slave_status_seconds_behind_master
│
└─ Panel: Slave IO/SQL Running (single stat)
Query: mysql_slave_status_slave_io_running
Row 5: Memory & Buffer Pool
├─ Panel: Buffer Pool Hit Ratio (gauge)
│ Query: mysql_innodb_buffer_pool_hit_ratio * 100
│
└─ Panel: Temp Tables (counter)
Query: rate(mysql_global_status_created_tmp_disk_tables[5m])
Row 6: InnoDB I/O
└─ Panel: InnoDB Read/Write Ops (graph)
Queries: rate(mysql_innodb_data_reads[5m]), rate(mysql_innodb_data_writes[5m])
4. Performance Baseline Establishment
BASELINE METRICS COLLECTION:
Week 1: Data Collection Phase
Baseline Calculation (After 7 days):
1. Calculate percentiles for each metric:
QPS Baseline:
├─ Off-peak average (2 AM - 8 AM): 150 QPS
├─ Normal (8 AM - 6 PM): 1200 QPS
├─ Peak (6 PM - 10 PM): 3500 QPS
├─ P95 peak: 4000 QPS
└─ P99 peak: 4500 QPS
Alert threshold: P99 + 20% = 5400 QPS
Warning threshold: P95 + 20% = 4800 QPS
Query Latency Baseline:
├─ P50 (median): 5ms
├─ P95: 15ms
├─ P99: 50ms
└─ P100 (max): 800ms
Alert on P99 > 100ms (2x baseline)
Warning on P95 > 30ms (2x baseline)
2. Create baseline profile by time:
Time_Window QPS Latency_P95 Connections CPU
─────────────────────────────────────────────────────
00:00-08:00 150 3ms 5 5%
08:00-12:00 800 8ms 20 25%
12:00-18:00 1200 10ms 30 40%
18:00-22:00 3500 25ms 50 75%
22:00-00:00 500 5ms 15 20%
3. Document baseline anomalies:
Monday mornings: 30% higher QPS (batch processing)
Noon peak: 15% latency increase
End-of-week report run: Custom query spike
Database backups: Brief I/O spike (expected)
Trend Analysis:
Week 1 baseline: QPS average 1000
Week 2 baseline: QPS average 1100 (+10%)
Week 3 baseline: QPS average 1250 (+25%)
Week 4 baseline: QPS average 1350 (+35%)
Trend: +10-15% per week growth
├─ Linear projection: Baseline doubles in 6-7 weeks
├─ Action: Plan scaling before saturation
└─ Monitor growth rate: Alert if > 25% week-over-week
5. Monitoring Best Practices
OPERATIONAL GUIDELINES:
Metric Collection Strategy:
1. Sample Interval Tuning
High-frequency metrics (1s interval):
├─ Replication lag (critical to catch)
├─ Connection count
├─ Active queries
└─ Error rate spikes
Standard metrics (15s interval):
├─ QPS and latency
├─ CPU and memory
├─ Buffer pool stats
└─ I/O operations
Low-frequency metrics (5m interval):
├─ Table sizes
├─ Replication consistency checks
├─ Backup verification
└─ Long-term trend data
2. Metric Retention Policy
Prometheus retention:
├─ 15-day high-resolution (30s)
├─ 90-day medium-resolution (1m)
└─ 1-year low-resolution (1h)
Alerting Best Practices:
1. Alert Fatigue Prevention
Problem: Too many alerts → ignored
Solution:
├─ Alert on root cause, not symptom
├─ Set meaningful thresholds using baseline analysis
├─ Group related alerts
│ If master down → suppress replica-related alerts
└─ Set quiet hours for non-critical
2. On-call Readiness
Create runbooks for top 10 alerts:
Alert: ReplicationLagHigh (> 30 seconds)
Runbook Steps:
├─ 1. Check SHOW SLAVE STATUS
├─ 2. Identify long-running query on master
├─ 3. Kill query or wait for completion
├─ 4. Verify lag returns to normal
├─ 5. If problem persists:
│ - Check network latency
│ - Check replica resource usage
└─ 6. Document root cause
3. Dashboard Review Cadence
Daily review (on-call engineer):
├─ Check for any overnight alerts
└─ Verify key metrics are healthy (5 minutes)
Weekly review (team):
├─ Analyze alert trends
└─ Update thresholds based on data (30 minutes)
Monthly review (operations + engineering):
├─ Capacity planning discussion
└─ Review alert effectiveness (1-2 hours)
6. Practical Monitoring Scenarios
REAL-WORLD MONITORING CASES:
Scenario 1: Replication Lag Spike
─────────────────────────────────
Symptoms:
├─ Alert: ReplicationLagHigh (lag = 45 seconds)
├─ Dashboard: Lag spike visible last 5 minutes
└─ Queries: No obvious slow query spike
Investigation Steps:
1. Check slave status:
SHOW SLAVE STATUS\G
Seconds_Behind_Master: 45
2. Identify long-running query on master:
SHOW PROCESSLIST;
Result: UPDATE large_table SET ... (40 seconds still running)
3. Options:
Option A: Wait for completion (lag resolves automatically)
Option B: KILL QUERY and re-run with better WHERE clause
Resolution: Schedule alert reduction for known batch windows
---
Scenario 2: Connection Pool Exhaustion
──────────────────────────────────────
Symptoms:
├─ Alert: ConnectionPoolExhaustion (490/500)
└─ Application: Getting "too many connections" error
Investigation:
1. Check current connections: SHOW PROCESSLIST;
├─ 200 sleep processes (idle)
└─ 90 unauthenticated connections
2. Immediate action:
-- Kill idle connections
SET GLOBAL wait_timeout = 600; (10 minutes)
-- Immediately closes idle connections
3. Long-term fix:
├─ Use connection pooling (HikariCP)
├─ Set max pool size = 20-50
└─ Monitor pool metrics
---
Scenario 3: Buffer Pool Hit Ratio Degradation
──────────────────────────────────────────────
Symptoms:
├─ Alert: BufferPoolHitRatioLow (91% vs 99% baseline)
└─ Application: Slightly slower queries
Investigation:
1. Check buffer pool:
SHOW ENGINE INNODB STATUS\G
2. Recent query found:
SELECT * FROM analytics_table (4 GB scan, not in buffer pool)
└─ This evicted 4 GB of hot data from pool
Solutions:
├─ Short-term: Kill query, allow warm-up (10 minutes)
├─ Medium-term: Add WHERE clause, create index
└─ Long-term: Separate OLTP and OLAP workloads
---
Scenario 4: Query Latency Increase
──────────────────────────────────
Symptoms:
├─ Alert: P99 latency 180ms (was 50ms)
└─ User complaints: "Site is slow"
Investigation:
1. Find top slow queries:
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;
2. Found: SELECT users WHERE status='active' (no index, full scan)
EXPLAIN SELECT ... WHERE status='active'\G
Result: Full table scan on 50M rows
3. Immediate fix:
CREATE INDEX idx_status ON users(status);
Query time: 1000ms → 5ms
4. Prevent recurrence:
├─ Add EXPLAIN to code review
└─ Test with production data volumes
7. Common Monitoring Pitfalls
MISTAKES TO AVOID:
1. Monitoring Wrong Metrics
❌ Alert on CPU > 80%
└─ Better: Alert on query latency/QPS ratio
✅ Alert on response time P99 > 100ms
└─ Slow response = actual user impact
2. Alerting on Correlated Metrics
❌ Alert on both:
├─ high_connections (alert)
└─ high_memory_usage (alert)
✅ Alert on root cause:
└─ connection_pool_exhaustion (one alert)
3. Ignoring Baseline Seasonality
❌ Fixed threshold: QPS > 1000
└─ Misses real issues at different times
✅ Time-based thresholds:
├─ 6 AM: Alert if QPS > 300 (2x baseline)
└─ 8 PM: Alert if QPS > 5500 (1.5x baseline)
4. Alert Storms
❌ Without inhibition rules:
└─ Master fails → 500+ notifications → alert fatigue
✅ With inhibition rules:
└─ Master fails → 1-3 actionable alerts
5. Incomplete Replication Monitoring
❌ Only monitor Seconds_Behind_Master
✅ Monitor both:
├─ Seconds_Behind_Master > 5 seconds
├─ Slave_IO_Running != 'Yes'
└─ Slave_SQL_Running != 'Yes'
6. Not Testing Alert Conditions
❌ Assumption: "Our alerts will catch it"
└─ Reality: Alert misconfigured, never fires
✅ Quarterly drill:
├─ Simulate high QPS: run benchmark
├─ Verify alert fires
└─ Check notification reaches on-call
7. Using Rate() Incorrectly
❌ rate(mysql_questions[1h]) (1 hour window too large)
└─ Smooths out spikes, misses short bursts
✅ rate(mysql_questions[5m]) (5 minute = ideal)
└─ Captures spike patterns without jitter
Conclusion
Effective monitoring is about understanding what matters and alerting only on actionable issues. Build your monitoring progressively: collect data, establish baselines, set meaningful thresholds, and evolve based on real incidents. The best monitoring system is one that catches problems before users notice them, while minimizing alert fatigue for your team.