Chapter 25

Troubleshooting Handbook

MySQL Troubleshooting and Diagnosis Guide

Systematic troubleshooting saves hours of debugging. This guide covers diagnostic procedures, common errors, and resolution strategies.

1. Performance Bottleneck Diagnosis

1.1 Systematic Diagnosis Workflow


PERFORMANCE DIAGNOSIS FLOW:

Problem: "Database is slow"

Step 1: Verify the issue is real
SHOW STATUS LIKE 'Questions';  -- Get baseline QPS
SHOW PROCESSLIST;  -- Are there slow queries running?
SELECT * FROM slow_log ORDER BY start_time DESC LIMIT 5;

Step 2: Check resource utilization
SHOW ENGINE INNODB STATUS\G  -- CPU, I/O, locks
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
SHOW STATUS LIKE 'Threads_%';

Step 3: Identify slow queries
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;

-- Get full query
SELECT DIGEST_TEXT FROM performance_schema.events_statements_summary_by_digest
WHERE DIGEST = 'hex_digest_from_above';

Step 4: Analyze slow query
EXPLAIN FORMAT=JSON SELECT ...;  -- Detailed execution plan

Step 5: Check for locking/blocking
SELECT * FROM performance_schema.data_locks;
SELECT * FROM performance_schema.table_io_waits_summary_by_table
WHERE COUNT_STAR > 1000;

ROOT CAUSE CATEGORIES:

1. Query-level (70% of issues)
   ├─ Missing index
   ├─ Full table scan
   ├─ Poor JOIN order
   └─ Subquery not optimized

2. Server-level (20%)
   ├─ Buffer pool too small
   ├─ Not enough CPU
   ├─ Disk I/O bottleneck
   └─ Memory pressure

3. Application-level (10%)
   ├─ Connection pool misconfigured
   ├─ N+1 query problem
   ├─ Redundant queries
   └─ Batch size too large

1.2 Common Slow Query Patterns


PATTERN 1: Missing Index

Query:
SELECT * FROM users WHERE email = '[email protected]';

EXPLAIN shows: type=ALL (full table scan)

Fix:
CREATE INDEX idx_email ON users(email);

PATTERN 2: ORDER BY Not Using Index

Query:
SELECT id, name FROM users ORDER BY created_at DESC LIMIT 10;

Issue: File sort (slow for large tables)

Fix: Create covering index
CREATE INDEX idx_created ON users(created_at, id, name);

PATTERN 3: N+1 Query Problem

Bad code:
users = getAllUsers();  -- 1 query, returns 1000 users
for (user in users) {
    user.orders = getOrdersByUser(user.id);  -- 1000 queries!
}

Solution A: JOIN
SELECT u.*, o.* FROM users u LEFT JOIN orders o ON u.id = o.user_id;

Solution B: Batch query
SELECT * FROM orders WHERE user_id IN (list of IDs);

PATTERN 4: Large JOIN with Wrong Column Type

Bad:
CREATE TABLE users (id INT);
CREATE TABLE logs (user_id VARCHAR(50));  -- Wrong type!

SELECT * FROM users u JOIN logs l ON u.id = l.user_id;
-- Type coercion forces full scan

Fix: Match column types
ALTER TABLE logs MODIFY user_id INT;

PATTERN 5: Subquery Not Optimized

Bad:
SELECT * FROM orders
WHERE user_id IN (
  SELECT id FROM users WHERE status = 'active'
);

Better:
SELECT o.* FROM orders o
JOIN users u ON o.user_id = u.id
WHERE u.status = 'active';

PATTERN 6: LIKE With Leading Wildcard

Bad:
SELECT * FROM users WHERE email LIKE '%@example.com';
-- Cannot use index (full scan)

Better:
SELECT * FROM users WHERE email LIKE '[email protected]';
-- Can use index

2. Lock and Deadlock Troubleshooting


DETECTING DEADLOCKS:

View deadlock log:
SHOW ENGINE INNODB STATUS\G | grep -A 20 "DEADLOCK"

Output:
--- Last deadlock time 2024-04-24 15:30:45 ---

Transaction A: UPDATE orders SET status='shipped' WHERE id=1
  Holds lock on: orders row 1
  Waits for: orders row 2

Transaction B: UPDATE orders SET status='paid' WHERE id=2
  Holds lock on: orders row 2
  Waits for: orders row 1

Result: Deadlock detected, Transaction B rolled back

PREVENTING DEADLOCKS:

1. Lock ordering (most effective)

Bad (can deadlock):
Transaction A: Lock table1, then table2
Transaction B: Lock table2, then table1

Good (no deadlock):
All transactions lock in same order: table1, then table2

2. Transaction isolation level tuning

For deadlock-prone workload:
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
START TRANSACTION;
...
COMMIT;

3. Timeout and retry

SET innodb_lock_wait_timeout = 5;  -- Seconds before timeout

Application code:
try {
  START TRANSACTION;
  UPDATE orders SET status='shipped' WHERE id = order_id;
  COMMIT;
} catch (DeadlockException e) {
  // Retry up to 3 times
  sleep(random(1,3) seconds);  // Backoff
  retry();
}

LOCK WAIT DIAGNOSIS:

View current locks:
SELECT * FROM performance_schema.data_locks;

Session waiting for lock:
SELECT * FROM performance_schema.data_lock_waits;

Kill blocking session (if necessary):
KILL 12345;  -- Session ID from SHOW PROCESSLIST

3. Memory and Resource Issues


OUT OF MEMORY (OOM) ERROR:

Cause 1: Buffer pool too large
-- System has 16GB, innodb_buffer_pool_size = 20GB → OOM

Fix:
SET GLOBAL innodb_buffer_pool_size = 12GB;

Cause 2: Large temporary table in memory
Solution:
-- Increase tmp_table_size
SET GLOBAL tmp_table_size = 1GB;
SET GLOBAL max_heap_table_size = 1GB;

-- Or rewrite query to reduce temp table size

Cause 3: Query result set too large
SELECT * FROM big_table;  -- All 100M rows loaded into memory

Fix: Use LIMIT and pagination
SELECT * FROM big_table LIMIT 0, 1000;

MONITORING MEMORY:

Check memory usage by thread:
SELECT
  THREAD_ID,
  SUM(CURRENT_NUMBER_OF_BYTES_USED) as bytes_used
FROM performance_schema.memory_summary_by_thread_by_event_name
GROUP BY THREAD_ID
ORDER BY bytes_used DESC;

If current_bytes keeps growing:
├─ Memory leak suspected
├─ Check table cache size
├─ Restart MySQL to reclaim memory
└─ Investigate application connection pool

4. Common Error Codes and Solutions


ERROR 1040: Too many connections
Cause: max_connections limit reached
Solution: SHOW VARIABLES LIKE 'max_connections';
         SET GLOBAL max_connections = 500;
         Or kill idle connections: KILL ;

ERROR 1090: You are using SafeUpdates mode
Cause: MySQL running with safe update mode
Solution: SET SQL_SAFE_UPDATES = 0;

ERROR 1317: Query interrupted
Cause: User cancelled query or timeout
Solution: SHOW VARIABLES LIKE 'connect_timeout';
          SET GLOBAL connect_timeout = 60;

ERROR 1205: Lock wait timeout exceeded
Cause: Transaction waiting too long for lock
Solution: SHOW VARIABLES LIKE 'innodb_lock_wait_timeout';
          SET GLOBAL innodb_lock_wait_timeout = 60;
          Analyze blocking query (see lock troubleshooting section)

ERROR 1366: Incorrect integer value
Cause: Inserting string into INT column
Solution: CHECK DATA TYPE BEFORE INSERT

ERROR 1451: Cannot delete or update (FK constraint)
Cause: Foreign key constraint prevents deletion
Solution: SET FOREIGN_KEY_CHECKS=0;  (temporary for maintenance)
          Or fix the constraint relationship

ERROR 2006: MySQL has gone away
Cause: Connection closed unexpectedly
Solution: Implement connection retry logic
          Increase wait_timeout: SET GLOBAL wait_timeout = 3600;

ERROR 2013: Lost connection during query
Cause: Network timeout or server restart
Solution: Implement connection pool with retry
          Increase read_timeout in connection string

DISK FULL ERROR (No space left on device)
Cause: Disk 100% utilization
Solution: Free up disk space immediately
          PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY);

5. Best Practices for Troubleshooting


DOCUMENTATION:

Maintain a runbook for each production error:

ERROR: Slow replication lag (> 30 seconds)
├─ Symptoms: Lag increasing, users see stale data
├─ Root cause: Long-running transaction on master
├─ Detection: SHOW SLAVE STATUS; check Seconds_Behind_Master
├─ Resolution steps:
│  1. Check SHOW PROCESSLIST; on master
│  2. Identify long query (usually SELECT or batch UPDATE)
│  3. Either wait for completion or KILL QUERY if safe
│  4. Verify lag returns to <1 second
├─ Prevention: Monitor query execution time
└─ Escalation: If lag > 60 seconds, page DBA immediately

TESTING PROCEDURES:

Before applying fixes to production:
1. Reproduce in staging
2. Document expected impact (latency, downtime)
3. Have rollback plan ready
4. Get approval from team lead
5. Execute during maintenance window
6. Monitor metrics closely
7. Document what was fixed

COMMUNICATION:

Status update template:
[HH:MM] Issue detected: Slow queries (P99 > 500ms)
[HH:MM] Diagnosis: Missing index on users.email
[HH:MM] Fix applied: CREATE INDEX idx_email...
[HH:MM] Verification: P99 latency returned to 20ms
[HH:MM] Root cause post-mortem: Added to code review checklist

TOOLS FOR TROUBLESHOOTING:

Percona Toolkit (pt-query-advisor):
pt-query-advisor /var/log/mysql/slow.log  -- Get recommendations

MySQL Workbench:
-- Visual EXPLAIN output
-- Connection management
-- Schema comparison

Performance Schema queries:
SELECT * FROM performance_schema.global_status;
SELECT * FROM performance_schema.events_statements_summary_global_by_event_name;
SELECT * FROM performance_schema.data_locks;

Rate this chapter
4.7  / 5  (6 ratings)

💬 Comments