Chapter 40

Charset & Collation

MySQL Charset & Collation

Charset and collation are among the most overlooked yet deeply impactful MySQL configurations. A charset mismatch in a JOIN can silently kill indexes; a wrong collation produces surprising case-sensitivity behavior. This chapter clears up every aspect of this topic.

Production incident warning: New system using utf8mb4_0900_ai_ci, old system using utf8mb4_general_ci — JOIN triggered implicit conversion, indexes failed, QPS dropped from 20K to 200, cascading service outage. Unify charset at table creation; fixing it later is extremely costly.

1. Charset Basics

MySQL allows charset configuration at 4 levels (higher overrides lower): server → database → table → column. Connection charset (SET NAMES utf8mb4) controls encoding between client and server.

2. utf8 vs utf8mb4: A Historical Trap

MySQL's utf8 is NOT real UTF-8! MySQL's utf8 stores max 3 bytes, covering only the Basic Multilingual Plane — it cannot store 4-byte characters (Emojis, rare CJK characters). Real UTF-8 is 4 bytes max, which MySQL calls utf8mb4 (mb4 = most bytes 4). Always use utf8mb4.

MySQL utf8 MySQL utf8mb4
Max bytes/char 3 4
Emoji support 😀 ❌ truncated or error
Rare CJK (𠮷)

3. Collation Deep Dive

Collation defines string comparison and sort order — case sensitivity, accent sensitivity, and sort sequence.

Collation Case Accent MySQL 8.0 Default
utf8mb4_general_ci insensitive insensitive No (old default)
utf8mb4_0900_ai_ci insensitive insensitive Yes
utf8mb4_0900_as_cs sensitive sensitive No
utf8mb4_bin sensitive (binary) sensitive No

Use _ci for usernames (case-insensitive), _bin for passwords/tokens (strict binary comparison).

4. Impact on Indexes and JOINs

JOINing columns with different collations triggers implicit conversion, which kills indexes. Always unify collation across joined columns. Use COLLATE clause in the JOIN condition as a temporary fix, or ALTER the tables to match collation.

5. Emoji Storage

Emojis require 4 bytes — only stored correctly with utf8mb4. Using utf8 causes Incorrect string value errors. Note: compound Emojis (👨‍👩‍👧‍👦 family) use ZWJ sequences and have CHAR_LENGTH > 1.

6. Charset Migration (utf8 → utf8mb4)

-- Convert entire table (use Online DDL tool for large tables)
ALTER TABLE users
  CONVERT TO CHARACTER SET utf8mb4
  COLLATE utf8mb4_0900_ai_ci;

-- Check remaining non-utf8mb4 tables
SELECT TABLE_NAME, TABLE_COLLATION FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'mydb'
  AND TABLE_COLLATION NOT LIKE 'utf8mb4%';

7. Best Practices

Rate this chapter
4.5  / 5  (3 ratings)

💬 Comments