Charset & Collation
MySQL Charset & Collation
Charset and collation are among the most overlooked yet deeply impactful MySQL configurations. A charset mismatch in a JOIN can silently kill indexes; a wrong collation produces surprising case-sensitivity behavior. This chapter clears up every aspect of this topic.
Production incident warning: New system using utf8mb4_0900_ai_ci, old system using utf8mb4_general_ci — JOIN triggered implicit conversion, indexes failed, QPS dropped from 20K to 200, cascading service outage. Unify charset at table creation; fixing it later is extremely costly.
1. Charset Basics
MySQL allows charset configuration at 4 levels (higher overrides lower): server → database → table → column. Connection charset (SET NAMES utf8mb4) controls encoding between client and server.
2. utf8 vs utf8mb4: A Historical Trap
MySQL's utf8 is NOT real UTF-8! MySQL's
utf8stores max 3 bytes, covering only the Basic Multilingual Plane — it cannot store 4-byte characters (Emojis, rare CJK characters). Real UTF-8 is 4 bytes max, which MySQL callsutf8mb4(mb4 = most bytes 4). Always use utf8mb4.
| MySQL utf8 | MySQL utf8mb4 | |
|---|---|---|
| Max bytes/char | 3 | 4 |
| Emoji support 😀 | ❌ truncated or error | ✅ |
| Rare CJK (𠮷) | ❌ | ✅ |
3. Collation Deep Dive
Collation defines string comparison and sort order — case sensitivity, accent sensitivity, and sort sequence.
| Collation | Case | Accent | MySQL 8.0 Default |
|---|---|---|---|
| utf8mb4_general_ci | insensitive | insensitive | No (old default) |
| utf8mb4_0900_ai_ci | insensitive | insensitive | Yes |
| utf8mb4_0900_as_cs | sensitive | sensitive | No |
| utf8mb4_bin | sensitive (binary) | sensitive | No |
Use _ci for usernames (case-insensitive), _bin for passwords/tokens (strict binary comparison).
4. Impact on Indexes and JOINs
JOINing columns with different collations triggers implicit conversion, which kills indexes. Always unify collation across joined columns. Use COLLATE clause in the JOIN condition as a temporary fix, or ALTER the tables to match collation.
5. Emoji Storage
Emojis require 4 bytes — only stored correctly with utf8mb4. Using utf8 causes Incorrect string value errors. Note: compound Emojis (👨👩👧👦 family) use ZWJ sequences and have CHAR_LENGTH > 1.
6. Charset Migration (utf8 → utf8mb4)
-- Convert entire table (use Online DDL tool for large tables)
ALTER TABLE users
CONVERT TO CHARACTER SET utf8mb4
COLLATE utf8mb4_0900_ai_ci;
-- Check remaining non-utf8mb4 tables
SELECT TABLE_NAME, TABLE_COLLATION FROM information_schema.TABLES
WHERE TABLE_SCHEMA = 'mydb'
AND TABLE_COLLATION NOT LIKE 'utf8mb4%';
7. Best Practices
- New MySQL 8.0+ systems: utf8mb4 + utf8mb4_0900_ai_ci (server default)
- Password/token columns: utf8mb4_bin (case-sensitive binary)
- Cross-system JOINs: unify collation or use COLLATE clause
- Always SET NAMES utf8mb4 in connection string
- Large table conversion: use pt-online-schema-change or gh-ost, not raw ALTER