String and Unicode: UTF-16, Surrogate Pairs, and Code Point Traps
'😀'.length equals 2 in JavaScript, not 1. This fact reveals the deepest design of JavaScript strings: they are not sequences of Unicode characters, but sequences of UTF-16 code units — a decision made in 1995 that continues to haunt every developer working with internationalized text.
🔹 Level 1 · What You Need to Know
JS Strings Are UTF-16, Not UTF-8
// Basic ASCII characters: everything works fine
'hello'.length // 5 (each character is 1 code unit)
'hello'[0] // 'h'
// CJK characters (BMP range, U+4E00-U+9FFF): usually fine
'你好'.length // 2 (each character is 1 code unit)
'你好'[0] // '你'
// Emoji (supplementary planes, U+10000+): this is where it breaks
'😀'.length // 2 (one emoji takes two code units!)
'😀'[0] // '\uD83D' (high surrogate, isolated invalid character!)
'😀'[1] // '\uDE00' (low surrogate, isolated invalid character!)
// Some rarely-used CJK characters also take two code units
'𠀋'.length // 2 (U+2000B, outside BMP)
APIs Affected by UTF-16 Encoding
const str = '😀Hello😀'; // 7 real characters, but length = 9
// ❌ Code-unit-level operations, may break surrogate pairs
str.length // 9 (not 7)
str[0] // '\uD83D' (isolated high surrogate)
str.charAt(0) // '\uD83D' (same)
str.charCodeAt(0) // 55357 (numeric value of high surrogate)
str.slice(0, 1) // '\uD83D' (isolated surrogate!)
str.indexOf('😀') // 0 (but indexed by code unit)
str.substring(1, 3) // '\uDE00H' (emoji broken!)
// ✅ APIs that correctly handle code points
str.codePointAt(0) // 128512 (complete emoji code point)
String.fromCodePoint(128512) // '😀'
[...str] // ['😀', 'H', 'e', 'l', 'l', 'o', '😀'] (7 elements!)
Array.from(str) // ['😀', 'H', 'e', 'l', 'l', 'o', '😀']
[...str].length // 7 (correct!)
// Regex: needs /u flag to handle surrogate pairs correctly
/^.$/u.test('😀') // true (/u treats surrogate pair as one char)
/^.$/.test('😀') // false (without /u, . doesn't span two code units)
Day-to-Day Development Guidelines
// ❌ Don't use .length directly to count characters
function countChars(str) {
return str.length; // may overcount emoji
}
// ✅ Use spread or Array.from
function countChars(str) {
return [...str].length; // correctly counts Unicode characters
}
// Or use Intl.Segmenter (most accurate, handles combining characters)
function countChars(str) {
const segmenter = new Intl.Segmenter();
return [...segmenter.segment(str)].length;
}
// ❌ Slicing strings can break emoji
'😀hello'.slice(0, 2) // '\uD83D\uDE00'... wrong
// ✅ Safe slicing
function safeSlice(str, maxChars) {
return [...str].slice(0, maxChars).join('');
}
safeSlice('😀hello', 3) // '😀he'
// ❌ charAt doesn't handle supplementary plane characters
'😀'.charAt(0) // '\uD83D' (wrong)
// ✅ codePointAt + String.fromCodePoint
'😀'.codePointAt(0) // 128512 (correct code point)
String.fromCodePoint(128512) // '😀' (correct character)
🔸 Level 2 · How It Actually Runs
Unicode Fundamentals
Unicode code point range: U+0000 to U+10FFFF
┌────────────────────────────────────────────────────────┐
│ Unicode Code Space Divisions │
│ │
│ U+0000 ~ U+FFFF Basic Multilingual Plane (BMP) │
│ 65,536 code points │
│ Contains: ASCII, most CJK characters, common symbols│
│ UTF-16 encoding: 1 × 16-bit code unit (stored as-is)│
│ │
│ U+10000 ~ U+1FFFF Supplementary Multilingual Plane │
│ U+20000 ~ U+2FFFF Supplementary Ideographic Plane │
│ U+30000 ~ U+3FFFF Tertiary Ideographic Plane │
│ ... │
│ U+E0000 ~ U+EFFFF Plane 14 (tags) │
│ U+F0000 ~ U+FFFFF Plane 15 (Private Use Area A) │
│ U+100000 ~ U+10FFFF Plane 16 (Private Use Area B) │
│ │
│ Supplementary planes: ~1 million code points │
│ UTF-16 encoding: 2 × 16-bit code units (surrogate pair)│
└────────────────────────────────────────────────────────┘
UTF-16 Surrogate Pair Encoding Algorithm
Surrogate pair encoding principle:
High surrogate range: U+D800 to U+DBFF (1,024 values)
Low surrogate range: U+DC00 to U+DFFF (1,024 values)
Encodable supplementary characters: 1,024 × 1,024 = 1,048,576 (fits all supplementary planes)
Encoding algorithm (code point → surrogate pair):
┌────────────────────────────────────────────────────┐
│ Input: code point C (U+10000 ≤ C ≤ U+10FFFF) │
│ │
│ Step 1: C' = C - 0x10000 │
│ C' range: 0x00000 to 0xFFFFF (20 bits) │
│ │
│ Step 2: Split the 20 bits of C': │
│ High 10 bits → H (range 0x000 to 0x3FF) │
│ Low 10 bits → L (range 0x000 to 0x3FF) │
│ │
│ Step 3: Compute surrogate code units: │
│ High surrogate = 0xD800 + H (range D800-DBFF) │
│ Low surrogate = 0xDC00 + L (range DC00-DFFF) │
└────────────────────────────────────────────────────┘
Example: 😀 has code point U+1F600
Step 1: C' = 0x1F600 - 0x10000 = 0xF600
Step 2: H = 0xF600 >> 10 = 0x3D (high 10 bits)
L = 0xF600 & 0x3FF = 0x200 (low 10 bits)
Step 3: High surrogate = 0xD800 + 0x3D = 0xD83D
Low surrogate = 0xDC00 + 0x200 = 0xDE00
Verification:
'😀'.charCodeAt(0).toString(16) // "d83d" (high surrogate)
'😀'.charCodeAt(1).toString(16) // "de00" (low surrogate)
Surrogate pair decoding (surrogate pair → code point):
Input: high surrogate H_sur (D800-DBFF), low surrogate L_sur (DC00-DFFF)
Step 1: H = H_sur - 0xD800
Step 2: L = L_sur - 0xDC00
Step 3: C' = (H << 10) | L
Step 4: C = C' + 0x10000
Example: D83D + DE00 → 😀
H = 0xD83D - 0xD800 = 0x3D
L = 0xDE00 - 0xDC00 = 0x200
C' = (0x3D << 10) | 0x200 = 0xF400 | 0x200 = 0xF600
C = 0xF600 + 0x10000 = 0x1F600 = U+1F600 = 😀 ✓
Unicode Normalization (NFC/NFD)
Some characters can be encoded in multiple ways, causing strings that look identical to be unequal:
Two ways to represent "é":
Form 1: Precomposed character (NFC)
U+00E9 (é) ← single code point, one character
Form 2: Base character + combining diacritic (NFD)
U+0065 (e) + U+0301 (´) ← two code points, visually identical
In JavaScript:
'\u00E9' === '\u0065\u0301' // false!
'\u00E9'.length // 1
'\u0065\u0301'.length // 2
// Visually identical, but code thinks they're different!
const str1 = 'café'; // U+00E9 precomposed
const str2 = 'cafe\u0301'; // e + combining character
str1 === str2 // false
str1.length // 4
str2.length // 5
Solution: string normalization:
// The normalize() method
const nfc = str.normalize('NFC'); // canonical precomposed (recommended for storage)
const nfd = str.normalize('NFD'); // canonical decomposed
const nfkc = str.normalize('NFKC'); // compatibility precomposed (normalizes full-width chars, etc.)
const nfkd = str.normalize('NFKD'); // compatibility decomposed
// Normalize before comparing
function normalizedEqual(a, b) {
return a.normalize('NFC') === b.normalize('NFC');
}
normalizedEqual('café', 'cafe\u0301') // true
// Also normalize before searching
const text = 'résumé'.normalize('NFC');
const query = 're\u0301sume\u0301'.normalize('NFC');
text.includes(query) // true (correct after normalization)
🔺 Level 3 · How the Spec Defines It
6.1.4 The String Type
Spec text (ECMA-262, Section 6.1.4):
6.1.4 The String Type
The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values ("elements") up to a maximum length of 2^53 - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value. Each element is considered to be a value at a position within the sequence; the first element is at index 0, the next at index 1, and so on. The length of a String is the number of elements (i.e., 16-bit values) within it. The empty String has length zero and therefore contains no elements.
Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit. However, ECMAScript does not place any restrictions on or requirements for the sequence of code units in a String value, so they may be ill-formed when interpreted as UTF-16 code unit sequences. Operations that do not interpret String contents treat them as sequences of undifferentiated 16-bit unsigned integers.
Key point: The spec explicitly allows "ill-formed UTF-16 sequences" — that is, lone surrogate code units (a high surrogate without a following low surrogate, or vice versa). This means JavaScript can legally store '\uD800' (a lone high surrogate), even though it is not valid Unicode text.
StringToCodePoints Abstract Operation
Spec Section 11.1.1 defines how to convert a String into a list of code points:
StringToCodePoints ( string )
- Let codePoints be a new empty List.
- Let size be the length of string.
- Let position be 0.
- Repeat, while position < size, a. Let cp be CodePointAt(string, position). b. Append cp.[[CodePoint]] to codePoints. c. Set position to position + cp.[[CodePointCount]].
- Return codePoints.
CodePointAt(string, position):
- Let size be the length of string.
- Let first be the numeric value of the code unit at index position within string.
- If first is not a leading surrogate or trailing surrogate, then a. Return the Record { [[CodePoint]]: first, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: false }.
- If first is a trailing surrogate or position + 1 = size, then a. Return the Record { [[CodePoint]]: first, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }.
- Let second be the numeric value of the code unit at index position + 1 within string.
- If second is not a trailing surrogate, then a. Return the Record { [[CodePoint]]: first, [[CodeUnitCount]]: 1, [[IsUnpairedSurrogate]]: true }.
- Let cp be UTF16SurrogatePairToCodePoint(first, second).
- Return the Record { [[CodePoint]]: cp, [[CodeUnitCount]]: 2, [[IsUnpairedSurrogate]]: false }.
This is why [...str] correctly handles surrogate pairs: the iterator uses the CodePointAt algorithm, consuming 1 or 2 code units per iteration depending on whether a valid surrogate pair is found.
Unicode Normalization in the Spec
Spec Section 22.1.3.12 (String.prototype.normalize):
This method normalizes the code points of the String value according to the form specified by form. The following forms are supported:
- "NFC", the Canonical Decomposition, followed by Canonical Composition.
- "NFD", the Canonical Decomposition.
- "NFKC", the Compatibility Decomposition, followed by Canonical Composition.
- "NFKD", the Compatibility Decomposition.
💎 Level 4 · Edge Cases and Traps
Trap 1: '😀'.length === 2
// Full surrogate pair demonstration
const emoji = '😀';
// length is the number of code units, not characters
emoji.length // 2
// Accessing individual code units (isolated surrogates — invalid Unicode!)
emoji[0] // '\uD83D' (high surrogate, U+D83D)
emoji[1] // '\uDE00' (low surrogate, U+DE00)
// charCodeAt returns the numeric value of a code unit
emoji.charCodeAt(0) // 55357 (= 0xD83D)
emoji.charCodeAt(1) // 56832 (= 0xDE00)
// codePointAt correctly handles surrogate pairs
emoji.codePointAt(0) // 128512 (= 0x1F600, the correct emoji code point)
emoji.codePointAt(1) // 56832 (low surrogate value — but this is the "wrong" position)
// Reconstructing the character
String.fromCodePoint(128512) // '😀' (correct)
String.fromCharCode(55357, 56832) // '😀' (also correct — manually providing the surrogate pair)
// Real-world: counting emoji correctly
function countRealChars(str) {
return [...str].length; // spread uses the iterator, handles surrogate pairs
}
countRealChars('😀😂🎉') // 3 (three emoji)
'😀😂🎉'.length // 6 (six code units)
Trap 2: '😀'[0] Yields a Lone High Surrogate
// The danger of lone surrogates
const half = '😀'[0]; // '\uD83D', a lone high surrogate
// Lone surrogates cause problems in some operations
half.length // 1 (looks like 1 character)
encodeURIComponent(half) // '%ED%A0%BD' (not valid UTF-8!)
JSON.stringify(half) // '"\uD83D"' (some JSON parsers reject this)
// WTF-8 (sometimes called "Wobbly Transformation Format 8")
// describes this situation — strings containing lone surrogates can cause:
// 1. Garbled text when transferred across language boundaries
// 2. JSON parse errors (ES2019's JSON.stringify escapes lone surrogates)
// 3. Confusing behavior with TextEncoder
// ES2019+ JSON.stringify improvement: escape lone surrogates
JSON.stringify('\uD800') // '"\uD800"' (escaped, rather than throwing pre-ES2019)
// Detecting lone surrogates
function hasLoneSurrogates(str) {
return /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/u.test(str);
}
hasLoneSurrogates('😀') // false (complete surrogate pair)
hasLoneSurrogates('\uD800') // true (lone high surrogate)
Trap 3: Regex . Doesn't Match Surrogate Pairs Without the /u Flag
// Without /u, . matches a single code unit
/^.$/.test('😀') // false! (😀 is two code units)
/^..$/.test('😀') // true (two code units, one . each)
// With /u, . matches a full code point (including surrogate pairs)
/^.$/u.test('😀') // true! (/u treats surrogate pair as one character)
/^..$/.test('😀') // false
// Character classes are also affected
/[\uD800-\uDFFF]/.test('😀') // true (matches first code unit D83D)
/[\u{1F600}]/u.test('😀') // true (Unicode escape with /u flag)
// Real-world impact: form validation
// ❌ Wrong: can't correctly limit emoji string length
const maxLen = 10;
const regex = new RegExp(`^.{0,${maxLen}}$`);
regex.test('😀'.repeat(10)) // true (10 emoji = 20 code units, but check passes!)
// ✅ Correct: add the /u flag
const regexU = new RegExp(`^.{0,${maxLen}}$`, 'u');
regexU.test('😀'.repeat(10)) // false (10 emoji = 10 characters, over limit)
regexU.test('😀'.repeat(9)) // true (9 emoji, under limit)
Trap 4: 'café' === 'café' Can Be false
// NFC vs NFD normalization difference
const nfc = '\u00E9'; // é (precomposed, NFC)
const nfd = 'e\u0301'; // é (base letter + combining diacritic, NFD)
nfc === nfd // false! (different code point sequences)
nfc.length // 1
nfd.length // 2
// They look completely identical, but the code sees different strings
console.log(nfc); // é
console.log(nfd); // é (looks the same)
nfc === nfd // false
// Where does this problem appear?
// 1. macOS file systems (HFS+) tend to use NFD
// 2. Merging text from different data sources (APIs, databases, user input)
// 3. Strings passed between libraries written in different languages
// A real bug
const userInput = 'café'; // from user (NFC)
const dbRecord = 'cafe\u0301'; // from database (NFD)
userInput === dbRecord // false! Search fails!
userInput.includes(dbRecord) // false! Text matching fails!
// ✅ Fix: normalize before comparing
userInput.normalize('NFC') === dbRecord.normalize('NFC') // true
Trap 5: Intl.Segmenter for Complex Emoji
Some emoji are composed of multiple code points joined by ZWJ sequences, making correct character counting even more complex:
// Family emoji: multiple code points combined
const family = '👨👩👧👦';
family.length // 11 (multiple surrogate pairs + ZWJ joiners)
[...family].length // 7 (7 code points, but visually 1 character!)
// ZWJ (Zero Width Joiner, U+200D) connects multiple emoji into one
// 👨 + ZWJ + 👩 + ZWJ + 👧 + ZWJ + 👦 = 👨👩👧👦
// Intl.Segmenter is the correct way to handle grapheme clusters
const segmenter = new Intl.Segmenter('en', { granularity: 'grapheme' });
function countGraphemes(str) {
return [...segmenter.segment(str)].length;
}
countGraphemes('😀') // 1 (single emoji)
countGraphemes('👨👩👧👦') // 1 (family emoji, 1 visual character)
countGraphemes('café') // 4 (4 graphemes)
countGraphemes('e\u0301') // 1 (e + combining diacritic = 1 grapheme)
// Input box character limit (based on user-perceived characters)
function limitInput(str, maxGraphemes) {
const segments = [...segmenter.segment(str)];
if (segments.length <= maxGraphemes) return str;
return segments.slice(0, maxGraphemes).map(s => s.segment).join('');
}
limitInput('😀😂🎉👨👩👧👦', 2) // '😀😂' (correctly slices 2 user-perceived chars)
Trap 6: String.prototype.normalize in Internationalized Search
// A correct internationalized search implementation
function searchText(text, query, locale = 'en') {
// 1. Normalize (unify NFC/NFD)
const normalizedText = text.normalize('NFC');
const normalizedQuery = query.normalize('NFC');
// 2. Case-insensitive search
const lowerText = normalizedText.toLocaleLowerCase(locale);
const lowerQuery = normalizedQuery.toLocaleLowerCase(locale);
return lowerText.includes(lowerQuery);
}
searchText('Résumé', 'RÉSUMÉ') // true (normalized + case-insensitive)
searchText('cafe\u0301', 'café') // true (NFD and NFC of same content)
// Even more complete: use Intl.Collator
const collator = new Intl.Collator('en', {
sensitivity: 'base', // ignore case and diacritics
usage: 'search'
});
function collatorSearch(text, query) {
for (let i = 0; i <= text.length - query.length; i++) {
const result = collator.compare(
text.slice(i, i + query.length),
query
);
if (result === 0) return true;
}
return false;
}
Chapter Summary
-
JavaScript strings are sequences of UTF-16 code units, not Unicode code points:
lengthreturns the number of code units (16-bit integers), andstr[i]accesses the i-th code unit. Characters in supplementary planes (above U+10000, like emoji) occupy two code units, which is why'😀'.length === 2. -
Surrogate pairs are UTF-16's mechanism for encoding supplementary plane characters: a high surrogate (U+D800-DBFF) paired with a low surrogate (U+DC00-DFFF) represents one code point. Directly accessing
str[0]may yield a lone high surrogate — an invalid Unicode character that causes issues in JSON serialization, URI encoding, and more. -
Handling emoji requires code-point-level APIs: use
str.codePointAt(),String.fromCodePoint(),[...str](spread uses the iterator, which correctly handles surrogate pairs), and the/uflag for regex (.matches a full code point). Use[...str].lengthfor character counting, notstr.length. -
Unicode normalization (NFC/NFD) affects string equality:
'é'(U+00E9) and'e\u0301'(e + combining diacritic) are visually identical but===returns false, and their lengths differ (1 vs 2). When comparing strings across systems, always call.normalize('NFC')first to unify the form. -
Intl.Segmenteris the only reliable way to count user-perceived characters: the family emoji👨👩👧👦consists of 7 code points (14 code units) but is visually 1 "character" (grapheme cluster).[...str].lengthreturns 7; onlyIntl.Segmenterreturns 1. For limiting user input length and truncating displayed strings,Intl.Segmenteris required.