Chapter 30

Key Design and Data Modeling: Avoiding All Common Pitfalls

Chapter 30: Key Design and Data Modeling — Avoiding Every Common Pitfall

30.1 Why Key Design Matters

In Redis, a key is not just an identifier. Its length affects memory consumption; its structure affects readability and operability; its slot determines which cluster node owns it; its naming convention determines whether you can efficiently scan, monitor, and maintain the system at scale.

Poor key design is insidious: it looks harmless at 10K keys and becomes catastrophic at 10 billion. This chapter covers five dimensions of production key design: naming, serialization, hot keys, big keys, and TTL strategy.

30.2 Key Naming Conventions

30.2.1 Hierarchical Naming

The broadly accepted convention is domain:entity_type:id[:subfield]:

# Recommended patterns
user:1000:profile           # User 1000's profile (Hash)
user:1000:orders            # User 1000's order list (List or ZSet)
order:20240501:12345        # Specific order (String or Hash)
cache:product:SKU-9527      # Product cache (String)
lock:payment:order-12345    # Distributed lock (String)
rate:api:user:1000:v2       # Rate limiter counter (String)
leaderboard:game:101:2024W20  # Weekly leaderboard (ZSet)
session:abc123def456        # Session data (Hash)

Separator choice:

: is the de facto Redis community standard. Tools like RedisInsight parse : to build a tree view of your keyspace
_ is an acceptable alternative in codebases where colon conflicts with language conventions
Never mix: pick one separator and enforce it project-wide

30.2.2 Key Length

# Memory cost of a key in Redis internals
# - dictEntry struct:      ~64 bytes per key
# - SDS (key string):      header (8B) + key content
# - Total overhead:        ~90–200 bytes per key

# Too long — redundant verbosity
user_profile_data_for_the_user_with_id_12345_and_type_premium  # 63 chars

# Too short — unreadable
u:1:p   # What does this mean without documentation?

# Good — clear hierarchy, reasonable length
user:12345:profile   # 18 chars, self-documenting

Guidelines:

Target < 60 bytes; anything over 100 bytes needs justification
Don't sacrifice readability for brevity
At scale: 100 million keys × 90 bytes base overhead = 9 GB just for key metadata. Short keys reduce this meaningfully

30.2.3 Special Characters and Encoding

# Redis keys support arbitrary binary content, but avoid:
# - Spaces:        redis-cli will misparse them
# - Newlines:      invisible in logs and monitoring tools
# - Control chars: difficult to debug

# Non-ASCII keys (valid, but not recommended)
SET 用户:1000:档案 value  # Legal; each CJK character = 3 bytes, making keys longer
# Prefer ASCII with numeric IDs

# Hash tags in Cluster mode
# The substring inside {} determines the hash slot
SET {user:1000}:profile value
SET {user:1000}:session value
# Both keys land on the same slot → can be used in the same Pipeline or MULTI/EXEC

30.2.4 Namespace Planning for Multi-Tenant Redis

When multiple systems share a Redis instance, prefix isolation is mandatory:

ecommerce:product:123      # E-commerce system
crm:user:456               # CRM system
bi:report:cache:20240501   # BI cache
analytics:event:20240501   # Analytics system

Better architecture: separate Redis instances per domain. The databases configuration (0–15) is often misused as namespacing—it provides no performance isolation, no independent maxmemory, and no independent monitoring:

# Avoid using SELECT 1, SELECT 2 for namespace isolation
# Prefer: one Redis instance per major business domain
# Or: Redis Cluster with dedicated key prefix per service

30.3 Serialization Format Selection

30.3.1 Format Comparison

Redis values are byte strings. The application chooses how to serialize objects:

Format	Size vs. JSON	Speed	Cross-language	Readability	Best For
JSON	Baseline	Slow (text parse)	Excellent	High	Debugging, small objects, API response caches
Protobuf	~1/3 of JSON	Very fast (binary)	Good (needs IDL)	Low	High-frequency large objects, cross-language microservices
MessagePack	~1/2 of JSON	Fast (binary)	Good (no IDL needed)	Low	General use; drop-in JSON replacement
Avro	Similar to Protobuf	Fast	Good (Schema Registry)	Low	Kafka + Redis pipelines
Kryo/Hessian	~1/2 of JSON	Fast	Poor (JVM-only)	Low	Java monolith

30.3.2 Benchmark: 100-Field User Object

Format	Payload Size	Serialize (μs)	Deserialize (μs)
JSON	2.1 KB	45	62
MessagePack	1.0 KB	18	22
Protobuf	650 B	8	11
Custom string	500 B	3	12

Takeaway: For objects larger than a few hundred bytes accessed thousands of times per second, Protobuf's bandwidth and CPU savings are meaningful. For objects needing human inspection (debugging, small configs), JSON's readability is genuinely valuable.

30.3.3 Compressing Large Values

For values exceeding 10 KB, compress after serializing:

import gzip, json, redis

r = redis.Redis()

def set_compressed(key: str, obj: dict, ttl: int = 3600):
    """Serialize → gzip compress → store in Redis."""
    serialized = json.dumps(obj).encode('utf-8')
    compressed = gzip.compress(serialized, compresslevel=6)
    r.setex(key, ttl, compressed)

def get_compressed(key: str) -> dict | None:
    """Fetch from Redis → decompress → deserialize."""
    raw = r.get(key)
    if raw is None:
        return None
    return json.loads(gzip.decompress(raw).decode('utf-8'))

# Typical result: 10 KB JSON → 2 KB compressed (5x ratio)
# CPU cost: ~0.3–0.5ms additional latency per operation

Enable compression when:

Value size > 5 KB and access frequency < 1K QPS (CPU cost is amortized)
Network bandwidth is billed (cloud environments)
Memory is the bottleneck (compression reduces memory 50–80%)

30.4 Hot Keys

30.4.1 Defining a Hot Key

A key becomes a "hot key" when it receives a disproportionate share of total QPS—typically 10–30%+ of a single node's request volume:

Typical hot key scenarios:
- Top trending topic on social media (GET hotSearch:rank:1)
- Flash sale product inventory (GET/DECR stock:product:SKU-9527)
- Homepage configuration (GET config:homepage)
- Global page view counter (INCR global:pageview)
- Popular discount coupon (GET coupon:activity:999)

Consequences:

Single node CPU hits 100%; other keys on that node suffer latency
Cluster data skew: hot node handles 10x the load of neighboring nodes
Network saturation on large-value hot keys under high QPS

30.4.2 Detecting Hot Keys

# Method 1: redis-cli --hotkeys (requires maxmemory-policy = *-lfu)
redis-cli --hotkeys -h redis-host -p 6379
# Output:
# hot key found with counter: 9842  keyname: hotSearch:rank:1
# hot key found with counter: 7234  keyname: stock:product:SKU-9527

# Method 2: MONITOR sampling (use sparingly — halves server throughput)
redis-cli monitor | head -5000 | grep " GET " \
  | awk '{print $4}' | tr -d '"' \
  | sort | uniq -c | sort -rn | head -20

# Method 3: LFU frequency counter (Redis 4.0+ with LFU policy)
OBJECT FREQ hotSearch:rank:1  # Returns LFU access frequency estimate

# Method 4: Client-side instrumentation (recommended for production)
# Intercept all Redis calls in your client wrapper, record key access counts,
# export to Prometheus/Grafana without impacting Redis performance

30.4.3 Hot Key Solutions

Solution 1: Local (L1) Cache

from cachetools import TTLCache
import threading
import redis

r = redis.Redis()
_local = TTLCache(maxsize=500, ttl=3)  # 3-second local cache
_lock = threading.Lock()

def get_with_local_cache(key: str):
    """Check local cache first, fall back to Redis."""
    value = _local.get(key)
    if value is not None:
        return value
    value = r.get(key)
    if value is not None:
        with _lock:
            _local[key] = value
    return value

def invalidate_local(key: str):
    """Call on writes to keep local cache fresh."""
    with _lock:
        _local.pop(key, None)

Pros: zero Redis traffic for hot reads, sub-0.1ms latency Cons: short data lag (TTL window), stale across multiple app instances

Solution 2: Key Sharding (Read Replicas in Key Form)

import random, redis

r = redis.Redis()

SHARD_COUNT = 10

def read_sharded(base_key: str) -> bytes | None:
    """Read from a random shard — distributes hot reads across 10 keys."""
    shard_idx = random.randint(0, SHARD_COUNT - 1)
    return r.get(f"{base_key}:shard:{shard_idx}")

def write_sharded(base_key: str, value: str, ttl: int = 60):
    """Write to ALL shards to keep them consistent."""
    pipe = r.pipeline()
    for i in range(SHARD_COUNT):
        pipe.setex(f"{base_key}:shard:{i}", ttl, value)
    pipe.execute()

# Usage:
# Write (updates all 10 copies)
write_sharded("hotSearch:rank:1", "Redis 8.0 Released")
# Read (randomly picks 1 of 10 — QPS distributed across 10 keys/nodes)
val = read_sharded("hotSearch:rank:1")

Solution 3: Read-Write Splitting

from redis.sentinel import Sentinel

sentinel = Sentinel(
    [('sentinel1', 26379), ('sentinel2', 26379), ('sentinel3', 26379)],
    socket_timeout=0.1
)
master = sentinel.master_for('mymaster', socket_timeout=0.1)
replica = sentinel.slave_for('mymaster', socket_timeout=0.1)

# Writes go to primary
master.set('hotkey', 'value')

# Reads distributed across replicas (Redis handles replication automatically)
val = replica.get('hotkey')

30.5 Big Keys

30.5.1 Defining a Big Key

Industry-standard thresholds:

Data Type	Big Key Threshold
String	> 10 KB
Hash	> 5,000 fields
List	> 5,000 elements
Set	> 5,000 members
ZSet	> 5,000 members
Stream	> 10,000 entries

30.5.2 The Damage Big Keys Cause

Slow network transfers:

Fetching a 1 MB String value:
- 10 Gbps intranet:   ~0.8 ms network time
- 1 Gbps intranet:    ~8 ms network time
- Client deserialize: +1–5 ms
Compare: a 100-byte GET typically completes in 0.1 ms

Main thread blocking during deletion:

# DEL on a Hash with 100,000 fields can block for tens of milliseconds
DEL big_hash   # Blocks main thread during memory release!

# Use UNLINK instead — returns immediately, releases memory in background thread
UNLINK big_hash

# Same applies to flush operations
FLUSHDB ASYNC
FLUSHALL ASYNC

Cluster data skew: a 10 MB key on one node causes that node's memory, CPU, and network utilization to far exceed its neighbors, breaking load balance.

RDB and AOF impact: large keys serialize slowly during RDB snapshots, extending the fork copy-on-write window and increasing memory pressure.

30.5.3 Detecting Big Keys

# Method 1: redis-cli --bigkeys (uses SCAN, non-blocking)
redis-cli --bigkeys -h redis-host -p 6379
# Output:
# Biggest string: 'user:1000:bio' with 52428 bytes
# Biggest hash:   'user:events:all' with 125432 fields

# Method 2: Check a specific key's memory footprint
redis-cli MEMORY USAGE user:1000:bio           # Returns bytes
redis-cli DEBUG OBJECT user:1000:bio           # Encoding, serialization length

# Method 3: RDB offline analysis
# rdb-tools (Python)
pip install rdbtools
rdb --command memory dump.rdb | sort -t, -k4 -rn | head -20

# redis-rdb-cli (Java, richer output)
rct -c memory -s /var/lib/redis/dump.rdb -o big_keys_report.csv -t string,hash,list,set,zset

30.5.4 Big Key Remediation Strategies

String — large values:

import gzip, json, redis, boto3

r = redis.Redis()
s3 = boto3.client('s3')

def store_large_content(content_id: str, content: dict, ttl: int = 3600):
    content_json = json.dumps(content)
    if len(content_json) > 10240:  # > 10 KB → offload to object storage
        s3_key = f"content/{content_id}.json"
        s3.put_object(Bucket='my-cache-bucket', Key=s3_key, Body=content_json)
        # Store only the reference URL in Redis
        r.setex(f"article:{content_id}:content", ttl,
                f"s3://my-cache-bucket/{s3_key}")
    else:
        # Small enough to store directly
        r.setex(f"article:{content_id}:content", ttl, content_json)

Hash — too many fields:

# Before: single Hash with 500+ fields
# HSET user:1000 name Alice age 30 email ... [500 fields]

# After: split by field category
HSET user:1000:basic    name Alice age 30 gender F
HSET user:1000:contact  email [email protected] phone 13800138000
HSET user:1000:prefs    language zh timezone Asia/Shanghai theme dark
HSET user:1000:stats    login_count 150 last_login_ts 1716000000

# Benefit: most operations only need one sub-Hash
HGETALL user:1000:basic     # Fast — only 3-5 fields
HGET user:1000:contact email  # Even faster — one field

List/ZSet — too many elements:

from datetime import datetime, timedelta
import redis

r = redis.Redis()

# Time-based sharding: split history ZSet by month
def add_user_event(user_id: int, event: str, score: float):
    month = datetime.now().strftime("%Y%m")
    key = f"user:{user_id}:events:{month}"
    r.zadd(key, {event: score})
    r.expire(key, 86400 * 90)  # Keep for 90 days

def get_user_events(user_id: int, months: int = 3):
    now = datetime.now()
    results = []
    for i in range(months):
        d = now - timedelta(days=i * 30)
        key = f"user:{user_id}:events:{d.strftime('%Y%m')}"
        results.extend(r.zrevrangebyscore(key, '+inf', '-inf', withscores=True))
    return sorted(results, key=lambda x: x[1], reverse=True)[:1000]

# Fixed-size capped list: keep only the most recent N items
def append_capped(key: str, value: str, max_size: int = 1000):
    pipe = r.pipeline()
    pipe.lpush(key, value)
    pipe.ltrim(key, 0, max_size - 1)
    pipe.execute()

30.6 TTL Design Principles

30.6.1 Why Cache Keys Must Have TTLs

Without TTLs, memory grows unboundedly until maxmemory is hit. If eviction policy is noeviction, Redis starts rejecting writes—a production outage. Even with LRU/LFU eviction, keys without TTLs compete unfairly with keys that have natural expiry.

Rule: every cache key must have a TTL. Even "permanent" data should refresh periodically.

30.6.2 Preventing Cache Stampede with TTL Jitter

When many keys expire simultaneously (common after a cache warm-up or deployment), all requests hit the database at once—a "thundering herd" or cache stampede:

import random, redis

r = redis.Redis()

def set_with_jitter(key: str, value, base_ttl: int = 3600):
    """Add random jitter to TTL to prevent synchronized expiry."""
    jitter = random.randint(0, base_ttl // 10)  # ±10% randomization
    actual_ttl = base_ttl + jitter
    r.setex(key, actual_ttl, value)

# 1000 keys with base TTL 3600s will now expire
# spread across 3600–3960s — no synchronized cache miss spike

# More aggressive: use base ± 20%
def set_with_wide_jitter(key: str, value, base_ttl: int = 3600):
    spread = base_ttl // 5   # 20% spread
    actual_ttl = base_ttl + random.randint(-spread // 2, spread // 2)
    r.setex(key, max(1, actual_ttl), value)

30.6.3 Aligning TTL with Data Volatility

# Rule: Cache TTL must be shorter than the meaningful freshness window,
# but long enough to provide a cache hit rate worth having.

# Anti-pattern: product price cached for 1 second
# (prices change hourly — cache is useless, 99% miss rate)
r.setex("product:price:123", 1, "99.9")  # Pointless!

# Correct: align TTL to actual change frequency
TTL_POLICY = {
    "user:profile":       86400,   # Changes rarely: 1 day
    "user:session":       7200,    # Session timeout: 2 hours
    "product:detail":     300,     # Updated every few minutes: 5 min
    "product:stock":      30,      # Volatile: 30 seconds
    "config:homepage":    600,     # Config reloads: 10 minutes
    "rate:api:*":         60,      # Rate limiter window: 1 minute
    "lock:*":             30,      # Lock timeout + safety margin
}

def set_with_policy(key: str, value):
    for pattern, ttl in TTL_POLICY.items():
        import fnmatch
        if fnmatch.fnmatch(key, pattern):
            r.setex(key, ttl, value)
            return
    r.setex(key, 300, value)  # Default: 5 minutes

30.6.4 Sliding Window TTL and Background Refresh

import threading

def get_sliding_ttl(key: str, ttl: int = 3600):
    """Reset TTL on each access — useful for session-like data."""
    pipe = r.pipeline()
    pipe.get(key)
    pipe.expire(key, ttl)   # Refresh TTL on read
    value, _ = pipe.execute()
    return value

class ProactiveCache:
    """Refresh cache before it expires to avoid a miss on expiry."""

    def __init__(self, refresh_fn, base_ttl: int = 3600, refresh_threshold: int = 60):
        self.refresh_fn = refresh_fn
        self.base_ttl = base_ttl
        self.refresh_threshold = refresh_threshold
        self._refreshing = set()

    def get(self, key: str):
        value = r.get(key)
        remaining = r.ttl(key)

        # Proactively refresh if near expiry
        if remaining != -1 and remaining < self.refresh_threshold:
            if key not in self._refreshing:
                self._refreshing.add(key)
                thread = threading.Thread(
                    target=self._background_refresh,
                    args=(key,),
                    daemon=True
                )
                thread.start()

        return value

    def _background_refresh(self, key: str):
        try:
            new_value = self.refresh_fn(key)
            r.setex(key, self.base_ttl, new_value)
        finally:
            self._refreshing.discard(key)

30.7 Data Modeling — Translating RDBMS Patterns to Redis

30.7.1 Common Relational Pattern Translations

One-to-many (User → Orders):

# RDBMS: orders table with user_id foreign key + query by user_id

# Redis Option A: ZSet (score = timestamp → range queries by time)
ZADD user:1000:orders 1716000000 "order:20240501:12345"
ZADD user:1000:orders 1716003600 "order:20240501:12346"

# Query: all orders in a time range
ZRANGEBYSCORE user:1000:orders 1716000000 1716086400

# Redis Option B: List (insertion order → fast access to recent N)
LPUSH user:1000:orders "order:20240501:12346"
LTRIM user:1000:orders 0 999   # Keep latest 1000

# Fetch recent 10
LRANGE user:1000:orders 0 9

Many-to-many (Users ↔ Tags):

# Bidirectional Sets: tag→users and user→tags
SADD tag:redis:users "user:1000" "user:1001" "user:1002"
SADD tag:python:users "user:1001" "user:1003"

SADD user:1000:tags "redis" "distributed"
SADD user:1001:tags "redis" "python"

# Find users with BOTH redis AND python tags (set intersection)
SINTERSTORE result:redis_and_python tag:redis:users tag:python:users
SMEMBERS result:redis_and_python    # → user:1001

# Find users with redis OR python (set union)
SUNIONSTORE result:redis_or_python tag:redis:users tag:python:users
SCARD result:redis_or_python        # Count of union

Sorted pagination:

# ZSet provides natural ordered pagination
ZADD articles 1716000000 "article:1"
ZADD articles 1716003600 "article:2"
ZADD articles 1716007200 "article:3"

# Page 1 (newest first): items 1–10
ZREVRANGEBYSCORE articles +inf -inf LIMIT 0 10

# Page 2: items 11–20
ZREVRANGEBYSCORE articles +inf -inf LIMIT 10 10

# Redis 6.2+ unified ZRANGE syntax
ZRANGE articles 0 9 REV                     # Newest 10
ZRANGE articles "(last_seen_score" "-inf" BYSCORE LIMIT 0 10 REV  # Cursor-based

30.7.2 Eliminating N+1 Queries

import redis

r = redis.Redis()

# WRONG: N+1 pattern (each order is a separate round trip)
order_ids = r.lrange("user:1000:orders", 0, 49)
orders = []
for oid in order_ids:
    orders.append(r.hgetall(f"order:{oid}"))  # 50 round trips!

# CORRECT: Pipeline batches all requests into one network round trip
order_ids = r.lrange("user:1000:orders", 0, 49)
pipe = r.pipeline()
for oid in order_ids:
    pipe.hgetall(f"order:{oid}")
orders = pipe.execute()  # 1 round trip for 50 orders!

# Even more efficient with MGET for String values
product_keys = [f"product:{pid}" for pid in product_ids]
products_raw = r.mget(product_keys)
products = [json.loads(p) for p in products_raw if p is not None]

30.8 Production Monitoring and Operations

30.8.1 Key Space Monitoring

# Keyspace overview — O(1)
redis-cli INFO keyspace
# db0:keys=1500000,expires=800000,avg_ttl=3541000
# db1:keys=50000,expires=50000,avg_ttl=7200000

# Memory overview
redis-cli INFO memory
# used_memory_human: 2.50G
# mem_fragmentation_ratio: 1.15    ← ideal: 1.0–1.5
# maxmemory_human: 8.00G

# Total key count — O(1)
redis-cli DBSIZE

# Count keys matching a pattern — uses SCAN (non-blocking)
redis-cli --scan --pattern "user:*" | wc -l
redis-cli --scan --pattern "cache:*" | wc -l

30.8.2 Safe Key Scanning

# NEVER use KEYS in production — O(N), blocks all clients
KEYS user:*           # DANGEROUS: blocks Redis for seconds on 10M+ keys

# ALWAYS use SCAN — non-blocking, iterative
SCAN 0 MATCH "user:*" COUNT 100
# Returns: [next_cursor, [key1, key2, ...]]
# Repeat until cursor returns 0

# Python: lazy iteration over all matching keys
for key in r.scan_iter("user:*", count=100):
    process(key)   # Processes ~100 keys per Redis round trip

30.8.3 Key Expiry Monitoring

# Monitor expiry events (enable keyspace notifications first)
CONFIG SET notify-keyspace-events Ex
SUBSCRIBE __keyevent@0__:expired

# Check if a key's TTL is set
TTL mykey          # Returns -1 (no TTL), -2 (doesn't exist), or seconds remaining
PTTL mykey         # Same, in milliseconds

# Find keys WITHOUT a TTL (potential memory leak)
# Note: This is a full scan — only run during off-peak hours
redis-cli --scan | while read key; do
    ttl=$(redis-cli TTL "$key")
    if [ "$ttl" = "-1" ]; then
        echo "No TTL: $key"
    fi
done | head -100

30.9 Summary and Production Checklist

Key design and data modeling underpin every production Redis deployment. Here is the distilled checklist from this chapter:

Naming conventions:

Use domain:entity:id[:subfield] hierarchy with : as separator
Keep key length under 60 bytes
Isolate namespaces with prefixes when multiple systems share a Redis instance
In Cluster mode, use hash tags {} to control slot placement for co-located keys

Serialization:

Use Protobuf or MessagePack for large, high-frequency objects instead of JSON
Apply gzip compression for values exceeding 10 KB
Offload values larger than 100 KB to object storage (S3/OSS); store the URL in Redis

Hot keys:

Monitor key access frequency via LFU counters or client-side instrumentation
Apply local caching (Caffeine/Guava) for extreme hot keys
Shard hot reads across N replica keys to distribute load
Use read-write splitting to route reads to replicas

Big keys:

Run redis-cli --bigkeys periodically (ideally in CI/CD or scheduled jobs)
Always use UNLINK instead of DEL for large keys
Split large Hashes by field category; split large ZSets by time window
Move large String values to object storage

TTL design:

Every cache key must have a TTL — no exceptions
Add random jitter (±10%) to prevent synchronized cache expiry storms
Align TTL with actual data change frequency (not too short, not too long)
Use sliding TTL for session-like data; use proactive background refresh for critical hot keys

Rate this chapter

4.7 / 5 (3 ratings)