Chapter 38

Slow Query and Big Key Analysis

Chapter 38: Slow Query Analysis and Big Key Diagnosis

Performance problems hide in details: a forgotten KEYS *, a Hash that silently grew to 100,000 fields, a single key absorbing 80% of all traffic. This chapter covers the complete toolkit for discovering, diagnosing, and fixing these issues without impacting production.


1. Slowlog Configuration and Usage

1.1 Configuration Parameters

# redis.conf
slowlog-log-slower-than 10000   # threshold in microseconds (μs); 10000 μs = 10 ms
                                 # set to 0 to log every command (debug use only)
                                 # set to -1 to disable entirely
slowlog-max-len 128             # ring buffer of at most 128 entries; older entries are dropped

Dynamic changes (no restart needed):

CONFIG SET slowlog-log-slower-than 5000   # lower to 5 ms to catch more
CONFIG SET slowlog-max-len 256
CONFIG REWRITE                             # persist to redis.conf

1.2 Querying the Slowlog

SLOWLOG GET          # retrieve all entries (up to slowlog-max-len)
SLOWLOG GET 10       # retrieve the 10 most recent entries
SLOWLOG LEN          # how many entries are currently stored
SLOWLOG RESET        # clear all entries

# Each entry contains:
# 1) (integer) 14          ← auto-increment ID
#    (integer) 1699000000  ← Unix timestamp when the command started
#    (integer) 28000       ← execution time in microseconds (28 ms)
#    1) "KEYS"             ← command name
#       "*"                ← argument (truncated to 128 bytes by default)
#    "127.0.0.1:54321"     ← client address (Redis 4.0+)
#    "myapp"               ← client name (set via CLIENT SETNAME)

1.3 Parsing the Slowlog in Python

import redis
from datetime import datetime

r = redis.Redis()

def print_slowlog(count: int = 50):
    for entry in r.slowlog_get(count):
        ts = datetime.fromtimestamp(entry['start_time'])
        ms = entry['duration'] / 1000
        cmd_parts = [
            arg.decode() if isinstance(arg, bytes) else str(arg)
            for arg in entry['command']
        ]
        cmd = ' '.join(cmd_parts)[:120]
        print(f"[{ts}] {ms:.1f} ms | {cmd}")

print_slowlog()

2. Commands Known to Be Slow

Command Complexity Blocking Risk Replacement
KEYS * O(N) Critical SCAN
HGETALL O(N) High (large Hashes) HSCAN iteratively
SMEMBERS O(N) High (large Sets) SSCAN iteratively
SORT O(N + M·log M) High Pre-sort into ZSET
LRANGE 0 -1 O(N) High (long Lists) Paginate with LRANGE
SUNIONSTORE O(N) High Split into smaller operations
SINTERSTORE O(N·M) Critical Step-by-step intersections
DEL (large key) O(N) High UNLINK (async)
FLUSHDB (sync) O(N) Critical FLUSHDB ASYNC
# DEL is synchronous — freeing a 100,000-field Hash blocks the main thread
DEL big_hash          # dangerous: may take hundreds of milliseconds

# UNLINK detaches the key in O(1) on the main thread; a background thread frees the memory
UNLINK big_hash       # safe
# Production policy: always prefer UNLINK for keys that could be large
def safe_delete(*keys):
    return r.unlink(*keys)

3. Big Key Scanning

3.1 redis-cli --bigkeys

redis-cli --bigkeys
# Uses SCAN internally — does NOT block the main thread
# Reports the largest key for each data type

# Throttle scan rate to reduce production impact
redis-cli --bigkeys -i 0.1   # sleep 0.1 s after every 100 SCAN commands

Sample output:

# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.

[00.00%] Biggest string found so far 'config:blob'       with 512000 bytes
[23.45%] Biggest hash   found so far 'user:profile:9999' with 18432 fields
[67.89%] Biggest zset   found so far 'leaderboard:all'   with 2048321 members

-------- summary -------
Sampled 1000000 keys in the keyspace!
Total key length in bytes is 24000000 (avg len 24.00)

Biggest   string found 'config:blob'       has 512000 bytes
Biggest     hash found 'user:profile:9999' has 18432 fields
Biggest     zset found 'leaderboard:all'   has 2048321 members
Biggest      set found 'tags:popular'      has 10240 members
Biggest     list found 'queue:jobs'        has 50000 items

3.2 Per-Key Memory Usage

# Recursive memory calculation including encoding overhead
MEMORY USAGE user:profile:9999
# returns bytes used by the key and its value (including Redis metadata)

MEMORY USAGE user:profile:9999 SAMPLES 0   # exact (slow) — samples all nested elements

4. Hot Key Detection

4.1 Enabling LFU Policy

Hot key detection requires an LFU eviction policy so Redis tracks access frequency:

# redis.conf
maxmemory-policy allkeys-lfu    # or volatile-lfu

# or dynamically
CONFIG SET maxmemory-policy allkeys-lfu
redis-cli --hotkeys
# Sample output:
# [67.89%] Hot key 'product:detail:10086' found so far with counter 9876
# [89.12%] Hot key 'user:session:abc123' found so far with counter 7654
#
# -------- summary -------
# hot key found with counter: 9876  keyname: product:detail:10086
# hot key found with counter: 7654  keyname: user:session:abc123

4.2 Real-Time Hot Key Detection via MONITOR

# WARNING: MONITOR streams all commands and causes ~50% throughput drop
# Use only briefly for urgent investigation, never in steady-state
redis-cli monitor | head -n 50000 \
    | grep -oP '(?<= ")[^"]+(?=")' \
    | sort | uniq -c | sort -rn | head -20

4.3 Mitigating Hot Keys

# Strategy 1: Local in-process cache (LRU, short TTL)
from functools import lru_cache

@lru_cache(maxsize=1000)
def get_product(product_id: str) -> dict:
    """Cached at process level; Redis only queried on cache miss."""
    data = r.hgetall(f"product:{product_id}")
    return {k.decode(): v.decode() for k, v in data.items()}

# Strategy 2: Key replication across multiple Redis keys
import random

REPLICAS = 10

def get_hot_key(base_key: str) -> str:
    """Distribute reads across N replicated copies."""
    replica = random.randint(0, REPLICAS - 1)
    return r.get(f"{base_key}:r{replica}")

def set_hot_key(base_key: str, value: str, ttl: int = 300):
    """Write to all replicas atomically via pipeline."""
    pipe = r.pipeline()
    for i in range(REPLICAS):
        pipe.setex(f"{base_key}:r{i}", ttl, value)
    pipe.execute()

5. Memory Analysis Tools

5.1 Online Analysis

# Memory usage of a single key
redis-cli memory usage mykey

# Aggregated memory statistics
redis-cli memory stats
# Key fields:
#   peak.allocated      — highest memory allocation ever seen
#   total.allocated     — current total allocated
#   fragmentation.ratio — RSS / used_memory

# Memory doctor: actionable recommendations
redis-cli memory doctor
# Example: "Your vm.overcommit_memory is set to 0..."

5.2 Offline RDB Analysis

# Option 1: redis-rdb-tools (Python)
pip install rdbtools python-lzf

rdb --command memory dump.rdb > memory.csv
# CSV columns: database, type, key, size_in_bytes, encoding, num_elements, len_largest_element

# Find the 20 largest keys
sort -t',' -k4 -rn memory.csv | head -20

# Option 2: rdb-cli (Go — much faster on large RDB files)
go install github.com/HDT3213/rdb@latest
rdb -c memory -port 7379 dump.rdb
# Spawns an HTTP server; visualize in your browser

5.3 Generating an RDB Snapshot

# Trigger a background save (brief fork overhead)
BGSAVE
LASTSAVE    # Unix timestamp of the most recent successful save

# Monitor progress
redis-cli info persistence | grep rdb_bgsave_in_progress
# 0 = save complete; 1 = in progress

6. Big Key Refactoring Playbook

6.1 Scenario: Oversized Hash (100,000 fields)

# Problem: HGETALL user:1000 returns 100,000 fields → 500 ms+
HGETALL user:1000

Solution: shard fields across multiple Hashes

import hashlib

SHARD_COUNT = 16

def _shard(field: str) -> int:
    return int(hashlib.md5(field.encode()).hexdigest(), 16) % SHARD_COUNT

def hset_sharded(base_key: str, field: str, value: str):
    r.hset(f"{base_key}:s{_shard(field)}", field, value)

def hget_sharded(base_key: str, field: str) -> str | None:
    raw = r.hget(f"{base_key}:s{_shard(field)}", field)
    return raw.decode() if raw else None

def hgetall_sharded(base_key: str) -> dict:
    pipe = r.pipeline()
    for i in range(SHARD_COUNT):
        pipe.hgetall(f"{base_key}:s{i}")
    result = {}
    for shard_data in pipe.execute():
        result.update({k.decode(): v.decode() for k, v in shard_data.items()})
    return result

6.2 Scenario: Oversized Set (Millions of Members)

SET_SHARD_COUNT = 100

def sadd_sharded(base_key: str, member: str):
    shard = hash(member) % SET_SHARD_COUNT
    r.sadd(f"{base_key}:s{shard}", member)

def sismember_sharded(base_key: str, member: str) -> bool:
    shard = hash(member) % SET_SHARD_COUNT
    return bool(r.sismember(f"{base_key}:s{shard}", member))

def scard_sharded(base_key: str) -> int:
    pipe = r.pipeline()
    for i in range(SET_SHARD_COUNT):
        pipe.scard(f"{base_key}:s{i}")
    return sum(pipe.execute())

6.3 Scenario: Oversized String Value (> 1 MB)

import gzip, json

def set_compressed(key: str, data: dict, ttl: int = 3600):
    """Compress large JSON payloads before storing."""
    raw = json.dumps(data, ensure_ascii=False).encode()
    compressed = gzip.compress(raw, compresslevel=6)
    compression_ratio = len(raw) / len(compressed)
    print(f"Stored {len(compressed)} bytes (ratio {compression_ratio:.1f}x)")
    r.setex(key, ttl, compressed)

def get_compressed(key: str) -> dict | None:
    compressed = r.get(key)
    if not compressed:
        return None
    return json.loads(gzip.decompress(compressed).decode())

Guideline for values over 100 KB: store the object in object storage (S3/OSS) and keep only a URL + metadata in Redis.


7. Memory Fragmentation

INFO memory | grep mem_fragmentation_ratio
# mem_fragmentation_ratio: 1.89
#   = RSS memory / used_memory
# Healthy range: 1.0 – 1.5
# > 1.5: significant fragmentation — consider active defrag or restart
# < 1.0: memory has been swapped to disk — investigate immediately

7.1 Active Defragmentation (Redis 4.0+)

CONFIG SET activedefrag yes
CONFIG SET active-defrag-ignore-bytes 100mb    # start only if waste > 100 MB
CONFIG SET active-defrag-threshold-lower 10    # start defrag above 10% fragmentation
CONFIG SET active-defrag-threshold-upper 100   # run at full speed above 100%
CONFIG SET active-defrag-cycle-min 1           # minimum CPU % allocated to defrag
CONFIG SET active-defrag-cycle-max 25          # maximum CPU % allocated to defrag

# One-shot manual purge
MEMORY PURGE

7.2 Restart-Based Defragmentation

For severe fragmentation (ratio > 2.0), the fastest remedy is a controlled failover:

  1. Promote a replica to primary.
  2. Restart the old primary — RDB reload produces a compact, unfragmented layout.
  3. Rejoin as a replica.

8. Production Diagnosis Checklist

When Redis response times spike, work through this list in order:

□ 1.  SLOWLOG GET 20                   — what commands are slow?
□ 2.  INFO stats | grep instantaneous  — current QPS
□ 3.  INFO memory                      — used_memory, maxmemory, fragmentation
□ 4.  INFO clients                     — connected_clients, blocked_clients
□ 5.  redis-cli --bigkeys -i 0.1       — any surprise big keys? (run off-peak)
□ 6.  redis-cli --hotkeys              — concentrated traffic on a few keys?
□ 7.  INFO persistence                 — bgsave / AOF rewrite in progress?
□ 8.  INFO replication                 — repl_backlog overflow?
□ 9.  top -p $(pgrep redis-server)     — CPU saturation?
□ 10. ss -tnp | grep 6379 | wc -l     — TCP connection count normal?

Chapter Summary

Rate this chapter
4.6  / 5  (3 ratings)

💬 Comments