Chapter 24

Memory Allocation and jemalloc: Defragmentation Internals

Chapter 24 — Memory Allocation and jemalloc: Fragmentation Internals

Memory management is one of the most critical factors affecting Redis performance and stability. This chapter provides a comprehensive source-code-level analysis covering allocator selection, the zmalloc wrapper layer, jemalloc's internal architecture, online defragmentation (activedefrag), and background lazy freeing (lazyfree).

24.1 Memory Allocator Selection

Redis supports three memory allocators, selected at compile time:

make MALLOC=jemalloc   # Default on Linux (recommended for production)
make MALLOC=libc       # System malloc (default on macOS/BSD, uses glibc)
make MALLOC=tcmalloc   # Google tcmalloc (requires libgoogle-perftools-dev)

To check which allocator is in use at runtime:

redis-cli INFO memory | grep mem_allocator
# Output: mem_allocator:jemalloc-5.3.0

Allocator Comparison

Feature	glibc malloc	jemalloc	tcmalloc
Fragmentation rate	Higher (older ptmalloc2 design)	Low (carefully engineered size classes)	Low
Multi-thread performance	Poor (global lock)	Excellent (per-CPU arenas)	Excellent (per-thread cache)
Memory return to OS	Slower	Faster (configurable)	Faster
Debug/profiling support	Limited	Rich (stats, heap profiling)	Rich
Production track record	Mature	Mature (Firefox, Facebook, Redis)	Mature (Chrome)

Why Redis chose jemalloc as the Linux default:

glibc ptmalloc2 suffers from lock contention under concurrent writes
jemalloc's size class design minimizes internal fragmentation
jemalloc supports fine-grained tuning via MALLOC_CONF
Proven stability at Facebook, Twitter, and other high-scale deployments

24.2 The zmalloc Wrapper Layer

Redis never calls malloc directly. Instead it uses zmalloc.c as a wrapper whose primary purpose is tracking total allocated memory.

Core Implementation

// zmalloc.c

// PREFIX_SIZE: extra bytes allocated before the user data
// On 64-bit systems with libc: sizeof(size_t) = 8 bytes for storing the size
// With jemalloc/tcmalloc: PREFIX_SIZE = 0 (we use malloc_usable_size instead)
#ifdef HAVE_MALLOC_SIZE
#define PREFIX_SIZE (0)               // jemalloc / tcmalloc mode
#else
#define PREFIX_SIZE (sizeof(size_t))  // libc mode: manually store size in header
#endif

void *zmalloc(size_t size) {
    void *ptr = malloc(size + PREFIX_SIZE);

    if (!ptr) zmalloc_oom_handler(size);

#ifdef HAVE_MALLOC_SIZE
    // jemalloc mode: use malloc_usable_size to get the actual allocated size
    update_zmalloc_stat_alloc(zmalloc_size(ptr));
    return ptr;
#else
    // libc mode: store the requested size in the first PREFIX_SIZE bytes
    *((size_t*)ptr) = size;
    update_zmalloc_stat_alloc(size + PREFIX_SIZE);
    return (char*)ptr + PREFIX_SIZE;
#endif
}

void zfree(void *ptr) {
    if (ptr == NULL) return;
#ifdef HAVE_MALLOC_SIZE
    update_zmalloc_stat_free(zmalloc_size(ptr));
    free(ptr);
#else
    void *realptr = (char*)ptr - PREFIX_SIZE;
    size_t oldsize = *((size_t*)realptr);
    update_zmalloc_stat_free(oldsize + PREFIX_SIZE);
    free(realptr);
#endif
}

used_memory Tracking

// Atomically update the memory usage counter
// Uses thread-local caching to reduce the frequency of atomic operations
#define update_zmalloc_stat_alloc(__n) do { \
    size_t _n = (__n); \
    /* Round up to sizeof(long) boundary */ \
    if (_n & (sizeof(long) - 1)) \
        _n += sizeof(long) - (_n & (sizeof(long) - 1)); \
    atomicIncr(used_memory, _n); \
} while(0)

// Current total allocation in bytes
size_t zmalloc_used_memory(void) {
    size_t um;
    atomicGet(used_memory, um);
    return um;
}

The gap between used_memory and RSS:

used_memory — bytes Redis believes it has allocated (tracked by zmalloc)
used_memory_rss — physical memory the OS has assigned to the process (from /proc/pid/smaps or getrusage)
The difference comes from: memory fragmentation, allocator-retained pages, and OS page-rounding

24.3 jemalloc's Three-Layer Architecture

Understanding jemalloc's internals explains why Redis behaves the way it does under memory pressure.

Hierarchy

OS (operating system)
    │
    │  mmap / brk (in 2MB chunks)
    ▼
Chunk (2MB block obtained from OS)
    │
    │  Split by size class into runs
    ▼
Run (internal name: slab) — a contiguous range of pages for one size class
    │
    │  Run subdivided into equally-sized regions
    ▼
Region — the actual memory returned to the caller

Arenas: Reducing Lock Contention

jemalloc creates (CPU count × 4) arenas by default

Thread 1 ──→ Arena 0
Thread 2 ──→ Arena 1
Thread 3 ──→ Arena 2
Thread 4 ──→ Arena 0  (round-robin assignment)

Each arena independently manages its own Chunks/Bins/Runs.
Threads almost never need to synchronize with each other.

Configuring via MALLOC_CONF:

# Set number of arenas (default: CPU count × 4)
MALLOC_CONF=narenas:8 redis-server redis.conf

# View jemalloc's internal statistics
redis-cli MEMORY MALLOC-STATS

Bins and Size Classes

jemalloc predefines a fine-grained sequence of size classes to minimize internal fragmentation:

Small objects (0–14KB):
  8, 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256,
  320, 384, 448, 512, 640, 768, 896, 1024, 1280, 1536, 1792, 2048,
  2560, 3072, 3584, 4096, 5120, 6144, 7168, 8192, 10240, 12288, 14336

Medium objects (14KB–4MB): aligned to 2MB page boundaries
Large objects (>4MB): satisfied directly via mmap

Internal fragmentation analysis:

Request 100 bytes → jemalloc allocates 112 bytes → waste: 12 bytes (12%)
Request 200 bytes → jemalloc allocates 224 bytes → waste: 24 bytes (12%)
glibc malloc: request 100 bytes → allocates 128 bytes → waste: 28 bytes (28%)

This 2× reduction in internal fragmentation per allocation compounds significantly at millions of keys.

24.4 Memory Fragmentation Analysis

Key Fields in INFO memory

redis-cli INFO memory

Annotated output:

# Memory
used_memory:1073741824          # 1 GB: what Redis believes it has allocated
used_memory_human:1.00G
used_memory_rss:1610612736      # 1.5 GB: actual RSS from the OS perspective
used_memory_rss_human:1.50G
used_memory_peak:1073741824     # Historical peak allocation
used_memory_peak_human:1.00G
used_memory_peak_perc:100.00%   # current / peak ratio
used_memory_overhead:847249408  # Redis internal overhead (dicts, obj headers, eventloop)
used_memory_startup:895776      # Baseline memory at startup
used_memory_dataset:226492416   # Pure data memory (used_memory - overhead)
used_memory_dataset_perc:21.09% # Data as fraction of total allocation
allocator_allocated:1073815552  # What the allocator thinks it has given out
allocator_active:1342177280     # Allocator active memory (includes allocator reserves)
allocator_resident:1610612736   # Memory the allocator has obtained from OS (≈ RSS)
mem_fragmentation_ratio:1.50    # = used_memory_rss / used_memory (THE key metric)
mem_fragmentation_bytes:536870912 # Fragmentation in bytes
mem_not_counted_for_evict:0     # Memory excluded from maxmemory accounting
mem_replication_backlog:1048576 # Replication backlog buffer
mem_total_replication_buffers:2097152
mem_clients_slaves:0            # Memory for replica client objects
mem_clients_normal:20512        # Memory for normal client objects
mem_cluster_links:0             # Memory for cluster bus connections
mem_aof_buffer:8                # AOF buffer memory
mem_allocator:jemalloc-5.3.0
active_defrag_running:0         # Is active defrag currently running?
lazyfree_pending_objects:0      # Objects queued for async free
lazyfreed_objects:0             # Total objects async-freed since start

Fragmentation Ratio Interpretation

mem_fragmentation_ratio interpretation:
  < 1.0   → Using swap — CRITICAL, system is out of physical RAM
  1.0–1.1 → Healthy, minimal fragmentation
  1.1–1.5 → Mild fragmentation (acceptable for most workloads)
  1.5–2.0 → Significant fragmentation — consider enabling activedefrag
  > 2.0   → Severe fragmentation — immediate action required

Root causes of fragmentation:

Mix of objects with widely varying sizes
High churn of SET/DEL operations (memory repeatedly allocated and freed)
Key expiration creating holes in memory pages
Large values allocated via mmap that don't immediately return pages to the OS after freeing

24.5 activedefrag: Online Fragmentation Defragmentation

Introduced in Redis 4.0, activedefrag defragments memory without a restart or even a pause.

How It Works

// defrag.c (internal implementation)
// Core idea:
// 1. Scan each key in the database
// 2. For each value, check whether its memory address is in a fragmented slab
// 3. If yes, allocate fresh memory, copy the data, update all pointers
// 4. Free the old memory → jemalloc reclaims the fragmented slab

void activeDefragCycle(void) {
    // Control CPU usage dynamically based on current fragmentation ratio
    // Stays between active-defrag-cycle-min and active-defrag-cycle-max percent
    size_t hits_per_second = computeDefragHitsPerSecond();

    // Scan the current DB's dictionary
    // dictScanDefrag is a defrag-aware scan that visits one bucket per call
    unsigned long cursor = dictScanDefrag(server.db[current_db].dict,
                                          cursor,
                                          defragCallback,
                                          &server.db[current_db]);
}

// Reallocate a single object to a fresh memory location
void *activeDefragAlloc(void *ptr) {
    size_t size = zmalloc_size(ptr);
    void *newptr;

    // Ask jemalloc whether this address is worth defragmenting
    // je_get_defrag_hint() returns true if the slab utilization is below threshold
    if (!je_get_defrag_hint(ptr)) return NULL;

    // Allocate fresh memory and copy
    newptr = zmalloc(size);
    if (newptr == NULL) return NULL;
    memcpy(newptr, ptr, zmalloc_size(newptr));
    zfree(ptr);
    return newptr;
}

Configuration Reference

# redis.conf

# Enable online defragmentation (default: no)
activedefrag yes

# Only start defragmenting if fragmentation exceeds this many bytes (default: 100mb)
active-defrag-ignore-bytes 100mb

# Start defragmenting when (mem_fragmentation_ratio - 1) × 100 exceeds this (default: 10)
active-defrag-threshold-lower 10

# Run at maximum CPU when fragmentation exceeds this threshold (default: 100 = ratio 2.0)
active-defrag-threshold-upper 100

# Minimum CPU% to dedicate to defragmentation (default: 1)
active-defrag-cycle-min 1

# Maximum CPU% to dedicate to defragmentation (default: 25)
active-defrag-cycle-max 25

# Maximum listpack nodes to process per scan cycle
active-defrag-max-scan-fields 1000

Pause Characteristics During Defragmentation

activedefrag is not completely pause-free:

Moving small objects (< 1KB):   pause < 1µs (negligible)
Moving medium objects (< 64KB): pause < 10µs
Moving large objects (> 1MB):   pause 1–10ms (noticeable for latency-sensitive apps)

Operational recommendations:
- During off-peak hours: raise active-defrag-cycle-max to 50–75% for faster progress
- During peak hours: lower active-defrag-cycle-max to 5–10% to minimize impact
- For latency-sensitive services (P99 < 1ms requirement):
  Do NOT enable activedefrag. Instead, handle fragmentation by periodically
  restarting replicas and then promoting them.

24.6 lazyfree: Background Asynchronous Object Freeing

Synchronously freeing a large object with free() can block the main thread for milliseconds or even tens of milliseconds. Redis 4.0 introduced lazyfree to make freeing asynchronous.

DEL vs UNLINK

# DEL: synchronous deletion (large objects will block the main thread)
DEL bigkey

# UNLINK: asynchronous deletion (main thread only severs the reference;
# the bio background thread performs the actual free)
UNLINK bigkey

# Practical comparison:
redis-cli DEL biglist    # A list with 1 million elements may block for 100ms+
redis-cli UNLINK biglist # Returns immediately; freeing happens in background

lazyfree Source Code Flow

// lazyfree.c

int dbAsyncDelete(redisDb *db, robj *key) {
    // Remove from expires dict (small operation, always synchronous)
    if (dictSize(db->expires) > 0)
        dictDelete(db->expires, key->ptr);

    // Unlink the entry from the main dict WITHOUT freeing the value yet
    dictEntry *de = dictUnlink(db->dict, key->ptr);
    if (de) {
        robj *val = dictGetVal(de);

        // Estimate the cost of freeing this object
        size_t free_effort = lazyfreeGetFreeEffort(key, val);

        // If cost is high enough, free asynchronously
        if (free_effort > LAZYFREE_THRESHOLD && val->refcount == 1) {
            atomicIncr(lazyfree_objects, 1);
            // Submit to the BIO_LAZY_FREE queue
            bioCreateLazyFreeJob(lazyfreeFreeObject, 1, val);
            dictSetVal(db->dict, de, NULL);  // Prevent double-free
        }
        dictFreeUnlinkedEntry(db->dict, de);
    }

    if (server.lazyfree_lazy_server_del)
        return de != NULL;
    return C_ERR;  // Signal caller to fall back to synchronous delete
}

// The actual free function executed by the bio thread
void lazyfreeFreeObject(void *args[]) {
    robj *o = (robj *) args[0];
    decrRefCount(o);  // When refcount hits 0, the object is truly freed
    atomicDecr(lazyfree_objects, 1);
}

lazyfreeGetFreeEffort: Cost Estimation

size_t lazyfreeGetFreeEffort(robj *key, robj *obj) {
    if (obj->type == OBJ_LIST) {
        quicklist *ql = obj->ptr;
        return ql->len;                 // Number of list elements
    } else if (obj->type == OBJ_SET &&
               obj->encoding == OBJ_ENCODING_HT) {
        dict *ht = obj->ptr;
        return dictSize(ht);            // Number of set members
    } else if (obj->type == OBJ_ZSET &&
               obj->encoding == OBJ_ENCODING_SKIPLIST) {
        zset *zs = obj->ptr;
        return zs->zsl->length;         // Number of sorted set members
    } else if (obj->type == OBJ_HASH &&
               obj->encoding == OBJ_ENCODING_HT) {
        dict *ht = obj->ptr;
        return dictSize(ht);            // Number of hash fields
    } else if (obj->type == OBJ_STREAM) {
        size_t effort = 0;
        stream *s = obj->ptr;
        effort += s->length;            // Number of stream entries
        effort += raxSize(s->cgroups);  // Number of consumer groups
        return effort;
    } else {
        return 1;  // Strings and other simple objects: cost = 1
    }
}

// Objects with effort > 64 are freed asynchronously
#define LAZYFREE_THRESHOLD 64

lazyfree Configuration

# redis.conf

# Whether key expiration triggers async freeing (default: no — recommend: yes)
lazyfree-lazy-expire yes

# Whether server-internal deletions (e.g., RENAME overwriting a key) are async
# (default: no — recommend: yes)
lazyfree-lazy-server-del yes

# Whether replicas flush their DB asynchronously when receiving FLUSHDB/FLUSHALL
# (default: no — recommend: yes — prevents multi-second pauses on replicas)
replica-lazy-flush yes

# Whether memory eviction uses async freeing (default: no — recommend: yes)
lazyfree-lazy-eviction yes

# Whether user-issued DEL (via CONFIG) triggers async freeing
lazyfree-lazy-user-del yes

# Whether FLUSHDB/FLUSHALL commands are async by default
lazyfree-lazy-user-flush yes

24.7 bio.c: Background I/O Threads

bio.c implements Redis's background thread mechanism for operations that should not block the main thread.

// bio.c
// Three categories of background tasks:
#define BIO_CLOSE_FILE   0  // Async close() of file descriptors
#define BIO_AOF_FSYNC    1  // fsync for AOF (everysec and always modes)
#define BIO_LAZY_FREE    2  // Lazy freeing of objects and datasets

// Task structure
struct bio_job {
    time_t time;               // Submission timestamp
    void (*free_fn)(void *args[]); // The function to execute
    void *args[3];             // Arguments to pass to free_fn
};

// Each task type has its own queue, mutex, and thread
static pthread_t bio_threads[BIO_NUM_OPS];
static pthread_mutex_t bio_mutex[BIO_NUM_OPS];
static pthread_cond_t bio_newjob_cond[BIO_NUM_OPS];
static list *bio_jobs[BIO_NUM_OPS];

// Background thread worker loop
void *bioProcessBackgroundJobs(void *arg) {
    int type = (unsigned long) arg;

    // Lower scheduling priority so bio threads don't compete with the main thread
    struct sched_param sp;
    sp.sched_priority = sched_get_priority_min(SCHED_RR);
    pthread_setschedparam(pthread_self(), SCHED_RR, &sp);

    pthread_mutex_lock(&bio_mutex[type]);
    while (1) {
        listNode *ln = listFirst(bio_jobs[type]);
        if (ln == NULL) {
            // Queue empty — block until a new job arrives
            pthread_cond_wait(&bio_newjob_cond[type], &bio_mutex[type]);
            continue;
        }

        // Dequeue the job and release the lock while executing
        struct bio_job *job = ln->value;
        listDelNode(bio_jobs[type], ln);
        pthread_mutex_unlock(&bio_mutex[type]);

        // Execute the appropriate operation
        if (type == BIO_CLOSE_FILE) {
            close((long)job->args[0]);
        } else if (type == BIO_AOF_FSYNC) {
            redis_fsync((long)job->args[0]);
        } else if (type == BIO_LAZY_FREE) {
            job->free_fn(job->args);  // e.g., lazyfreeFreeObject
        }

        zfree(job);
        pthread_mutex_lock(&bio_mutex[type]);
    }
}

24.8 Memory Analysis Tools

Per-Key Memory Analysis

# Exact memory cost of a single key
redis-cli MEMORY USAGE mykey

# Increase sampling depth for accuracy with collection types
redis-cli MEMORY USAGE mykey SAMPLES 100

# Diagnostic advice in plain English
redis-cli MEMORY DOCTOR
# Sample output:
# - High fragmentation: RSS is 1.5G, but used_memory is 1.0G.
#   mem_fragmentation_ratio = 1.50. This could be due to memory defragmentation.
#   To avoid this, configure 'activedefrag yes'.

Big Key Scanning

# Scan for the largest key by encoding size
redis-cli --bigkeys
# Output:
# Biggest string found 'user:profile:12345' has 51200 bytes
# Biggest list   found 'events:queue'       has 100000 items
# Biggest hash   found 'product:catalog'    has 50000 fields

# Memory distribution scan (estimated memory per key)
redis-cli --memkeys

# Hot key analysis (Redis 4.0+, requires maxmemory-policy LFU)
redis-cli --hotkeys

OBJECT Introspection

# Inspect internal encoding
OBJECT ENCODING mykey         # listpack, skiplist, embstr, raw, int, ...

# Seconds since last access (LRU mode)
OBJECT IDLETIME mykey

# Reference count (usually 1; shared integers 0–9999 have higher counts)
OBJECT REFCOUNT mykey

# Access frequency counter (LFU mode only)
OBJECT FREQ mykey

Memory Reclamation Operations

# Trigger memory purge: return fragmented pages to OS (brief stall)
MEMORY PURGE

# Inspect jemalloc's internal state
MEMORY MALLOC-STATS

# Dynamically adjust defragmentation aggressiveness
CONFIG SET active-defrag-cycle-max 50
CONFIG SET activedefrag yes

# Manually trigger a full defrag scan pass
DEBUG QUICKLIST-PACKED-THRESHOLD 1  # Force listpack → quicklist conversion test

24.9 Production Memory Optimization Guidelines

Recommended Configuration

# 1. Enable lazyfree everywhere to prevent large-object blocking
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
lazyfree-lazy-eviction yes
replica-lazy-flush yes
lazyfree-lazy-user-del yes

# 2. Enable activedefrag when fragmentation is an issue
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25

# 3. jemalloc is already the default on Linux — no extra config needed

Data Structure Sizing for Memory Efficiency

# listpack encoding (small objects) saves 50-70% vs skiplist/hashtable
# Keep data counts within thresholds:

# Hash: listpack thresholds
CONFIG GET hash-max-listpack-entries  # Default: 128 fields
CONFIG GET hash-max-listpack-value    # Default: 64 bytes per value

# ZSet: listpack thresholds
CONFIG GET zset-max-listpack-entries  # Default: 128 members
CONFIG GET zset-max-listpack-value    # Default: 64 bytes per member

# Set: intset threshold
CONFIG GET set-max-intset-entries     # Default: 512 integers

# List: listpack threshold
CONFIG GET list-max-listpack-size     # Default: 128 elements per node

Memory Monitoring Alerts

# Prometheus alert rules (using redis_exporter metrics):

# Fragmentation ratio alert
redis_mem_fragmentation_ratio > 1.5  # Warning
redis_mem_fragmentation_ratio > 2.0  # Critical

# Memory utilization alert
redis_memory_used_bytes / redis_memory_max_bytes > 0.85  # Warning
redis_memory_used_bytes / redis_memory_max_bytes > 0.95  # Critical

# Lazyfree backlog alert (indicates large object deletion pressure)
redis_lazyfree_pending_objects > 10000  # Warning

Disable Transparent Huge Pages (Critical for Production)

# THP causes massive Copy-on-Write overhead during fork() for RDB saves
# Add to /etc/rc.local or systemd unit:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

# Verify
cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: always madvise [never]

Chapter Summary

Redis supports jemalloc, libc, and tcmalloc allocators; jemalloc is the Linux default and is strongly recommended for production
The zmalloc layer tracks used_memory; the gap between it and OS RSS measures fragmentation plus allocator overhead
jemalloc's three-layer architecture (Arena → Bin → Chunk) delivers low fragmentation and high concurrency with ~2× better internal fragmentation than glibc
Monitor mem_fragmentation_ratio: enable activedefrag above 1.5; treat values above 2.0 as urgent
activedefrag works by scanning keys, reallocating fragmented values to fresh memory, and updating pointers — all without stopping the world for small objects
lazyfree/UNLINK moves large object freeing to a bio background thread, eliminating the 10–100ms main-thread stalls that can occur when deleting million-element collections
bio.c maintains exactly 3 background threads: file close, AOF fsync, and lazy free — all expensive I/O operations pass through these threads
Enable all lazyfree options + activedefrag, disable THP, and alert on fragmentation ratio and lazyfree backlog in production

Rate this chapter

4.6 / 5 (6 ratings)