Memory Allocation and jemalloc: Defragmentation Internals
Chapter 24 โ Memory Allocation and jemalloc: Fragmentation Internals
Memory management is one of the most critical factors affecting Redis performance and stability. This chapter provides a comprehensive source-code-level analysis covering allocator selection, the zmalloc wrapper layer, jemalloc's internal architecture, online defragmentation (activedefrag), and background lazy freeing (lazyfree).
24.1 Memory Allocator Selection
Redis supports three memory allocators, selected at compile time:
make MALLOC=jemalloc # Default on Linux (recommended for production)
make MALLOC=libc # System malloc (default on macOS/BSD, uses glibc)
make MALLOC=tcmalloc # Google tcmalloc (requires libgoogle-perftools-dev)
To check which allocator is in use at runtime:
redis-cli INFO memory | grep mem_allocator
# Output: mem_allocator:jemalloc-5.3.0
Allocator Comparison
| Feature | glibc malloc | jemalloc | tcmalloc |
|---|---|---|---|
| Fragmentation rate | Higher (older ptmalloc2 design) | Low (carefully engineered size classes) | Low |
| Multi-thread performance | Poor (global lock) | Excellent (per-CPU arenas) | Excellent (per-thread cache) |
| Memory return to OS | Slower | Faster (configurable) | Faster |
| Debug/profiling support | Limited | Rich (stats, heap profiling) | Rich |
| Production track record | Mature | Mature (Firefox, Facebook, Redis) | Mature (Chrome) |
Why Redis chose jemalloc as the Linux default:
- glibc ptmalloc2 suffers from lock contention under concurrent writes
- jemalloc's size class design minimizes internal fragmentation
- jemalloc supports fine-grained tuning via
MALLOC_CONF - Proven stability at Facebook, Twitter, and other high-scale deployments
24.2 The zmalloc Wrapper Layer
Redis never calls malloc directly. Instead it uses zmalloc.c as a wrapper whose primary purpose is tracking total allocated memory.
Core Implementation
// zmalloc.c
// PREFIX_SIZE: extra bytes allocated before the user data
// On 64-bit systems with libc: sizeof(size_t) = 8 bytes for storing the size
// With jemalloc/tcmalloc: PREFIX_SIZE = 0 (we use malloc_usable_size instead)
#ifdef HAVE_MALLOC_SIZE
#define PREFIX_SIZE (0) // jemalloc / tcmalloc mode
#else
#define PREFIX_SIZE (sizeof(size_t)) // libc mode: manually store size in header
#endif
void *zmalloc(size_t size) {
void *ptr = malloc(size + PREFIX_SIZE);
if (!ptr) zmalloc_oom_handler(size);
#ifdef HAVE_MALLOC_SIZE
// jemalloc mode: use malloc_usable_size to get the actual allocated size
update_zmalloc_stat_alloc(zmalloc_size(ptr));
return ptr;
#else
// libc mode: store the requested size in the first PREFIX_SIZE bytes
*((size_t*)ptr) = size;
update_zmalloc_stat_alloc(size + PREFIX_SIZE);
return (char*)ptr + PREFIX_SIZE;
#endif
}
void zfree(void *ptr) {
if (ptr == NULL) return;
#ifdef HAVE_MALLOC_SIZE
update_zmalloc_stat_free(zmalloc_size(ptr));
free(ptr);
#else
void *realptr = (char*)ptr - PREFIX_SIZE;
size_t oldsize = *((size_t*)realptr);
update_zmalloc_stat_free(oldsize + PREFIX_SIZE);
free(realptr);
#endif
}
used_memory Tracking
// Atomically update the memory usage counter
// Uses thread-local caching to reduce the frequency of atomic operations
#define update_zmalloc_stat_alloc(__n) do { \
size_t _n = (__n); \
/* Round up to sizeof(long) boundary */ \
if (_n & (sizeof(long) - 1)) \
_n += sizeof(long) - (_n & (sizeof(long) - 1)); \
atomicIncr(used_memory, _n); \
} while(0)
// Current total allocation in bytes
size_t zmalloc_used_memory(void) {
size_t um;
atomicGet(used_memory, um);
return um;
}
The gap between used_memory and RSS:
used_memoryโ bytes Redis believes it has allocated (tracked by zmalloc)used_memory_rssโ physical memory the OS has assigned to the process (from/proc/pid/smapsorgetrusage)- The difference comes from: memory fragmentation, allocator-retained pages, and OS page-rounding
24.3 jemalloc's Three-Layer Architecture
Understanding jemalloc's internals explains why Redis behaves the way it does under memory pressure.
Hierarchy
OS (operating system)
โ
โ mmap / brk (in 2MB chunks)
โผ
Chunk (2MB block obtained from OS)
โ
โ Split by size class into runs
โผ
Run (internal name: slab) โ a contiguous range of pages for one size class
โ
โ Run subdivided into equally-sized regions
โผ
Region โ the actual memory returned to the caller
Arenas: Reducing Lock Contention
jemalloc creates (CPU count ร 4) arenas by default
Thread 1 โโโ Arena 0
Thread 2 โโโ Arena 1
Thread 3 โโโ Arena 2
Thread 4 โโโ Arena 0 (round-robin assignment)
Each arena independently manages its own Chunks/Bins/Runs.
Threads almost never need to synchronize with each other.
Configuring via MALLOC_CONF:
# Set number of arenas (default: CPU count ร 4)
MALLOC_CONF=narenas:8 redis-server redis.conf
# View jemalloc's internal statistics
redis-cli MEMORY MALLOC-STATS
Bins and Size Classes
jemalloc predefines a fine-grained sequence of size classes to minimize internal fragmentation:
Small objects (0โ14KB):
8, 16, 32, 48, 64, 80, 96, 112, 128, 160, 192, 224, 256,
320, 384, 448, 512, 640, 768, 896, 1024, 1280, 1536, 1792, 2048,
2560, 3072, 3584, 4096, 5120, 6144, 7168, 8192, 10240, 12288, 14336
Medium objects (14KBโ4MB): aligned to 2MB page boundaries
Large objects (>4MB): satisfied directly via mmap
Internal fragmentation analysis:
- Request 100 bytes โ jemalloc allocates 112 bytes โ waste: 12 bytes (12%)
- Request 200 bytes โ jemalloc allocates 224 bytes โ waste: 24 bytes (12%)
- glibc malloc: request 100 bytes โ allocates 128 bytes โ waste: 28 bytes (28%)
This 2ร reduction in internal fragmentation per allocation compounds significantly at millions of keys.
24.4 Memory Fragmentation Analysis
Key Fields in INFO memory
redis-cli INFO memory
Annotated output:
# Memory
used_memory:1073741824 # 1 GB: what Redis believes it has allocated
used_memory_human:1.00G
used_memory_rss:1610612736 # 1.5 GB: actual RSS from the OS perspective
used_memory_rss_human:1.50G
used_memory_peak:1073741824 # Historical peak allocation
used_memory_peak_human:1.00G
used_memory_peak_perc:100.00% # current / peak ratio
used_memory_overhead:847249408 # Redis internal overhead (dicts, obj headers, eventloop)
used_memory_startup:895776 # Baseline memory at startup
used_memory_dataset:226492416 # Pure data memory (used_memory - overhead)
used_memory_dataset_perc:21.09% # Data as fraction of total allocation
allocator_allocated:1073815552 # What the allocator thinks it has given out
allocator_active:1342177280 # Allocator active memory (includes allocator reserves)
allocator_resident:1610612736 # Memory the allocator has obtained from OS (โ RSS)
mem_fragmentation_ratio:1.50 # = used_memory_rss / used_memory (THE key metric)
mem_fragmentation_bytes:536870912 # Fragmentation in bytes
mem_not_counted_for_evict:0 # Memory excluded from maxmemory accounting
mem_replication_backlog:1048576 # Replication backlog buffer
mem_total_replication_buffers:2097152
mem_clients_slaves:0 # Memory for replica client objects
mem_clients_normal:20512 # Memory for normal client objects
mem_cluster_links:0 # Memory for cluster bus connections
mem_aof_buffer:8 # AOF buffer memory
mem_allocator:jemalloc-5.3.0
active_defrag_running:0 # Is active defrag currently running?
lazyfree_pending_objects:0 # Objects queued for async free
lazyfreed_objects:0 # Total objects async-freed since start
Fragmentation Ratio Interpretation
mem_fragmentation_ratio interpretation:
< 1.0 โ Using swap โ CRITICAL, system is out of physical RAM
1.0โ1.1 โ Healthy, minimal fragmentation
1.1โ1.5 โ Mild fragmentation (acceptable for most workloads)
1.5โ2.0 โ Significant fragmentation โ consider enabling activedefrag
> 2.0 โ Severe fragmentation โ immediate action required
Root causes of fragmentation:
- Mix of objects with widely varying sizes
- High churn of SET/DEL operations (memory repeatedly allocated and freed)
- Key expiration creating holes in memory pages
- Large values allocated via mmap that don't immediately return pages to the OS after freeing
24.5 activedefrag: Online Fragmentation Defragmentation
Introduced in Redis 4.0, activedefrag defragments memory without a restart or even a pause.
How It Works
// defrag.c (internal implementation)
// Core idea:
// 1. Scan each key in the database
// 2. For each value, check whether its memory address is in a fragmented slab
// 3. If yes, allocate fresh memory, copy the data, update all pointers
// 4. Free the old memory โ jemalloc reclaims the fragmented slab
void activeDefragCycle(void) {
// Control CPU usage dynamically based on current fragmentation ratio
// Stays between active-defrag-cycle-min and active-defrag-cycle-max percent
size_t hits_per_second = computeDefragHitsPerSecond();
// Scan the current DB's dictionary
// dictScanDefrag is a defrag-aware scan that visits one bucket per call
unsigned long cursor = dictScanDefrag(server.db[current_db].dict,
cursor,
defragCallback,
&server.db[current_db]);
}
// Reallocate a single object to a fresh memory location
void *activeDefragAlloc(void *ptr) {
size_t size = zmalloc_size(ptr);
void *newptr;
// Ask jemalloc whether this address is worth defragmenting
// je_get_defrag_hint() returns true if the slab utilization is below threshold
if (!je_get_defrag_hint(ptr)) return NULL;
// Allocate fresh memory and copy
newptr = zmalloc(size);
if (newptr == NULL) return NULL;
memcpy(newptr, ptr, zmalloc_size(newptr));
zfree(ptr);
return newptr;
}
Configuration Reference
# redis.conf
# Enable online defragmentation (default: no)
activedefrag yes
# Only start defragmenting if fragmentation exceeds this many bytes (default: 100mb)
active-defrag-ignore-bytes 100mb
# Start defragmenting when (mem_fragmentation_ratio - 1) ร 100 exceeds this (default: 10)
active-defrag-threshold-lower 10
# Run at maximum CPU when fragmentation exceeds this threshold (default: 100 = ratio 2.0)
active-defrag-threshold-upper 100
# Minimum CPU% to dedicate to defragmentation (default: 1)
active-defrag-cycle-min 1
# Maximum CPU% to dedicate to defragmentation (default: 25)
active-defrag-cycle-max 25
# Maximum listpack nodes to process per scan cycle
active-defrag-max-scan-fields 1000
Pause Characteristics During Defragmentation
activedefrag is not completely pause-free:
Moving small objects (< 1KB): pause < 1ยตs (negligible)
Moving medium objects (< 64KB): pause < 10ยตs
Moving large objects (> 1MB): pause 1โ10ms (noticeable for latency-sensitive apps)
Operational recommendations:
- During off-peak hours: raise active-defrag-cycle-max to 50โ75% for faster progress
- During peak hours: lower active-defrag-cycle-max to 5โ10% to minimize impact
- For latency-sensitive services (P99 < 1ms requirement):
Do NOT enable activedefrag. Instead, handle fragmentation by periodically
restarting replicas and then promoting them.
24.6 lazyfree: Background Asynchronous Object Freeing
Synchronously freeing a large object with free() can block the main thread for milliseconds or even tens of milliseconds. Redis 4.0 introduced lazyfree to make freeing asynchronous.
DEL vs UNLINK
# DEL: synchronous deletion (large objects will block the main thread)
DEL bigkey
# UNLINK: asynchronous deletion (main thread only severs the reference;
# the bio background thread performs the actual free)
UNLINK bigkey
# Practical comparison:
redis-cli DEL biglist # A list with 1 million elements may block for 100ms+
redis-cli UNLINK biglist # Returns immediately; freeing happens in background
lazyfree Source Code Flow
// lazyfree.c
int dbAsyncDelete(redisDb *db, robj *key) {
// Remove from expires dict (small operation, always synchronous)
if (dictSize(db->expires) > 0)
dictDelete(db->expires, key->ptr);
// Unlink the entry from the main dict WITHOUT freeing the value yet
dictEntry *de = dictUnlink(db->dict, key->ptr);
if (de) {
robj *val = dictGetVal(de);
// Estimate the cost of freeing this object
size_t free_effort = lazyfreeGetFreeEffort(key, val);
// If cost is high enough, free asynchronously
if (free_effort > LAZYFREE_THRESHOLD && val->refcount == 1) {
atomicIncr(lazyfree_objects, 1);
// Submit to the BIO_LAZY_FREE queue
bioCreateLazyFreeJob(lazyfreeFreeObject, 1, val);
dictSetVal(db->dict, de, NULL); // Prevent double-free
}
dictFreeUnlinkedEntry(db->dict, de);
}
if (server.lazyfree_lazy_server_del)
return de != NULL;
return C_ERR; // Signal caller to fall back to synchronous delete
}
// The actual free function executed by the bio thread
void lazyfreeFreeObject(void *args[]) {
robj *o = (robj *) args[0];
decrRefCount(o); // When refcount hits 0, the object is truly freed
atomicDecr(lazyfree_objects, 1);
}
lazyfreeGetFreeEffort: Cost Estimation
size_t lazyfreeGetFreeEffort(robj *key, robj *obj) {
if (obj->type == OBJ_LIST) {
quicklist *ql = obj->ptr;
return ql->len; // Number of list elements
} else if (obj->type == OBJ_SET &&
obj->encoding == OBJ_ENCODING_HT) {
dict *ht = obj->ptr;
return dictSize(ht); // Number of set members
} else if (obj->type == OBJ_ZSET &&
obj->encoding == OBJ_ENCODING_SKIPLIST) {
zset *zs = obj->ptr;
return zs->zsl->length; // Number of sorted set members
} else if (obj->type == OBJ_HASH &&
obj->encoding == OBJ_ENCODING_HT) {
dict *ht = obj->ptr;
return dictSize(ht); // Number of hash fields
} else if (obj->type == OBJ_STREAM) {
size_t effort = 0;
stream *s = obj->ptr;
effort += s->length; // Number of stream entries
effort += raxSize(s->cgroups); // Number of consumer groups
return effort;
} else {
return 1; // Strings and other simple objects: cost = 1
}
}
// Objects with effort > 64 are freed asynchronously
#define LAZYFREE_THRESHOLD 64
lazyfree Configuration
# redis.conf
# Whether key expiration triggers async freeing (default: no โ recommend: yes)
lazyfree-lazy-expire yes
# Whether server-internal deletions (e.g., RENAME overwriting a key) are async
# (default: no โ recommend: yes)
lazyfree-lazy-server-del yes
# Whether replicas flush their DB asynchronously when receiving FLUSHDB/FLUSHALL
# (default: no โ recommend: yes โ prevents multi-second pauses on replicas)
replica-lazy-flush yes
# Whether memory eviction uses async freeing (default: no โ recommend: yes)
lazyfree-lazy-eviction yes
# Whether user-issued DEL (via CONFIG) triggers async freeing
lazyfree-lazy-user-del yes
# Whether FLUSHDB/FLUSHALL commands are async by default
lazyfree-lazy-user-flush yes
24.7 bio.c: Background I/O Threads
bio.c implements Redis's background thread mechanism for operations that should not block the main thread.
// bio.c
// Three categories of background tasks:
#define BIO_CLOSE_FILE 0 // Async close() of file descriptors
#define BIO_AOF_FSYNC 1 // fsync for AOF (everysec and always modes)
#define BIO_LAZY_FREE 2 // Lazy freeing of objects and datasets
// Task structure
struct bio_job {
time_t time; // Submission timestamp
void (*free_fn)(void *args[]); // The function to execute
void *args[3]; // Arguments to pass to free_fn
};
// Each task type has its own queue, mutex, and thread
static pthread_t bio_threads[BIO_NUM_OPS];
static pthread_mutex_t bio_mutex[BIO_NUM_OPS];
static pthread_cond_t bio_newjob_cond[BIO_NUM_OPS];
static list *bio_jobs[BIO_NUM_OPS];
// Background thread worker loop
void *bioProcessBackgroundJobs(void *arg) {
int type = (unsigned long) arg;
// Lower scheduling priority so bio threads don't compete with the main thread
struct sched_param sp;
sp.sched_priority = sched_get_priority_min(SCHED_RR);
pthread_setschedparam(pthread_self(), SCHED_RR, &sp);
pthread_mutex_lock(&bio_mutex[type]);
while (1) {
listNode *ln = listFirst(bio_jobs[type]);
if (ln == NULL) {
// Queue empty โ block until a new job arrives
pthread_cond_wait(&bio_newjob_cond[type], &bio_mutex[type]);
continue;
}
// Dequeue the job and release the lock while executing
struct bio_job *job = ln->value;
listDelNode(bio_jobs[type], ln);
pthread_mutex_unlock(&bio_mutex[type]);
// Execute the appropriate operation
if (type == BIO_CLOSE_FILE) {
close((long)job->args[0]);
} else if (type == BIO_AOF_FSYNC) {
redis_fsync((long)job->args[0]);
} else if (type == BIO_LAZY_FREE) {
job->free_fn(job->args); // e.g., lazyfreeFreeObject
}
zfree(job);
pthread_mutex_lock(&bio_mutex[type]);
}
}
24.8 Memory Analysis Tools
Per-Key Memory Analysis
# Exact memory cost of a single key
redis-cli MEMORY USAGE mykey
# Increase sampling depth for accuracy with collection types
redis-cli MEMORY USAGE mykey SAMPLES 100
# Diagnostic advice in plain English
redis-cli MEMORY DOCTOR
# Sample output:
# - High fragmentation: RSS is 1.5G, but used_memory is 1.0G.
# mem_fragmentation_ratio = 1.50. This could be due to memory defragmentation.
# To avoid this, configure 'activedefrag yes'.
Big Key Scanning
# Scan for the largest key by encoding size
redis-cli --bigkeys
# Output:
# Biggest string found 'user:profile:12345' has 51200 bytes
# Biggest list found 'events:queue' has 100000 items
# Biggest hash found 'product:catalog' has 50000 fields
# Memory distribution scan (estimated memory per key)
redis-cli --memkeys
# Hot key analysis (Redis 4.0+, requires maxmemory-policy LFU)
redis-cli --hotkeys
OBJECT Introspection
# Inspect internal encoding
OBJECT ENCODING mykey # listpack, skiplist, embstr, raw, int, ...
# Seconds since last access (LRU mode)
OBJECT IDLETIME mykey
# Reference count (usually 1; shared integers 0โ9999 have higher counts)
OBJECT REFCOUNT mykey
# Access frequency counter (LFU mode only)
OBJECT FREQ mykey
Memory Reclamation Operations
# Trigger memory purge: return fragmented pages to OS (brief stall)
MEMORY PURGE
# Inspect jemalloc's internal state
MEMORY MALLOC-STATS
# Dynamically adjust defragmentation aggressiveness
CONFIG SET active-defrag-cycle-max 50
CONFIG SET activedefrag yes
# Manually trigger a full defrag scan pass
DEBUG QUICKLIST-PACKED-THRESHOLD 1 # Force listpack โ quicklist conversion test
24.9 Production Memory Optimization Guidelines
Recommended Configuration
# 1. Enable lazyfree everywhere to prevent large-object blocking
lazyfree-lazy-expire yes
lazyfree-lazy-server-del yes
lazyfree-lazy-eviction yes
replica-lazy-flush yes
lazyfree-lazy-user-del yes
# 2. Enable activedefrag when fragmentation is an issue
activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
# 3. jemalloc is already the default on Linux โ no extra config needed
Data Structure Sizing for Memory Efficiency
# listpack encoding (small objects) saves 50-70% vs skiplist/hashtable
# Keep data counts within thresholds:
# Hash: listpack thresholds
CONFIG GET hash-max-listpack-entries # Default: 128 fields
CONFIG GET hash-max-listpack-value # Default: 64 bytes per value
# ZSet: listpack thresholds
CONFIG GET zset-max-listpack-entries # Default: 128 members
CONFIG GET zset-max-listpack-value # Default: 64 bytes per member
# Set: intset threshold
CONFIG GET set-max-intset-entries # Default: 512 integers
# List: listpack threshold
CONFIG GET list-max-listpack-size # Default: 128 elements per node
Memory Monitoring Alerts
# Prometheus alert rules (using redis_exporter metrics):
# Fragmentation ratio alert
redis_mem_fragmentation_ratio > 1.5 # Warning
redis_mem_fragmentation_ratio > 2.0 # Critical
# Memory utilization alert
redis_memory_used_bytes / redis_memory_max_bytes > 0.85 # Warning
redis_memory_used_bytes / redis_memory_max_bytes > 0.95 # Critical
# Lazyfree backlog alert (indicates large object deletion pressure)
redis_lazyfree_pending_objects > 10000 # Warning
Disable Transparent Huge Pages (Critical for Production)
# THP causes massive Copy-on-Write overhead during fork() for RDB saves
# Add to /etc/rc.local or systemd unit:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# Verify
cat /sys/kernel/mm/transparent_hugepage/enabled
# Expected: always madvise [never]
Chapter Summary
- Redis supports jemalloc, libc, and tcmalloc allocators; jemalloc is the Linux default and is strongly recommended for production
- The
zmalloclayer tracksused_memory; the gap between it and OS RSS measures fragmentation plus allocator overhead - jemalloc's three-layer architecture (Arena โ Bin โ Chunk) delivers low fragmentation and high concurrency with ~2ร better internal fragmentation than glibc
- Monitor
mem_fragmentation_ratio: enableactivedefragabove 1.5; treat values above 2.0 as urgent activedefragworks by scanning keys, reallocating fragmented values to fresh memory, and updating pointers โ all without stopping the world for small objectslazyfree/UNLINKmoves large object freeing to a bio background thread, eliminating the 10โ100ms main-thread stalls that can occur when deleting million-element collectionsbio.cmaintains exactly 3 background threads: file close, AOF fsync, and lazy free โ all expensive I/O operations pass through these threads- Enable all lazyfree options + activedefrag, disable THP, and alert on fragmentation ratio and lazyfree backlog in production