Chapter 2

Open-Source Competitors: Memcached, Aerospike, Dragonfly, Valkey, KeyDB

Chapter 2: The Open-Source Competitive Landscape: Memcached, Aerospike, Dragonfly, Valkey, KeyDB

2.1 Feature Comparison Matrix

Dimension	Redis 7.2	Memcached 1.6	Aerospike 6.x	Dragonfly 1.x	Valkey 7.2	KeyDB 6.x
Data structures	10+ types	String only	KV + List + Map	Redis-compatible	Redis-compatible	Redis-compatible
Persistence	AOF / RDB	None	Native SSD	Snapshot	AOF / RDB	AOF / RDB
Clustering	Hash-slot Cluster	None (client sharding)	Native cluster	Native multi-shard	Hash-slot Cluster	Active-Active replication
Thread model	Single-cmd + multi I/O	Multi-threaded	Multi-core	One thread per core	Single-cmd + multi I/O	Multi-threaded event loops
Protocol	RESP2 / RESP3	Text / binary	Proprietary	RESP (compatible)	RESP2 / RESP3	RESP2
Max dataset size	RAM-bounded	RAM-bounded	SSD (multi-TB)	RAM-bounded	RAM-bounded	RAM-bounded
License	RSALv2 + SSPLv1	BSD-3	Apache 2.0	BSL (delayed open)	BSD-3	BSD-3
Ecosystem maturity	Very high	High	Medium	Low	Growing	Low
Transactions	MULTI / Lua	None	Lua	MULTI / Lua	MULTI / Lua	MULTI / Lua
Pub/Sub	Yes	No	No	Yes	Yes	Yes
Streams	Yes	No	No	Partial	Yes	Yes

2.2 Memcached: Simplicity as a Design Principle

2.2.1 Architecture

Brad Fitzpatrick wrote Memcached in 2003 for LiveJournal. The design philosophy is radical minimalism: a multi-threaded, in-memory hash table supporting only GET, SET, DELETE, and CAS operations. No persistence, no replication, no data structures beyond opaque byte strings.

Memory management uses a slab allocator — memory is pre-partitioned into fixed-size chunk classes:

Slab Class  Chunk Size    Chunks/Slab   Notes
1           96 bytes      10,922        Small strings, session tokens
2           120 bytes     8,738
3           152 bytes     6,898
...
39          512 KB        2
40          640 KB        2
41          1 MB          1             Maximum value size

Allocation: find the smallest chunk class that fits the value, return one chunk from the free list. Deallocation: return the chunk to the same class's free list. Zero fragmentation within a class, but internal fragmentation occurs when a 65-byte value occupies a 120-byte chunk — wasting 55 bytes.

The slab allocator's global lock is Memcached's primary multi-threading bottleneck. On a 32-core machine, throughput scales sub-linearly — roughly 8–10x rather than 32x — because threads contend on slab free-list access.

2.2.2 Multi-Threaded Model

Memcached uses a dispatcher + worker thread model via libevent:

Main Thread
(accept() new connections)
        │
        ├──→ Worker Thread 0  (handles connections assigned to it)
        ├──→ Worker Thread 1
        ├──→ Worker Thread 2
        └──→ Worker Thread N-1

Each worker has its own libevent instance and event loop. The dispatcher round-robins new connections to workers via a pipe. Within a worker, processing is single-threaded for its connections. Between workers, shared state (the global hash table, slab allocator) requires locking.

Benchmark comparison on 8-core server:

Concurrent clients	Memcached GET/s	Redis GET/s
1	85,000	95,000
8	580,000	100,000
64	1,200,000	108,000
256	1,400,000	115,000

At high concurrency with multiple clients, Memcached's multi-threading advantage materializes. Redis's single-threaded command processing caps out around 100–120k ops/sec per instance — however, Redis Cluster scales linearly.

2.2.3 When to Use Memcached

Good fit: Pure caching workload, data can be rebuilt from primary store on miss, no persistence needed, simple key-value only, need to maximize throughput per core under many concurrent connections.

Poor fit: Any requirement for data persistence, replication, complex data structures, pub/sub, atomic operations beyond CAS, or TTL on specific fields.

In 2024, Memcached remains relevant for large-scale caching where the data model is simple and horizontal scaling is done at the application layer via consistent hashing (libketama). Its declining market share is not due to technical failure — it simply hasn't grown beyond its original scope.

2.3 Aerospike: Breaking the Memory Barrier

2.3.1 Hybrid Memory Architecture (HMA)

Aerospike's defining innovation is decoupling the index (in DRAM) from the data (on SSD or DRAM), allowing datasets far exceeding available RAM:

┌─────────────────────────────────────────────────────────────┐
│                     Aerospike Node                          │
│                                                             │
│  DRAM (Primary Index)              SSD (Data Layer)         │
│  ┌──────────────────────┐         ┌───────────────────────┐ │
│  │  Hash → Record Ptr   │────────→│  Record Bytes         │ │
│  │  64 bytes per record │         │  Direct Block I/O     │ │
│  │  (regardless of      │         │  (O_DIRECT, bypassing │ │
│  │   value size)        │         │   OS page cache)      │ │
│  └──────────────────────┘         └───────────────────────┘ │
│                                                             │
│  Memory cost: 64B × N records     Capacity: SSD total      │
│  1 billion records = 64 GB DRAM   1 billion × 1KB = 1 TB  │
└─────────────────────────────────────────────────────────────┘

The per-record index entry (simplified):

/* Aerospike primary index record — packed to 64 bytes */
struct as_index {
    uint8_t  digest[20];         /* RIPEMD-160 of key */
    uint16_t set_id;             /* which set (namespace.set) */
    uint16_t tree_id;            /* partition tree */
    uint64_t rblock_id;          /* block address on SSD */
    uint16_t n_rblocks;          /* number of blocks occupied */
    uint32_t void_time;          /* TTL expiry (epoch seconds) */
    uint32_t generation;         /* write generation (for CAS) */
    uint8_t  replication_state;  /* master/replica/migrating */
    /* bit-packed flags fill remaining space to exactly 64 bytes */
};

Why bypass the OS page cache: The kernel page cache manages data in 4KB pages. For random-access to thousands of small records (512 bytes to 2KB each), page cache thrashing is severe — you load 4KB to read 512 bytes, and the page is evicted before the next access. O_DIRECT enables 512-byte-aligned block I/O directly to hardware, matching SSD physical block sizes.

Measured latency on NVMe SSD (Samsung 983 DCT):

Random read: 120–150 μs (vs. 5–10 ms for SATA SSD, 0.05–0.1ms for DRAM)
Random write: 80–120 μs
Throughput: 500,000+ random read IOPS

2.3.2 Smart Client: No Proxy Layer

Traditional distributed key-value systems route requests through a proxy tier:

Client → Proxy (routing) → Data Node
                           Data Node
                           Data Node

Aerospike embeds routing intelligence in the client library:

Client (Smart Client)
  ├── Cluster map: {partition_id → node_address}
  ├── compute: node = cluster_map[hash(key) % partition_count]
  └── connect directly to node (no proxy)

On startup, the client fetches the cluster's partition map. On every operation, it computes the target node locally and connects directly. This eliminates proxy latency (typically 0.1–0.5ms) and proxy as a single point of failure.

When the cluster topology changes (node addition, failure, rebalancing), each node broadcasts partition map updates, and clients refresh within ~1 second.

2.3.3 Strong Consistency (CP Mode)

Aerospike supports two consistency models:

Availability mode (AP): Asynchronous replication. Primary acknowledges writes after local commit only. Risk of data loss if primary fails before replication completes. Maximum throughput, minimum latency.

Strong consistency mode (CP): Write quorum based on a Paxos-like protocol:

Write flow (replication factor RF=3, quorum = RF/2+1 = 2):
1. Client → Primary: write record
2. Primary commits to local SSD
3. Primary → Replica-1 (parallel): replicate
4. Primary → Replica-2 (parallel): replicate
5. Primary waits for 2 confirmations (quorum met)
6. Primary → Client: ACK

Write latency in CP mode is bounded by the replication round-trip to the second-fastest replica, typically adding 1–3ms on a well-connected cluster. Reads in CP mode always go to the primary, ensuring no stale reads.

2.3.4 Use Cases and Comparison with Redis

Scenario	Aerospike	Redis
Dataset >> RAM	Wins	Loses (OOM risk)
Adtech / RTB user profiles	Wins	Loses at scale
Sub-millisecond DRAM latency	Loses (0.2–1ms)	Wins (< 0.1ms)
Complex data structures	Loses	Wins
Pub/Sub, Streams, Geo	N/A	Wins
Cost at 1B records × 1KB	~$3K/mo (SSD)	~$20K/mo (RAM)

Real-world scale: A major advertising exchange stores 2 billion user profiles (1KB average each = 2TB) in a 6-node Aerospike cluster with 128GB DRAM per node (for 768GB total index) and 4TB NVMe per node. Equivalent Redis deployment would require 2TB of RAM across the cluster — roughly 7x the infrastructure cost.

2.4 Dragonfly: Rethinking Redis from Scratch

2.4.1 Shared-Nothing Architecture

Dragonfly (founded 2022, Roman Gershman) is built on a shared-nothing multi-threaded model using the C++20 Fiber (coroutine) framework:

4-core Dragonfly instance:

Core 0     Core 1     Core 2     Core 3
  │           │           │           │
Shard 0    Shard 1    Shard 2    Shard 3
(keys 0%4=0)(keys 0%4=1)(keys 0%4=2)(keys 0%4=3)

Each shard: independent hash table, independent event loop
Cross-shard ops: message passing via lock-free queues

For single-key operations (GET, SET, INCR), the request routes to exactly one shard — zero cross-thread communication, zero locking. For multi-key operations (MGET, MSET), the dispatcher sends sub-requests to each relevant shard and aggregates results.

Throughput scales nearly linearly with CPU cores because there is no shared mutable state between shards.

2.4.2 Dashtable: Lock-Free Hash Table

Dragonfly replaces the standard chaining hash table with dashtable (Dynamic Array of Segments with Hash):

Traditional hash table resize problem:

Allocate new larger array
Rehash all entries — O(n) blocking operation
Replace old table

Redis's mitigation: incremental rehash (move a fixed number of buckets per operation). Dragonfly's solution: segment-level growth.

Dashtable structure:
┌─────────────────────────────────────────────┐
│  Segment Directory (array of segment ptrs)  │
│  [ptr0] [ptr1] [ptr2] ... [ptrN]            │
└───┬─────────┬─────────────────────────────-─┘
    │         │
┌───▼───┐ ┌──▼────┐
│Seg 0  │ │Seg 1  │  Each segment: fixed-size probing table
│14 slots│ │14 slots│  Load factor 75% → split this segment only
└───────┘ └───────┘  Other segments continue serving requests

When a segment reaches capacity, it splits into two — only that segment is reorganized, taking O(segment_size) time rather than O(total_keys) time. No global lock required.

2.4.3 Performance Numbers and Limitations

Official benchmark (AWS c6gn.12xlarge, 48 vCPU, 192 GB RAM, 25 Gbps NIC):

System	GET ops/sec	SET ops/sec
Redis 7.0 (single instance)	800,000	700,000
Dragonfly 1.0 (single instance)	18,000,000	15,000,000
Redis Cluster (48 nodes × 1 core)	~16,000,000	~14,000,000

The comparison is somewhat unfair: Dragonfly uses all 48 cores, while the single Redis instance uses 1. Against a properly sized Redis Cluster, the advantage shrinks to 1.2–2x.

Limitations:

Business Source License: BSL prohibits commercial use without a license until 4 years after each release, at which point the code converts to Apache 2.0. Meaning: you cannot build a managed Dragonfly service commercially without paying.
Immature ecosystem: Few production case studies, limited tooling, sparse documentation on failure scenarios and recovery procedures.
Compatibility gaps: Dragonfly claims full Redis protocol compatibility, but there are known issues with:
- Complex Lua scripts using redis.call in certain patterns
- Some RESP3 protocol extensions
- Cluster mode behavior differences (Dragonfly's sharding model differs from Redis Cluster's 16384-slot model)
Cross-shard command complexity: Commands operating on multiple keys in different shards (like RENAME or LMOVE across shards) require distributed coordination — either blocked or handled differently from Redis.

2.5 Valkey: The Open-Source Redis Successor

2.5.1 The License Change Event

Redis Inc.'s license change on March 20, 2024 was not subtle:

RSALv2 (Redis Source Available License v2): Source code visible and modifiable, but you may not offer Redis as a service to third parties. Specifically targets cloud providers offering managed Redis.
SSPLv1 (Server Side Public License v1): If you offer Redis as a service, you must open-source your entire service stack. Unacceptable to AWS, Google, Microsoft.

The practical impact: AWS, Google, and Azure could no longer upgrade their managed Redis offerings to new versions without negotiating a commercial license with Redis Inc. This triggered a fork.

Timeline:

2024-03-20  Redis Inc. announces RSALv2 + SSPLv1 dual license
2024-03-21  Linux Foundation announces Valkey project (fork of Redis 7.2.4)
2024-03-27  AWS, Google Cloud, Oracle, Alibaba Cloud, Ericsson join Valkey
2024-04-02  Valkey repository live on GitHub under LF umbrella
2024-04-16  Valkey 7.2.5 — first official release
2024-06-11  Valkey 8.0.0-rc1 — multi-threading improvements
2024-10-01  AWS makes Valkey default engine for new ElastiCache clusters
2024-Q4     Alibaba Cloud Tair adds Valkey protocol compatibility layer

2.5.2 Technical Direction of Valkey 8.0

Valkey 8.0 introduces meaningful performance improvements while maintaining full Redis 7.2 protocol compatibility:

I/O threading improvements: Valkey 8.0 ships with I/O threads enabled by default on multi-core systems. Unlike Redis 6.0's opt-in io-threads configuration, Valkey auto-detects CPU count and enables parallel I/O appropriately.

Dual-channel replication: Full synchronization (RDB transfer) and incremental replication (AOF propagation) use separate TCP connections. This prevents replication lag from blocking write propagation during large initial sync.

Slot migration optimization: Cluster rebalancing (CLUSTER SETSLOT + MIGRATE) is 60% faster in benchmarks, reducing the disruption window during scale-out events.

Memory efficiency: Valkey 8.0 reduces per-key overhead by ~10 bytes through internal struct packing improvements. At 100M keys, this saves ~1GB of RAM.

2.5.3 Migration Considerations

Valkey is a drop-in replacement for Redis 7.2. Client libraries, command syntax, persistence formats (RDB/AOF), and cluster protocols are identical. A migration from Redis 7.2 to Valkey requires:

Replace the redis-server binary with valkey-server
No configuration changes required (valkey.conf is a renamed redis.conf)
No client library changes (Valkey speaks RESP2/RESP3)
No data migration (Valkey reads Redis RDB files directly)

The only practical concern is verifying that any Redis modules (redis-server --loadmodule) are available for Valkey — most popular modules (RedisSearch, RedisJSON, RedisBloom) have Valkey-compatible builds.

2.6 KeyDB: Active-Active Replication

KeyDB (originally open-sourced by EQ Alpha Technology, later acquired by Snap Inc.) extends Redis with multi-threaded command execution and active-active (multi-primary) replication.

2.6.1 Per-Thread Event Loop

Redis's threading model (6.0+) keeps command execution on the main thread and offloads only network I/O:

Redis 6.0+:
io_thread_0 → reads request bytes → main_thread → executes → io_thread_0 → writes response

KeyDB moves command execution into each thread:

KeyDB:
Thread 0: read request → parse → execute → write response  (for its connections)
Thread 1: read request → parse → execute → write response  (for its connections)
Thread 2: read request → parse → execute → write response  (for its connections)

Shared data structures (the main hash table, sorted sets) use a combination of fine-grained locking and MVCC (Multi-Version Concurrency Control) to allow safe concurrent access. This is architecturally closer to Dragonfly than to Redis.

KeyDB benchmark (8-core server, 100 concurrent connections):

	Redis 7.0	KeyDB 6.3
GET ops/sec	100,000	680,000
SET ops/sec	95,000	620,000
Memory overhead	Baseline	+5% (MVCC metadata)

2.6.2 Active-Active Replication

Traditional Redis replication is single-primary:

Redis: Primary (writes) → Replica (reads only)

KeyDB supports multiple primaries, each accepting writes:

KeyDB Active-Active:
Node A ←──────────────── Node B
(accepts writes)         (accepts writes)
        └────────────────┘
           bidirectional sync

Write conflicts use Last-Write-Wins (LWW) semantics based on the server timestamp. This is suitable for workloads where write conflicts are rare and occasional LWW resolution is acceptable — session stores, counters with high write volume, geographically distributed caches.

Caution: LWW conflict resolution means data can be silently lost if two nodes receive conflicting writes for the same key within the replication propagation window. This is not appropriate for financial data or inventory systems.

2.7 Decision Framework

Starting point: need a key-value store

Q1: Does dataset size exceed available RAM?
    YES → Aerospike (HMA: DRAM index + SSD data)
    NO  → continue

Q2: Do you need complex data structures (ZSet, Stream, Geo, etc.)?
    YES → Redis 7.2 (ecosystem) or Valkey 8.0 (open-source future)
    NO  → continue

Q3: Pure caching, no persistence, maximize multi-core throughput?
    YES → Memcached
    NO  → continue

Q4: Maximum single-node throughput, BSL license acceptable?
    YES → Dragonfly
    NO  → continue

Q5: Multi-primary writes required?
    YES → KeyDB or Redis Enterprise (CRDT-based)
    NO  → continue

Q6: Cloud-managed preferred?
    AWS:   ElastiCache for Valkey (caching) or MemoryDB for Valkey (durable)
    GCP:   Memorystore for Valkey
    Azure: Azure Cache for Redis (Enterprise tier for modules)
    Alibaba: Tair (enhanced features: field TTL, multi-score ZSet, etc.)
    NO  → Valkey (best open-source default: BSD license, active community)

2.8 Summary

The competitive landscape around Redis has never been more active. The 2024 license change functionally split the ecosystem into two tracks: Redis Inc.'s commercial product and the open-source Valkey lineage backed by the Linux Foundation and major cloud providers.

For most new projects, the choice is straightforward: Valkey for self-managed deployments, cloud-managed Valkey for teams that prefer operational simplicity.

Aerospike fills the specific niche of large-scale data (billions of records) where the dataset cannot fit in RAM. Memcached remains a valid choice for pure high-throughput caching with a multi-threaded advantage. Dragonfly and KeyDB are technically interesting but carry ecosystem risk for production use.

Chapter 3 goes deeper into cloud-managed services — specifically the technical architecture differences between AWS ElastiCache, MemoryDB, and Alibaba Cloud Tair.

Rate this chapter

4.6 / 5 (108 ratings)