Open-Source Competitors: Memcached, Aerospike, Dragonfly, Valkey, KeyDB
Chapter 2: The Open-Source Competitive Landscape: Memcached, Aerospike, Dragonfly, Valkey, KeyDB
2.1 Feature Comparison Matrix
| Dimension | Redis 7.2 | Memcached 1.6 | Aerospike 6.x | Dragonfly 1.x | Valkey 7.2 | KeyDB 6.x |
|---|---|---|---|---|---|---|
| Data structures | 10+ types | String only | KV + List + Map | Redis-compatible | Redis-compatible | Redis-compatible |
| Persistence | AOF / RDB | None | Native SSD | Snapshot | AOF / RDB | AOF / RDB |
| Clustering | Hash-slot Cluster | None (client sharding) | Native cluster | Native multi-shard | Hash-slot Cluster | Active-Active replication |
| Thread model | Single-cmd + multi I/O | Multi-threaded | Multi-core | One thread per core | Single-cmd + multi I/O | Multi-threaded event loops |
| Protocol | RESP2 / RESP3 | Text / binary | Proprietary | RESP (compatible) | RESP2 / RESP3 | RESP2 |
| Max dataset size | RAM-bounded | RAM-bounded | SSD (multi-TB) | RAM-bounded | RAM-bounded | RAM-bounded |
| License | RSALv2 + SSPLv1 | BSD-3 | Apache 2.0 | BSL (delayed open) | BSD-3 | BSD-3 |
| Ecosystem maturity | Very high | High | Medium | Low | Growing | Low |
| Transactions | MULTI / Lua | None | Lua | MULTI / Lua | MULTI / Lua | MULTI / Lua |
| Pub/Sub | Yes | No | No | Yes | Yes | Yes |
| Streams | Yes | No | No | Partial | Yes | Yes |
2.2 Memcached: Simplicity as a Design Principle
2.2.1 Architecture
Brad Fitzpatrick wrote Memcached in 2003 for LiveJournal. The design philosophy is radical minimalism: a multi-threaded, in-memory hash table supporting only GET, SET, DELETE, and CAS operations. No persistence, no replication, no data structures beyond opaque byte strings.
Memory management uses a slab allocator โ memory is pre-partitioned into fixed-size chunk classes:
Slab Class Chunk Size Chunks/Slab Notes
1 96 bytes 10,922 Small strings, session tokens
2 120 bytes 8,738
3 152 bytes 6,898
...
39 512 KB 2
40 640 KB 2
41 1 MB 1 Maximum value size
Allocation: find the smallest chunk class that fits the value, return one chunk from the free list. Deallocation: return the chunk to the same class's free list. Zero fragmentation within a class, but internal fragmentation occurs when a 65-byte value occupies a 120-byte chunk โ wasting 55 bytes.
The slab allocator's global lock is Memcached's primary multi-threading bottleneck. On a 32-core machine, throughput scales sub-linearly โ roughly 8โ10x rather than 32x โ because threads contend on slab free-list access.
2.2.2 Multi-Threaded Model
Memcached uses a dispatcher + worker thread model via libevent:
Main Thread
(accept() new connections)
โ
โโโโ Worker Thread 0 (handles connections assigned to it)
โโโโ Worker Thread 1
โโโโ Worker Thread 2
โโโโ Worker Thread N-1
Each worker has its own libevent instance and event loop. The dispatcher round-robins new connections to workers via a pipe. Within a worker, processing is single-threaded for its connections. Between workers, shared state (the global hash table, slab allocator) requires locking.
Benchmark comparison on 8-core server:
| Concurrent clients | Memcached GET/s | Redis GET/s |
|---|---|---|
| 1 | 85,000 | 95,000 |
| 8 | 580,000 | 100,000 |
| 64 | 1,200,000 | 108,000 |
| 256 | 1,400,000 | 115,000 |
At high concurrency with multiple clients, Memcached's multi-threading advantage materializes. Redis's single-threaded command processing caps out around 100โ120k ops/sec per instance โ however, Redis Cluster scales linearly.
2.2.3 When to Use Memcached
Good fit: Pure caching workload, data can be rebuilt from primary store on miss, no persistence needed, simple key-value only, need to maximize throughput per core under many concurrent connections.
Poor fit: Any requirement for data persistence, replication, complex data structures, pub/sub, atomic operations beyond CAS, or TTL on specific fields.
In 2024, Memcached remains relevant for large-scale caching where the data model is simple and horizontal scaling is done at the application layer via consistent hashing (libketama). Its declining market share is not due to technical failure โ it simply hasn't grown beyond its original scope.
2.3 Aerospike: Breaking the Memory Barrier
2.3.1 Hybrid Memory Architecture (HMA)
Aerospike's defining innovation is decoupling the index (in DRAM) from the data (on SSD or DRAM), allowing datasets far exceeding available RAM:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Aerospike Node โ
โ โ
โ DRAM (Primary Index) SSD (Data Layer) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Hash โ Record Ptr โโโโโโโโโโโ Record Bytes โ โ
โ โ 64 bytes per record โ โ Direct Block I/O โ โ
โ โ (regardless of โ โ (O_DIRECT, bypassing โ โ
โ โ value size) โ โ OS page cache) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Memory cost: 64B ร N records Capacity: SSD total โ
โ 1 billion records = 64 GB DRAM 1 billion ร 1KB = 1 TB โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The per-record index entry (simplified):
/* Aerospike primary index record โ packed to 64 bytes */
struct as_index {
uint8_t digest[20]; /* RIPEMD-160 of key */
uint16_t set_id; /* which set (namespace.set) */
uint16_t tree_id; /* partition tree */
uint64_t rblock_id; /* block address on SSD */
uint16_t n_rblocks; /* number of blocks occupied */
uint32_t void_time; /* TTL expiry (epoch seconds) */
uint32_t generation; /* write generation (for CAS) */
uint8_t replication_state; /* master/replica/migrating */
/* bit-packed flags fill remaining space to exactly 64 bytes */
};
Why bypass the OS page cache: The kernel page cache manages data in 4KB pages. For random-access to thousands of small records (512 bytes to 2KB each), page cache thrashing is severe โ you load 4KB to read 512 bytes, and the page is evicted before the next access. O_DIRECT enables 512-byte-aligned block I/O directly to hardware, matching SSD physical block sizes.
Measured latency on NVMe SSD (Samsung 983 DCT):
- Random read: 120โ150 ฮผs (vs. 5โ10 ms for SATA SSD, 0.05โ0.1ms for DRAM)
- Random write: 80โ120 ฮผs
- Throughput: 500,000+ random read IOPS
2.3.2 Smart Client: No Proxy Layer
Traditional distributed key-value systems route requests through a proxy tier:
Client โ Proxy (routing) โ Data Node
Data Node
Data Node
Aerospike embeds routing intelligence in the client library:
Client (Smart Client)
โโโ Cluster map: {partition_id โ node_address}
โโโ compute: node = cluster_map[hash(key) % partition_count]
โโโ connect directly to node (no proxy)
On startup, the client fetches the cluster's partition map. On every operation, it computes the target node locally and connects directly. This eliminates proxy latency (typically 0.1โ0.5ms) and proxy as a single point of failure.
When the cluster topology changes (node addition, failure, rebalancing), each node broadcasts partition map updates, and clients refresh within ~1 second.
2.3.3 Strong Consistency (CP Mode)
Aerospike supports two consistency models:
Availability mode (AP): Asynchronous replication. Primary acknowledges writes after local commit only. Risk of data loss if primary fails before replication completes. Maximum throughput, minimum latency.
Strong consistency mode (CP): Write quorum based on a Paxos-like protocol:
Write flow (replication factor RF=3, quorum = RF/2+1 = 2):
1. Client โ Primary: write record
2. Primary commits to local SSD
3. Primary โ Replica-1 (parallel): replicate
4. Primary โ Replica-2 (parallel): replicate
5. Primary waits for 2 confirmations (quorum met)
6. Primary โ Client: ACK
Write latency in CP mode is bounded by the replication round-trip to the second-fastest replica, typically adding 1โ3ms on a well-connected cluster. Reads in CP mode always go to the primary, ensuring no stale reads.
2.3.4 Use Cases and Comparison with Redis
| Scenario | Aerospike | Redis |
|---|---|---|
| Dataset >> RAM | Wins | Loses (OOM risk) |
| Adtech / RTB user profiles | Wins | Loses at scale |
| Sub-millisecond DRAM latency | Loses (0.2โ1ms) | Wins (< 0.1ms) |
| Complex data structures | Loses | Wins |
| Pub/Sub, Streams, Geo | N/A | Wins |
| Cost at 1B records ร 1KB | ~$3K/mo (SSD) | ~$20K/mo (RAM) |
Real-world scale: A major advertising exchange stores 2 billion user profiles (1KB average each = 2TB) in a 6-node Aerospike cluster with 128GB DRAM per node (for 768GB total index) and 4TB NVMe per node. Equivalent Redis deployment would require 2TB of RAM across the cluster โ roughly 7x the infrastructure cost.
2.4 Dragonfly: Rethinking Redis from Scratch
2.4.1 Shared-Nothing Architecture
Dragonfly (founded 2022, Roman Gershman) is built on a shared-nothing multi-threaded model using the C++20 Fiber (coroutine) framework:
4-core Dragonfly instance:
Core 0 Core 1 Core 2 Core 3
โ โ โ โ
Shard 0 Shard 1 Shard 2 Shard 3
(keys 0%4=0)(keys 0%4=1)(keys 0%4=2)(keys 0%4=3)
Each shard: independent hash table, independent event loop
Cross-shard ops: message passing via lock-free queues
For single-key operations (GET, SET, INCR), the request routes to exactly one shard โ zero cross-thread communication, zero locking. For multi-key operations (MGET, MSET), the dispatcher sends sub-requests to each relevant shard and aggregates results.
Throughput scales nearly linearly with CPU cores because there is no shared mutable state between shards.
2.4.2 Dashtable: Lock-Free Hash Table
Dragonfly replaces the standard chaining hash table with dashtable (Dynamic Array of Segments with Hash):
Traditional hash table resize problem:
- Allocate new larger array
- Rehash all entries โ O(n) blocking operation
- Replace old table
Redis's mitigation: incremental rehash (move a fixed number of buckets per operation). Dragonfly's solution: segment-level growth.
Dashtable structure:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Segment Directory (array of segment ptrs) โ
โ [ptr0] [ptr1] [ptr2] ... [ptrN] โ
โโโโโฌโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ-โโ
โ โ
โโโโโผโโโโ โโโโผโโโโโ
โSeg 0 โ โSeg 1 โ Each segment: fixed-size probing table
โ14 slotsโ โ14 slotsโ Load factor 75% โ split this segment only
โโโโโโโโโ โโโโโโโโโ Other segments continue serving requests
When a segment reaches capacity, it splits into two โ only that segment is reorganized, taking O(segment_size) time rather than O(total_keys) time. No global lock required.
2.4.3 Performance Numbers and Limitations
Official benchmark (AWS c6gn.12xlarge, 48 vCPU, 192 GB RAM, 25 Gbps NIC):
| System | GET ops/sec | SET ops/sec |
|---|---|---|
| Redis 7.0 (single instance) | 800,000 | 700,000 |
| Dragonfly 1.0 (single instance) | 18,000,000 | 15,000,000 |
| Redis Cluster (48 nodes ร 1 core) | ~16,000,000 | ~14,000,000 |
The comparison is somewhat unfair: Dragonfly uses all 48 cores, while the single Redis instance uses 1. Against a properly sized Redis Cluster, the advantage shrinks to 1.2โ2x.
Limitations:
-
Business Source License: BSL prohibits commercial use without a license until 4 years after each release, at which point the code converts to Apache 2.0. Meaning: you cannot build a managed Dragonfly service commercially without paying.
-
Immature ecosystem: Few production case studies, limited tooling, sparse documentation on failure scenarios and recovery procedures.
-
Compatibility gaps: Dragonfly claims full Redis protocol compatibility, but there are known issues with:
- Complex Lua scripts using
redis.callin certain patterns - Some RESP3 protocol extensions
- Cluster mode behavior differences (Dragonfly's sharding model differs from Redis Cluster's 16384-slot model)
- Complex Lua scripts using
-
Cross-shard command complexity: Commands operating on multiple keys in different shards (like
RENAMEorLMOVEacross shards) require distributed coordination โ either blocked or handled differently from Redis.
2.5 Valkey: The Open-Source Redis Successor
2.5.1 The License Change Event
Redis Inc.'s license change on March 20, 2024 was not subtle:
- RSALv2 (Redis Source Available License v2): Source code visible and modifiable, but you may not offer Redis as a service to third parties. Specifically targets cloud providers offering managed Redis.
- SSPLv1 (Server Side Public License v1): If you offer Redis as a service, you must open-source your entire service stack. Unacceptable to AWS, Google, Microsoft.
The practical impact: AWS, Google, and Azure could no longer upgrade their managed Redis offerings to new versions without negotiating a commercial license with Redis Inc. This triggered a fork.
Timeline:
2024-03-20 Redis Inc. announces RSALv2 + SSPLv1 dual license
2024-03-21 Linux Foundation announces Valkey project (fork of Redis 7.2.4)
2024-03-27 AWS, Google Cloud, Oracle, Alibaba Cloud, Ericsson join Valkey
2024-04-02 Valkey repository live on GitHub under LF umbrella
2024-04-16 Valkey 7.2.5 โ first official release
2024-06-11 Valkey 8.0.0-rc1 โ multi-threading improvements
2024-10-01 AWS makes Valkey default engine for new ElastiCache clusters
2024-Q4 Alibaba Cloud Tair adds Valkey protocol compatibility layer
2.5.2 Technical Direction of Valkey 8.0
Valkey 8.0 introduces meaningful performance improvements while maintaining full Redis 7.2 protocol compatibility:
I/O threading improvements: Valkey 8.0 ships with I/O threads enabled by default on multi-core systems. Unlike Redis 6.0's opt-in io-threads configuration, Valkey auto-detects CPU count and enables parallel I/O appropriately.
Dual-channel replication: Full synchronization (RDB transfer) and incremental replication (AOF propagation) use separate TCP connections. This prevents replication lag from blocking write propagation during large initial sync.
Slot migration optimization: Cluster rebalancing (CLUSTER SETSLOT + MIGRATE) is 60% faster in benchmarks, reducing the disruption window during scale-out events.
Memory efficiency: Valkey 8.0 reduces per-key overhead by ~10 bytes through internal struct packing improvements. At 100M keys, this saves ~1GB of RAM.
2.5.3 Migration Considerations
Valkey is a drop-in replacement for Redis 7.2. Client libraries, command syntax, persistence formats (RDB/AOF), and cluster protocols are identical. A migration from Redis 7.2 to Valkey requires:
- Replace the
redis-serverbinary withvalkey-server - No configuration changes required (valkey.conf is a renamed redis.conf)
- No client library changes (Valkey speaks RESP2/RESP3)
- No data migration (Valkey reads Redis RDB files directly)
The only practical concern is verifying that any Redis modules (redis-server --loadmodule) are available for Valkey โ most popular modules (RedisSearch, RedisJSON, RedisBloom) have Valkey-compatible builds.
2.6 KeyDB: Active-Active Replication
KeyDB (originally open-sourced by EQ Alpha Technology, later acquired by Snap Inc.) extends Redis with multi-threaded command execution and active-active (multi-primary) replication.
2.6.1 Per-Thread Event Loop
Redis's threading model (6.0+) keeps command execution on the main thread and offloads only network I/O:
Redis 6.0+:
io_thread_0 โ reads request bytes โ main_thread โ executes โ io_thread_0 โ writes response
KeyDB moves command execution into each thread:
KeyDB:
Thread 0: read request โ parse โ execute โ write response (for its connections)
Thread 1: read request โ parse โ execute โ write response (for its connections)
Thread 2: read request โ parse โ execute โ write response (for its connections)
Shared data structures (the main hash table, sorted sets) use a combination of fine-grained locking and MVCC (Multi-Version Concurrency Control) to allow safe concurrent access. This is architecturally closer to Dragonfly than to Redis.
KeyDB benchmark (8-core server, 100 concurrent connections):
| Redis 7.0 | KeyDB 6.3 | |
|---|---|---|
| GET ops/sec | 100,000 | 680,000 |
| SET ops/sec | 95,000 | 620,000 |
| Memory overhead | Baseline | +5% (MVCC metadata) |
2.6.2 Active-Active Replication
Traditional Redis replication is single-primary:
Redis: Primary (writes) โ Replica (reads only)
KeyDB supports multiple primaries, each accepting writes:
KeyDB Active-Active:
Node A โโโโโโโโโโโโโโโโโ Node B
(accepts writes) (accepts writes)
โโโโโโโโโโโโโโโโโโ
bidirectional sync
Write conflicts use Last-Write-Wins (LWW) semantics based on the server timestamp. This is suitable for workloads where write conflicts are rare and occasional LWW resolution is acceptable โ session stores, counters with high write volume, geographically distributed caches.
Caution: LWW conflict resolution means data can be silently lost if two nodes receive conflicting writes for the same key within the replication propagation window. This is not appropriate for financial data or inventory systems.
2.7 Decision Framework
Starting point: need a key-value store
Q1: Does dataset size exceed available RAM?
YES โ Aerospike (HMA: DRAM index + SSD data)
NO โ continue
Q2: Do you need complex data structures (ZSet, Stream, Geo, etc.)?
YES โ Redis 7.2 (ecosystem) or Valkey 8.0 (open-source future)
NO โ continue
Q3: Pure caching, no persistence, maximize multi-core throughput?
YES โ Memcached
NO โ continue
Q4: Maximum single-node throughput, BSL license acceptable?
YES โ Dragonfly
NO โ continue
Q5: Multi-primary writes required?
YES โ KeyDB or Redis Enterprise (CRDT-based)
NO โ continue
Q6: Cloud-managed preferred?
AWS: ElastiCache for Valkey (caching) or MemoryDB for Valkey (durable)
GCP: Memorystore for Valkey
Azure: Azure Cache for Redis (Enterprise tier for modules)
Alibaba: Tair (enhanced features: field TTL, multi-score ZSet, etc.)
NO โ Valkey (best open-source default: BSD license, active community)
2.8 Summary
The competitive landscape around Redis has never been more active. The 2024 license change functionally split the ecosystem into two tracks: Redis Inc.'s commercial product and the open-source Valkey lineage backed by the Linux Foundation and major cloud providers.
For most new projects, the choice is straightforward: Valkey for self-managed deployments, cloud-managed Valkey for teams that prefer operational simplicity.
Aerospike fills the specific niche of large-scale data (billions of records) where the dataset cannot fit in RAM. Memcached remains a valid choice for pure high-throughput caching with a multi-threaded advantage. Dragonfly and KeyDB are technically interesting but carry ecosystem risk for production use.
Chapter 3 goes deeper into cloud-managed services โ specifically the technical architecture differences between AWS ElastiCache, MemoryDB, and Alibaba Cloud Tair.