Chapter 3

Cloud Managed Redis: ElastiCache, MemoryDB, Alibaba Tair

Chapter 3: Cloud-Managed Redis: ElastiCache, MemoryDB, and Alibaba Cloud Tair

3.1 The Value Proposition of Managed Redis

Self-managing a Redis cluster means owning: primary failover (Sentinel), cluster resharding, version upgrades, backup and restore, security patching, monitoring, and capacity planning. Cloud-managed services transfer this operational burden to the provider โ€” not because they are cheaper (they are usually more expensive), but because the operational complexity has real engineering cost.

However, "managed Redis" covers a wide spectrum of architectures. Choosing the wrong product can result in unexpected data loss, latency surprises, or capability gaps that require re-architecture later.

3.2 AWS ElastiCache for Redis/Valkey

3.2.1 Two Cluster Modes

Cluster Mode Disabled (CMD) โ€” single shard:

Primary Node
  โ”œโ”€โ”€ Replica 0 (read replica)
  โ””โ”€โ”€ Replica 1 (read replica)

All keys reside on one primary. Read throughput scales horizontally (route reads to replicas); write throughput does not scale. Suitable when dataset fits comfortably in one node (<= 340 GB per node) and write throughput is not the bottleneck.

Cluster Mode Enabled (CME) โ€” multi-shard:

Shard 0: Primary-0 + Replica-0a + Replica-0b  (hash slots   0 โ€“ 5460)
Shard 1: Primary-1 + Replica-1a + Replica-1b  (hash slots 5461 โ€“ 10922)
Shard 2: Primary-2 + Replica-2a + Replica-2b  (hash slots 10923 โ€“ 16383)

16,384 hash slots distributed across shards. Each shard scales independently. Suitable for large datasets (multi-TB by adding shards) or write-heavy workloads requiring horizontal write scaling.

Critical CME constraint: Multi-key operations (MGET, MSET, MULTI transactions, Lua scripts touching multiple keys) require all keys to hash to the same slot. Use hash tags to enforce co-location:

# WRONG: 'user:1001' and 'order:1001' may hash to different slots
MGET user:1001 order:1001
# โ†’ CROSSSLOT error in cluster mode

# CORRECT: hash tag {} forces slot based on content inside braces
MGET {1001}:user {1001}:order
# CRC16("1001") % 16384 = same slot for both keys

# Lua scripts must also use hash tags if accessing multiple keys
EVAL "
local user = redis.call('GET', KEYS[1])
local order = redis.call('GET', KEYS[2])
return {user, order}
" 2 {1001}:user {1001}:order

3.2.2 Global Datastore: Cross-Region Replication

Global Datastore creates an active-passive relationship between ElastiCache clusters in different AWS Regions:

us-east-1 (Primary Cluster)
    โ”‚  Async replication over AWS backbone (~50ms typical)
    โ–ผ
eu-west-1 (Secondary Cluster, read-only)
    โ”‚  Async replication
    โ–ผ
ap-southeast-1 (Secondary Cluster, read-only)

Operational characteristics:

Cost consideration: Cross-region data transfer is billed at $0.02โ€“$0.09/GB depending on regions. A write-heavy workload (1 GB/day replication) adds ~$1โ€“$3/day. Calculate before enabling Global Datastore on high-write clusters.

3.2.3 Post-2024: Embracing Valkey

As of October 2024, AWS defaults all new ElastiCache clusters to Valkey rather than Redis. Existing Redis clusters continue to operate unchanged; AWS provides a migration path but does not force upgrades.

Functionally, ElastiCache for Valkey 7.2 and ElastiCache for Redis 7.2 are equivalent from the application perspective. The change is primarily contractual: AWS no longer needs a commercial license from Redis Inc., which may eventually reduce managed service costs.

3.2.4 Migration: Self-Managed Redis to ElastiCache

Four approaches, in order of operational complexity:

# Approach 1: redis-shake (recommended for online migration)
# Full RDB sync followed by incremental AOF replication
# Downtime window: < 1 second (cut DNS after sync catches up)
./redis-shake \
  -source=self-managed-redis:6379 \
  -target=elasticache-endpoint.cache.amazonaws.com:6379 \
  -auth_target=your-auth-token

# Approach 2: DUMP / RESTORE (key-by-key, small datasets only)
# Export each key from source, import to target
redis-cli -h source DUMP mykey | redis-cli -h target RESTORE mykey 0 -

# Approach 3: AWS Database Migration Service (DMS)
# GUI-driven, no command line required
# Supports Redis โ†’ ElastiCache online migration

# Approach 4: Application-level dual-write during cutover
# Write to both old and new clusters simultaneously
# Read from new cluster after warm-up period
# Zero downtime, but complex application logic

Connection limit: ElastiCache nodes have per-instance connection limits (e.g., cache.r6g.large allows up to 65,000 connections). Self-managed Redis instances often have this set to 10,000. Verify your application's connection pool settings before cutover.

3.3 AWS MemoryDB for Redis/Valkey (Deep Dive)

3.3.1 Fundamental Difference from ElastiCache

This is the most frequently misunderstood distinction in AWS's Redis offering:

Dimension ElastiCache MemoryDB
Primary use Cache tier Durable primary database
Persistence Optional (AOF/RDB; restart may lose data) Mandatory (multi-AZ transaction log)
Write ACK timing After in-memory write on primary After commit to all-AZ transaction log
Write latency (p50) < 0.5ms 2โ€“5ms
Write latency (p99) < 1ms 5โ€“10ms
Data loss risk Possible on primary failure Zero (by design)
Usable as primary DB No (cache miss = rebuild) Yes (explicit design goal)
Price (vs ElastiCache) Baseline ~1.5โ€“2x higher

3.3.2 MemoryDB's Persistence Mechanism

MemoryDB does not rely on Redis's built-in AOF or RDB. Instead, it wraps Redis with an external multi-AZ transaction log implemented with a Raft-like consensus protocol:

Write path (simplified):

Client
  โ”‚
  โ–ผ SET order:10086 {...}
Primary Node (us-east-1a)
  โ”‚
  โ”œโ”€โ”€โ†’ Write to AZ-a transaction log segment
  โ”œโ”€โ”€โ†’ Replicate to AZ-b transaction log segment (parallel)
  โ”œโ”€โ”€โ†’ Replicate to AZ-c transaction log segment (parallel)
  โ”‚
  โ”‚ Wait for quorum (2 of 3 AZs confirmed write)
  โ”‚
  โ”œโ”€โ”€ Apply write to in-memory Redis data structure
  โ”‚
  โ””โ”€โ”€โ†’ Return ACK to client

Failure scenario:
  Primary node crashes after AZ-b and AZ-c wrote, before AZ-a wrote
  โ†’ New primary reads from AZ-b/AZ-c log, reconstructs state
  โ†’ Zero data loss (quorum was met before crash)

The transaction log uses a two-tier storage model:

On primary failover, MemoryDB selects the replica with the most up-to-date transaction log position, applies any remaining log entries, and promotes it to primary. The entire process typically completes in 10โ€“30 seconds.

3.3.3 Comparing MemoryDB to Redis with AOF

You might wonder: can't I get the same durability from Redis with appendfsync always?

appendfsync always:  every write fsyncs to disk โ€” durability, but single-AZ
appendfsync everysec: fsync every second โ€” up to 1 second of data loss possible
appendfsync no:      OS decides when to flush โ€” best performance, worst durability

appendfsync always on a self-managed Redis:

MemoryDB with multi-AZ transaction log:

MemoryDB provides stronger durability guarantees than any single-node Redis configuration.

3.3.4 Practical Use Cases

Suitable for MemoryDB:

# Shopping cart โ€” loss means customer must re-add items
r.hset(f"cart:{user_id}", mapping={
    "item:SKU-A001": json.dumps({"qty": 2, "price": 29.99}),
    "item:SKU-B007": json.dumps({"qty": 1, "price": 149.99}),
})

# Game state โ€” loss means player progress rollback
r.zadd(f"player:{player_id}:inventory", {
    "sword:legendary": 1,
    "gold": 48250
})

# Rate limiting with financial implications (credit debits)
remaining = r.decrby(f"credit:{user_id}", cost)
if remaining < 0:
    r.incrby(f"credit:{user_id}", cost)  # rollback
    raise InsufficientCreditError

Not suitable for MemoryDB:

  1. Pure cache (data rebuildable from database): Use ElastiCache โ€” save 40โ€“50% in cost.
  2. Sub-millisecond write latency required: MemoryDB's 2โ€“5ms write latency may violate SLAs for high-frequency trading or similar use cases.
  3. Complex SQL queries: MemoryDB is still a KV store. Complex aggregations belong in a database.

3.3.5 MemoryDB vs DynamoDB

Both are AWS-managed, durable, highly available. The choice depends on data model and access pattern:

Dimension MemoryDB DynamoDB
Latency (read) < 1ms 5โ€“10ms single-digit ms
Latency (write) 2โ€“5ms 5โ€“10ms
Data model Redis structures (ZSet, List, Hash, etc.) Item-based (nested documents)
Query capability Key-based + ZSet ranges Primary key + GSI + LSI
Auto-scaling Manual (resize cluster) Automatic (on-demand mode)
Serverless option No Yes (DynamoDB on-demand)
Max item/value size 512 MB 400 KB
Global tables MemoryDB Global Clusters DynamoDB Global Tables

Choose MemoryDB when your access pattern maps naturally to Redis data structures (leaderboards, session state, real-time counters) and you need sub-millisecond reads. Choose DynamoDB when you need auto-scaling, serverless economics, or richer query capability (GSI, sparse indexes, FilterExpression).

3.4 Alibaba Cloud Tair: The Most Extensive Redis Enhancement

3.4.1 Background

Alibaba Cloud's "ApsaraDB for Redis" was rebranded to Tair in 2021, coinciding with a major internal initiative to extend Redis's capabilities for Alibaba's specific production requirements. Tair's enhancements are implemented as Redis modules (using the Redis Module API) and loaded by default on all Tair instances. Users access them through new commands with the EX/TAIR/GIS prefix.

The problems Tair addresses are real gaps in upstream Redis that Alibaba engineers discovered at scale โ€” field-level TTL, multi-dimensional sorted sets, and polygon geographic queries are not niche requirements; they appear in production at most large internet companies.

3.4.2 TairString: Versioned CAS Operations

Problem: Implementing optimistic locking in Redis requires a Lua script:

-- Verbose Lua CAS (current Redis approach)
local current = redis.call('GET', KEYS[1])
if current == ARGV[1] then
    redis.call('SET', KEYS[1], ARGV[2])
    return 1
end
return 0

This works but requires Lua script registration, awkward error handling, and cannot be used across clients without coordination on the script SHA.

TairString commands:

# Set value with explicit version
EXSET inventory:SKU-A001 "500" VER 1

# Atomic compare-and-set (version-based)
EXCAS inventory:SKU-A001 "498" 1
# Returns OK if current version == 1 (sets new value, increments version to 2)
# Returns ERR CAS conflict if version mismatch

# Get value and current version atomically
EXGET inventory:SKU-A001
# 1) "498"
# 2) (integer) 2

# Application-level optimistic locking pattern
def update_inventory(sku_id: str, delta: int, max_retries: int = 3):
    for attempt in range(max_retries):
        result = r.execute_command('EXGET', f'inventory:{sku_id}')
        current_val = int(result[0])
        current_ver = int(result[1])
        new_val = current_val + delta
        
        status = r.execute_command('EXCAS', f'inventory:{sku_id}', str(new_val), current_ver)
        if status == 'OK':
            return new_val
        # Version conflict โ€” retry
    raise MaxRetriesExceeded

3.4.3 TairHash: Per-Field TTL

Problem: Redis Hash supports TTL only at the key level. There is no way to expire individual fields. Common workarounds:

# Workaround A: separate String keys per field
SET user:1001:name "Alice"
SET user:1001:session "tok-xyz" EX 300   # session expires in 5min
# Problem: HGETALL equivalent requires N round trips

# Workaround B: store expiry metadata as a field
HSET user:1001 session "tok-xyz"
HSET user:1001 session_exp "1703001234"  # manual expiry check in application
# Problem: requires application-side expiry logic on every read

TairHash solves this natively:

# Set fields with individual TTLs
EXHSET user:1001 name "Alice"
EXHSET user:1001 session "tok-xyz" EX 300    # expires in 5 minutes
EXHSET user:1001 avatar_url "https://..." EX 86400  # expires in 1 day

# Get a specific field (returns nil if expired)
EXHGET user:1001 session   # โ†’ "tok-xyz" (if within 5 minutes)

# Get all non-expired fields (atomic)
EXHGETALL user:1001
# Returns only non-expired fields:
# 1) "name"
# 2) "Alice"
# (session and avatar_url excluded if expired)

# Modify a field's TTL without changing its value
EXHEXPIRE user:1001 session 600    # extend session TTL to 10 minutes

# Check remaining TTL for a field
EXHTTL user:1001 session    # โ†’ 547 (seconds remaining)

# Check field existence (respects TTL)
EXHEXISTS user:1001 session  # โ†’ 1 (exists and not expired)

Implementation details: TairHash stores a per-field expiry timestamp (Unix epoch in milliseconds) alongside each value. On every EXHGET or EXHGETALL, it checks the timestamp against current time and returns nil for expired fields. A background thread performs active expiry (similar to Redis's active key expiry), scanning for and deleting expired fields to reclaim memory.

Production use cases:

3.4.4 TairZset: Multi-Dimensional Scoring

Problem: Redis ZSet supports only a single scalar score. Implementing a leaderboard where ties are broken by time-of-achievement requires either:

TairZset supports tuples of scores (score1, score2, ..., scoreN):

# Add members with 2-dimensional scores: (game_score, timestamp)
EXZADD leaderboard 9850 1703000000 "player:1001"
EXZADD leaderboard 9850 1703000100 "player:2007"   # Same score, 100s later
EXZADD leaderboard 12400 1703001000 "player:3015"

# Rank by score1 descending, then score2 ascending (higher score first, earlier timestamp first for ties)
EXZREVRANGEBYSCORE leaderboard +inf -inf EXWITHSCORES
# 1) "player:3015"  2) 12400  3) 1703001000
# 2) "player:1001"  3) 9850   4) 1703000000  โ† arrived first
# 3) "player:2007"  5) 9850   6) 1703000100  โ† arrived second

# Range by score1 only (returns all records with score1 in range)
EXZRANGEBYSCORE leaderboard 8000 10000 EXWITHSCORES

# Get rank of a specific member
EXZREVRANK leaderboard "player:1001"  # โ†’ 1 (0-indexed, second place)

Application examples:

3.4.5 TairDoc: Native JSON Path Queries

# Store a JSON document
EXJSONSET user:1001 $ '{"name":"Alice","prefs":{"theme":"dark","lang":"zh"},"scores":[95,87,92]}'

# Path queries (JSONPath syntax)
EXJSONGET user:1001 $.name
# โ†’ ["Alice"]

EXJSONGET user:1001 $.prefs.theme
# โ†’ ["dark"]

# Update a nested field without fetching the full document
EXJSONSET user:1001 $.prefs.theme '"light"'

# Append to an array
EXJSONARRAYAPPEND user:1001 $.scores 88
# scores is now [95, 87, 92, 88]

# Get array length
EXJSONARRLEN user:1001 $.scores
# โ†’ 4

# Increment a numeric value
EXJSONNUMINCRBY user:1001 $.scores[0] 5
# scores[0] is now 100

The advantage over the client-side approach (GET + JSON.parse + modify + SET) is atomicity and network efficiency: partial updates avoid the full round-trip of a large document.

3.4.6 TairBloom: Scalable Bloom Filter

Problem with standard Bloom Filters: Fixed capacity at creation. When the filter reaches its element count limit, you must rebuild it with a larger capacity โ€” requiring a maintenance window and temporary increased false-positive rate.

TairBloom implements a Scalable Bloom Filter (SBF) โ€” a cascade of fixed-size BFs that grows automatically:

# Create with target false positive rate (no fixed capacity)
TFBFRESERVE crawler:visited 0.001    # 0.1% FPR target

# Add URLs
TFBFADD crawler:visited "https://example.com/page1"
TFBFMADD crawler:visited "https://example.com/page2" "https://example.com/page3"

# Query
TFBFEXISTS crawler:visited "https://example.com/page1"  # โ†’ 1 (exists)
TFBFEXISTS crawler:visited "https://example.com/new"    # โ†’ 0 (definitely not exists)

# Batch query
TFBFMEXISTS crawler:visited "https://a.com" "https://b.com" "https://c.com"
# โ†’ 1 0 1 (per-URL results)

Internal structure:

TairBloom (capacity auto-grows):
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Sub-filter 0: capacity 10,000  FPR 0.001     โ”‚ โ† fills first
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Sub-filter 1: capacity 20,000  FPR 0.0005   โ”‚ โ† created when sub-0 is full
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Sub-filter 2: capacity 40,000  FPR 0.00025  โ”‚ โ† created when sub-1 is full
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Query: check all sub-filters; return 1 if any returns 1
Total FPR: โ‰ค sum of sub-filter FPRs โ‰ˆ 2ร— original FPR (geometric series)

3.4.7 TairGIS: Polygon Geospatial Queries

Redis's native GEOSEARCH supports only circular range queries. TairGIS adds polygon and multi-geometry support:

# Add point locations (GeoJSON format)
GISADD delivery:stores \
  '{"type":"Feature","geometry":{"type":"Point","coordinates":[116.397,39.907]},"properties":{"id":"store-001"}}'

# Add delivery zone as polygon
GISADD delivery:zones \
  '{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[116.39,39.90],[116.41,39.90],[116.41,39.92],[116.39,39.92],[116.39,39.90]]]},"properties":{"name":"zone-A"}}'

# Is a customer's location within any delivery zone?
GISCONTAINS delivery:zones \
  '{"type":"Point","coordinates":[116.395,39.910]}'
# โ†’ ["zone-A"]

# Which stores fall within a specific zone?
GISWITHIN delivery:zones delivery:stores
# โ†’ ["store-001", "store-007"]

# Do two zones overlap?
GISINTERSECTS delivery:zones \
  '{"type":"Polygon","coordinates":[[[116.40,39.91],[116.42,39.91],[116.42,39.93],[116.40,39.93],[116.40,39.91]]]}'
# โ†’ ["zone-A"] (if they intersect)

Use cases: Food delivery platform service area validation, ride-hailing service zones, geofence marketing (trigger push notification when user enters/exits a polygon region).

3.5 Other Cloud Providers

3.5.1 Azure Cache for Redis โ€” Enterprise Tier

Azure's Enterprise tier (E tier) uses Redis Enterprise (not open-source Redis) as the underlying engine:

Feature Standard Tier Enterprise (E) Tier
Engine Open-source Redis Redis Enterprise
RediSearch (full-text) No Yes
RedisJSON No Yes
RedisTimeSeries No Yes
Active geo-replication No Yes (CRDT-based)
Active-Active clusters No Yes
Price Baseline ~3โ€“5x standard

The Enterprise tier is worth evaluating if you need vector similarity search (RedisVL integration), full-text search, or active-active multi-region writes.

3.5.2 Google Cloud Memorystore for Redis/Valkey

Google Memorystore added Valkey support in 2024, alongside the existing Redis offering. Key characteristics:

3.5.3 Upstash Serverless Redis

Upstash is a per-request-billed Redis service:

Pricing model:
- $0.20 per 100,000 commands
- No fixed monthly charge
- Data persisted to disk (not purely in-memory like standard Redis)
- Read replicas available for multi-region reads

Cold start latency: Upstash instances are not always hot in memory. First access after an idle period may take 10โ€“50ms. This makes it unsuitable for latency-sensitive production workloads, but excellent for:

3.6 The Full Timeline: Redis License Change Events

2024-03-20  Redis Inc. announces RSALv2 + SSPLv1 for Redis 7.4+
            (Redis 7.2 and below remain BSD-3)
2024-03-21  Linux Foundation announces Valkey project
            AWS, Google, Oracle, Alibaba Cloud join as founding members
2024-03-22  Ericsson, Snap, and 15+ other companies announce Valkey support
2024-03-27  Oracle, Verizon, additional enterprise members join
2024-04-02  Valkey GitHub repository goes live (valkey-io/valkey)
2024-04-16  Valkey 7.2.5 released (first official version, BSD-3 license)
2024-05-22  Valkey 7.2.6 released (bug fixes)
2024-06-11  Valkey 8.0.0-rc1 released (I/O threading improvements)
2024-08-xx  Valkey 8.0.0-rc2, rc3 โ€” stability improvements
2024-10-01  AWS announces ElastiCache default engine is now Valkey
2024-Q4     Alibaba Cloud Tair adds Valkey protocol compatibility
2024-Q4     Google Memorystore for Valkey reaches GA

What this means for existing Redis users:

3.7 Managed Service Comparison Table

Dimension AWS ElastiCache AWS MemoryDB Alibaba Cloud Tair Tencent Cloud Redis
Max capacity 340 GB/shard 427 GB/shard 4 TB (cluster) 4 TB (cluster)
Write latency (p50) < 0.5ms 2โ€“5ms < 0.5ms < 0.5ms
Write latency (p99) < 1ms 5โ€“10ms < 1ms < 1ms
Durability Optional (cache semantics) Multi-AZ transaction log Optional Optional
Extended data types None None TairHash, TairZset, TairGIS, TairBloom None
Price (128 GB node) ~$500/mo ~$900/mo ~$400/mo ~$350/mo
SLA 99.99% 99.99% 99.99% 99.99%
Cross-region Global Datastore Global Clusters Cross-region sync Global region replication
Best for General caching Durable primary store Specialized data ops General caching

3.8 Summary

The cloud-managed Redis market has fragmented meaningfully since the 2024 license event:

Chapter 4 descends into Redis internals, starting with the redisObject structure that underpins every value stored in the system.

Rate this chapter
4.8  / 5  (95 ratings)

๐Ÿ’ฌ Comments