Data Modeling Guide
Query-First Design Principle
In Cassandra, design tables around your queries, not around entity relationships. One query = one table.
-- Queries to support:
-- Q1: Get all orders for a user
-- Q2: Get orders by status (across users)
-- Q3: Get order by ID
-- Q1: orders_by_user
CREATE TABLE orders_by_user (
user_id UUID,
order_id TIMEUUID,
status TEXT,
total DECIMAL,
PRIMARY KEY (user_id, order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);
-- Q2: orders_by_status (denormalized copy)
CREATE TABLE orders_by_status (
status TEXT,
order_id TIMEUUID,
user_id UUID,
total DECIMAL,
PRIMARY KEY (status, order_id)
) WITH CLUSTERING ORDER BY (order_id DESC);
-- Q3: orders_by_id
CREATE TABLE orders_by_id (
order_id TIMEUUID PRIMARY KEY,
user_id UUID,
status TEXT,
total DECIMAL
);
Partition Strategy
Good partition keys distribute data evenly. Bad keys cause hot spots. Aim for hundreds of MB per partition.
-- Anti-pattern: date as partition key (hot partition on today)
CREATE TABLE logs_bad (
date DATE,
ts TIMEUUID,
message TEXT,
PRIMARY KEY (date, ts)
);
-- All today's logs go to ONE node = hot spot!
-- Better: bucket by hour to spread load
CREATE TABLE logs_by_hour (
sensor_id TEXT,
hour TEXT, -- '2024-01-15-10' (YYYY-MM-DD-HH)
ts TIMEUUID,
message TEXT,
PRIMARY KEY ((sensor_id, hour), ts)
);
-- Or use a random bucket number to spread further
CREATE TABLE logs_bucketed (
bucket INT, -- random 0-9
sensor_id TEXT,
ts TIMEUUID,
message TEXT,
PRIMARY KEY ((bucket, sensor_id), ts)
);
-- Write: pick bucket = random 0-9
-- Read: query all 10 buckets (fan-out read)
Denormalization Patterns
-- Pattern: write to multiple tables for different query patterns
-- Application logic handles consistency
-- Write function (pseudocode)
async function createOrder(order) {
await Promise.all([
db.execute(
"INSERT INTO orders_by_user (user_id, order_id, status, total) VALUES (?,?,?,?)",
[order.userId, order.id, order.status, order.total]
),
db.execute(
"INSERT INTO orders_by_status (status, order_id, user_id, total) VALUES (?,?,?,?)",
[order.status, order.id, order.userId, order.total]
),
db.execute(
"INSERT INTO orders_by_id (order_id, user_id, status, total) VALUES (?,?,?,?)",
[order.id, order.userId, order.status, order.total]
)
]);
}
-- Use BATCH for atomicity (same partition only for unlogged, cross-partition for logged)
BEGIN BATCH
INSERT INTO orders_by_user (user_id, order_id, status) VALUES (?, ?, ?);
INSERT INTO orders_by_id (order_id, user_id, status) VALUES (?, ?, ?);
APPLY BATCH;
Time Series Pattern
-- Time series with wide rows (efficient for range reads)
CREATE TABLE sensor_readings (
sensor_id TEXT,
bucket TEXT, -- e.g., '2024-01-15' groups daily data
ts TIMESTAMP,
value DOUBLE,
PRIMARY KEY ((sensor_id, bucket), ts)
) WITH CLUSTERING ORDER BY (ts ASC)
AND compaction = {
'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1
};
-- Read last 24 hours for a sensor
SELECT ts, value FROM sensor_readings
WHERE sensor_id = 'temp-001' AND bucket = '2024-01-15'
AND ts >= '2024-01-15 00:00:00' AND ts <= '2024-01-15 23:59:59';
-- TimeWindowCompactionStrategy: ideal for time-series
-- Groups SSTables by time window, efficient TTL eviction
Data Modeling Golden Rules
| Rule | Detail |
|---|---|
| Design for your queries | List all queries first, then design tables |
| Avoid ALLOW FILTERING | It causes full-cluster scans |
| No JOINs | Denormalize instead; write to multiple tables |
| Avoid hot partitions | Add bucket/shard to high-traffic partition keys |
| Max partition size | Keep under 100 MB; ideal <10 MB |
| Use TTL | Set TTL for time-bounded data (logs, sessions) |