Chapter 1

Kafka Is Not Just a Message Queue

The LinkedIn Problem That Birthed Kafka

In 2010, LinkedIn's engineering team was drowning in data. Hundreds of millions of user behavior events — page views, connection requests, job clicks, search queries — needed to flow in real time into offline analytics systems, recommendation engines, A/B testing frameworks, and monitoring dashboards. They tried ActiveMQ. They built custom pipelines. Every solution collapsed under the pressure of millions of messages per second.

Jay Kreps, Neha Narkhede, and Jun Rao sat down to rethink the problem from first principles. They weren't asking "how do we move a message from A to B?" — that question had been answered a dozen times. They were asking something fundamentally different: how do we build a unified data bus where any number of consumers, at any point in time, can read the same stream of events at their own pace, and the data is still there after they've read it?

Traditional message queues couldn't answer that question. The answer they built was Kafka, open-sourced in 2011 and donated to the Apache Software Foundation in 2012.

Why Traditional Message Queues Fall Short

The Push Model and Its Constraints

RabbitMQ, ActiveMQ, and their kin are built on the push model: the broker actively delivers messages to consumers. This creates a tight coupling between broker and consumer that has deep architectural consequences.

The broker must maintain per-consumer state: which messages have been delivered, which have been acknowledged, how many are in-flight. It tracks consumer connection health, applies backpressure when consumers slow down, and decides when it's safe to delete a message. The broker is a stateful, opinionated intermediary.

This design works well for task queues where you want exactly-once delivery of work items to exactly one worker. But it breaks down when you need:

Multiple independent consumers reading the same data — you must duplicate the message into multiple queues, multiplying storage.
Replay — once a consumer acknowledges a message, it's gone. If your ETL job had a bug and you need to reprocess last week's events, you're out of luck.
Consumers slower than producers — the broker accumulates unacknowledged messages in memory, eventually pushing back on producers.

Broker-Side Queue: The Root of Limited Replay

The broker-side queue is the core data structure of traditional MQ systems. Messages exist in the queue until acknowledged. This lifecycle, driven by consumer acknowledgment rather than time or size policy, makes replay structurally impossible. The moment a message is ACK'd, it is deleted. There is no concept of "read position" — only "delivered or not delivered."

For LinkedIn's use case — feeding the same activity stream into Hadoop for batch analytics, into a recommendation system for real-time scoring, and into a monitoring system for alerting — this model was a non-starter. Three consumers, three queues, three copies of every event, and still no replay capability.

Kafka's Core Design: Pull Model + Append-Only Log

The Append-Only Log

Kafka doesn't use queues. It uses logs. Each Topic is divided into Partitions, and each Partition is a sequence of append-only log segment files on disk. Producers write messages to the end of the log, and each message receives a monotonically increasing offset — its permanent position in the log.

Partition 0 physical files:
00000000000000000000.log        (segment data)
00000000000000000000.index      (sparse offset index)
00000000000000000000.timeindex  (timestamp index)

Conceptual layout:
offset=0: {"user": "alice", "action": "view", "item": "12345"}
offset=1: {"user": "bob",   "action": "click", "item": "67890"}
offset=2: {"user": "alice", "action": "purchase", "item": "12345"}
offset=3: ...

The append-only nature is not just a design choice — it's a performance foundation. Sequential disk writes on a spinning hard drive achieve 500–600 MB/s. Random writes achieve perhaps 0.4 MB/s. For SSDs the gap is smaller but still significant for high-throughput workloads. Kafka exploits this by writing only sequentially, and by relying entirely on the OS page cache rather than maintaining an application-level cache. The write path is brutally simple: a write() syscall deposits data into the page cache; the OS flushes to disk asynchronously. Zero memory copies, zero data structure overhead.

The Pull Model

Kafka consumers issue Fetch Requests to brokers, asking for messages starting at a specific offset. The consumer decides how many bytes to fetch, how long to wait if no messages are available (fetch.min.bytes, fetch.max.wait.ms), and when to commit its offset. The broker knows nothing about consumer state.

Consumer offsets are stored in a special internal Kafka topic: __consumer_offsets. This means:

Broker is stateless with respect to consumer progress — it stores bytes, not delivery state.
Consumers are independently scalable — adding a new consumer group does not increase broker load (beyond the additional fetch traffic).
Replay is trivial — reset the consumer's offset to any historical position and re-consume. The broker doesn't care; it just serves bytes at the requested offset.

// Resetting a consumer to the beginning — broker doesn't need to know
consumer.seekToBeginning(consumer.assignment());

// Resetting to a specific timestamp
Map<TopicPartition, Long> timestampsToSearch = new HashMap<>();
timestampsToSearch.put(partition, targetTimestampMs);
Map<TopicPartition, OffsetAndTimestamp> offsets =
    consumer.offsetsForTimes(timestampsToSearch);
consumer.seek(partition, offsets.get(partition).offset());

Infinite Replay

Because messages are never deleted upon consumption, Kafka's retention is governed purely by time (log.retention.hours) or size (log.retention.bytes) policies. Within the retention window, any consumer — even one that has never existed before — can read from offset 0. Multiple consumer groups can read the same partition simultaneously at different positions, with no coordination and no interference.

This property transforms Kafka's operational model. When a Flink job discovers a bug in its aggregation logic, the fix is to reset the consumer group offset to the point before the error began and rerun. No data backfill, no coordination with producers, no special tooling — just an offset reset.

Kafka's Three Roles

Role 1: Message Queue (Async Decoupling)

For traditional async messaging use cases — order services decoupled from inventory, notification, and analytics — Kafka works as a high-throughput message queue. The Consumer Group abstraction provides competing-consumer semantics: multiple consumers in the same group share partitions, each message processed by exactly one group member.

Where Kafka differs from RabbitMQ in this role: it trades lower latency for higher throughput and replay capability. RabbitMQ can achieve sub-millisecond delivery; Kafka's p99 typically runs 5–20ms due to batching and the overhead of its replication protocol. For most async workflows, this is irrelevant. For latency-critical RPC patterns, it matters.

Role 2: Distributed Storage System

Configure Kafka with log.retention.ms=-1 (infinite retention) and you have a distributed, fault-tolerant, append-only storage system organized by time. This enables:

Event Sourcing: A Kafka Compacted Topic stores the complete change history of an entity. The compaction policy retains only the latest value per key, so the topic holds the current state of every entity. New services can reconstruct the full current state by replaying from offset 0.

Change Data Capture (CDC): Tools like Debezium tap database write-ahead logs and publish every row change to Kafka. Downstream systems subscribe to get a real-time, ordered, replayable replica of the database — without touching the primary database.

Kafka Streams State Stores: Kafka Streams (covered in Chapter 10) stores stream processing intermediate state in changelog topics, using Kafka itself as the durable state backend. This eliminates the external state management problem that plagues other stream processing frameworks.

Role 3: Stream Processing Engine

Kafka Streams and ksqlDB allow Kafka to execute stream processing logic natively, without a separate Flink or Spark cluster.

// Kafka Streams: count clicks per product per minute
StreamsBuilder builder = new StreamsBuilder();
KStream<String, ClickEvent> clicks = builder.stream("product-clicks",
    Consumed.with(Serdes.String(), clickSerde));

KTable<Windowed<String>, Long> clickCounts = clicks
    .groupByKey()
    .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
    .count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as("click-counts-store")
        .withKeySerde(Serdes.String())
        .withValueSerde(Serdes.Long()));

clickCounts
    .toStream()
    .map((windowedKey, count) -> new KeyValue<>(
        windowedKey.key(), new ClickCountEvent(windowedKey.window(), count)))
    .to("click-counts-output", Produced.with(Serdes.String(), clickCountSerde));

KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

The same Kafka cluster stores the input stream, the output stream, and the intermediate state. This "storage as computation substrate" architecture eliminates the Lambda Architecture anti-pattern, where batch and streaming pipelines implement the same logic twice.

Deep Comparison: Four Competing Systems

RabbitMQ

Protocol: AMQP 0-9-1 (also MQTT, STOMP via plugins). A proper standard protocol with formal semantics for acknowledgment, transactions, and flow control.

Core Model: Exchange → Binding → Queue. Four exchange types — Direct, Fanout, Topic, Headers — give RabbitMQ routing flexibility that Kafka simply doesn't have. A single message can be routed to multiple queues based on routing keys or header attributes, with zero application-layer logic required.

Latency: Sub-millisecond end-to-end under normal load. The push model and in-memory queue deliver messages as fast as the network allows.

Throughput: 20K–50K msg/s per node without aggressive tuning. Kafka can do 500K–1M+ msg/s on the same hardware.

Persistence: Optional. Messages are in-memory by default; durable=true on the queue plus delivery_mode=2 on the message enables disk persistence with fsync, at significant throughput cost.

Replay: Not supported. ACK = delete.

Where it wins: Task queues (background jobs, email delivery), complex routing (content-based routing to different processing chains), request-reply patterns (RPC over MQ), systems where message volume is moderate and routing complexity is high.

RocketMQ

Origin: Developed by Alibaba for Double Eleven (Singles' Day) peak traffic — the world's largest e-commerce event. Production-proven at tens of billions of messages per day. Open-sourced in 2012, Apache Top-Level Project since 2017.

Unique capabilities:

Transaction Messages: RocketMQ's killer feature. A producer sends a "half message" (invisible to consumers), executes a local database transaction, then either commits or rolls back the half message based on the transaction outcome. A background thread periodically queries the producer for the transaction status if no commit/rollback is received within a timeout. This provides distributed transaction semantics without XA or two-phase commit overhead.

// RocketMQ transaction message
TransactionMQProducer producer = new TransactionMQProducer("order-group");
producer.setTransactionListener(new TransactionListener() {
    @Override
    public LocalTransactionState executeLocalTransaction(Message msg, Object arg) {
        try {
            orderDatabase.insertOrder((Order) arg);
            return LocalTransactionState.COMMIT_MESSAGE;
        } catch (Exception e) {
            return LocalTransactionState.ROLLBACK_MESSAGE;
        }
    }

    @Override
    public LocalTransactionState checkLocalTransaction(MessageExt msg) {
        // Called if commit/rollback was not received in time
        return orderDatabase.orderExists(msg.getKeys())
            ? LocalTransactionState.COMMIT_MESSAGE
            : LocalTransactionState.ROLLBACK_MESSAGE;
    }
});

TransactionSendResult result = producer.sendMessageInTransaction(msg, order);

Delay Queues: 18 fixed delay levels (1s, 5s, 10s, 30s, 1m, 2m, 3m, 4m, 5m, 6m, 7m, 8m, 9m, 10m, 20m, 30m, 1h, 2h). An order timeout cancellation job becomes a single setDelayTimeLevel(3) call.

Consumer Retry and Dead Letter: Failed messages are automatically retried with exponential backoff. After exhausting retries, they land in a Dead Letter Queue, visible in the RocketMQ console for manual intervention.

Where it wins: Financial systems requiring distributed transaction guarantees, e-commerce platforms with delay queue requirements, domestic China deployments where RocketMQ's ecosystem is stronger.

Apache Pulsar

Architecture: Pulsar separates compute from storage at the architectural level. Brokers are completely stateless — they handle client connections and protocol logic but store no data. All storage is handled by Apache BookKeeper, a distributed write-ahead log system. Brokers and BookKeeper clusters scale independently.

Multi-tenancy: Pulsar's namespace hierarchy (Tenant → Namespace → Topic) provides genuine multi-tenancy with per-namespace quotas, authentication policies, and replication configurations. A single Pulsar cluster can serve dozens of independent teams with hard isolation guarantees.

Tiered Storage: Pulsar can automatically offload old segments to object storage (S3, GCS, HDFS) as they age, while keeping recent segments in BookKeeper for low-latency reads. The result is near-unlimited retention at object storage prices — typically 1/10th the cost of block storage.

Geo-replication: Built-in asynchronous cross-datacenter replication at the topic level, configurable per namespace. No external tooling (like Kafka's MirrorMaker 2) required.

Unified messaging model: The same topic can serve queue consumers (Shared subscription, competing consumers) and stream consumers (Exclusive or Failover subscription) simultaneously.

Where it wins: Multi-cloud/multi-datacenter deployments, SaaS platforms requiring tenant isolation, systems requiring very long data retention at low cost.

Operational cost: Significantly higher than Kafka. Three independent systems (ZooKeeper, BookKeeper, Pulsar Brokers) must be maintained, monitored, and tuned. The ecosystem is less mature than Kafka's.

Redpanda

Implementation: A complete reimplementation of the Kafka protocol in C++, built on the Seastar framework. Seastar uses the reactor pattern with per-core thread pinning — each CPU core runs its own event loop with no cross-core synchronization overhead, and no JVM GC pauses.

No JVM, no GC: Kafka's JVM-based implementation suffers from GC pauses that can spike p99.9 latency by 10x during major collections. Redpanda's C++ implementation has deterministic latency — p99.9 latency on Redpanda is typically 2–5x lower than Kafka for the same workload.

No ZooKeeper, no KRaft quorum complexity: Redpanda ships with its own Raft implementation baked into every broker. No separate controller quorum to manage.

Kafka protocol compatibility: Existing Kafka producers, consumers, and admin tools work without modification. Changing from Kafka to Redpanda requires changing a single configuration property (the bootstrap server address).

Resource efficiency: Benchmarks consistently show Redpanda consuming 1/3 to 1/2 the CPU and memory of Kafka for equivalent throughput, primarily due to eliminating JVM overhead and more efficient I/O handling via Seastar.

Where it wins: Latency-sensitive applications (high-frequency trading, real-time gaming, IoT), resource-constrained environments (edge deployments), teams that want Kafka semantics with lower operational complexity.

Limitations: Kafka Streams, Kafka Connect, and ksqlDB are JVM-based and connect to Redpanda via the Kafka protocol, but may not support all features. Some advanced Kafka features (quotas, ACLs, delegated tokens) have partial or lagging support.

Decision Matrix

Dimension	Kafka	RabbitMQ	RocketMQ	Pulsar	Redpanda
Peak throughput	★★★★★	★★★	★★★★★	★★★★	★★★★★
p99 latency	★★★	★★★★★	★★★★	★★★★	★★★★★
Message replay	★★★★★	✗	★★	★★★★★	★★★★★
Transaction messages	★★★	★★★	★★★★★	★★★	★★★
Complex routing	★★	★★★★★	★★★	★★★	★★
Multi-tenancy	★★	★★★	★★★	★★★★★	★★★
Operational simplicity	★★★	★★★★	★★★	★	★★★★
Ecosystem maturity	★★★★★	★★★★	★★★★	★★★	★★★
Tiered storage	★★★	✗	✗	★★★★★	★★★★
Stream processing	★★★★★	✗	★	★★★	★★★★

Choose Kafka when: You need a large-scale data pipeline (>100K msg/s), you're building stream processing with Flink/Spark/Kafka Streams, you need event sourcing or CDC, multiple independent consumer groups need to read the same data.

Choose RabbitMQ when: Routing complexity is high, latency must be sub-millisecond, you're building task queues with competing consumers, overall message volume is moderate.

Choose RocketMQ when: You need distributed transaction guarantees, delay queues are a core requirement, you're building a finance-grade system in the Chinese ecosystem.

Choose Pulsar when: You need native multi-datacenter replication, genuine multi-tenancy with hard isolation, or very long retention at low cost.

Choose Redpanda when: You want Kafka semantics with lower GC-induced latency jitter, you're in a resource-constrained environment, or you want to simplify operations by eliminating the JVM and KRaft quorum management.

Kafka's Enduring Insight

LinkedIn's Jay Kreps wrote in his 2013 blog post "The Log: What every software engineer should know about real-time data's unifying abstraction": the append-only log is the simplest possible data structure that is also completely general. Every database, every distributed system, every replication protocol is fundamentally a log at its core.

Kafka made that insight operational at scale. By treating the log as the primary abstraction — rather than the queue, the message, or the subscriber — Kafka built something that transcends any single messaging use case. It is simultaneously a message bus, a distributed storage system, and a stream processing substrate, because logs are all three of those things simultaneously.

This is why the rest of this book doesn't just cover "how to use Kafka" — it goes into the internals of how a distributed log actually works: how messages flow through the system, how replicas stay in sync, how the protocol is encoded on the wire, and how KRaft replaced ZooKeeper without breaking a single client. The answers to those questions reveal why Kafka's design decisions are not arbitrary but deeply principled.

Rate this chapter

4.7 / 5 (112 ratings)