Log Storage Engine: Why Kafka Is So Fast
Log Storage Engine: Why Kafka Is So Fast
Kafka's extreme throughput does not come from any single technology. It is the compound effect of several mutually reinforcing design decisions: sequential disk writes, OS page cache exploitation, zero-copy network transfer, and memory-mapped index access. Understanding these mechanisms is the prerequisite for optimizing Kafka deployments and diagnosing performance problems.
LogSegment File Structure: What's Inside a Partition Directory
Each Kafka partition corresponds to a directory on disk. Inside that directory, files are grouped into LogSegments — each segment is a set of related files identified by a common base offset:
# Inspect a real partition directory
ls -lh /data/kafka/orders-0/
# Typical output for a busy production partition:
-rw-r--r-- 1 kafka kafka 1.0G Apr 26 00:00 00000000000000000000.log
-rw-r--r-- 1 kafka kafka 10M Apr 26 00:00 00000000000000000000.index
-rw-r--r-- 1 kafka kafka 10M Apr 26 00:00 00000000000000000000.timeindex
-rw-r--r-- 1 kafka kafka 0 Apr 26 00:00 00000000000000000000.txnindex
-rw-r--r-- 1 kafka kafka 1.0G Apr 26 01:00 00000000000010485760.log
-rw-r--r-- 1 kafka kafka 10M Apr 26 01:00 00000000000010485760.index
-rw-r--r-- 1 kafka kafka 10M Apr 26 01:00 00000000000010485760.timeindex
-rw-r--r-- 1 kafka kafka 0 Apr 26 01:00 00000000000010485760.txnindex
-rw-r--r-- 1 kafka kafka 542M Apr 26 10:33 00000000000020971520.log ← active segment
-rw-r--r-- 1 kafka kafka 5M Apr 26 10:33 00000000000020971520.index
-rw-r--r-- 1 kafka kafka 5M Apr 26 10:33 00000000000020971520.timeindex
-rw-r--r-- 1 kafka kafka 0 Apr 26 10:33 00000000000020971520.txnindex
-rw-r--r-- 1 kafka kafka 0 Apr 26 10:33 leader-epoch-checkpoint
-rw-r--r-- 1 kafka kafka 60 Apr 26 10:33 partition.metadata
The 20-digit number in each filename is the base offset of the first message in that segment.
The .log File: Raw Message Data
The .log file is an append-only record store. Messages are stored in RecordBatch format (Magic V2, introduced in Kafka 0.11):
RecordBatch layout:
├── baseOffset (8 bytes) — offset of the first record in this batch
├── batchLength (4 bytes) — total byte length of this batch
├── partitionLeaderEpoch (4 bytes) — leader generation (for consistency verification)
├── magic (1 byte) — format version, currently 2
├── crc (4 bytes) — CRC32C checksum of everything after this field
├── attributes (2 bytes) — compression type, timestamp type, transactional flag
├── lastOffsetDelta (4 bytes) — (last record offset) - baseOffset
├── baseTimestamp (8 bytes) — timestamp of the first record
├── maxTimestamp (8 bytes) — maximum timestamp in this batch
├── producerId (8 bytes) — for idempotent/transactional producers
├── producerEpoch (2 bytes) — fencing zombie writers
├── baseSequence (4 bytes) — starting sequence number for idempotency
├── recordsCount (4 bytes) — number of records in this batch
└── Records[]:
├── length (varint)
├── attributes (int8)
├── timestampDelta (varint) — delta from baseTimestamp (saves space)
├── offsetDelta (varint) — delta from baseOffset (saves space)
├── keyLength (varint)
├── key (bytes)
├── valueLength (varint)
├── value (bytes)
└── headers[] — optional key-value metadata
Using variable-length integer encoding (varint) for per-record timestamp deltas and offset deltas means records within a batch are extremely compact — the deltas are typically just 1-2 bytes each.
The .index File: Sparse Offset Index
The .index file is a sparse index — not every record gets an entry. Instead, one index entry is written for every log.index.interval.bytes (default: 4096 bytes) of log data written.
Each index entry is exactly 8 bytes:
- Relative offset (4 bytes): offset relative to the segment's base offset
- Physical position (4 bytes): byte position in the
.logfile where this offset begins
Index entries for segment with base offset 10,485,760:
Relative Offset Physical Position
0 0
157 4,096
314 8,192
471 12,288
628 16,384
...
Offset lookup procedure:
- Binary search across all LogSegment base offsets → identify which segment contains the target offset
- Binary search within that segment's
.indexfile → find the nearest index entry (floor entry for target offset) - Sequential scan of the
.logfile starting from the index entry's physical position → locate the exact record
The worst case for step 3 is scanning at most log.index.interval.bytes (4KB) of log data — negligible for sequential reads.
The .timeindex File: Timestamp-Based Index
The .timeindex file has the same structure as .index but indexed by timestamp rather than offset:
Timeindex entry (12 bytes):
├── timestamp (8 bytes) — record's timestamp
└── relative offset (4 bytes) — corresponding offset in the segment
This index supports timestamp-based operations:
kafka-consumer-groups.sh --reset-offsets --to-datetime "2024-04-26T00:00:00.000"- Time-based retention: finding the earliest offset with a timestamp newer than
log.retention.hours
The .txnindex File: Transaction Abort Index
The .txnindex file records offset ranges of aborted transactions. When a consumer is configured with isolation.level=read_committed, it reads this file to determine which message ranges to skip.
If transactions are not used, this file remains empty (0 bytes), as shown in the directory listing above.
# Inspect index file contents
kafka-dump-log.sh \
--files /data/kafka/orders-0/00000000000000000000.index \
--index-sanity-check
# Output:
# Dumping /data/kafka/orders-0/00000000000000000000.index
# offset: 0 position: 0
# offset: 157 position: 4096
# offset: 314 position: 8192
# ...
# Decode actual message contents
kafka-dump-log.sh \
--files /data/kafka/orders-0/00000000000000000000.log \
--print-data-log | head -30
Why Sequential Writes Are Fast: Physics of Storage
HDD Physics
A traditional spinning hard disk's performance is governed by three mechanical factors:
- Seek time: moving the read/write head to the target track — typically 3–15ms
- Rotational latency: waiting for the target sector to rotate under the head — at 7,200 RPM, average latency is ~4ms
- Transfer time: the actual data movement once the head is positioned
Random I/O: Each access requires seek + rotational latency = ~7–20ms overhead per operation. This yields roughly 100–200 IOPS, translating to ~0.1 MB/s for random 4K block writes.
Sequential I/O: No head movement; data occupies adjacent sectors. Throughput reaches 100–200 MB/s. This is a 1,000× difference.
This massive gap is why append-only log design was not merely an optimization for Kafka — for HDDs, it was a requirement to achieve any meaningful throughput at all.
SSD Characteristics
SSDs eliminate mechanical motion, dramatically reducing random I/O latency (SATA SSD: ~0.1ms, NVMe SSD: ~0.02ms). However, sequential access still holds significant advantages:
Write amplification: SSDs must erase in large blocks (typically 512KB to 4MB). Many small random writes trigger frequent block erasures — each erase cycle degrades the block's write endurance and creates I/O stalls.
Parallelism: SSDs contain multiple NAND channels that can be written in parallel. Sequential writes can sustain this parallelism; scattered random writes cannot.
Practical gap: Even on NVMe SSDs, sequential write throughput typically exceeds random write throughput by 2–5×, and sequential writes produce far less write amplification (longer SSD lifespan).
Kafka's append-only log guarantees sequential writes regardless of storage medium. On HDDs this is existential; on SSDs it extends device lifespan and maximizes channel parallelism.
Page Cache: Why Kafka Doesn't Manage Its Own Memory Cache
How the OS Page Cache Works
The Linux page cache is a kernel-managed buffer that sits between file system calls and physical storage:
Write path:
Application write() → copied into page cache → returns immediately to application
↓ (asynchronously, in background)
pdflush/kworker kernel threads write dirty pages to disk
Read path:
Application read() → check page cache → HIT: return data directly from RAM (no disk I/O!)
MISS: read from disk into page cache, then return
For a message broker like Kafka, the write-then-read pattern strongly favors the page cache: producers write messages, and consumers read them shortly after. If consumers are keeping up with producers — the common case — the messages they read are still in the page cache from the producer's write. Consumption becomes a memory-to-network transfer rather than a disk-to-network transfer.
Why Not a JVM Heap Cache?
Alternative: JVM heap-based cache (approach Kafka rejected):
- Requires reimplementing LRU/LFU eviction, cache warming, and consistency management
- Stored as Java objects — massive object count increases GC pressure
- Large heaps (32GB+) cause GC stop-the-world pauses lasting hundreds of milliseconds
- After Kafka restart, cache is cold — throughput degrades until data is re-read from disk
- Data lives twice: once in JVM heap, once in OS page cache — wasted RAM
Actual approach: rely on OS page cache:
- Kernel engineers have spent decades optimizing the page cache — no need to reinvent it
- After Kafka restart, page cache data survives (until evicted by memory pressure from other processes)
- JVM heap remains small (4–8GB), keeping GC pauses short and infrequent
- Kernel automatically applies workload-appropriate algorithms (readahead, LRU eviction)
- Zero implementation cost — Kafka just does regular file I/O
# Check how much of the active log segment is in page cache
# Install pcstat: go install github.com/tobert/pcstat@latest
pcstat /data/kafka/orders-0/00000000000020971520.log
# Output:
# +---------------------------------------------------+----------------+--------+--------+---------+
# | Name | Size (bytes) | Pages | Cached | Percent |
# |---------------------------------------------------+----------------+--------+--------+---------|
# | /data/kafka/orders-0/00000000000020971520.log | 567,705,600 | 138600 | 138542 | 99.958% |
# +---------------------------------------------------+----------------+--------+--------+---------+
#
# → ~100% of the active segment is in page cache
# → Consumers are effectively reading from RAM
# Check total page cache usage on the broker
free -h
# Shows used/available memory — "cached" column represents page cache
This is why Kafka's official guidance recommends limiting JVM heap to 4–8 GB and letting the operating system use the remainder as page cache. On a 64GB broker, -Xmx6g leaves approximately 56GB available for caching active log data.
Zero-Copy: The sendfile() Optimization
Traditional Data Transfer Path (Without Zero-Copy)
When a consumer fetches data from a broker using conventional read/write system calls:
1. Broker calls read() system call
→ OS reads data from disk into page cache (if not already there)
→ OS copies data from page cache into Kafka's JVM heap buffer (user space)
[CPU copy #1: kernel space → user space]
2. Kafka's Java code inspects/processes data (if needed)
3. Broker calls write() system call
→ OS copies data from JVM heap buffer into kernel socket send buffer
[CPU copy #2: user space → kernel space]
4. OS uses DMA to transfer data from socket buffer to NIC
[hardware DMA, no CPU involvement]
Total: 2 CPU memory copies + 4 user/kernel mode switches
sendfile(): Data Never Enters User Space
The sendfile() Linux system call was designed specifically for file-to-socket transfers:
Broker calls sendfile(socket_fd, file_fd, offset, length)
↓
Kernel checks page cache for file data (reads from disk if needed)
↓
Kernel directly DMA-transfers data from page cache to NIC send buffer
(entire operation stays in kernel space)
Total: 0 CPU memory copies (with scatter-gather DMA capable NIC)
2 user/kernel mode switches (just the sendfile() call itself)
Visual comparison:
Traditional:
Disk → [DMA] → Page Cache → [CPU copy] → JVM Heap → [CPU copy] → Socket Buffer → [DMA] → NIC
Zero-copy (sendfile):
Disk → [DMA] → Page Cache ─────────────────────────[DMA]──────────────────────> NIC
Kafka implements zero-copy in Java via FileChannel.transferTo(), which maps to sendfile() on Linux:
// Conceptual representation of how Kafka sends log data to consumers
// (from kafka/core/src/main/scala/kafka/log/LogSegment.scala region)
public long transferTo(FileChannel sourceFile, long position, long count,
WritableByteChannel destSocket) throws IOException {
// Java NIO FileChannel.transferTo() → JVM → Linux sendfile()
// Data flows: page cache → NIC, bypassing JVM heap entirely
return sourceFile.transferTo(position, count, destSocket);
}
When zero-copy doesn't apply: If the broker needs to decompress and recompress messages (when producer and consumer use different compression codecs, or when broker-side re-compression is configured), the data must enter user space for processing. This is why it's best to configure end-to-end compression with consistent codecs — producer compresses, broker stores as-is, consumer decompresses — allowing zero-copy for the broker-to-consumer path.
mmap: Memory-Mapped Access for Index Files
Kafka accesses its index files (.index and .timeindex) via memory-mapped I/O (mmap):
// Conceptual index file access pattern (from OffsetIndex.scala behavior)
MappedByteBuffer mmap = new RandomAccessFile(indexFile, "rw")
.getChannel()
.map(FileChannel.MapMode.READ_WRITE, 0, maxIndexSize);
// Binary search directly in memory — no read() system calls
// OS transparently handles page faults (loads index pages from disk if not in cache)
public int lookup(long targetOffset) {
int relativeTarget = (int)(targetOffset - baseOffset);
int lo = 0;
int hi = (mmap.position() / 8) - 1; // each entry is 8 bytes
while (lo <= hi) {
int mid = (lo + hi) >>> 1;
int relativeOffset = mmap.getInt(mid * 8); // direct memory read
int physicalPosition = mmap.getInt(mid * 8 + 4);
if (relativeOffset < relativeTarget) {
lo = mid + 1;
} else if (relativeOffset > relativeTarget) {
hi = mid - 1;
} else {
return physicalPosition; // exact match
}
}
// Return position of largest offset <= target
return (hi >= 0) ? mmap.getInt(hi * 8 + 4) : -1;
}
With mmap, the index file appears to the application as a simple byte array. The kernel manages the mapping between virtual memory addresses and physical disk pages, transparently loading pages on demand. This eliminates all read() system call overhead for index access — just pointer arithmetic and direct memory reads.
Log Recovery on Startup
Every time a Kafka broker starts, it performs a recovery scan on the last (active) LogSegment of each partition. This is necessary because a broker crash may have left incomplete RecordBatches at the end of the active segment (partially written before the crash):
Recovery algorithm:
- Starting from the beginning of the segment, iterate through RecordBatches
- For each batch, compute CRC32C and compare against the stored CRC
- When a CRC mismatch is found, truncate the file at the byte position of the corrupt batch
- Rebuild
.indexand.timeindexby scanning the now-clean.logfile from the beginning
# Monitor broker startup log for recovery events
grep -E "(Recovering|Truncating|Recovery)" /var/log/kafka/server.log
# Normal recovery (no corruption):
# [2024-04-26 08:00:01,234] INFO Recovering unflushed segment 20971520
# in log /data/kafka/orders-0 (kafka.log.Log)
# [2024-04-26 08:00:01,890] INFO Completed load of log /data/kafka/orders-0
# with 3 log segments and log end offset 20973847 (kafka.log.Log)
# Recovery with truncation (corruption found):
# [2024-04-26 08:00:01,234] INFO Recovering unflushed segment 20971520
# [2024-04-26 08:00:01,456] WARN Found invalid messages in log segment
# /data/kafka/orders-0/00000000000020971520.log at byte offset 542351088.
# Message: Invalid record (e=Record is corrupt (stored crc = 1234567890,
# computed crc = 987654321))
# [2024-04-26 08:00:01,457] INFO Truncating segment 20971520 to 542351088 bytes
log.recovery.threads.per.data.dir (default: 1) controls the parallelism of log recovery per data directory. For brokers with thousands of partitions, increasing this to match the number of disks can significantly reduce restart time:
# Increase recovery parallelism for brokers with many partitions
log.recovery.threads.per.data.dir=4 # or match disk count
Key Configuration Reference
| Parameter | Default | Purpose |
|---|---|---|
log.segment.bytes |
1 GB | Maximum size of a single .log segment file |
log.index.size.max.bytes |
10 MB | Maximum size of .index and .timeindex files |
log.index.interval.bytes |
4 KB | How often to add an index entry (bytes of log data per entry) |
log.flush.interval.messages |
Long.MAX_VALUE | Force fsync after this many messages (default: rely on OS) |
log.flush.interval.ms |
Long.MAX_VALUE | Force fsync after this interval (default: rely on OS) |
log.recovery.threads.per.data.dir |
1 | Threads for parallel log recovery on startup |
Critical note on fsync: Kafka defaults to not forcing fsync, relying on the OS to flush page cache dirty pages to disk asynchronously. This means in a broker crash scenario, the last few unflushed messages could be lost from that specific broker's local disk.
This is intentional — durability is achieved through replication, not local fsync. With acks=all and min.insync.replicas=2, data has been written to at least 2 brokers' page caches (and likely disks). The probability of both crashing simultaneously before data is flushed is negligible in practice.
Enabling per-message fsync (log.flush.interval.messages=1) would nearly eliminate the throughput advantage of sequential writes — disk write throughput would drop from hundreds of MB/s to hundreds of KB/s. Never enable this in production; rely on replication for durability instead.