Chapter 29

Performance Tuning: OS, JVM and Broker End-to-End

The Right Approach: Bottleneck Identification, Not Config Cargo-Culting

The most common anti-pattern in Kafka performance tuning is what engineers call "config cargo-culting" — copying a stack of parameters from a blog post, pasting them into server.properties without understanding them, and hoping for a performance doubling. This approach is not just ineffective; it often introduces new problems.

The correct method is systematic bottleneck identification. Start at the operating system layer, work upward through the JVM, and then address broker-level configuration. At each layer, identify the actual limiting factor, make a targeted change, and measure the outcome. This chapter follows that path, providing the theoretical reasoning, specific parameters, and validation methods at every layer.

Kafka performance bottlenecks typically fall into a small number of categories:

Disk I/O: Sequential write speed determines the producer throughput ceiling; random reads affect consumer catch-up speed
Network bandwidth: Intra-cluster replication and client data transfer share the same network
JVM GC pauses: Stop-the-world pauses directly cause replicas to fall behind the leader, triggering ISR shrinks
Page cache: Consumers typically read recent data; page cache hits are 1-2 orders of magnitude faster than disk reads
CPU: Becomes a bottleneck when TLS termination or compression is enabled, but usually ranks last

OS Layer Tuning

Disk Selection: SSD vs HDD, JBOD vs RAID, XFS vs ext4

SSD vs HDD

Kafka's write pattern is sequential append — precisely the access pattern where HDDs perform best. Sequential write throughput on modern HDDs reaches 200-400 MB/s, substantially closing the gap with SSDs. That said, SSDs have a clear advantage in specific scenarios:

ZooKeeper and KRaft metadata disks: ZooKeeper issues an fsync on every transaction commit. HDD fsync latency of 5-10ms becomes the serialization bottleneck for the entire cluster's metadata operations. Always use a dedicated SSD for ZooKeeper data and KRaft's metadata.log.dir.
Low-latency requirements (< 10ms P99): SSD random read performance reduces tail latency when consumers read historical data not in page cache.
High partition count: Many partitions mean many small files. SSD random I/O handles this far better than HDDs.

For pure throughput-optimized workloads (log aggregation, batch processing), high-capacity SATA HDDs offer the best cost-per-gigabyte.

JBOD vs RAID

Kafka has built-in data replication via replication.factor. It does not need disk-level redundancy, and RAID is more harmful than helpful in Kafka deployments:

RAID 5/6 write penalty: every write must read the parity block, compute the new parity, and write it back — turning sequential writes into mixed read-write operations
The RAID controller is a single point of failure; its failure takes all disks offline
RAID rebuilds severely degrade I/O performance for hours or days, potentially causing cluster-wide ISR thrashing

JBOD is the recommended Kafka storage architecture. Configure multiple disks as independent entries in log.dirs:

# Multi-disk JBOD configuration
log.dirs=/data/kafka/disk1,/data/kafka/disk2,/data/kafka/disk3,/data/kafka/disk4

Kafka's partition assignment algorithm round-robins across all log directories. If one disk fails, only the partitions assigned to that disk are affected; partitions on other disks continue serving traffic.

XFS vs ext4

Both are production-validated. The practical differences are minor:

XFS: Slightly better sequential write performance for large files. No performance degradation at high inode count. Kafka log segments (typically 1 GB each) benefit from XFS's extent-based allocation.
ext4: Equally stable, widely deployed, and sufficient for most Kafka workloads.

Regardless of choice, mount options matter more than the filesystem:

# Recommended mount options in /etc/fstab
/dev/sdb /data/kafka/disk1 xfs defaults,noatime,nodiratime 0 0

# noatime: disables file access timestamp updates — prevents spurious writes on every read
# nodiratime: same for directories

Validate raw disk performance before deploying Kafka:

# Sequential write throughput (what limits producer ingestion)
fio --name=kafka-seq-write \
    --filename=/data/kafka/disk1/perf-test \
    --rw=write \
    --bs=1m \
    --size=10g \
    --numjobs=1 \
    --ioengine=libaio \
    --direct=1 \
    --runtime=60 \
    --group_reporting

# fdatasync latency (critical for acks=all producers and ZooKeeper)
fio --name=kafka-fsync-latency \
    --filename=/data/kafka/disk1/perf-test \
    --rw=write \
    --bs=4k \
    --size=1g \
    --fsync=1 \
    --ioengine=sync \
    --numjobs=1

Memory Tuning: Page Cache Is Kafka's First-Level Accelerator

Give 50-60% of RAM to page cache, not the JVM heap

This is the most counter-intuitive and most important Kafka memory principle. Many operations teams instinctively allocate large JVM heaps, reasoning that "more memory is better." This reasoning fails for Kafka:

Kafka stores its data on disk. The JVM heap holds metadata only: partition state, ISR lists, request queues, NetworkClient objects, compression buffers. 6-8 GB is sufficient for even large clusters.
Consumers typically read recent messages (real-time streaming). Those messages are still in the OS page cache — no disk I/O required.
If the JVM heap consumes most of the available RAM, the page cache is starved. Consumer reads trigger disk I/O, degrading throughput by 10-100x.

On a 32 GB RAM machine, recommended allocation:

JVM heap: 6-8 GB
Page cache available: ~20-24 GB (managed automatically by the OS)
System and other processes: ~4 GB

Dirty page write-back tuning:

# View current values
sysctl vm.dirty_background_ratio vm.dirty_ratio vm.swappiness

# Recommended settings — add to /etc/sysctl.conf
vm.dirty_background_ratio=5    # Begin background write-back when dirty pages reach 5% of total RAM
vm.dirty_ratio=80              # Block writers and force write-back at 80% of total RAM dirty
vm.swappiness=1                # Strongly prefer page cache eviction over swapping
                               # (0 risks OOM killer; 1 is the safe minimum)

dirty_ratio=80 allows the OS to accumulate a large dirty page buffer and flush it sequentially — exactly what Kafka needs. This is safe because Kafka's replication mechanism guarantees data durability, not the OS fsync schedule.

Apply immediately without reboot:

sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=80
sysctl -w vm.swappiness=1

Network Tuning: High-Throughput TCP Parameters

Kafka is network-intensive. Each broker simultaneously maintains connections with multiple producers, consumers, and peer brokers (for replication). Default Linux TCP parameters are tuned for general workloads, not for sustained high-throughput streaming.

# /etc/sysctl.conf network parameters

# Socket buffer maximums (allows large TCP windows for high-bandwidth links)
net.core.rmem_max=134217728           # 128 MB receive buffer maximum
net.core.wmem_max=134217728           # 128 MB send buffer maximum
net.core.rmem_default=67108864        # 64 MB receive buffer default
net.core.wmem_default=67108864        # 64 MB send buffer default

# TCP auto-tuning (kernel adjusts buffer sizes based on network conditions)
net.ipv4.tcp_rmem=4096 65536 134217728
net.ipv4.tcp_wmem=4096 65536 134217728

# TCP window scaling (essential for high-BDP links: cross-AZ, cross-region)
net.ipv4.tcp_window_scaling=1

# Connection backlog (for high-concurrency broker endpoints)
net.core.netdev_max_backlog=30000
net.ipv4.tcp_max_syn_backlog=8192

# TIME_WAIT socket reuse (reduces port exhaustion under many short connections)
net.ipv4.tcp_tw_reuse=1

Mirror these in the Kafka broker configuration:

# server.properties — socket buffer sizes
socket.send.buffer.bytes=1048576      # 1 MB
socket.receive.buffer.bytes=1048576   # 1 MB
socket.request.max.bytes=104857600    # 100 MB maximum single-request size

JVM Layer Tuning

Heap Size: The 6-8 GB Sweet Spot

The recommended JVM heap for Kafka brokers is 6-8 GB. Going higher than 12 GB is almost always counterproductive. Three reasons:

Kafka's hot data path is in the page cache, not on the heap. The heap holds metadata, request buffers, and NIO objects — not message data.
Large heaps produce longer GC pauses. A G1GC full GC on a 24 GB heap can pause for tens of seconds, guaranteed to trigger ISR shrinks and client timeouts.
Page cache is more efficient than heap for Kafka's access patterns and can be shared across processes.

# Set in kafka-server-start.sh or via environment
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
# -Xms equals -Xmx: prevents JVM from dynamically resizing the heap,
# which causes additional GC activity during resize operations

G1GC: The Default Choice for Kafka 3.x

G1GC became the default GC in JDK 9 and is the recommended choice for Kafka 3.x on JDK 11 and JDK 17. The following configuration has been validated in multi-terabyte-per-day production deployments:

export KAFKA_JVM_PERFORMANCE_OPTS="
  -server
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20
  -XX:InitiatingHeapOccupancyPercent=35
  -XX:+ExplicitGCInvokesConcurrent
  -XX:MaxInlineLevel=15
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/kafka/
  -Xlog:gc*:file=/var/log/kafka/gc.log:time,uptime:filecount=10,filesize=100m
"

Parameter rationale:

-XX:MaxGCPauseMillis=20: G1GC's target maximum pause time. G1GC adjusts its region collection strategy to stay under this threshold. 20ms is well below the replica.lag.time.max.ms default of 30,000ms, meaning GC pauses will not trigger ISR shrinks under normal conditions.
-XX:InitiatingHeapOccupancyPercent=35: Start the concurrent marking cycle when heap occupancy reaches 35% (default is 45%). Starting earlier gives G1GC more time to collect before the heap fills up, reducing the likelihood of an evacuation failure that triggers a full STW collection.
-XX:+ExplicitGCInvokesConcurrent: When code calls System.gc() explicitly (some JDK internals and third-party libraries do this), run a concurrent GC cycle instead of a blocking full GC.

Monitor GC behavior:

# Watch GC log in real time
tail -f /var/log/kafka/gc.log | grep -E "(pause|GC|Pause)"

# Check GC statistics via JMX
jcmd $(pgrep -f kafka.Kafka) GC.stat

ZGC: For Ultra-Low Latency Requirements

ZGC (Z Garbage Collector) reached production quality in JDK 15 and is an excellent choice for Kafka 3.x on JDK 17 LTS when latency is a primary concern. ZGC's defining characteristic is sub-millisecond STW pauses, even on 32 GB heaps.

ZGC is appropriate when:

P99/P999 latency requirements are below 5ms
A single broker hosts tens of thousands of partitions (large metadata working set)
The team is running JDK 17+ and is comfortable with ZGC's operational characteristics

# ZGC configuration for JDK 17+
export KAFKA_JVM_PERFORMANCE_OPTS="
  -server
  -XX:+UseZGC
  -XX:ZCollectionInterval=5
  -XX:+ZProactive
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/kafka/
  -Xlog:gc*:file=/var/log/kafka/gc.log:time,uptime:filecount=10,filesize=100m
"

# ZGC is less sensitive to heap size; slightly larger heaps are fine
export KAFKA_HEAP_OPTS="-Xms8g -Xmx8g"

ZGC trade-offs: Higher CPU consumption (concurrent collection requires more CPU cycles) and higher memory bandwidth requirements. On brokers where CPU is already near saturation, ZGC's concurrent threads can actually reduce throughput. Benchmark before committing.

Broker Layer Tuning

Thread Pool Sizing: Align With CPU Cores

Two thread pools control how Kafka handles incoming requests. Their default values predate modern multi-core servers and are chronically undersized for production workloads:

# num.io.threads: handles the actual request processing (disk reads/writes, replica coordination)
# Recommended: 2 × CPU cores (default 8 is inadequate for 16-core+ servers)
# Signal: RequestHandlerAvgIdlePercent < 0.3 means increase this
num.io.threads=16      # for an 8-core server

# num.network.threads: handles socket reads and writes (receives bytes, queues requests)
# Recommended: CPU cores (default 3 is inadequate)
num.network.threads=8  # for an 8-core server

# Maximum depth of the request queue before back-pressure kicks in
queued.max.requests=500

Diagnosing thread pool bottlenecks:

# Check I/O thread idle percent via JMX
kafka-jmx.sh \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --object-name kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent \
  --attributes Value

# Check per-request type latency
kafka-jmx.sh \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --object-name "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce" \
  --attributes 99thPercentile

When idle percent falls below 30%, increase num.io.threads. If idle percent is above 70% but latency is still high, the bottleneck is disk I/O, not thread count — adding threads will not help.

Replica Fetch Tuning: The Most Underestimated Parameters

Replication configuration is where the most headroom is left on the table in the majority of production clusters:

# num.replica.fetchers: threads per follower broker for pulling data from leaders
# Default of 1 is drastically insufficient for modern NVMe storage
# An NVMe SSD can sustain 50,000+ IOPS; 1 fetcher thread cannot saturate it
# Recommended: 4-8 (lower for HDDs, higher for NVMe arrays)
num.replica.fetchers=4

# Maximum bytes to fetch per replica fetch request
# Default 1MB is too small for high-throughput clusters
# Recommended: 10-100 MB depending on network and disk speed
replica.fetch.max.bytes=10485760      # 10 MB

# How long the leader waits to accumulate data before responding to a fetch
replica.fetch.wait.max.ms=500

# Socket receive buffer for replica fetch connections
replica.socket.receive.buffer.bytes=65536

Validate replication throughput:

# Check replication bytes received per broker per second
kafka-jmx.sh \
  --object-name "kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec" \
  --attributes OneMinuteRate

# Check replica fetch request rate and latency
kafka-jmx.sh \
  --object-name "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Fetch" \
  --attributes 99thPercentile

If replication throughput is substantially below the producer ingestion rate, and num.replica.fetchers is at 1, incrementing it to 4 and restarting the broker will typically produce an immediate and dramatic improvement.

Log Flush Strategy: Let the OS Manage It

This is the most debated Kafka parameter, and the answer is clear: rely on OS page cache write-back. Do not manually trigger fsyncs.

# Let the OS decide when to flush (this is the correct production setting)
log.flush.interval.messages=9223372036854775807   # Long.MAX_VALUE — effectively disabled
log.flush.interval.ms=9223372036854775807         # Long.MAX_VALUE — effectively disabled

Why forced fsyncs are counterproductive:

Kafka's durability guarantee comes from replication, not fsync: With acks=all, a message is not acknowledged until it has been written to the OS page cache on all ISR replicas. If one broker crashes before fsync, the other replicas have complete data.
OS write-back is reliable at the configured dirty page ratios: The vm.dirty_ratio=80 setting ensures dirty pages are flushed well before memory pressure becomes critical.
Forced fsync is extremely expensive: An fsync blocks the writing thread until data is physically on disk. Under high write throughput, this converts sequential micro-second-latency writes into millisecond-latency disk operations, reducing throughput by 5-10x.

The only legitimate reason to set aggressive flush intervals is regulatory compliance requirements that mandate physical data persistence before acknowledgment — rare in practice.

Message Size and Batch Tuning

# Maximum message size the broker will accept (coordinate with producers and consumers)
# Default 1 MB is too small for many use cases (Avro schemas, binary payloads, video thumbnails)
message.max.bytes=10485760            # 10 MB

# Replica fetch size must be >= message.max.bytes
replica.fetch.max.bytes=10485760

Producer batching has a disproportionate impact on throughput. These settings belong in producer application configuration, not broker config:

Properties props = new Properties();
// Larger batches = fewer requests = higher throughput
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);          // 64 KB (default 16 KB is too small)
props.put(ProducerConfig.LINGER_MS_CONFIG, 5);              // Wait up to 5ms to fill a batch
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");   // LZ4: best throughput/CPU balance
// zstd: better compression ratio, more CPU — good for expensive network paths
// snappy: legacy, use lz4 or zstd instead

props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864L);  // 64 MB in-flight buffer
props.put(ProducerConfig.SEND_BUFFER_CONFIG, 1048576);      // 1 MB TCP socket send buffer

Complete Performance Tuning Reference

Layer	Parameter	Recommended	Default	Impact
OS Disk	Filesystem mount options	`noatime,nodiratime`	`relatime`	5-10% I/O reduction
OS Disk	Storage topology	JBOD	varies	2-5x write performance vs RAID 5/6
OS Disk	Filesystem	XFS or ext4	varies	Minor; mount options matter more
OS Memory	`vm.dirty_ratio`	`80`	`20`	Sequential write efficiency
OS Memory	`vm.dirty_background_ratio`	`5`	`10`	Write-back smoothness
OS Memory	`vm.swappiness`	`1`	`60`	Prevents GC-induced swap thrashing
OS Network	`net.core.rmem_max`	`134217728`	`212992`	High-throughput TCP
OS Network	`net.ipv4.tcp_window_scaling`	`1`	`1`	Cross-AZ throughput
JVM	Heap size	`6-8 GB`	`1 GB`	GC stability vs page cache
JVM	GC implementation	G1GC or ZGC	G1GC	STW pause duration
JVM	`MaxGCPauseMillis`	`20`	`200`	GC pause target
JVM	`InitiatingHeapOccupancyPercent`	`35`	`45`	GC trigger timing
Broker	`num.io.threads`	`2 × CPU cores`	`8`	Request handling capacity
Broker	`num.network.threads`	`CPU cores`	`3`	Network I/O throughput
Broker	`num.replica.fetchers`	`4-8`	`1`	Replication throughput
Broker	`replica.fetch.max.bytes`	`10 MB+`	`1 MB`	Replication batch efficiency
Broker	`log.flush.interval.messages`	`Long.MAX_VALUE`	`Long.MAX_VALUE`	Write throughput
Broker	`socket.send.buffer.bytes`	`1 MB`	`100 KB`	Network throughput
Broker	`message.max.bytes`	`10 MB`	`1 MB`	Maximum payload support
Producer	`batch.size`	`64 KB`	`16 KB`	Write throughput
Producer	`linger.ms`	`5-20`	`0`	Batch fill efficiency
Producer	`compression.type`	`lz4` or `zstd`	`none`	Network and disk savings

Benchmarking: Measuring What You Changed

After making tuning changes, always validate with benchmarks. Kafka ships with built-in performance testing tools:

# Producer throughput benchmark (--throughput -1 = unlimited, measures maximum)
kafka-producer-perf-test.sh \
  --topic perf-benchmark \
  --num-records 10000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=kafka-broker-1:9092 \
    acks=all \
    batch.size=65536 \
    linger.ms=5 \
    compression.type=lz4

# Consumer throughput benchmark
kafka-consumer-perf-test.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --topic perf-benchmark \
  --messages 10000000 \
  --fetch-size 1048576 \
  --group perf-benchmark-consumer

# End-to-end latency benchmark (rate-limited to measure latency distribution)
kafka-producer-perf-test.sh \
  --topic latency-benchmark \
  --num-records 100000 \
  --record-size 1024 \
  --throughput 10000 \
  --producer-props \
    bootstrap.servers=kafka-broker-1:9092 \
    acks=all

Expected performance baselines on a well-tuned cluster (3 brokers, 3× NVMe SSD each, 10GbE network, replication factor 3):

Producer throughput: 800-1,200 MB/s aggregate (including replication)
Consumer throughput: 1,500-2,000 MB/s per consumer (page cache hits)
End-to-end P99 latency with acks=all and 64 KB batches: 5-15ms

If your cluster falls significantly below these numbers after OS and JVM tuning, the broker layer is likely the next focus area: check num.replica.fetchers, num.io.threads, and whether page cache is being evicted by a competing heap allocation.

Summary: The Layered Tuning Methodology

The single biggest mistake in Kafka performance tuning is skipping OS configuration and jumping straight to broker parameters. The correct sequence:

OS disk layer: Choose the right storage architecture (JBOD), set filesystem mount options, and baseline raw I/O performance with fio.
OS memory layer: Verify the JVM heap is sized conservatively (6-8 GB), check that dirty page write-back ratios allow efficient batching, set swappiness to 1.
OS network layer: Expand TCP socket buffers, enable window scaling, increase connection backlog limits.
JVM layer: Set heap to the sweet spot, choose G1GC (general) or ZGC (ultra-low latency), tune GC triggers.
Broker layer: Align thread pools with CPU core count, increase replica fetcher concurrency, trust the OS for fsync timing.

Measure one layer at a time. Changing three parameters simultaneously makes it impossible to determine which change produced the observed effect — or which change caused the new regression.

Rate this chapter

4.9 / 5 (3 ratings)