Chapter 29

Performance Tuning: OS, JVM and Broker End-to-End

The Right Approach: Bottleneck Identification, Not Config Cargo-Culting

The most common anti-pattern in Kafka performance tuning is what engineers call "config cargo-culting" — copying a stack of parameters from a blog post, pasting them into server.properties without understanding them, and hoping for a performance doubling. This approach is not just ineffective; it often introduces new problems.

The correct method is systematic bottleneck identification. Start at the operating system layer, work upward through the JVM, and then address broker-level configuration. At each layer, identify the actual limiting factor, make a targeted change, and measure the outcome. This chapter follows that path, providing the theoretical reasoning, specific parameters, and validation methods at every layer.

Kafka performance bottlenecks typically fall into a small number of categories:

OS Layer Tuning

Disk Selection: SSD vs HDD, JBOD vs RAID, XFS vs ext4

SSD vs HDD

Kafka's write pattern is sequential append — precisely the access pattern where HDDs perform best. Sequential write throughput on modern HDDs reaches 200-400 MB/s, substantially closing the gap with SSDs. That said, SSDs have a clear advantage in specific scenarios:

For pure throughput-optimized workloads (log aggregation, batch processing), high-capacity SATA HDDs offer the best cost-per-gigabyte.

JBOD vs RAID

Kafka has built-in data replication via replication.factor. It does not need disk-level redundancy, and RAID is more harmful than helpful in Kafka deployments:

JBOD is the recommended Kafka storage architecture. Configure multiple disks as independent entries in log.dirs:

# Multi-disk JBOD configuration
log.dirs=/data/kafka/disk1,/data/kafka/disk2,/data/kafka/disk3,/data/kafka/disk4

Kafka's partition assignment algorithm round-robins across all log directories. If one disk fails, only the partitions assigned to that disk are affected; partitions on other disks continue serving traffic.

XFS vs ext4

Both are production-validated. The practical differences are minor:

Regardless of choice, mount options matter more than the filesystem:

# Recommended mount options in /etc/fstab
/dev/sdb /data/kafka/disk1 xfs defaults,noatime,nodiratime 0 0

# noatime: disables file access timestamp updates — prevents spurious writes on every read
# nodiratime: same for directories

Validate raw disk performance before deploying Kafka:

# Sequential write throughput (what limits producer ingestion)
fio --name=kafka-seq-write \
    --filename=/data/kafka/disk1/perf-test \
    --rw=write \
    --bs=1m \
    --size=10g \
    --numjobs=1 \
    --ioengine=libaio \
    --direct=1 \
    --runtime=60 \
    --group_reporting

# fdatasync latency (critical for acks=all producers and ZooKeeper)
fio --name=kafka-fsync-latency \
    --filename=/data/kafka/disk1/perf-test \
    --rw=write \
    --bs=4k \
    --size=1g \
    --fsync=1 \
    --ioengine=sync \
    --numjobs=1

Memory Tuning: Page Cache Is Kafka's First-Level Accelerator

Give 50-60% of RAM to page cache, not the JVM heap

This is the most counter-intuitive and most important Kafka memory principle. Many operations teams instinctively allocate large JVM heaps, reasoning that "more memory is better." This reasoning fails for Kafka:

On a 32 GB RAM machine, recommended allocation:

Dirty page write-back tuning:

# View current values
sysctl vm.dirty_background_ratio vm.dirty_ratio vm.swappiness

# Recommended settings — add to /etc/sysctl.conf
vm.dirty_background_ratio=5    # Begin background write-back when dirty pages reach 5% of total RAM
vm.dirty_ratio=80              # Block writers and force write-back at 80% of total RAM dirty
vm.swappiness=1                # Strongly prefer page cache eviction over swapping
                               # (0 risks OOM killer; 1 is the safe minimum)

dirty_ratio=80 allows the OS to accumulate a large dirty page buffer and flush it sequentially — exactly what Kafka needs. This is safe because Kafka's replication mechanism guarantees data durability, not the OS fsync schedule.

Apply immediately without reboot:

sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=80
sysctl -w vm.swappiness=1

Network Tuning: High-Throughput TCP Parameters

Kafka is network-intensive. Each broker simultaneously maintains connections with multiple producers, consumers, and peer brokers (for replication). Default Linux TCP parameters are tuned for general workloads, not for sustained high-throughput streaming.

# /etc/sysctl.conf network parameters

# Socket buffer maximums (allows large TCP windows for high-bandwidth links)
net.core.rmem_max=134217728           # 128 MB receive buffer maximum
net.core.wmem_max=134217728           # 128 MB send buffer maximum
net.core.rmem_default=67108864        # 64 MB receive buffer default
net.core.wmem_default=67108864        # 64 MB send buffer default

# TCP auto-tuning (kernel adjusts buffer sizes based on network conditions)
net.ipv4.tcp_rmem=4096 65536 134217728
net.ipv4.tcp_wmem=4096 65536 134217728

# TCP window scaling (essential for high-BDP links: cross-AZ, cross-region)
net.ipv4.tcp_window_scaling=1

# Connection backlog (for high-concurrency broker endpoints)
net.core.netdev_max_backlog=30000
net.ipv4.tcp_max_syn_backlog=8192

# TIME_WAIT socket reuse (reduces port exhaustion under many short connections)
net.ipv4.tcp_tw_reuse=1

Mirror these in the Kafka broker configuration:

# server.properties — socket buffer sizes
socket.send.buffer.bytes=1048576      # 1 MB
socket.receive.buffer.bytes=1048576   # 1 MB
socket.request.max.bytes=104857600    # 100 MB maximum single-request size

JVM Layer Tuning

Heap Size: The 6-8 GB Sweet Spot

The recommended JVM heap for Kafka brokers is 6-8 GB. Going higher than 12 GB is almost always counterproductive. Three reasons:

  1. Kafka's hot data path is in the page cache, not on the heap. The heap holds metadata, request buffers, and NIO objects — not message data.
  2. Large heaps produce longer GC pauses. A G1GC full GC on a 24 GB heap can pause for tens of seconds, guaranteed to trigger ISR shrinks and client timeouts.
  3. Page cache is more efficient than heap for Kafka's access patterns and can be shared across processes.
# Set in kafka-server-start.sh or via environment
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
# -Xms equals -Xmx: prevents JVM from dynamically resizing the heap,
# which causes additional GC activity during resize operations

G1GC: The Default Choice for Kafka 3.x

G1GC became the default GC in JDK 9 and is the recommended choice for Kafka 3.x on JDK 11 and JDK 17. The following configuration has been validated in multi-terabyte-per-day production deployments:

export KAFKA_JVM_PERFORMANCE_OPTS="
  -server
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=20
  -XX:InitiatingHeapOccupancyPercent=35
  -XX:+ExplicitGCInvokesConcurrent
  -XX:MaxInlineLevel=15
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/kafka/
  -Xlog:gc*:file=/var/log/kafka/gc.log:time,uptime:filecount=10,filesize=100m
"

Parameter rationale:

Monitor GC behavior:

# Watch GC log in real time
tail -f /var/log/kafka/gc.log | grep -E "(pause|GC|Pause)"

# Check GC statistics via JMX
jcmd $(pgrep -f kafka.Kafka) GC.stat

ZGC: For Ultra-Low Latency Requirements

ZGC (Z Garbage Collector) reached production quality in JDK 15 and is an excellent choice for Kafka 3.x on JDK 17 LTS when latency is a primary concern. ZGC's defining characteristic is sub-millisecond STW pauses, even on 32 GB heaps.

ZGC is appropriate when:

# ZGC configuration for JDK 17+
export KAFKA_JVM_PERFORMANCE_OPTS="
  -server
  -XX:+UseZGC
  -XX:ZCollectionInterval=5
  -XX:+ZProactive
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/kafka/
  -Xlog:gc*:file=/var/log/kafka/gc.log:time,uptime:filecount=10,filesize=100m
"

# ZGC is less sensitive to heap size; slightly larger heaps are fine
export KAFKA_HEAP_OPTS="-Xms8g -Xmx8g"

ZGC trade-offs: Higher CPU consumption (concurrent collection requires more CPU cycles) and higher memory bandwidth requirements. On brokers where CPU is already near saturation, ZGC's concurrent threads can actually reduce throughput. Benchmark before committing.

Broker Layer Tuning

Thread Pool Sizing: Align With CPU Cores

Two thread pools control how Kafka handles incoming requests. Their default values predate modern multi-core servers and are chronically undersized for production workloads:

# num.io.threads: handles the actual request processing (disk reads/writes, replica coordination)
# Recommended: 2 × CPU cores (default 8 is inadequate for 16-core+ servers)
# Signal: RequestHandlerAvgIdlePercent < 0.3 means increase this
num.io.threads=16      # for an 8-core server

# num.network.threads: handles socket reads and writes (receives bytes, queues requests)
# Recommended: CPU cores (default 3 is inadequate)
num.network.threads=8  # for an 8-core server

# Maximum depth of the request queue before back-pressure kicks in
queued.max.requests=500

Diagnosing thread pool bottlenecks:

# Check I/O thread idle percent via JMX
kafka-jmx.sh \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --object-name kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent \
  --attributes Value

# Check per-request type latency
kafka-jmx.sh \
  --jmx-url service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi \
  --object-name "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce" \
  --attributes 99thPercentile

When idle percent falls below 30%, increase num.io.threads. If idle percent is above 70% but latency is still high, the bottleneck is disk I/O, not thread count — adding threads will not help.

Replica Fetch Tuning: The Most Underestimated Parameters

Replication configuration is where the most headroom is left on the table in the majority of production clusters:

# num.replica.fetchers: threads per follower broker for pulling data from leaders
# Default of 1 is drastically insufficient for modern NVMe storage
# An NVMe SSD can sustain 50,000+ IOPS; 1 fetcher thread cannot saturate it
# Recommended: 4-8 (lower for HDDs, higher for NVMe arrays)
num.replica.fetchers=4

# Maximum bytes to fetch per replica fetch request
# Default 1MB is too small for high-throughput clusters
# Recommended: 10-100 MB depending on network and disk speed
replica.fetch.max.bytes=10485760      # 10 MB

# How long the leader waits to accumulate data before responding to a fetch
replica.fetch.wait.max.ms=500

# Socket receive buffer for replica fetch connections
replica.socket.receive.buffer.bytes=65536

Validate replication throughput:

# Check replication bytes received per broker per second
kafka-jmx.sh \
  --object-name "kafka.server:type=BrokerTopicMetrics,name=ReplicationBytesInPerSec" \
  --attributes OneMinuteRate

# Check replica fetch request rate and latency
kafka-jmx.sh \
  --object-name "kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Fetch" \
  --attributes 99thPercentile

If replication throughput is substantially below the producer ingestion rate, and num.replica.fetchers is at 1, incrementing it to 4 and restarting the broker will typically produce an immediate and dramatic improvement.

Log Flush Strategy: Let the OS Manage It

This is the most debated Kafka parameter, and the answer is clear: rely on OS page cache write-back. Do not manually trigger fsyncs.

# Let the OS decide when to flush (this is the correct production setting)
log.flush.interval.messages=9223372036854775807   # Long.MAX_VALUE — effectively disabled
log.flush.interval.ms=9223372036854775807         # Long.MAX_VALUE — effectively disabled

Why forced fsyncs are counterproductive:

  1. Kafka's durability guarantee comes from replication, not fsync: With acks=all, a message is not acknowledged until it has been written to the OS page cache on all ISR replicas. If one broker crashes before fsync, the other replicas have complete data.
  2. OS write-back is reliable at the configured dirty page ratios: The vm.dirty_ratio=80 setting ensures dirty pages are flushed well before memory pressure becomes critical.
  3. Forced fsync is extremely expensive: An fsync blocks the writing thread until data is physically on disk. Under high write throughput, this converts sequential micro-second-latency writes into millisecond-latency disk operations, reducing throughput by 5-10x.

The only legitimate reason to set aggressive flush intervals is regulatory compliance requirements that mandate physical data persistence before acknowledgment — rare in practice.

Message Size and Batch Tuning

# Maximum message size the broker will accept (coordinate with producers and consumers)
# Default 1 MB is too small for many use cases (Avro schemas, binary payloads, video thumbnails)
message.max.bytes=10485760            # 10 MB

# Replica fetch size must be >= message.max.bytes
replica.fetch.max.bytes=10485760

Producer batching has a disproportionate impact on throughput. These settings belong in producer application configuration, not broker config:

Properties props = new Properties();
// Larger batches = fewer requests = higher throughput
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);          // 64 KB (default 16 KB is too small)
props.put(ProducerConfig.LINGER_MS_CONFIG, 5);              // Wait up to 5ms to fill a batch
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");   // LZ4: best throughput/CPU balance
// zstd: better compression ratio, more CPU — good for expensive network paths
// snappy: legacy, use lz4 or zstd instead

props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 67108864L);  // 64 MB in-flight buffer
props.put(ProducerConfig.SEND_BUFFER_CONFIG, 1048576);      // 1 MB TCP socket send buffer

Complete Performance Tuning Reference

Layer Parameter Recommended Default Impact
OS Disk Filesystem mount options noatime,nodiratime relatime 5-10% I/O reduction
OS Disk Storage topology JBOD varies 2-5x write performance vs RAID 5/6
OS Disk Filesystem XFS or ext4 varies Minor; mount options matter more
OS Memory vm.dirty_ratio 80 20 Sequential write efficiency
OS Memory vm.dirty_background_ratio 5 10 Write-back smoothness
OS Memory vm.swappiness 1 60 Prevents GC-induced swap thrashing
OS Network net.core.rmem_max 134217728 212992 High-throughput TCP
OS Network net.ipv4.tcp_window_scaling 1 1 Cross-AZ throughput
JVM Heap size 6-8 GB 1 GB GC stability vs page cache
JVM GC implementation G1GC or ZGC G1GC STW pause duration
JVM MaxGCPauseMillis 20 200 GC pause target
JVM InitiatingHeapOccupancyPercent 35 45 GC trigger timing
Broker num.io.threads 2 × CPU cores 8 Request handling capacity
Broker num.network.threads CPU cores 3 Network I/O throughput
Broker num.replica.fetchers 4-8 1 Replication throughput
Broker replica.fetch.max.bytes 10 MB+ 1 MB Replication batch efficiency
Broker log.flush.interval.messages Long.MAX_VALUE Long.MAX_VALUE Write throughput
Broker socket.send.buffer.bytes 1 MB 100 KB Network throughput
Broker message.max.bytes 10 MB 1 MB Maximum payload support
Producer batch.size 64 KB 16 KB Write throughput
Producer linger.ms 5-20 0 Batch fill efficiency
Producer compression.type lz4 or zstd none Network and disk savings

Benchmarking: Measuring What You Changed

After making tuning changes, always validate with benchmarks. Kafka ships with built-in performance testing tools:

# Producer throughput benchmark (--throughput -1 = unlimited, measures maximum)
kafka-producer-perf-test.sh \
  --topic perf-benchmark \
  --num-records 10000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=kafka-broker-1:9092 \
    acks=all \
    batch.size=65536 \
    linger.ms=5 \
    compression.type=lz4

# Consumer throughput benchmark
kafka-consumer-perf-test.sh \
  --bootstrap-server kafka-broker-1:9092 \
  --topic perf-benchmark \
  --messages 10000000 \
  --fetch-size 1048576 \
  --group perf-benchmark-consumer

# End-to-end latency benchmark (rate-limited to measure latency distribution)
kafka-producer-perf-test.sh \
  --topic latency-benchmark \
  --num-records 100000 \
  --record-size 1024 \
  --throughput 10000 \
  --producer-props \
    bootstrap.servers=kafka-broker-1:9092 \
    acks=all

Expected performance baselines on a well-tuned cluster (3 brokers, 3× NVMe SSD each, 10GbE network, replication factor 3):

If your cluster falls significantly below these numbers after OS and JVM tuning, the broker layer is likely the next focus area: check num.replica.fetchers, num.io.threads, and whether page cache is being evicted by a competing heap allocation.

Summary: The Layered Tuning Methodology

The single biggest mistake in Kafka performance tuning is skipping OS configuration and jumping straight to broker parameters. The correct sequence:

  1. OS disk layer: Choose the right storage architecture (JBOD), set filesystem mount options, and baseline raw I/O performance with fio.
  2. OS memory layer: Verify the JVM heap is sized conservatively (6-8 GB), check that dirty page write-back ratios allow efficient batching, set swappiness to 1.
  3. OS network layer: Expand TCP socket buffers, enable window scaling, increase connection backlog limits.
  4. JVM layer: Set heap to the sweet spot, choose G1GC (general) or ZGC (ultra-low latency), tune GC triggers.
  5. Broker layer: Align thread pools with CPU core count, increase replica fetcher concurrency, trust the OS for fsync timing.

Measure one layer at a time. Changing three parameters simultaneously makes it impossible to determine which change produced the observed effect — or which change caused the new regression.

Rate this chapter
4.9  / 5  (3 ratings)

💬 Comments