Chapter 8

Producer Tuning: Reaching 1 Million TPS

Theory provides direction; benchmarks provide truth. This chapter follows a single narrative arc — taking a real pipeline from 50K TPS to 1 million TPS — while systematically explaining the why behind every optimization. Along the way we cover compression algorithm internals, the ordering implications of max.in.flight, and three production-ready configuration templates for different SLA profiles.

Compression: The Highest-Leverage Optimization

Compression simultaneously reduces network bandwidth, broker disk usage, and replication traffic. In most Kafka clusters, network and disk I/O are the binding constraints long before CPU is saturated, making compression the single most impactful configuration change you can make.

Five Algorithms: A Deep Comparison

Kafka 3.7 supports five compression types. The following benchmarks use 1 million 1KB JSON records on a 4-core producer host:

Algorithm Ratio Compress Speed Decompress Speed CPU Overhead Best For
none 1.0x None Already-compressed payloads (images, Parquet, video)
gzip 3.8x Slow Medium High (~25%) Batch archival, storage cost sensitive
snappy 2.3x Fast Fast Low (~8%) Legacy systems requiring broad compatibility
lz4 2.1x Very fast Very fast Very low (~5%) Low-latency with moderate compression needs
zstd 3.5x Fast Fast Medium (~12%) Best overall choice for Kafka 2.1+

Why zstd Wins in 2024

Facebook developed Zstandard (zstd) to unify the compression space: near-gzip ratios at near-lz4 speeds. Its key advantage over alternatives is adjustable compression levels (1-22), letting you tune the CPU/ratio tradeoff at runtime. For JSON and text payloads typical of Kafka workloads, zstd achieves ~30% better compression than snappy with similar CPU cost.

The ratio difference materializes as concrete savings. If your topic produces 1 TB/day uncompressed:

compression.type=zstd

Where Compression Happens (and Why Batch Size Matters)

Compression is applied on the main thread, at the ProducerBatch level, not per-record. MemoryRecordsBuilder compresses the entire batch just before it transitions to the READY state.

Two consequences:

1. Larger batches compress better. Compression algorithms exploit repetition across the input. A batch containing 100 JSON records with similar field names achieves far higher compression than 100 individually-compressed records. This creates a virtuous cycle: batch.size=64KB + linger.ms=5 → larger batches → better compression → fewer bytes on wire → higher effective throughput.

2. Compression time is on your critical path (partially). The main thread compresses before the batch is ready for the Sender. If you're CPU-bound on the producer, choose lz4 over zstd. If you're I/O-bound (typical), zstd is better.

Broker-Side Compression Interaction

By default, log.compression.type=producer on the broker — it stores data in whatever format the producer sent. If you configure a different log.compression.type, the broker recompresses every incoming batch, which adds CPU load and latency. Keep producer and broker compression types aligned:

# Check topic-level compression config
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --describe --topic orders
# Look for: Configs: compression.type=zstd

Benchmarking: Establishing Your Baseline

Before tuning anything, measure. The kafka-producer-perf-test.sh tool drives the official benchmark:

kafka-producer-perf-test.sh \
  --topic perf-test \
  --num-records 5000000 \
  --record-size 1024 \
  --throughput -1 \
  --producer-props \
    bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092 \
    acks=1 \
    compression.type=none \
    batch.size=16384 \
    linger.ms=0 \
    buffer.memory=33554432 \
    max.in.flight.requests.per.connection=5

# Sample output:
# 5000000 records sent, 52341.2 records/sec (51.2 MB/sec),
# 1.23 ms avg latency, 234.00 ms max latency,
# 1 ms 50th, 3 ms 95th, 18 ms 99th, 234 ms 99.9th.

Interpreting the output:

Run each configuration three times and discard the first (JVM warmup). Report the median.

Case Study: 50K TPS to 1 Million TPS

This is a real e-commerce order event pipeline: average message size 1.2KB, mixed key/value JSON. The initial deployment delivered 50K TPS.

Baseline Diagnosis (50K TPS)

Starting configuration — essentially all defaults:

batch.size=16384
linger.ms=0
buffer.memory=33554432
compression.type=none
acks=1
max.in.flight.requests.per.connection=5

JMX observation:

Metric Value Interpretation
batch-size-avg ~820 bytes Batches are nearly empty
records-per-request-avg ~1.2 Virtually no batching happening
buffer-available-bytes ~30MB Memory is not the constraint
Network utilization ~15% Far from saturated
CPU (broker) ~8% Plenty of headroom

Root cause: With linger.ms=0, every message that arrives during a Sender poll cycle forms its own batch. The Sender is spending most of its time in request/response cycles rather than sending data. This is a classic batching inefficiency problem — the fix has nothing to do with adding hardware.

Step 1: Larger Batches + Linger → 200K TPS

batch.size=65536         # 64KB (4x increase)
linger.ms=5              # 5ms accumulation window
buffer.memory=134217728  # 128MB (prevent buffer becoming bottleneck)

After these three changes alone:

Why does this work so dramatically? With linger.ms=5, the Sender waits up to 5ms to let the accumulator fill. In 5ms at our message rate, ~48 records arrive per partition. They're packed into one 58KB batch and sent in one ProduceRequest. The broker processes one request instead of 48 — that's 47 fewer request headers, 47 fewer acknowledgment round-trips, dramatically lower per-message overhead.

Step 2: Enable Compression → 450K TPS

compression.type=zstd

With 64KB batches of similar JSON records, zstd achieves approximately 3.5x compression. Each 64KB batch compresses to ~18KB. This:

TPS: 200K → 450K

One thing to verify: with linger.ms=5, batches consistently reach near-batch.size before they're drained. If batch-size-avg is well below batch.size, the linger time is dominating (message rate is too low) or messages are arriving in bursts and not accumulating well.

Step 3: Increase Parallelism → 800K TPS

A Kafka partition is the unit of parallelism. Multiple producers (or multiple threads) writing to a single partition contend for the partition lock on the broker. Increasing partition count directly increases broker-side write parallelism:

# Increase from 6 to 24 partitions
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --alter --topic orders --partitions 24

# Run 4 parallel producer processes
for i in {1..4}; do
  kafka-producer-perf-test.sh \
    --topic orders \
    --num-records 1250000 \
    --record-size 1024 \
    --throughput -1 \
    --producer-props \
      bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092 \
      batch.size=65536 \
      linger.ms=5 \
      compression.type=zstd \
      buffer.memory=134217728 &
done
wait

TPS: 450K → 800K (4 processes × ~200K per process, with some overhead)

Why not just run more partitions without limit? Each partition consumes:

For a 3-broker cluster, 24 partitions × 3 replicas = 72 partition-replicas, each broker holding 24. This is sustainable. At 2,400 partitions it starts to hurt. Kafka's general guidance: 10-100 partitions per broker is a healthy operating range.

Step 4: Fine-Tune for 1 Million TPS

At 800K TPS the remaining gains come from squeezing out per-request overhead:

# Increase batch size further — at 800K TPS, batches fill quickly
batch.size=131072          # 128KB
linger.ms=10               # Can afford slightly longer window at this throughput

# Increase buffer memory proportionally
buffer.memory=268435456    # 256MB (8 partitions × 128KB × pipeline depth)

# Socket buffer tuning (also set OS-side with sysctl)
send.buffer.bytes=131072   # 128KB TCP send buffer
receive.buffer.bytes=65536 # 64KB TCP receive buffer

After tuning: 1 million TPS sustained.

max.in.flight and Message Ordering: The Full Story

max.in.flight.requests.per.connection is the most commonly misunderstood tuning knob.

How Ordering Breaks Without Idempotence

With max.in.flight=5 and enable.idempotence=false:

Timeline:
T1: Sender dispatches Batch-A (seq 0-99) and Batch-B (seq 100-199) concurrently
T2: Broker writes Batch-B successfully, sends ACK-B
T3: ACK-A times out (transient network blip)
T4: Sender puts Batch-A back in retry queue; dispatches Batch-C (seq 200-299)
T5: Broker writes Batch-C; ACK-C received
T6: Batch-A retried; Broker writes it

Final partition log:
Batch-B[100-199] → Batch-C[200-299] → Batch-A[0-99]
                                              ↑ Reordered!

The retry of Batch-A "jumps ahead" in the logical sequence but lands after later batches in the physical log. Consumers reading this partition will see records 100-199, then 200-299, then 0-99 — completely wrong order.

Safe Configuration Matrix

Ordering Requirement Idempotence max.in.flight Throughput Impact
Strict order required, EOS true 1 Lowest (~60% of baseline)
Strict order required true 5 Medium (~85% of baseline)
Best-effort order, idempotence desired true 5 Medium
No order requirement, max throughput false 10-20 Highest

With enable.idempotence=true, the broker detects sequence gaps and rejects out-of-order deliveries from the same (PID, partition) pair. Retried batches are correctly placed because the broker can distinguish between "a retry of batch I've seen" (duplicate) and "a batch with an unexpected gap" (error). The net effect: up to 5 in-flight requests with idempotence are safe for ordering; the broker handles the reordering at ACK time.

Three Production Configuration Templates

Template 1: High Throughput (Log Archival, Analytics, Clickstream)

bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

# Batching — maximize batch fill rate
batch.size=131072             # 128KB
linger.ms=20                  # 20ms accumulation window

# Memory — large pool to absorb bursts
buffer.memory=268435456       # 256MB
max.block.ms=60000            # 60s before throwing TimeoutException

# Reliability — leader-only ACK for speed
acks=1
retries=3
retry.backoff.ms=100
delivery.timeout.ms=120000

# Compression — best ratio for storage/network savings
compression.type=zstd

# Parallelism
max.in.flight.requests.per.connection=10  # safe without idempotence if no ordering needed

Expected performance: 800K-1.5M TPS with 20-50ms P99 latency (network-dependent).

Template 2: Low Latency (Real-Time Alerting, Interactive Events)

bootstrap.servers=kafka1:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer

# Batching — send immediately, do not accumulate
batch.size=16384              # 16KB (default)
linger.ms=0                   # No waiting

# Memory — modest; low-throughput scenario
buffer.memory=33554432        # 32MB
max.block.ms=5000             # Fail fast: 5s

# Reliability
acks=1
retries=1
delivery.timeout.ms=5000      # 5s max delivery time

# No compression (saves CPU time, reduces latency by ~0.5ms)
compression.type=none

# Minimize queuing depth
max.in.flight.requests.per.connection=1

Expected performance: 10K-50K TPS, P99 latency < 5ms.

Template 3: Reliable Delivery (Financial Transactions, Orders)

bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

# Batching — balanced
batch.size=65536              # 64KB
linger.ms=5                   # 5ms window

# Memory
buffer.memory=134217728       # 128MB
max.block.ms=30000            # 30s before failing

# Reliability — exactly-once
acks=all
enable.idempotence=true
transactional.id=payment-processor-${INSTANCE_ID}  # unique per instance
max.in.flight.requests.per.connection=5            # max safe with idempotence
retries=2147483647            # effectively infinite (enforced by delivery.timeout.ms)
delivery.timeout.ms=120000    # 2 minutes max delivery window

# Compression
compression.type=zstd

# Timeouts
request.timeout.ms=30000

Expected performance: 200K-400K TPS, P99 latency 10-40ms.

JVM Tuning for High-Throughput Producers

Producer-embedded applications benefit from GC tuning. G1GC is the recommended collector for Kafka 3.7+:

# JVM flags for a producer embedded in a service process
JAVA_OPTS="\
  -Xms4g -Xmx4g \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 \
  -XX:G1HeapRegionSize=16m \
  -XX:+ParallelRefProcEnabled \
  -XX:+DisableExplicitGC \
  -Xlog:gc*:file=/var/log/app/gc.log:time,tags:filecount=10,filesize=100m"

Key principle: buffer.memory (JVM heap) + application heap + OS page cache must all fit within the host's physical memory. A common mistake is allocating buffer.memory=512MB on a 2GB host running a 1.5GB JVM — the OS has nothing left for page cache, turning every Kafka read into disk I/O.

Production Tuning Checklist

# 1. Verify partition count is sufficient
kafka-topics.sh --bootstrap-server kafka1:9092 \
  --describe --topic your-topic | grep PartitionCount

# 2. Watch producer metrics in real time
kafka-run-class.sh kafka.tools.JmxTool \
  --object-name "kafka.producer:type=producer-metrics,client-id=YOUR_CLIENT_ID" \
  --attributes "record-send-rate,batch-size-avg,records-per-request-avg,\
                compression-rate-avg,buffer-available-bytes,\
                buffer-exhausted-rate,record-error-rate,request-latency-avg" \
  --jmx-url "service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi" \
  --reporting-interval 5000

# 3. Diagnose buffer exhaustion (main thread blocking root cause)
#    If buffer-exhausted-rate > 0: either increase buffer.memory
#    or reduce production rate or fix slow broker/network

# 4. Verify batching efficiency
#    batch-size-avg should be ≥ 80% of batch.size for high-throughput workloads
#    If lower: increase linger.ms or increase batch.size

# 5. Check for send errors
#    record-error-rate > 0 sustained: look at broker logs for the cause

One million TPS is not magic. It is the compounded result of eliminating bottlenecks one layer at a time: batching efficiency, compression ratio, partition-level parallelism, memory headroom, and JVM behavior. Each bottleneck announces itself through a specific JMX metric. The skill is knowing which metric maps to which configuration, and understanding why the fix works rather than just copying a template.

Rate this chapter
4.8  / 5  (45 ratings)

💬 Comments