Producer Tuning: Reaching 1 Million TPS
Theory provides direction; benchmarks provide truth. This chapter follows a single narrative arc โ taking a real pipeline from 50K TPS to 1 million TPS โ while systematically explaining the why behind every optimization. Along the way we cover compression algorithm internals, the ordering implications of max.in.flight, and three production-ready configuration templates for different SLA profiles.
Compression: The Highest-Leverage Optimization
Compression simultaneously reduces network bandwidth, broker disk usage, and replication traffic. In most Kafka clusters, network and disk I/O are the binding constraints long before CPU is saturated, making compression the single most impactful configuration change you can make.
Five Algorithms: A Deep Comparison
Kafka 3.7 supports five compression types. The following benchmarks use 1 million 1KB JSON records on a 4-core producer host:
| Algorithm | Ratio | Compress Speed | Decompress Speed | CPU Overhead | Best For |
|---|---|---|---|---|---|
none |
1.0x | โ | โ | None | Already-compressed payloads (images, Parquet, video) |
gzip |
3.8x | Slow | Medium | High (~25%) | Batch archival, storage cost sensitive |
snappy |
2.3x | Fast | Fast | Low (~8%) | Legacy systems requiring broad compatibility |
lz4 |
2.1x | Very fast | Very fast | Very low (~5%) | Low-latency with moderate compression needs |
zstd |
3.5x | Fast | Fast | Medium (~12%) | Best overall choice for Kafka 2.1+ |
Why zstd Wins in 2024
Facebook developed Zstandard (zstd) to unify the compression space: near-gzip ratios at near-lz4 speeds. Its key advantage over alternatives is adjustable compression levels (1-22), letting you tune the CPU/ratio tradeoff at runtime. For JSON and text payloads typical of Kafka workloads, zstd achieves ~30% better compression than snappy with similar CPU cost.
The ratio difference materializes as concrete savings. If your topic produces 1 TB/day uncompressed:
snappyat 2.3x: ~435 GB/day stored, ~435 GB/day replicated per replicazstdat 3.5x: ~286 GB/day stored, ~286 GB/day replicated per replica- Annual delta: ~54 TB less storage per replica. On three-replica clusters with SSDs: significant cost.
compression.type=zstd
Where Compression Happens (and Why Batch Size Matters)
Compression is applied on the main thread, at the ProducerBatch level, not per-record. MemoryRecordsBuilder compresses the entire batch just before it transitions to the READY state.
Two consequences:
1. Larger batches compress better. Compression algorithms exploit repetition across the input. A batch containing 100 JSON records with similar field names achieves far higher compression than 100 individually-compressed records. This creates a virtuous cycle: batch.size=64KB + linger.ms=5 โ larger batches โ better compression โ fewer bytes on wire โ higher effective throughput.
2. Compression time is on your critical path (partially). The main thread compresses before the batch is ready for the Sender. If you're CPU-bound on the producer, choose lz4 over zstd. If you're I/O-bound (typical), zstd is better.
Broker-Side Compression Interaction
By default, log.compression.type=producer on the broker โ it stores data in whatever format the producer sent. If you configure a different log.compression.type, the broker recompresses every incoming batch, which adds CPU load and latency. Keep producer and broker compression types aligned:
# Check topic-level compression config
kafka-topics.sh --bootstrap-server kafka1:9092 \
--describe --topic orders
# Look for: Configs: compression.type=zstd
Benchmarking: Establishing Your Baseline
Before tuning anything, measure. The kafka-producer-perf-test.sh tool drives the official benchmark:
kafka-producer-perf-test.sh \
--topic perf-test \
--num-records 5000000 \
--record-size 1024 \
--throughput -1 \
--producer-props \
bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092 \
acks=1 \
compression.type=none \
batch.size=16384 \
linger.ms=0 \
buffer.memory=33554432 \
max.in.flight.requests.per.connection=5
# Sample output:
# 5000000 records sent, 52341.2 records/sec (51.2 MB/sec),
# 1.23 ms avg latency, 234.00 ms max latency,
# 1 ms 50th, 3 ms 95th, 18 ms 99th, 234 ms 99.9th.
Interpreting the output:
records/sec: Raw throughput โ your primary tuning targetavg latency: Mean time fromsend()call to ACK reception99th / 99.9th percentile: Tail latency โ a 99th of 18ms means 1% of messages took 18+ms; this is what kills SLA agreementsmax latency: Usually a warmup artifact or GC pause; focus on 99.9th
Run each configuration three times and discard the first (JVM warmup). Report the median.
Case Study: 50K TPS to 1 Million TPS
This is a real e-commerce order event pipeline: average message size 1.2KB, mixed key/value JSON. The initial deployment delivered 50K TPS.
Baseline Diagnosis (50K TPS)
Starting configuration โ essentially all defaults:
batch.size=16384
linger.ms=0
buffer.memory=33554432
compression.type=none
acks=1
max.in.flight.requests.per.connection=5
JMX observation:
| Metric | Value | Interpretation |
|---|---|---|
batch-size-avg |
~820 bytes | Batches are nearly empty |
records-per-request-avg |
~1.2 | Virtually no batching happening |
buffer-available-bytes |
~30MB | Memory is not the constraint |
| Network utilization | ~15% | Far from saturated |
| CPU (broker) | ~8% | Plenty of headroom |
Root cause: With linger.ms=0, every message that arrives during a Sender poll cycle forms its own batch. The Sender is spending most of its time in request/response cycles rather than sending data. This is a classic batching inefficiency problem โ the fix has nothing to do with adding hardware.
Step 1: Larger Batches + Linger โ 200K TPS
batch.size=65536 # 64KB (4x increase)
linger.ms=5 # 5ms accumulation window
buffer.memory=134217728 # 128MB (prevent buffer becoming bottleneck)
After these three changes alone:
batch-size-avg: 820 bytes โ ~58KB (71x improvement)records-per-request-avg: 1.2 โ ~48- TPS: 50K โ 200K
Why does this work so dramatically? With linger.ms=5, the Sender waits up to 5ms to let the accumulator fill. In 5ms at our message rate, ~48 records arrive per partition. They're packed into one 58KB batch and sent in one ProduceRequest. The broker processes one request instead of 48 โ that's 47 fewer request headers, 47 fewer acknowledgment round-trips, dramatically lower per-message overhead.
Step 2: Enable Compression โ 450K TPS
compression.type=zstd
With 64KB batches of similar JSON records, zstd achieves approximately 3.5x compression. Each 64KB batch compresses to ~18KB. This:
- Reduces network bytes by ~72%
- Reduces broker disk writes by ~72%
- Allows the same network capacity to carry 3.5x more messages
TPS: 200K โ 450K
One thing to verify: with linger.ms=5, batches consistently reach near-batch.size before they're drained. If batch-size-avg is well below batch.size, the linger time is dominating (message rate is too low) or messages are arriving in bursts and not accumulating well.
Step 3: Increase Parallelism โ 800K TPS
A Kafka partition is the unit of parallelism. Multiple producers (or multiple threads) writing to a single partition contend for the partition lock on the broker. Increasing partition count directly increases broker-side write parallelism:
# Increase from 6 to 24 partitions
kafka-topics.sh --bootstrap-server kafka1:9092 \
--alter --topic orders --partitions 24
# Run 4 parallel producer processes
for i in {1..4}; do
kafka-producer-perf-test.sh \
--topic orders \
--num-records 1250000 \
--record-size 1024 \
--throughput -1 \
--producer-props \
bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092 \
batch.size=65536 \
linger.ms=5 \
compression.type=zstd \
buffer.memory=134217728 &
done
wait
TPS: 450K โ 800K (4 processes ร ~200K per process, with some overhead)
Why not just run more partitions without limit? Each partition consumes:
- ~1 open file handle per segment (default segment size 1GB)
- Memory for the partition metadata in broker
- Replication connections per-follower
For a 3-broker cluster, 24 partitions ร 3 replicas = 72 partition-replicas, each broker holding 24. This is sustainable. At 2,400 partitions it starts to hurt. Kafka's general guidance: 10-100 partitions per broker is a healthy operating range.
Step 4: Fine-Tune for 1 Million TPS
At 800K TPS the remaining gains come from squeezing out per-request overhead:
# Increase batch size further โ at 800K TPS, batches fill quickly
batch.size=131072 # 128KB
linger.ms=10 # Can afford slightly longer window at this throughput
# Increase buffer memory proportionally
buffer.memory=268435456 # 256MB (8 partitions ร 128KB ร pipeline depth)
# Socket buffer tuning (also set OS-side with sysctl)
send.buffer.bytes=131072 # 128KB TCP send buffer
receive.buffer.bytes=65536 # 64KB TCP receive buffer
After tuning: 1 million TPS sustained.
max.in.flight and Message Ordering: The Full Story
max.in.flight.requests.per.connection is the most commonly misunderstood tuning knob.
How Ordering Breaks Without Idempotence
With max.in.flight=5 and enable.idempotence=false:
Timeline:
T1: Sender dispatches Batch-A (seq 0-99) and Batch-B (seq 100-199) concurrently
T2: Broker writes Batch-B successfully, sends ACK-B
T3: ACK-A times out (transient network blip)
T4: Sender puts Batch-A back in retry queue; dispatches Batch-C (seq 200-299)
T5: Broker writes Batch-C; ACK-C received
T6: Batch-A retried; Broker writes it
Final partition log:
Batch-B[100-199] โ Batch-C[200-299] โ Batch-A[0-99]
โ Reordered!
The retry of Batch-A "jumps ahead" in the logical sequence but lands after later batches in the physical log. Consumers reading this partition will see records 100-199, then 200-299, then 0-99 โ completely wrong order.
Safe Configuration Matrix
| Ordering Requirement | Idempotence | max.in.flight | Throughput Impact |
|---|---|---|---|
| Strict order required, EOS | true |
1 | Lowest (~60% of baseline) |
| Strict order required | true |
5 | Medium (~85% of baseline) |
| Best-effort order, idempotence desired | true |
5 | Medium |
| No order requirement, max throughput | false |
10-20 | Highest |
With enable.idempotence=true, the broker detects sequence gaps and rejects out-of-order deliveries from the same (PID, partition) pair. Retried batches are correctly placed because the broker can distinguish between "a retry of batch I've seen" (duplicate) and "a batch with an unexpected gap" (error). The net effect: up to 5 in-flight requests with idempotence are safe for ordering; the broker handles the reordering at ACK time.
Three Production Configuration Templates
Template 1: High Throughput (Log Archival, Analytics, Clickstream)
bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
# Batching โ maximize batch fill rate
batch.size=131072 # 128KB
linger.ms=20 # 20ms accumulation window
# Memory โ large pool to absorb bursts
buffer.memory=268435456 # 256MB
max.block.ms=60000 # 60s before throwing TimeoutException
# Reliability โ leader-only ACK for speed
acks=1
retries=3
retry.backoff.ms=100
delivery.timeout.ms=120000
# Compression โ best ratio for storage/network savings
compression.type=zstd
# Parallelism
max.in.flight.requests.per.connection=10 # safe without idempotence if no ordering needed
Expected performance: 800K-1.5M TPS with 20-50ms P99 latency (network-dependent).
Template 2: Low Latency (Real-Time Alerting, Interactive Events)
bootstrap.servers=kafka1:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.StringSerializer
# Batching โ send immediately, do not accumulate
batch.size=16384 # 16KB (default)
linger.ms=0 # No waiting
# Memory โ modest; low-throughput scenario
buffer.memory=33554432 # 32MB
max.block.ms=5000 # Fail fast: 5s
# Reliability
acks=1
retries=1
delivery.timeout.ms=5000 # 5s max delivery time
# No compression (saves CPU time, reduces latency by ~0.5ms)
compression.type=none
# Minimize queuing depth
max.in.flight.requests.per.connection=1
Expected performance: 10K-50K TPS, P99 latency < 5ms.
Template 3: Reliable Delivery (Financial Transactions, Orders)
bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
# Batching โ balanced
batch.size=65536 # 64KB
linger.ms=5 # 5ms window
# Memory
buffer.memory=134217728 # 128MB
max.block.ms=30000 # 30s before failing
# Reliability โ exactly-once
acks=all
enable.idempotence=true
transactional.id=payment-processor-${INSTANCE_ID} # unique per instance
max.in.flight.requests.per.connection=5 # max safe with idempotence
retries=2147483647 # effectively infinite (enforced by delivery.timeout.ms)
delivery.timeout.ms=120000 # 2 minutes max delivery window
# Compression
compression.type=zstd
# Timeouts
request.timeout.ms=30000
Expected performance: 200K-400K TPS, P99 latency 10-40ms.
JVM Tuning for High-Throughput Producers
Producer-embedded applications benefit from GC tuning. G1GC is the recommended collector for Kafka 3.7+:
# JVM flags for a producer embedded in a service process
JAVA_OPTS="\
-Xms4g -Xmx4g \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 \
-XX:G1HeapRegionSize=16m \
-XX:+ParallelRefProcEnabled \
-XX:+DisableExplicitGC \
-Xlog:gc*:file=/var/log/app/gc.log:time,tags:filecount=10,filesize=100m"
Key principle: buffer.memory (JVM heap) + application heap + OS page cache must all fit within the host's physical memory. A common mistake is allocating buffer.memory=512MB on a 2GB host running a 1.5GB JVM โ the OS has nothing left for page cache, turning every Kafka read into disk I/O.
Production Tuning Checklist
# 1. Verify partition count is sufficient
kafka-topics.sh --bootstrap-server kafka1:9092 \
--describe --topic your-topic | grep PartitionCount
# 2. Watch producer metrics in real time
kafka-run-class.sh kafka.tools.JmxTool \
--object-name "kafka.producer:type=producer-metrics,client-id=YOUR_CLIENT_ID" \
--attributes "record-send-rate,batch-size-avg,records-per-request-avg,\
compression-rate-avg,buffer-available-bytes,\
buffer-exhausted-rate,record-error-rate,request-latency-avg" \
--jmx-url "service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi" \
--reporting-interval 5000
# 3. Diagnose buffer exhaustion (main thread blocking root cause)
# If buffer-exhausted-rate > 0: either increase buffer.memory
# or reduce production rate or fix slow broker/network
# 4. Verify batching efficiency
# batch-size-avg should be โฅ 80% of batch.size for high-throughput workloads
# If lower: increase linger.ms or increase batch.size
# 5. Check for send errors
# record-error-rate > 0 sustained: look at broker logs for the cause
One million TPS is not magic. It is the compounded result of eliminating bottlenecks one layer at a time: batching efficiency, compression ratio, partition-level parallelism, memory headroom, and JVM behavior. Each bottleneck announces itself through a specific JMX metric. The skill is knowing which metric maps to which configuration, and understanding why the fix works rather than just copying a template.