Kafka for Java Developers: Beyond the Hello World
The first Kafka tutorial you find online shows you how to send a string and read it back. That part is easy. The hard part is everything that comes after — understanding what "at least once" really means when your consumer crashes mid-batch, why your consumer group is stuck on one partition, or why messages are being processed out of order even though you were sure Kafka guarantees ordering.
This post skips the basic setup and goes straight to the concepts and code patterns that actually matter once you're past the hello world. We'll cover the data model, producers, consumers, consumer groups, offset management, and a handful of production pitfalls I've run into personally.
The Data Model in Plain Terms
Kafka stores messages in topics. A topic is divided into partitions — append-only, ordered logs on disk. Each message written to a partition gets a monotonically increasing offset, which is just a number. That's it. There is no global ordering across partitions, only per-partition ordering.
When you publish a message, Kafka decides which partition it goes to. If you provide
a key, Kafka hashes it to pick a partition — the same key always lands on the same
partition, which is how you get per-entity ordering (all events for a given
orderId, for example). If you publish without a key, Kafka round-robins
across partitions.
Consumers don't delete messages when they read them. They just advance their offset. This means you can replay a topic from offset 0 at any time, have multiple independent consumer groups reading the same topic without interfering with each other, and retain messages for as long as your retention policy allows (time- or size-based).
Producing Messages
In Spring Boot, KafkaTemplate is the main abstraction. A minimal
producer looks like this:
@Service
public class OrderEventPublisher {
private final KafkaTemplate<String, OrderEvent> kafkaTemplate;
public void publish(OrderEvent event) {
kafkaTemplate.send("orders", event.orderId(), event);
}
}
The second argument to send() is the message key — here we use
orderId so all events for the same order land on the same partition
and are processed in order.
KafkaTemplate.send() returns a CompletableFuture. By
default it's fire-and-forget — if you want to know whether the broker acknowledged
the message, you need to handle the future:
public void publishWithConfirmation(OrderEvent event) {
kafkaTemplate.send("orders", event.orderId(), event)
.whenComplete((result, ex) -> {
if (ex != null) {
log.error("Failed to publish event for order {}", event.orderId(), ex);
// dead-letter, retry queue, or alerting here
} else {
log.debug("Published to partition {} offset {}",
result.getRecordMetadata().partition(),
result.getRecordMetadata().offset());
}
});
}
The key producer config that influences reliability is acks. Set it to
all (or -1) and Kafka will wait for all in-sync replicas
to acknowledge before confirming the write. This is slower but means you won't lose
a message if the leader broker crashes immediately after accepting it.
# application.yml
spring:
kafka:
producer:
acks: all
retries: 3
properties:
enable.idempotence: true
max.in.flight.requests.per.connection: 5
enable.idempotence: true pairs with acks: all to give you
exactly-once delivery from the producer side — Kafka deduplicates retried messages
using a sequence number it tracks per producer session.
Consuming Messages
The consumer side is where most of the interesting problems live. A basic listener with Spring Kafka:
@Component
public class OrderEventConsumer {
@KafkaListener(topics = "orders", groupId = "order-processor")
public void handle(OrderEvent event) {
processOrder(event);
}
}
The groupId is critical. Kafka assigns partitions to consumers within
the same group — each partition is owned by exactly one consumer at a time. If you
have a topic with 4 partitions and start 4 instances of your service, each instance
gets one partition. If you start a 5th instance, it sits idle — there are no
partitions left to assign.
Consumers in different groups each get a full copy of the topic, completely
independently. This is useful when multiple services need to react to the same events
— an order-processor and a notification-service can both
subscribe to orders without any coordination.
Offsets and Commit Semantics
By default, Spring Kafka uses enable.auto.commit=true, which commits
offsets in the background on a timer. This is convenient but dangerous: if your
consumer processes a batch, the auto-commit fires, and then the application crashes
before saving the results to your database, the messages are gone — Kafka thinks
they were processed.
The safer approach is manual acknowledgement with AckMode.MANUAL_IMMEDIATE:
@KafkaListener(topics = "payments", groupId = "payment-processor")
public void handle(PaymentEvent event, Acknowledgment ack) {
try {
paymentService.process(event);
ack.acknowledge(); // only commit after successful processing
} catch (Exception e) {
log.error("Failed to process payment {}", event.paymentId(), e);
// do NOT ack — message will be redelivered
}
}
# application.yml
spring:
kafka:
listener:
ack-mode: MANUAL_IMMEDIATE
consumer:
enable-auto-commit: false
This gives you at-least-once delivery: messages are redelivered on failure, but your handler might see the same message more than once. That means your processing logic should be idempotent — processing the same event twice should produce the same result as processing it once. The classic pattern is storing the event ID and checking for duplicates before writing.
Dead Letter Topics
Not all failures are transient. Sometimes a message is simply malformed or triggers
a bug that will never resolve itself — retrying it forever would block the partition.
The solution is a dead letter topic (DLT). Spring Kafka's
DeadLetterPublishingRecoverer makes this straightforward:
@Bean
public DefaultErrorHandler errorHandler(
KafkaTemplate<Object, Object> template) {
var recoverer = new DeadLetterPublishingRecoverer(template);
var backoff = new ExponentialBackOff(1_000L, 2.0);
backoff.setMaxElapsedTime(30_000L); // give up after 30 s
return new DefaultErrorHandler(recoverer, backoff);
}
By convention, the DLT topic name is the original topic name suffixed with
.DLT — so payments falls into payments.DLT.
You can then have a separate consumer monitoring the DLT, alerting on failures,
and replaying messages once the underlying issue is fixed.
Partition Assignment and Rebalancing
When a consumer joins or leaves a group, Kafka triggers a rebalance: it reassigns partitions across the active consumers. During a rebalance, all consumers in the group pause. In the worst case, a consumer that processes messages slowly (or holds a lot of state) can cause frequent rebalances that grind your throughput to a halt.
A few settings help here. max.poll.interval.ms defines how long Kafka
will wait between polls before deciding the consumer is dead and triggering a
rebalance. If your processing logic is slow, increase this rather than letting
Kafka kick the consumer out. session.timeout.ms and
heartbeat.interval.ms control how quickly a genuinely crashed consumer
is detected — keep heartbeat.interval.ms at roughly a third of
session.timeout.ms.
# application.yml
spring:
kafka:
consumer:
properties:
max.poll.interval.ms: 60000
session.timeout.ms: 30000
heartbeat.interval.ms: 10000
max.poll.records: 50 # smaller batches = less risk of timeout
Reducing max.poll.records is often the simplest fix for rebalance
problems. If each poll returns 500 records and processing takes 2 seconds each,
you'll blow through max.poll.interval.ms easily. Fetch fewer records
per poll and you stay well within the timeout.
Ordering Guarantees: What Kafka Actually Promises
Kafka guarantees ordering within a single partition. That's the whole promise. If you need all events for a given entity to be processed in order, route all of that entity's events to the same partition using a consistent key. If you need global ordering across all entities — Kafka is probably the wrong tool, or you need a single partition (which eliminates horizontal scaling).
A common mistake is consuming a topic with multiple threads inside a single consumer
instance. Spring Kafka's @KafkaListener supports
concurrency — each thread becomes a consumer and gets its own
partition(s). That's fine for throughput, but it means two messages from
different partitions may be processed simultaneously. If your processing
touches shared state, you need to account for that.
@KafkaListener(
topics = "inventory-updates",
groupId = "inventory-service",
concurrency = "4" // 4 threads, one per partition
)
public void handle(InventoryEvent event) {
inventoryService.apply(event);
}
Transactional Producers
There's one case where you need exactly-once semantics end-to-end: consuming a
message, doing some work, and producing a new message — all as an atomic unit.
Kafka supports this with transactional producers, and Spring Kafka exposes it
through executeInTransaction:
@KafkaListener(topics = "raw-events", groupId = "enricher")
public void enrich(RawEvent raw) {
EnrichedEvent enriched = enrichmentService.enrich(raw);
kafkaTemplate.executeInTransaction(ops -> {
ops.send("enriched-events", enriched.entityId(), enriched);
return null;
});
}
This requires setting a transactional-id-prefix on the producer and
configuring your consumer with isolation.level=read_committed so it
only sees messages from committed transactions and never reads partial writes from
an in-flight one.
# application.yml
spring:
kafka:
producer:
transaction-id-prefix: enricher-tx-
consumer:
properties:
isolation.level: read_committed
A Few Things I've Learned the Hard Way
Don't use Long keys blindly. If you publish with a
numeric ID as the key, make sure you're converting it to a String
before passing it to Kafka. Using a raw Long with the default
StringSerializer will throw a runtime exception — or worse, silently
produce garbage keys if you use the wrong serializer.
Schema evolution is a day-one problem. If you're serialising with JSON, adding a new field is usually fine. Removing or renaming one will break consumers that are still running the old code during a rolling deploy. Use Avro or Protobuf with a schema registry if you care about backward compatibility — the setup cost pays for itself the first time you try to deploy a breaking change without downtime.
Consumer lag is your primary health metric. The number you want to
watch is not throughput, not latency — it's how far behind each consumer group is
from the latest offset. A lag that grows over time means your consumers can't keep
up. Expose this via Micrometer (kafka.consumer.fetch-manager.records-lag)
and alert on it before it becomes a crisis.
Log compaction changes retention semantics. If you enable log compaction on a topic, Kafka only guarantees that the latest value for each key is retained. Older values for the same key will eventually be removed. This is great for event-sourced state (like a user profile topic where you only care about the current state), but catastrophic if you were counting on full history for audit or replay.
Where to Go from Here
The concepts in this post get you to a point where you can build something reliable. The next level is understanding how Kafka itself works internally — leader election, ISR lists, how log segments are written and compacted. The official documentation is genuinely good and worth reading cover to cover at least once.
If you're using Spring Boot, the
Spring Kafka reference docs
are thorough. And when something goes wrong in production,
kafka-consumer-groups.sh --describe is the first command to reach for —
it shows you exactly which partitions are assigned to which consumers and what the
current lag is.