Kafka for Java Developers: Beyond the Hello World

The first Kafka tutorial you find online shows you how to send a string and read it back. That part is easy. The hard part is everything that comes after — understanding what "at least once" really means when your consumer crashes mid-batch, why your consumer group is stuck on one partition, or why messages are being processed out of order even though you were sure Kafka guarantees ordering.

This post skips the basic setup and goes straight to the concepts and code patterns that actually matter once you're past the hello world. We'll cover the data model, producers, consumers, consumer groups, offset management, and a handful of production pitfalls I've run into personally.

The Data Model in Plain Terms

Kafka stores messages in topics. A topic is divided into partitions — append-only, ordered logs on disk. Each message written to a partition gets a monotonically increasing offset, which is just a number. That's it. There is no global ordering across partitions, only per-partition ordering.

When you publish a message, Kafka decides which partition it goes to. If you provide a key, Kafka hashes it to pick a partition — the same key always lands on the same partition, which is how you get per-entity ordering (all events for a given orderId, for example). If you publish without a key, Kafka round-robins across partitions.

Consumers don't delete messages when they read them. They just advance their offset. This means you can replay a topic from offset 0 at any time, have multiple independent consumer groups reading the same topic without interfering with each other, and retain messages for as long as your retention policy allows (time- or size-based).

Producing Messages

In Spring Boot, KafkaTemplate is the main abstraction. A minimal producer looks like this:

@Service
public class OrderEventPublisher {

    private final KafkaTemplate<String, OrderEvent> kafkaTemplate;

    public void publish(OrderEvent event) {
        kafkaTemplate.send("orders", event.orderId(), event);
    }
}

The second argument to send() is the message key — here we use orderId so all events for the same order land on the same partition and are processed in order.

KafkaTemplate.send() returns a CompletableFuture. By default it's fire-and-forget — if you want to know whether the broker acknowledged the message, you need to handle the future:

public void publishWithConfirmation(OrderEvent event) {
    kafkaTemplate.send("orders", event.orderId(), event)
        .whenComplete((result, ex) -> {
            if (ex != null) {
                log.error("Failed to publish event for order {}", event.orderId(), ex);
                // dead-letter, retry queue, or alerting here
            } else {
                log.debug("Published to partition {} offset {}",
                    result.getRecordMetadata().partition(),
                    result.getRecordMetadata().offset());
            }
        });
}

The key producer config that influences reliability is acks. Set it to all (or -1) and Kafka will wait for all in-sync replicas to acknowledge before confirming the write. This is slower but means you won't lose a message if the leader broker crashes immediately after accepting it.

# application.yml
spring:
  kafka:
    producer:
      acks: all
      retries: 3
      properties:
        enable.idempotence: true
        max.in.flight.requests.per.connection: 5

enable.idempotence: true pairs with acks: all to give you exactly-once delivery from the producer side — Kafka deduplicates retried messages using a sequence number it tracks per producer session.

Consuming Messages

The consumer side is where most of the interesting problems live. A basic listener with Spring Kafka:

@Component
public class OrderEventConsumer {

    @KafkaListener(topics = "orders", groupId = "order-processor")
    public void handle(OrderEvent event) {
        processOrder(event);
    }
}

The groupId is critical. Kafka assigns partitions to consumers within the same group — each partition is owned by exactly one consumer at a time. If you have a topic with 4 partitions and start 4 instances of your service, each instance gets one partition. If you start a 5th instance, it sits idle — there are no partitions left to assign.

Consumers in different groups each get a full copy of the topic, completely independently. This is useful when multiple services need to react to the same events — an order-processor and a notification-service can both subscribe to orders without any coordination.

Offsets and Commit Semantics

By default, Spring Kafka uses enable.auto.commit=true, which commits offsets in the background on a timer. This is convenient but dangerous: if your consumer processes a batch, the auto-commit fires, and then the application crashes before saving the results to your database, the messages are gone — Kafka thinks they were processed.

The safer approach is manual acknowledgement with AckMode.MANUAL_IMMEDIATE:

@KafkaListener(topics = "payments", groupId = "payment-processor")
public void handle(PaymentEvent event, Acknowledgment ack) {
    try {
        paymentService.process(event);
        ack.acknowledge(); // only commit after successful processing
    } catch (Exception e) {
        log.error("Failed to process payment {}", event.paymentId(), e);
        // do NOT ack — message will be redelivered
    }
}

# application.yml
spring:
  kafka:
    listener:
      ack-mode: MANUAL_IMMEDIATE
    consumer:
      enable-auto-commit: false

This gives you at-least-once delivery: messages are redelivered on failure, but your handler might see the same message more than once. That means your processing logic should be idempotent — processing the same event twice should produce the same result as processing it once. The classic pattern is storing the event ID and checking for duplicates before writing.

Dead Letter Topics

Not all failures are transient. Sometimes a message is simply malformed or triggers a bug that will never resolve itself — retrying it forever would block the partition. The solution is a dead letter topic (DLT). Spring Kafka's DeadLetterPublishingRecoverer makes this straightforward:

@Bean
public DefaultErrorHandler errorHandler(
        KafkaTemplate<Object, Object> template) {

    var recoverer = new DeadLetterPublishingRecoverer(template);

    var backoff = new ExponentialBackOff(1_000L, 2.0);
    backoff.setMaxElapsedTime(30_000L); // give up after 30 s

    return new DefaultErrorHandler(recoverer, backoff);
}

By convention, the DLT topic name is the original topic name suffixed with .DLT — so payments falls into payments.DLT. You can then have a separate consumer monitoring the DLT, alerting on failures, and replaying messages once the underlying issue is fixed.

Partition Assignment and Rebalancing

When a consumer joins or leaves a group, Kafka triggers a rebalance: it reassigns partitions across the active consumers. During a rebalance, all consumers in the group pause. In the worst case, a consumer that processes messages slowly (or holds a lot of state) can cause frequent rebalances that grind your throughput to a halt.

A few settings help here. max.poll.interval.ms defines how long Kafka will wait between polls before deciding the consumer is dead and triggering a rebalance. If your processing logic is slow, increase this rather than letting Kafka kick the consumer out. session.timeout.ms and heartbeat.interval.ms control how quickly a genuinely crashed consumer is detected — keep heartbeat.interval.ms at roughly a third of session.timeout.ms.

# application.yml
spring:
  kafka:
    consumer:
      properties:
        max.poll.interval.ms: 60000
        session.timeout.ms: 30000
        heartbeat.interval.ms: 10000
        max.poll.records: 50   # smaller batches = less risk of timeout

Reducing max.poll.records is often the simplest fix for rebalance problems. If each poll returns 500 records and processing takes 2 seconds each, you'll blow through max.poll.interval.ms easily. Fetch fewer records per poll and you stay well within the timeout.

Ordering Guarantees: What Kafka Actually Promises

Kafka guarantees ordering within a single partition. That's the whole promise. If you need all events for a given entity to be processed in order, route all of that entity's events to the same partition using a consistent key. If you need global ordering across all entities — Kafka is probably the wrong tool, or you need a single partition (which eliminates horizontal scaling).

A common mistake is consuming a topic with multiple threads inside a single consumer instance. Spring Kafka's @KafkaListener supports concurrency — each thread becomes a consumer and gets its own partition(s). That's fine for throughput, but it means two messages from different partitions may be processed simultaneously. If your processing touches shared state, you need to account for that.

@KafkaListener(
    topics = "inventory-updates",
    groupId = "inventory-service",
    concurrency = "4"  // 4 threads, one per partition
)
public void handle(InventoryEvent event) {
    inventoryService.apply(event);
}

Transactional Producers

There's one case where you need exactly-once semantics end-to-end: consuming a message, doing some work, and producing a new message — all as an atomic unit. Kafka supports this with transactional producers, and Spring Kafka exposes it through executeInTransaction:

@KafkaListener(topics = "raw-events", groupId = "enricher")
public void enrich(RawEvent raw) {
    EnrichedEvent enriched = enrichmentService.enrich(raw);

    kafkaTemplate.executeInTransaction(ops -> {
        ops.send("enriched-events", enriched.entityId(), enriched);
        return null;
    });
}

This requires setting a transactional-id-prefix on the producer and configuring your consumer with isolation.level=read_committed so it only sees messages from committed transactions and never reads partial writes from an in-flight one.

# application.yml
spring:
  kafka:
    producer:
      transaction-id-prefix: enricher-tx-
    consumer:
      properties:
        isolation.level: read_committed

A Few Things I've Learned the Hard Way

Don't use Long keys blindly. If you publish with a numeric ID as the key, make sure you're converting it to a String before passing it to Kafka. Using a raw Long with the default StringSerializer will throw a runtime exception — or worse, silently produce garbage keys if you use the wrong serializer.

Schema evolution is a day-one problem. If you're serialising with JSON, adding a new field is usually fine. Removing or renaming one will break consumers that are still running the old code during a rolling deploy. Use Avro or Protobuf with a schema registry if you care about backward compatibility — the setup cost pays for itself the first time you try to deploy a breaking change without downtime.

Consumer lag is your primary health metric. The number you want to watch is not throughput, not latency — it's how far behind each consumer group is from the latest offset. A lag that grows over time means your consumers can't keep up. Expose this via Micrometer (kafka.consumer.fetch-manager.records-lag) and alert on it before it becomes a crisis.

Log compaction changes retention semantics. If you enable log compaction on a topic, Kafka only guarantees that the latest value for each key is retained. Older values for the same key will eventually be removed. This is great for event-sourced state (like a user profile topic where you only care about the current state), but catastrophic if you were counting on full history for audit or replay.

Where to Go from Here

The concepts in this post get you to a point where you can build something reliable. The next level is understanding how Kafka itself works internally — leader election, ISR lists, how log segments are written and compacted. The official documentation is genuinely good and worth reading cover to cover at least once.

If you're using Spring Boot, the Spring Kafka reference docs are thorough. And when something goes wrong in production, kafka-consumer-groups.sh --describe is the first command to reach for — it shows you exactly which partitions are assigned to which consumers and what the current lag is.