Blog · Architecture

Event streaming with Apache Kafka

From distributed commit log to streaming platform: what Kafka does differently from a classical message broker, how topics, partitions, replication, and consumer groups fit together, how to run a productive cluster — and when RabbitMQ is still the better answer. A grounded ten-page survey, with an architecture diagram, a runnable setup, and an honest tool comparison at the end.

Positioning — streaming, not queuing

Apache Kafka is a distributed streaming platform — not a message broker in the classical sense, but a durably scalable, partitioned commit log. Once that distinction has clicked, the most important conceptual step between RabbitMQ and Kafka is already behind you.

The distinction is not academic. With a classical queue broker, a message lands in a queue, is delivered to a consumer, and is gone. With Kafka, an event lands in a partition of a topic, stays there for the configured retention period (hours, days, weeks, in some cases indefinitely), and can be read by any number of consumers as often as needed — including after the fact. "Deliver and forget" becomes "write the event and keep it available on demand."

That property is what makes Kafka the default architecture for three application areas that have shown up in practically every larger platform over the past years. First: event sourcing. The state of an application is modelled as a stream of state changes; the current state emerges from replaying all the events. Second: microservice integration. Instead of synchronous REST calls between services, they communicate asynchronously through events — decoupled, more resilient to failure, easier to extend. Third: log aggregation and stream processing. Log files, metrics, or sensor data are collected centrally, filtered, enriched, and forwarded into downstream systems in real time.

Kafka was developed at LinkedIn in 2010, open-sourced in 2011, and has been an Apache top-level project since 2012. Its commercial sibling is Confluent — the company founded by the original LinkedIn engineers — which offers additional components (Schema Registry, KSQL, Connect connectors) and a cloud variant. Licence: Apache 2.0 for the core. Maturity: extremely high — Kafka runs in production today at Netflix, LinkedIn, Uber, Pinterest, Spotify, Deutsche Telekom, and countless other platforms, often with cluster sizes well beyond a hundred brokers.

An honest caveat: Kafka is powerful, but it is not a tool to operate on the side. A productive cluster needs attention to sizing, monitoring, the version upgrade path, and the disaster-recovery strategy. If all you want is "asynchronous messages between two services," RabbitMQ gets you there with less effort. If you understand streams as a central architectural primitive, Kafka is the tool with which platforms can be built whose scaling stays predictable for years.

Core concepts — topics, partitions, offsets

Most of the practical pain in working with Kafka can be traced back to a missing understanding of three concepts: what a topic is, how it is partitioned, and how a consumer manages its read progress.

Topic — the logical collection of events

A topic is the business category into which events are written — typically named after a business event: orders, payments, user-clicks, iot-telemetry. Topics in Kafka are created explicitly; unlike queue brokers, they do not come into existence implicitly the moment someone writes. That is a deliberate design choice: a topic is a business asset, whose lifecycle, retention, and schema should be decided consciously.

Partition — the unit of scaling

Each topic consists of one or more partitions. A partition is an append-only log: new events are appended at the end, never inserted in the middle or modified. Within a partition, strict ordering applies — event N+1 was written after event N. Between partitions, that ordering does not hold. This is the property that most often trips people up in practice: if you need ordering, you must route events with the same key into the same partition.

The partition is also the unit of scaling. More partitions means more consumers can work in parallel. A ten-figure number of events per second can be spread across hundreds of partitions, which in turn are spread across dozens of brokers. Rule of thumb: between 10 and 100 partitions per topic is appropriate in most setups. Several thousand partitions per broker is technically possible but operationally uncomfortable.

Partition key — the routing decision

When a producer writes an event, it chooses a key. From this key, Kafka computes a hash to decide which partition the event lands in. Without a key, events are distributed round-robin. With a key, all events with the same key end up in the same partition — and thereby keep their ordering. In an orders topic, the customer_id is a good key: all orders of one customer land in the same partition, and a consumer can process them in order.

Offset — the read progress

Each event in a partition is assigned a sequential number — the offset. Offsets start at 0 and grow monotonically with every new event. A consumer maintains its current read position per partition as "the last processed offset". This information is stored in a special internal topic (__consumer_offsets), so a consumer resumes exactly where it stopped after a restart — or, depending on configuration, reads from the beginning, which is Kafka's native replay mechanism.

Consumer group — load distribution

Multiple consumer instances can join a consumer group. Kafka distributes the partitions of a topic across the members of the group — each partition is read by exactly one consumer of the group. With three partitions and two consumers, one consumer reads two partitions, the other one. With three consumers, each reads one. With four consumers, one stays idle. Multiple independent consumer groups read the same topic in full — this is the mechanism that makes Kafka so flexible in the multi-subscriber model: one topic, any number of read pipelines.

Cluster, brokers, and the end of ZooKeeper

A Kafka cluster consists of multiple brokers and a metadata mechanism that coordinates the cluster members. Until a few years ago, that mechanism was a separate ZooKeeper ensemble; since Kafka 3.3, it is integrated directly into the brokers as KRaft. Anyone setting up a new cluster today should choose KRaft — ZooKeeper-based setups are deprecated and will no longer be supported in Kafka 4.0.

Broker

A broker is a single Kafka instance that holds partitions and serves producer and consumer connections. A productive cluster typically consists of at least three brokers — three is the minimum requirement for meaningful replication, because it allows surviving a single broker failure without data loss. In larger setups, clusters with twenty, fifty, or a hundred brokers are normal. Brokers are equals among themselves — no master-slave hierarchy at the broker level.

Controller

Within the cluster, however, there is one role that is elected among the brokers: the controller. The controller manages metadata — which partition lives on which broker, which broker is the leader for which partition, which replicas are currently in sync. With ZooKeeper, that role was held by one of the brokers; in KRaft mode, it exists as a dedicated quorum (three or five nodes, typically on the same machines as the brokers or on dedicated controller nodes in very large clusters).

KRaft — why it replaces ZooKeeper

ZooKeeper was historically the external coordination system without which Kafka could not run. It was robust, but it was also a second stack that had to be operated, monitored, upgraded, and secured on its own. KRaft (Kafka Raft) is the Raft consensus protocol implemented inside Kafka itself, which takes over that role. The advantage is not primarily technical — both approaches are robust — but operational: one stack instead of two, one place to look for failures, one version-management story. Anyone setting up new today has no good reason to touch ZooKeeper.

Leader and followers

Each partition in a replicated cluster has exactly one leader and several followers. Producers and consumers always talk to the leader. Followers copy data asynchronously from the leader and stand ready to take over if the leader fails. The number of replicas is a topic setting (replication.factor) — three is the standard value in productive setups, so up to two simultaneous broker failures can be survived without data loss.

Architecture in one picture

The figure below shows a typical three-broker cluster with a KRaft controller quorum, two producers, and two consumer groups. The topic orders shown here has three partitions (P0, P1, P2) with replication factor three — each partition exists on all three brokers, but only one holds the leader role.

Figure 1 — Three-broker cluster with KRaft quorum: producers write into the cluster without having to care about the broker split. Each partition (P0, P1, P2) lives on all three brokers, but only one is leader. Two consumer groups read independently — each sees all events, each maintains its own read progress.

Producers, consumers, and the question of delivery guarantees

Three guarantees, one choice per application — the most important design decision when writing into Kafka lies in the trade-off between latency, throughput, and delivery certainty.

At-most-once, at-least-once, exactly-once

Classically, three delivery guarantees exist. At-most-once means: each message is processed at most once, but may be lost in failures. At-least-once means: each message is processed at least once, but may arrive twice on retries. Exactly-once means: each message is processed exactly once — the Holy Grail, long considered impossible, supported within a clearly defined frame since Kafka 0.11.

Producer configuration

On the producer side, three parameters are decisive:

acks — how many replicas must have acknowledged a written message before the producer treats the send as successful. acks=0 is fire-and-forget (maximum performance, no delivery guarantee). acks=1 waits for the leader (default, but lossy on leader failure). acks=all waits for all in-sync replicas — the only setting that reliably excludes data loss on broker failures.
retries and retry.backoff.ms — how many times and with what spacing the producer retries a failed send. Combined with acks=all, this yields at-least-once semantics.
enable.idempotence=true — Kafka assigns each producer session an ID and numbers the messages. The broker recognises duplicate messages by sequence number and ID and discards them. This effectively turns at-least-once into exactly-once on the producer side. This should be the default for every new application today.

Consumer configuration

On the consumer side, the central question is when the read progress (offset) is committed. By default, the consumer does so in the background every few seconds — convenient, but on a crash between processing and committing, messages get lost or are processed twice. The more robust variant is manual commits after successful processing:

props = {
    "bootstrap.servers": "kafka-1:9092,kafka-2:9092,kafka-3:9092",
    "group.id":           "billing",
    "enable.auto.commit": False,
    "auto.offset.reset":  "earliest",
}

consumer = KafkaConsumer("orders", **props)
for message in consumer:
    process_order(message.value)   # idempotent processing!
    consumer.commit()              # only after success: save the offset

Important here: the processing itself must be idempotent, otherwise exactly-once on the producer side and manual commits on the consumer side together still produce duplicates on a crash between writing the result and committing the offset. In practice, idempotence is achieved through a unique event ID, checked in the downstream system (e.g. database) as a unique constraint.

Transactional API

For applications that read from one topic and write into another within a single transaction (the classical read-process-write pattern), Kafka offers a Transactional API. Writes to multiple topics plus the commit of the source offset are bundled into one atomic operation. That is the clean end-to-end exactly-once solution — and at the same time the most complex configuration, useful only in a fraction of the setups in which it is actually needed.

Replication and high availability

Replication is not a nice feature; it is the foundation of every productive Kafka installation. Without replication, every broker failure is a potential data loss.

Each partition is replicated N times, where N is the replication.factor of the topic. Standard value in productive setups: three. With that, each partition is distributed across three different brokers. One holds the leader role, the other two are followers and continuously copy data from the leader. Producers and consumers talk exclusively to the leader.

The term ISR (in-sync replicas) is central. The ISR list of a partition contains all replicas that are in sync with the leader — meaning they have pulled all messages from the leader within the last replica.lag.time.max.ms milliseconds (default: 30 seconds). If a follower falls behind, it drops out of the ISR. A broker failure leads to a new leader election for all partitions whose leader was on that broker — the controller picks a new leader from the ISR. If no ISR members remain (e.g. after multiple simultaneous failures), the partition goes unavailable — unless unclean.leader.election=true is set, which trades availability against possible data loss. This switch defaults to false and should stay there.

The min-ISR setting determines how many replicas must at minimum be in the ISR for a producer with acks=all to be allowed to write at all. With replication.factor=3 and min.insync.replicas=2, it is ensured: even if one broker fails, the topic keeps running; if two fail, producers stop writing — but no data is lost. This combination is the industry standard for productive Kafka clusters.

Beyond simple replication, MirrorMaker 2 (built on Connect) provides cross-cluster replication — typically between data centres or between an on-premise and a cloud region. For disaster-recovery setups, MirrorMaker 2 is the right tool; any multi-region architecture should plan for it from the start.

Setup with Docker Compose

For a local setup or a development environment, a single-broker cluster in KRaft mode is enough — defined in a single Compose file. Production needs three brokers plus a dedicated controller quorum — but the configuration shown here scales as a template.

Step 1 — create a docker-compose.yml:

services:
  kafka:
    image: confluentinc/cp-kafka:7.6.0
    container_name: kafka
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_PROCESS_ROLES: 'broker,controller'
      KAFKA_LISTENERS: 'PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093'
      KAFKA_ADVERTISED_LISTENERS: 'PLAINTEXT://localhost:9092'
      KAFKA_CONTROLLER_LISTENER_NAMES: 'CONTROLLER'
      KAFKA_INTER_BROKER_LISTENER_NAME: 'PLAINTEXT'
      KAFKA_CONTROLLER_QUORUM_VOTERS: '1@kafka:9093'
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: |
        CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
      KAFKA_LOG_DIRS: '/var/lib/kafka/data'
      CLUSTER_ID: 'MkU3OEVBNTcwNTJENDM2Qk'
    volumes:
      - kafka-data:/var/lib/kafka/data

volumes:
  kafka-data:

Step 2 — start the cluster and create a topic:

docker compose up -d

docker exec -it kafka \
  kafka-topics --bootstrap-server localhost:9092 \
               --create --topic orders \
               --partitions 3 --replication-factor 1

Step 3 — test producer and consumer from the command line:

# In one shell — interactive producer:
docker exec -it kafka \
  kafka-console-producer --bootstrap-server localhost:9092 \
                         --topic orders

# In another shell — consumer from the beginning of the topic:
docker exec -it kafka \
  kafka-console-consumer --bootstrap-server localhost:9092 \
                         --topic orders --from-beginning

Step 4 — a minimal Python producer writing real events:

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers     = "localhost:9092",
    value_serializer      = lambda v: json.dumps(v).encode("utf-8"),
    key_serializer        = lambda k: str(k).encode("utf-8"),
    acks                  = "all",
    enable_idempotence    = True,
)

producer.send(
    "orders",
    key   = "customer-42",
    value = {"order_id": 1001, "customer_id": 42, "total": 99.95}
)
producer.flush()

Step 5 — the matching consumer, with manual commits:

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    "orders",
    bootstrap_servers   = "localhost:9092",
    group_id            = "billing",
    enable_auto_commit  = False,
    auto_offset_reset   = "earliest",
    value_deserializer  = lambda m: json.loads(m.decode("utf-8")),
)

for message in consumer:
    print(f"Partition {message.partition} offset {message.offset}:")
    print(f"  key={message.key} value={message.value}")
    # ... processing (must be idempotent) ...
    consumer.commit()

Practical note

For production: extend the Compose file to three brokers, each with a unique KAFKA_NODE_ID and the corresponding KAFKA_CONTROLLER_QUORUM_VOTERS list. The CLUSTER_ID must be identical across all brokers — it is generated once via kafka-storage random-uuid. On Kubernetes, the Strimzi operator project handles all of this declaratively and is the established way to run Kafka in production on Kubernetes today.

Streams, Connect, Schema Registry

The core broker is only half the story. Three companion components turn Kafka into a complete streaming platform.

Kafka Streams — processing inside the application process

Kafka Streams is a Java library that makes stream processing available directly inside your own application — without running a separate cluster such as Spark or Flink. The library offers the typical operations: map, filter, groupBy, aggregate, join, with time windows. Internal state (e.g. counters, aggregates) is kept in an embedded RocksDB instance per application instance and backed up via dedicated topics inside Kafka itself. This makes Streams scale horizontally: multiple application instances share the partitions without needing a central coordinator.

Anyone working with Streams should know the term KTable: a view of a topic as "latest value per key". KTables are how event streams turn into queryable state — the conceptual counterpart to a materialised view in a database.

Kafka Connect — integration with the outside world

Connect is a framework for prefabricated connectors — source connectors pull data from external systems into Kafka, sink connectors write Kafka data into external systems. Hundreds of ready-made connectors exist: Postgres, MySQL, MongoDB, S3, Elasticsearch, Snowflake, Salesforce, Datadog, Splunk. A Debezium connector (the standard tool for change data capture) reads the WAL logs of a Postgres database and pushes every row as an event into Kafka — the standard way to bring relational data into a real-time event stream.

Schema Registry — contracts between producers and consumers

One of the more unpleasant properties of an uncontrolled Kafka setup: a producer changes the JSON format, and all consumers silently break. The Schema Registry (a separate service, not part of the core) solves that by holding schemas centrally — typically Apache Avro, optionally JSON Schema or Protocol Buffers. Producers and consumers check the schema against the registry before writing or reading. Schema changes follow compatibility rules (backward, forward, full) that prevent retroactive breakage. If you deploy Kafka across multiple teams in production, you should plan for the Schema Registry from day one — without it, the data model quickly turns into a black box.

KSQL / ksqlDB

A SQL-like language to query and transform streams and KTables, without writing a line of Java code. ksqlDB runs as its own server process and is attractive for teams that have SQL as their lingua franca and want quick stream aggregations without a code stack.

Comparison with RabbitMQ — two related but very different tools

Kafka and RabbitMQ often appear side by side in tool-selection discussions because both transport "messages" between systems. The similarity ends there. The two tools represent two different views of asynchronous communication.

The fundamental concept

RabbitMQ is a classical message broker in the AMQP 0-9-1 spirit. Messages enter an exchange, get distributed via bindings and routing keys into queues, are pulled from there by consumers, and disappear. A message has a defined lifetime: from write to ack. If you want "yesterday's message", tough luck — it was consumed and is gone.

Kafka is a distributed commit log. Events are written, persisted (days, weeks, months), and consumers decide from which offset they want to read. "Yesterday's message" is trivial — set the offset to yesterday and reprocess the entire stream.

Side by side — Apache Kafka and RabbitMQ

Apache Kafka

Model: partitioned commit log
Data retention: persistent, configurable (hours to indefinite)
Routing: producer chooses partition via key hash
Consumer model: pull, with self-managed offset
Ordering: guaranteed per partition, not topic-wide
Replay: native through offset reset, arbitrary number of times
Throughput: very high — hundreds of thousands to millions of messages per second per cluster
Latency: typically 5–20 ms end-to-end
Protocol: Kafka's own binary protocol
Ecosystem: Streams, Connect, Schema Registry, ksqlDB
Operational complexity: high — own cluster, sizing, ISR, retention
Classical use cases: event sourcing, log aggregation, stream processing, CDC

RabbitMQ

Model: classical message broker (AMQP)
Data retention: queue lives until the message is consumed
Routing: exchange types (direct, topic, fanout, headers) with flexible bindings
Consumer model: push, with ack/nack
Ordering: per queue, strict within each queue
Replay: not built in (streams feature since 3.9 is closing the gap)
Throughput: moderate — tens of thousands of messages per second
Latency: typically 1–5 ms, often lower than Kafka for small messages
Protocol: AMQP 0-9-1 native, plus AMQP 1.0, MQTT, STOMP through plugins
Ecosystem: Federation, Shovel, management UI, plugin architecture
Operational complexity: moderate — easier to handle than Kafka, fewer fundamental concepts
Classical use cases: task queues, worker patterns, fanout notifications, RPC, IoT

The typical points of friction in comparison

Routing flexibility. RabbitMQ is markedly more expressive here. A topic exchange with routing patterns such as orders.*.cancelled or a headers exchange with attribute-based bindings allow distribution patterns that, in Kafka, can only be modelled through topic naming or additional stream-processing steps. If you need business-complex routing without a streaming aspect, RabbitMQ is easier.

Per-message TTL and priority queues. RabbitMQ supports time-to-live per message and priority queues natively. Kafka knows neither — all events of a partition share the same retention, and ordering is ordering (no priority skip). "Urgent orders first" is trivial in RabbitMQ, an architectural exercise in Kafka.

Push versus pull. RabbitMQ pushes messages to consumers as soon as they arrive. Kafka has to be polled by the consumer. In very latency-critical scenarios (sub-millisecond), push wins; in any scenario with backpressure (slow consumers, load spikes), pull wins, because the consumer dictates the pace.

Replay. The operation natural to Kafka — "read this topic from the position it had two days ago" — is simply not provided by RabbitMQ. The Streams feature in RabbitMQ 3.9+ tries to close that gap but is not yet at the same maturity as Kafka.

Throughput. Kafka regularly beats RabbitMQ in throughput benchmarks by an order of magnitude. In most applications, that does not matter — anyone handling 5,000 messages per second is well below the capacity limit of either tool. But if you talk about 100,000 messages per second and beyond, Kafka is effectively without alternative.

When Kafka fits — and when RabbitMQ is the better choice

Both tools are mature, widely adopted, and production-ready. The right choice follows from the use case, not from the hype.

Kafka is the right choice when …

events should be retained durably and consumed multiple times — for event sourcing, replay-capable processing, or audit trails;
throughput exceeds tens of thousands of messages per second, or is foreseeably going to;
multiple independent consumer pipelines work on the same event stream — one computes invoices, another feeds an analytics dashboard, a third sends push notifications;
stream processing is part of the requirement — aggregates, joins, time windows over live data;
change data capture from relational databases is to be established — Debezium on Kafka Connect is the standard here;
a platform-wide event architecture with clearly defined schemas (Schema Registry) is to be established across multiple teams.

RabbitMQ is the right choice when …

classical task queues and worker pools are modelled — image processing, mail dispatch, long report computations;
routing is business-complex and maps well to exchange types, routing keys, and headers;
RPC-style request-reply patterns are needed (reply queue with correlation ID — idiomatic in RabbitMQ, awkward in Kafka);
priority queues, per-message TTL, or dead-letter queues with finely tunable policies are required;
the system is polyglot and must serve AMQP alongside MQTT (IoT) or STOMP (web) through plugins;
operational overhead should be minimal and the foreseeable volume nowhere near RabbitMQ's limits.

In most platforms we work on, both tools end up in play — not against each other, but complementary. A typical picture: Kafka as the central event backbone for the business events of a platform (orders, payments, status changes), RabbitMQ as the task queue for operational background work (mail dispatch, report generation, asynchronous API calls). The two worlds overlap at the edges, but they rarely compete directly. Anyone who knows both tools and cleanly separates their strengths builds architectures that deliver faster, scale more easily, and stand up to operational scrutiny. A dedicated RabbitMQ introduction sits in a separate article.

Event architecture or stack consolidation?

We review your messaging and event landscape together with your team — topic design, partitioning, schema discipline, delivery guarantees, cluster sizing, disaster recovery. The result: a concrete action plan, tailored to the size and maturity of your platform.

Arrange a conversation