Kafka: a distributed log, and what you can build on top of one

October 15, 2023

Kafka is often introduced as a “message queue,” and that description is wrong in a way that matters. A message queue is a piece of middleware whose job is to hand a message to a consumer and then forget about it. A broker holds the message long enough to deliver it once, acknowledges it, and moves on. Kafka does not work that way. Kafka is a distributed, partitioned, replicated commit log, and the consumer is a cursor that moves along the log at its own pace. The broker remembers everything for as long as the retention policy says to — hours, days, weeks, forever. Delivery is not an event the broker initiates; it is an act of reading the consumer performs.

That distinction is the whole reason Kafka fits the problems it fits. Once you internalize the log is the source of truth, the consumer is a cursor, the rest of the system — replication, consumer groups, exactly-once semantics, stream processing, change data capture — is a set of patterns layered onto that one primitive.

The core primitive: the partitioned log

A topic is a named, append-only log. Producers append records to the end of the log; consumers read records from wherever their cursor sits. That is the entire data model at the conceptual level. Everything else is either an implementation detail or a policy decision on top of it.

The implementation detail that matters most is that a topic is not one log. It is split into partitions — one or more ordered logs, each living on one broker at a time. A record goes to exactly one partition, chosen by the producer (usually by hashing a key). Inside a partition, records have a strictly increasing offset and are delivered in the order they were written. Across partitions, there is no ordering.

That is the first tradeoff Kafka asks you to accept. Ordering is cheap and strong per partition, and does not exist between partitions. A topic with twelve partitions can be consumed in parallel by up to twelve consumers, and within each partition the events arrive in the order the producer wrote them. If ordering matters for a particular entity — say, all events for order #1234 — put the order id in the record key, and Kafka’s default partitioner will hash every event for that order to the same partition. If ordering does not matter across entities, you pick as many partitions as you want throughput.

A consumer commits its offset back to Kafka (in an internal topic called __consumer_offsets) so that if it crashes and restarts, it resumes from where it left off. Two consumers of the same topic have independent offsets. The broker does not know or care who else is reading; it just serves bytes from a file.

Replication and the ISR

Every partition has a configurable number of replicas across brokers. One replica is the leader — it handles all reads and writes — and the others are followers that pull from the leader. Followers that are caught up are members of the in-sync replica set (ISR). A producer that sets acks=all is told the write has succeeded only once every ISR member has it. If the leader fails, a member of the ISR is elected as the new leader; the log continues.

Two levers control the durability–availability tradeoff:

acks. acks=0 means fire-and-forget. acks=1 means the leader has it. acks=all means all ISR members have it.
min.insync.replicas. If the ISR shrinks below this number, the broker refuses writes with acks=all rather than silently accepting them at reduced durability.

This is the piece of configuration most teams get wrong the first time. A replication factor of 3 does not protect your writes if acks=1 and a disk fails between the leader’s flush and the followers’ pulls. Durability is the combination of replication factor, acks, and min.insync.replicas — not any of them alone.

Retention, compaction, and the log as state

The broker’s retention policy decides how long a log is kept. The two options compose rather than compete:

Time/size retention. Keep the last N days or M gigabytes of records, then delete the oldest segments. Useful when the log is a stream of events that become uninteresting after a while — clicks, telemetry, structured logs.
Log compaction. For each key, keep at least the latest record. Older records for the same key may be removed by a background compactor. Useful when the log represents the current state of a set of entities — user profiles, configuration, the latest balance per account.

Compacted topics are how Kafka supports a style of system where the log is the database. You can throw away a downstream service’s local state and rebuild it by replaying the compacted topic from offset zero. The log holds one record per key — the latest one — and that is enough to reconstitute the world.

Consumer groups and the partition-as-unit-of-parallelism

A consumer group is a set of consumer processes that together consume a topic. Kafka assigns each partition to exactly one consumer in the group, so the total parallelism of a group is bounded by the number of partitions. Add more consumers than partitions and the extras sit idle. Add fewer consumers than partitions and some consumers handle more than one.

When a consumer joins or leaves the group, Kafka triggers a rebalance: partitions are redistributed across the surviving members. Rebalances are cheap in steady state and expensive when they happen constantly — a consumer that crashes and restarts every minute will drag the whole group’s throughput down while it thrashes.

The upshot for capacity planning is that partitions are a commitment. You can increase a topic’s partition count later, but doing so breaks the key-to-partition mapping for existing data, which breaks per-key ordering for any consumer that cares. Pick a count that leaves room for a few years of growth in parallelism, and live with a bit of over-partitioning rather than re-partitioning a production topic under load.

Use case: microservice communication

The most common reason a team reaches for Kafka is to stop having services call each other synchronously over HTTP for everything. A synchronous call graph is a runtime-coupling machine: every call is a place where latency can stack and availability can multiply downward. If service A calls B calls C, then A is up only when B and C are up, and A’s p99 latency is at best the sum of B’s and C’s p99s.

Kafka decouples that graph in time. A service that wants to tell the rest of the world “an order was placed” writes an OrderPlaced event to a topic and is done. Whichever services care — inventory, payments, notification, analytics — read the topic at their own pace. If the notification service is down for ten minutes, events accumulate in the topic; when it comes back, it catches up. Neither the producer nor the other consumers notice. The producer does not know who the consumers are. New consumers can be added later by subscribing to the existing topic; existing producers do not change.

Two styles dominate:

Event notification. The event is a small fact (“order 1234 was placed”). A consumer that needs more information calls the producer’s API to fetch it. This is cheap to produce and keeps the event schema small, at the cost of re-introducing a synchronous call for the details.

Event-carried state transfer. The event contains the full state the consumer needs to act on. The consumer maintains its own local copy of whatever it cares about from the stream and does not need to call anyone. This is more expensive on the wire and requires more discipline about schemas, but it lets consumers serve their own reads without a dependency on the producer being up.

In practice most systems use both, picking per event based on how expensive the fetch would be and how badly the consumer needs to survive the producer being down.

Use case: the saga, done with events

A saga — a sequence of local transactions across services, with compensations for failure — is one of the places Kafka earns its keep. The choreography style of saga is essentially a pattern of services reacting to each other’s events. OrderService writes OrderPlaced. PaymentService reads it, charges the card, writes PaymentAuthorized (or PaymentFailed). InventoryService reads PaymentAuthorized, reserves stock, writes StockReserved. And so on, with compensating events (PaymentRefunded, OrderCancelled) on failure paths.

The reason Kafka fits this well is that it provides the two guarantees a saga actually needs: durable, ordered delivery of events per key (so the steps of saga #1234 are processed in order), and the ability to replay if a participant was down. The reason it is still hard has nothing to do with Kafka — it is the saga’s own irreducible problems: no automatic rollback, no isolation, compensations that must be designed for every step. Kafka makes the mechanics cheaper, not the design.

The transactional outbox: the one pattern you should not skip

Kafka’s delivery story for microservices collapses if the service loses the ability to guarantee that writing to its own database and publishing to Kafka happen atomically. The naive approach — commit the transaction, then send the event — fails whenever the process crashes between the two. You have updated the database and lost the event. Every downstream service now has a view that disagrees with the source service, and there is no way to notice.

The transactional outbox fixes this. Inside the same local transaction that updates the business state, the service inserts a row into an outbox table describing the event to publish. A separate relay — a poller, or a change data capture pipeline — reads the outbox and publishes to Kafka, then marks the row as sent. If the service crashes after the commit, the row is there; the relay will pick it up on restart. If the relay crashes after publishing but before marking, the event is re-delivered; consumers must be idempotent, which they should be anyway.

There is a similar pattern for the consumer side — the inbox — where the consumer records the ids of events it has handled so it can deduplicate retries. Idempotency keys, deduplication windows, and event ids are how you get effectively-once behavior out of at-least-once delivery, which is what Kafka actually gives you without transactions on. (It can give exactly-once within the Kafka-to-Kafka path via the transactional producer and read_committed consumer, but not across Kafka and your database; the outbox is how you bridge that gap.)

Use case: event streaming and stream processing

Once a stream of events exists, it is natural to derive other streams from it. An OrderPlaced stream plus an OrderShipped stream can be joined to produce a stream of time-to-ship durations. A ClickEvent stream can be aggregated into a stream of per-user five-minute session counts. These derivations are stream processing, and Kafka ships with two ways to do it:

Kafka Streams, a client library that turns a Java/Scala service into a stream processor: subscribe to input topics, apply operators (map, filter, groupBy, join, windowed aggregations), write to output topics. State lives in local RocksDB stores, backed by internal compacted topics so that it can be rebuilt after a crash.
ksqlDB, which layers a SQL-like language on top of the same machinery for people who would rather write SELECT ... FROM orders JOIN shipments ... than a topology in code.

Outside the Kafka project, Flink is the other mainstream engine that treats Kafka topics as inputs and outputs. All of them are instances of the same idea: the log is the source of truth; a processor is a function that reads from one or more logs and writes to another, and its own state is just a table materialized from the input.

The payoff is that analytical and operational derivations become the same kind of thing — both are just derived streams or derived tables. A dashboard that shows “orders placed per minute” is reading from a topic produced by a one-line windowed count. A real-time fraud rule that flags unusual transactions is another consumer on the same input. They do not contend; they have independent cursors.

Use case: change data capture

Most organizations have data that already lives in databases and is not, yet, on a stream. Change data capture (CDC) gets it onto one. A CDC connector — Debezium is the common choice — tails the database’s write-ahead log (Postgres WAL, MySQL binlog, SQL Server transaction log) and publishes every row-level change as an event on a Kafka topic.

This is the cheapest way to turn a legacy system into a participant in an event-driven architecture without modifying its code. The rest of the organization subscribes to the CDC topic and treats the database as if it had been publishing events all along. It is also, notably, the mechanism by which the transactional outbox’s relay typically works: Debezium on the outbox table, writing to a Kafka topic, with the relay’s state being “the last WAL position I read.”

CDC has costs. Schema changes in the source database become schema changes in the stream, and the downstream consumers now depend on a shape the source team controls. If the source team drops a column or renames a table without coordination, every downstream breaks. The trick is to not expose the raw CDC topic as the public contract. Instead, a small transformation job reads the CDC stream and emits a curated event stream with a contract the source team has agreed to keep stable. The raw stream is private; the curated stream is the interface.

Use case: log aggregation, metrics, and the operational firehose

The original LinkedIn use case — the one Kafka was built for — was shipping application logs and metrics off of thousands of machines into a central system. That use case has not gone away. A Kafka cluster is a very good place to terminate firehoses:

Application logs from every service, to be consumed by a log-storage backend (Elasticsearch, ClickHouse, S3).
Metrics from every host, to be consumed by a time-series store.
Audit events, to be consumed by long-term storage and by real-time alerting.
Click and telemetry events, to be consumed by analytics warehouses.

The reason Kafka is a good fit is that it accepts writes fast, buffers them durably, and lets multiple downstream systems each read the full stream at their own rate. If the warehouse is slow, analytics lags but logging does not. If the log backend goes down for an hour, the topic absorbs the traffic and the backend catches up on recovery. The alternative — every producer writing directly to every backend — means every producer has to know every backend, handle every backend’s backpressure, and fail if any of them fails.

What Kafka is not

Kafka is not a database. It has no query engine, no secondary indexes, no way to ask “what is the current value for key X?” other than to read the whole partition or let a stream processor materialize a view. Tools like ksqlDB and Kafka Streams’ GlobalKTable make those views easier to build, but the views are still built on top of the log, not inside the broker.

Kafka is not, out of the box, low-latency in the sense a queue is low-latency. A well-tuned Kafka setup moves messages end-to-end in a few milliseconds, but the design prioritizes throughput and durability. If you need sub-millisecond p99 for a small number of messages, a purpose-built queue or an in-process channel is probably a better tool.

Kafka is not free. A production cluster is a serious operational commitment: broker sizing, partition strategy, retention tuning, replication health, consumer lag monitoring, schema governance. Hosted offerings (Confluent Cloud, MSK, Redpanda Cloud, Warpstream) move most of that off the team’s plate, at a bill that is real. The cheap moment is a single topic at low volume; the expensive moments are the ones where the log has become the backbone of the organization’s data flow and needs to be treated as such.

Kafka does not make your event schemas good. A topic with a loose or undocumented schema becomes a coupling surface that every consumer depends on and nobody owns. A schema registry (Avro, Protobuf, or JSON Schema) with compatibility rules — backward-compatible for producers, forward-compatible for consumers — is part of the production setup, not an optional extra. Without it, schema evolution is done by email.

The shape of a well-run Kafka estate

Pull the ideas together and a healthy use of Kafka looks like this:

Topics are contracts. Each topic has an owning team, a schema registered with compatibility rules, and documented semantics (keying, ordering guarantees, retention).
Producers write through an outbox. Business state and events commit atomically; a relay publishes to Kafka. Consumers are idempotent.
Consumers own their offsets and their lag. Lag is a first-class monitored metric; a consumer that is falling behind is a problem before it is an outage.
State is derived, not stored. Stream processors and CDC pipelines materialize views that consumers read; the canonical facts live on the log.
Replication and acks are tuned for the durability you actually want. acks=all, a replication factor of 3, min.insync.replicas of 2 is a common baseline for business-critical topics.
Partition counts leave room to grow. You over-partition a little at creation time because you cannot repartition cleanly under load.

None of that is exotic, and none of it happens by accident. The value Kafka delivers — decoupled services, replayable streams, event-driven derivations, a single spine for operational data — comes from the boring discipline of treating the log as the canonical thing and building the rest of the system around it. Get that right and Kafka is quietly load-bearing. Get it wrong and it is an expensive queue you could have replaced with Postgres.

The broker is simple on purpose. The architecture around it is where the work is.