Apache Kafka

By definition, Kafka is a distributed streaming platform. In simple words, it sends data from one node/server to another in real-time. It uses a Pub-Sub mechanism to accomplish this. Here we have a publisher, which publishes the messages to the Kafka cluster and there is a subscriber which receives the messages from the Kafka cluster.

Figure 1. Streaming in Kafka


Kafka uses topics to deliver messages from the producer to the consumer. A producer sends the messages to the topic and a consumer subscribes to the topic to receive those messages.

A topic is divided into partitions. Using this concept the topic is scaled on to multiple brokers in the Kafka cluster. This helps in parallel consumption of messages by the consumer. One partition can have one consumer in a consumer group assigned to it.

Partition replication
The Partitions can be replicated on different broker nodes in the Kafka cluster. This helps in the failover scenarios so that the messages are not lost. The leader partition handles the reads and writes, whereas the replicas are the backups. If the leader dies, one of the replicas takes up the leadership role.

Records are the ones that consist of the messages along with the metadata such as – timestamp, message key.

Offset is the position committed by the consumer on the partitions. Using this position the consumer knows which message should be read next. If the consumer needs to read any previous messages, then the offset position can be reconfigured to read them.

A producer is the server that generates the messages and sends those messages to the topic(i.e, partition in the Topic) in Kafka Cluster.

Brokers are the servers that form a Kafka cluster. The messages/data are stored on the brokers. A Leader broker node in the cluster will act as a controller that manages the health of the cluster. This controller is responsible for adding or deleting the messages.

The consumer is the end system that consumes the messages, by subscribing to the topic in the Kafka Cluster. An acknowledgment is sent by the consumer to the Kafka cluster after the message is read.

Consumer groups
A group of consumers with the same group id form a consumer group. The offset is stored for a consumer group, which applies to the consumers in the group. If there is only a single consumer in the group, then all the partitions in the topic are subscribed to the single consumer by default. If you add more consumers to the group, now the consumers share the partitions.

Zookeeper is the one that maintains the brokers, topics, partitions, and replicas. This is responsible for coordinating the operations in a Kafka cluster. This plays an important role in letting the producers and consumers know about the cluster status, routing requests to partition leaders. It stores the last and current offset of the consumers.

Kafka source connector
It is used to connect external producers to send messages to the topics in the Kafka cluster. This is a component provided by Apache Kafka.

Kafka sink connector
It is used to connect external consumers to receive messages from the Kafka cluster. This is a component provided by Apache Kafka.

Having this basic understanding of the flow and the components in Kafka helps you decide on the broker count, partition count, replica count, consumer group count, and consumer count. Based on the message publish rate by the producer and consumer subscription rate we decide on these numbers. Having the right configuration we can achieve real-time streaming using Apache Kafka.

Experience the streaming 🙂