Kafka internals : how Kafka brokers work
The main source for this article is the book : Real-Time Data and Stream Processing at Scale. Available for free online using this link.
Just like the ACID properties for databases, Kafka has 3 main principles that the technology is trying to satisfy while handling data streams.
The architecture of a Kafka cluster is similar to the image bellow :
It has three components :
They are the main component that need to be hosted. As you know, a topic is split into partitions. Since a partition can be read by one consumer at most, having multiple partitions allows parallel processing and leads to a better throughput of data.
It is possible to choose which message goes to which partition. Messages with the same key go to the same partition. It is very unwise to add a partition after creating the topic, because messages with same key can go to a different partition.
It is impossible to delete a partition (consistency reasons)
Partitions are assigned to consumers in consumer groups according to one of the following strategies :
- Range (by default) : A consumer C1 and C2 from the same consumer group are processing messages from two Topics T1 and T2, they both host 3 partitions. The range assigning will split the partition in half and assign them to each consumer, since we have an odd number of partitions, we would have 2 partitions from T1 and T2 for C1 and the rest for C2. A total of 4 partitions for C1 and 2 for C2.
- Round Robin : the partitions are assigned according to the round robin algorithm, C1 and C2 will have 3 partitions each.
In order to satisfy both principles : Availability and Partitionning a replication operation exists between partitions.
We will have a leader partition and a number of other partitions called “followers”. The leader replica is the only one that communicates with the clients (Producers and Consumers).
The other replicas only send requests to the leader replica to replicate the data it has.
If the leader replica goes down one of the follower replicas will become the new leader replica.
There is also a preferred leader system. As you know to guarantee availability, Kafka dispatches the leaders between the brokers. So that when a broker crashes we don’t lose too many leaders at once. By default the first replica is the preferred leader, so it should be placed in a different broker.
Another role of the partition leader, is to know which follower partition is up-to-date, in order to choose it as the new leader. To get the messages, follower replicas send a FETCH request to the leader partition. By looking at the latest offset requested by the replica, the leader knows which one is up-to-date (in-sync).
A partition is in-synch when :
- It sends a heartbeat to zookeeper (6seconds configurable)
- Sends fetch data request to the leader within (10seconds configurable)
- Fetches the most recent message within (10 seconds configurable)
Brokers are basically servers that host topic partitions, there is a specific broker called the controller (1 at most by cluster). The controller is responsible for choosing leader replicas for each partition.
Each broker has a cache containing all metadatas of all the brokers, it contains the addresses of leader replicas for each partition.
When a leader partition receives a message, it cannot be fetched by the consumer until it is replicated to all the in-sync replicas. Because the message is considered unsafe. If a message that isn’t available in all in-sync replicas gets consumed, and the leader partition crashes other consumers might not find it => Inconsistency.
When hosted in Brokers, Kafka partitions are split into segments. A segment is defined by a size and time limit. Once the retention period is reached the segment is closed and a new one is started. once a segment is closed and the defined retention time is reached the segment is either deleted or compacted.
When choosing compaction retention, we only store the latest state of messages using their key :
The offset is not the same thing as the key of a message in Kafka
When we want to delete a massage, we produce it with the same key and with a null value. This is called a tombstone. When the consumer sees it it understands that the message has to be deleted. It corresponds to deleting a line from data base table.
Zookeeper servers :
Zookeeper stores the metadas concerning the broker, topics and consumers. Its servers are organized in a quorum way (3 servers -> can run with 1 missing, 5 servers -> can run with 2 missing). 2n+1 servers, with n the number of servers that can go down.
Zookeeper also stores ACL, authentications required to access each topic.