KRaft is production ready!
Kafka marks KRaft as production ready feature, deprecating the need to run Zookeeper. To celebrate this, let’s meet Raft!
On October 3rd 2022 Kafka 3.3 was released — marking KRaft consensus protocol as production ready and removing the need for Zookeeper in Kafka clusters. This is of course very exciting news but let’s better understand what is going on under the hood. In this article we will explore Raft, see a high-level overview of Kafka’s KRaft implementation. I will include extra links to follow for the curious minds — so that you can learn more about these concepts if you wish.
Raft overview
The idea behind Raft consensus protocol comes from this paper. The paper is easy to follow, it is a good paper to begin your journey with reading academic papers — I highly encourage you to check it out. I’ll walk you through it too.
Raft was designed to be easily understood. It comes from the similar family of consensus protocols as Paxos. There is a famous quote from engineers behind Chubby (Google’s internal lock service)
There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system. In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.
The authors decided to design a simpler protocol for state machine replication model (also referred to as quorum replication). First things first, what is a distributed consensus protocol? Distributed means involving many computers, consensus means agreement, thus distributed consensus means a agreeing on a shared state across multiple servers in a way that is prone to node failures and network partitions.
Raft defines three roles
- A leader. It serves as a mediator between clients and distributed network of servers. Data flow in Raft is unidirectional — data is passed from the leader to the followers. There is just one leader in the cluster (also known as strong leadership) per term (identified by a term number).
- A follower. Any other server in a cluster is a follower. They get information from the leader and maintain their own copy of the state.
- A candidate. Whenever a leader is gone, a follower may become a candidate — a candidate collect votes from other servers during election of a new leader. To become a new leader, a candidate must collect a majority of votes in a cluster (also known as quorum). Raft is designed in a way that guarantees that eventually a new leader will be elected.
To achieve consensus, Raft introduces three main procedures: leader election, log replication and cluster membership adjustment. Let’s explore them.
Leader election
In order for the system to be able to propagate changes, a leader has to be both present and have connection to a half of the nodes in a cluster. Raft consensus doesn’t have any external party that may interfere and select a new leader — nodes themselves need to figure this out for themselves. That’s why any node may try to become a leader whenever a current leader dies or a network partition happens.
Whenever a follower is not receiving a heartbeat from the leader, after a certain time period it will increment its term number and become a candidate. It will push RequestVotes
RPC message to all nodes in a cluster. A voting node follows an algorithm that ensures that a node votes exactly once per term and for a candidate with the most up to date log. As a consequence
- one of the candidates will get the majority of votes and will become a new leader
- none of the candidates will get the majority. Another election has to take place. A general idea here is that due to a fact that the timeouts are randomized (between 150–300 ms) it is less likely that the same situation will happen.
This algorithm is simple and effective. There are some edge cases that have to be solved e.g. in situations where a single node gets cut off this algorithm would make it bump its term number over and over again, eventually triggering an election process once it connects back to the cluster with a really high term number. Implementations usually don’t initialize election unless the node can connect to majority of the nodes in the first place.
Log replication (and a heartbeat)
In state machine replication model a long entry is a client command. It is stored in leader’s log with term number and an index. On each new entry, leader sends AppendEntries
RPC to all followers and, once it gets a response from majority of them, commits a message.
Log can get out of sync during network partitions or leader crashes. In order to fix this, the RPC message contains information about the term number and an index of a message that precedes new entry. If that message doesn’t match the follower’s state, the follower will reject it. This in turn causes leader to push information about a previous message in their log. Eventually they will settle on some message. All messages that follow that message in their log should be ignored. They will sync the rest of log.
Log replication message has another purpose — it is also a heartbeat. It will be emitted even if no new log entry was produced on leader side. As we saw in the paragraph about leader election — if log replication message can’t be delivered, it will trigger election process on that follower.
Cluster membership adjustment
Changing cluster membership is not a trivial task when we want to keep the system functional during transition period. Raft uses two-phase model where a joint consensus is formed as an intermediate step.
For a certain period of time, two cluster configurations will be in effect. Some servers will be a member of both configurations. In order to achieve consensus we need a quorum for both old and new configuration separately. The leader of a joint consensus may not be a member of a new configuration.
This process is driven by log replication process — to achieve joint consensus a certain message has to be committed. Once a joint consensus is formed, we need to commit a message that informs nodes about the new configuration. And once that is achieved, old logs and unnecessary nodes can be safely decommissioned.
As a rule of a thumb though, Raft cluster should consist of 3 to 5 nodes. With more nodes, the randomized nature of the protocol may have severe impact on the performance. If you need to scale the system even further, it is suggested to optimize your network layer or decouple system and spin up multiple independent clusters.
Raft is just a protocol, not a concrete implementation. There are many implementations out there, possibly the most famous one comes from ETCD project — a key-value store behind Kubernetes. Kafka team have their own implementation called KRaft.
Differences to classic Raft
As you have realized, Raft protocol is push-based. Kafka team leveraged existing features to build a pull-based protocol.
Not all nodes in Kafka cluster form a quorum. Kafka identifies three types of the nodes — leader, observer and voter. Only a handful of nodes for a quorum (leader and voters). They follow a similar consensus logic as described in Raft.
Implications
Pull-based model means that
- log replication logic is reversed — a voter is issuing a request with epoch and offset. And they will retry if leader rejects their request due to divergence
- less network overhead in reconfiguration scenarios — decommissioned voters can be immediately informed about the fact they are no longer a voter and should transition into observer on the next poll from the leader
- extra
BeginEpoch
API is needed to inform the voters about a new leader election. In push-based model this is a natural role ofAppendEntries
RPC.
Raft allowed Kafka team to reduce the latency of metadata log that in turn resulted in faster recovery times. Here’s a diagram that depicts the difference
Unsupported Kafka features
The most painful missing feature, which is not directly related to the protocol itself, is the lack of an option to upgrade from the Zookeeper mode. Apart from that these include
- security: no way to configure SCRAM/SASL users, no delegation tokens authentication
- storage: limited JBOD support
More yet to come
We are still waiting for a way to migrate existing Zookeeper clusters to KRaft based ones — that feature is planned for Kafka 3.5 release. Kafka 4.0 will come with KRaft only mode support.
Conclusion
In this post we explored Raft consensus in short and hopefully you got the better understanding of the protocol. You should now understand that there are many Raft implementations out there and some of them may diverge from the reference implementation — just as KRaft does. Leveraging existing, battle-tested pieces of Kafka infrastructure the team was able to come up with an implementation that hopefully will stay with Kafka for decades. Even though certain features are not supported yet.
It is important to point out that those changes are transparent to the Kafka consumers and producers. Kafka since version 2.5.0 removed direct ZK access from any client-side API (including administrative tools). It was a huge undertaking, spanning across many versions but the day has finally come.
If you wish to learn more, I encourage you to watch this great visualization.