Open In App

Raft Consensus Algorithm

This article will help you give a brief history about Raft, what is consensus, what is the RAFT protocol, what are the advantages, how is it better than its alternatives, what are some limitations of the RAFT protocol. 

 



Introduction

Raft protocol was developed by Diego Ongaro and John Ousterhout (Stanford University) which won Diego his Ph.D in 2014(The link for the paper is in the References section at the end of the article). Raft was designed for better understandability of how Consensus(we will explain what consensus is, in a moment) can be achieved considering that its predecessor, the Paxos Algorithm, developed by Lesli Lamport is very difficult to understand and implement. Hence, the title of the paper by Diego, ‘In Search of an Understandable Consensus Algorithm’. Before Raft, Paxos was considered the holy grail in achieving Consensus.. 
Lets start. 

 



Consensus

So, to understand Raft, we shall first have a look at the problem which the Raft protocol tries to solve and that is achieving Consensus. Consensus means multiple servers agreeing on same information, something imperative to design fault-tolerant distributed systems. Lets describe it with the help of couple visuals. 
So, lets first define the process used when a client interacts with a server to clarify the process. 
Process : The client sends a message to the server and the server responds back with a reply. 

A consensus protocol tolerating failures must have the following features : 
 

Now, there can be two types of systems assuming only one client(for the sake of understandability): 
 

single server raft visual

Such a system in which all the servers replicate(or maintain) similar data(shared state) across time can for now be referred to as, replicated state machine. 

We shall now define some terms used to refer individual servers in a distributed system. 
 

So, the above system can now be labelled as in the following snap. 
 

multiple server labelled raft visual

CAP theorem CAP Theorem is a concept that a distributed database system can only have 2 of the 3: 

 

What is the Raft protocol

Raft is a consensus algorithm that is designed to be easy to understand. It’s equivalent to Paxos in fault-tolerance and performance. The difference is that it’s decomposed into relatively independent subproblems, and it cleanly addresses all major pieces needed for practical systems. We hope Raft will make consensus available to a wider audience, and that this wider audience will be able to develop a variety of higher quality consensus-based systems than are available today. 

 

Raft consensus algorithm explained

To begin with, Raft states that each node in a replicated state machine(server cluster) can stay in any of the three states, namely, leader, candidate, follower. The image below will provide the necessary visual aid. 
 

Under normal conditions, a node can stay in any one of the above three states. Only a leader can interact with the client; any request to the follower node is redirected to the leader node. A candidate can ask for votes to become the leader. A follower only responds to candidate(s) or the leader. 

To maintain these server status(es), the Raft algorithm divides time into small terms of arbitrary length. Each term is identified by a monotonically increasing number, called term number
Term number 
This term number is maintained by every node and is passed while communications between nodes. Every term starts with an election to determine the new leader. The candidates ask for votes from other server nodes(followers) to gather majority. If the majority is gathered, the candidate becomes the leader for the current term. If no majority is established, the situation is called a split vote and the term ends with no leader. Hence, a term can have at most one leader. 
Purpose of maintaining term number 
Following tasks are executed by observing the term number of each node: 
 

Raft algorithm uses two types of Remote Procedure Calls(RPCs) to carry out the functions : 
 

Now, lets have a look at the process of leader election. 
 

Leader election

In order to maintain authority as a Leader of the cluster, the Leader node sends heartbeat to express dominion to other Follower nodes. A leader election takes place when a Follower node times out while waiting for a heartbeat from the Leader node. At this point of time, the timed out node changes it state to Candidate state, votes for itself and issues RequestVotes RPC to establish majority and attempt to become the Leader. The election can go the following three ways: 
 

 

raft leader election

The following excerpt from the Raft paper(linked in the references below) explains a significant aspect of server timeouts. 

 

Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms). This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. 
 

 

Log Replication

For the sake of simplicity while explaining to the beginner level audience, we will restrict our scope to client making only write requests. Each request made by the client is stored in the Logs of the Leader. This log is then replicated to other nodes(Followers). Typically, a log entry contains the following three information : 
 

The Leader node fires AppendEntries RPCs to all other servers(Followers) to sync/match up their logs with the current Leader.The Leader keeps sending the RPCs until all the Followers safely replicate the new entry in their logs. 

There is a concept of entry commit in the algorithm. When the majority of the servers in the cluster successfully copy the new entries in their logs, it is considered committed. At this point, the Leader also commits the entry in its log to show that it has been successfully replicated. All the previous entries in the log are also considered committed due to obvious reasons. After the entry is committed, the leader executes the entry and responds back with the result to the client. 
It should be noted that these entries are executed in the order they are received. 

If two entries in different logs(Leader’s and Followers’) have identical index and term, they are guaranteed to store the same command and the logs are identical upto that point(Index). 

However, in case the Leader crashes, the logs may become inconsistent. Quoting the Raft paper : 
 

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. 
 

The Leader node will look for the last matched index number in the Leader and Follower, it will then overwrite any extra entries further that point(index number) with the new entries supplied by the Leader. This helps in Log matching the Follower with the Leader. The AppendEntries RPC will iteratively send the RPCs with reduced Index Numbers so that a match is found. When the match is found, the RPC succeeds. 

 

Safety

In order to maintain consistency and same set of server nodes, it is ensured by the Raft consensus algorithm that the leader will have all the entries from the previous terms committed in its log. 

During a leader election, the RequestVote RPC also contains information about the candidate’s log(like term number) to figure out which one is the latest. If the candidate requesting the vote has less updated data than the Follower from which it is requesting vote, the Follower simply doesn’t vote for the said candidate. The following excerpt from the original Raft paper clears it in a similar and profound way. 
 

Raft determines which of two logs is more up-to-date by comparing the index and term of the last entries in the logs. If the logs have last entries with different terms, then the log with the later term is more up-to-date. If the logs end with the same term, then whichever log is longer is more up-to-date. 
 

Rules for Safety in the Raft protocol 
The Raft protocol guarantees the following safety against consensus malfunction by virtue of its design : 
 

Cluster membership and Joint Consensus

When the status of nodes in the cluster changes(cluster configuration changes), the system becomes susceptible to faults which can break the system. So, to prevent this, Raft uses what is known as a two phase approach to change the cluster membership. So, in this approach, the cluster first changes to an intermediate state(known as joint consensus) before achieving the new cluster membership configuration. Joint consensus makes the system available to respond to client requests even when the transition between configurations is taking place. Thus, increasing the availability of the distributed system, which is a main aim. 

 

What are its advantages/Features

 

Raft Alternatives

Limitations


Article Tags :