Recovery in Distributed Systems

Last Updated : 22 Nov, 2022

Recovery from an error is essential to fault tolerance, and error is a component of a system that could result in failure. The whole idea of error recovery is to replace an erroneous state with an error-free state. Error recovery can be broadly divided into two categories.

1. Backward Recovery:

Moving the system from its current state back into a formerly accurate condition from an incorrect one is the main challenge in backward recovery. It will be required to accomplish this by periodically recording the system’s state and restoring it when something goes wrong. A checkpoint is deemed to have been reached each time (part of) the system’s current state is noted.

2. Forward Recovery:

Instead of returning the system to a previous, checkpointed state in this instance when it has entered an incorrect state, an effort is made to place the system in a correct new state from which it can continue to operate. The fundamental issue with forward error recovery techniques is that potential errors must be anticipated in advance. Only then is it feasible to change those mistakes and transfer to a new state.

These two types of possible recoveries are done in fault tolerance in distributed system.

Stable Storage :

Stable storage, which can resist anything but major disasters like floods and earthquakes, is another option. A pair of regular discs can be used to implement stable storage. Each block on drive 2 is a duplicate of the corresponding block on drive 1, with no differences. The block on drive 1 is updated and confirmed first whenever a block is updated. then the identical block on drive 2 is finished.

Suppose that the system crashes after drive 1 is updated but before the update on drive 2. Upon recovery, the disk can be compared with blocks. Since drive 1 is always updated before drive 2, the new block is copied from drive 1 to drive 2 whenever two comparable blocks differ, it is safe to believe that drive 1 is the correct one. Both drives will be identical once the recovery process is finished.

Another potential issue is a block’s natural deterioration. A previously valid block may suddenly experience a checksum mistake without reason . The faulty block can be constructed from the corresponding block on the other drive when such an error is discovered.

Checkpointing :

Backward error recovery calls for the system to routinely save its state onto stable storage in a fault-tolerant distributed system. We need to take a distributed snapshot, often known as a consistent global state, in particular. If a process P has recorded the receipt of a message in a distributed snapshot, then there should also be a process Q that has recorded the sending of that message. It has to originate somewhere, after all.

Each process periodically saves its state to a locally accessible stable storage in backward error recovery techniques. We must create a stable global state from these local states in order to recover from a process or system failure. Recovery to the most current distributed snapshot, also known as a recovery line, is recommended in particular. In other words, as depicted in Fig., a recovery line represents the most recent stable cluster of checkpoints.

Coordinated Checkpointing :

As the name suggests, coordinated checkpointing synchronises all processes to write their state to local stable storage at the same time. Coordinated checkpointing’s key benefit is that the saved state is automatically globally consistent, preventing cascading rollbacks that could cause a domino effect.

Message Logging :

The core principle of message logging is that we can still obtain a globally consistent state even if the transmission of messages can be replayed, but without having to restore that state from stable storage. Instead, any communications that have been sent since the last checkpoint are simply retransmitted and treated appropriately.

Incorrect replay of messages after recovery, leading to an orphan process

As system executes, messages are recorded on stable storage. A message is called as logged if its data and index of stable interval that is stored are both recorded on stable storage. In above Fig. you can see logged and unlogged images denoted by different arrows. The idea is if transmission of messages is replayed, we can still reach a globally consistent state. so we can recover logs of messages and continue the execution.

Suggest improvement

Transaction Recovery in Distributed System

Share your thoughts in the comments