Open In App

Recovery in Distributed Systems

Last Updated : 22 Nov, 2022
Like Article

Pre-requisites: Distributed System

Recovery from an error is essential to fault tolerance, and error is a component of a system that could result in failure.  The whole idea of error recovery is to replace an erroneous state with an error-free state. Error recovery can be broadly divided into two categories. 

1.  Backward Recovery:

Moving the system from its current state back into a formerly accurate condition from an incorrect one is the main challenge in backward recovery. It will be required to accomplish this by periodically recording the system’s state and restoring it when something goes wrong. A checkpoint is deemed to have been reached each time (part of) the system’s current state is noted.

2.  Forward Recovery:

Instead of returning the system to a previous, checkpointed state in this instance when it has entered an incorrect state, an effort is made to place the system in a correct new state from which it can continue to operate. The fundamental issue with forward error recovery techniques is that potential errors must be anticipated in advance. Only then is it feasible to change those mistakes and transfer to a new state.

These two types of possible recoveries are done in fault tolerance in distributed system.

Stable Storage :

Stable storage, which can resist anything but major disasters like floods and earthquakes, is another option. A pair of regular discs can be used to implement stable storage. Each block on drive 2 is a duplicate of the corresponding block on drive 1, with no differences. The block on drive 1 is updated and confirmed first whenever a block is updated. then the identical block on drive 2 is finished.

Suppose that the system crashes after drive 1 is updated but before the update on drive 2. Upon recovery, the disk can be compared with blocks. Since drive 1 is always updated before drive 2, the new block is copied from drive 1 to drive 2 whenever two comparable blocks differ, it is safe to believe that drive 1 is the correct one. Both drives will be identical once the recovery process is finished.

Another potential issue is a block’s natural deterioration. A previously valid block may suddenly experience a checksum mistake without reason . The faulty block can be constructed from the corresponding block on the other drive when such an error is discovered.

Checkpointing :

Backward error recovery calls for the system to routinely save its state onto stable storage in a fault-tolerant distributed system. We need to take a distributed snapshot, often known as a consistent global state, in particular. If a process P has recorded the receipt of a message in a distributed snapshot, then there should also be a process Q that has recorded the sending of that message. It has to originate somewhere, after all. 

A Recovery Line


Each process periodically saves its state to a locally accessible stable storage in backward error recovery techniques. We must create a stable global state from these local states in order to recover from a process or system failure. Recovery to the most current distributed snapshot, also known as a recovery line, is recommended in particular. In other words, as depicted in Fig., a recovery line represents the most recent stable cluster of checkpoints.

Coordinated Checkpointing : 

As the name suggests, coordinated checkpointing synchronises all processes to write their state to local stable storage at the same time. Coordinated checkpointing’s key benefit is that the saved state is automatically globally consistent, preventing cascading rollbacks that could cause a domino effect.

Message Logging :

The core principle of message logging is that we can still obtain a globally consistent state even if the transmission of messages can be replayed, but without having to restore that state from stable storage. Instead, any communications that have been sent since the last checkpoint are simply retransmitted and treated appropriately.

Incorrect replay of messages after recovery, leading to an orphan process


As system executes, messages are recorded on stable storage. A message  is called as logged if its data and index of stable interval that is stored are both recorded on stable storage. In above Fig. you can see logged and unlogged images denoted by different arrows. The idea is if transmission of messages is replayed, we can still reach a globally consistent state. so we can recover logs of messages and continue the execution. 

Similar Reads

Distributed Consensus in Distributed Systems
A procedure to reach a common agreement in a distributed or decentralized multi-agent platform. It is important for the message passing system. Example - A number of processes in a network decide to elect a leader. Each process begins with a bid for leadership. In traditional or conventional distributed systems, we apply consensus to ensure reliabi
4 min read
Transaction Recovery in Distributed System
Transactions may be performed effectively using distributed transaction processing. However, there are instances in which a transaction may fail for a variety of causes. System failure, hardware failure, network error, inaccurate or invalid data, application problems, are all probable causes. Transaction failures are impossible to avoid. These fail
4 min read
Heterogeneous and other DSM systems | Distributed systems
A distributed shared memory is a system that allows end-user processes to access shared data without the need for inter-process communication. The shared-memory paradigm applied to loosely-coupled distributed-memory systems is known as Distributed Shared Memory (DSM). Distributed shared memory (DSM) is a type of memory architecture in computer scie
7 min read
Distributed System - Thrashing in Distributed Shared Memory
In this article, we are going to understand Thrashing in a distributed system. But before that let us understand what a distributed system is and why thrashing occurs. In naive terms, a distributed system is a network of computers or devices which are at different places and linked together. Each one of these distributed computers shares the same s
4 min read
Distributed System - Types of Distributed Deadlock
A Deadlock is a situation where a set of processes are blocked because each process is holding a resource and waiting for another resource occupied by some other process. When this situation arises, it is known as Deadlock. [caption width="800"]Deadlock[/caption]A Distributed System is a Network of Machines that can exchange information with each o
4 min read
Difference between a Distributed Lock Manager and a Distributed Database
In today’s world, managing data and resources efficiently across multiple locations is crucial. Distributed Lock Managers and Distributed Databases are foundational in achieving this. They serve different yet complementary roles in distributed systems. While a distributed lock manager coordinates access to shared resources, a distributed database h
5 min read
How can Heartbeats Detection provide a solution to network failures in Distributed Systems
What are Network Failures in distributive systems? Network failures are one of the most common types of failures in distributed systems. A distributed system is composed of multiple machines or nodes that communicate with each other to achieve a common goal. Network failures occur when there is a disruption in the communication between these nodes,
4 min read
Remote Write Protocol in Distributed Systems
Pre-requisites: Network Protocols A protocol is a set of rules or procedures that need to be followed on every request. The data store is a high-level view of databases; that is, it is simply a collection of all the databases scattered across the geographical region. In a distributed system, the client can perform a read or write a request on data
3 min read
Hashing in Distributed Systems
Prerequisite – Hashing A distributed system is a network that consists of autonomous computers that are connected using a distribution middleware. They help in sharing different resources and capabilities to provide users with a single and integrated coherent network. One of the ways hashing can be implemented in a distributed system is by taking h
3 min read
Deadlock Detection in Distributed Systems
Prerequisite - Deadlock Introduction, deadlock detection In the centralized approach of deadlock detection, two techniques are used namely: Completely centralized algorithm and Ho Ramamurthy algorithm (One phase and Two-phase). Completely Centralized Algorithm - In a network of n sites, one site is chosen as a control site. This site is responsible
2 min read