Open In App

Handling Failure in Distributed System

A distributed system is a group of independent computers that seem to clients as a single cohesive system. There are several components in any distributed system that work together to execute a task. As the system becomes more complicated and contains more components, the likelihood of failure rises, resulting in decreased reliability. In other words, we can say in a distributed system, there will always be systems that are broken while others function normally. It is known as a partial failure. Partial failures are unpredictable as the time takes for a message to travel across a network is non-deterministic, we have no way of knowing whether anything has succeeded or failed. As a result, we have no idea which systems have failed in the interim, nor do we know whether a system has failed or not. Working with distributed systems is tough because of this. There is a possibility for partial failures such as node crashes or communication connection failures in distributed systems. As a result, such errors during inter-process communication may result in the following issues:

1. Request Message Loss: This loss can occur when the sender-receiver communication link gets failed or the other reason might be when the node on the receiver side is not enabled at the time the request message reaches it.



2. Loss of Response Message: This loss can occur when the sender-receiver communication link gets failed or the other reason might be when the node on the sender side is not enabled at the time the response message reaches it.



3. Unsuccessful request execution: This occurs when the receiver’s node crashes during the request processing.

For handling these issues, reliable IPC protocol is employed by a message-passing system that deals with the concepts of retransmissions of messages internally after a fixed time interval, and the kernel on receiving side returns an acknowledgment message to the kernel on sending machine.

The following reliable IPC protocol is used in client-server communication between two processes:

1. Four-Message Reliable IPC Protocol: In this client-server communication between two processes takes place in the following manner:

2. Three-Message Reliable IPC Protocol: When the successful response has been received by the client process, it ensures that the request message was received by the server in client-server communication. So, it is based on this concept:

There can be an issue if the request takes a long time to process. Because the retransmission of a message can only be carried out after a fixed set of intervals that generally sets to a large amount to avoid wasteful retransmission. On the other side, if a considerable amount of time is not set for request processing then it might result in the sending of request messages multiple times. To deal with this issue, use the following protocol:

3. Two-Message Reliable IPC Protocol: The Two-Message Reliable IPC Protocol is used for client-server communication between two processes. For its implementation, a message-passing system might be developed:

Idempotency:

Idempotency essentially refers to “repeatability.” That implies executing idempotent operation several times with the same parameters, generates the same outcomes with no side effects.

The tracking of Lost and Out-of-Sequence Packets is required in Multidatagram Messages: 

The complete transmission implies when all of the message’s packets have been received by the process to which it was sent as every packet is crucial for the effective completion of a multidatagram message transmission. So, the simple approach is to recognize each package independently (called stop-and-wait protocol). The second approach in a multidatagram message (called blast protocol) is to use a single acknowledgment packet for all packets. With the usage of this method, however, a node crash or a communication link failure may result in the following issues:

To handle these problems, the bitmap approach is used for identifying the message packets.

There are other various sorts of failures that can occur in a distributed system:  

The above-mentioned other failure Issues in distributed systems can be handled in the following manner:


Article Tags :