Open In App

Basic Fault Tolerant Software Techniques

Improve
Improve
Improve
Like Article
Like
Save Article
Save
Share
Report issue
Report

Fault tolerance is a property of software systems that allows them to continue functioning even in the event of failures or errors. In this article, we are going to discuss the fault tolerance techniques that are used in the Software system in detail. The following are some basic techniques used to improve the fault tolerance of software systems:

  1. Redundancy: This involves duplicating critical components of the software system so that if one component fails, the others can take over and keep the system running. This can include using redundant hardware, such as redundant servers or storage systems, or creating redundant software components.
  2. Checkpointing: This involves periodically saving the state of the software system so that if a failure occurs, the system can be restored to a previous state. This can be useful in systems that require a lot of processing time, as it allows the system to restart from a saved state if it crashes or fails.
  3. Error Detection and Correction: This involves detecting errors and correcting them before they cause problems. For example, error detection and correction algorithms can be used to detect and correct errors in data transmission.
  4. Failure Prediction: This involves using algorithms or heuristics to predict when a failure is likely to occur so that the system can take appropriate action to prevent or mitigate the failure.
  5. Load Balancing: This involves distributing workloads across multiple components so that no single component is overburdened. This can help to prevent failures and improve the overall performance of the system.
  6. Autonomous Systems: Autonomous systems are made to identify, diagnose, and fix errors on their own without the need for human assistance. To ensure ongoing operation, these systems use automatic fault isolation, recovery, and identification procedures.
  7. Isolation and Restrictions: The goal of isolation and containment approaches is to build systems so that errors in one component do not spread to the rest of the system. This can involve dividing up components and minimizing the effect of errors through the use of virtualization, microservices, or containers.
  8. Replication: The practice of making multiple copies of essential system components or services and distributing them to several places is known as replication. These are designed to be fault-tolerant and to function continuously even in the event of a failure.
  9. Dynamic reconfiguration: This technique allows a system to dynamically respond to faults, reallocate resources, and adapt to changing conditions. By modifying the configuration of the system in real time according to the operational conditions, this technique improves system resilience.

These are just a few of the basic techniques used to improve the fault tolerance of software systems. In practice, many systems use a combination of these techniques to provide the highest level of fault tolerance possible.

Fault tolerance means the ability of a system such as a computer, network, etc. will continue to work too when one or more components fail but the system will work without interruption.

The main objective of establishing the fault-tolerant system is to prevent disruptions. These disruptions may arise due to a single point of failure that ensures the high availability of Applications. as mission-critical applications for their business continuity. The Fault-tolerant systems also have the use of backup components. and these backup components will automatically take place when there are failed components which may ensure there is no loss of service. These include Power sources, hardware systems, and Software systems

The study of software fault-tolerance is relatively new compared with the study of fault-tolerant hardware. In general, fault-tolerant approaches can be classified into fault-removal and fault-masking approaches. Fault-removal techniques can be either forward error recovery or backward error recovery. Forward error recovery aims to identify the error and, based on this knowledge, correct the system state containing the error. Exception handling in high-level languages, such as Ada and PL/1, provides a system structure that supports forward recovery. Backward error recovery corrects the system state by restoring the system to a state that occurred before the manifestation of the fault. The recovery block scheme provides such a system structure. Another fault-tolerant software technique commonly used is error masking. The NVP scheme uses several independently developed versions of an algorithm. A final voting system is applied to the results of these N-versions and a correct result is generated. A fundamental way of improving the reliability of software systems depends on the principle of design diversity where different versions of the functions are implemented. To prevent software failure caused by unpredicted conditions, different programs (alternative programs) are developed separately, preferably based on different programming logic, algorithms, computer languages, etc. This diversity is normally applied in the form of recovery blocks or N-version programming. Fault-tolerant software assures system reliability by using protective redundancy at the software level.

Types of Fault-Tolerance Software

There are two basic techniques for obtaining fault-tolerant software: RB scheme and NVP. Both schemes are based on software redundancy assuming that the events of coincidental software failures are rare.

  1. Recovery Block Scheme
  2. N-version Programming
  3. Check Pointing and Rollback Recovery

Recovery Block Scheme

The recovery block scheme consists of three elements: primary module, acceptance tests, and alternate modules for a given task. The simplest scheme of the recovery block is as follows:

Ensure T
   By P
    Else by Q1
      Else by Q2
         .
         .
         .
      Else by Qn-1
    Else Error 

Where T is an acceptance test condition that is expected to be met by the successful execution of either the primary module P or the alternate modules Q1, Q2, . . ., Qn-1. The process begins when the output of the primary module is tested for acceptability. If the acceptance test determines that the output of the primary module is not acceptable, it recovers or rolls back the state of the system before the primary module is executed. It allows the second module Q1, to execute. The acceptance test is repeated to check the successful execution of module Q1. If it fails, then module Q2 is executed, etc. The alternate modules are identified by the keywords “else by” When all alternate modules are exhausted, the recovery block itself is considered to have failed and the final keyword “else error” declares the fact. In other words, when all modules execute and none produce acceptable outputs, then the system falls. A reliability optimization model has been studied by Pham (1989b) to determine the optimal number of modules in a recovery block scheme that minimizes the total system cost given the reliability of the individual modules. In a recovery block, a programming function is realized by n alternative programs, P1, P2, . . . ., Pn. The computational result generated by each alternative program is checked by an acceptance test, T.


Recovery Block Scheme

Recovery Block Scheme


If the result is rejected, another alternative program is then executed. The program will be repeated until an acceptable result is generated by one of the n alternatives or until all the alternative programs fail. The probability of failure of the RB scheme, $P_{rb}$, is as follows: 

*** QuickLaTeX cannot compile formula:
$$ P_{rb}= \prod_{i=1}^n (e_i+t_{2i})+\sum_{i=1}^n t_{1i}e_i\left ( \prod_{j=1}^{i-1} (e_j+t_{2j}) \right) $$    


*** Error message:
Cannot connect to QuickLaTeX server: cURL error 28: Connection timed out after 10001 milliseconds
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37
where 
*** QuickLaTeX cannot compile formula:
$e_i$    











*** Error message:
Cannot connect to QuickLaTeX server: cURL error 28: Connection timed out after 10001 milliseconds
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37
= probability of failure for version Pi 
*** QuickLaTeX cannot compile formula:
$t_{1i}$    











*** Error message:
Cannot connect to QuickLaTeX server: cURL error 28: Connection timed out after 10000 milliseconds
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37
= probability that acceptance test I judges an incorrect result as correct t
*** QuickLaTeX cannot compile formula:
$t_{2i}$    











*** Error message:
Cannot connect to QuickLaTeX server: cURL error 28: Connection timed out after 10000 milliseconds
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37
= probability that acceptance test I judge a correct result as incorrect. The above equation corresponds to the case when all versions fail the acceptance test. The second term corresponds to the probability that acceptance test I judges an incorrect result as correct at the ith trial of the n versions.

N-version Programming

NVP is used for providing fault tolerance in software. In concept, the NVP scheme is similar to the N-modular redundancy scheme used to provide tolerance against hardware faults. The NVP is defined as the independent generation of $N\geq 2$ functionally equivalent programs, called versions, from the same initial specification. Independent generation of programs means that the programming efforts are carried out by N individuals or groups that do not interact concerning the programming process. Whenever possible, different algorithms, techniques, programming languages, environments, and tools are used in each effort. In this technique, N program versions are executed in parallel on identical input and the results are obtained by voting on the outputs from the individual programs. The advantage of NVP is that when a version failure occurs, no additional time is required for reconfiguring the system and redoing the computation. Consider an NVP scheme consisting of n programs and a voting mechanism, V.


N-Version Programming

N-Version Programming


As opposed to the RB approach, all alternative programs are usually executed simultaneously and their results are sent to a decision mechanism that selects the final result. The decision mechanism is normally a voter when there are more than two versions (or, more than k versions, in general), and it is a comparator when there are only two versions (k versions). The syntactic structure of NVP is as follows:

seq
 par
  P1(version 1)
  P2(version 2)
  .
  .
  .
  Pn(version n)
  decision V 







Assume that a correct result is expected where there are at least two correct results. The probability of failure of the NVP scheme, Pn, can be expressed as 

*** QuickLaTeX cannot compile formula:
$$p_{nv}=\prod_{i=1}^n e_i+ \prod_{i=1}^n (1-e_i)e_i^{-1}\prod_{j=1}^n e_j + d$$    


*** Error message:
Cannot connect to QuickLaTeX server: cURL error 28: Connection timed out after 10001 milliseconds
Please make sure your server/PHP settings allow HTTP requests to external resources ("allow_url_fopen", etc.)
These links might help in finding solution:
http://wordpress.org/extend/plugins/core-control/
http://wordpress.org/support/topic/an-unexpected-http-error-occurred-during-the-api-request-on-wordpress-3?replies=37
The first term of this equation is the probability that all versions fail. The second term is the probability that only one version is correct. The third term, d, is the probability that there are at least two correct results but the decision algorithm fails to deliver the correct result. It is worthwhile to note that the goal of the NVP approach is to ensure that multiple versions will be unlikely to fail on the same inputs. With each version independently developed by a different programming team, design approach, etc., the goal is that the versions will be different enough so that they will not fail too often on the same inputs. However, multiversion programming is still a controversial topic.

Check-Pointing and Rollback Recovery

Check Pointing and Rollback Recovery is a different technique from the above present technique. The system is tested when some computation is performed. It is used generally when there is a failure in the process or data corruption.


Check Pointing and Rollback Recovery

Check Pointing and Rollback Recovery


The main difference between the recovery block scheme and the N-version programming is that the modules are executed sequentially in the former. The recovery block generally does not apply to critical systems where real-time response is of great concern.

For more, refer to the Difference Between Recovery Block Scheme and N-Version Programming.

Advantages of Using Fault-Tolerant Techniques in Software Systems

  • Improved Reliability: Fault-tolerant techniques help to ensure that software systems continue to function even in the event of failures or errors, improving the overall reliability of the system.
  • Increased Availability: By preventing failures and downtime, fault tolerance techniques help to increase the overall availability of the system, leading to increased user satisfaction and adoption.
  • Reduced Downtime: By preventing failures and mitigating the impact of errors, fault tolerance techniques help to reduce the amount of downtime experienced by the software system, leading to increased productivity and efficiency.
  • Improved Performance: By distributing workloads across multiple components and preventing overburdening of any single component, fault tolerance techniques can help to improve the overall performance of the software system.

Disadvantages of Using Fault-Tolerant Techniques in Software Systems

  • Increased complexity: Implementing fault tolerance techniques can add complexity to the software system, making it more difficult to develop, maintain, and test.
  • Increased cost: Implementing fault tolerance techniques can be expensive, requiring specialized hardware, software, and expertise.
  • Reduced performance: In some cases, implementing fault tolerance techniques can lead to reduced performance, as the system must devote resources to error detection, correction, and recovery.
  • Overhead: The process of detecting and recovering from failures can introduce overhead into the software system, reducing its overall performance.
  • False alarms: In some cases, fault-tolerant techniques may detect errors or failures that are not present, leading to false alarms and unnecessary downtime.

Questions For Practice

1. The extent to which the software can control to operate correctly despite the introduction of Invalid input is called as

(A) Reliability

(B) Robustness

(C) Fault Tolerance

(D) Portability

Answer: The correct answer is (B).

FAQs

1. What are the four phases of fault tolerance?

Ans: The four phases of Fault-Tolerance

  • Software Detection
  • Damage Assessment
  • Error Recovery
  • Fault Treatment

2. What is fault tolerance testing?

Ans: Fault Tolerance Testing means to assess or verify our system’s operations at various checkpoints to determine coming failures in the future.



Last Updated : 01 Feb, 2024
Like Article
Save Article
Previous
Next
Share your thoughts in the comments
Similar Reads