Basic fault tolerant software techniques
The study of software fault-tolerance is relatively new as compared with the study of fault-tolerant hardware. In general, fault-tolerant approaches can be classified into fault-removal and fault-masking approaches. Fault-removal techniques can be either forward error recovery or backward error recovery.
Forward error recovery aims to identify the error and, based on this knowledge, correct the system state containing the error. Exception handling in high-level languages, such as Ada and PL/1, provides a system structure that supports forward recovery. Backward error recovery corrects the system state by restoring the system to a state which occurred prior to the manifestation of the fault. The recovery block scheme provides such a system structure. Another fault-tolerant software technique commonly used is error masking. The NVP scheme uses several independently developed versions of an algorithm. A final voting system is applied to the results of these N-versions and a correct result is generated.
A fundamental way of improving the reliability of software systems depends on the principle of design diversity where different versions of the functions are implemented. In order to prevent software failure caused by unpredicted conditions, different programs (alternative programs) are developed separately, preferably based on different programming logic, algorithm, computer language, etc. This diversity is normally applied under the form of recovery blocks or N-version programming.
Fault-tolerant software assures system reliability by using protective redundancy at the software level. There are two basic techniques for obtaining fault-tolerant software: RB scheme and NVP. Both schemes are based on software redundancy assuming that the events of coincidental software failures are rare.
- Recovery Block Scheme –
The recovery block scheme consists of three elements: primary module, acceptance tests, and alternate modules for a given task. The simplest scheme of the recovery block is as follows:
Ensure T By P Else by Q1 Else by Q2 . . . Else by Qn-1 Else Error
Where T is an acceptance test condition that is expected to be met by successful execution of either the primary module P or the alternate modules Q1, Q2, . . ., Qn-1.
The process begins when the output of the primary module is tested for acceptability. If the acceptance test determines that the output of the primary module is not acceptable, it recovers or rolls back the state of the system before the primary module is executed. It allows the second module Q1, to execute. The acceptance test is repeated to check the successful execution of module Q1. If it fails, then module Q2 is executed, etc.
The alternate modules are identified by the keywords “else by” When all alternate modules are exhausted, the recovery block itself is considered to have failed and the final keywords “else error” declares the fact. In other words, when all modules execute and none produce acceptable outputs, then the system falls. A reliability optimization model has been studied by Pham (1989b) to determine the optimal number of modules in a recovery block scheme that minimizes the total system cost given the reliability of the individual modules.
In a recovery block, a programming function is realized by n alternative programs, P1, P2, . . . ., Pn. The computational result generated by each alternative program is checked by an acceptance test, T. If the result is rejected, another alternative program is then executed. The program will be repeated until an acceptable result is generated by one of the n alternatives or until all the alternative programs fail.
The probability of failure of the RB scheme, , is as follows:
= probability of failure for version Pi
= probability that acceptance test i judges an incorrect result as correct
t = probability that acceptance test i judges a correct result as incorrect.
The above equation corresponds to the case when all versions fall the acceptance test. The second term corresponds to the probability that acceptance test i judges an incorrect result as correct at the ith trial of the n versions.
- N-version Programming –
NVP is used for providing fault-tolerance in software. In concept, the NVP scheme is similar to the N-modular redundancy scheme used to provide tolerance against hardware faults.
The NVP is defined as the independent generation of functionally equivalent programs, called versions, from the same initial specification. Independent generation of programs means that the programming efforts are carried out by N individuals or groups that do not interact with respect to the programming process. Whenever possible, different algorithms, techniques, programming languages, environments, and tools are used in each effort. In this technique, N program versions are executed in parallel on identical input and the results are obtained by voting on the outputs from the individual programs. The advantage of NVP is that when a version failure occurs, no additional time is required for reconfiguring the system and redoing the computation.
Consider an NVP scheme consists of n programs and a voting mechanism, V. As opposed to the RB approach, all n alternative programs are usually executed simultaneously and their results are sent to a decision mechanism which selects the final result. The decision mechanism is normally a voter when there are more than two versions (or, more than k versions, in general), and it is a comparator when there are only two versions (k versions). The syntactic structure of NVP is as follows:
seq par P1(version 1) P2(version 2) . . . Pn(version n) decision V
Assume that a correct result is expected where there are at least two correct results.
The probability of failure of the NVP scheme, Pn, can be expressed as
The first term of this equation is the probability that all versions fail. The second term is the probability that only one version is correct. The third term, d, is the probability that there are at least two correct results but the decision algorithm fails to deliver the correct result.
It is worthwhile to note that the goal of the NVP approach is to ensure that multiple versions will be unlikely to fail on the same inputs. With each version independently developed by a different programming team, design approach, etc., the goal is that the versions will be different enough in order that they will not fail too often on the same inputs. However, multiversion programming is still a controversial topic.
The main difference between the recovery block scheme and the N-version programming is that the modules are executed sequentially in the former. The recovery block generally is not applicable to critical systems where real-time response is of great concern.