Fault-tolerant technology is a capability of a computer system, electronic system or network to deliver uninterrupted service, despite one or more of its components failing. Fault tolerance also resolves potential service interruptions related to software or logic errors. The purpose is to prevent catastrophic failure that could result from a single point of failure.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
VMware vSphere 6 Fault Tolerance is a branded, continuous data availability architecture that exactly replicates a VMware virtual machine on an alternate physical host if the main host server fails.
Fault-tolerant systems are designed to compensate for multiple failures. Such systems automatically detect a failure of the computer processor unit, I/O subsystem, memory cards, motherboard, power supply or network components. The failure point is identified, and a backup component or procedure immediately takes its place with no loss of service.
To ensure fault tolerance, enterprises need to purchase an inventory of formatted computer equipment and a secondary uninterruptible power supply device. The goal is to prevent the crash of key systems and networks, focusing on issues related to uptime and downtime. Fault tolerance can be provided with software embedded in hardware, or by some combination of the two.
In a software implementation, the operating system (OS) provides an interface that allows a programmer to checkpoint critical data at predetermined points within a transaction. In a hardware implementation (for example, with Stratus and its Virtual Operating System), the programmer does not need to be aware of the fault-tolerant capabilities of the machine.
At a hardware level, fault tolerance is achieved by duplexing each hardware component. Disks are mirrored. Multiple processors are lockstepped together and their outputs are compared for correctness. When an anomaly occurs, the faulty component is determined and taken out of service, but the machine continues to function as usual.
Fault tolerance vs. high availability
Fault tolerance is closely associated with maintaining business continuity via highly available computer systems and networks. Fault-tolerant environments are defined as those that restore service instantaneously following a service outage, whereas a high-availability environment strives for five nines of operational service.
In a high-availability cluster, sets of independent servers are coupled loosely together to guarantee system-wide sharing of critical data and resources. The clusters monitor each other's health and provide fault recovery to ensure applications remain available. Conversely, a fault-tolerant cluster consists of multiple physical systems that share a single copy of a computer's OS. Software commands issued by one system are also executed on the other system.
This video from Professor Messer
describes the differences between redundancy,
fault tolerance and high availability.
The tradeoff between fault tolerance and high availability is cost. Systems with integrated fault tolerance incur a higher cost due to the inclusion of additional hardware.
What is graceful degradation?
Fault tolerance is often used synonymously with graceful degradation, although the latter is more aligned with the more holistic discipline of fault management, which aims to detect, isolate and resolve problems pre-emptively. A fault-tolerant system swaps in backup componentry to maintain high levels of system availability and performance. Graceful degradation allows a system to continue operations, albeit in a reduced state of performance.
Matching data protection and fault tolerance
Fault tolerance hinges on redundancy. Namely, information is redundantly protected via data replication or synchronous mirroring of volumes to an off-site data center. For physical redundancy, extra hardware equipment remains on standby for failover of operational systems.
Data backup is frequently combined with redundancy. Both strategies are intended as a safeguard against data loss, although backup tends to focus on point-in-time recovery, including granular recovery of a discrete data object. Redundant systems are engineered specifically for application workloads that tolerate very little downtime.
When implementing fault tolerance, enterprises should match data availability requirements to the appropriate level of data protection with redundant array of independent disks (RAID). The RAID technique ensures data is written to multiple hard disks, both to balance I/O operations and boost overall system performance.
Organizations that prioritize fault tolerance above speed and performance would be best served by RAID 1 disk mirroring or RAID 10, which combines disk mirroring and disk striping. If fault tolerance and system performance are equally important, an enterprise may find it worthwhile to spend a little extra money combining RAID 10 with RAID 6, or double-parity RAID, which tolerates the loss of two disk failures before data is lost. Aside from higher cost, the other drawback is data writes occur more slowly to the RAID set.
Aside from hardware, a fault-tolerant architecture should be coordinated with regularly scheduled backups of critical data, perhaps including a mirrored copy at a secondary or alternate location. Security needs to be part of the planning to prevent unauthorized access, and to apply antivirus tools and the most recent version of the computing system OS.
Which industries depend on system fault tolerance?
Fault tolerance refers not only to the consequence of having redundant equipment, but also to the ground-up methodology computer makers use to engineer and design their systems for reliability. Fault tolerance is a required design specification for computer equipment used in online transaction processing systems, such as airline flight control and reservations systems. Fault-tolerant systems are also widely used in sectors such as distribution and logistics, electric power plants, heavy manufacturing, industrial control systems and retailing.
A New Trend on the Development of Fault-Tolerant Applications: Software Meta-Level Architectures
|Maria Lúcia Blanck Lisbôa|
Universidade Federal do Rio Grande do Sul
Instituto de Informática, Caixa Postal 15064
91501-970 Porto Alegre, RS
|Abstract||The purpose of this paper is to investigate a clearly defined way of developing fault-tolerant applications using software meta-level architectures. Meta-level architectures are software architectures based on computational reflection. It addresses complex pieces of software: fault-tolerant software. Fault-tolerant applications must cope with several non-functional requirements to maintain its functionality. So, it is particularly relevant to investigate how to alleviate developers from repeatedly dealing with this complexity. Some solutions are presented, such as software patterns and basic guidelines to help the development of such applications.|
Keywords: meta-level architecture, computational reflection, software patterns, fault-tolerance
Meta-level architectures have recently acquired a great deal of attention from researchers as this offers new solutions to well known problems. Particularly among those are researchers interested in fault-tolerance as can be seen on some recent works [1, 2, 3, 4, 5].
As happens with any emerging methodology, the concepts related to meta-level architectures are not well understood. These concepts range from computational reflection to software architecture styles. Both are becoming explicit fields of study and there is currently no well-established methodology combining them into meta-level architectures.
Meta-level architectures are software architectures based on computational reflection. While software architecture addresses the structure and interaction of software components, computational reflection concerns are about introspection on software components. This means that a component can get information about the internal properties of another component.
Some reflection-based solutions offer obvious benefits. However, they can introduce additional complexity and liability on the fault-tolerant software development. The challenge of this paper is to investigate a clearly defined way of developing fault-tolerant applications using software meta-level architectures. Some ideas were inspired by the work of Buschmann  who documented a collection of eight architectural patterns including a general reflection pattern. Also recently, Islam  documented a pattern for distributed systems, which emphasizes the two fundamental forces: fault-tolerance and performance.
2 Software architecture and design patterns
Fault-tolerant software can be very complex, its development being subject to human fallibility. Based on intentional inclusion of redundancy supported by special mechanisms as exception handling, checkpointing and persistence, fault tolerant applications must cope with several non-functional requirements to maintain its functionality. Also, as stated by , to specify non-functional requirements for software architecture, it is necessary to consider the interdependencies and trade-offs that exist between them. So, it is particularly relevant to investigate how to alleviate developers from repeatedly deal with this complexity.
As observed by , designers have begun to develop a shared repertoire of methods, techniques, patterns and idioms for structure complex software systems. Informally speaking, in a bottom-up point of view, programs are composed by interacting components, which are styled and organized according to a structure defined by a software architecture. Some of these components might be implemented following a design pattern while others could be specializations of classes from a framework. Hence, software architectures molds some relationships among frameworks, design patterns and components used to build an application.
Architectural styles define the fundamental structure and interactions among the components of an application. Components are atomic units and their interaction provides the functional and non-functional requirements. One or a group of components fulfills each requirement. Conversely, one component can fulfill a number of requirements. Moreover, an application may include several components not directly related to any requirement (figure 1). The structural and logic relationships among these components constitute the major task in system design.
Figure 1: Relationships among components and requirements
However, it is possible to look at some of these relationships and identify them as recurrent on different systems. Hence, such problems can have an application independent solution and thus described using abstra ctions of structure and interactions. The description of a solution using components, collaborations among components and their design context is a design pattern.
In an attempt to provide help to a number of software designers, experts capture the essentials of a design pro-blem that arises in many systems, and then they describe the basic structure of a well proven solution to it. As stated by , early experiences indicate that pattern-based solutions design reuse reduces the time needed to solve a problem and makes it easy to customize the core solution for a specific implementation.
Both frameworks and design patterns describe solutions to software problems. Frameworks intend to solve several problems related to a specific domain presenting a collection of classes. A framework class is an incomplete or complete implementation of an abstraction. As a consequence, a class is programming language depen-dent. A design pattern describes a solution using a language independent idiom , but it is not supposed to present a specific solution to problems found in an application. It focuses on a particular, well-known and widely applicable software design problem.
3 Meta-level architectures
The key concept in the design of software using meta-level architecture style is the separation of the system in two tiered layers: a meta-level and a base level. Additionally, those layers have different but related responsibilities.
The base level encompasses all the components that implement the functionalities of an application as defined on its functional requirements. The meta-level provides a self-representation of the software to give it knowledge of its own structure and behavior . Computational reflection provides this basis for meta-level architectures.
Thus, a reflective architecture defines a system which incorporates data structures representing itself. That self-representation makes it possible for the system to get information concerning to the state, structure and behavior of some systems components. Moreover, based on that information, it can dynamically interfere on its current computations. More precisely, the meta-level components can change and adapt some structural and behavioral aspects of the base-level components.
A MOP - Metaobject Protocol, establishes the relationships among the base level and the meta-level components. The MOP provides a high level interface to the programming language implementation in order to reveal to the program information normally hidden by the compiler and/or run-time environment. As a consequence, a programmer can develop language extensions, adapt component behavior and even make non-permanent changes into the system.
During the compilation process, the compiler has information about the classes, class hierarchies, identi-fiers, types and other attributes of classes, data structures and methods regarding to the program. Such information, that is not likely to change during the execution, constitutes the structural meta-information part of a metaobject protocol.
Conversely, the non-permanent information, e.g. the values of arguments passed to a method, depends on the execution. This kind of data deals with the behavioral aspects of the components and it is part of the dynamic information that a metaobject holds. Depending on the metaobject protocol, the metaobjects provide different meta-information. A metaobject holds static and dynamic information about the base-level objects. Figure 2 depicts some static (e.g. class and type) and dynamic information (e.g. values) reified as data within the metaobject.
Figure 2: Meta-information
On the object model, the meta-information interface is usually implemented at one single class, which by further specialization or direct instantiation offers to the program a meta-level interface.
The meta-level interface defines the methods that implement the reflective services. Generalizing from different MOPs, it is possible to identify four groups representing the basic aspects of metaobjects protocols.
- Attachment of base and meta-level: The purpose of this group is to bind base-level components to meta-level entities in a static or dynamic way. An entity is one or a group of objects or classes. It is possible to bind them in a one-to-one basis and also multiple metaobjects can be bound to one base-level entity. Thus, one can devise many semantics: one metaobject can control just one base object or a number of base objects of the same class, or multiple metaobjects can be used to control one or a number of base objects.
- Reification: Reflection means materialization of information otherwise hidden to the program. This group deals with reification of incoming and outgoing messages, reification of arguments, data and other structural information.
- Execution: This group addresses the meta-level computation and the ability to invoke base-level methods. Metaobjects are ordinary objects and thus they provide services through public interfaces and invoke other objects in a normal way. However, to perform base-level computation transpa-rently to the server they depend on reification. When a base object receives a message, the corresponding metaobject intercepts and materializes the method identifier and the actual parameters. Re-sending the message to the original receiver, before or after meta-level computations, the metaobject acts as an indirect client of the base object.
- Modification: Once the meta-level holds information about the behavior and structure of base objects, computations can change some aspects of the base components. The ability to handle structural information is critical since it can affect multiple objects. One alternative is to make changes on shadow classes, as proposed by  without affecting semantics of other objects of the same class.
There is no universally accepted MOP that serves as a recipe to implement reflective support in programming languages. First, it is necessary to establish which kind of meta-information is useful to reveal to the program. Second, there are several implementations of programming languages intended to support the object model. Some of them use a common root class to derive the classes used to instantiate application objects, while others use objects to describe the application objects. Some programming languages have native reflective facilities, while in others the support for reflection is included by language extensions.
Smalltalk meta-classes define a native interface for structural reflection. The original design of the Java Virtual Machine supports some introspection about classes and objects, but recently it has been extended to offer more reflective mechanisms [9,10]. C++ does not support reflection - only little information about types provided by library functions. However, different MOPs [11, 12, 13] extended the C++ language.
4 The meta-level approach for fault-tolerance
Software fault-tolerance encompasses all techniques and programming language mechanisms intended to support the development of high reliability software. We can consider the fault-tolerance area a specific domain of knowledge composed by well-defined techniques used to guarantee the reliability of applications built over other domains. Thus, we can figure a domain-specific architecture for software fault-tolerance.
4.1 Domain-specific meta-level architecture
A domain-specific architecture, as stated by , comprises: a) a reference architecture, which describes a general computational framework for a significant domain of applications, b) a component library, which contains reusable chunks of domain expertise, and c) an application configuration method for selecting and configuring components within the architecture to meet particular application requirements.
Thus, the fault-tolerance meta-level architecture comprises:
- A meta-level architecture as reference architecture for structuring fault-tolerant applications split on two layers: the fault-tolerance layer and the application logic layer.
- A collection of design patterns describing successful solutions of fault-tolerance problems, and an open component library, which contains reusable meta-classes implementing fault-tolerant mechanisms to support the development of fault-tolerant applications.
- A metaobject protocol to provide a nice interface among the application components and the fault-tolerant level.
Meta-level architectures have been achieving considerable attention as a reference architecture for fault-tolerant applications. Evidence of this trend can be seen on some recent papers.
- Fabre : "... our experiments in implementing fault-tolerance and secure communication metaobjects were really successful and confirmed that a really good level of transparency and abstraction could be achieved within application objects, fault-tolerance and secure communication metaobjects."
- Stroud : "With a metaobject protocol approach, non-functional aspects of atomic data types such as synchronization and recovery are implemented in metaobject classes at the meta-level, whilst functional aspects such as object operations are implemented at the base-level."
- Fraga : "To improve the flexibility of the implementation, we use a reflexive approach which allows to separate aspects related to the replication model from those related uniquely to the service being replicated."
- Lisboa : "... reflection is used to implement different fault-tolerance schemes in the meta-level in a transparent and non-intrusive way..."
Those authors stressed the major issue related to meta-level architecture: the separation of concerns within the fault-tolerant application. It is advisable to separate the fault-tolerant components from the rest of the system to reduce problems associated with the interaction between redundant components . It fo-llows that this separation, although focusing on independent responsibilities, relies on a tight interaction among components.
Computational reflection helps to organize fault-tolerant activities at meta-level and it provides transparent interfaces among components dealing with fault-tolerance and those implementing the functional requirements.
There are tangible benefits from this separation but there are also some liabilities. First, computational reflection adds cost to the application due to extra-level computation and indirection mechanisms. Second, it depends on reflective facilities to be supported by the target implementation language. Third, reflection is not a universal solution to programming problems. Instead, it is primarily concerned with implementation of open library components.
4.2 Drawing design patterns
The next step toward a reference meta-level architecture is to devise design patterns capturing the essentials of fault-tolerance schemes. Considering the fault-tolerance domain, the body of knowledge and the solutions given to many non-functional requirements are well established and thus it is possible to use design patterns to describe successful solutions.
Patterns help reducing software complexity by clearly describing a software solution in terms of structure, dynamic behavior and context. Similar to what happens for any intrinsically complex systems, the definition of clean interfaces among components contributes to their independent development and reuse. In the case of software fault-tolerance domain, patterns can be used to present a collection of relatively independent solutions to common non-functional properties problems such as atomic actions, recovery points, replication policies and exceptions.
Suppose a fault-tolerant application built from several components. Some of these components should provide critical services while others are not so critical, but there are many interactions occurring among them. Both critical and non-critical objects must provide reliable services but only the critical ones are subject to fault-tolerance techniques. The following pattern presents a basic guideline to solve the interaction problem between a fault-tolerant object and its client.
Pattern: Fault-tolerant Object interaction
Fault-tolerant objects (FT objects) are highly re-liable objects whose services are assured by redundant components.
Context: Software fault-tolerance
Redundant components can be managed according to different fault-tolerance techniques as n-version programming, recovery blocks, active or passive replication protocols and so on. Regardless of the technique adopted, a fault-tolerant object should interact with other objects to provide reliable services.
The client requests services to a fault-tolerant object but it is not aware of the existence of redundant components used to provide the requested service. As an alternate, the fault-tolerant component can manage all its counterparts thus providing the expected service. The inclusion of redundancy within an application component is quite intrusive and it violates the separation of concerns. Hence, the provision of redundant services must be done transparently to the application objects. Both the client and the server must not be involved with this issue.
Solution: Meta-level architecture
The client, the fault-tolerant object and the redundant objects are implemented as base-level objects but the redundant services are managed at the meta-level. Using Class-Responsibilities-Collaborators (CRC-) cards , figures 3a and 3b depict the responsibilities and collaborators of base-level components.
The fault-tolerant object is connected to a metaobject which intercepts the incoming messages. For each incoming message, the target method and the actual parameters are materialized at meta-level, and then the same message is dispatched to all redundant components. The responses to the requested service are collected at meta-level and one response is selected according to the adopted fault-tolerance technique. It can follow the first-response semantics, roting or acceptance. Finally, the selected response is delivered to the original client. Figures 4a and 4b show the responsibilities and collaborators of the meta-level components.
|Class Base-level Client||Collaborator|
Figure 3a: Base level client
|Class FT Object||Collaborator|
Figure 3b: Base-level FT objects
|Meta-level FT manager|
Figure 4a: Meta level object
|Class Meta-level FT Manager||Collaborator|
Figure 4b: Meta-level FT manager
Another concern is about the fault-tolerance technique adopted by the application. The functional core of the meta-level fault-tolerance scheme should be separated into a component. The component class shall encapsulate the minimum memory size and services to manage the redundant components because it can be inherited or instantiated by the specialized metaobject associated with the base-level FT-object. To avoid performance problems and to guarantee independent reusability, the services dealing with the multiple results and the selection of one response - the adjudicators - should be encapsulated in separate components. Note that the meta-level program is composed by several metaobjects.
They are distinct from base-level objects since they are specialists on fault-tolerance schemes and thus hold information about the application objects. However, meta-level components provide their own interface and the meta-level program has its own dynamic behavior, very much like the base-level program.
Example: Consider a Recovery Block (RB-) scheme to manage redundancy. A recovery blocks a control structure that consists of several alternates and an acceptance test. This structure can be represented by a code fragment as:
ensure <acceptance test>
by <alternate 1>
else by <alternate 2>
else by <alternate n>
The execution of the recovery block terminates successfully when a result from any one of the alternates satisfies the acceptance test. When a faulty alternate is executed, the state that existed prior to the execution of the faulty alternate is restored and the next alternate is tried . Resuming the design pattern example, diagrams are used in addition to CRC cards to illustrate runtime scenarios. The diagram in figure 5 shows the dynamic behavior of an execution of recovery block.
As previously stated by , dealing with state restoration can be as simple as to make a copy of the original object or as complex as recovery cache memory implemented in hardware. Thus, the problem of state restoration needs special solutions and this concern suggests the existence of another opportunity to create design patterns for different contexts of state restoration. The discussion about patterns can go further but, for the moment, it is enough to think about the benefits they can bring to fault-tolerant software development.
4.3 Configuring fault-tolerant applications
When the fault-tolerance domain is merged into an application domain, it becomes responsible for some non-functional requirements of the application. But, as previously discussed, there are mutual dependencies between both domains. However, the designer of the whole application need not be concerned with a" global" property of fault tolerance of the application. The designer can identify individual critical objects thus subject to fault tolerant behavior. The other objects are considered non-critical and thus not subject to fault tolerant behavior.
Figure 5: The Recovery Block dynamic behavior
Since the application consists of fault tolerant (FT-) and non-fault tolerant objects, and only fault tolerant objects are connected to metaobjects, the meta-level program must provide information only about those aspects on which the metaobjects depend. For example, metaobjects must be aware of incoming messages addressed to critical methods within the FT objects. Also, metaobjects often need information about the current state of computation.
The following guidelines, adapted from those suggested by , help in designing a meta-level architecture:
1 Define a model of the fault-tolerant application using an appropriate analysis method. Identify the services according to the functional requirements, the components that can fulfill these services and the structure and the relationships among the components.
2 Identifystructuralaspects of the system, which, when changed, should not affect the implementation of the components. Examples include the type structure or the distribution of components.
3 Identify fault-tolerantbehavior of the application such as atomic transactions, real-time constraints, reliable distribution, exceptional behavior and special algorithms for critical services.
4 Identify fault-tolerant services to be provided to guarantee the reliability, availability and safety of the whole application. Several fault-tolerance strategies can be combined to achieve these goals. Identify the system components, which provides the services. For example, the run-time system may provide some safety/security mechanisms.
5 Define the metaobjects that provide the fault-tolerance services needed by the application. The metaobjects can be specially implemented to fulfill the requested services, or they can be reused from frameworks or library classes.
6 Define the metaobject protocol suitable to the application. A metaobject protocol can be defined on early stages of software development, when an implementation language is chosen. Otherwise, it can be implemented as one separate component to just operate on other object components. As proposed by , metaobject support can be customized to accommodate different policies of metaobject organization and activation.
This paper presented important remarks about the current trend in adopting a software reference architecture to provide a skeleton for building applications. Concerning the development of fault-tolerant applications, a meta-level software architecture was suggested as a reference architecture. Then, basic guidelines to help the development of such applications were addressed.
One could have chosen any other software archite-cture or other ways to develop fault-tolerant software. However, recent works dealing with software develo-pment on the fault-tolerance domain serve as evidence of this trend. Those works have confidence on the object model for composing fault-tolerant applications from pre-defined components and they use a reflective architecture to achieve separation between functional requirements and those related to fault-tolerance. Meta-level architectures attack several important issues in implementing non-functional requirements in a non-intrusive, transparent way to the application.
This work was partially supported by grants from FAPERGS and CAPES.
|||M. Ancona et al. Reflective architecture for reusable fault-tolerant software. In: ProcLatin American Conference in Informatics, pp. 87-98, 1995.|
|||J. C. Fabre et al. Implementing fault-tolerant applications using reflective o-o programming. In: IEEE International Symposium on Fault-Tolerant Computing, pp. 489-498, 1995.|
|||R. Stroud and Z. Wu. Using metaobject protocols to implement atomic data types. Proc ECOOP95, Lecture Notes in Computer Science 952:168-189, 1995.|
|||M. L. Lisboa, C. M. F. Rubira and L. E. Buzato. Arquitetura reflexiva para o desenvolvimento de software tolerante a falhas. In: XVI Congresso da Sociedade Brasileira de Computação, pp. 155-166, 1996.|
|||J. Fraga et al. Implementação de Serviços Replicados em um Ambiente Aberto Usando uma Abordagem Reflexiva. In: XVI Congresso da Sociedade Brasileira de Computação, pp. 261-272, 1996.|
|||Buschmann et al. A system of patterns: pattern-oriented software architecture. John Wiley & Sons, England, 1996.|
|||N. Islam and M. Devarakonda. An essential design pattern for fault-tolerant distributed state sharing. Communicatins of the ACM, 39(10): 65-74, 1996.|
|||D. Garlan and D. E. Perry. Introduction to the special issue on software architecture. IEEE Trans on Software Engineering, 21(4):269-274, 1995.|
|||M. Golm. Design and implementation of a meta architecture for Java. Diplomarbeit im Fach Informatik, Friedrich-Alexander Universität, Erlangen-Nürnberg, january 1997.|
|||JAVA Core Reflection: API and specification. Java Soft, Mountain View, CA, USA, october 1996.|
|||S. Chiba and T. Masuda. Designing an extensible distributed language with meta-level architecture. In Proc ECOOP93, Lecture Notes in Computer Science 707: 482-501, 1993.|
|||S. Chiba. A metaobject protocol for C++. ACM SIGPLAN Notices, 30(10): 482-50, 1995.|
|||Buschmann, K. Kiefer, M. Stal and F. Paulisch. The meta-information protocol: run-time type information for C++. In: International Workshop on New Models for Software Architecture (IMSA92), pp. 82-87, 1992.|
|||Hayes-Roth. A domain specific software architecture for adaptative intelligent systems. IEEE Transactions on Software Engineering, 21(4): 288-301, 1995.|
|||M. F Rubira and R. Stroud. Forward and backward error recovery in C++. Object Oriented Systems, 1(1): 1-85, 1994.|
|||A. Burnsand R. Wellings. Real-time systems and programming languages. Addison Wesley, England, 1996.|
|||M. Campo and R. T. Price. Meta-object manager: a framework for customizable meta-object support for smalltalk 80. In: Simpósio Brasileiro de Linguagens de Programação. pp. 399-413, 1996|