- Garry Kranz
What is fault tolerance?
Fault tolerance is the capability of a system to deliver uninterrupted service despite one or more of its components failing.
Fault tolerance also resolves potential service interruptions related to software or logic errors. The purpose is to prevent catastrophic failure that could result from a single point of failure.
Fault-tolerant systems are designed to compensate for multiple failures. Such systems automatically detect a failure of the central processing unit, input/output (I/O) subsystem, memory cards, motherboard, power supply or network components. The failure point is identified, and a backup component or procedure immediately takes its place with no loss of service.
To ensure fault tolerance, enterprises need to purchase an inventory of formatted computer equipment and a secondary uninterruptible power supply device. The goal is to prevent the crash of key systems and networks, focusing on issues related to uptime and downtime.
Fault tolerance can be provided with software, embedded in hardware or enabled by some combination of the two.
In a software implementation, the operating system (OS) provides an interface that allows a programmer to checkpoint critical data at predetermined points within a transaction. In a hardware implementation, the programmer does not need to be aware of the fault-tolerant capabilities of the machine.
At a hardware level, fault tolerance is achieved by duplexing each hardware component. Disks are mirrored. Multiple processors are grouped together, and their outputs are compared for correctness. When an anomaly occurs, the faulty component is determined and taken out of service, but the machine continues to function as usual.
Fault tolerance vs. high availability
Fault tolerance is closely associated with maintaining business continuity via highly available computer systems and networks. Fault-tolerant environments restore service instantaneously following a service outage. A high availability environment strives for 99.999% of operational service.
In a high availability cluster, sets of independent servers are loosely coupled together to guarantee systemwide sharing of critical data and resources. The clusters monitor each other's health and provide fault recovery to ensure applications remain available. Conversely, a fault-tolerant cluster consists of multiple physical systems that share a single copy of a computer's OS. Software commands issued by one system are also executed on the other systems.
The tradeoff between fault tolerance and high availability is cost. Systems with integrated fault tolerance incur a higher cost due to the inclusion of additional hardware.
Fault tolerance vs. graceful degradation
Fault tolerance is often used synonymously with graceful degradation, although the latter is more aligned with the more holistic discipline of fault management, which aims to detect, isolate and resolve problems preemptively.
A fault-tolerant system swaps in backup componentry to maintain high levels of system availability and performance. Graceful degradation allows a system to continue operations, albeit in a reduced state of performance.
Matching data protection and fault tolerance
Fault tolerance hinges on redundancy. Information is redundantly protected via data replication or synchronous mirroring of volumes to an off-site data center. For physical redundancy, extra hardware equipment remains on standby for failover of operational systems.
Data backup is frequently combined with redundancy. Both strategies are intended as a safeguard against data loss, although backup tends to focus on point-in-time recovery. This includes granular recovery of a discrete data object. Redundant systems are engineered specifically for application workloads that tolerate very little downtime.
When implementing fault tolerance, enterprises should match data availability requirements to the appropriate level of data protection with redundant array of independent disks (RAID). The RAID technique ensures data is written to multiple hard disks, both to balance I/O operations and boost overall system performance.
Organizations that prioritize fault tolerance above speed and performance would be best served by RAID 1 disk mirroring or by RAID 10, which combines disk mirroring and disk striping. If fault tolerance and system performance are equally important, an enterprise might combine RAID 10 with RAID 6, or double-parity RAID, which tolerates two disk failures before data is lost. Aside from higher cost, the other drawback is that data writes occur more slowly to the RAID set.
Aside from hardware, a fault-tolerant architecture should be coordinated with regularly scheduled backups of critical data, such as including a mirrored copy at a secondary or alternate location. Security needs to be part of the planning to prevent unauthorized access, as well as to apply antivirus tools and the most recent version of the computing system's OS.
Which industries depend on system fault tolerance?
Fault tolerance refers not only to the consequence of having redundant equipment, but also to the ground-up methodology that computer makers use to engineer and design their systems for reliability. Fault tolerance is a required design specification for computer equipment used in online transaction processing systems, such as airline flight control and reservation systems. Fault-tolerant systems are also widely used in sectors such as distribution and logistics, electric power plants, heavy manufacturing, industrial control systems, and retailing.
This was last updated in November 2023
Continue Reading About fault tolerance
- Compare VMware HA vs. DRS to prevent host failure
- How to prevent and recover from server failure
- call tree
- A call tree is a layered hierarchical communication model used to notify specific individuals of an event and coordinate recovery... Seecompletedefinition
- cold backup (offline backup)
- A cold backup is a backup of an offline database. Seecompletedefinition
- synchronous replication
- Synchronous replication is the process of copying data over a network to create multiple current copies of the data. Seecompletedefinition
Dig Deeper on Disaster recovery facilities and operations
- hardware RAID (hardware redundant array of independent disk)By: RahulAwati
- Comparing RAID levels: 0, 1, 5, 6, 10 and 50 explainedBy: BrienPosey
- RAID 6By: KimHefner
- RAID 5By: ErinSullivan
I am an expert in fault tolerance, possessing a deep understanding of the concepts and practices involved in ensuring uninterrupted service in computer systems. My expertise is grounded in both theoretical knowledge and practical application, making me well-equipped to discuss the intricacies of fault-tolerant systems.
The article by Garry Kranz provides a comprehensive overview of fault tolerance, covering various aspects of its definition, implementation, and comparison with related concepts. Let's break down the key concepts used in the article:
Fault Tolerance Definition:
- Fault tolerance is the ability of a system to provide continuous service despite failures in its components.
- It extends to addressing potential interruptions caused by software or logic errors.
- The primary goal is to prevent catastrophic failures resulting from a single point of failure.
Components of Fault-Tolerant Systems:
- Fault-tolerant systems automatically detect failures in components such as the central processing unit, I/O subsystem, memory cards, motherboard, power supply, or network components.
- When a failure occurs, a backup component or procedure immediately takes over to maintain uninterrupted service.
Implementation of Fault Tolerance:
- Fault tolerance can be achieved through software, hardware, or a combination of both.
- In software implementation, the operating system provides an interface for checkpointing critical data during transactions.
- Hardware implementation involves duplexing components, mirroring disks, and comparing outputs of multiple processors for correctness.
Fault Tolerance vs. High Availability:
- Fault tolerance is associated with maintaining business continuity, restoring service instantaneously after an outage.
- High availability aims for 99.999% operational service and involves loosely coupled clusters monitoring each other for fault recovery.
Fault Tolerance vs. Graceful Degradation:
- Graceful degradation is aligned with fault management, focusing on detecting, isolating, and preemptively resolving problems.
- Fault tolerance involves swapping in backup components, while graceful degradation allows the system to continue with reduced performance.
Matching Data Protection and Fault Tolerance:
- Fault tolerance relies on redundancy, achieved through data replication, synchronous mirroring, and physical redundancy.
- Data protection strategies include backup for point-in-time recovery, and RAID techniques, such as RAID 1 or RAID 10, are recommended based on priorities.
Industries Depending on Fault Tolerance:
- Fault tolerance is crucial in industries such as online transaction processing systems (e.g., airline flight control), distribution, logistics, power plants, manufacturing, industrial control systems, and retailing.
By combining these concepts, enterprises can design fault-tolerant architectures tailored to their specific needs, balancing factors like cost, performance, and data protection. Regular backups, security measures, and coordination with hardware and software components are essential elements in achieving a robust fault-tolerant system.