The ability to continue non-stop when a hardware failure occurs. A fault-tolerant system is designed from the ground up for reliability by building multiples of all critical components, such as CPUs, memories, disks and power supplies into the same computer. In the event one component fails, another takes over without skipping a beat.|
Tandem and Stratus were the first two manufacturers that were dedicated to building fault-tolerant computer systems for the transaction processing (OLTP) market.
Many systems are designed to recover from a failure by detecting the failed component and switching to another computer system. These systems, although sometimes called fault tolerant, are more widely known as "high availability" systems, requiring that the software resubmits the job when the second system is available.
True fault tolerant systems with redundant hardware are the most costly because the additional components add to the overall system cost. However, fault tolerant systems provide the same processing capacity after a failure as before, whereas high availability systems often provide reduced capacity. See fault management.
This RAID II prototype in 1992, which embodies principles of high performance and fault tolerance, was designed and built by University of Berkeley graduate students. Housing 36 320MB disk drives, its total storage was less than the disk drive in the cheapest PC only six years later. (Image courtesy of The Computer History Museum, www.computerhistory.org) See
Redundancy is the hallmark of fault tolerant systems. This storage server from Xtore Extreme Storage (www.xtore-es.com) contains multiple, hot-swappable power supplies to ensure continued operation.