Lab Home Lab Phone Lab Search
Home Research Papers Software People Jobs Los Alamos Photos Site Map

Fault Tolerance in Large Clusters

The increased reliance on supercomputing clusters based on commercial off-the-shelf (COTS) parts has created a growing need for lightweight fault tolerance. As clusters grow from hundreds of processors to thousands, and even tens of thousands of processors, the mean time to failure (MTTF) drops from a few days to a less than an hour for the entire cluster (assuming that the failure rate is geometric and that the reliability of a single-processor system is an optimistic 99.99% or MTTF = 10,000 hours = 1.14 years).

Due to the size of the clusters in question, resource constraints make traditional fault-tolerance techniques based on active replications impractical. However, rollback-recovery protocols provide an attractive low-overhead solution for making such clusters resilient to crash failures. These protocols save information during failure-free execution that can be used during recovery to restore a crashed process to its pre-crash state.

This joint project, led by Lorenzo Alvisi and Harrick Vin from the University of Texas at Austin, builds on their prior work on Egida, an object-oriented toolkit designed to support transparent rollback recovery and low-overhead fault tolerance.