Table of Contents

Software Design - (Fault Tolerance|Resilience)

About

Fault tolerance (or resilience) is the ability to recover from errors (fault), regardless of whether those errors resulted from:

A system tolerant of every possible kind of fault is not feasible.

See also: Software Design - Recovery (Restartable) (same thing ?)

Term

When talking about fault tolerance, the following terms are often used:

Implementation

Checkpoint

fault tolerance is generally provided via a mechanism called checkpoints, essentially taking a consistent snapshot periodically without ever stopping the computation.

Savepoint

Svepoints makes checkpointing mechanism available directly to the user. Savepoints are checkpoints that are triggered externally by the user. Savepoints make it possible to “version” applications by taking consistent snapshots of the state at well-defined time points, and then rerunning the application (or a different version of the application code from that time point). In practice, savepoints are essential for production use, enabling easy debugging, code upgrades, what-if simulations, and A/B testing.

Documentation / Reference