Checkpointing and failure recovery techniques that have low overhead
and provide fast recovery from failures are integral to the design of
fault-tolerant, high-performance mobile computing systems.
This talk will present a new approach called the quasi-synchronous
checkpointing and failure recovery for mobile computing systems.
The checkpointing algorithm preserves process autonomy by allowing them to
take checkpoints asynchronously and uses communication-induced checkpointing
for progression of the recovery line which helps bound rollback
propagation during a recovery. Thus, it has easeness and low overhead
of asynchronous checkpointing and recovery time advantages of
synchronous checkpointing. There is no extra message overhead
involved during checkpointing and the additional checkpointing overhead
is nominal. The algorithm ensures the existence of a recovery line consistent
with the latest checkpoint of any process at all time.
The recovery algorithm exploits this feature to restore the system to a
state consistent with the latest checkpoint of a failed process. The recovery
algorithm has no domino effect and a failed process only needs to rollback
to its latest checkpoint and request other processes to roll back to a
consistent checkpoint. To avoid domino effect altogether, selective
pessimistic message logging at the receiver end is used.