Fast, Low-Cost Checkpointing and Recovery Techniques
Mobile Computing Systems

Mukesh Singhal
National Science Foundation
The Ohio State University


Lecture Abstract

Checkpointing and failure recovery techniques that have low overhead and provide fast recovery from failures are integral to the design of fault-tolerant, high-performance mobile computing systems. This talk will present a new approach called the quasi-synchronous checkpointing and failure recovery for mobile computing systems. The checkpointing algorithm preserves process autonomy by allowing them to take checkpoints asynchronously and uses communication-induced checkpointing for progression of the recovery line which helps bound rollback propagation during a recovery. Thus, it has easeness and low overhead of asynchronous checkpointing and recovery time advantages of synchronous checkpointing. There is no extra message overhead involved during checkpointing and the additional checkpointing overhead is nominal. The algorithm ensures the existence of a recovery line consistent with the latest checkpoint of any process at all time. The recovery algorithm exploits this feature to restore the system to a state consistent with the latest checkpoint of a failed process. The recovery algorithm has no domino effect and a failed process only needs to rollback to its latest checkpoint and request other processes to roll back to a consistent checkpoint. To avoid domino effect altogether, selective pessimistic message logging at the receiver end is used.

