FTOP: A Library for Fault Tolerance in a Cluster

R. Badrinath, R. Gupta, and N. Shrivastava (India)

Keywords

Coordinated Checkpointing, Rollback recovery, Fault tolerance.

Abstract

Checkpointing and rollback recovery is a simple technique for fault tolerance. The state of a process is saved on a disk file from which the process can recover on the occurrence of failure. In this paper we describe the implementation of FTOP (Fault Tolerant PVM), a coordinated checkpointing library integrated with PVM. Existing PVM applications require only minor change for incorporating fault tolerance using FTOP. FTOP provides fault tolerance mechanism that is totally transparent to the programmer. It does not require any changes to be made in the kernel. FTOP handles intransit messages, open files and routing that makes it a very useful fault tolerant library.

Important Links:



Go Back