Transparent Fault-tolerance in Parallel Orca Programs

With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca progra...

Full description

Bibliographic Details
Main Authors: Kaashoek, M.F., Michiels, R., Bal, H.E., Tanenbaum, A.S.
Format: Other Non-Article Part of Journal/Newspaper
Language:English
Published: 1992
Subjects:
Online Access:https://research.vu.nl/en/publications/81c2f9d5-f003-4f22-bfe3-ca0d9d915c5e
http://hdl.handle.net/1871.1/81c2f9d5-f003-4f22-bfe3-ca0d9d915c5e
http://www.cs.vu.nl/~ast/Publications/Papers/sedms-1992.pdf
Description
Summary:With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer.