Normal Versus Worst-case Performance in High-Availability Cluster and Distributed Computing

L. Lundberg and C. Svahnberg (Sweden)


Clusters, Availability, Performance, Recovery schemes


Clusters and distributed systems offer fault tolerance and high performance. When all computers are up and running, we would like the load to be evenly distributed among the computers. When a computer breaks down the load on this computer must be redistributed to the other computers in the cluster. Most cluster systems are designed to tolerate one single fault, and one can thus distinguish between two modes of operation: normal operation when all computers are up and running and worst-case operation when one computer is down. The performance during these two modes of operation is determined by the way work is allocated to the computers in the cluster or distributed system. It turns out that the same allocation can in general not acheive optimal normal and worst-case performance, i.e. there is a trade-off. In this paper we put an optimal upper bound on the loss of worst-case performance when optimizing for normal case performance. We also provide a heuristic algorithm for doing engineering trade-offs between worst-case and normal case performance.

Important Links:

Go Back