A Scalable Algorithm for Compiler-Placed Staggered Checkpointing

Alison N. Norman and Calvin Lin

Keywords

Checkpointing, Supercomputing, Parallel Computing, Compilers and Run Time Support

Abstract

To make progress in the face of possible system failures, long-running parallel applications often checkpoint, or save their state, so they can resume execution. Many current checkpointing techniques require user input, impose run-time performance penalties, or result in all processes checkpointing synchronously which leads to network and file system contention, again resulting in significant performance penalty. This paper presents an algorithm to perform compiler-placed staggered checkpointing, where process checkpoints are placed at compile-time at different places in the application text, thereby reducing contention for the network and file system. Identifying staggered checkpoints is algorithmically challenging since the number of possible solutions is enormous---it grows as L^P, where L is the number of possible checkpoint locations and P is the number of processes---and the number of desirable solutions is relatively small. But our algorithm successfully places staggered checkpoints in parallel applications that use tens of thousands of processes. On benchmarks, our algorithm generates and places checkpoints that improve performance by an average of 35% over the current technique.

Important Links:



Go Back