next up previous
Next: Eager Scheduling. Up: Adaptive Parallelism and Fault-Tolerance Previous: Adaptive Parallelism and Fault-Tolerance

Checkpointing and Process Migration.

Previous research in C-based BSP systems has shown that the call to bsp_sync() at the end of the superstep provides a convenient place to checkpoint the program state and migrate processes to achieve adaptive parallelism and crash-tolerance [15,13]. In our implementation, checkpointing happens automatically as each work object returns itself to the server at the end of a superstep. Since the returned work object includes all necessary process state in bspVars, the MsgQueue's, and other non-transient fields, it can easily be saved and restored later or migrated to another machine.



Luis Sarmenta
1/19/1999