Fault and Sabotage Tolerance.

Next: Writing BSP programs in Up: Adaptive Parallelism and Fault-Tolerance Previous: Eager Scheduling.

Fault and Sabotage Tolerance.

To be useful on a large-scale, volunteer computing systems must be able to tolerate not only random data faults, but also intentional sabotage from malicious volunteers producing erroneous data. While achieving such fault-tolerance is difficult in most parallel programming models, where data can move arbitrarily between processes, it is relatively easy in master-worker systems, where result data are neatly packaged into independent packets that can be checked, compared, and recomputed, if necessary.

At present, we have implemented and are studying two approaches to fault- and sabotage-tolerance for master-worker systems: majority voting and spot-checking. In majority voting, the work manager waits for an agreeing majority of at least m out of r results from different workers to be received before a work object is considered done. If r results are received without reaching a majority agreement, all the r results are invalidated, and the work object is redone. In spot-checking, the work manager precomputes a randomly selected work object before the start of a new superstep, and returns this spotter work with probability p at each call to getWork(). Then, if a worker is sent a spotter work and does not return the expected result, it is blacklisted as untrustable, and the work manager backtracks through all the current results, invalidating any results dependent on results from the offending worker. Spot-checking may be less reliable in cases where saboteurs do not always submit bad data, but is much more time-efficient, since it only takes an extra p fraction of the time instead of taking m times longer as voting does.

Early experiments have demonstrated the potential effectiveness of these techniques [19]. Spot-checking, tested with backtracking but without blacklisting, was particularly promising, maintaining error rates of only a few percent, even in cases where over half of the volunteers were saboteurs. Furthermore, the high rate at which saboteurs were being caught suggests that these error rates will be even smaller with blacklisting enabled. While still not zero, these low error rates may be enough for some applications where a few bad answers may be acceptable or can be screened out. These include image rendering, where a few scattered bad pixels would be acceptable, some statistical computations, where outliers can be detected and either ignored or double-checked, genetic algorithms, where bad results are naturally screened out by the system, and others. For other applications that assume 100% reliability, however, we are exploring ways of combining voting, spot-checking, backtracking, and blacklisting, as well as developing new mechanisms, to reduce the error rate as much as possible.

Any progress made in developing mechanisms for master-worker systems in general can be applied automatically to BSP programs by virtue of our master-worker based implementation of BSP. Programmers need only define an appropriate hasSameValue() method for each data class used in their work objects so that two work objects can be compared just like ordinary result objects by comparing their bspVars and MsgQueue fields.

Next: Writing BSP programs in Up: Adaptive Parallelism and Fault-Tolerance Previous: Eager Scheduling.

Luis Sarmenta
1/19/1999