Fault-Tolerance.

Next: Security. Up: Research Issues: Problems and Previous: Adaptive Parallelism.

Fault-Tolerance.

Faults in distributed systems can generally be classified into stopping faults and Byzantine faults [33]. Stopping faults cover cases of processors crashing or leaving, and are automatically handled by adaptively parallel systems. Byzantine faults cover all other kinds of faults, including unintentional random faults such as data loss or corruption due to faulty network links, or faulty processors, as well as intentional malicious attacks. In general, however, we can think of Byzantine faults simply as faults that result in the generation of incorrect result packets.

Byzantine faults can be handled using replication techniques such as majority voting or the more sophisticated and reliable algorithms in [33]. (To prevent sabotage by groups, replicated nodes would preferrably belong to different Internet domains.) However, this has the disadvantage of being very inefficient since a replication factor r means a factor r drop in aggregate computational speed.

We can improve efficiency by spot-checking. For each work packet that a node receives, there would be some probability p that the server already knows the correct answer, and just wants to check if the node is faulty. Although faulty nodes may be able to slip through for a while, the probability of not getting caught approaches zero exponentially in time, so faulty nodes get caught eventually.

Once a node gets caught, the server backtracks through the results, recomputing any results depending on the offending nodes results, and then blacklists the offending node, never allowing it to join the computation again. (More flexible versions of blacklisting can also be used in situations where unintentional and transient occasional errors are expected even from non-faulty nodes, e.g., due to power fluctuations, and we do not want to blacklist a node immediately and forever just because of these temporary failures.)

Another way to achieve fault-tolerance is by simply choosing more fault-tolerant problems. Such problems include those that do not require 100% accuracy in the first place, such as sound or graphics processing where a little static or a few scattered erroneous pixels would be unnoticeable or can be averaged out to be make them unnoticeable. Other problems include those that have easily-verifiable results, such as search problems with rare results that can easily be checked by the server without adding significant extra load (e.g., travelling salesman problem), and problems like rendering, where a human user can visually recognize any unacceptable errors and ask the system to recalculate (and blacklist the node responsible for the problematic areas).

Next: Security. Up: Research Issues: Problems and Previous: Adaptive Parallelism.

Luis Sarmenta
1/2/1998