Next: The Algorithm for Up: Medical Diagnosis Using a Previous: Probability Model

Computing the Differential Diagnosis

The knowledge base, with its binary nodes and probabilities, resembles a Bayesian belief network as defined by Pearl[Pearl86]. Such networks have been the subject of considerable research by Pearl, Lauritzen and Spiegelhalter[Lauritzen88], Cooper[Cooper86], and others. However, the network of our model does not conform to the definition of a Bayesian belief network because it has forward cycles. Even if we removed all of the forward cycles, there are still multiple paths between nodes. These are not forbidden in Bayesian belief networks, but they greatly increase the amount of computation required to compute the probabilities. The algorithms that have been developed to handle multiple paths are exponential and Cooper has shown that the problem is NP-hard[Cooper87]. We did an analysis of an earlier version of the model and determined that to use the method described by Pearl[Pearl86] and interpolate from a set of nodes whose values cut the multiple paths would require more than 40 nodes in the cut set. Since the probabilities in the network for all combinations of values of this set would have to be evaluated, an exact solution is infeasible.

These observations led us to develop heuristic methods for approximating the probabilities of nodes and finding good hypotheses. The approach we developed is based on an observation about the nature of hypotheses. A hypothesis is a subset of the model such that every finding and node in the hypothesis is accounted for by another node or is primary. This subset can be viewed as the union of one or more paths through the model from a primary node to each finding. Thus, the best hypothesis is the set of paths covering all of the findings which taken together has the highest probability. The problem is difficult because paths that share links have higher probability than the product of the individual path probabilities. Thus, the best explanation for two findings may not include either of the best paths for the findings taken separately. Still, in a reasonably behaved network, a good heuristic approach is to take the findings one at a time and build up a hypothesis using the best path to each finding.

While this is the basic approach we are using, there are a number of enhancements that can be made, most of which have already been incorporated into our algorithm. Each of these is a heuristic for restricting the search and increasing the probability of the hypotheses found. The test of whether a subset of the nodes in a hypothesis is better than an alternative set of nodes is to compare the total probability of the hypotheses.

Some nodes may be definitely true or definitely false from the findings. These are set and used as if they were findings. For example, if there is a finding with only one cause and that requires a cause (i.e., it can not exist by itself), the cause must be true. Similarly, if there is a node that always produces a finding and that finding is absent, the node is false. Often there are several nodes definitely known, significantly constraining the search for hypotheses.
The search for a path for a finding can make use of the paths already selected for other findings. Often the best explanation of a finding is a short path from a node in the partially generated hypothesis rather than a path from a primary node.
Since sharing of paths decreases the number of primary nodes in the hypothesis, the algorithm starts the search by selecting a cover set of primary nodes to account for all of the findings and restricts the search to these. Currently, the program uses the primary nodes that account for a large number of findings and for each of these generates all minimal additional sets of primary nodes needed to account for the findings. This way all minimal cover sets are used plus cover sets that include each primary node that accounts for a large number of findings. Care must be taken in generating cover sets because there are times when an explanation with two causes has a higher probability that an explanation with only one of the causes even though the single cause is sufficient.
When there are multiple pieces of evidence (findings or true nodes) for a node, it is often possible to show that the node is the best explanation for the evidence under fairly conservative assumptions by comparing the probability of the findings with the node in the hypothesis to an estimated maximum probability with the node false. If this comparison is done later in the hypothesis generation process, the probability estimates are better but there is less savings in search. This process takes considerable computation and we have yet to establish a definite benefit.
The probabilities of the paths can be adjusted for the negative findings and nodes that are direct consequences of the nodes in the path. Since the probability that a node is false is , where is the probability of true causes, is the probability that the node will remain false when the node in the path is true. This gives a better estimate of the probability that the path is part of the diagnosis.
The diagnostic nodes provide a natural place to break the paths so that fewer paths have to be considered. Thus, if the probability of the diagnostic nodes can be approximated, they can be used as if they were primary and divide the search. When this approach was added to the algorithm, the number of causal paths was reduced by a factor of about four.
Ordering the findings by the difference in probability between the best path and the second best path, improves the chances that the best combinations of paths will be found. Thus, it is selecting first the paths that are most clearly indicated as best paths. We have tried several methods of ordering the findings and while it is clear that the ordering makes a difference, it is difficult to tell how much better this method is than others.
Pruning a hypothesis can sometimes improve it - essentially hill climbing to a local maximum probability. Sometimes a primary cause is unneeded because some findings may be better left unexplained or other primary causes cover them. Sometimes sections of paths are unneeded because paths added later account for all of the findings. Care must be taken in removing sections to insure that all nodes remain reachable from a primary node, because circularities in the causality appear to be self supporting.

There are also a number of computational tactics that can be employed to make computing the differential diagnosis reasonably efficient.

Precompute all of the causal paths in the model when the system is initialized. This can be done efficiently by generating a tree structure through the effects from each primary node and diagnostic node. Thus, the paths share structure and the only link unique to each path is the last link. Before the diagnostic nodes were used to divide the paths and unusable paths were eliminated, there were about 70,000 paths. There are now about 7,000.
Some paths can be eliminated if there is another path with a subset of its nodes and the minimum probability of the shorter path is always higher than the maximum probability of the longer path. (The longer path has a detour around one link in the shorter path.) Eliminating such paths more than halved the number of paths.
Use bit vectors to do comparisons between paths and sets of nodes. For example, examining paths that come from the primary nodes in the cover set is done by checking for a non-zero intersection between the bit vector of the cover set and the bit vector of each path. Also, a path is discarded if its bit vector intersects the bit vector of the set of nodes known to be false.
The probabilities along the paths can be computed once the user enters the findings. Since paths are examined multiple times and share structure, this is much more efficient than computing the probabilities as needed.
Once consideration has been restricted to a small number of primary nodes in generating a hypothesis, there are often intermediate nodes that must be true because all causal paths from the selected primary nodes to the findings go through them. These can be found by intersecting bit vectors.
Hypotheses can be compared by their total probability - essentially the probability that a patient with that mechanism producing that set of findings would exist in the general population. It is unnecessary to normalize by the probability of the findings because all of the hypotheses to be compared have the same findings.

With these enhancements and computational tactics, the algorithm has proven effective for generating differential diagnoses in our domain. Even with the large size of the probabilistic model, the algorithm runs in less than a minute on a typical case with about a dozen abnormal findings on a Symbolics 3650. This makes it useful as an interactive tool for physicians analyzing a case. The method is still heuristic. Indeed we have found cases where there was a better hypothesis than any of those that were generated. These have all been hypotheses that differed from generated hypotheses in relatively minor ways. From the viewpoint of the domain, the performance is still adequate. It also appears that we should be able to extend the final pruning process to identify many of these situations and modify the hypothesis - in effect a local hill climbing process to make sure there is no better hypothesis in the region of the candidate.

The Algorithm for the Differential Diagnosis

Next: The Algorithm for Up: Medical Diagnosis Using a Previous: Probability Model

wjl@MEDG.lcs.mit.edu
Fri Nov 3 17:21:37 EST 1995