Next: The Algorithm for
Up: Medical Diagnosis Using a
Previous: Probability Model
The knowledge base, with its binary nodes and probabilities, resembles a
Bayesian belief network as defined by Pearl[Pearl86]. Such networks
have been the subject of considerable research by Pearl, Lauritzen and
Spiegelhalter[Lauritzen88], Cooper[Cooper86], and others.
However, the network of our model does not conform to the definition of
a Bayesian belief network because it has forward cycles. Even if we
removed all of the forward cycles, there are still multiple paths
between nodes. These are not forbidden in Bayesian belief networks, but
they greatly increase the amount of computation required to compute the
probabilities. The algorithms that have been developed to handle
multiple paths are exponential and Cooper has shown that the problem is
NP-hard[Cooper87]. We did an analysis of an earlier version of the
model and determined that to use the method described by
Pearl[Pearl86] and interpolate from a set of nodes whose values cut
the multiple paths would require more than 40 nodes in the cut set.
Since the probabilities in the network for all combinations of values of
this set would have to be evaluated, an exact solution is
infeasible.
These observations led us to develop heuristic methods for
approximating the probabilities of nodes and finding good hypotheses.
The approach we developed is based on an observation about the
nature of hypotheses. A hypothesis is a subset of the model such that
every finding and node in the hypothesis is accounted for by another
node or is primary. This subset can be viewed as the union of one or
more paths through the model from a primary node to each finding. Thus,
the best hypothesis is the set of paths covering all of the findings
which taken together has the highest probability. The problem is difficult
because paths that share links have higher probability than the product
of the individual path probabilities. Thus, the best explanation for
two findings may not include either of the best paths for the findings
taken separately. Still, in a reasonably behaved network, a good
heuristic approach is to take the findings one at a time and build
up a hypothesis using the best path to each finding.
While this is the basic approach we are using, there are a number of
enhancements that can be made, most of which have already been
incorporated into our algorithm. Each of these is a heuristic for
restricting the search and increasing the probability of the hypotheses
found. The test of whether a subset of the nodes in a hypothesis is
better than an alternative set of nodes is to compare the total
probability of the hypotheses.
- Some nodes may be definitely true or definitely false from the
findings. These are set and used as if they were findings. For
example, if there is a finding with only one cause and that requires a
cause (i.e., it can not exist by itself), the cause must be true.
Similarly, if there is a node that always produces a finding and that
finding is absent, the node is false. Often there are several nodes
definitely known, significantly constraining the search for hypotheses.
- The search for a path for a finding can make use of the paths
already selected for other findings. Often the best explanation of a finding
is a short path from a node in the partially generated hypothesis rather
than a path from a primary node.
- Since sharing of paths decreases the number of primary nodes in
the hypothesis, the algorithm starts the search by selecting a cover set
of primary nodes to account for all of the findings and restricts the
search to these. Currently, the program uses the primary nodes that
account for a large number of findings and for each of these generates
all minimal additional sets of primary nodes needed to account for the
findings. This way all minimal cover sets are used plus cover sets that
include each primary node that accounts for a large number of findings.
Care must be taken in generating cover sets because there are times when
an explanation with two causes has a higher probability that an
explanation with only one of the causes even though the single cause is
sufficient.
- When there are multiple pieces of evidence (findings or true nodes) for
a node, it is often possible to show that the node is the best
explanation for the evidence under fairly conservative assumptions by
comparing the probability of the findings with the node in the
hypothesis to an estimated maximum probability with the node false. If
this comparison is done later in the hypothesis generation process, the
probability estimates are better but there is less savings in search.
This process takes considerable computation and we have yet to establish
a definite benefit.
- The probabilities of the paths can be adjusted for the negative findings
and nodes that are direct consequences of the nodes in the path. Since
the probability that a node is false is
, where
is
the probability of true causes,
is the probability that the
node will remain false when the node
in the path is true. This
gives a better estimate of the probability that the path is part of the
diagnosis.
- The diagnostic nodes provide a natural place to break the paths
so that fewer paths have to be considered. Thus, if the probability of
the diagnostic nodes can be approximated, they can be used as if they
were primary and divide the search. When this approach was added to the
algorithm, the number of causal paths was reduced by a factor of about
four.
- Ordering the findings by the difference in probability between
the best path and the second best path, improves the chances that the
best combinations of paths will be found. Thus, it is selecting first
the paths that are most clearly indicated as best paths. We have tried
several methods of ordering the findings and while it is clear that the
ordering makes a difference, it is difficult to tell how much better
this method is than others.
- Pruning a hypothesis can sometimes improve it - essentially hill
climbing to a local maximum probability. Sometimes a primary cause is
unneeded because some findings may be better left unexplained or other
primary causes cover them. Sometimes sections of paths are unneeded
because paths added later account for all of the findings. Care must be
taken in removing sections to insure that all nodes remain reachable
from a primary node, because circularities in the causality appear to be
self supporting.
There are also a number of computational tactics that can be employed
to make computing the differential diagnosis reasonably efficient.
- Precompute all of the causal paths in the model when the system is
initialized. This can be done efficiently by generating a tree
structure through the effects from each primary node and diagnostic node.
Thus, the paths share structure and the only link unique to each path is
the last link. Before the diagnostic nodes were used to divide the
paths and unusable paths were eliminated, there were about 70,000 paths.
There are now about 7,000.
- Some paths can be eliminated if there is another path with a subset
of its nodes and the minimum probability of the shorter path is always
higher than the maximum probability of the longer path. (The longer
path has a detour around one link in the shorter path.) Eliminating
such paths more than halved the number of paths.
- Use bit vectors to do comparisons between paths and sets of nodes.
For example, examining paths that come from the primary nodes in the
cover set is done by checking for a non-zero intersection between the
bit vector of the cover set and the bit vector of each path. Also, a
path is discarded if its bit vector intersects the bit vector of the set
of nodes known to be false.
- The probabilities along the paths can be computed once the user
enters the findings. Since paths are examined multiple times and share
structure, this is much more efficient than computing the probabilities
as needed.
- Once consideration has been restricted to a small number of
primary nodes in generating a hypothesis, there are often intermediate
nodes that must be true because all causal paths from the selected
primary nodes to the findings go through them. These can be found by
intersecting bit vectors.
- Hypotheses can be compared by their total probability -
essentially the probability that a patient with that mechanism producing
that set of findings would exist in the general population. It is
unnecessary to normalize by the probability of the findings because all
of the hypotheses to be compared have the same findings.
With these enhancements and computational tactics, the algorithm has
proven effective for generating differential diagnoses in our domain.
Even with the large size of the probabilistic model, the algorithm runs
in less than a minute on a typical case with about a dozen abnormal
findings on a Symbolics 3650. This makes it useful as an interactive
tool for physicians analyzing a case. The method is still heuristic.
Indeed we have found cases where there was a better hypothesis than any
of those that were generated. These have all been hypotheses that
differed from generated hypotheses in relatively minor ways. From the
viewpoint of the domain, the performance is still adequate. It also
appears that we should be able to extend the final pruning process to
identify many of these situations and modify the hypothesis - in
effect a local hill climbing process to make sure there is no better
hypothesis in the region of the candidate.
Next: The Algorithm for
Up: Medical Diagnosis Using a
Previous: Probability Model