Categorical and Probabilistic Reasoning in Medicine Revisited^

Peter Szolovits, PhD
Stephen G. Pauker, MD

1. Introduction

Our 1978 paper reviewed the artificial intelligence-based medical (AIM) diagnostic systems. Medical diagnosis is one of the earliest difficult intellectual domains to which AI applications were suggested, and one where success could (and still can) lead to benefit for society. The early 1970's brought Schwartz's clarion call to adopt computers to augment human reasoning in medicine, Gorry's rejection of older flowchart and probabilistic methods, and the first demonstrations of "expert systems" that could indeed achieve human expert-level performance on bounded but challenging intellectual tasks that were important to practicing professionals, such as symbolic mathematics and the determination of chemical structure. By the mid-70's, a handful of first-generation medical AI systems had been developed, demonstrated, and at least partially evaluated. Although the methods each employed appeared on the surface to be very different, we identified the underlying knowledge on which each operated and classified the general methods they used. We emphasized the distinction between the "categorical" or structural knowledge of the programs and the particular calculi they employed to approximate some form of probabilistic reasoning. Our analysis also suggested that alternative combinations of these methods could be equally well applied to solve diagnostic problems.

In this note, we explore several research themes concerning medical expert systems that have emerged during the past fifteen years, review the roles of diagnosis and therapy in medicine, and make some observations on the future of medical AI tools in the changing context of clinical care. This paper is discursive; for a more complete treatment of these issues, please refer to a paper now in preparation, "The Third Decade of Artificial Intelligence in Medicine, Challenges and Prospects in a New Medical Era."*

2. Reasoning about multiple diagnoses

One difficulty encountered by most of the initial AIM programs was how to deal with the simultaneous presence of multiple disorders. The AIM programs we described in 1978 each took a different, and not altogether satisfactory, approach. Basically, CASNET/Glaucoma concerned only a single disease. MYCIN worked in a domain in which any strongly suspected disease must be treated because the risk of not treating a serious infection typically far outweighs the risk of the treatment. Thus, whether two suspected infections might actually both be present or might really represent an ambiguous presentation of a single infection, MYCIN proposed treatment for both. PIP considered all disorders to be competitors to explain all findings unless the disorders were connected by causal or associational links. PIP therefore tended to produce many alternative explanations for a case, each centered on a likely actual disorder but with no indication that some combinations of disorders were more suitable diagnostic conclusions. INTERNIST had a clever partitioning heuristic that helped sort out and allocate specific findings to particular disorders and worked well when asked to identify co-occurring disorders whose findings neither overlapped significantly nor interfered with each other; it was weak, however, at identifying clusters of related diseases, such as might arise in the multiple facets of a systemic disorder.

Although Pople suggested interesting heuristics for dealing with multiple disorders and reformulated diagnosis as a search problem, Reggia, and Reiter and de Kleer best formalized and solved at least a simplified version of the multiple-diagnosis problem. These efforts consider the simplified case where the knowledge base consists of a bipartite graph of diseases (each unrelated to the other) and findings (also unrelated to each other), interconnected so that each disease is linked to each finding that it may cause. Reggia's insight was that for such a knowledge base, an adequate diagnosis consisted of a minimal set of diseases whose associated findings covered the set of observed findings. Normally, there is no unique minimal set, and additional criteria (e.g., likelihood) could be used to select among alternative minimal candidate sets. This process is, in general, combinatorial, and careful guidance is needed to achieve practical performance. Wu has recently shown that a heuristic but robust symptom clustering method dramatically improves search performance by introducing the equivalent of "planning islands" in the overall diagnostic process: it first partitions the symptoms into sets in which all symptoms must have a common explanation; then it performs differential diagnosis among the possible causes of each set. The overall diagnosis will identify a single disorder as the cause of each set of symptoms. An efficient factored representation cuts down combinatorial growth of the search space. The simple bipartite model is not entirely satisfactory, however. It does not deal well with situations in which the absence of a particular observation is key, nor with keeping clear the distinction between a finding that is unknown because its presence has yet to be investigated and one that is known to be absent.

3. Reasoning at multiple levels of detail

Each of the programs we surveyed in 1978 based its reasoning on associations among diseases or abnormal states and their manifestations, without any explicit representation of the mechanisms whereby the diseases caused their symptoms. In cases where two or more disorders interact, such associational methods either require explicit encoding of the effects of all possible combinations of disorders (prohibitive for sizable domains) or some generative theory that allows the program to predict their joint effects. Such a generative theory must typically be based on a much more causal and possibly quantitative description of the domain than what would serve to diagnose isolated disorders. An easy illustration is the case of two diseases, one of which raises the blood concentration of some ion, the other of which lowers it. In this case, a measurement of that ion in the blood can indicate only the relative severity of the two disorders, and may take on virtually any value. In particular, it could be perfectly normal if the two disorders happen to cancel out in this dimension. Yet it would be odd for an associational diagnostician to link both disorders with a normal level of that ion!

This problem is far more ubiquitous in contemporary medicine than one might at first imagine, because almost any identifiable disease normally elicits medical treatment. One goal of treatment is to eliminate the underlying disturbance, but another common one is to compensate for abnormalities caused by the disease, to keep the patient in some stable state, even if not perfectly healthy. But each instance of this leads precisely (and deliberately) to just the sorts of interactions that make associational diagnosis difficult. Furthermore, even when a diagnosis is not firmly established, empiric treatments or therapeutic trials often produce partially treated (and often partially obscured) diseases. Although it is tempting therefore to adopt a diagnostic strategy based on reasoning from first principles about the possible consequences of any combinations of diseases and their treatments, this turns out to be impractical for several reasons. First, much of medicine remains a sufficient mystery that even moderately accurate models are beyond our knowledge. Second, even in domains where such models have been developed (e.g., circulatory dynamics), the models require a level of detail and consistency of global analysis that is clinically not achievable.

Ideally, a program should reason simply about simple cases and resort to a more complex generative theory only when actual complexity of the case requires it. Human experts appear to have this skill—to recognize simple cases as simple and, after a cursory ruling out of potentially serious alternatives, to treat them as adequately solved. With this skill, intensive investigation is reserved for cases (or aspects of cases) that actually involve complicated interactions and require deeper thought. How is a program to recognize, however, that a case thought to be simple really is not? And how is it to do so without concluding that every case is complex?

Patil's ABEL program for diagnosing acid-base and electrolyte disorders employed a five-level pathophysiologic model in which the top level contained associational links among clinically-significant states and lower levels introduced successively more physiological detail, until the lowest layer represented our most detailed biochemical understanding of the mechanisms of acid-base physiology (at least as clinicians think of it). ABEL would begin by trying to construct a consistent explanation at the highest level, and then would use discrepancies between that model's predictions and subsequently gathered data as the clue that more detailed investigation was needed. Although this insight is valuable, the technique suffered some important deficiencies: As soon as any unexpected finding arose, its control strategy would tend to "dive" down to the deepest model layer because that is the one at which quantitative measurements could best resolve questions of interaction among multiple disturbances. In its narrow domain, this was tolerable, but more generally it leads to a very detailed and costly analysis of any case that departs from the textbook norm. Furthermore, a discrepancy detected in a hypothesis at some level could actually mean either that a more detailed analysis will lead to an incrementally-augmented successful hypothesis, or that this hypothesis should be abandoned in favor of one of its competitors.

A few other successful programs in non-medical domains (e.g., Hamscher's program for diagnosing faulty circuit boards by considering their aggregate temporal behavior) have also demonstrated the value of more abstract or aggregate levels of representation in diagnostic problem solving, but a clear exposition and formal analysis of this approach remains in the future.

4. Doing probabilities "right"

Medical diagnosis is innately an uncertain business, due to our imperfect understanding of medicine, variations among individual patients, measurement and observational errors, etc. Numerical measures, such as probabilities, are a convenient and conventional way to summarize uncertainty and are an essential part of medical diagnosis programs. Programs in the mid-1970's used a variety of metrics and mechanisms for propagation, inspired by classical probability theory, fuzzy set theory, Dempster-Shafer belief functions, and other formalisms. Bayesian probabilities were often taken as the norm, but were thought to be impractical. Our paper argued, for example, that a full accounting for non-independence would require vast numbers of estimated joint probabilities for even a small diagnostic problem. The alternative, assuming global independence, was clearly wrong. CASNET/Glaucoma used a causal-associational network that could be interpreted to represent probabilistic dependence, but it was overloaded because it also attempted to represent temporal progression of disease. Weiss' formulas for propagating numerical estimates of certainty, therefore, were a combination of probabilistic, fuzzy set and heuristic approaches, and did not easily lend themselves to systematic analysis. INTERNIST used crudely quantized scores that seemed somewhat like log odds. MYCIN presented an idiosyncratic system of certainty factors that had to be revised in response to empirical inadequacies. We considered interpreting the causal and associational links in PIP as probabilistic influences, but our attempts to assign both prior probabilities to every node in the graph and conditional probabilities along links led to an over-constrained model that we failed to solve by relaxation methods. Duda and Hart had presented an uncertainty scheme based on propagation of likelihood ratios, but required fudge factors to make the estimated values non-contradictory.

The early 1980's finally saw insightful analyses of networks of probabilistic dependence by Cooper and Pearl, and Pearl's formulation has had a revolutionary impact on much of AI. (It is interesting to note, by the way, that equivalent techniques had been published by geneticists interested in the analysis of pedigrees with inbreeding in 1975, but were unknown to the general computer science community until very recently.) The critical insight here is that in even large networks of interrelated nodes (representing, say, diseases and symptoms), most nodes are not directly related to most others; indeed, it is only a very small subset of all the possible links that need be present to represent the important structure of the domain. When such a network is singly-connected, efficient methods can propagate the probabilistic consequences of hypotheses and observations. When there are multiple connections among portions of the network, exact evaluation is combinatorial, but may still be practical for networks with few multiple connections. Approximation methods based on sampling are relatively fast, but cannot guarantee the accuracy of their estimates. With these new powerful probabilistic propagation methods, it has become possible to build medical expert systems that obey a correct Bayesian probabilistic model. Cooper's program for diagnosing hypercalcemic disorders has been followed by MUNIN (for diagnosing muscle diseases), PATHFINDER (for pathology diagnosis), and BUNYAN (for critiquing clinically oriented decision trees). Several other programs now use these techniques, in domains ranging from physiologic monitoring to trying to recast the INTERNIST (now QMR) knowledge base in strictly probabilistic terms. Eddy has been developing a related basis for supporting and quantifying the uncertainty in inferences based on partial data from separate studies.

5. Reasoning about therapy

Although it is tempting to view medical management as a sequential process consisting first of diagnosis and then of therapy, the separation of those two phases, with an expert system focussing primarily on one or the other, is too simplistic. In 1978 our descriptions of first generation AIM programs presented an iterative diagnostic process that included context formation, information gathering, information interpretation, hypothesis generation and hypothesis revision. A second module, in programs that contained it, was invoked to select therapy, typically from a rather straightforward table or list, only when the diagnosis had been established. The probabilistic threshold or categorical rule by which a diagnosis is established must be explicitly specified so that the diagnostic program has clearly defined "stopping criteria." In some programs those thresholds and rules were based on information content alone (either how certain the leading diagnosis is or how unlikely the next leading contending hypothesis is), whereas in others those cutoff points included explicit consideration of the risks and benefits of errors of omission and commission. In programs whose focus was purely diagnostic, like PIP and INTERNIST, such criteria were either arbitrary or altogether absent.

In medical practice, the management process is far more complex, with diagnostic reasoning and therapeutic action belonging to a spectrum of options that are presented and manipulated iteratively. First, the patient's presenting or chief complaint allows the clinician to form a context for further management. Of course, context revision occurs as additional information is acquired and processed. As Elstein and Kassirer and Kopelman have noted, the physician gathers and then interprets information, typically in discrete chunks. It would seem only reasonable for the clinician to process all available information before electing either to gather more information, to perform an additional test that carries some risk or cost, to undertake a therapeutic maneuver (be it administering a drug or submitting the patient to surgery), or to announce a diagnosis. In actual practice, though, physicians will often take an action before all the available information has been processed. Because the "complete" interpretation of all available information can be costly in terms of time and processing effort, the physician sometimes feels compelled to deal with either seemingly urgent or important hypotheses even while additional information remains in the input list. Of course, some preprocessing and ranking of the list usually occurs to identify critical findings that require immediate attention.

Actions to be taken may be clearly therapeutic, clearly diagnostic, or more likely may represent a mixture of goals. When a therapy is given, the patient's response often provides important information that allows refinement or modification of diagnostic hypotheses. Silverman's Digitalis Therapy Advisor and its progeny (Long's Heart Failure Program and Russ's approach to diabetic ketoacidosis), as well as ONCOCIN, VM and T-HELPER utilize such responses in either simple rules or a complex network to improve diagnostic certainty. Even diagnostic maneuvers often affect patient outcomes. Tests can produce complications that introduce either entirely new diseases, that make the existing disease more severe, or that preclude certain therapeutic options.

Expert system-based decision support can address many phases of patient management, but as yet, no integrated system has addressed the entire problem, even in rather limited domains. In our 1978 paper, we applied the terms probabilistic and categorical reasoning exclusively to medical diagnosis. We now believe that such a classification scheme should be applied to therapeutic problems as well. Although therapy selection might be considered to be largely a categorical process (e.g., given a diagnosis and perhaps a set of complicating factors, optimal therapy can be defined by a simple table, algorithm or heuristic), it need not be. Quantitative and probabilistic information is available about the benefits and risks of alternative therapies (even at different stages of an evolving disease), and the risks, benefits, costs and efficiencies of different strategies must be weighed. In the 1990's, it is no longer sufficient to say that pneumocystis pneumonia in a patient with AIDS is treated in the hospital. One must ask whether therapy can be delivered on an out-patient basis and what patient characteristics should dictate the choice and setting of therapy. Analogously, thrombolytic therapy has become the standard therapy for patients with acute myocardial infarction, but we must consider whether or not it should be administered to patients with suspected infarction as well. Although such therapeutic tradeoffs can and perhaps should be approached quantitatively and explicitly, they are now prey to implicit clinical judgment because the explicit process is often too cumbersome and inadequately explicated for the practicing physician. Perhaps expert systems can help.

6. Machinable data

Perhaps the greatest limitation to providing medical decision support is the absence of clinical data that can be conveniently accessed by such programs and the inability to incorporate either computer or paper-based decision support into the physician's daily routine in an efficient manner. Although decision support has been included in hospital information systems such as Pryor and Warner's HELP and although new information systems have been designed for some AIM projects like ONCOCIN, the total absence—or certainly the lack of uniformity—of credible hospital and office practice information systems means that the clinician must manually enter her patients' clinical information, typically in a somewhat different format and with a different syntax and vocabulary for each program she might want to use. Although this problem will eventually be ameliorated by the standardization that MacDonald proposes and by unification of medical terminology (such as Lindberg's UMLS), it will remain problematic for most practice settings at least for the next decade.

The importance of having slick and efficient "front ends" to make any system acceptable is particularly important in dealing with medical professionals who are often pressed simultaneously by the double burdens of limited time and enhanced responsibility. We believe, however, that an efficient and perhaps invisible "back end" connection to both the patient's clinical data and to a variety of knowledge bases is even more important. In system development and maintenance, the medical informatician must have access to large clinical datasets, to create appropriate interfaces and to develop knowledge from the clinical data elements. In the past, such access has been either developed manually in a single institution or has used a registry (local or national) of relevant cases (e.g., ARAMIS, Duke Cardiovascular Database). Given the expense and administrative problems of collecting such special datasets for a broad spectrum of disease, we believe that expert system developers will need to rely increasingly on general hospital information systems, hoping to merge data from a variety of such sources. One additional information source is readily available, although as yet of only limited utility: the large administrative datasets, such as those maintained by HCFA and private insurers. Although the primary motivation for maintaining such datasets is financial and although the clinical information in them is often sparse, they have revealed important lessons about variations and patterns of care and complications. The routine inclusion of some clinical information in those datasets will surely increase. The AIM community should have at least two lines of interest in these datasets. First, they may provide useful machinable information for expert system development. Second, the adaptation of such administrative data to provide clinical insights about clinical practice is complex but will occur increasingly. Expertise in this process is, however, limited to a few groups and poorly distributed. The development of expert systems to manage, query and interpret responses from such datasets may be an important new domain for system development.

In providing expert systems-based decision support for clinicians, the AIM community must rely on developers of medical information systems to support AIM projects, both by supplying machinable data and by presenting decision support to the clinician in a timely and efficient manner. For decision support to be relevant, it must be provided at the time that choices must be made. The clinician must not need to interrupt her flow of activity either to enter information to the expert system or to review the suggestions of the system. When the clinician is making a choice among therapies and writing the relevant orders, she must have access to any advice the system provides, whether that advice is individualized to the specifics of the patient at hand or is generic, suggesting something that should apply to all patients with that problem.

Substantial effort is being devoted to developing guidelines and algorithms for clinical care. To the extent that such paper-based decision support is read by the clinician and perhaps even filed in a notebook over her desk, its implementation will be limited by the clinician's ability to remember the details of the often complex guideline. Much as a logistic regression equation or its daughter, a clinical decision rule, can be quite misleading if certain terms are omitted, even the most carefully crafted guideline can produce grossly inadequate care if it is not followed in its entirety. If critical loops and checks are omitted because they are not remembered, then the guided care may be erroneous. A related problem arises if the guideline is applied in an inappropriate context or to the wrong patient. Similarly, expert system developers must be cognizant of the environment in which their decision support tool will be placed so its database is as complete as necessary and so its advice can be followed as completely as possible. Developers must also be careful to include surface or common sense checks to detect situations in which the system is applied inappropriately in a setting for which it was not intended and must anticipate that the system's advice may not be followed completely. Given the limitations of the clinical environment, we believe that all such implementations must include a feedback and iteration phase so that errors, mistranslations and omissions can be detected and corrected in a timely fashion.

7. Problem breadth

Expert systems in medicine have always represented a mix of domains broad (e.g., the general diagnostic problem in medicine as evidenced by INTERNIST, and now DxPLAIN and QMR) and narrow (e.g., CASNET/Glaucoma, MYCIN, ABEL, ONCOCIN). The attraction of the broad domain has always been to provide a critical mass of expertise in one system, such that it was applicable to many patients and physicians. Yet if one examines critically the successes of expert systems outside of medicine, the importance of a narrow focus becomes clear. The general diagnostic problem is just too hard and covers too much territory both for knowledge capture and knowledge-base maintenance and for the efficient application of reasoning algorithms. Although both QMR and DxPlain have interesting behaviors which clinicians find fascinating and initially helpful, the problem of inappropriate context and false positives ("false drops" in the vernacular of the medical library science) may become sufficiently annoying to limit continuing use. These programs serve to jog the clinician's memory and suggest unconsidered possibilities (often unusual ones) but do not tend to replace consultation by an experienced clinician. Narrower, more specialized systems may have a more compelling, if limited impact. However, a proliferation of specialized systems, each relying on its idiosyncratic interface, database, forms of explanation, and medical knowledge base, cannot offer a satisfactory systematic solution to the broad needs of clinical medicine. Much remains to be done in creating common frameworks in which specialized systems can work effectively together to produce a usable and useful whole.

8. New themes in medicine

Medicine is changing. The past quarter century has been called the golden years of medicine. But knowledge and technology, exploding at breakneck pace, have run headlong into the wall of resource limitation. Fueled by the need to constrain the unchecked expansion of medical resource use, which even now consumes some 10% of our gross national product (contributing more to the cost of Detroit's now non-competitive behemoths than does their steel), the clinician is for the first time beset by the need to consider the resources that her care consumes, to explain and justify her care, to be accountable for both the cost of care and any adverse outcomes that may occur, and to explain variations in the kind and intensity of care she provides, especially if those variations increase cost or the number of procedures performed. Unfortunately, by both education and temperament most clinicians are ill-equipped to prosper or even to survive professionally on the new playing field.

When medicine first provided a fertile playground for computer scientists developing artificial intelligence approaches to problem solving, the primary concern was the distribution of expertise to medically underserved areas. Physicians were increasingly less able to capture knowledge and update their personal knowledge bases because the exponential expansion of medical information was producing an unmanageable tide. Thus, programs focussed on making a correct diagnosis and demonstrating expertise similar to that of experienced clinicians. These programs also addressed the variability of patients' presentations and the optimization of therapy based on such variations. The goal was not to minimize variation but rather to maximize flexibility.

In the third decade of AIM, the goals of a relevant expert system have changed and additional attributes must be considered. The use of health care resources must be made more efficient. For hospitalized patients, length of stay must be minimized, within the constraints of not compromising therapeutic efficacy or increasing the rate of complications. The process of care can no longer run without feedback or oversight. Health outcomes and resource use must be accurately quantified and monitored.

The new goals redefine both relevant expertise for medical practice and decision support. The expert system developer may no longer be able just to simulate the behavior of experienced clinicians. The environment and its rules are new and evolving rapidly. Some experienced clinicians will adapt effectively to these additional concerns while others may be unable to alter their styles sufficiently. Expert systems may be able to provide important services in this environment but will likely need to be driven more directly by the new multi-attribute goal structure.

9. Conclusions

In this brief note, we have surveyed the major technical advances in medical reasoning systems since our 1978 paper, described the shift from purely diagnostic programs to those whose concern is more with therapeutic management, reviewed changing needs and opportunities that are coming into play as medical data are becoming more routinely computerized, and outlined some dramatic changes in the practice of medicine itself that will have profound impacts on medical AI research in the coming years. We continue to be optimistic, although the dissemination and use of AIM systems has remained minuscule.

Physicians and other health-care personnel seems unlikely to reject computer technology which has become ubiquitous in our society. It is already cheaper to maintain electronic rather than paper records, and pressures for accountability and cost containment will undoubtedly bring about the availability of machinable data that we had anticipated long ago.

Dramatic advances in genetics and molecular biology that will surely result from the Human Genome Project will fuel both clinicians' information overload and create further technological imperatives for diagnosis and therapy. So long as constraints of cost, laboratory capacity, and human cognition remain, however, difficult diagnostic and therapeutic choices in clinical care will be necessary. The methods of model-based reasoning being developed by AIM researchers may take a more prominent role in clinical informatics as more detailed models become available. Constraining the use of such models to appropriate contexts, getting multiple models to interact in the absence of an overarching model, and developing abstract models for reasoning when detailed data are unavailable will continue to pose formidable and exciting technical challenges.

Changes in medicine call into question some of AIM's original goals while providing new challenges. Twenty years ago, an anticipated severe shortage of well-trained physicians motivated, in large part, the development of AIM systems. Today we speak of a doctor glut; the proliferation of physicians appears to generate a proliferation of medical services. Early efforts focussed on improving the quality of care for all patients to that provided by the world's foremost experts. Today society values more highly uniform, accessible care at prices we can all afford. These factors suggest that AIM programs should be an integral part of a broader system to support all aspects of health care delivery, evaluation, and policy.

^This invited paper was written for the 25th anniversary edition of Artificial Intelligence as a commentary on our 1978 paper, which was one of the most frequently-cited (and hence, presumably, influential) papers to appear in the journal.

*Alas, rashly-made promises, even in print, are not always kept. As of May 1995, this paper has not been written.