This invited paper was written for the 25th anniversary edition of Artificial Intelligence as a commentary on our 1978 paper, Peter Szolovits and Stephen G. Pauker. Categorical and probabilistic reasoning in medical diagnosis. Artificial Intelligence, 11:115-144, 1978, which was one of the most frequently-cited (and hence, presumably, influential) papers to appear in the journal. The present paper was published as P. Szolovits and S. G. Pauker. Categorical and Probabilistic Reasoning in Medicine Revisited. Artificial Intelligence, 59:167-180, 1993.
In this note, we explore several research themes concerning medical expert systems that have emerged during the past fifteen years, review the roles of diagnosis and therapy in medicine, and make some observations on the future of medical AI tools in the changing context of clinical care. This paper is discursive; for a more complete treatment of these issues, please refer to a paper now in preparation, "The Third Decade of Artificial Intelligence in Medicine, Challenges and Prospects in a New Medical Era."*
Although Pople suggested interesting heuristics for dealing with multiple disorders and reformulated diagnosis as a search problem, Reggia, and Reiter and de Kleer best formalized and solved at least a simplified version of the multiple-diagnosis problem. These efforts consider the simplified case where the knowledge base consists of a bipartite graph of diseases (each unrelated to the other) and findings (also unrelated to each other), interconnected so that each disease is linked to each finding that it may cause. Reggia's insight was that for such a knowledge base, an adequate diagnosis consisted of a minimal set of diseases whose associated findings covered the set of observed findings. Normally, there is no unique minimal set, and additional criteria (e.g., likelihood) could be used to select among alternative minimal candidate sets. This process is, in general, combinatorial, and careful guidance is needed to achieve practical performance. Wu has recently shown that a heuristic but robust symptom clustering method dramatically improves search performance by introducing the equivalent of "planning islands" in the overall diagnostic process: it first partitions the symptoms into sets in which all symptoms must have a common explanation; then it performs differential diagnosis among the possible causes of each set. The overall diagnosis will identify a single disorder as the cause of each set of symptoms. An efficient factored representation cuts down combinatorial growth of the search space. The simple bipartite model is not entirely satisfactory, however. It does not deal well with situations in which the absence of a particular observation is key, nor with keeping clear the distinction between a finding that is unknown because its presence has yet to be investigated and one that is known to be absent.
This problem is far more ubiquitous in contemporary medicine than one might at first imagine, because almost any identifiable disease normally elicits medical treatment. One goal of treatment is to eliminate the underlying disturbance, but another common one is to compensate for abnormalities caused by the disease, to keep the patient in some stable state, even if not perfectly healthy. But each instance of this leads precisely (and deliberately) to just the sorts of interactions that make associational diagnosis difficult. Furthermore, even when a diagnosis is not firmly established, empiric treatments or therapeutic trials often produce partially treated (and often partially obscured) diseases. Although it is tempting therefore to adopt a diagnostic strategy based on reasoning from first principles about the possible consequences of any combinations of diseases and their treatments, this turns out to be impractical for several reasons. First, much of medicine remains a sufficient mystery that even moderately accurate models are beyond our knowledge. Second, even in domains where such models have been developed (e.g., circulatory dynamics), the models require a level of detail and consistency of global analysis that is clinically not achievable.
Ideally, a program should reason simply about simple cases and resort to a more complex generative theory only when actual complexity of the case requires it. Human experts appear to have this skillÑto recognize simple cases as simple and, after a cursory ruling out of potentially serious alternatives, to treat them as adequately solved. With this skill, intensive investigation is reserved for cases (or aspects of cases) that actually involve complicated interactions and require deeper thought. How is a program to recognize, however, that a case thought to be simple really is not? And how is it to do so without concluding that every case is complex?
Patil's ABEL program for diagnosing acid-base and electrolyte disorders employed a five-level pathophysiologic model in which the top level contained associational links among clinically-significant states and lower levels introduced successively more physiological detail, until the lowest layer represented our most detailed biochemical understanding of the mechanisms of acid-base physiology (at least as clinicians think of it). ABEL would begin by trying to construct a consistent explanation at the highest level, and then would use discrepancies between that model's predictions and subsequently gathered data as the clue that more detailed investigation was needed. Although this insight is valuable, the technique suffered some important deficiencies: As soon as any unexpected finding arose, its control strategy would tend to "dive" down to the deepest model layer because that is the one at which quantitative measurements could best resolve questions of interaction among multiple disturbances. In its narrow domain, this was tolerable, but more generally it leads to a very detailed and costly analysis of any case that departs from the textbook norm. Furthermore, a discrepancy detected in a hypothesis at some level could actually mean either that a more detailed analysis will lead to an incrementally-augmented successful hypothesis, or that this hypothesis should be abandoned in favor of one of its competitors.
A few other successful programs in non-medical domains (e.g., Hamscher's program for diagnosing faulty circuit boards by considering their aggregate temporal behavior) have also demonstrated the value of more abstract or aggregate levels of representation in diagnostic problem solving, but a clear exposition and formal analysis of this approach remains in the future.
The early 1980's finally saw insightful analyses of networks of probabilistic dependence by Cooper and Pearl, and Pearl's formulation has had a revolutionary impact on much of AI. (It is interesting to note, by the way, that equivalent techniques had been published by geneticists interested in the analysis of pedigrees with inbreeding in 1975, but were unknown to the general computer science community until very recently.) The critical insight here is that in even large networks of interrelated nodes (representing, say, diseases and symptoms), most nodes are not directly related to most others; indeed, it is only a very small subset of all the possible links that need be present to represent the important structure of the domain. When such a network is singly-connected, efficient methods can propagate the probabilistic consequences of hypotheses and observations. When there are multiple connections among portions of the network, exact evaluation is combinatorial, but may still be practical for networks with few multiple connections. Approximation methods based on sampling are relatively fast, but cannot guarantee the accuracy of their estimates. With these new powerful probabilistic propagation methods, it has become possible to build medical expert systems that obey a correct Bayesian probabilistic model. Cooper's program for diagnosing hypercalcemic disorders has been followed by MUNIN (for diagnosing muscle diseases), PATHFINDER (for pathology diagnosis), and BUNYAN (for critiquing clinically oriented decision trees). Several other programs now use these techniques, in domains ranging from physiologic monitoring to trying to recast the INTERNIST (now QMR) knowledge base in strictly probabilistic terms. Eddy has been developing a related basis for supporting and quantifying the uncertainty in inferences based on partial data from separate studies.
In medical practice, the management process is far more complex, with diagnostic reasoning and therapeutic action belonging to a spectrum of options that are presented and manipulated iteratively. First, the patient's presenting or chief complaint allows the clinician to form a context for further management. Of course, context revision occurs as additional information is acquired and processed. As Elstein and Kassirer and Kopelman have noted, the physician gathers and then interprets information, typically in discrete chunks. It would seem only reasonable for the clinician to process all available information before electing either to gather more information, to perform an additional test that carries some risk or cost, to undertake a therapeutic maneuver (be it administering a drug or submitting the patient to surgery), or to announce a diagnosis. In actual practice, though, physicians will often take an action before all the available information has been processed. Because the "complete" interpretation of all available information can be costly in terms of time and processing effort, the physician sometimes feels compelled to deal with either seemingly urgent or important hypotheses even while additional information remains in the input list. Of course, some preprocessing and ranking of the list usually occurs to identify critical findings that require immediate attention.
Actions to be taken may be clearly therapeutic, clearly diagnostic, or more likely may represent a mixture of goals. When a therapy is given, the patient's response often provides important information that allows refinement or modification of diagnostic hypotheses. Silverman's Digitalis Therapy Advisor and its progeny (Long's Heart Failure Program and Russ's approach to diabetic ketoacidosis), as well as ONCOCIN, VM and T-HELPER utilize such responses in either simple rules or a complex network to improve diagnostic certainty. Even diagnostic maneuvers often affect patient outcomes. Tests can produce complications that introduce either entirely new diseases, that make the existing disease more severe, or that preclude certain therapeutic options.
Expert system-based decision support can address many phases of patient management, but as yet, no integrated system has addressed the entire problem, even in rather limited domains. In our 1978 paper, we applied the terms probabilistic and categorical reasoning exclusively to medical diagnosis. We now believe that such a classification scheme should be applied to therapeutic problems as well. Although therapy selection might be considered to be largely a categorical process (e.g., given a diagnosis and perhaps a set of complicating factors, optimal therapy can be defined by a simple table, algorithm or heuristic), it need not be. Quantitative and probabilistic information is available about the benefits and risks of alternative therapies (even at different stages of an evolving disease), and the risks, benefits, costs and efficiencies of different strategies must be weighed. In the 1990's, it is no longer sufficient to say that pneumocystis pneumonia in a patient with AIDS is treated in the hospital. One must ask whether therapy can be delivered on an out-patient basis and what patient characteristics should dictate the choice and setting of therapy. Analogously, thrombolytic therapy has become the standard therapy for patients with acute myocardial infarction, but we must consider whether or not it should be administered to patients with suspected infarction as well. Although such therapeutic tradeoffs can and perhaps should be approached quantitatively and explicitly, they are now prey to implicit clinical judgment because the explicit process is often too cumbersome and inadequately explicated for the practicing physician. Perhaps expert systems can help.
The importance of having slick and efficient "front ends" to make any system acceptable is particularly important in dealing with medical professionals who are often pressed simultaneously by the double burdens of limited time and enhanced responsibility. We believe, however, that an efficient and perhaps invisible "back end" connection to both the patient's clinical data and to a variety of knowledge bases is even more important. In system development and maintenance, the medical informatician must have access to large clinical datasets, to create appropriate interfaces and to develop knowledge from the clinical data elements. In the past, such access has been either developed manually in a single institution or has used a registry (local or national) of relevant cases (e.g., ARAMIS, Duke Cardiovascular Database). Given the expense and administrative problems of collecting such special datasets for a broad spectrum of disease, we believe that expert system developers will need to rely increasingly on general hospital information systems, hoping to merge data from a variety of such sources. One additional information source is readily available, although as yet of only limited utility: the large administrative datasets, such as those maintained by HCFA and private insurers. Although the primary motivation for maintaining such datasets is financial and although the clinical information in them is often sparse, they have revealed important lessons about variations and patterns of care and complications. The routine inclusion of some clinical information in those datasets will surely increase. The AIM community should have at least two lines of interest in these datasets. First, they may provide useful machinable information for expert system development. Second, the adaptation of such administrative data to provide clinical insights about clinical practice is complex but will occur increasingly. Expertise in this process is, however, limited to a few groups and poorly distributed. The development of expert systems to manage, query and interpret responses from such datasets may be an important new domain for system development.
In providing expert systems-based decision support for clinicians, the AIM community must rely on developers of medical information systems to support AIM projects, both by supplying machinable data and by presenting decision support to the clinician in a timely and efficient manner. For decision support to be relevant, it must be provided at the time that choices must be made. The clinician must not need to interrupt her flow of activity either to enter information to the expert system or to review the suggestions of the system. When the clinician is making a choice among therapies and writing the relevant orders, she must have access to any advice the system provides, whether that advice is individualized to the specifics of the patient at hand or is generic, suggesting something that should apply to all patients with that problem.
Substantial effort is being devoted to developing guidelines and algorithms for clinical care. To the extent that such paper-based decision support is read by the clinician and perhaps even filed in a notebook over her desk, its implementation will be limited by the clinician's ability to remember the details of the often complex guideline. Much as a logistic regression equation or its daughter, a clinical decision rule, can be quite misleading if certain terms are omitted, even the most carefully crafted guideline can produce grossly inadequate care if it is not followed in its entirety. If critical loops and checks are omitted because they are not remembered, then the guided care may be erroneous. A related problem arises if the guideline is applied in an inappropriate context or to the wrong patient. Similarly, expert system developers must be cognizant of the environment in which their decision support tool will be placed so its database is as complete as necessary and so its advice can be followed as completely as possible. Developers must also be careful to include surface or common sense checks to detect situations in which the system is applied inappropriately in a setting for which it was not intended and must anticipate that the system's advice may not be followed completely. Given the limitations of the clinical environment, we believe that all such implementations must include a feedback and iteration phase so that errors, mistranslations and omissions can be detected and corrected in a timely fashion.
When medicine first provided a fertile playground for computer scientists developing artificial intelligence approaches to problem solving, the primary concern was the distribution of expertise to medically underserved areas. Physicians were increasingly less able to capture knowledge and update their personal knowledge bases because the exponential expansion of medical information was producing an unmanageable tide. Thus, programs focussed on making a correct diagnosis and demonstrating expertise similar to that of experienced clinicians. These programs also addressed the variability of patients' presentations and the optimization of therapy based on such variations. The goal was not to minimize variation but rather to maximize flexibility.
In the third decade of AIM, the goals of a relevant expert system have changed and additional attributes must be considered. The use of health care resources must be made more efficient. For hospitalized patients, length of stay must be minimized, within the constraints of not compromising therapeutic efficacy or increasing the rate of complications. The process of care can no longer run without feedback or oversight. Health outcomes and resource use must be accurately quantified and monitored.
The new goals redefine both relevant expertise for medical practice and decision support. The expert system developer may no longer be able just to simulate the behavior of experienced clinicians. The environment and its rules are new and evolving rapidly. Some experienced clinicians will adapt effectively to these additional concerns while others may be unable to alter their styles sufficiently. Expert systems may be able to provide important services in this environment but will likely need to be driven more directly by the new multi-attribute goal structure.
Physicians and other health-care personnel seems unlikely to reject computer technology which has become ubiquitous in our society. It is already cheaper to maintain electronic rather than paper records, and pressures for accountability and cost containment will undoubtedly bring about the availability of machinable data that we had anticipated long ago.
Dramatic advances in genetics and molecular biology that will surely result from the Human Genome Project will fuel both clinicians' information overload and create further technological imperatives for diagnosis and therapy. So long as constraints of cost, laboratory capacity, and human cognition remain, however, difficult diagnostic and therapeutic choices in clinical care will be necessary. The methods of model-based reasoning being developed by AIM researchers may take a more prominent role in clinical informatics as more detailed models become available. Constraining the use of such models to appropriate contexts, getting multiple models to interact in the absence of an overarching model, and developing abstract models for reasoning when detailed data are unavailable will continue to pose formidable and exciting technical challenges.
Changes in medicine call into question some of AIM's original goals while providing new challenges. Twenty years ago, an anticipated severe shortage of well-trained physicians motivated, in large part, the development of AIM systems. Today we speak of a doctor glut; the proliferation of physicians appears to generate a proliferation of medical services. Early efforts focussed on improving the quality of care for all patients to that provided by the world's foremost experts. Today society values more highly uniform, accessible care at prices we can all afford. These factors suggest that AIM programs should be an integral part of a broader system to support all aspects of health care delivery, evaluation, and policy.