Artificial Intelligence and Medicine(1)

Peter Szolovits

Szolovits, P.  "Artificial Intelligence and Medicine."  Chapter 1 in Szolovits, P. (Ed.) Artificial Intelligence in Medicine. Westview Press, Boulder, Colorado.  1982.

Man strives to augment his abilities by building tools. From the invention of the club to lengthen his reach and strengthen his blow to the refinement of the electron microscope to sharpen his vision, tools have extended his ability to sense and to manipulate the world about him. Today we stand on the threshold of new technical developments which will augment man's reasoning, the computer and the programming methods being devised for it are the new tools to effect this change.

Medicine is a field in which such help is critically needed. Our increasing expectations of the highest quality health care and the rapid growth of ever more detailed medical knowledge leave the physician without adequate time to devote to each case and struggling to keep up with the newest developments in his field. For lack of time, most medical decisions must be based on rapid judgments of the case relying on the physician's unaided memory. Only in rare situations can a literature search or other extended investigation be undertaken to assure the doctor (and the patient) that the latest knowledge is brought to bear on any particular case. Continued training and recertification procedures encourage the physician to keep more of the relevant information constantly in mind, but fundamental limitations of human memory and recall coupled with the growth of knowledge assure that most of what is known cannot be known by most individuals. ~~is is the opportunity for new computer tools: to help organize, store, and retrieve appropriate medical knowledge needed by the practitioner in dealing with each difficult case, and to suggest appropriate diagnostic, prognostic and therapeutic decisions and decision making techniques.

In a 1970 review article, Schwartz speaks of

the possibility that the computer as an intellectual tool can reshape the present system of health care, fundamentally alter the role of the physician, and profoundly change the nature of medical manpower recruitment and medical education--in short, the possibility that the health-care system by the year 2000 will be basically different from what it is today. [18]

The key technical developments leading to this reshaping will

almost certainly involve exploitation of the computer as an 'intellectual,' 'deductive' instrument--a consultant that is built into the very structure of the medical-care system and that augments or replaces many traditional activities of the physician. Indeed, it seems probable that in the not too distant future the physician and the computer will engage in frequent dialogue, the computer continuously taking note of history, physical findings, laboratory data, and the like, alerting the physician to the most probable diagnoses and suggesting the appropriate, safest course of action. [18]

This vision is only slowly coming to reality. The techniques needed to implement computer programs to achieve these goals are still elusive, and many other factors influence the acceptability of the programs.

This book is an introduction to the field of Artificial Intelligence in Medicine, (abbreviated AIM) which is now taking up the challenge of creating and distributing the tools mentioned above. This introductory chapter defines the problems addressed by the field, gives a short overview of other technical approaches to these problems, introduces some of the fundamental ideas of artificial intelligence, briefly describes the current state of the art of AIM, discusses its technical accomplishments and current problems, and looks at likely future developments. The other four chapters each describe one of the current AIM projects in some detail, pointing out not only the accomplishments of the programs built so far but also what we have learned In the process of creating them.

Definitions

What is "Artificial intelligence in Medicine?" One introductory textbook defines artificial intelligence (called Al) this way:

Artificial Intelligence is the study of ideas which enable computers to do the things that make people seem intelligent ... The central goals of Artificial Intelligence are to make computers more useful and to understand the principles which make intelligence possible. [30]
This is a rather straightforward definition, but it embodies certain assumptions about the idea of intelligence and the relationship between human reasoning and computation which are, in some circles, quite controversial. The coupling of the study of how to make computers useful with the study of the principles which underlie human intelligence clearly implies that the researcher expects the two to be related. Indeed, in the newly-developing field of cognitive science, computer models of thought are explicitly used to describe human capabilities.

Historically, researchers in Al have had to defend this linkage against humanist attacks on the reduction of the human intellect to computational steps. The debate has sometimes been heated, as exemplified by the following quote from the introduction to an early collection of AI papers:

Is it Possible for Computing Machines to Think?

No--if one defines thinking as an activity peculiarly and exclusively human. Any such behavior in machines, therefore, would have to be called thinking-like behavior.

No--if one postulates that there is something in the essence of thinking which is inscrutable, mysterious, mystical.

Yes--if one admits that the question is to be answered by experiment and observation, comparing the behavior of the computer with that behavior of human beings to which the term "thinking" is generally applied.

We regard the two negative views as unscientifically dogmatic. [5, p.2]

An enlightening review of the history of Al and the bouts between its proponents and adversaries may be found in the recently published Machines Who Think [13].

AI in Medicine (AIM) is AI specialized to medical applications. Researchers in AIM need not engage in the controversy introduced above. Although we employ human- like reasoning methods in the programs we write, we may justify that choice either as a commitment to a human/computer equivalence sought by some or as a good engineering technique for capturing the best-understood source of existing expertise on medicine--the practice of human experts. Most researchers adopt the latter view.

The choice to model the behavior of a computer expert in medicine on the expertise of human consultants is by no means logically necessary. If we could understand the functioning in health and in disease of the human body in sufficient depth to model the detailed disease processes which disturb health, then, at least In principle, we could perform diagnosis by fitting our model to the actually observable characteristics of the patient at hand. Further, we could try out possible therapies on the model to select the optimum one to use on the patient. Unfortunately, although biomedical research strives for such a depth of understanding, it has not been achieved in virtually any area of medical practice. The AIM methodology does not dogmatically reject the use of non-human modes of expertise in the computer. Indeed, accurate computations of probabilities and solutions of simple differential equations--tasks at which human experts are rather poor without special training--play a role in some of our programs. Nevertheless, most of what we know about the practice of medicine we know from interrogating the best human practitioners; therefore, the techniques we tend to build into our programs mimic those used by our clinician informants.

Relying on the knowledge of human experts to build expert computer programs is actually helpful for several additional reasons: First, the decisions and recommendations of a program can be explained to its users and evaluators in terms which are familiar to the experts. Second, because we hope to duplicate the expertise of human specialists, we can measure the extent to which our goal is achieved by a direct comparison of the program's behavior to that of the experts. Finally, within the collaborative group of computer scientists and physicians engaged in AIM research, basing the logic of the programs on human models supports each of the three somewhat disparate goals that the researchers may hold:

History

AIM is certainly not the first use of computers in medicine. Many of the administrative and financial record keeping needs of the hospital, health center, and even small group medical practice have been turned over to computer systems. Such use of computers differs little from similar applications in a wide range of businesses, and few technical developments have been motivated specifically by medical use of what could be called "business computing." Obviously, such use will continue to benefit from the increasing performance of general business-oriented systems; just as computer suppliers now aim for the small retail store as a possible market, they also envision the computerization of even individual doctors' offices, providing billing, scheduling, forms preparation, word processing, and other services.

It appears unlikely, however, that such business uses of computing in medical applications will fulfill the promise to "reshape" medicine. In a recent book on management decision support systems, McCosh and Scott Morton, writing about management information systems (MIS), note that

despite the tremendous growth in computer-related activities, [MIS] has had little significant impact on management. The kinds of decisions and the ways m which they are made have been vet,' little affected by computers over the last fifteen years. We believe that this can be traced in large part to the lack of proper perspective on the problems involved in augmenting the decision-making ability of management. [12, p.3]
Similarly, much of the business computing in medicine impacts only on the periphery of the physician's task.

A second, currently much smaller use of computers in medicine is their application to the substance rather than the form of health care. If the computer is a useful manager of billing records, it should also maintain medical records, laboratory data, data from clinical trials, etc. And if die computer is useful to store data, it should also help to analyze, organize, and retrieve it. Three main approaches to this second type of medical computing have so far been used: the clinical algorithm or flowchart. the matching of cases to large data bases of previous cases, and applications of decision theory. Each of these has had notable successes, but also a more limited applicability than its developers had hoped. All contribute to the development of the AI approaches described here. A good recent review of the state of the art of computer tools for medical decision making can be found in [19] and an accompanying argument for the Al orientation in [25].

Flowcharts

A flowchart is conceptually the simplest decision making tool. It encodes, in principle, the sequences of actions a good clinician would perform for any one of some population of patients. We may imagine, for example, recording all sequences of questions asked, answers given, procedures performed, laboratory analyses obtained and eventual diagnoses, treatments and outcomes for a number of patients who present at the emergency room with severe chest pain. If we observe enough patients and allow expert cardiologists to suggest an appropriate retrospective analysis of each case based on their excellent knowledge of the field, we may be able to identify a suitable sequence of actions to take under all possible circumstances. This approach has been successfully applied to the encoding of triage protocols for use by nurses [15], and has also formed the basis for several programs for patient interviewing [20]. A very large flowchart program has also been built for giving therapeutic advice in the acid/base area [1].

The principal deficiency of the flowchart as a general technique for encoding medical decision making knowledge is its lack of compactness and perspicuity. When used in a very large problem domain, the flowchart is likely to become huge, because the number of possible sequences of situations to be considered is enormous.(2) Furthermore, the flowchart does not include information about its own logical organization: each decision point appears to be independent of the others, no record exists of all logical places where each piece of information is used, and no discipline exists for systematic revision or updating of the program. Therefore, inconsistencies may easily arise due to incomplete updating of knowledge in only some of the appropriate places, the totality of knowledge of the flowchart is difficult to characterize, and the lack of any explicit underlying model makes justification of the program very difficult.

Data Bases

Large data bases of clinical histories of patients sharing a common presentation or disease are now being collected in several fields. The growth of data capture and storage facilities and their co-occurring decline in cost make attractive the accumulation of enormous numbers of cases, both for research and clinical uses. Today we are engaged in numerous long-term studies of the health effects of various substances, the eventual outcomes of competing methods of treatment, and die clinical development of diseases. Large databases on significant populations, concentrating on cardiovascular disease, arthritis, cancer and other major medical problems, are now being collected and used to clarify the true incidence of diseases, to identify demographic factors and to measure therapeutic efficacy of drugs and procedures [10, 17, 29].

For clinical purposes, the typical use of large data bases is to select a set of previously known cases which are most similar to the case at hand by some statistical measures of similarity. Then, diagnostic, therapeutic and prognostic conclusions may be drawn by assuming that the current case is drawn from the same sample as members of that set and extrapolating the known outcomes of the past cases to the current one.

The use of collected past records either for research or clinical practice is clearly a data-intensive activity. To sift through the voluminous information at hand, to identify the important generalizations to be found among the thousands of detailed records and to select previous cases likely to shed light on the one under current consideration, numerous statistical techniques have been developed and applied. The literature of medical statistics is large, and will not be reviewed here; a good survey may be found in [26] and accompanying articles.

Although vast collections of data and processing techniques for them are an important advance, the application of this methodology to all of medicine appears unlikely for several reasons. Firstly, the collection and maintenance of the data in a consistent and accessible form is very costly and extremely time consuming. Old data are difficult to reconcile with the new, because continual refinements introduced as medical knowledge deepens introduce distinctions which were absent in previously-collected cases. Rare disorders may be infrequent enough that an insufficient number are seen within the "catchment basin" of any data collection scheme to provide adequate data. Historical and regional differences in nomenclature and interpretation can make the reconciliation of separately-collected data virtually impossible. Thus, it appears likely that only the more common and severe disorders generate enough interest, resources, and clinical cases to make the collection of data practical. Secondly, and equally importantly, the existing expertise of physicians is a highly valuable body of knowledge which cannot he recovered from just the processing of many cases by statistical techniques. A method of diagnosis, prognosis or therapy which relies on the projection of past data without detailed explanations of the causality of the illness under consideration seems unlikely to attract the confidence of physician or patient. People feel the need to explain phenomena in terms of mechanisms they understand, and tend to reject predictions which cannot be understood in such terms. Therefore, clinical judgment based on comparisons with collected data will fill an important but limited role. Other methods of computer use in medicine, relying on the encoding of knowledge held by the expert physician, will be at least as important.

Decision Theory

Decision theory is a mathematical theory of decision making under uncertainty. It assumes that one can quantify the a priori and conditional likelihoods of existing states and their manifestations and can similarly determine an evaluation (utility) of all contemplated outcomes. Given these data, decision theory offers a normative, rational theory of optimal decision making which is urged by its practitioners as an effective technique for structuring medical decision making problems [16]. Although there is considerable evidence that most human decision makers not specifically trained in decision analysis deviate from this model in their decision making activities [27], the theory is nevertheless appealing as a norm for helping to make explicit the bases of decision making and any existing disagreements among decision makers. Numerous computer programs for decision making in small domains of medicine have employed the decision theoretic formalism [6, 8].

The chief disadvantages of the decision theoretic approach are the difficulties of obtaining reasonable estimates of probabilities and utilities for a particular analysis. Although techniques such as sensitivity analysis help greatly to indicate which potential inaccuracies are unimportant, the lack of adequate data often forces artificial simplifications of the problem and lowers confidence in the outcome of the analysis. Attempts to extend these techniques to large medical domains in which multiple disorders may co-occur, temporal progressions of findings may offer important diagnostic clues, or partial effects of therapy can be used to guide further diagnostic reasoning, have not been successful. The typical language of probability and utility theory is not rich enough to discuss such issues, and its extension within the original spirit leads to untenably large decision problems. For example, one could handle the problem of multiple disorders by considering all possible subsets of the primitive disorders as mutually competing hypotheses. The number of a priori and conditional probabilities required for such an analysis is, however, exponentially larger than that needed for the original problem, and that is unacceptable.

A second difficulty for decision analysis is the relatively mysterious reasoning of a decision theoretic program-an explanation of the results is to he understood in terms of the numeric manipulations involved in expected value computations, which is not a natural way of thinking for most people. The role of decision theoretic computations is discussed further in [24].

Additional Flexibility

A careful analysis of the shortcomings of any of the above techniques reveals numerous possible improvements. An interesting observation of the AIM community is that the improvements more often involve bringing to bear specific knowledge on selected subproblems of an application than developing a new complete theory for it. For example, in the decision theoretic framework, if most hypotheses are disjoint and most observations are conditionally independent, then it is very helpful to be able to express the few exceptions without resorting to expanding the complete database to give joint probabilities. Flexibility in knowledge representation and problem solving techniques is highly desirable to allow the inclusion of these bits of specific knowledge without needing to magnify greatly the whole program.

The five research projects reported on in this volume all employ AI techniques to represent and reason with their knowledge. In each case, similarities to more traditional forms of program organization will--not so surprisingly--be apparent. Each project is pragmatically oriented, with the intent of ultimately producing a clinically significant tool. Although each is based in part on its developers' insights into how expert physicians reason, none is intended as a serious psychological model of human performance in medical reasoning. Thus, aspects of the predetermined clinical flowchart, pattern matching to a data base of known or prototypical cases, and probabilistic reasoning underlie each program where those techniques are appropriate. Of particular interest are the new techniques and their combinations which have been developed for these programs to provide the additional flexibility described above.

Expertise and Common Sense

Encoding human expertise in the computer is amazingly difficult. The difficulty rests both on our lack of understanding of how people know what they know and on technical problems of structuring and accessing large amounts of knowledge in the machine. For an example of a simple human reasoning task that is somewhat beyond the ability of current computer techniques to handle, consider the following dialog, quoted from The New Yorker in [7]:

Mrs. Eloise Dobbs, 38, is married to a feed store owner and she comes to her physician, Dr. Elwood Schmidt, complaining of chest pain. The following dialogue ensues:

"This whole side of my chest hurts, Elwood. It really hurts."

"What about your heart--any irregular beats?"

"I haven't noticed any. Elwood, I just want to feel good again."

"That's a reasonable request And I think it's very possible you will."

"But what do you think? Is it my heart? Is it my lungs?"

"Now, you won't believe this-but I don't know. I do not know. But I wonder. Are you lifting any sacks down at the store?"

"I lift some. But only fifty pounds or so. And only for the woman customers."

"I think you'd better let your lady customers lift their own sacks If I know those ladies, they can do it just as well as you can. Maybe better."

The doctor in this story relies not only on his understanding of the physiological basis of pain (that although overexertion can exacerbate some underlying disorder to cause pain, especially in an older person it can cause pain by itself) but also on his knowledge of the patient and her occupation, the common practices of small-town stores, the weight of typical sacks of feed, etc. Therefore, we would not expect even the most sophisticated computer program, charged only with the latest of pathophysiological theory, to arrive at the parsimonious diagnosis of the local doctor.

An optimistic assessment holds that "tricks" like the above do not pose any real difficulty. After all, that reasoning process can be defined in terms of a small set of rules and facts:

  1. Try to explain isolated complaints by possible non-pathological causes.
  2. Overexertion can cause chest pain.
  3. mall-town people are prone to overexertion.
  4. ...

Some programs actually manage to make use of some such knowledge. For example, the Present Illness Program (PIP) [14] is able to infer that if the patient passed a military physical or a life-insurance company's health examination, then neither blood, sugar, nor protein were present at that time in the urine. This is a widely-known heuristic among physicians, being one of the many ways that past data can be inferred in the absence of definitive reports.

How many such "tricks of the trade" are there, however? How can we learn them all, to include them in programs? Given the knowledge, how can we know when to apply which piece to achieve the desired ends? One may guess that such special knowledge is vast--facts perhaps numbering in the millions. Each particular situation may demand the correct application of only a few of these facts for its resolution, but a program with broad expertise must be able to use a very large number, to select the right ones for each case. The problems of acquiring, organizing, retrieving and applying the large amount of knowledge we now believe necessary are part of the focus of knowledge based systems research in AI.

A more pessimistic evaluation of AI applications, held by some of leading practitioners of AI, holds the bleak (to us) view that expert consultant programs of the type built by AIM researchers cannot meet the challenge of general competence and reliability until much more fundamental progress is made by AI in understanding the operation of common sense. This argument suggests that the ultimate reliability of all reasoning, whether by human or computer, rests on a supervisory evaluation of the outcome of that reasoning to assure that it is sensible. Just what that means in computational terms is rather difficult to even imagine specifying, though we suspect that it has much to do with checking the result against a considerable stock of experience acquired in interacting with the real world. The story of Mrs. Dobbs and her physician is an illustration of the possibly necessary experience. This argument against AIM claims that although the formal expertise of the country doctor can be modeled, his common sense cannot, at the present state of the art, and this failure will vitiate the considerable accomplishments of the implementations of the formal expertise.

Although having better general theories of common sense reasoning would be an undeniable benefit, its current lack is not as large a handicap to AIM as the above view claims. It is the very expertise of the expert that is the chief escape from the "common sense is indispensable" attack. Building a medical expert consultant may in fact be easier than building a program to act as a general practitioner. The family doctor is much concerned with the interpretation of everyday events into their medical significance-thus, with common sense interpretation. The medical expert, by contrast. typically gets information from the report of the general practitioner and from laboratory data, both of which require far less real world interpretation. One can imagine an expert consultant, but not the family doctor, acquiring an understanding of a case by telephone.

Medical expertise, by its very nature as a taught body of material, is formalized as no common experience is. The structure of the formalization used in teaching physicians is useful in capturing that expertise within the computer. Thus, the formal reasoning of the expert physician, seemingly paradoxically, is actually a better ground for building computer models than the less formal knowledge of the physician who must be in direct contact with patients and their world. Assuming that the program acts as advisor to a person (doctor, nurse, medical technician) who provides a critical layer of interpretation between an actual patient and the formal models of the programs, the limited ability of the program to make a few common sense inferences is likely to be enough to make the expert program usable and valuable.

Artificial Intelligence and Knowledge Based Systems

How do we currently understand those "ideas which enable computers to do the things that make people seem intelligent?" Although the details are controversial, most researchers agree that problem solving (in a broad sense) is an appropriate view of the task to be attacked by Al programs, and that the ability to solve problems rests on two legs: knowledge and the ability to reason.

Historically, the latter has attracted more attention, leading to the development of complex reasoning programs working on relatively simple data bases. The General Problem Solver (GPS) [4] formalized notions of problem solving by successive decomposition of a goal into its subgoals and by establishing new goals based on differences between the current and desired states. Theorem provers based on variations on the resolution principle explored generality in reasoning, deriving problem solutions by a method of contradiction. More recently, Al languages like PLANNER [9], CONNIVER [22], and various production system languages [28] have explored various control mechanisms to generate powerful reasoning. In the Truth Maintenance System (TMS) [3], such schemes are strengthened by the introduction of dependency-directed backtracking, in which contradictions resulting from assumptions made by the reasoner can be traced back to the offending assumption(s), thereby assuring that only those assumptions and their consequences are re-thought which could have led to the contradiction. Even extremely unsophisticated reasoning schemes can at times be useful, such as the "British Museum Algorithm," which tries all possible conclusions from all known facts (theorems) and inference rules.(3)

In addition to considering how a program reasons, it is essential to ask what sorts of data it reasons about--i.e., how is its knowledge represented. During the past ten years, the notion has gained acceptance that reasoning becomes simpler if the structure of the representation reflects the structure of the reality being reasoned about. Much current research focuses on the design of new knowledge representation languages which permit this principle to be applied [2, 23]. Two aspects of this structure that are receiving much attention are the representation of structured objects and of processes. Although this is not the place to enter a comprehensive discussion of the field, a small example will illustrate some of the concerns.

Early representation languages were based on the predicate calculus, in which each fact, or item of knowledge, was represented as a single expression in the language. There would be separate entries in the data base, for example, for the facts that CHAIR-1 is a chair, that it has a back, CHAIR-BACK-1, that it has four legs, that it is in my office, and that I am sitting on it. Another chair, CHAIR-2, also has a back, CHAIR-BACK-2, also has four legs, is located in my living room, and is currently occupied by a cat.

It is obvious that one wants to be able to make generalizations about individual entities rather than specifying everything about each object in detail. In the example of the chairs, above, it is useful to assume that our knowledge representation contains a description of some prototypical CHAIR, that the individual chairs we discuss can be said to be instances or kinds of the prototype, and that much of what we know about each individual chair is in fact shared knowledge more appropriately known about the prototype. Normally, we would not feel a need to mention the legs or back of any chair, yet we would be prepared to hear that such objects existed and could be described. In fact, we assume that, in all ways not explicitly stated, any individual chair we consider inherits a default description of its form and function from the prototype.

What does it mean to speak of inheritance, for example? The data structures in terms of which a program's knowledge are represented cannot be said to have meaning on their own, independent of the way they are used. Therefore, what we understand representation systems to specify is the standard ways in which certain often-needed, perhaps trivial inferences will be made automatically by the system whenever they are needed. In the above case, we can think of the representation and its interpreter deducing the existence of CHAIR-2's back when we state that it is sturdy. It is important that we can think of this deduction as trivial and computationally inexpensive. In the predicate calculus, we can certainly express the same notion, that each chair has a chair-back as a part:

ForAll(x)(CHAIR(x) ->Exists(y)(CHAIR-BACK(y) & PART-OF(y, x))).
This solution is not satisfactory as an actual computational mechanism to many workers in AI, however, because it fails to distinguish in any way the reasoning involved in determining that a chair has a back from that in concluding that Mrs. Dobbs is not having a heart attack. We feel that some aspects of the world (e.g., the relations among objects and their components, things and their attributes) are so basic that we should not have to describe them or reason about them in the same way that we express our "real" problems. The power of the newer AI representation languages comes from their incorporation of automatic mechanisms to make the simple, local deductions implied by the conventions of their knowledge representation scheme.

The parsimony of description hinted at above is not without its price. What are we to do about the chair without a back? Or one with but three legs? If our underlying reasoning mechanism enforces (indeed embodies) assumptions of certain regularities in the world, we need to provide mechanisms for expressing exceptions. Very quickly, however, we tread on the infirm ground of philosophical conundrum. How can we mention that back of CHAIR3 which it does not have? Such problems often arise to plague existing knowledge representation formalisms, and it is an active area of research to find adequate solutions to them [11, 21]. Nevertheless, some reasonable scheme of knowledge representation is necessary to permit the concise expression of all the information needed by Al programs that must know about a great deal.

To summarize, we might loosely say that the power of a problem solver is proportional to the product of its reasoning power and the expressiveness of its knowledge representation scheme. Obviously, a more sophisticated reasoning mechanism can make more powerful conclusions from me same data. Just as obviously, however, the same reasoning mechanism can make more powerful conclusions by reasoning with an expression of knowledge that permits large steps to be taken by automatically supplying the simple intermediate details without the need for attention from the reasoning mechanism.

Research in AIM has relied on progress in both domains, as is apparent in the descriptions of the AIM programs in this book. The representation of rules as the predominant form of knowledge in MYCIN, the patient-specific model in the digitalis therapy advisor, the causal-associational network in CASNET/Glaucoma, disease frames in INTERNIST and the Present Illness Program are all important representational mechanisms. The partitioning heuristic of INTERNIST, the computation of "points of interest" in CASNET, the recursive control mechanism of MYCIN, and the expectation-driven procedures of the digitalis program are all reasoning mechanisms of some power.

The First Generation Programs

This book is a collection of chapters describing and critiquing what is perhaps best called "the first generation" of AIM programs. As the reader will see, each program concentrates on a particular aspect of the medical diagnostic or therapeutic problem, bringing to bear techniques derived from or inspired by the methods of Al to overcome deficiencies of the traditional approaches to decision making in medicine. Each discussion recapitulates the successful development of a tool or set of tools and its successful application to a medical decision making problem. Each also goes on to suggest remaining problems with the chosen methods and research plans currently under way to improve the state of the art.

As a general introduction, it may be useful to give here a brief overview of the five programs to be presented and die intellectual concerns addressed in each project.

Chapter 2 presents an overview of the mechanisms of the CASNET system, developed at Rutgers University, in its major incarnation as a diagnostic and therapeutic program for Glaucoma and related diseases of the eye, and then it describes EXPERT, a somewhat simpler and more widely applied system which is being used in the analysis of thyroid disorders and in rheumatology. CASNET identified the fundamental issue of causality as essential in the diagnostic and therapeutic process. Simply put, any abnormal phenomenon must have some causal pathway which can be traced back to an ultimate etiological factor. Conversely, if every possible pathway to a suspected disorder can be ruled out, belief in that disorder is not tenable. CASNET uses these simple observations as the basis of a rather complex computational mechanism which permits it to assess the likelihood of a node in the causal network based either on directly observable evidence, on expectation from known causally antecedent states, or by inference from known causally subsequent states. Differences among the scores calculated in these diverse ways are used as heuristic guidance to suggest which areas of the network are most worth exploring further. CASNET also has the attractive feature that it models the partial or complete failure of treatment with the same mechanisms as the progression of disease. The development of a national network and several extensive demonstrations and trials are other important hallmarks of this project. The more recent development of the EXPERT system generalizes the computational techniques of CASNET and makes them simpler to apply to other domains. Unfortunately, the semantic interpretation of links as causal connections is at least partially abandoned, leaving a system that is easier to use but one which offers a potential user less guidance on how to use it appropriately.

Chapter 3 is a description of the MYCIN system, developed at Stanford University originally for the diagnosis and treatment of bacterial infections of the blood and later extended to handle other infectious diseases as well. The fundamental insight of the MYCIN investigators was that the complex behavior of a program which might require a flowchart of hundreds of pages to implement as a clinical algorithm could be reproduced by a few hundred concise rules and a simple recursive algorithm (described in a one-page flowchart) to apply each rule just when it promised to yield information needed by another rule. For example, if the identity of some organism is required to decide whether some rule's conclusion is to be made, all those rules which are capable of concluding about the identities of organisms are automatically brought to bear on the question. The modularity of such a system is obviously advantageous, because each individual rule can be independently created, analyzed by a group of experts, experimentally modified, or discarded, always incrementally modifying the behavior of the overall program in a relatively simple manner. Other advantages of the simple, uniform representation of knowledge which are not as immediately apparent but equally important are that the system can reason not only with the knowledge in the rules but also about them. Thus, it is possible to build up facilities to help acquire new rules from the expert user when the expert and program disagree, to suggest generalizations of some of the rules based on their similarity to others, and to explain the knowledge of the rules and how they are used to the system's users.

Chapter 4 provides an overview of our efforts at M.I.T. to develop a program for advising physicians using the drug digitalis for patients with heart disease. The major insight in developing this program was the recognition that therapeutic advice must be based on a patient-specific model which includes, in addition to all relevant factors from the patient's medical history, the goals of therapy and how previous sessions have shed light on the drug's effects on the individual patient. This program, as the others described here, has undergone several tests indicating its human-like competence; it has also served as the vehicle for ongoing research in the automatic generation of explanations of program behavior which are based on programs expressed as procedures (as opposed to rules) and on the relation between medical knowledge about the underlying domain and the performance of the program. We use the discussion of this chapter to address a number of non-technical issues in the development of AIM programs as well: the nature of collaboration between physicians and computer scientists, the trial-and-error method of program and theory refinement, the requirements for careful testing of programs intended for potential life-saving or life-threatening applications, and the eventual need for commercial involvement in the development of such programs before they can be broadly disseminated.

Chapter 5 introduces an AI framework for thinking about the diagnostic problem, and presents an overview of the INTERNIST system developed at the University of Pittsburgh for diagnosis in general internal medicine. INTERNIST-I uses a problem-formulation heuristic to select from among all its known diseases that set which should be considered as competing explanations of the currently-known abnormal findings in a case. A distinction is made between the tasks of formulating such a differential problem and of solving it. Formulating the problem is what might be called an ill-structured task, similar to the problem of making up an interesting mathematical theorem or designing a house; solving the differentiation problem once formulated is well-structured, inviting the application of numerous conventional methods. The simple heuristic of INTERNIST-1 is seen to do well on many complex cases, but falters on cases requiring an analysis from several different viewpoints, e.g., an interaction between the causal mechanism of the disease and the organ systems involved in it. Based on such deficiencies, the chapter presents a new, extended method of medical knowledge representation and problem formulation that is intended to form the basis for CADUCEUS, the second-generation follow-on to INTERNIST-1.

Chapter 6 introduces a formalism for reasoning with a causal representation of illness, one that permits multiple levels of detail at which to consider portions of the diagnostic task. It is an outgrowth of work at M.I.T. on the Present Illness Program (PIP), for taking the history of the present illness of a patient with renal (kidney) disease [14]. Although PIP'S performance on some cases was comparable to that of a human expert, it as well as the other programs suffered from weaknesses on complex cases. This chapter presents the design of ABEL, a program for the diagnosis (and eventually treatment) of acid/base and electrolyte disturbances. The design is based on the recognition that the earlier programs used representations of medical knowledge that were not able to capture the subtlety of medical reasoning actually used by expert physicians, especially in cases of multiple disorders. ABEL therefore includes mechanisms to express causal and associational relationships at different levels of aggregation and detail, the quantitative decomposition of constituents and summation of changes resulting from different pathophysiological pathways, and temporal aggregation. The rich set of descriptive mechanisms permits construction of a similarly rich set of operations for building up descriptions into hypotheses and changing them to reflect new information as it is discovered about the case. Although the program proposed here has now begun to work, it is described in the chapter not as a working program but as a set of requirements for sophisticated representations for the second generation of AIM programs. The remaining section of this chapter brings together an assessment of just how successful the first generation programs have been, and outlines a set of concerns identified in their construction which now provide the focus for ongoing research.

The State of the Art and Future Prospects

Just how good are AIM programs now? The remaining chapters of this book present a number of the more mature programs in use today, with reports on their formal and informal evaluations. Each existing program has, in some trials, been judged comparable to expert physicians in their competence--this is indeed an outstanding result. Some have been shown statistically indistinguishable from experts in the field, others were judged as giving expert advice by true experts. Although the trials have ranged in rigor from well-controlled experiments to almost anecdotal testimonials, an objective examination of their performance clearly demonstrates that they have captured an important aspect of what it means to be an expert in a particular field of medicine and provided a good demonstration of their capabilities on some significant medical cases.

However, the programs' performance can also be non-uniform, exhibiting the "plateau and cliff effect" well known to all program developers: the program is outstanding on the core set of anticipated applications, but degrades rather ungracefully for problems just outside its domain of coverage. On very difficult cases, which are not typical of the ones used in formal evaluations, the programs may even be misled in cases that fall within their central domain by complex interactions or multiple disorders that are they are unable to untangle successfully.

We cannot ask, naturally, that an AIM program be flawless before it can be acceptable, when the domain of its expertise is subject to uncertainties of knowledge and lack of data that can equally trip up a human expert.(4) Nevertheless, for programs to function effectively in a typically human role, advising on decisions where life may hang in the balance, very high degrees of competence and reliability will be demanded. Can they be achieved?

The answer depends on two considerations: Is the AIM methodology sound enough to base the work on, or is it profoundly flawed by our lack of understanding of common sense human reasoning? Is the depth of knowledge which is expressible with the techniques we now have adequate to yield highly competent and reliable programs? To the first question, we will suggest "yes" as the answer, based on the observation that medical expertise is already rather formal (as argued above). To deal with the second, we point to some currently-developing ideas which suggest that richer representations are probably needed and probably achievable. First, however, we touch briefly on another aspect of AIM programs acceptability in clinical use.

We must realize that although current AIM programs already give quite impressive demonstrations of the success of the techniques used and of the dedication of the investigators, none of the programs reported on here or developed by other, similar efforts is in current clinical use. Perhaps, as it has been argued, programs will only be clinically accepted once their indispensability is established only when successful demonstrations exist that physicians or other medical personnel working with such programs are more successful than those without. Alternatively, social and administrative mechanisms may be more responsible for the ultimate utilization or abandonment of these tools. In any case, improved competence and reliability will surely be necessary and perhaps sufficient to help propel the programs into use. We take up this topic in greater detail in Chapter 4.

Depth of Knowledge

What technical problem most fundamentally accounts for the failure of current AIM programs when they encounter difficulty? Our view here is that they fail to be able to exploit the recognition that a problem exists (that their reasoning procedures are producing conflicting results) to seek and create a deeper analysis of the problem at hand. Much of the knowledge embedded in AIM programs is what we can appropriately call phenomenological-that is, concerned with the relations among phenomena more than with an understanding of the mechanisms which are suggested by the observations. For example, a MYCIN rule relating the gram stain and morphology of an organism to its likely identity is based on a human belief in the validity of that deduction, not on any significant theory of microscopic observation and staining. Similarly, the digitalis therapy advisor's conclusion that an increase in premature ventricular beats indicates a toxic response to the drug it is trying to manage is based on that specific knowledge, learned from an expert, and not on any bioelectrical theory of heart tissue conductivity and its modification by the drug. Such phenomenological descriptions of reality provide a good first approximation to the way that physicians reason about medical reality, but they fail to capture the subtlety of which physicians are capable when difficulties arise in the straightforward phenomenological interpretation of the data at hand.

Consider what happens when two "rules of thumb" (as we may identify a bit of phenomenological knowledge in medicine) conflict. Every AIM program written so far evaluates that conflict by reducing it to a numerical judgment of likelihood (or certainty, belief, etc.) in the hypotheses it holds: MYCIN computes a revised certainty factor, CASNET computes new weights, INTERNIST computes new scores, and the digitalis program often computes a weighted sum of its observations to evaluate their joint effect. Thus, conflict, just as agreement, is reduced to a manipulation of strength of belief. Yet, by contrast, we believe that human experts make a much more powerful use of occasions where they detect conflict. They are not satisfied by a simple revision of their degree of belief in the hypotheses which they have previously held; they seek a deeper, more detailed understanding of the causes of the conflict they have detected. For it is just at such times of conflicting information that interesting new facets of the problem are visible. Conflicts provide the occasion for contemplating a needed re-interpretation of previously-accepted data, the addition of possible new disorders to the set of hypotheses under consideration, and the reformulation of hypotheses thus far loosely held into a more satisfying, cohesive whole. Much of human experts' ability to do these things depends on their knowledge of the domain in greater depth than what is typically needed to interpret simple cases not involving conflict. To move beyond the sometimes fragile nature of today's programs, we believe that future AIM programs will have to represent medical knowledge and medical hypotheses at the same depth of detail as used by expert physicians. Some of the additionally needed representations are:

The AIM field, scarcely a few years old, has already produced a handful or impressive programs demonstrating that the application of AI techniques to medical decision making problems is a fruitful methodology. A number of the programs already exhibit expert-level behavior on some realistic, important medical problems. The field is also rich with many other problems in representation and reasoning, ready to challenge the interested investigator with projects in artificial intelligence research and its applications, urging us all to discover and use what is knowable In the art and science of medicine.

References

1. Bleich, H. L., "Computer-Based Consultation: Electrolyte and Acid-Base Disorders," Amer. J. Med. 53, (1972), 285.

2. Bobrow, D. G., and Winograd, T., An Overview of KRL, a Knowledge Representation Language. Technical Report AIM-293, Stanford Artificial Intelligence Lab., Stanford, Ca., (1976).

3. Doyle. J., "A Truth Maintenance System," Artificial Intelligence 12, (1979), 231-272.

4. Ernst, G. and Newell, A., GPS: A Case Study in Generality and Problem Solving. Academic Press, New York, (1969).

5. Feigenbaum, E. A., and Feldman, J., (Eds.), Computers and Thought, McGraw-Hill, New York, (1963).

6. Gorry, G. A., Kassirer, J. P., Essig, A., and Schwartz, W. B., "Decision Analysis as the Basis for Computer-Aided Management of Acute Renal Failure," Amer. J Med 55, (1973), 473-484.

7. Gorry, G. A., "On the Mechanization of Clinical Judgment," in Weller, C., (Ed.), Computer Applications in Health Care Delivery, Symposia Specialists, Miami, Florida, (1976).

8. Gorry. G. A., Silverman, H., and Pauker, S. G., "Capturing Clinical Expertise: A Computer Program that Considers Clinical Responses to Digitalis," Amer. J Med 64, (March 1978), 452-460.

9. Hewitt, C., Description and Theoretical Analysis (Using Schemata) of PLANNER: A Language for Proving Theorems and Manipulating Models in a Robot, AI-TR-258, MIT Artificial Intelligence Lab. Cambridge, Mass., (1972).

10. Mabry, J. C., Thompson. H. K., Hopwood, M.D., and Baker, W. R., "A Prototype Data Management and Analysis System--CLINFO: System description and user experience," MEDINFO 77, North-Holland, Amsterdam, (1977), 71-75.

11. Martin, W. A., Roles co-descriptors and the formal representation of quantified English expressions. TM-139. MIT Lab. for Comp. Sci., Cambridge, Mass., (September 1979).

12. McCosh, A. M., and Scott Morton, M. S., Management Decision Support Systems. John Wiley and Sons, New York, (1978).

13. McCorduck, P., Computers Who Think, W. H. Freeman and Co., (1980).

14. Pauker, S. G., Gorry, G. A., Kassirer, J. P., and Schwartz, W. B., "Toward the Simulation of Clinical Cognition: Taking a Present Illness by Computer," Amer. J Med 60, (June 1976), 981-995.

15. Perlman, F., McCue, J. D., and Friedland, G., Urinary Tract Infection (UTI) / Vaginitis Protocol, Introduction. Ambulatory Care Project, Lincoln Laboratory, Massachusetts Institute of Technology. and Beth Israel Hospital, Harvard Medical School, (July 1974).

16. Raiffa. H., Decision analysis. Addison-Wesley, Reading, Mass., (1970).

17. Rosati, R. D., McNeer, J. F., and Stead, E. A., pages 1017-1024.  "A New Information System for Medical Practice," Archives of Internal Medicine 135, (1975).

18. Schwartz, W. B., "Medicine and the Computer: The Promise and Problems of Change," New Engl. J. Med. 283, (1970), 1257-1264.

19. Shortliffe, E. H., et al., "Knowledge Engineering for Medical Decision Making: A Review of Computer-Based Clinical Decision Aids," Proceeding of the IEEE 67, (9) (1979), 1207-1224.

20. Slack, W. V., Van Cura, L. J., "Patient Reaction to Computer-Based Medical Interviewing," Comput. Biomed Res 1, (1968), 527-531.

21. Smith, B. C., Levels, Layers, and Planes: A Framework of a Theory of Knowledge Representation Semantics. S.M. thesis, Dept. of Electrical Engineering and Computer Science, Massachusetts institute of Technology, (Feb.1978).

22. Sussman, G. J., and McDermott, D. V., "From PLANNER to CONNIVER--A Genetic Approach," Proceedings of the 1976 Fall Joint Computer Conference. AFIPS Press, (1976), 1171-1179.

23. Szolovits, P.. Hawkinson, L., and Martin, W. A., An Overview of OWL, a Language for Knowledge Representation. MIT/LCS/TM-86, MIT Lab. for Comp. Sci., Cambridge, Mass.. (June 1977), also in Rahmstorf, G.. and Ferguson, M., (Eds.), Proceedings of the Workshop on Natural Language Interaction with Databases, International Institute for Applied Systems Analysis, Schloss Laxenburg, Austria, Jan. 10.1977.

24. Szolovits, P., Pauker, S. G., "Categorical and Probabilistic Reasoning in Medical Diagnosis," Artificial Intelligence 11, (1978), 115-144.

25. Szolovits. P., and Pauker, S. G., "Computers and Clinical Decision Making: Whether, How, and For Whom?," Proceeding of the IEEE 67, (9) (1979), 1224-1226.

26. Tautu, P., and Wagner. G., "The Process of Medical Diagnosis: Routes of Mathematical Investigations," Meth, Inform. Med. 7, (1) (1978).

27. Tversky. A., and Kahneman, D., "Judgment under Uncertainty: Heuristics and Biases," Science 185, (September 1974), 1124-1131,

28. Waterman, D. A., and Hayes-Roth, F., (Eds.), Pattern-Directed Inference Systems. Academic Press, (1978).

29. Weyl, S., Fries, J., Wiederhold, G., and Germano, F., "A modular self-describing clinical databank system," Comp. Biomed. Res. 8, (1975), 279-293.

30. Winston, P. W., Artificial Intelligence. Addison-Wesley, Reading, Mass., (1977).

Notes

(1) This research was Supported (in part) by the National Institutes of Health Grant No.1 P01 LM 03374 from be National Library of Medicine and Grant No. 1 P41 RR 01096 from the Division of Research Resources.

(2) Its developer has kindly provided the author with a listing of a recent version of the Acid/Base flowchart program mentioned above [1]. The listing, in a variant of the MUMPS language, occupies close to 150 pages.

(3) The British Museum Algorithm is the most primitive of theorem provers, considered primarily as a theoretical vehicle. It is named for Bertrand Russel's suggestion that a set of monkeys typing randomly at typewriters would, eventually, reproduce all the books in the British Museum.

(4) Some AI researchers have labeled this the super-human human fallacy, which requires that a program must virtually do the impossible before it can be called "intelligent."


This is part of a Web-based reconstruction of the book originally published as
   Szolovits, P. (Ed.).  Artificial Intelligence in Medicine. Westview Press, Boulder, Colorado. 1982.
The text was scanned, OCR'd, and re-set in HTML by Peter Szolovits in 1999.