The Development of Clinical Expertise in the Computer

Peter Szolovits, William J. Long

Szolovits, P. and Long, W. J.  "The Development of Clinical Expertise in the Computer."  Chapter 4 in Szolovits, P. (Ed.) Artificial Intelligence in Medicine. Westview Press, Boulder, Colorado.  1982.

Introduction

We describe the development of programs which apply Artificial Intelligence (Al) techniques to medical applications such as diagnosis the selection and interpretation of therapy. This domain of application, called Artificial Intelligence in Medicine (AIM), is introduced in Chapter 1 of this work and is exemplified by the descriptions of several AIM programs in other chapters. Here our interest is in exploring the natural course of evolution of a typical AIM project, which we illustrate by following as our major example the Digitalis Therapy Advisor, which offers advice on the administration of the drug digitalis for patients with cardiac rhythm disturbances and congestive heart failure.(1) We suggest an analogy between the development of an AIM program and the growth of a child, as a handy mnemonic to guide one's expectations of the stages of development of the program.

An AIM program is born in the timely confluence of a wide set of physicians' and computer scientists' interests and the identification of a suitable medical problem and an appropriate computing technology. Its infancy is the initial concerted effort to lay out the problem in an insightful way, creating new understanding and techniques, and leading to the first functioning implementation. Childhood consists of repeated trial and revision, as the initial ideas are integrated, tested, and evaluated against the original requirements. Often a successful (though informal) test signals the transition to adolescence, in which the program continues to undergo revision and testing, as it is made to accommodate to the detailed realities of its practical environment. Adulthood, which is practical use, is a stage which no AIM program has yet reached.

The analogy of a program's development to the growth of a child is clearly not to be taken too literally; however, it does suggest an appropriate need for nurture and a surprising, though approximately realistic time schedule. The initial program is often fragile, limited in its scope of applicability, and awkward to use. It must be protected from attempts to use it in broad clinical settings, where its limitations would show themselves to be overwhelming. With time, as the program's developers gain an understanding of the strengths and weaknesses of the original version, improvements are made to both the depth and breadth of the program's competence. A schedule of successively more demanding trials continues to point out limitations, each suggesting a focus for further development. As the program begins to meet its technical requirements, economic, social and human engineering issues of usability come to the fore, testing its ability to adapt to the real world in which it must seek its ultimate application. In the following sections, we take up each of these phases, illustrating our views with the description of our experiences with the digitalis advisor and also drawing on our own experiences and our colleagues' with other programs. 

In addition to the digitalis program, we at the Clinical Decision Making Group of MIT's Laboratory for Computer Science, along with our collaborators at the Tufts-New England Medical Center Hospital, have been engaged for several years in the development of other computer programs which embody expert medical knowledge. We have applied techniques developed in the study of AI to encode expert clinicians' approaches to handling several important medical problems: taking the history of the present illness of a patient with kidney disease (the program is called PIP) [16, 18, 35], and diagnosing and treating patients with acid/base and electrolyte disturbances [15]. We have also studied the application of decision analytic techniques to the testing and treatment of Hodgkin's disease [25], the differential diagnosis of acute renal failure [4], the value of coronary surgery to the individual patient [17], and other problems [12, 19, 20]. Thus, our observations here rest also on our experience with these programs. In addition, we have followed closely the work of colleagues in the AIM community, who are investigating the treatment of infectious disease [28, 42, 43], diagnosis and therapy for glaucoma [38, 39, 11], and diagnosis in internal medicine [22]. Although we would not expect all these investigators to agree with our characterization of their efforts, the ability to follow their work has influenced both our developing projects and our assessment of the general process of AIM program development.

Birth and Infancy

The motivation for building an AIM program may range from a pressing medical need for a program that embodies some valuable and not universally-known expertise to a desire to try out a new computer technology in an interesting domain of application. In fact, many simultaneous motivations may co-exist, with each member of a research group pursuing the work for different reasons. Additional motivations include educational goals, such as making clear the conceptual basis for some aspect of medicine so that it may be taught better to medical students, simple curiosity, as  in much of science, merely to see if some intellectual activity can be thoroughly understood and mechanically reproduced; and "political" consideration, taking into account the anticipated response, from the public and the funding agencies, to a successfully demonstrable program.

Naturally, the desire to build a useful application is not in itself sufficient to make it possible. Another crucial requirement, especially using the AIM methodology, is the existence of a significant gap in capability between the intended users of the application and the best available expert. Whereas no great differences in ability seem to bear on how various physicians treat the common cold, it is generally recognized in the medical community that certain experts, because of their greater experience, insight, or intelligence, are indeed capable of making better medical decisions. Consultation with such experts on difficult medical cases is the rule. A program can be most useful when it offers a capability lacking to its users. Whether it is to give advice in difficult cases or to serve as part of a teaching system, it should include expert capabilities that surpass the knowledge and the ability to use that knowledge of its audience. For a program to capture that sort of expertise, it is essential that a physician who is one of the experts be an active collaborator in the project. Further, that physician must have a special sensitivity to the need to formalize that "art" of medicine which is the expert's special skill. Such a collaborator is not generally easy to find.

The Medical Problem

In the case of the Digitalis program, a serious medical need was identified based on estimates that as many as 20% of hospitalized patients treated with digitalis develop a toxic reaction to the drug [24] with more recent studies indicating a 15% toxicity incidence [8], and as many as 30% of those patients die of the toxicity [1]. This is the result of widely varying response to the drug in different patients, combined with some difficulty in identifying and distinguishing the more subtle therapeutic and toxic effects of the drug. Based on clinical experience, physicians in our group felt that the rate of toxicity could be significantly reduced by a careful anticipation of potential causes of that toxicity.

After hundreds of years of essential constancy in the use of digitalis, an improved pharmacokinetic understanding of the action of the drug had recently led to a better management technique for it. Before this change, William Withering's 1785 suggestions from An Account of the Foxglove [41] were generally followed: 

[L]et [the medicine] be continued until it either acts on the kidneys, the stomach, the pulse, or the bowels; let it be stopped upon the first appearance of any of these effects. [41, p.186]

His recommendation followed an analysis of the then-current understanding of the drug's actions: 

The Foxglove when given in very large and quickly-repeated doses, occasions sickness, vomiting, purging. giddiness, confused vision, objects appearing green or yellow; increased secretion of urine, with frequent motions to part with it, and sometimes inability to retain it; slow pulse, even as slow as 35 in a minute, cold sweats, convulsions, syncope, death. [41, p.184] 

As we now understand the drug, its effect is primarily on the heart, and a major beneficial secondary effect is on the kidney. Effects on the stomach, the vision and some of the effects on the heart are undesirable side-effects; the drug may produce death through heart rhythm disturbances.

Withering's advice is, in essence, to continue use of the drug until either a therapeutic or toxic effect is apparent. Naturally, such a method of use, current until about two decades ago, results in many instances of toxicity. Indeed, the chief of cardiology at one of Boston's major teaching hospitals recalls believing earlier in his career that frequent digitalis toxicity was an unavoidable consequence of the drug's use (personal communication). Toxicity can arise if subtle therapeutic or toxic clues are missed and the dose is increased until significant toxic effects occur. It may also arise if the drug is given rapidly and toxic levels are reached before the therapeutic evidence has a chance to be manifested. In addition, the drug is not appropriately given to some patients, in whom toxic reactions appear to occur before significant therapeutic benefit can be reaped from the drug.

Medical Expertise

Through the pioneering work on the pharmacokinetics of digitalis, it is now understood that digitalis preparations in current use act directly on the heart muscle, where they are concentrated from the blood, and digoxin, the most commonly used, is excreted mainly through the kidney. With this knowledge, dosage of the drug can be appropriately modified to account for the actual expected rate of renal excretion. In addition, various other factors about the patient are known to cause a higher likelihood of toxic reaction to the drug. For example, a decrease in serum potassium levels is correlated with increased likelihood of toxicity. Such knowledge, in a somewhat more quantitative form, may be used to modify the prescribed dose, to lower the anticipated digitalis concentration and thus to lower the chances of toxic effects. Clearly, such a reduction in dose is a conservative strategy, trading off potential additional therapeutic effect from the higher dose for increased safety in the use of the drug. The appropriate balancing of this tradeoff is part of the expertise that an expert prescriber of digitalis brings to the problem.

When our group became interested in the problem of digitalis therapy, two programs which showed some promise of utility bad already been developed for this problem domain. One of these, written by Jelliffe [9, 10], implemented the computations implied by the renal excretion model described above. Thus, it was able to transform a given desired therapeutic level into a dosage schedule, based on the patient's weight and renal function. It also successfully dealt with a range of digitalis preparations, each with somewhat different pharmacokinetic models, being able to compute appropriate schedules for moving the patient from one to another form of the drug. The other program, developed by Sheiner and his colleagues [27, 21], introduced the critical notion of feedback, suggesting that therapeutic recommendations must be based on an evaluation of the previous effect of the drug. This program used a measurement of the serum level of digitalis (measured by laboratory techniques first developed in the early 1960's) and a statistical model to suggest appropriate changes in the dosage to reach the goal level of serum concentration.

Both these programs encouraged us in the belief that some body of expert knowledge could be codified to improve therapy with digitalis. We felt, however, that neither program as it stood was clinically adequate. The Jelliffe program made only a "one-shot" recommendation based on limited information about the current state of the patient; it was unable to evaluate the actual response to the suggested therapy. Further, the program dealt with patients assumed to have normal liver and thyroid function, normal electrolyte concentrations, no evidence of gastrointestinal malabsorption, and no other drugs being given which could alter the metabolism of digitalis. Such a conjunction of normality is rather unusual in the population of ill patients who require digitalis therapy. 

The Sheiner program, although adding feedback, considered the outcome of therapy to hinge on the measurable serum levels of digitalis rather than on the clinical effects of the drug. Our medical colleagues, based on their experience with the drug and with serum digitalis levels, felt strongly that feedback adjustments must be made with respect to the clinical problems manifested by the patient, not the serum digitalis level. Indeed. later evidence [8] indicates a fairly poor correlation between the serum digitalis level and clinical effects of the drug. According to this view, then, the goal state of therapy is to be defined by clinical desiderata, and the goal level of the drug's serum concentration is simply a major parameter which may be varied to try to achieve the clinical goals. 

A new program for digitalis therapy, ANNA, was implemented in our group by Howard Silverman, based on the advice of Dr. Stephen G. Pauker and Prof. G. A. Gorry in 1974 [29]. This program incorporated solutions to the deficiencies in the previous programs suggested above. It used a patient specific model (PSM) which encodes not only the values of clinical variables such as age, sex and weight; and histories of serum drug levels, creatinine clearance, electrolyte levels, etc., but also the reason the drug is being given, previous response of the patient to various levels of the drug, and expectations of possible future therapeutic and toxic signs and symptoms. The program has an initiation phase in which it constructs an initial PSM by gathering the needed background data and determining the reason why the drug is being given. From this information, an initial dose recommendation is made, and the program enters its feedback phase. The initial dose is computed based on Jelliffe's model, extended to modify the desired serum drug level based on the reason for digitalization and any sensitivities the patient may have which predispose him toward toxic responses to the drug. 

Figure 1 consists of excerpts from a transcript illustrating the interaction between the user and the program during the initial session. This transcript is actually from a later version of the digitalis program, but the interaction does not differ in any essential way from the interaction with ANNA. In this case, the patient is being treated for an arrhythmia. The program asks for the basic patient information, and then asks various questions to determine the features of the cardiac rhythm. This information is used to establish a base for assessing the therapeutic and toxic changes. After the program has assessed the renal function, assessed any previous digitalis use, and checked for factors that might increase the likelihood of toxicity, it provides a recommended schedule for loading the patient with digitalis.

The following is an annotated transcript of the first session of a patient being treated for atrial fibrillation (a type of fast irregular rhythm). This transcript begins after the name, time, age, weight, height and sex have been gathered.
WHAT IS THE CARDIAC RHYTHM? 
THE POSSIBILITIES ARE:
        1. NORMAL SINUS RHYTHM
        2. PACED-RHYTHM
        3. ATRIAL FIBRILLATION
        4. ATRIAL FLUTTER
        5. PAT
        6. PAT WITH BLOCK
        7. NON-PAROXYSMAL JUNCTIONAL TACHYCARDIA
ENTER SINGLE VALUE ====> 3
IS THERE REGULARIZATION OF VENTRICULAR RHYTHM? 
THE POSSIBILITIES ARE:
        1. PRESENT
        2. ABSENT
ENTER SINGLE VALUE ====> 2

Regularization of the rhythm would imply that the rhythm is probably actually non-paroxysmal junctional tachycardia, which is not treatable with digitalis and is usually a sign of digitalis toxicity.

WHAT IS THE VENTRICULAR RATE? 
ENTER NUMERICAL VALUE ====> 138
WHAT IS THE STATUS OF PVCS? 
THE POSSIBILITIES ARE:
        1. PRESENT
        2. ABSENT
ENTER SINGLE VALUE ====> 1
IS THE PATIENT REALLY HAVING PVCS OR ABERRANTLY CONDUCTED
SUPRAVENTRICULAR BEATS (ASHMAN BEATS)?
THE POSSIBILITIES ARE:
        1. PVCS
        2. ASHMAN BEATS 
ENTER SINGLE VALUE ====> 2

If they were actually PVCs they should be closely watched because an increase in the number of P VCs is an early sign of digitalis toxicity. At this point the program assesses present status of digitalis use, assesses the renal function, and starts checking for possible increased sensitivity to digitalis.

WHAT IS THE LEVEL OF SERUM POTASSIUM?
UNITS: MEQ PER LITER
ENTER NUMERICAL VALUE ====> 3.4
IS THE PATIENT BEING GIVEN OR ABOUT TO BE GIVEN DIURETICS? 
THE POSSIBILITIES ARE:
        1. GIVEN
        2. NOT GIVEN 
ENTER SINGLE VALUE ====> 2

Most diuretics would further deplete the potassium.

IT IS GENERALLY AGREED THAT PATIENTS WITH LOW SERUM POTASSIUM
LEVELS ARE MORE PRONE TO DEVELOP DIG TOXICITY. PLEASE WATCH
THIS PATIENT CAREFULLY AND ADMINISTER POTASSIUM SUPPLEMENTS.

The program continues to query the user about cardiomyopathy, possible acute myocardial infarctions (heart attacks), the oxygen level, thyroid condition and upcoming procedures (cardioversion catheterization, and surgery). (These questions have been left out of the transcript to conserve space.) In this case the user indicates that cardiac surgery is anticipated
in two days, requiring that the digitalis level be fairly low at the time of surgery. After asking all of these questions, the program prints the parameters it will use in making its recommendations.

PARAMETERS USED TO MAKE RECOMMENDATIONS:
REDUCTION FACTOR DUE TO HYPOKALEMIA ===> 0.800
TOTAL ALTERATION SENSITIVITY FACTOR ===> 0.800

The program will aim for four-fifths of what it normally would because of the low potassium level.

BODY STORE GOAL ===>                     0.551 MG

This is the desired average amount of digitalis in the body. It is computed using the basic requirements of the disease state (atrial fibrillation requires more digitalis than heart failure), the patient's weight, as well as the sensitivity, and in later sessions the response to the previous level of digitalis.

PROJECTED AVERAGE AMOUNT ON BOARD ====>  0.000 MG
PROJECTED CURRENT AMOUNT ON BOARD ===>   0.000 MG
BODY STORES SHIFT ===>                   0.637 MG

This amount would have to be given now for the average over the next 24 hours to be .551 mg.

REMAINING TIME TO REACH GOAL ===>       24.0 HRS.
HALF LIFE ===>                          55.8 HOURS   2.3 DAYS
DAILY LOSS AT EQUILIBRIUM ===>           0.164 MG

Once the goal is reached, this amount per day will be required to maintain that level.

ABSORPTION FACTOR ===>                   0.75

Unless the program discovers otherwise, it is assumed that 75% of an oral dose reaches the bloodstream. Next the program asks how often digitalis is to be administered (left out of the transcript), and being told "twice daily" it provides the following dosage recommendations:

THE DOSAGE RECOMMENDATIONS ARE:
DATE       TIME      ORAL              IV
11/10/79    9:00     .5 MG             0.4375 MG (OR 1.75 CC)
REPORT BACK AFTER THE FIRST DOSE.

The effects of the first dose will be assessed and adjustments made in the schedule.

           21:00     .25 MG            0.125 MG (OR 0.5 CC)
11/11/79    9:00     .0625 MG          0.0625 MG (OR 0.25 CC)
HOLD THE DIGITALIS FOR THE CARDIAC-SURGERY.

The digitalis should be held until the surgery has been completed.

Fig. 1.  Transcript of an interaction with the Digitalis Program.

On every further interaction between ANNA and a particular patient's case, the PSM is updated and modified to indicate the digitalis history of the patient, his tolerance for and response to various levels of the drug, planned surgery or other procedures which require an adjustment to digitalis therapy, etc. In each such session, the patient's response is summarized along two dimensions-therapeutic benefit and toxic manifestations-and the recommended dosage is modified based on this summary. Thus, for example, if the patient shows no toxic response and only partial or no therapeutic response, the program records the current digitalis level as minimal (given other current parameter values) and raises the goal serum level of the drug. In the worse case of definite toxic response and partial therapeutic response, the program suggests: 

  1. immediate steps to treat the toxic manifestations (if possible), 
  2. obtaining a serum digitalis level (if not already available), 
  3. that any of the patient's sensitivities to the drug be reduced (e.g., by correction of a low serum potassium level), 
  4. that digitalis administration be discontinued until the toxicity abates, and 
  5. that alternative therapies be considered. 

ANNA also records that toxicity occurred at the current level of digitalis and uses this as a cap on maximum future dosage.

Computational Capabilities

An important share of the reasons for building ANNA came from possibly interesting computational problems in its design and implementation. According to its author, the desired components of the system were [29J: 

  1. computation facilities for performing the mathematical calculations of the digitalis pharmacokinetic model, 
  2. model-tailoring facilities for maintaining the PSM, 
  3. explanation capabilities for providing accounts of the program's reasoning that are meaningful to intended users, and 
  4. extensibility to permit changes to the program in an orderly manner, without the need for complete re-design or re-implementation. 

At the time, progress in the design of programming and representational languages in Artificial Intelligence promised to provide methods for achieving elegant versions of components 2-4.

Representation.. It is a truism of Computer Science that all programming languages are capable of expressing the same computations, in principle. Nevertheless, progress in computing often relies on the recognition of important generalizations and their incorporation into a programming language or system. For example, an idea in computing such as list processing has greatly simplified the conceptual view of program designers and implementers when they work with data that is fruitfully viewed as lists. Identifying such key ideas is important because large applications cannot even be conceived if the conception must be in terms of very low level, unimportant details. Realizing that the computer actually implements lists as chains of indices in some vector is quite irrelevant and simply gets in the way of most uses of list structure. Also, programming is a difficult, time-consuming task. If a general technique useful in a number of programs can be built "once and for all" in a general way, then future programmers can use that technique without once again programming it.

Minsky's seminal paper on the organization of knowledge within a computer [13] suggested a natural approach to implementing the PSM. One view of a frame (Minsky's basic memory structure) is as "a collection of the most serious problems and questions commonly associated with [a scenario structure]." [13, p.37] The PSM was implemented as a memory structure capable of answering those serious problems and questions associated with the patient's treatment: What toxic effects should be expected? What changes would represent improvement? What is the maximum serum level we have ever safely reached? What special sensitivities does the patient exhibit?

Explanation. Explainability of a program's detailed behavior is important for a number of reasons. Ultimately, the users of a program such as the Digitalis advisor will not be able to read the code of the program (or to grasp it all, even if they can read it). Yet users need to be able to question both the general competence of the program and its specific conclusions if they are to have faith in its recommendations. We reject the model of the program as oracle; it is not a model of any current acceptable style of medical consultation, and it would very likely be defeated by users' opposition. In the short term, an emphasis on explainability encourages the program designer to choose appropriate modular decompositions of the task, so that individual modules can be separately explained and understood. Indeed, die inclusion of capabilities for explanation have aided the discovery and correction of errors in the program. 

Although ANNA did not actually provide the planned explanation capabilities, these were added to the later version of the digitalis advisor implemented by William Swartout [33], illustrated in Figure 2. Most of these explanations, as well as those provided by MYCIN [28] (the only other AIM program offering explanations), are based on the techniques used in Winograd's classic program SHRDLU [40], which first demonstrated high quality English language interaction between a program and its human users. Assuming that the program's structure is expressed as a set of goals which are implemented in terms of sub-goals. simple questions about what, how and when may easily be answered: Why is a step being taken? Because it is a sub-goal of some higher goal. How is a goal to be pursued? By working in turn on each of its sub-goals. When is a goal pursued? During the pursuit of its super-goal, after pursuit of the previous sub-goal. Such explanations rely on an orderly structuring of the program's tasks. These techniques apply to systems which express their procedures as production rules, goal trees, or simple stylized subroutines.

Figure 2 shows how the digitalis program would answer a question about checking the cardiac rhythm during the initial session illustrated in the previous figure. This explanation is produced from a trace of the goal structure that was generated by the program. Each accomplished goal is given to the English generator which produces a sentence stating what was accomplished. The levels of the goals are reflected in the levels of the explanation. The resulting explanation specifies the steps the program went through to reach its conclusions.

The following is a transcript of the interaction when a user asks a question about how the Digitalis Advisor accomplished one of its goals daring the session in the previous figure.

-> HOW DID YOU CHECK THE CARDIAC RHYTHM?

The user types a question to the explanation facility.

DURING THE CURRENT SESSION. I CHECKED THE CARDIAC RHYTHM BY
EXECUTING THE FOLLOWING STEPS: 

The program starts to enumerate the subgoals that had a part in accomplishing the goal to check the cardiac rhythm.

1. 1 ASKED THE USER THE CARDIAC RHYTHM. THE USER RESPONDED
THAT THE CARDIAC RHYTHM WAS ATRIAL FIBRILLATION.

This was the first question in the previous figure.

2. SINCE THE CARDIAC RHYTHM WAS ATRIAL FIBRILLATION I DID
THE FOLLOWING:
    2.1 I ASKED THE USER THE STATUS OF VENTRICULAR-RHYTHM
    REGULARIZATION. THE USER RESPONDED THAT THE STATUS 
    OF VENTRICULAR-RHYTHM REGULARIZATION WAS ABSENT. 

This is a subgoal that was conditional on the cardiac rhythm being atrial fibrillation.

3. SINCE THE CARDIAC RHYTHM WAS ATRIAL FIBRILLATION I ASKED THE
USER THE VENTRICULAR RATE. THE USER RESPONDED THAT THE
VENTRICULAR RATE WAS 138.
4. I ASKED THE USER THE STATUS OF PVCS. SINCE THE CURRENT VALUE
OF THE STATUS OF PVCS WAS PRESENT I DID THE FOLLOWING:
    4.1 I ASKED THE USER THE CAUSE OF PVCS. SINCE THE CARDIAC
    RHYTHM WAS ATRIAL FIBRILLATION AND THE CURRENT VALUE OF THE 
    CAUSE OF PVCS WAS ASHMAN BEATS I SET THE STATUS OF PVCS TO
    ABSENT. THE STATUS OF PVCS WAS PREVIOUSLY PRESENT.

This is because Ashman beats, which appear similar to PVCs, occur as a result of a mechanism not related to digitalis. Thus as far as the program is concerned there are no PVCs.

5. SINCE THE STATUS OF PVCS WAS ABSENT I DID THE FOLLOWING:
    5.1 I DID THE FOLLOWING SUBSTEPS:

The following steps set up variables which will be used in the rest of the program when the status of PVCs is needed.

        5.1.1 I SET THE NUMBER OF PVCS TO 0.00. THE NUMBER OF
        PVCS WAS PREVIOUSLY UNDETERMINED.

The number of PVCs is watched from session to session to distinguish between PVCs resulting from digitalis toxicity and those from other causes.

        5.1.2 I SET THE STATUS OF SALVOS OF PVCS TO ABSENT. THE
        STATUS OF SALVOS OF PVCS WAS PREVIOUSLY UNDETERMINED.
        5.1.3 1 SET THE TYPE OF PVCS TO NONE. THE TYPE OF PVCS 
        WAS PREVIOUSLY UNDETERMINED.

Fig. 2 Explanation of the cardiac rhythm determination as generated by the program.

Extensibility. Easy extensibility of a program depends on the ability to incorporate new knowledge sources (rules, procedures, etc.) without modifying any of the existing program. The best solution to this requirement is based on pattern directed invocation of knowledge. In this scheme, any procedure (to choose a concrete example) has an associated pattern which identifies the goals to which the procedure can contribute. If, for example, the digitalis advisor has a goal of reducing the target dosage due to sensitivities, then any procedure whose pattern matches that goal will be run. To add an additional procedure then requires no changes to the existing program; the new procedure will be selected because its pattern matches the goal. Languages such as PLANNER and CONNIVER [32] provide this capacity as a fundamental mechanism. ANNA and some later versions of the digitalis advisor use such mechanisms to implement extensibility.

Collaboration

Collaboration appears to be the sine qua non of research and development for creating sophisticated computer programs which perform interesting applications in medicine. Generally, computer scientists have little understanding of how building computer programs can benefit the practice of medicine. In fact, most computing development for the medical field has consisted of the extension of programming methodologies originally applied to other applications. For example, database and record keeping techniques are just as necessary in health care as they are in many business environments, and a computer scientist's natural inclinations are to extend just such techniques to the medical area.

Just as many computer scientists have a rather poor understanding of how their programs may be most useful to the medical community, most doctors have little knowledge of computers and virtually no idea how best to apply them. Doctors' reactions to computing tend to range from the science fiction optimism of those who anticipate nearly miraculous applications of micro-technology to diagnose and treat diseases to the pessimistic reaction of some who fear the machine as an oppressive. dehumanizing instrument which threatens both their professional worth and the patient's good care.

Through a number of joint M.D./Ph.D. training programs, a "new breed" of medical scientists who are expert both as doctors and computer scientists is being trained. So far, however, the number of such people is very small; the length of specialty training in two disparate areas is so long that it both discourages those thinking about entering the field and guarantees that even an enthusiastic response today will not be felt in the form of trained scholars for many years. Other interdisciplinary M.D./Ph.D. areas between medicine and engineering (e.g., biomedical engineering) have attracted students in the past, so the long-term prospect for the training of people with expertise in both areas is good. For at least the short term, however, collaboration is essential.

Collaborative work between doctors and computer scientists is not by any means easy. As outlined above, their respective spheres of knowledge are sufficiently far apart that even with the best of intentions, it is difficult for each to communicate his views in a comprehensible way to the other. Each must learn to deal with a whole new vocabulary which the other takes for granted. A phrase like "the distal part of an and/or tree" (actually occurring in some of our discussions) illustrates the mixture of these vocabularies-a mixture that either an untrained doctor or computer scientist would find puzzling. Months of intensive exposure to the language of the other are typically needed before a discussion can move beyond constantly having to define terms. Joint research involves the development and sharing of common ideas, and this is not possible until the computer scientist can sit through a discussion of a relatively complex medical case and concentrate on the decision-making criteria being exposed rather than on his ignorance of the medical content of the case. Similarly, the physician must be able to understand the principles of organization used in a computer program before he can intelligently critique its strengths and failures. This need for investigators from different spheres to "grow together" is the greatest impediment to the establishment of a successful collaborative effort, and explains in large part the relatively small number of workers in this and similar fields.

As secondary problems, a number of cultural difficulties also intrude on the collaborators. These range from the substantive to the silly. In our projects, a continuing conflict surfaces repeatedly over the proper relationship between medical doctors and computer science graduate students. Initially, the doctors viewed our students in the same way they see medical student-as young people who must be taught a large number of facts and techniques in a short time. Research does not play a large role in the training of medical students. Computer science doctoral students, however, do not fit that model well. By the time they are pursuing their thesis research, there is little factual knowledge we have yet to teach them; mostly, they need to work on good research problems. Currently, our graduate students are called "fellows," which reflects a higher valuation in the medical hierarchy and allows them to be accepted as equals within the project. At the other extreme, unimportant issues like style of dress and formality of manner have sometimes complicated the collaborative work. Doctors are generally people-intensive, sensitive to the image they project to their patients and their community. Computer scientists are often loners whose first devotion is to the machine, which cares little about their appearance or manner. The early-morning meeting between the doctor in coat and tie preparing for a day of visits with patients and the computer scientist in disheveled clothing coming off a long night's vigil debugging a program is an ideal theme for a comedy routine--and sometimes a focus of conflict.

Childhood

From the moment an AIM program first "turns over," a great enthusiasm envelops the developing team. The physicians involved, often left out of the technical side of program development, begin to see the fruits of months of problem formulation, Soul searching about the best medical approach to difficult problems, and difficulties teaching the computer scientists about the medical domain. The program now does something, embodying in an active way the medical expertise elucidated for it. Especially within the AIM methodology, the program is seen to exhibit some tentative steps toward an artificial intelligence. The temptation is strong to succumb to anthropomorphic models: "It says to give .125 daily." "It thinks the patient is getting toxic." 

Initially, the program is terrible. It is still full of programming errors. The novel methods developed for expressing its knowledge base have not been fully tested-they often fail to produce the expected answers or simply "break" on a badly- formed programming construction. The knowledge base of the program is extremely incomplete. Often, elation comes to the developers when the program can limp through a single, uncomplicated case. The programmer feels that now, getting the rest working is "simply a matter of more programming"-a phrase known to have grave consequences in all areas of computer development. The physicians are anxious to try out the program, but quickly become frustrated by its strict limitations. Elation is easily followed by a period of disappointment when everyone realizes that that simple matter of more programming will actually involve several months, and the program cannot be used in any interesting way until those months have passed. 

Eventually, an initial version of the program is complete and free of the programming errors that would make it unusable. Then the really important investigation begins: How do the models developed for the project actually work out in practice? 

One major advantage of the AIM methodology comes into play here. The program is seen as an expression of human expertise. Its behavior, in detail, is intended to correspond to the behavior of a medical expert in the domain. Thus, a natural mode of investigation is often followed: A physician creates or selects a case to illustrate some point of interest, the case is run in the presence of a computer scientist and a physician, and discrepancies between the expert's expectation of what the program should have done and what it did are traced down to specific flaws in the program's knowledge. (This technique is elegantly utilized in the work of Davis [2], where part of TEIRESIAS interacts directly with the physicians to allow them to identify and correct the program's misconception.) These "debugging" sessions often last far into the night, as the correction of each error enables the detection of another. The sessions can also be tense, because substantive medical questions must often be (at least tentatively) settled quickly.

Of course, the process of knowledge accretion does not proceed monotonically. Often a correction introduced to handle one difficult case destroys the program's previously-correct performance on another. At other times, a set of corrections appears difficult to implement and suggests the need for some new program mechanism to capture the previously unseen commonality among the tools in use. At limes, the medical knowledge incorporated in the program may fall under suspicion and lead to a careful review of the literature and discussion with other physicians. In any of these circumstances, weeks (at least) may pass before the program is back in operation, and progress toward the ultimate goal seems painfully slow. After sufficient exercise with challenging cases invented by members of the development group, the program needs to be challenged by cases which are less contrived-more reflective of the ultimately intended domain of application. In the case of ANNA, one of our students started to spend much of his time at New England Medical Center Hospital (NEMCH-where our physician colleagues practice), collecting especially interesting cases and trying the program Out on them. Difficulties with these led to further testing and modification, until we were all reasonably satisfied with the overall performance of the program.

The First Evaluation

In the Winter of 1976, we felt ready to try the program on a somewhat more formal basis. For one month, the histories and progress notes of every patient who was receiving Digitalis on the Cardiology Service at NEMCH (except those on a stable maintenance schedule) were run through the program. Fifteen of the nineteen patients followed by the program had an increased sensitivity to digitalis, which would have made the techniques used by earlier programs inappropriate. Four of the patients manifested toxic signs during the month. In each case, the program detected the early development of toxicity before the physicians handling the cases did, and in each case the program had recommended lower doses than what had actually been given. The program did not err by concluding definite toxicity in any of the fifteen non-toxic patients, though in general the program's dosage recommendations were somewhat more conservative than those of the clinicians [5]. In reporting the conclusions of the study, Gorry et al. suggest that

[t]his trial does not establish that this program performs more expertly than a cardiologist would; it only demonstrates that the use of such programs might be utilized to distribute knowledge about digitalis therapy to settings in which cardiac consultation may not be readily available ... Further evaluation of this program is clearly required. The design of such trials will be difficult since they must put no patient at an increased risk of toxicity. The next step will be to use this program in a wider spectrum of patients in other settings (e.g., on a general medical service, on a surgical service, in an outpatient clinic) and have the program's performance compared to the performance of clinicians by a panel of cardiologists. [5, p.459]

Certainly to its developers, the program had been successful. It had passed a necessary milestone, demonstrating that the hours of design and iterative improvements had resulted in a program which performed adequately in a limited but fair trial of its capabilities. The next step, to try out the program in a broader setting, would be delayed by several years because of changes in personnel, an new emphasis on the ease of use of the program, and a re-evaluation of the underlying implementation methodology. In a move not unusual in our experience, Silverman left our group to enroll in medical school, having become fascinated by the practice of medicine through his exposure in this project. Gorry also moved to a more medically-oriented setting, to head a program in health management at the Baylor College of Medicine.

Just how good was the program at this time, compared to the ultimate capabilities that eventual clinical use would demand? What remained to be done? With the benefit of hindsight, we may now examine where it stood. Because its medical competence had (appropriately) received our greatest attention, that was certainly the best-developed facet of the program. Although later experiences would show the need for some substantial revisions of individual parts of the medical knowledge used by the program, its essential structure has remained stable. Thus, by the end of its childhood, the program had matured enough to form a viable continuing basis for future improvements. Naturally, we would also discover medical areas in which the program's coverage was incomplete; fortunately, these also proved to fit well within the structure of the existing program. Other areas of development were far less complete. The program's interface to the user was rather primitive, discouraging its use by any but its most enthusiastic advocates. Its implementation was inefficient and cumbersome because much of the generality originally designed into the program had turned out to be unnecessary when a better understanding of the medical domain had suggested more straightforward program structures. Although facilities for explanation had been part of the original design, lack of time and technical difficulties of implementation had thus far kept them from fruition. At the end of its childhood, then, the digitalis program faced two needs: to evaluate further its depth and breadth of medical knowledge, comparing it to the capabilities that would be needed for the program's clinical application; and to alter and complete the implementation-to speed up the program, to make its internal structure more elegant and accessible to explanation, and to provide a much improved user interface.

Adolescence

The adolescent stage of a program's development begins after it has been recognized as a viable, working representation of medical knowledge. The process of maturing involves extension, revision, and adjustment to meet all of the possible situations to be faced by a program used as a medical tool rather than merely as part of a research project in medical knowledge organization. Adolescence can be characterized as a time of preparation for reality. The adult world has no filters to weed out unusual situations or even occasional errors. It requires speedy responses, appropriate questions asked in the right sequence and manner, complete coverage of the intended field of application, and the ability to explain and justify the program's answers. In this section we first examine some of the issues of testing the program's performance against the real world and we men describe the changes in medical models, user interface, and implementation methods that the digitalis program has undergone.

Assessing Reality

The first question to be faced when a program enters into its adolescent period is "What is reality?" Obviously, when the initial design of the program was set out, an effort was made to define, explore, and understand the problem area. Now the needed effort is a little different. The program is in hand and must be made to fit the problem area more closely than it already does. To do this the problem area is explored again, in terms of the program. There are many different kinds of evaluation and typically there will be several during the adolescent period, each a little more formal and extensive than the previous. The first kind of evaluation exposes the program to a diverse sample of cases to test its capabilities. These cases may be gathered casually, as they were during the program's childhood development, or their selection may be carefully planned to test various capabilities of the program. The earlier evaluations usually only test the medical knowledge of the program, but later evaluations (perhaps in the form of field tests) evaluate all aspects of the program. Evaluations are very revealing because they expose the program to the non-textbook cases, including the fuzzy data, differences of interpretation, confusing unrelated signs, and non-standard starting points. 

There are a number of examples of adolescent stage evaluations represented by the digitalis program and the programs covered in other chapters of this book. The simplest kind of evaluation is just to gather a set of interesting cases to try out on the program. The CASNET/Glaucoma effort did this: 

Early in our work, 'we collected a sample of 40 difficult cases. Initially, the program did not classify (diagnose) all cases correctly. However, as our model improved, it was soon able to diagnose the 40 cases correctly. This result demonstrated, at a relatively early stage, that our approach did provide an incremental means of improving the program's performance. We became confident that poor or inaccurate conclusions could be corrected, that cases diagnosed correctly would remain correct, and that diagnostic and therapeutic recommendations could be improved. [11] 

The Digitalis program, MYCIN and INTERNIST each went through similar stages of development. The more extensive the medical knowledge of the program, the longer it takes to test the program effectively with cases. INTERNIST, which has as its domain the diagnosis of all of the diseases in internal medicine, is still in the process of being evaluated by sample cases chosen by the developers for their difficulty. Since INTERNIST does not yet cover all of the diseases, this is a reasonable way to test incrementally the knowledge that is has.

There are typically a number of times during adolescence when it is appropriate to evaluate a program. From the simple evaluation described above one can go either to a more extensive gathering of cases or to a more controlled evaluation of cases. The Glaucoma project chose the former approach. In 1974 they established a network of researchers for the purpose of giving a wider audience access to the program for more extensive testing and assessment. Then at a recent meeting of the American Academy of Ophthalmology and Otolaryngology, they set up the program and ran it on all of the cases presented to it. The results of these evaluations have been favorable and have pointed out a number of facts about the real world of medicine that will have to be faced by all programs. One interesting advantage of broad exposure occurred at a recent trial of the glaucoma program in Japan. The program was given a case involving a characteristically Japanese form of glaucoma unfamiliar to the developers. It was unable to diagnose the case properly, but because of this exposure the appropriate information has been added to the program--knowledge that most Western doctors would not have. 

More controlled kinds of evaluations have taken place for both the digitalis program and MYCIN. The digitalis program has recently undergone an evaluation utilizing fifty cases from the Veterans Administration hospital in Houston. This evaluation differed from the previous one in a number of respects. First, the cases included all of the patients taking digitalis in the cardiac intensive care unit during a period of time in the fall of 1977 (except for a number for whom records were unavailable). As a result of not eliminating the "mundane" cases, as we had in the 1976 evaluation, we exposed an unanticipated difficulty in handling patients already on maintenance schedules of digitalis. (In patients being treated for something unrelated to their digitalis needs, the program would inappropriately attempt to overturn a long-established pattern of maintenance digitalis use to try to treat residual signs of the past disease for which digitalis was being given. Some of these signs are permanent--e.g., some degree of residual cardiomegaly--and are judged inappropriate to treat.) Because the sample of cases was more extensive, there were also situations that the program did not fully appreciate, such as the limitations on toxic responses of patients with pacemakers. The evaluation was conducted blind to keep the evaluators (four expert cardiologists from different hospitals in the Houston area) from introducing whatever biases they might have for or against computer programs. At each decision point in the case the evaluators were given two possible therapy recommendations (one actually given by the attending doctor and the other recommended by the program), and asked to rate the two and to indicate what they would do. In this way it is possible to see how the expert rates both the treating physician and the program and how the experts differ among themselves. The results of this study are currently being evaluated and prepared for publication.

MYCIN has also been the subject of a more formal evaluation [42] on fifteen patients (also consecutively selected) using ten experts, five from Stanford and five from other places. This study was not conducted blind. It was felt that to evaluate the program properly, both the diagnosis and the therapy needed to be judged. Thus, the experts were given an abstract of the case (and any other information they might want to know not contained in the abstract was made available) and asked to judge the program's results concerning the significance of the bacterial organism, the identity of the organism and the therapy selection. The results of this study, which were quite encouraging, are reported in [42].

Beyond these evaluations lie several more demanding kinds once the programs have passed the earlier hurdles. Just as physicians in training are given greater responsibilities with lesser oversight from their seniors as they progress from medical student status through internship and residency to fellowships and clinical practice, so a program should be evaluated in successively more independent modes of action. The next logical step is an evaluation in which the program is asked to provide a consultation whenever its user makes any treatment decision. The results of that consultation are, however, filtered by an expert consultant who takes responsibility for their reasonableness. In this mode of evaluation, the consequences of the program's actions can be gauged more readily than in purely retrospective studies. When the user accepts the program's advice, its efficacy can be determined from actual outcome in the patient. When the advice is rejected, the reasons for its rejection help determine the deficiency in the program. The importance of this kind of testing depends somewhat on the kind of program involved. For the digitalis program or any other program largely dependent on a feedback model to adjust its recommendations on the basis of the results of previous recommendations, this testing is very important. For programs that do more static determinations without depending on previous results, this testing is only marginally better than using past cases. 

As a program approaches use in the adult world, there will also be independent prospective evaluations and field tests, representing stages none of the AIM programs have yet reached. Prospective evaluations represent an important step because in that case the program is being utilized to provide the primary care for some group of patients. Thus, it is a test to see if the program is really ready for adulthood. 

The evaluations conducted on existing AIM programs have pointed out a number of features of the world of medicine that these programs will have to face. First, there are biases and differences of opinion that enter into any evaluation. The MYCIN developers discovered a definite institutional bias, where MYCIN reflected the Stanford approach and some of the outside experts differed with the results which were considered acceptable by Stanford physicians [42]. There is a similar problem in the domain of digitalis therapy, as clinicians from different institutions tend to make different judgments in controversial cases. We have also noted different degrees of conservatism in digitalis use at different institutions.

Another kind of bias is against computer programs (or perhaps against any formalism claiming to have correct answers). When the program is distinguishable from the human practitioners against whom it is compared, it may be unfairly penalized for failing to meet expectations that would not be demanded from a human expert. There is some evidence that both MYCIN and the Glaucoma program have experienced this. In one MYCIN evaluation the subjective reaction of the experts was that too few questions were asked by the program--they suggested an average of seven additional questions per patient, although there was no consensus among them on the desired content of the missing questions. A few also thought that some questions were extraneous, again without any consensus [42]. In a trial of the Glaucoma program each doctor who tried out the program was asked to fill out a survey form which included questions to indicate the applicability to glaucoma research and the importance to health care. It was interesting to see that while 77% judged the clinical proficiency to be at the very competent or expert level and 71% judged it to be very applicable for research, only 45% judged it to be very important for health care [11]. Although we have trouble interpreting just what "very important" means, it would seem perhaps that doctors want a considerably higher standard of accomplishment from programs than they demand of themselves. In evaluating the Digitalis program, we have attempted to suppress this source of possible bias by similarly abstracting the behavior of the program and the physicians into a common format, thus hiding such irregularities as style of expression, etc. This attempt has been only partly successful, as we have discovered that some of the program's dosage schedules (although they are appropriately given in terms of available pill or bolus sizes) are recognizably not human-generated.

Another problem that arises in evaluations and in the real world is the needed recognition and delineation of the limits of a program. Much of the difficulty experienced by the Glaucoma program was a result of doctors trying out cases on it for which a significant part of the diagnosis involved diseases other than Glaucoma [11]. This will probably always be a problem because the areas of medicine are so highly intertwined that there will always be situations that stretch outside a program's expertise. Also, there is often a need to evaluate more than just the final results of a program. The programs reach intermediate conclusions, producing diagnoses and alternatives in addition to therapy recommendations. Eliminating biased judging of such auxiliary products is an especially challenging problem in testing methodology.

From the simple evaluations of sample cases, to the exposure to various research environments, to the more formal evaluations judged by uninvolved experts, the programs are exposed to increasingly realistic facets of the world they will have to survive in. Whatever kind of evaluation might be conducted, it gives an indication of the difficulties that the program will have to face upon entering into the adult world. However, because of the complex nature of the programs and the users' interactions with them, many of the lessons are learned by going through the process of evaluation rather than from the results of the evaluation. The process of opening the program to the world causes questions to be generated about the medical knowledge, about the interactions with the user and about the way the program is implemented. These in turn spur additional rounds of growth.

Medical Knowledge

As a program is growing in its mastery of the medical knowledge of its field, there are several kinds of changes taking place. First, the program develops an explicitly defined area of specialization that is often broader than what was originally implemented. The initial program usually handles an interesting, important, but small set of problems. The adult program's domain should correspond to some existing medical domain of expertise, and the program should be able to handle all of the problems within those limits. Both the breadth of expertise and its explicit definition are important: one to assure that the program can do enough to make its use worthwhile and the other to assure that potential users can determine for which situations its use is appropriate. Secondly, the models of medical knowledge are adjusted in light of a better understanding of the needs of the domain to reflect the appropriate levels of detail to be used in each model. In the critical parts of the problem the models may need to be more sophisticated, while in peripheral parts a simpler one may replace a complicated model which slows the program down without adding appreciable benefit. Thirdly, through added medical knowledge, the program exhibits greater competence. With more exposure to actual cases, the probing of experts and the literature, the number of medical facts that are embodied in the program grows. Finally, the program in one way or another is made to deal with the existing legitimate differences in medical opinion within its domain. 

Completeness. The area in which a medical program is probably judged most harshly is on its completeness. The tendency of doctors in judging a program, since there is no way that they can completely explore what it can do, is to try out a limited number of what they consider to be hard cases. It should be noted that a doctor's conception of a hard case may be different from what is hard for the program. Problems involving complex but precise models are easy for programs and difficult for people. Problems involving common sense are hard for programs and easy for people. If the program can do these hard problems, physicians feel it to be competent; otherwise they will have nothing to do with it. The Glaucoma program was judged inadequate by one of its users because it failed to diagnose a case involving a disease other than glaucoma. Thus, it is very important to set the limits of claimed competence of the program emphatically and publicly. The limits should also be ones that are easily definable before anyone tries to use the program. There is no stigma attached to the program if the user knows that, for example, it does not handle cases of children under the age of twelve; the user will be less accepting if the program's last question is the patient's age and the whole interaction is repudiated only at the final step. 

As we have mentioned before, the AIM methodology is particularly attractive for addressing problems of completeness. Whereas other popular approaches such as statistical or pattern-matching ones innately admit some small fraction of erroneous conclusions as necessary consequences of the technique, the AIM approach encourages the inclusion of additional (perhaps specialized. situation-specific) knowledge to correct the program's behavior. In principle, each erroneous conclusion by the program is subject to correction. Naturally, this approach is only successful if the corrections can be based on a systematic understanding of the problems being encountered--a long succession of ad hoc fixes will lead to an intractable program. Nevertheless, the encouragement of an emphasis on "debugging," to get each case correct, is a strong push toward the possibility of achieving realistic completeness within a domain.

Level of model detail. The medical knowledge in a program can be thought of in terms of models. These models need to be at the appropriate level of detail required to fulfill the true needs of the program without including so much detail that the program bogs down in consideration of unimportant considerations. For consistent reasonable behavior, the various components of the program must each be developed to approximately the same level. This reflects a classical engineering issue, the avoidance of over-design of certain components of a system. For example, the usual model of renal excretion of digitalis is a classical one tank model: input to the tank represents the drug coming into the body, output from the tank represents the drug leaving, and the quantity in the tank represents the amount of drug in the body. This is only one of many possible models one might imagine. There is evidence in the literature to suggest that the relationships between the digitalis dosage, the serum digitalis levels and the renal excretion rate are better represented by a two compartment model [23], with one compartment representing the blood level and one compartment representing the tissue level. On the other hand, there is evidence to suggest that from a statistical viewpoint the simplest model consistent with all of the data is a three compartment model [31] with "shallow" and "deep" tissue compartments not corresponding to particular anatomical entities. Beyond this, there are more detailed renal models that break down the renal excretion into glomerular filtration and tubular secretion [30], which may be separately influenced by different agents. Thus, for this one aspect of the Digitalis program alone, there is a whole range of possible models that might be used. The correct model depends on the rest of the program. A model with too little detail would not be able to account for and respond appropriately to all of the important influences from the rest of the program. A model with too much detail slows the program, adds requirements for data that may not be available, and provides a degree of resolution that may have negligible influence on the rest of the program. Because the appropriate model is dependent on the rest of the program, it may not be possible to tell at the beginning of a program's life what level of detail will be proper for the adult program. During adolescence the models of the program will be developed and adjusted to find the appropriate level of detail for all of the models.

Competence. The primary responsibility of a program in the medical domain, or any domain for that matter, is to be competent. The program must handle the situations in its domain properly. This goal extends to all facets of the programming endeavor, since the program can be considered "correct" only if all of the parts are correct. With respect to the medical models in the program, competence means that they can be counted on to give appropriate answers when properly interpreted. This is related to the issue of completeness, but has a different emphasis and different implications. A program cannot be competent unless it is complete. It must be able to handle all of the situations in its domain. But, more than that, it must be able to recognize and deal with all of those situations as well as a competent human practitioner in the field. This is the goal-it may not be completely attainable. A doctor has five senses with which to perceive the patient; a program must rely solely on asking questions. In the questioning process it is easy to overlook are~ that would be apparent to a doctor because of clues provided by his senses. To be truly competent a program must demand and accept sufficient data input to recognize any situations that would substantially affect its appropriate responses. This goal implies that domains which rest critically on clinical discriminations that are hard to describe are not yet ripe for program support. For example, much of the expertise of dermatologists is thought to lie in their ability to recognize subtle differences in abnormalities of the skin. Some of these distinctions are so incapable of verbal description that they must be "seen to be understood." To the extent that this characterization is correct, we would not think dermatology to be a fruitful domain for computational support of the kind we describe here. In areas that are ripe, the program must have available to it sufficient information and utilize models sufficiently detailed to keep from making blunders. The only way to test the competence of a program is to test it on a wide range of real cases. Thus, the primary goal of evaluations either formal or informal during the adolescent period is to improve and establish the competence of the program.

Diversity of opinion. There will remain areas in most of the medical disciplines where there are honest differences of opinion among the authorities. This is a problem that is often overlooked. There are several possible ways to address it, any one of which might be satisfactory; for a program to receive wide acceptance, the issue will have to be faced. The simplest approach is just to represent a single consistent, widely held, supportable view and make that fact known to the user in the beginning. This approach has the advantage of keeping the program simple and making it possible to design the program around the approach of a single expert or group of experts. It is honest-after all, if that expert were called in on a consultation, it is that single opinion that the user would receive. It should also be emphasized that a consistent viewpoint must be represented. It often happens that while an effect may be known, its cause may be controversial. Through experience an expert has determined a consistent way to account for the effect and react to it. Thus he has developed for himself an effective way of dealing with the effect which might be destroyed by replacing part of the model with someone else's theory. An example of this phenomenon occurred in the Digitalis program. The original renal function determination was quite simple, only taking into account the serum creatinine and the sex of the patient. The program also included a sensitivity factor to digitalis based on age. When the renal function determination was changed to a more sophisticated model that also took into account age and body size, it became apparent that the sensitivity factor due to age, although previously achieving an effect similar to the new age-dependence introduced in the renal function model, would now cause older patients to be underdigitalized. Thus, to maintain a consistent viewpoint, this factor had to be modified. Note that the principle--give less digitalis to older patients--is reflected in either view. However, one assigns impaired renal function in old age as the cause whereas the other assigns a more mysterious tendency of older patients to develop digitoxicity. The former is probably a better account for the known data, and has replaced the latter in our program. We had to recognize this underlying connection, however, to avoid the double correction for old age.

Another approach to the problem of different opinions is to provide different programs that would fit the overall "style" of the user community that will be using it. The advantage is that wider use may be possible. The disadvantage is that more people would have to be involved in the development of the program to insure competence for all of the different versions. The most ambitious approach would be to have the different theories represented in the same program and provide alternate suggestions when those views come into conflict. We know of no program in which this has actually been tried, but it may offer a reasonable way to handle this problem as well as a way to more readily resolve some of the disputes.

The User Interface

The program also grows to meet the requirements of the users. Medical programs have a much larger set of user requirements than most other kinds of programs: many potential users must be satisfied, the program may be used in a number of significantly different modes, input and output of data must be quite flexible, expected frequent changes in the knowledge of the program must be explainable to the user, and the problem being addressed by the program is often highly complex. Exactly what the users' requirements are depends on who the users of the program are and how the program is used.

Variety of users. Several different kinds of users must be satisfied. The most obvious are the physicians who come to the program wanting information helpful to the treatment of their patients. They are more demanding users than might at first be suspected. Physicians are not relinquishing responsibility for their patients; therefore not only do they want information, but they want the reasons why the conclusions were reached, both to convince themselves that the program is reliable and to incorporate the program's mode of reasoning into their own further considerations of the patients' cases. To complicate the situation, the program might be dealing with other kinds of medical personnel or only indirectly with the doctor. These situations imply different tactics to be used in interaction. A second kind of user is represented by those who want to make improvements to the program. It is often assumed that these are the computer scientists who wrote the program and that therefore they can change it without any trouble. However, the writing of a medical program is a collaboration between medical experts and computer scientists. Thus modifications of the program may involve either or both kinds of people. To require that the computer scientists always modify the program implies that medical knowledge will be filtered and reevaluated through them. It also ties the program to the computer laboratory even though the medical environment is the place where it will be used. On the other hand to have the medical experts make the modifications implies that they will have to know and understand the program. In particular it means that the program will have to be sufficiently transparent to make understanding from a medical viewpoint practical, and the mechanisms will have to be available to make possible modification and testing without a great deal of programming background. A third kind of user is the student. It has long been recognized that medical programs have a potential as teaching aids. This places further demands on the ability of the program to explain its data and the process of its logic, not only for specific cases but also in general, not from a programming viewpoint but from a viewpoint that will be consistent with the students' other medical training. 

Modes of use. Related to the problem of who the users of the program are is the way in which the program will be used. One of the considerations that must be addressed during adolescence is determining the possible modes of use the program will later have. The first assumption is usually that the program will be used as a consultant for doctors treating patients. This is only one of many possible modes in which a medical program could be used. For the Digitalis program, we have considered using the program as a remote consultant available by telephone (with a doctor reviewing the results), as a therapy checker running in the background of an automated hospital record system, as a consultant run by a nurse or para-medical personnel, as a research aid to provide a standard against which to compare options, or as a teaching aid. Other medical programs might also be used in any of these modes and possibly more. Probably a successful medical program will be used in more than one mode. Most are useful as teaching aids besides whatever other modes might be desirable. How the program will be used changes the requirements placed on it. For example, the most recent evaluation of the Digitalis program was run retrospectively on hospital cases. However, the program was designed as a consultant (as was illustrated in Figure 1). As a result it asks questions of the doctor, such as the rate at which the patient should be digitalized, which the doctor was not available to answer. To carry out the study, it was necessary to set up rules for answering these questions to maintain consistency. To use the program without the physician available, there would have to be logic to cover all such missing pieces of information.

The users and uses of a program will evolve during the adolescent period of the program. At the end of childhood, the primary users are just the computer scientists and the medical experts. As this shifts to the user community of the adult program, the program will have to be developed to be ready for the ultimate modes of operation. These considerations influence how the program should gather information from the user, how the program should inform the user and answer his questions, and the general operation and access properties of the program.

Data input. The program is at somewhat of a disadvantage when it comes to gathering input--a disadvantage that may result in a frustrated user. The problem is that the program must rely on the user for all of the information about the patient (unless there is some other computer accessible information available). This is not a problem that the physician has with a human consultant, because the human consultant usually sees the patient. A single glance conveys a considerable amount of information: that the patient is a middle-aged, slightly obese, average height male in moderate distress and slightly pale but otherwise functioning normally with no missing limbs, open wounds or sores, etc. All of these facts would be needed by one or another of the programs mentioned in this book. Entering such information can he time consuming. In fact, in demonstrating the digitalis program one of the most serious complaints has been the amount of time it takes to use it.

There are many ways that a program can get information from the user. A common strategy used by all of the programs represented in this volume is just to ask questions as the need arises in executing the algorithms of the program. This can be somewhat troublesome for several reasons. Since the questions come one at a time, users may have some difficulty maintaining a consistent picture in their own minds of what the program already knows about the case. It is like viewing a construction project through a peephole. Also, if the methods of the program are different from those of the users, the questions may come in an order that seems unnatural or unnecessary-lowering trust that the program is really doing something that is correct and possibly lowering the reliability of the answers. And since the order and exact nature of the questions often depends on complicated interactions between the information provided so far and the medical models, this style makes it difficult or impossible to set up cases in advance to operate the program away from the user, because it may be difficult to anticipate what questions the program will want to ask.

A related difficulty faced by programs which attempt to acquire information from a sophisticated user is the degree to which that user's own interpretation of the medical situation affects the data given to the program. What happens, for example, when a physician suspects some data to be faulty, therefore reports a "corrected" value to the program, which in turn also corrects the data (a second, inappropriate time)? Can the user and the program share a model of just what interpretation each is responsible for? Without that, many problems of inconsistent interaction are likely to arise.

Questions also vary in their expected information content. It is fine to ask questions individually when there is a reasonable expectation for more than one of the possible answers. Yet in most domains there are questions, often many questions, which are only very rarely pertinent. But when they are, they change the whole complexion of the problem. It is appropriate to find ways so that the user expends as little energy as possible in providing this information while still having the opportunity to specify such contingencies when they are pertinent. If the questions involve running tests on the patient, the problem is more serious, because tests have associated costs which must be weighed against the expected benefits of having the answers. Hence, the program must be prepared to accept information in whatever form it is available and possibly do without when it is still possible to arrive at a reasonable conclusion with less than optimal information. An adequate analysis of this problem demands finding a balance between the sins of omission encouraged by a permissive data gathering module that fails to press the user for needed information and the sins of commission of a module that insists on too much information that may be difficult to obtain. In any case the problem of what questions to ask is one that must receive considerable attention during the adolescent period.

A second style of "questioning" that shows a great deal of promise is letting the user provide the initial specification of the problem in free form text. Doctors are used to presenting cases to their colleagues in this manner and the technical language is sufficiently constrained to make parsing a solvable problem. There is no guarantee that the users will include all of the information that is necessary to run the program, but it is always possible to resort to individual questions to fill in the gaps. This style also has the advantages of giving users a chance to specify any known unusual situations without the program having to specifically ask questions about them. Users' aversion to typing is, however, a problem here, which may be ameliorated by rapid and flexible menu selection systems and which may ultimately be overcome if speech input techniques are perfected. There are also a number of other simple techniques that could help, such as tabular display of questions or simplified forms for knowledgeable users. Adolescence is a time for experimenting with these possibilities to find the combinations that best match the needs of the program to the desires of the user community.

If the program can be connected to existing computerized information sources such as medical record systems, then the medical system can be used in another important way. If the general patient information and laboratory values are available automatically, the burden on the user of providing inputs is eased considerably. Also, any of that information which might be only rarely useful but often available can be checked without worrying whether it is worth asking about. Besides the problems of interfacing to the computerized information and making whatever transformations of form that might be required, there are several interesting problems in using this kind of information. While the information may purport to be the same as that provided by a physician, there is a subtle refinement process that has not taken place. For example, if the program needs the heart rate it might get the current heart rate automatically. If the physician had been asked, he might have given a number somewhat lower, realizing that the patient is agitated. The opposite effect is also possible, because monitoring instruments can often provide detailed information on trends which is not available to the physician. The problems of interpretation of data, as raised above between physician and program, redouble when other programs may be the source of the data. These are questions that have not yet been faced by any of the programs.

The use of computerized information can be carried a step further if in some mode of use the program receives all of its information from other programs. This places new stresses on the program structure because it is no longer possible to ask all of the questions that seem appropriate. A program used in such form would have to decide when it does not have enough information to make conclusions, when it can make general but not specific conclusions, and when its recommendations should be in the form of conditional actions, to be taken only if the conditions as determined by the user are satisfied. In most domains these are hard problems to handle, because missing information can change the possible ways in which a problem can be approached.

Giving explanations. The second principal area of the user/program interaction is the problem of answering questions from the user and providing the needed recommendations; that is, maintaining the flow of information from the program to the user. In most general kinds of computer programs this is not a difficult problem because the program is accepted as an authority. In the medical domain however physicians retain responsibility for their actions and therefore must demand sufficient justification for the answers they receive to understand and accept them. Both the MYCIN project and the Digitalis project have ventured into the problem of explaining what the program is doing. This is a difficult area for a number of reasons. First, explaining what the program is doing requires knowing what the program is doing. That means that the control structure of the program must be sufficiently open for other parts of the program to examine it and determine what is taking place. Second, the kind of explanation that is appropriate depends on the kind of user. Computer specialists working on the program want to know precisely in terms of the incorporated models what is going on. Doctors treating patients want to know in terms of the medical concepts they are familiar with. Medical students need information that is sufficiently complete to help tie the process incorporated in the program into the knowledge the students already possess. It would even be nice to tailor the response to the individuals who are interacting with the program by maintaining a model of their level of knowledge. Thirdly, there is often a mismatch between the models appropriate for explaining to a person and those that would be used in a program. To overcome this mismatch, the explanation facility needs to develop a higher level of interpretation to match the models the user may have to the ones used in the program. 

There are several different kinds of explanations that a program could provide. (Here we follow the classification of explanations developed by Swartout [33].) The most obvious is just explaining what it is doing. This could be either a path description of how it got to where it is (as was illustrated in Figure 2--very useful for people working on the program--or a higher level description of the medical aspects it is checking and how that is taking place--useful for a doctor. It could also explain the general approaches that the program takes. That is, explain the algorithms. This is useful to give users the confidence that the program will really consider all of the things they know are important. It may also be appropriate to provide justification for approaches through literature citations, especially in areas where there are competing opinions. For medical students in particular and for others with a need for deeper understanding, explanations in terms of the basic physiological models are needed.

Finally, there are the more general issues of user/program interaction such as response time, availability, and the details of interaction. The response time is always an important issue because users will not tolerate a slow program for very long. This is a problem because the more complete and sophisticated a program becomes, the larger it is and more slowly it runs. Often in providing the program for the ultimate user some compromises will have to be made to bring response time to a level where it is acceptable. Response time is one reason that alternate methods of interacting with the program such as running the program off-line and placing the results in the patient record (as is now done with laboratory results) becomes inviting. Availability is closely associated with response time. Often the size and sophistication of the program is dictated by the computers that will be available to run the program. The locally available computers in hospitals tend to be fairly small right now, while the AIM programs tend to be large. This may mean that the programs will have to be accessed remotely or even that "stripped down" versions will have to be provided for local use with the full version available when the need arises.

There are many "small" issues about the characteristics of the user interface which ultimately determine how friendly the program is to the users. These include such things as the presence of a strange operating system between the user and the program, the format of questions, the ability to discover and recover from faulty input or computer disruptions, the ability to change answers in the middle of a session, and more sophisticated features such as the ability to do selected sensitivity analyses on the information given by the user. These are all important and should be developed long before a program is considered ready for the real world. Such simple problems as the program failing because the operating system of the computer occasionally takes inappropriate actions can cause the program to lose its user community.

Some features help alleviate the users' fears of being misled by the program by supporting an ability to explore and test the recommendations of the program under altered circumstances. Figure 3 illustrates the ability to change answers and investigate the consequences of those changes in the Digitalis program. If after examining the recommendations given in Figure 1, the user decides to change the answer to the question about diuretic use, the illustrated interactions will take place. The program only does the work and asks the questions necessitated by the changed answer. The addition of diuretic therapy without the use of potassium supplements or a potassium preserving diuretic make it likely that the potassium level will decrease further, increasing the likelihood of toxicity. In response the program further deceases the initial digitalis goal.

Clearly the interface of an AIM program with the user is something that must be considered carefully and developed with the possible user communities in mind. The program must use as much information as is available to it, but demand the collection of as little additional information as possible. Thus, the program must be able to obtain the information it needs to reach appropriate conclusions without wasting the time of its users or, more importantly, requiring tests which are not really justified. The program must be able to keep the users informed of what it is doing and answer their questions about the approach and justify those answers. And finally, to be successful in the unforgiving environment of the general medical world, the program must present a solid, friendly face to the user.

The following is an annotated transcript of the interaction that takes place when the user decides to change the answer about giving diuretics in the previous example.
IS THE PATIENT BEING GIVEN OR ABOUT TO BE GIVEN DIURETICS?
THE POSSIBILITIES ARE:
        1. GIVEN
        2. NOT GIVEN
ENTER SINGLE VALUE ====> 1

This time the user indicates diuretics are about to be given.

WHAT IS THE TYPE OF THE DIURETIC USE?
THE POSSIBILITIES ARE:
        1. RECEIVING DIURETICS REGULARLY
        2. RECEIVING DIURETICS FOR THE FIRST TINE
        3. RECEIVING AN INCREASED DOSE OF DIURETICS
ENTER SINGLE VALUE ====> 2
IS THE PATIENT RECEIVING ALDACTONE OR DYRENIUM?
THE POSSIBILITIES ARE:
        1. GIVEN
        2. NOT GIVEN
ENTER SINGLE VALUE ====> 2

If those diuretics were Aldactone or Dyrenium they would not affect the potassium level.

IS THE PATIENT RECEIVING POTASSIUM SUPPLEMENTS?
THE POSSIBILITIES ARE:
        1. GIVEN
        2. NOT GIVEN 
ENTER SINGLE VALUE ====> 2

Potassium supplements would reduce the likelihood of potassium depletion from the diuretics.

SINCE ACUTE DIURETIC ADMINISTRATION CAN LEAD TO UNEXPECTED
POTASSIUM DEPLETION AND PRECIPITATE A TOXIC EPISODE AN
APPROPRIATE CHANGE IN THE BODY STORES GOAL WILL BE MADE. IT IS
GENERALLY AGREED THAT PATIENTS WITH LOW SERUM POTASSIUM LEVELS
ARE MORE PRONE TO DEVELOP DIG TOXICITY. PLEASE WATCH THIS
PATIENT CAREFULLY AND ADMINISTER POTASSIUM SUPPLEMENTS. IN THIS
PATIENT THE USE OF DIURETICS WITHOUT THE CONCURRENT ADMINISTRATION
OF POTASSIUM SUPPLEMENTS MAY LEAD TO UNEXPECTED TOXICITY.

This time the program has more warnings and suggestions to give about the possible results of this course of action.

PARAMETERS USED TO MAKE RECOMMENDATIONS:
REDUCTION FACTOR DUE TO HYPOKALEMIA ===> 0.633 
TOTAL ALTERATION SENSITIVITY FACTOR ===> 0.633

The reduction factor is now less than two-thirds of the amount that would normally be given and this is reflected in the goal level and then in the daily dose.

BODY STORE GOAL ===>                   0.436 MG
PROJECTED AVERAGE AMOUNT ON BOARD ===> 0.000 MG
PROJECTED CURRENT AMOUNT ON BOARD ===> 0.000 MG
BODY STORES SHIFT ===>                 0.504 MG
REMAINING TIME TO REACH GOAL ===>     24.0 HRS.
HALF LIFE ===>                        55.B HOURS 2.3 DAYS
DAILY LOSS AT EQUILIBRIUM ===>         0.130 MG
ABSORPTION FACTOR ===>                 0.75
THE DOSAGE RECOMMENDATIONS ARE:
DATE       TIME      ORAL              IV
1/10/79     9:00     .25 + .125 MG     0.375 MG (OR 1.5 CC)
REPORT BACK AFTER THE FIRST DOSE.

The lower goal implies a smaller first dose.

           21:00     .125 + .0625 MG   0.0625 MG (OR 0.25 CC)
1/11/79     9:00     .0625 MG          0.0625 MG (OR 0.25 CC)
HOLD THE DIGITALIS FOR THE CARDIAC-SURGERY.
UPDATE COMPLETED.

The user could then ask for a description of the update. This would explain the interaction that took place above, including the changes that took place in various internal variables leading to the change in the sensitivity factor and the change in the body stores goal.

Fig. 3. Changing the answer to the question about diuretic use.

The Implementation Methodology

The previous discussion has mentioned a wide range of requirements imposed on the implementation methodology of the program. As we first described in recollecting the birth of the program, numerous techniques from Al have been instrumental in providing the mechanisms needed to satisfy the Digitalis program's requirements for maintaining a ISM, for extensibility, and for explainability. Nevertheless, as we have become more and more familiar with the particulars of our domain of application, numerous simplifications have become possible. As we look forward to the "real world" application of the program, further changes in its implementation, toward more traditional programming styles, become possible at the cost of giving up some of the program's once-desirable features. This is, we believe, a common development in AIM programs. The properties of the medical models and interface dictate the requirements for the implementation methodology, arid as they change during adolescence the requirements change. In the beginning the needs of the programmers for flexibility during the rapid development of the program are the dominant demands on the implementation. Thus, the programs tend to have many features that aid the programmers in changing and testing the program. Also, during the early development the basic models of the program tend to be less structured because the inherent structure of the domain has not become clear. As a result, the implementation often tends to have more powerful, more general, and slower constructs than are really necessary.

As an example of how our improved understanding of the medical domain has simplified our implementation technique, we will describe a change in one component of the program from its first to its current state. ANNA was implemented in terms of a Therapy Transition NETwork (TTNET), part of which is illustrated in Figure 4. (This discussion is drawn from Chapter 4 of [29].) Each node in the network represented a procedure used to gather information or make a conclusion, and links among them represented functional dependency, alternative choices, and flow of control. Two of the three possible kinds of links are illustrated in the figure. The solid lines are non-selective links implying that all of the connected nodes will be executed. The dotted lines are selective links implying that one of the nodes must be selected and executed. Virtually any decision implicit in the structure of the TTNET is subject to change by providing additional procedures to mediate the decision, and the allocation of responsibility for decision making is widely scattered in this implementation. The appropriateness of any node concluding its own validity is based on matching patterns of prerequisites and precludes and sufficient assertions. One such example (translated from ANNA's structure into English) consists of two pattern assertions associated with the SLOW-RATE node [29, p.60]: 

Selection of SLOW-RATE is precluded if the reason for digitalization is an arrhythmia, if pulmonary edema is present, or if the user specifies some other rate of digitalization. 

and 

Specification by the user that slow digitalization is desired is sufficient to qualify SLOW-RATE.

Thus, each of the nodes SLOW-RATE, MODERATE-RATE, RAPID-RATE and INSTANTANEOUS-RATE may become qualified or disqualified, and a final choice is then made (by a choice-daemon--another mechanism) among the qualified nodes.

wpe5B.gif (10812 bytes)

Fig. 4. Section of the TTNET in ANNA.

Note that ANNA's structure makes very little commitment about the form of decision making to be used by the program. The interpretation of the TTNET by matching data inputs and intermediate conclusions to such patterns allows remarkable flexibility for the program. Indeed, this flexibility was quite valuable when the program was originally built, because we had little confidence in our ability to make long-lasting commitments without some experience with the program and the domain. However, the present version of the digitalis advisor recognizes the selection of the rate of digitalization as a simple encapsulated issue, simply asking for the rate of digitalization. The logic to assess the appropriateness of the rate is encoded in a validity test which is executed when the user enters an answer. The validity test contains the following logic: In the case of either a patient with a ventricular rate greater than 130 or a heart failure patient with definite pulmonary edema, a slow or moderate rate is inappropriate. If the user selects such a rate, the program will suggest that a rapid rate is more appropriate and give the user the opportunity to change his answer. If in the example session the user had specified a moderate rate, the interaction would have been as follows:

HOW LONG DO YOU WISH TO TAKE TO REACH FULL DIGITALIZATION? 
THE POSSIBILITIES ARE: 
        1. INSTANTANEOUS RATE (1-2 HOURS) 
        2. RAPID RATE (1 DAY) 
        3. MODERATE RATE (2-3 DAYS) 
        4. SLOW RATE (4-7 DAYS)
ENTER SINGLE VALUE ====> 3
SINCE THE VENTRICULAR RATE IS HIGH IT IS ADVISABLE TO ACHIEVE FULL
DIGITALIZATION AS QUICKLY AS POSSIBLE.
WOULD YOU LIKE TO CHANGE YOUR ANSWER CONCERNING THE RATE OF
DIGITALIZATION? (YES OR NO) YES
HOW LONG DO YOU WISH TO TAKE TO REACH FULL DIGITALIZATION?
THE POSSIBILITIES ARE:
        1. INSTANTANEOUS RATE (1-2 HOURS) 
        2. RAPID RATE (1 DAY) 
        3. MODERATE RATE (2-3 DAYS) 
        4. SLOW RATE (4-7 DAYS)
ENTER SINGLE VALUE ====> 3

This implementation, requiring no pattern matching, no daemons, and no multi-level decisions, is a much simpler expression of essentially the same decision task (with some modifications representing other changes in the program). We were able to achieve this simplicity because we now understand that the decision does not depend on any more factors than those we included. In this implementation, ANNA's search procedures to discover what information was relevant to the particular decision have been eliminated by hand-crafting the appropriate procedure. In further work now being pursued by Swartout, this "compilation" process will automatically derive the procedural form of the decision from a data base of possible dependencies and domain principles [34]. 

How far could we go in simplifying the program's implementation? If we were interested only in the advice the program provides, we could undoubtedly re-implement it in a conventional programming language. As we have described above, however, we are also very much concerned (at least in most modes of possible use of the program) with its ability to explain and justify its medical content and behavior. The various desirable features discussed in previous sections place a number of requirements on the implementation that make the Al methodologies not only helpful for programming but necessary. The basic facts of the medical model provide a good illustration of how a flexible representation permits multiple uses of the same information. First of all a fact may be used to make conclusions about the case; the appropriate time to use the fact may not be clear in general, thus sometimes necessitating its indexing by some pattern of use rather than including it directly in an algorithm. If some information is missing in consideration of the case, a fact may be appropriately used to determine what could be known if the information were known. If tests are expensive, the fact could be used to assess the importance of determining the information-test selection. Depending on the kind of program and the kind of fact, the fact may be useful in assessing the patient's present state, possible previous states, and the probable results of different kinds of therapy. In addition to these uses the fact must be accessible for explanation and modification. To explain a fact it is necessary either to be able to examine all of its parts or to have some kind of "canned" explanation associated with it. Examining it is much more desirable since there is no assurance that a canned explanation reflects the actual code. If modification of the online system is to be supported, it is also necessary to take apart, examine, and replace parts of the fact. Not all AIM programs will have all of these capabilities, but each places requirements on the features of the implementation. The explanation facility also places requirements on the control structure of the program. To explain what has happened in a program it is necessary to have a trace of what facts and procedures have been used This can be accomplished in a number of ways, such as leaving a list of what procedures have been called and the important changes they made, or just preserving the control structure in some explicit way.

The Digitalis advisor is currently implemented in OWL [6, 7, 36], a language based on a linguistic theory of knowledge representation, and itself implemented in LISP. OWL's data base component, LMS, provides a complete dynamic cross-linking among all mentions and uses of a concept, thus providing one of the most basic requirements of our implementation [7]. The OWL interpreter also maintains the dynamic record of how procedures were invoked and how values were changed, underlying the program's ability to explain its course of action [33].

The Adult Program

The adult program would be one that is out in the medical community, being used as a tool to aid physicians or other medical personnel in carrying out their primary functions. It would not require constant upkeep by computer science personnel other than an occasional "new release" to clear up minor problems and to incorporate new medical findings. We would be delighted to describe the characteristics of such a program in great detail, but unfortunately none of the AIM programs have yet progressed that far. We can, however, make some general observations about how we envision the world of the adult program and what difficulties we see in its way.

User Acceptance. As we suggested in discussing the possible modes of use of an AIM program, various ways of applying the expertise of such a program are available. We suspect that those which have the least disruptive impact on the health care system as it is currently organized are the most likely to be accepted early. Thus, the use of programs in educational settings appears quite feasible, their installation as background monitors in computerized record systems to apprehend gross errors seems likely, and their employment with nurse-practitioners or para-medical technicians is likely to begin soon after the programs have demonstrated their completeness in some domain of interest and their competence. Even in these forms of use, the programs are likely to generate a significant amount of antipathy from parts of the medical community, and this antipathy will have to be sensitively dealt with. An early article by Schwartz [26] gives a good discussion of the range of likely difficulties and possible new approaches to be tried. 

From the limited experience of other computer programs in the medical application area, we can understand some of the ways in which delicacy in the introduction of the new programs can lead to successful acceptance. A Stanford group developed and first installed at the Stanford University Hospital a background system to monitor all drug orders to the pharmacy for potential toxic drug interactions [14, 37]. The system never refuses to deliver a drug, but takes the liberty of accompanying the drug with cautionary messages of several levels of severity ranging from an informative note about a recent article discussing this drug to definite warnings that the drug is known to interact badly with another drug that has been ordered for the same patient and is still being given. The emphasized phrase has been a key to the success of the program, in contrast with other similar efforts which have been rejected by their intended medical audiences because they violated the previously-accepted manner of practicing medicine. Note that the pharmacy does not know what happens to the drugs it delivers. The physician may have decided to discontinue a mode of therapy for which the drugs have already been ordered, without alerting the pharmacy. In that case, a second drug order, though it would conflict with the first if both drugs were actually given, may be quite appropriate. A computer program cannot presume to make judgments in the absence of adequate information about the environment--this is exactly why this drug interaction program has succeeded when others, more willing to impose their incomplete (and often incorrect) conceptions of the world on their users, have failed.

A careful examination of the successes and failures of other computer augmentations to the health care environment can point out some of the more common pitfalls and ways to avoid them. Because the scope of AIM programs comes closer to the way medicine is actually practiced by physicians, many mistakes are, however, likely to be made as we try to install such programs into common use.

Development. The ever-present abbreviation "R&D" recognizes that in the creation of useful technical products a possibly long phase of development must follow the initial research effort that created the product. In the adoption of AIM programs for actual use, the advance from research to development will need to be accompanied by two concomitant moves. First, although university-based laboratories are the current centers of research for the creation of AIM programs and these same laboratories are likely to produce the first prototype programs to be adopted in practical use, the long- term success of the field requires the initiation of actual development facilities, probably outside the universities and preferably within commercial companies. Second, the basis on which funding is allocated for AIM R&D must change as a project moves from research to development.

The university environment is an ideal one for the conception and initial production of a new AIM program. As we have suggested in this chapter, the close availability of colleagues from different technical and medical backgrounds, of students with the intelligence, willingness and enthusiasm to undertake large projects, and the constant ferment of new ideas percolating through the environment all contribute to the possibility of formulating and undertaking an interesting new project. These advantages persist through the stages of growth we have described above, because the project continues to raise interesting technical questions which are of interest not only to those working closely with it but also to the others in the environment who merely provide the supportive intellectual atmosphere. In this way, the project repays the debt it owes to that environment. As the difficult technical problems are in time solved, however, and as the attention of the investigators must shift from internal problems of technique to external problems of application, the project often ceases to interest the rest of the academic community. This change even percolates to many of those working on the program itself-they would like to be done with it and move on to other, newer technical challenges. (This observation is especially appropriate for students, who are generally expected to base their incipient careers on brilliant technical innovation, not on a successful application development.) Thus, the original nurturing environment becomes far less suitable when the program is in its ultimate development stage. 

Another difficulty in the university environment is the frequent limitation on project size imposed by the limited size of the laboratory. Throughout our work on the Digitalis program, about two full-time equivalent employees have carried the effort. (This number has often represented as many as seven individuals, each of whom had numerous other commitments.) Naturally, the efforts of this limited group have been focused on the most important and the technically most interesting aspects of the problem. Many other features which we would find highly desirable have not been provided, for lack of manpower and lack of the commitment actually to turn the program into a production instrument. For example, although we have discussed augmented modes of user input such as a simple English-language interface, a fast table-driven data acquisition module, connection to existing data-bases, etc., we have not in fact built any of these components. Ultimately they will all be critical to the success of the program. We anticipate that the adoption of responsibility for continued development by commercial institutions is the best solution for these difficulties. A commercial organization has the flexibility to devote the needed manpower to a project when it becomes necessary; it can establish a reward system for its employees in which successful application is seen by all as a worthy goal. Actual movement of university-developed computer tools into commercial companies is not a frequent event. Even such long-standing successful computer tools as MACSYMA and DENDRAL, which are now both in actual application use, continue to be run by a university-centered consortium, not a commercial vendor. Somewhat more successful has been the history of projects undertaken and completely built at a commercial laboratory based on the technology previously developed in the university environment. For example, SRI International's PROSPECTOR system for the "diagnosis" of possibly valuable mineral-bearing sites is built on the methodology created in the MYCIN project [3]. SRI is, however, a rather unusual commercial institution, having begun in close affiliation with Stanford University and having been carefully groomed exactly for providing the kind of development opportunity we describe. Most large vendors who traditionally supply computation-based tools to the health-care system still concentrate on tools based on simpler techniques, with more limited goals.

Virtually all funding for AIM programs in the United States has come from the Federal government. Much of this support has been from the National Institutes of Health (NIH), principally through its Division of Research Resources, and (recently) from the National Library of Medicine (NLM), as part of its extramural research program. In both cases, the support of AIM programs has required a deliberate commitment to change the basic funding orientation of the institution. The NIH's charter is to support basic research in the biomedical sciences to improve the scientific basis of health care in the nation. To the extent that AIM research efforts contribute more to the codification and dissemination of medical expertise than to its discovery, the support of AIM research itself is a somewhat peripheral NIH activity. The pursuit of an expensive and long-term AIM development project thus has appeared especially unappealing to NIH. The NLM's interest in AIM is primarily for the development of knowledge representation and retrieval techniques and problem-solving strategies applicable to the storage and management of the tremendous volume of information within the Library's purview. Again, the commitment of any large fraction of its available research funds for the costly development of a single application seems unlikely. Other agencies have at times also supported AIM projects, though typically for too short a time to face the difficulties of placing a program into practice.

Sometimes the difficulties of obtaining funding for development efforts have stymied a project just at the time when the initial efforts had been completed successfully. Two of the programs described in this book, MYCIN and the CASNET/Glaucoma program, have both been inactive at times for lack of funds, despite demonstrating expert-level performance in adolescent-stage trials. The involvement of commercial development teams at this stage of the AIM projects may be the most appropriate escape from this problem. The Advanced Research Projects Agency (DARPA) of the Defense Department has dealt with a similar set of problems by explicitly providing a number of funding categories ranging from basic research to deployment and supporting the movement of a project along this continuum when appropriate.

Optimism. The conjunction of two trends make us optimistic. Pressure continues to build on the health care system to provide better care for more people at limited expense, making the adoption of useful technology an attractive possibility. At the same time, continued improvement of computer technology, both in hardware and software, promise to make more and more power available to deal in programs with the hardest medical needs. AIM programs are growing up, overcoming many of the technical and application difficulties we have considered, demonstrating their ability to handle difficult medical diagnostic and therapeutic problems, and providing accurate and useful consulting advice about those problems to their users. We look forward to the success of those programs described here and their intellectual descendants.

References

1. Beller, G. A., Smith, T. W., et al., "Digitalis Intoxication, A Prospective Clinical Study with Serum Level Correlations," New England J. Med 284, (May 6, 1971), 989-997. 

2 Davis, R., "Interactive Transfer of Expertise: Acquisition of New Inference Rules," Artificial Intelligence 12, (1979), 121-157. 

3. Duda, R. O., Hart, P. E., Nilsson, N.J., Reboh, R., Slocum, J., and Sutherland, G. L, Development of a Computer-Based Consultant for Mineral Exploration, Annual Report, SRI International, Menlo Park, Ca,, (1977). 

4. Gorry, G. A., Kassirer, J. P., Essig, A., and Schwartz, W. B., "Decision Analysis as the Basis for Computer-Aided Management of Acute Renal Failure," Amer. J. Med 55, (1973), 473-484. 

5. Gorry, C. A., Silverman, H., and Pauker, S. C., "Capturing Clinical Expertise: A Computer Program that Considers Clinical Responses to Digitalis," Amer. J. Med 64, (March 1978), 452-460. 

6. Hawkinson, L. B., "The Representation of Concepts in OWL," Proceedings of the Fourth International Joint Conference on Artificial Intelligence, MIT Artificial Intelligence Laboratory, (1975). 

7. Hawkinson, L B., XLMS: A Linguistic Memory System. TM-173, MIT Lab. for Comp. Sci., Cambridge, Mass., (1980). 

8. Ingelfinger, J. A., and Goldman, P., "The serum digitalis concentration--Does it diagnose digitalis toxicity?," New England J. Med. 294, (April 15, 1976), 867-870. 

9. Jelliffe, R. W., Buell, J., Kalaba, R., Sridhar, R., and Rockwell, R., "A Computer Program for Digitalis Dosage Regimens," Mathematical Biosciences 9, (1970), 179-193. 

10. Jelliffe, R. W., Buell, J., and Kalaba, R., "Reduction of Digitalis Toxicity by Computer-Assisted Glycoside Dosage Regimens," Ann. Int.. Med. 77, (1972), 891-906. 

11. Kulikowski, C. A., and Weiss, S. M., "Representation of Expert Knowledge for Consultation: The CASNET and EXPERT Projects, in Szolovits, P., (Ed.), Artificial Intelligence in Medicine. Westview Press. (1981), this volume. 

12. McNeil, B. J., Weischelbaum. R., and Pauker, S. G., "The Fallacy of the Five Year Survival in Lung Cancer," New England J. Med. 299, (1978), 1397-1401. 

13. Minsky, M., A Framework for Representing Knowledge, Memo 306, MIT M Lab, (1974), condensed version also published in Winston. P. (Ed.), The Psychology of Computer Vision, (1975), McGraw Hill, New York. 

14. Morrell, J., Podlone, M., and Cohen, S. N., "Receptivity of physicians in a teaching hospital to a computerized drug interaction monitoring and reporting system," Medical Care 15, (1977), 68-78. 

15. Patil, R. S., Design of a Program for Expert Diagnosis of Acid Base and Electrolyte Disturbances TM-132, MIT Laboratory for Computer Science, (May, 1979). 

16. Pauker, S. G., Gorry, G. A., Kassirer, J. P., and Schwartz, W B., "Toward the Simulation of Clinical Cognition: Taking a Present Illness by Computer," Amer. J. Med. 60, (June 1976), 981-995. 

17. Pauker, S. G., "Coronary Artery Surgery: The Use of Decision Analysis," Ann. Int. Med. 85, (8) (1976). 

18 Pauker, S. G., and Szolovits, P., "Analyzing and Simulating Taking the History of the Present Illness: Context Formation," in Schneider/Sagvall Hein (Eds.), Computational Linguistics in Medicine. North-Holland, (1977). 

19. Pauker, S. P., and Pauker S. G., "Prenatal Diagnosis: A Directive Approach to Genetic Counseling Using Decision Analysis," Yale J. Med. 50, (1977), 275. 

20. Pauker, S. G., and Kassirer, J. P., "Clinical Applications of Decision Analysis: A detailed illustration," Seminars in Nuclear Medicine (Oct. 1975). 

21. Peck, C. C., Sheiner, L. B., et al., "Computer-Assisted Digoxin Therapy," New England J. Med. 289, (1973), 441-446. 

22' Pople, H. E., Jr., "The Formation of Composite Hypotheses in Diagnostic Problem Solving: an Exercise in Synthetic Reasoning," Proceedings of the Fifth International Joint Conference on Artificial Intelligence, Department of Computer Science, Carnegie-Mellon University, Pittsburgh, PA 15213, (1977). 

23. Rabkin, S. W. and Grupp, G., "A Two Compartment Open Model for Digoxin Pharmacokinetics in Patients Receiving a Wide Range of Digoxin Doses," Acta Cardiologica 30, (1975), 343-351. 

24. Rodensky, P. L. and Wasserman, F., "Observations on Digitalis Intoxication," Arch. Int. Med. 108, (1961), 171-188. 

25. Safran, C., Tsichlis, P. N., Bluming, A. Z., and Desforges, J. F., "Diagnostic Planning using Computer Assisted Decision-Making for Patients with Hodgkin's Disease," Cancer 39. (June 1977), 2426-2434. 

26. Schwartz, W. B., "Medicine and the Computer: The Promise and Problems of Change," New England J. Med. 283, (1970), 1257-1264. 

27. Sheiner, L. B., Hal kin, H., et al., "Improved Computer-Assisted Digoxin Therapy," Ann. Int. Med. 82, (1975), 619-627. 

28. Shortliffe, E. H., Computer Based Medical Consultations: MYCIN, Elsevier-North Holland Inc., (1976). 

29. Silverman, H., A Digitalis Therapy Advisor. Technical Report TR-143, MIT Project MAC, (1975). 

30. Steiness, F., "Renal Tubular Secretion of Digoxin," Circulation 50, (1974), 103-107. 

31. Sumner, D. J., Russell A. J., "Digoxin Pharmacokinetics: Multicompartmental Analysis and Its Clinical Implications," Br. J. Clin. Pharmac. 3, (1976), 221-229. 

32. Sussman, G. J., and McDermott, D. V., "From PLANNER to CONNIVER - A Genetic Approach," Proceedings of the 1976 Fall Joint Computer Conference, AFIPS Press, (1976), 1171-1179. 

33. Swartout, W. R.. A Digitalis Therapy Advisor with Explanations. Technical Report TR-176, MIT Laboratory for Computer Science, (February 1977). 

34. Swartout, W. R., Producing Program Explanations from Multiple Hierarchical Models. Ph.D. thesis. Dept. of Electr. Eng. and Comp. Sci., MIT, Cambridge, Ma., (Jan.1981). 

35. Szolovits, P., and Pauker, S. G., "Research on a Medical Consultation System for Taking the Present Illness," Proceedings of the Third Illinois Conference on Medical information Systems, University of Illinois at Chicago Circle, (November 1976). 

36. Szolovits, P., Hawkinson, L.. and Martin. W. A., An Overview of OWL a Language for Knowledge Representation, TM-56, MIT Laboratory for Computer Science, Cambridge, Mass., (June 1977), also in Rahmstorf, G., and Ferguson, M., (Eds.), Proceedings of the Workshop on Natural Language Interaction with Databases, International Institute for Applied Systems Analysis, Schloss Laxenburg, Austria, Jan.10, 1977. 

37. Tatro, D. S.. Moore, T. N., and Cohen, S. N., "Computer-Based System for adverse drug reaction detection and prevention," Amer. J. Hosp. Pharm. 36, (1979), 198-201. 

38. Weiss, S. M., A System for Model-Based Computer-Aided Diagnosis and Therapy, Ph.D. thesis. CBM-TR-27-Thesis, Computers in Biomedicine, Department of Computer Science, Rutgers University, (June 1974). 

39. Weiss, S., Kern, K., Kulikowski, C. and Safir, A., "System for Interactive Analysis of a Time-Sequenced Ophthalmological Data Base," Proc. Third Illinois Conference on Medical Information Systems (1976). 

40. Winograd, T., Understanding Natural Language, Academic Press, New York, (1972). 

41. Withering, W., An Account of the Foxglove, and some of Its Medical Uses: with Practical Remarks on Dropsy, and Other Diseases, G. G.. J. and J. Robinson, Paternoster-Row, London, (1785). 

42' Yu, V. L, Buchanan, B. G., Shortliffe, E. H., Wraith, S. M., Davis, R., Scott, A. C., and Cohen, S. N., "Evaluating the Performance of a Computer-Based Consultant," Comput. Programs in Biomed. 9, (1979), 95-102.  

43. Yu, V. L., Fagan, L. M., Wraith, S. M., Clancey, W. J., Scott, A. C., Hannigan, J., Blum, R. L., Buchanan, B. G., and Cohen, S. N., "Antimicrobial Selection by a Computer: A Blinded Evaluation by Infectious Diseases Experts," J. Amer. Med. Assoc. 42, (1979), 1279-1282.

Notes

(1) The digitalis program hat been developed through several versions by a group or collaborators including, in addition to the authors, G. Anthony Gorry, Stephen G. Pauker, Howard Silverman and William Swartout. Although much of the style and content of the digitalis therapy advisor will be reviewed here, the reader interested primarily in a description of that program is referred to other publications [5, 29, 33]. This research was supported (in part) by the National Institutes of Health Grant No. 1 P01 LM 03374-01 from the National Library of Medicine and Grant No. 1 P41 RR 01096-03 from the Division of Research Resources.


This is part of a Web-based reconstruction of the book originally published as
   Szolovits, P. (Ed.).  Artificial Intelligence in Medicine. Westview Press, Boulder, Colorado. 1982.
The text was scanned, OCR'd, and re-set in HTML by Peter Szolovits in 2000.