Next: Analysis of Results Up: Development of a Knowledge Previous: Collection of Case

Evaluation and Refinement Method

The discharge summaries were used to create worksheets with the information used by the program. Since the most completely described point in the discharge summaries is the initial examination, that point was used for determining a diagnosis. Thus, the presenting illness, initial examination data, and some laboratory findings were likely to be available. Anything measured or done after that time was excluded from the input.

Filling out the worksheets involved some interpretation since the terminology used in the discharge summaries was not always consistent with the measurement values used by the program. Some of the program inputs are interpreted values rather than raw test results. To translate test values such as pO, pCO, hematocrit, and white blood count into the qualitative values accepted by the input menu, a table was used to maintain consistency. Interpretation of chest pain and electrocardiogram (EKG) results was more difficult. For chest pain, the program has four descriptors: anginal, atypical, pleuritic, and other non-ischemic chest pain. Chest pain whose description was consistent with the characteristic attributes of anginal chest pain, was entered as anginal. If there were characteristics of the chest pain that were not typical, but it had some of the features of anginal chest pain, it was entered as atypical. To interpret the EKG description, it was necessary to decide whether the description was consistent with old, evolving, or acute MI, or with ischemia. Often the description was in these terms, but when it was in terms of changes in specific leads, we used a simple table to translate the description. When the description did not match any of the interpretation descriptions, but there were still changes in the ST segment or T wave, it was entered as non-specific ST and T changes. Times of tests are not presently included in the input to the program. This precludes distinguishing between tests done during the present admission and ones done in the past that still provide useful information. (This will be corrected in a future version of the program.) The murmur descriptions also presented problems. Many times the murmurs were only described as systolic or diastolic without specifying character. These were given default characteristics as appropriate. The location descriptions were often less specific than the options in the program, and these were translated using a simple table. Sometimes locations were omitted entirely. In case 13 only the secondary locations, the locations of radiation, were given.

Once the worksheets were completed, the patient information was entered and the program was used to print a textual version of the information. For example, the computer generated description of case 13 was:

Patient:: PT1013 Version 1 @ 2/01/90 20:15:00
HISTORY:: 67 year old female with nausea/vomiting having known-diagnoses of coronary-heart-disease and paroxysmal-atrial-fibrillation and on furosemide digitalis coronary-artery-bypass-graft
VITAL-SIGNS:: bp: 132/80 hr: 80 and T: 98.6
PHYSICAL-EXAM:: chest was clear, auscultation revealed normal s1, normal s2, a III/VI holosystolic-murmur also at the left-axilla and a III/VI systolic-ejection-murmur also at the neck, normal abdomen and normal extremities
LABORATORY-FINDINGS:: ekg: atrial-fibrillation, cxr: no cardiac-enlargement, Na: 125, k: 3.3, bun: 13, creat: 0.9 and normal urinalysis

These program generated summaries were used by the cardiologists to determine the diagnoses. The final diagnoses given in the discharge summaries were not used because they were diagnoses based on more information than was available initially or even than was included in the summary. Often those diagnoses were not adequately supported by the information in the discharge summary. Using the program summaries means that the program diagnoses are determined from the same information as the expert diagnoses. The diagnoses were determined by agreement between the two cardiologists. In the process of examining the summaries a number of data inconsistencies were discovered which were corrected by a more careful reading of the discharge summary.

A typical expert diagnosis (the one for case 13) is:

Coronary heart disease, atrial fibrillation, compensated heart failure, mitral regurgitation, aortic stenosis or aortic sclerosis, possible digitalis toxicity, and possible diuretic complications

This diagnosis is actually a differential in terms of the program because it admits a number of possible hypotheses expressed as nodes by the program. The hypothesis must have atrial fibrillation. It must have either aortic stenosis or aortic sclerosis. It may or may not have digitalis or furosemide (a diuretic) accounting for abnormal states. Coronary heart disease is not a single node but may be accounted for by either fixed coronary obstruction or an old MI. For such terms used in diagnosis we defined expansions in terms of the nodes in the KB. Compensated heart failure is part of the heart failure syndrome and can have several incarnations. There must be either low left ventricular (LV) systolic function, low LV compliance, or low right ventricular (RV) systolic function. These are the minimum systolic or diastolic manifestations that would count as heart failure. Since it is compensated, the left atrial pressure (LAP) is not high and there are no congestive findings on the left or right. This description does not include all of the intermediate pathophysiologic nodes that might be in a corresponding computer generated hypothesis, but we assumed that all of the diagnostic nodes in a computer hypothesis are either in the expert diagnosis or definitely implied by the input. That is, states listed as previously known diseases in the input, heart rhythm on the EKG, or other direct inferences are automatically included in the diagnosis. When the diagnosis included unknown etiology, we allowed the program to attach some plausible etiology, since HFP produces completely specified hypotheses.

The hypotheses implied by the expert diagnoses were considered unordered. There were times during the process of collecting the diagnoses when the cardiologists put some partial ordering on them, such as stating that the patient probably had unstable angina, but it could be an MI. Because this information is only available in a fraction of the cases and the difficulty in using it in the comparisons, we chose to consider all of the expert differential diagnoses as unordered.

Once the diagnoses were decided by the cardiologists, we used HFP to generate a differential diagnosis. The differential diagnoses produced by HFP are ordered lists of one or more completely specified hypotheses with relative probabilities, as discussed in section 2. The criterion we used for accepting the machine's diagnosis is that the top hypothesis in the differential list match one of the admissible diagnoses listed by the experts. That is, the top hypothesis must include all of the required entities in the diagnosis, may include any of the optional entities, and may not include any diagnostic node that is not part of the diagnosis or definitely implied by the input. More or less stringent criteria could have been specified, from all hypotheses acceptable and all acceptable hypotheses included in the differential to some acceptable hypothesis included in the differential. Any kind of comparison that considers how many of the alternatives in the expert diagnosis are included in the computer differential is difficult because it depends on what probability cutoff is chosen in selecting the differential and because the process of pruning the hypotheses eliminates some of the alternatives. The top hypothesis criterion seemed best for identifying the main issues for further research.

The matching process can be illustrated with the case in figure 1. Coronary heart disease is covered by fixed coronary obstruction; atrial fibrillation was explicit; compensated heart failure was manifest as low LV systolic function; mitral regurgitation is included as a chronic condition; aortic stenosis, rather than aortic sclerosis was used to account for the systolic ejection murmur; digitalis accounted for the nausea or vomiting, a sign of toxicity; and furosemide accounted for the low potassium. The hypothesis also suggests a mechanism for the low sodium level, but no diagnostic node is included in that causal chain so it is not evaluated. There are no diagnostic nodes in the hypothesis that are not accounted for by the diagnosis, so the match is successful.

The matching is done by a small program that takes the diagnosis and a table of the description translations and generates all of the allowed combinations of nodes. It compares that list to the top hypothesis. If there is a combination of nodes that all occur in the hypothesis and any other nodes in the hypothesis are non-diagnostic or definitely true from the input, the match succeeds.

If the top hypothesis was not acceptable, there were three possible explanations: 1) the hypothesis from HFP was wrong, 2) the expert diagnosis was wrong, or 3) the translation of the diagnosis into required nodes was wrong. We reviewed a sample of the unacceptable cases, analyzed the nature of the problems, corrected what was easy to correct (either the KB, the expert diagnosis, or the diagnosis translation, as appropriate), and repeated the process. Over the course of a dozen iterations 93 of the cases were analyzed in detail by the cardiologists, some more than once.

The analysis of the erroneous hypotheses and the corrections to the KB will be discussed in the next section. The kinds of corrections that were made to the expert diagnoses are shown in table 2.

For the most part these are relatively minor changes in the diagnosis, for example, adding possible chronic obstructive pulmonary disease (COPD) that was overlooked or not requiring aortic stenosis that is only supported by a murmur that could be functional. Still, it is indicative of how difficult it is to specify a complete diagnosis.

The translation of the diagnosis into nodes in the model was also fairly difficult. For example, stating that the patient had left heart failure usually meant that there was the systolic causal chain of low LV emptying, low cardiac output, and high LAP. However, it also happened that the high LAP could be caused by diastolic dysfunction manifest as low LV compliance, LV hypertrophy, or some cause that produced a chronic state of high LAP (such as mitral stenosis) and those situations were also called left heart failure. Furthermore, specifying that the left heart failure was systolic, diastolic, or compensated changed the list of nodes that characterized the state. It took a number of iterations to get the matching program to accept all of the hypotheses that were in fact consistent with the diagnoses.

Next: Analysis of Results Up: Development of a Knowledge Previous: Collection of Case

wjl@MEDG.lcs.mit.edu
Sat Nov 4 11:03:23 EST 1995