Next: Evaluation Methodology Up: Evaluation of a New Previous: Criticisms Related to

Discussion

This evaluation provides us with a reasonable picture of the diagnostic performance of the Heart Failure Program. The number of cases in the evaluation was too small to measure the performance relative to physicians. In any case, that would be missing the point. The performance of the program is at a level such that the experts can relate to the explanations and their criticisms are mostly in the details of the diagnoses. There are still occasional serious errors made by the program, but they imply that further refinement of the knowledge base is needed rather than fundamental changes to the reasoning mechanisms. The severity and temporal constraints give the program the tools necessary to keep from generating impossible causal pathways in the hypotheses and allowing better discrimination of the likelihood of the pathways it does generate. Some additional reasoning is needed to create a better fit with the human expectations of good hypotheses.

The evaluation also illuminated a number of difficulties faced in evaluating a program that provides highly detailed assessments. The first issue the evaluators commented on was the artificiality of the case descriptions with most of the echocardiographic findings left out. We chose to rely mostly on physical examination, electrocardiographic, and X-ray findings to get a more extensive test of the program's reasoning, but cardiology in the United States has come to a point where in complicated cases cardiologists are reluctant to form conclusions without seeing the echocardiographic findings.

The summarization of the differential diagnoses was not part of the program's diagnostic reasoning and was developed for the evaluation to reduce the work load for the reviewers. Indeed, it decreased the number of nodes in the hypotheses to a manageable number, making an evaluation of the detailed reasoning feasible. Because it proved to be a useful tool to convey the important conclusions in the diagnosis, we intend to incorporate it into the program. At present, the summarizer can also obscure the nature of relationships present in the hypothesis. The main problem is that not all of the summary links have the same meaning. The labels need to reflect important distinctions such as causing versus possibly contributing to or chronic with acute worsening. Using the word causes for all links often misrepresents the kind of relationship that exists. There are also a number of conventions used by people in conveying degrees of uncertainty that need to be incorporated in the summaries, such as saying pulmonary disease rather than COPD when the findings are non-specific. Thus, summarization of the differential diagnoses is a difficult problem that will take more work to bring it to the competence necessary to be an effective utility.

Evaluation Methodology

Next: Evaluation Methodology Up: Evaluation of a New Previous: Criticisms Related to

wjl@MEDG.lcs.mit.edu
Sat Nov 4 11:23:04 EST 1995