Evaluation of expert systems is always difficult, but it is more difficult for the HFP because of the complex conclusions that were being evaluated. Conducting the evaluation as a single session with all reviewers was extremely beneficial. It enabled us to answer their questions as they arose, monitor their progress, make sure they understood the instructions as intended, and to encourage them to provide comments with their judgments. It also allowed for some discussions, which slowed the progress. Since three hours is about the maximum length of time people can effectively do such detailed work, a larger evaluation may require multiple sessions or the rest of the reviews to be done in a different context.
The use of evaluation forms proved to be much more practical than using the program interactively because of the logistics involved, the additional delays that the program would have introduced, and the ability of paper to take comments anywhere. The evaluation forms themselves left us with the difficult task of analyzing the judgments to determine the source of criticisms. Fortunately, the reviewers wrote enough comments to make that analysis possible. The problem is essentially one of designing a multiple question test when an essay is needed. The reviewers are being asked to critique the details, but there are too many details to consider them one by one. The node interconnections imply that a single issue changes the rating of multiple nodes. A possible improvement would be to organize the questions around paths through the summaries rather than single nodes, or organizing the questions around ``diagnostic'' nodes (the main disease and syndrome nodes).
The automatic generation of the evaluation forms utilized a limited vocabulary and ``computer logic'' that made some of the statements difficult to understand. Particularly difficult were statements about nodes appearing only in some of the hypotheses. The significantly poorer rating of such possible nodes and the smaller number of correct ratings of alternative hypotheses may be due to these problems of presentation or it may be that determining the range of a differential is a more difficult problem than determining the best hypothesis.
The final issue that makes an evaluation of diagnostic programs difficult is the lack of a gold standard. It is tempting to say that the final diagnosis of the patient is the gold standard, but the objective of the differential is to determine all of the hypotheses that are consistent with the patient presentation rather than only the diseases that the patient actually has. Because there is no objective way of obtaining such a list, the best that can be achieved is expert consensus. As a result, there will always be some level of controversy and misunderstanding in the critiques. A common strategy to control for these factors is to have the experts critique each other as well as the program. That is not feasible when the diagnoses are detailed because the expert diagnoses are given with much more selective details. An alternative would be to have the reviewers individually critique a case and then collectively agree on a final critique. The problem, of course, is that it would lengthen an already time consuming and somewhat tedious process.
Even with the difficulties of designing and analyzing this type of evaluation, the evaluation has been very helpful in determining the level of performance of the program. Overall, the program is capable of providing high quality detailed diagnostic hypotheses for complex cardiovascular cases. With some additional refinement of the knowledge base and processing of the hypotheses, the mistakes encountered by the reviewers should be eliminated and the error rate decreased significantly. Once these changes and the summarizer have been appropriately validated, the appropriate next step is a prospective evaluation to address the usefulness of the program.
This research was supported by National Institutes of Health Grant No. R01 HL33041 from the National Heart, Lung, and Blood Institute and No. R01 LM04493 from the National Library of Medicine.