Over the past several years we have been developing the Heart Failure Program to assist physicians in reasoning about patients with cardiovascular disease. The program takes a description of the case including information about the history, symptoms, physical examination, and test results, and generates a differential diagnosis that explains all of the findings that might indicate cardiovascular disease. The program can also suggest other measurements to refine the diagnosis and therapies to manage the problem, and can predict the hemodynamic effects of the therapies, but only the differential diagnosis is addressed by the experiments described in this paper (see other papers about other aspects of the system[3][2][1]).
This paper reports on a formative evaluation of the diagnostic capabilities of the Heart Failure Program (HFP). The process of formative evaluation combines aspects of system development with assessment of effectiveness and was undertaken with specific objectives in mind. The major part of the development effort on the basic diagnostic algorithms and diagnostic knowledge base of the program was completed and the program has been functioning in a reasonably stable way for a couple of years. In that time we identified two main circumstances that can lead to incorrect diagnoses: ones in which the temporal relationships among the diseases and findings determine the diagnosis, and ones in which the relationships between severities of findings are important. Both of these are problems that would require a major effort to solve in their full generality with the potential for greatly increased computational requirements, but it is possible to handle specific instances by making provision for them in the knowledge base. Since the frequency or extent of these problems in practice was unknown, we did not know their practical significance. Given this state of affairs, we conducted this development and assessment process to 1) determine the accuracy of the program with the present diagnostic algorithms, and 2) to determine the applicability of the system for diagnosis of patients typical of a tertiary care hospital.
To conduct the formative evaluation, we collected a set of 242 cases of patients classified by DRG (diagnosis related group) as falling within the domain of the program. On these cases we analyzed the performance of the program in its present state, refined the knowledge base to obtain the best performance achievable with the present algorithms, and used the results to focus our plans for further development. The cases were distilled from hospital discharge summaries and entered into the program. They were separately diagnosed by the cardiologists on the project from the program's case summary without seeing the computer generated diagnosis. Errors made by the program were classified into those correctable by refinements of the knowledge base and those that would require additional reasoning algorithms. The corrections to the knowledge base were made and the whole process repeated through a number of iterations until optimal correlation with the cardiologist's diagnoses was obtained. The program currently produces a first hypothesis which agrees with the diagnosis of the cardiologists in about 90%of the cases. We have analyzed the cases in which there remained disagreement with the clinical diagnoses after correcting the knowledge base. We used these analyses to categorize and determine the significance of the limitations of the current reasoning mechanisms.