Next: Criticisms Related to Up: Evaluation of a New Previous: Methodology

Results

The statements that were judged can be divided into those about the overall diagnosis and those about the details. Overall rating was done on the first hypothesis in each case and on the summary of the alternatives in the 21 cases that had additional hypotheses. The ratings are given in table 1.

There are three cases in which at least one reviewer considered the first hypothesis wrong. In two of these, the program missed diagnoses of mitral stenosis, once because the location of the murmur only matched tricuspid stenosis and once because the program considered the murmur to be a flow murmur of the mitral regurgitation that was present. In the third case, the program left pleuritic chest pain unaccounted. The reviewer criticized it for not suggesting pulmonary embolism, although it was actually pericarditis. The two seriously wrong judgments were made by different reviewers. The judgments of the alternatives included fewer in complete agreement and none considered wrong. With only about 10%of first hypotheses judged by any reviewer to be wrong there seems to be general agreement, but with only 25%of first hypotheses judged by both to be completely correct and fewer of the alternatives, there is considerable disagreement in the details.

In this analysis and in the analysis of the detailed statements, it became clear that some of the distinctions in the rating scale had no practical significance. The reviewers seemed to use possible and partly correct interchangeably, so in the following analysis they are combined and called possible. There were no statements that both reviewers rated as seriously wrong, in fact there were no statements that one reviewer rated seriously wrong that the other reviewer even rated wrong, so wrong and seriously wrong are combined as wrong.

To determine whether there was any systematic bias among the reviewers, we compared the fraction of the statements each rated as correct, possible, or wrong to the fraction with that rating as judged by the rest of the reviewers. We also compared the ratings of statements about disease nodes in all hypotheses versus those in some hypotheses. Those fractions with their significance and the fraction of the total statements they represent are given in table 2.

This analysis indicates that the reviewers were well balanced. The two reviewers that differed significantly from the rest contributed the smallest number of ratings and their differences balanced each other. Comparing definite nodes to possible nodes, the possible nodes were judged more harshly.

Analysis of the detailed statements requires caution because each statement may overlap with one to several other statements in the same case. In relating a node to its definite and possible causes and effects, each statement touches on several nodes. If there is a node that the reviewer feels should not be in the hypothesis, that affects his judgment of statements about each of the causes and effects as well as the statement about the node itself. For that reason we have analyzed the critiques to determine the source of negative judgments and clustered the affected statements together. This process required some interpretation in many cases, but fortunately the reviewers comments were extensive enough that it was always possible to have a good idea what they were concerned about.

In the 285 detailed questions there were 137 issues raised by one or both of the reviewers. One might infer that both reviewers were in complete agreement with 148 of the statements, but because of the influence of the issues on the statements about causes and effects, there were only 92 statements that were considered correct by both reviewers. Each issue was assigned for each reviewer the most serious rating of any of the statements in which that issue arose. Some issues were comments from the reviewers about relations and hypotheses that they considered missing. Because these did not have ratings, they were all rated as possible. Thus, if possible and wrong , the issues that concerned both reviewers could be rated WW, WP, or PP, and those that were of concern to only one would be rated W or P. The 137 issues that arose in the cases were rated as follows: WW 11, WP 10, PP 16, W 27, P 73. The majority (53%) were possible changes that were of concern to only one reviewer. Still, there were 11 issues that the reviewers agreed were wrong and another 10 that one considered wrong and the other reviewer thought were not the best choice.

To determine the source of these issues, we analyzed and classified them. Classifying the issues is open to some interpretation, but it is very useful in determining whether the issues imply that refinement of the program is needed, the method of summarization and presentation misrepresented the conclusions of the program, there was some misunderstanding or mistake on the part of the reviewer, or that there is an underlying difference of opinion among cardiologists. To avoid missing problems with the program, we have classified issues as relating to the program if there is any doubt.

The classifications we used are as follows:

Controversy: These are issues in which there is clearly a difference of opinion among cardiologists. Several of these came to light during the evaluation in discussions in which the reviewers disagreed with each other. Others are disagreements with carefully considered representations in the program, that is, disagreements between a reviewer and the developers. These were reviewed and classified by the cardiologist developers.
Reviewer wrong or inconsistent: These are issues where the criticism is in conflict with the actual patient state as indicated in the patient record or, in two cases, the reviewer's ratings of different statements in the same case are mutually inconsistent. Given the limited information in the input, several diseases may be appropriate to consider in addition to what the patient actually had. Therefore, the judgment was only classified as wrong when the program's statement corresponded to the actual situation and reviewer rated that as incorrect.
Misunderstanding: These are issues in which the reviewer probably overlooked part of the information or misunderstood some of the (occasionally convoluted) automatically generated text. That is, the intended meaning of the statement is consistent with the reviewer's objection.
Summarization: These are issues in which the program that summarized the hypotheses and put the information from multiple hypotheses together into single statements about nodes obscured the relationships that exist among the nodes or inappropriately labeled node clusters.
Program: Relationships in the program that need to be reexamined. (These will be further classified.)

As an example of how the cases were analyzed, consider the evaluation form in the appendix. The first reviewer marked all statements correct except two: one about atrial fibrillation and one about atrial septal defect (ASD), which were marked possible. The second reviewer marked the atrial fibrillation and mitral regurgitation statements partly correct with a note that atrial fibrillation is caused by dilated cardiomyopathy not mitral regurgitation. He marked the statement about dopamine wrong and those about acute MI, coronary artery disease, and ASD as possible. Thus, these reviews resulted in one W: dopamine contributing to heart failure; two PPs: atrial fibrillation caused by dilated cardiomyopathy instead of mitral regurgitation and ASD causing fixed splitting S2; and one P: acute MI as evidenced by elevated CPK-MB (with the coronary artery disease statement considered to be the same issue). These were classified as follows: ASD, because it is a weak alternative hypothesis, was attributed to the program; the dopamine and the cause of atrial fibrillation were considered controversies, because the cardiologist developers consider them correct; and the acute MI as a misunderstanding. The acute MI is an example of the kind of misunderstanding that can arise from an automatically generated evaluation form. The acute MI was only included in the first of the two hypotheses and therefore listed with the nodes only present in some hypotheses. The reviewer considered acute MI a likely possibility, and therefore was in agreement with the program, but marked the statement possible rather than correct, confusing the correctness of the statement with the likelihood of acute MI.

Given this scheme, the 137 issues identified by the reviewers were classified as shown in table 3.

One controversy that accounted for six of the issues was whether diastolic dysfunction causes low cardiac output and left heart failure. That is, whether a patient with LV hypertrophy, a normal ejection fraction, fatigue and pulmonary congestion should be described as having diastolic dysfunction causing the findings. It is clear that patients, especially older patients, present with those findings but it depends on how one defines ``diastolic dysfunction'' whether that is the cause or not. This difference generated discussion among the reviewers. Several other controversies were probably issues of definition as well - whether mitral stenosis causes anginal chest pain, what constitutes a left atrial abnormality on electrocardiogram, or how broadly one may define COPD (chronic obstructive pulmonary disease). Some of the disagreements were whether a particular disease was adequately supported. For example, whether murmurs as the only direct evidence were sufficient to suggest aortic stenosis or tricuspid regurgitation. Others were findings that the program left unaccounted that the reviewers felt should be accounted for, such as cough and non-specific ST and T wave changes.

Assuming that different reviewers take different sides in controversies, make different mistakes, and have different misunderstandings, the first three categories in the table are issues that would have a high rate of disagreement between reviewers. Because these are only a sample of what might contribute to inter-reviewer disagreement and they are 45%of the issues, the differences among the reviewers are likely to be comparable to the differences the reviewers had with the program.

Most of the summarization problems resulted from using the phrases caused by and accounted for in the statements to represent all of the linkages between summary nodes. Once clusters of nodes are abstracted to summary nodes, many of these become influences not normally considered causality (eg, the summary saying that aortic stenosis is causing hypertension because the hypertension cluster includes the high LV pressure node which is also caused by aortic stenosis) or a cluster having an inappropriate label given the severity or what it was influencing (eg, a cluster labeled high blood volume causing elevated liver function tests, skipping the intermediate splanchnic congestion). Another summarization problem was grouping acute and chronic manifestations of nodes together. For example, the program hypothesized that one patient had chronic mitral regurgitation worsened by an acute MI. In summarizing the mitral regurgitation it listed the acute MI as a cause and included both acute and chronic findings as effects. Another problem was not having different names for different severities of the clusters. For example, situations ranging from mild tachypnea to frank pulmonary edema were all labeled left heart failure, whereas treated failure without symptoms or with minor symptoms should be called compensated left failure.

Criticisms Related to the Program

Next: Criticisms Related to Up: Evaluation of a New Previous: Methodology

wjl@MEDG.lcs.mit.edu
Sat Nov 4 11:23:04 EST 1995