Allocating CHI reviewers, a sequel
Last year I used an analysis of CHI review data to argue that we could save a lot of reviewers’ time on low quality papers by modiyfing our review process. With all the current talk of the value of replication, I figured it was worth testing the same procedure with this year’s review data, which Dan Olsen was kind enough to provide.
CHI currently collects three external reviews on every paper before engaging the program committee to make an accept/reject decision. Last year I did an analysis that suggested rejecting papers with two bad reviews (scores below 3) without requiring a third, and showed that this would have saved 469 reviews (of papers that should have been rejected) while accidentally rejecting just 6 of about 300 papers. I argued that 6 papers was probably dwarfed by other mistakes the PC makes, so this would be a worthwhile tradeoff.
Because I’m replicating, I don’t need to detail the analysis again; you can find that in last year’s post. I explored the following procedure: send each paper to two reviewers and, if the scores are too low, reject it without further consideration. Otherwise, send it to a third reviewer and then on the the program committee for decision. Varying the definition of “too low” provides a tradeoff between false positives (extra reviews for papers that ultimately get rejected) and false negatives (accidental early rejection of papers that would have been accepted). Looking at last year’s data, a pretty appealing threshold was to reject any paper whose two scores were both lower than 3. This is a natural rule because 3 is the “neutral” score in CHI reviews; both reviews below signifies that both reviewers recommend rejection. It was also a good threshold because it saved 469 reviews while creating only 6 accidental rejections.
So, did the results replicate this year? Applying exactly the same procedure, we get a remarkable level of agreement. Of the roughly 1567 papers submitted, there were 369 acceptances and 1198 rejections. Using the “two below 3″ rule, we skip 491 reviews of low-rated papers, while accidentally rejecting 4.7 papers (see last year’s post, or the discussion below, to understand why we get a fraction). An almost identical outcome, just slightly better.
Some caveats: the data I got was slightly messy, with a few papers showing no reviews or less than three. I presume these were withdrawn submissions, but haven’t had time to find out. There were less than ten of these, not enough to influence the results significantly.
For those who want to explore the data themselves, I’ve prepared a table of the relevant numbers. Score 1 and Score 2 are the reviewer scores; Accepts counts the number of pairs (remember, three pairs to a paper) from accepted papers that got those two scores, Rejects the number of pairs from rejected papers that got those two scores. All other columns are functions of these two. Ratio measures the ratio of accepts to rejects for a given pair of scores. I’ve order the rows by this ratio, which is useful for visualizing the optimum false positive/false negative tradeoff. Cum acc and Cum rej total the accept and reject pairs in the lines above, then divides by three so we can count papers (averaged over outcomes) instead of counting pairs. These cumulative totals count the number of accepts and rejects above the current line, thus telling you the number of papers from each category that would be given a third review if you used the given line as the threshold for deciding on that third review. False rejects then shows how many papers that were ultimately accepted would have been rejected in the first round using the given threshold, while saved reviews counts the number of ultimately rejected papers for which the threshold would have led to skipping the third review.
Note that the accepts and rejects columns are counting “review pairs”. Each paper in the data set got three reviews. If we imagine the chair selecting three reviewers but then requesting a review from only 2 of them (keeping the third in reserve for when the first reviews are good), then there are three possible outcomes. Counting over all three outcomes of all three papers yields the average outcome—the expectation should the chair select their two reviewers at random from the pool of three. It is this averaging that produces fractions in the other columns.
| Score 1 | Score 2 | Accepts | Rejects | Ratio | Cum acc | Cum rej | false rejects | saved reviews |
| 5 | 5 | 14 | 0 | NaN | 14 | 0 | 364.3 | 1197.3 |
| 4.5 | 4.5 | 37 | 3 | 12.33 | 51 | 3 | 352.0 | 1196.3 |
| 5 | 4 | 63 | 8 | 7.87 | 114 | 11 | 331.0 | 1193.7 |
| 5 | 4.5 | 41 | 6 | 6.83 | 155 | 17 | 317.3 | 1191.7 |
| 4.5 | 4 | 113 | 21 | 5.38 | 268 | 38 | 279.7 | 1184.7 |
| 4 | 4 | 125 | 31 | 4.03 | 393 | 69 | 238.0 | 1174.3 |
| 4.5 | 3.5 | 65 | 20 | 3.25 | 458 | 89 | 216.3 | 1167.7 |
| 5 | 3 | 22 | 8 | 2.75 | 480 | 97 | 209.0 | 1165.0 |
| 5 | 3.5 | 19 | 8 | 2.37 | 499 | 105 | 202.7 | 1162.3 |
| 4 | 3.5 | 129 | 68 | 1.90 | 628 | 173 | 159.7 | 1139.7 |
| 4.5 | 3 | 32 | 23 | 1.39 | 660 | 196 | 149.0 | 1132.0 |
| 4 | 3 | 70 | 81 | 0.86 | 730 | 277 | 125.7 | 1105.0 |
| 3.5 | 3.5 | 52 | 62 | 0.84 | 782 | 339 | 108.3 | 1084.3 |
| 4.5 | 2.5 | 20 | 29 | 0.69 | 802 | 368 | 101.7 | 1074.7 |
| 5 | 1 | 2 | 3 | 0.67 | 804 | 371 | 101.0 | 1073.7 |
| 5 | 2.5 | 9 | 17 | 0.53 | 813 | 388 | 98.0 | 1068.0 |
| 4 | 2.5 | 43 | 95 | 0.45 | 856 | 483 | 83.7 | 1036.3 |
| 4.5 | 1.5 | 8 | 18 | 0.44 | 864 | 501 | 81.0 | 1030.3 |
| 4.5 | 2 | 11 | 25 | 0.44 | 875 | 526 | 77.3 | 1022.0 |
| 4 | 2 | 39 | 95 | 0.41 | 914 | 621 | 64.3 | 990.3 |
| 3.5 | 3 | 48 | 117 | 0.41 | 962 | 738 | 48.3 | 951.3 |
| 5 | 2 | 6 | 20 | 0.30 | 968 | 758 | 46.3 | 944.7 |
| 5 | 1.5 | 2 | 8 | 0.25 | 970 | 766 | 45.7 | 942.0 |
| 3 | 3 | 14 | 66 | 0.21 | 984 | 832 | 41.0 | 920.0 |
| 4 | 1 | 5 | 28 | 0.18 | 989 | 860 | 39.3 | 910.7 |
| 3.5 | 2.5 | 30 | 172 | 0.17 | 1019 | 1032 | 29.3 | 853.3 |
| 4 | 1.5 | 6 | 36 | 0.17 | 1025 | 1068 | 27.3 | 841.3 |
| 4.5 | 1 | 2 | 13 | 0.15 | 1027 | 1081 | 26.7 | 837.0 |
| 3.5 | 2 | 24 | 199 | 0.12 | 1051 | 1280 | 18.7 | 770.7 |
| 3.5 | 1.5 | 6 | 68 | 0.09 | 1057 | 1348 | 16.7 | 748.0 |
| 3 | 2.5 | 15 | 188 | 0.08 | 1072 | 1536 | 11.7 | 685.3 |
| 3 | 1 | 3 | 52 | 0.06 | 1075 | 1588 | 10.7 | 668.0 |
| 3.5 | 1 | 3 | 55 | 0.05 | 1078 | 1643 | 9.7 | 649.7 |
| 3 | 2 | 10 | 214 | 0.05 | 1088 | 1857 | 6.3 | 578.3 |
| 3 | 1.5 | 2 | 107 | 0.02 | 1090 | 1964 | 5.7 | 542.7 |
| 2.5 | 2.5 | 3 | 155 | 0.02 | 1093 | 2119 | 4.7 | 491.0 |
| 2 | 2 | 4 | 223 | 0.02 | 1097 | 2342 | 3.3 | 416.7 |
| 2 | 1.5 | 3 | 227 | 0.01 | 1100 | 2569 | 2.3 | 341.0 |
| 2.5 | 2 | 4 | 331 | 0.01 | 1104 | 2900 | 1.0 | 230.7 |
| 2.5 | 1 | 1 | 83 | 0.01 | 1105 | 2983 | 0.7 | 203.0 |
| 1.5 | 1 | 1 | 144 | 0.01 | 1106 | 3127 | 0.3 | 155.0 |
| 2 | 1 | 1 | 157 | 0.01 | 1107 | 3284 | 0.0 | 102.7 |
| 1 | 1 | 0 | 94 | 0.00 | 1107 | 3378 | 0.0 | 71.3 |
| 1.5 | 1.5 | 0 | 76 | 0.00 | 1107 | 3454 | 0.0 | 46.0 |
| 2.5 | 1.5 | 0 | 138 | 0.00 | 1107 | 3592 | 0.0 | 0.0 |
Interesting.
I’m not sure what the CHI procedure was this year, apparently it was streamlined over the past. For CSCW, more reviews could be saved while still giving all papers three reviews by not assigning a fourth reviewer/metareviewer to papers with three reviews under a 4. For example, if every paper had 2 external reviewers and one AC assigned to do a review, and rejected with no further comment if none gave it a 4 or 5, and the same data pattern was found, almost 950 papers would have ended up with only three reviews. If in fact all papers ended up with three externals and an AC it would save 950 or more, and our data indicated that for CSCW no accidental kills would occur. How many reviews/metareviews were there in total this year?
As to the papers with fewer than 3 reviews, they could have been “desk rejects” ruled out of scope by the chairs prior to reviewing or after one reviewer noted they were entirely out of scope or had some other fatal flaw. 1% of the CSCW 2012 submissions were in this category.
Are you sure that your dataset reflects the original review scores, and not scores that were revised after discussion or rebuttals? Revised scores are more likely to be compatible with final decisions.
The data from year 1 of the analysis is perfect: an archival copy of the scores that were entered from just before reviewers began discussion. In the second year, I missed the beginning of discussion by 3 days—some eager reviewers may have begun discussion, but I doubt that many had. It’s impossible to know, and I’ll be more careful to request the data in time next year.
[...] can make use of data to improve the conference. I’ve already analyzed historical data that demonstrates that we can substantially reduce reviewer workload. We’ve also created a way you can use [...]