The effect of instructions on a proofreading task

  • Total Cost: $0.94
  • 15 iterative HITs
  • TurKit code for this experiment: proofread.js
  • TurKit Version: 0.1.42

Inspired by previous posts about proofreading research papers (1, 2), I used a proofreading task to explore how iterative tasks on MTurk converge or diverge.  There was an interesting effect caused by a small change in instructions.

The HITs used a paragraph drawn from one of our research papers.  Each iteration introduced a single random error in it — an inserted character, a deleted character, or a transposition of two adjacent characters.  The paragraph was presented to the turker in a textbox, and the turker was asked to proofread it, correct any errors found, and submit it.  For example, here is the paragraph with a random error (highlighted in red):

Automatic clustering generally helps separate different kinds of records that need to be edited differently, but it isn't perfect. Sometimes it creates more clusters than needed, because the differences in structure aren't important to the user's particular editing task.  For example, if the user only needs to edit near the end of each line, then differences at the start of the line are largely irrelevant, and it isn't necessary to split based on those differences.  Conversely, sometimes the clustering isn't fine enough, leaving heterogeneous clusters that must be edited one line at a time.  One solution to this problem would be to let the user rearrange the clustering manually, perhaps using drag-and-drop to merge and split clusters.  Clustering and selection generalizaxtion would also be improved by recognizing common text structure like URLs, filenames, email addresses, dates, times, etc.

The turker didn’t see the error highlighted like this, but the turker’s web browser may have highlighted it anyway.  Firefox, for example, puts a red underline under suspiciously-spelled words in a textbox.  So “generalizaxtion” would in fact be underlined above.  So, incidentally, would be “filenames,” because Firefox prefers “file names.”

After one turker edited the paragraph (hopefully fixing the introduced error), a new error would be introduced in the submitted version, and another turker would edit the paragraph.  The structure was iterative, so if one turker made radical edits to the paragraph, those changes would persist.  No voting or other validation process was used to approve the edits, so it would be possible for the paragraph to significantly diverge from the original if a turker thought it would be better written differently.  That was the goal of this little exploration — to see what might encourage or discourage this kind of divergence.

The first time I tried this task, the instructions were:

  • Please proofread and make corrections to the text below.

Each HIT paid $0.01, and the paragraph went through five iterations of editing before I terminated the process.  The full results for trial 1 show exactly the kind of divergence I was hoping for — turkers not only fixed the introduced error, but made other changes as well.  Here is the final version of the paragraph (with all changes from the original paragraph highlighted in yellow):

Automatic clustering generally helps separate different kinds of records that need to be edited differently, but it isn't perfect.

Sometimes it creates more clusters than needed because the differences in structure aren't important to the user's particular editing task. 

For example, if the user only needs to edit near the end of each line, then differences at the start of the line are largely irrelevant, and it isn't necessary to split based on those differences. 

Conversely, sometimes the clustering isn't fine enough which leaves heterogeneous clusters that must be edited one line at a time. 

One solution to this problem would be to let the user rearrange the clustering manually, perhaps using drag-and-drop to merge and split clusters.  

Clustering and selection generalization would also be improved by recognizing common text structure such as URLs, file names, e-mail addresses, dates, times, and so on.

In fact, all five turkers made at least two edits, even though there was only one glaring typo in each iteration.  (One might argue that “filenames” was also a glaring typo, because Firefox’s spellchecker pointed it out.)  One turker changed “email” to “e-mail”; another preferred “such as” instead of “like”; another changed “etc.” to “and so on.”  The most radical change was made by a turker who split the text into one-sentence paragraphs — possibly a newspaper copy editor?

The second trial of this task started with the same paragraph, but slightly different instructions:

  • Please proofread and correct the text below.

and a larger payout: $0.04 per HIT, instead of just a cent.  This process ran for 10 iterations, but no divergence occurred.  The full results for trial 2 end with this final version:

Automatic clustering generally helps separate different kinds of records that need to be edited differently, but it isn't perfect. Sometimes it creates more clusters than needed, because the differences in structure aren't important to the user's particular editing task.  For example, if the user only needs to edit near the end of each line, then differences at the start of the line are largely irrelevant, and it isn't necessary to split based on those differences.  Conversely, sometimes the clustering isn't fine enough, leaving heterogeneous clusters that must be edited one line at a time.  One solution to this problem would be to let the user rearrange the clustering manually, perhaps using drag-and-drop to merge and split clusters.  Clustering and selection generalization would also be improved by recognizing common text structure like URLs, file names, email addresses, dates, times, etc.

which differs from the original only in the place where Firefox’s spellchecker suggested a typo, “filenames.”  The filenames->file names edit was made in the first iteration of trial 2, just as it was in trial 1, which strongly suggests that Firefox is to blame.

All 15 turkers who worked on this task in trial 1 or 2 were different.  TurKit allowed me to enforce that even though the iterations were posted as separate HITs.

Discussion

It’s interesting to speculate why divergence occurred in trial 1 but not in trial 2.  Note that trial 1 involved only half as many iterations as trial 2, but it diverged much more, and divergent editing happened on every iteration of trial 1, and no iterations of trial 2.  Something must be up.

In fact, trial 1 turkers actually did more work (more edits each) for less money (only one cent instead of four cents).  The consequence was that a lot of their edits were unnecessary and unhelpful, at least in the opinion of one author of the paragraph (me).

My guess is that the wording of the trial 1 instructions (“…make corrections…”) biased them to do more than one edit, lest their work not be accepted.  So trial 1 turkers were actually hunting for something to change.

Trial 2 turkers, on the other hand, merely had to correct the text.  So it was sufficient to make the obvious corrections that Firefox suggested, and not introduce arbitrary changes for the sake of earning their pay.

One trial 2 turker actually made no changes at all, leaving the introduced error unfixed.  The next turker fixed both errors, however, so trial 2 successfully converged, even in the absence of voting or other verification.

One idea that this experiment suggests is that turkers will work hard to find something to do in a proofreading task, even if there isn’t anything useful for them to do.  So a proofreading application that uses MTurk may be more effective if it intentionally introduces at least one error in each work piece — not only to catch lazy turkers, but also to reduce the risk of divergence due to unnecessary changes by honest turkers who are just trying to prove they’re really working.

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

1 Comment »

 
  • There are really instances the proofreading may be misguided. Of course we are depending that proofreaders will really point out the errors and i Mean real errors and not just for the sake of work. Hope many proofreaders will give justice to what the goal is.