A cautionary note on selection

In follow-up to my experiment on anchoring, I had subjects guess the number of dots in a series of images. Workers were “randomly” (I’ll explain the scare quotes later) assigned to one of 7 of these tasks (labeled A-G), in order from fewest dots to the most dots.
Here is A:
Picture A
and here is G:
Picture A
After the initial task, workers could keep working, making their way through the other 6 tasks, which were also “randomly” assigned. Most did all 7 but a sizable number just did 1. As data began to come in, I checked that an approximately equal numbers of workers did each of the 7 pictures. They did, but in AMT this tells us nothing about selection: when a user “returns” a HIT rather than completing it, that HIT is returned to the pool of uncompleted HITs, making it available for a future worker.

When I looked at the distribution of “first HITs” i.e., the first picture performed by a subject (remember that they could complete more than one picture) a striking pattern jumped out:

> table(data$first_pic_name)

A B C D E F G
101 100 141 170 154 158 169
>

In general, the more dots a picture had, the more likely it was to be the first task done by a worker. A chi-square test confirmed that the distribution of first pictures.

What’s going on?

The over-representation of high-dot pictures among the first pictures completed by a worker suggests that a disproportionate number of high-dot pictures are being returned. It is of course possible that workers find evaluating many-dot pictures more onerous, but this seems unlikely, especially since there was no penalty for bad guesses. My hunch is that workers find the high-dot tasks less pleasant because they take longer to load—not because they are intrinsically more difficult. Picture G’s file size is 35K, while picture A is only about 10K. This difference is not much for someone with a fast connection, but many workers presumably have slow connections and grew tired of waiting for a G image to load. I plan to test this hypothesis by using image files with approximately identical sizes and see if the pattern still persists.

So what?

If workers are returning HITs for basically arbitrary reasons, that the AMT “back to the pool” sampling is as good as random. However, if users are non-randomly non-complying, you can potentially make biased inferences. In addition to the problem of non-random attrition, because of how AMT allocated workers to tasks, you also get changes in the probability that a subject will be assigned to a group (e.g., late arrivals to the experiment are more likely to be assigned to the group people found distasteful). In future posts, I hope to discuss some of the ways this problem can be circumvented.

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

5 Comments »

 
  • Dean Eckles says:

    I’m excited about this new blog. Thanks for sharing your experiences.

    For any study on Mechanical Turk in which you want random assignment, I would use ExternalQuestions in which the content is determined when the task is previewed (or accepted). If it is returned or abandoned and then previewed (or accepted) by another worker later, then the random assignment code just runs again.

    I wrote a post about using dynamic content for HITs — though with the aim being repriorization and quality of service when using MTurk to power something that is “live”.

  • John Horton says:

    Thanks Dean – you’re right – external questions are the way to go when you need random assignment (though even then, non-random up-take is still a problem). In this particular experiment, I assumed (wrongly) that since the tasks were so similar and there were no obvious differences that could affect compliance, I could simply use the AMT ‘randomization’ – lesson learned!

  • I think another alternate hypothesis is that the high-dot tasks, at first glance to the Turkers, appeared to be much more work. I’d guess that a Turker makes a mental calculation when they see a hit about how much work it will take and how much they’re getting paid, then decide whether they want to do it. I’d guess that both mental workload and estimated time enter into the equation.

    If that’s true, it would actually mean something good about MTurk: that people don’t always try to cheat the system. (If they had tried to cheat, you would have seen a flat rate of takeup across the numbers of dots, because they would have answered with a cheat no matter how hard it was.) It seems that Turkers would, in general, rather return a HIT than cheat. That is A Good Thing.

  • John Horton says:

    Agreed, though I also think cheating is interesting :)

  • Dean Eckles says:

    One angle on the first glance assessment of work required, Michael, would be to see how to make tasks that are the same in difficulty and work required strike workers as easier (not through any kind of deception, but other means).