Image Rating — MTurk vs. Lab Members
We have often used Mechanical Turk to rate things, like brainstorming ideas and image descriptions. We were curious how MTurk ratings compared to ratings we might obtain from a traditional user study.
We randomly selected 10 images from www.publicdomainpictures.net. This site has a rating for each image obtained from members of the site, on a scale from 1 to 10. We tried to select images with a uniform distribution of image ratings.
Next, we had two groups of people rate the images on a scale of 1 to 10:
- MTurk workers: We posted the images on MTurk, and solicited ratings from 10 different turkers for each image (1 cent per rating).
- Lab Member: We sent an e-mail to the MIT CSAIL mailing list, asking for self-identified amateur photographers to rate our images using a web interface, with the promise of $25 to one lucky participant. The response rate was greater than we expected: 56 participants answered all our questions.
Each group used the same interface to rate the images:
Results:
The following table shows each of the images. Beneath each image are 3 ratings:
MTurk Rating — Lab Member Rating — and Site Rating.
|
|
Discussion:
MTurk ratings definitely seem correlated with ratings obtained from amateur photographers in our lab (R^2 = 0.597), though the R^2 value is low enough to suggest that there may be some significant differences between these populations as well. MTurk ratings are also correlated with site ratings, though the correlation is not as high (R^2 = 0.182). Interestingly, the correlation between lab member ratings and site ratings is higher (R^2 = 0.495).
Lab member ratings were 2 points lower than MTurk ratings on average, which is statistically significant using a paired t-test (p < 0.0001). We suspected that this might be because lab members rated all the images, whereas MTurk workers could rate as many or as few as they wanted. However, each lab member saw the images in a random order, and we didn’t see any correlation between where an image appeared in the sequence, and the rating it received. Another hypothesis is that amateur photographers are more discriminating in their photography taste — or at least, they may have been encouraged to be more discriminating in our study, since we requested their skills as an amateur photographer.
Code:
You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.












[...] We have run a number of experiments that involve asking turkers to rate items on a scale of 1 to 10. This post explores data obtained from three prior experiments, including ratings of image descriptions, company name ideas, and photos. [...]