Latest Publications

Blurry Text Transcription

  • Total Cost: $4.73
  • 40 HITs, including both text transcription and voting HITs.
  • TurKit code for this experiment: code.js, all files
    • NOTE: The experiment was stopped prematurely, so you probably can’t run the code.js file using the provided database, since the program will try to make calls to MTurk for HITs under my account. However, all the partially completed HITs have been written out to a file called output.txt.
  • TurKit Version: 0.1.37

We are in search of a task that lends itself to iteration — a task where it is easier to understand previous people’s work than it is to redo that work. We did an experiment in the past, before this blog, where people attempted to decipher extremely poor handwriting (page 6). This handwriting seemed impossible for anyone to decipher alone, but by building on each other’s work, turkers were able to transcribe it almost verbatim.

Of course, we didn’t have a control in that experiment, so we don’t actually know that nobody could decipher it all on their own. So we are now running a set of experiments comparing iteration and non-iteration for handwriting recognition.

Well, almost. It is difficult to write with a consistent poorness, so instead I wrote passages with the text tool in the Gimp, and obfuscated them with a distortion filter. To be precise, I used the “Sans” font, size 22 pixels. Then I distorted each passage with Filters–Noise–Spread, using 9 pixels for both horizontal and vertical spread. Here is a result:

writing-1-distort-1You may see this in a sample HIT here.

We then adopted the experimental design of a previous blog post comparing image description writing iteratively and non-iteratively, except we used 8 iterations for both conditions instead of 6. We also paid $0.10 for each iteration instead of $0.02 — we tried $0.02, but it didn’t seem like we were getting any takers.

Even with $0.10, the experiment ran for a few days without completing, and I could already see some room for improvement, so I shut it off prematurely.

Results:

Here is the final iterative version of the passage shown above:

I had intended to hit the nail, but I’m not a very good aim it seems and I ended up hitting my thumb. This is a common occurence I know, but it doesn’t make me feel any less ridiculous having done it myself. My new strategy will involve lightly tapping the nail while holding it until it is embedded into the wood enough that the wood itself is holding it straight and then I’ll remove my hand and pound carefully away. We’ll see how this goes.

The highlighted word should have been “wedged”. This transcription also fixes — or almost fixes — a couple of spelling errors in the original passage.

The following pages show the iterative and non-iterative submissions for each of the three blurry passages. Note that the iterative submissions build on each other, while the non-iterative submissions are all independent.

Nail Passage : After eight iterations, turkers transcribe the text with only one error, shown above. We only have 5 non-iterative responses — all of them essentially say that the text is unreadable.

Boat Passage : The first seven iterations don’t transcribe any words. The last iteration is a promising start. We got 7 non-iterative responses before terminating the experiment, and one of these is even more promising than the last iterative response.

Babysitting Passage : In this passage, we solicited non-iterative responses before the iterative ones. One of the responses is very good, with only about seven missed words. The iterative process only gets the first two iterations before the experiment stops.

Discussion:

We stopped the experiment because it was taking a while, and many people were submitting responses that essentially said “this text is unreadable.” Empirically, the text appears to be mostly readable to multiple people (the major contributors from each experiment were different turkers).

I hypothesize that the real problem is convincing turkers that progress is possible, since it looks impossible. An early iteration for the Nail Passage made a good effort, and laid the groundwork for future iterations. None of the subsequent turkers iterating on this passage complained about readability — at least not as an addendum to the transcribed passage — which suggests that people are more comfortable with this task after it has been broken down a little.

Another related problem is convincing turkers that it’s ok to just do a little bit, when faced with the entire passage. Most turkers would either do none of it, or make an attempt at most of it. One counter example is the very first turker in the Nail Passage, who attempted to transcribe only the first line or so. But then the voters voted against it. So even the voters need to be convinced that it’s ok to do just a little bit.

The plan for version 2.0 of this experiment is to have a textbox beneath each word, and instructions saying it’s ok to contribute only 1 or 2 words. The hope is to make turkers more comfortable contributing just what they are able. This should also make a better comparison between the iterative and non-iterative conditions, since it will be easier for the non-iterative contributors to make guesses on individual words, without feeling like they need to transcribe the entire text. These guesses can then be combined programmatically later, similar to how we combine tags from non-iterative responses in the tag cloud experiment.

Website Clustering

  • Total Cost: $114.8
  • 553 HITs, each with 10 assignments, each paying $0.01, $0.02 or $0.05.
  • TurKit code for this experiment: cluster22.js, all files.
  • TurKit Version: 0.1.37

This experiment explores clustering using Mechanical Turk. This work is done in conjunction with Thomas W. Malone, and Robert Laubacher of the Center for Collective Intelligence (CCI). The CCI is interested in categorizing and understanding the multitude of websites and organizations which make use of vast numbers of people to achieve various goals in intelligent ways.

The CCI has a growing list of websites exhibiting collective intelligence. For this experiment, we have taken a subset of this list consisting of 22 websites. I hand picked these websites, often in pairs that I thought were similar, like FaceBook and Orkut.

All the HITs in this experiment ask 10 different people to decide how similar two websites are, using the following user interface:

sampleHIT

We asked people to compare every pair of websites, paying $0.01 for each comparison. We then posted all the HITs again, offering $0.02 for each comparison. We posted the HITs a third time paying $0.05 per comparison, but this time we reduced the initial set of 22 websites down to 14, to save money.

Results:

Similarity Matrix for 1-cent.

Similarity Matrix for 2-cents.

Similarity Matrix for 5-cents (note that this matrix has only 14 sites).

If we subtract each value in the similarity matricies above from 5, then we get a form of distance matrix. We might envision that each website is a point in a high-dimensional space, and we know it’s distance from every other point in that space.

To help visualize the points in this high-dimensional space, we use a technique to place the points in a 2-dimensional space, while preserving the distance relationships between the points as much as possible. This is done using the Matlab Toolbox for Dimensionality Reduction from Laurens van der Maaten — specifically using the “Multidimensional scaling” technique in this toolbox.

Plot for 1-cent matrix:

cluster1

Plot for 2-cent matrix:

cluster2

Plot for 5-cent matrix (note that this plot has only 14 sites):

cluster5

Discussion:

First question: are we getting meaningful data?

I think so. We see websites like Facebook and Orkut near each other. Of course, we also see some potential anomalies, like YouTube closer to Facebook than to Hulu in the first plot — note that YouTube and Hulu are closer together in the 2-cent plot.

Second question: how close is the data to ground truth?

This question is harder to answer. Recall the coin flipping experiment, where turkers were asked to flip a coin and submit whether it landed on heads or tails. We saw a bias toward heads, suggesting that some turkers were cheating.

Turkers could be cheating here too, but it would be hard to detect, since we do not know the underlying distributions. Any variation or disagreement we see could be reflective of actual disagreement between turkers over how similar websites are.

The problem is similar to polling for an election, when we can’t trust the answers people give us. If we get 40 votes for candidate A, and 60 votes for candidate B, it could be that everyone answered truthfully, or it could be that 80 people answered randomly, and the remaining 20 people answered truthfully in favor of B.

One interesting observation is that the variance decreases when we offer more money. If we offer 1-cent for each similarity measure, the standard deviation of 10 responses is, on average, 0.86. If we offer 2-cents per similarity measure, this average goes down to 0.83. This difference is not significant (paired t-test, p=0.3). However, if we offer 5-cents for each similarity measure, the average standard deviation goes down to 0.59. This difference is significant (paired t-test, p<0.0001 — note that the t-test can be paired since we only look at comparisons between the 14 websites common between the 2-cent and 5-cent cases).

This observation suggests that there is a ground truth similarity between websites, and turkers come closer to discovering this value by exerting more mental energy, which they are encouraged to do with an increased reward.

However, it turns out that turkers seem to spend less time completing these tasks when offered more money. Average time per turker for 1-cent is 70 seconds. This decreases to 63 seconds for 2-cents, which is not significant (paired t-test, p=0.19). However, it drops again to 50 seconds for 5-cents, which is significant (paired t-test, p=0.01).

This seems odd, and it could be that the “average time on task” is masking the real story. If not, it raises many question, like: are turkers spending less time because they are more focused? Are we attracting a different segment of the turk population for 5-cents, which happen to work faster? Is the decrease we saw in variance actually a bad thing?

There is a lot of data here, and some more thought will probably reveal additional experiments to shed light on these questions. Comments and suggestions are welcome.

Image Description — Iterative vs Non-Iterative

  • Total Cost: $10.82
  • Running Time: 75.8 hours
  • 200 HITs, paying between $0.01 and $0.02
  • TurKit code for this experiment: code.js, all files
  • TurKit Version: 0.1.37

This experiment uses the experimental design described in the pilot experiment, except that we run the procedure on ten different images. We also alternate whether the iterative or non-iterative descriptions are solicited first.

Note that the same file-synchronization bug exists here as with the pilot experiment, so it is unclear whether turkers saw the example writing style.

Results:

The following table shows the iterative description in green, along with the iterations that lead up to that description. The non-iterative descriptions are shown in white, and all descriptions are sorted by their average rating.

You may also view a complete list of these tables for all ten images.

rating sunrise
description
4.6 An array of majestic sunbeams peaking over the horizon and piercing through the low clouds display a changing explosion of colors that constantly maneuver through the sky and are captured by the camera.
4.5 A blazing golden-yellow sun sets beneath a sky of swirling orange clouds in the moments before twilight, silhouetting a pine tree-lined horizon. The northern wilderness is calming down for the day and the creatures of the night will soon begin their nocturnal activities. A lone owl is perched high in the tree, ever watchful of the area around her.

1 Golden yellow sunset with silhouetted pine trees on the horizon.
2 Golden yellow sunset with silhouetted pine trees on the horizon. The sky is very fire color and that sun light veru pure white.
3 Blazing orange sunset with swirling clouds silhouettes a line of trees.
4 Blazing sun sets beneath a sky of swirling orange in the moments before twilight, silhouetting a tree-lined horizon.
5 A blazing golden-yellow sun sets beneath a sky of swirling orange clouds in the moments before twilight, silhouetting a pine tree-lined horizon.
6 A blazing golden-yellow sun sets beneath a sky of swirling orange clouds in the moments before twilight, silhouetting a pine tree-lined horizon. The northern wilderness is calming down for the day and the creatures of the night will soon begin their nocturnal activities. A lone owl is perched high in the tree, ever watchful of the area around her.
3.4 Sunset, orange hazy in the sky, and shadows of the tree line.
3 A red sunset in an equatorial country.
2.7 This image is a picture of a sun sets in the afternoon that many people loved to see and discover how great was the Earth.
2.7 it is a red yellowish sunset
1.7 Aku terus cuba dan mencuba untuk aku serasikan bidang photography dan kerja2 editing dan aku berharap akan ianya akan terus bersatu dalam jiwa akubak kata cameramerah http://www.flickr.com/photos/cameramerah
photography dan editing saling perlu memerlukan bagi menambah seri sesuatu hasil tp tidak keterlaluan tp aku punya nih dah macam mengubah alam la plak hehehehe xpelakan banyak masa untuk nih so sebagai ganti rugi aku upload untuk kawan2 comment dan bg pendapat ble kan

Discussion:

It is difficult to see a significant difference between the two methods. The average top ranked non-iterative description has a rating of 3.9, while the average iteratively generated description has a rating of 3.86. This difference is not significant (paired t-test p = 0.65).

The average length of the top ranked non-iterative description is 238 characters, compared with 237.5 in the iterative case. Again, this is not significant (paired t-test p = .99).

The average time spent on task for each turker in the non-iterative process is 214 seconds, compared with 195 seconds in the iterative case. Again, not significant (t-test p = 0.51).

Due to technical difficulties, it’s hard to compare the total running time of each process — sometimes the program running the tasks was shut down, which would add unnecessary time to the iterative process. Still, removing obvious outliers, we get 0.97 hours for non-iterative processes, and 1.13 hours for iterative processes, which is not a significant difference (t-test p = 0.54).

It would make sense that the two methods would be similar if the iterative descriptions could all be traced to the work of a single contributor, but this doesn’t appear to be the case. Subjectively, it appears that turkers often build on, or at least incorporate each other’s work.

One potential explanation is that it takes as much time and effort to read someone else’s work and improve upon it, as it does to write from scratch — at least in these examples. For iteration to be useful, it seems like we need each increment to save time for the next contributor, so that they can go even further.

Image Description — Iterative vs Non-Iterative (Pilot)

  • Total Cost: $1.08
  • Running Time: 15.7 hours
  • 20 HITs, paying between $0.01 and $0.02
  • TurKit code for this experiment: code.js, all files
  • TurKit Version: 0.1.37

A couple of previous experiments compare iterative and non-iterative methods of generating tag clouds for websites. This is a pilot experiment of a similar comparison for writing image descriptions.

Experimental Setup:

We had turkers describe a fan iteratively and non-iteratively. The non-iterative method asks 6 turkers to write an entire description (2 cents each). None of these workers were allowed to participate in the iterative version.

The iterative version alternates between an improve HIT (paying 2 cents) and one of two vote HITs: vote on first improvement, and vote on subsequent improvements. The vote HITs pay 1 cent per vote, and we collect votes until at least 2 votes agree with each other. The improve HIT seeds the textarea with the most recent description that was voted in favor of.

Next, we had ten turkers rate each description on a scale of 1-5 (1 cent each). None of the turkers who participated in either of the previous processes were allowed to rate the descriptions.

NOTE: All of the HITs linked above show an “Example writing style” with a description of the aurora borealis. However, due to a file-synchronization bug, it is unclear whether the turkers actually saw this example, or a previous version with the text “…put writing example here…”.

Results:

The following table shows the iterative description in green, along with the six non-iterative descriptions. All descriptions are sorted by their average rating.

rating fan
description
3.7 This is a Japaneese style folding fan and when opened it spreads like an accordian. The beautiful scene of the waterfall in the tropics, is in colors of deep forest green and white cascading falls. If you look closely between the folds it gives the appearance of sunshine glistening off the falls, almost as if you can see the water actually falling.
3.6
versions
Decorative wall fans are one of the most beautiful ways you can decorate your home. These large wall fans come in an endless variety of designs and colors. This is a large decorative oriental fan with a touch of traditional Chinese brush painting. This large oriental wall fan has a landscape motif of waterfalls and trees and two people sitting on the stones by enjoying the beautiful secenery. The material used for large wall fans is bamboo wood, which forms the spokes and base of the fan.
3.2 This beautiful decorative fan displays a nature image of waterfalls and trees against a mountainous background. The dark wood accent and natural setting would be a nice addition to any home.
3.2 This beautiful oriental, hand held fan unfolds to reveal ascenic waterfall. The muted colors are elegant on an ivory background accented by the deep red and white sunburst at the center of the fan. The black lacquered sides of the open fan complete a very sturdy and visually appealing design.
3 This classically beautiful asian fan features a landscape scene as well as asian characters. It depicts a landscape scene with a river, trees, waterfalls, and rocks on a white/beige background.
2.8 This is a vintage Japanese hand fan. It has a traditional landscape image of a couple on a hillside watching a waterfall done in cascading blues and grays.
1.7 it looks like a large fan. The colors are white,black, and red. It has lettering on top and a drawing of a forest.

Discussion:

There is a story here. A story of two turkers. One turker was shown an image of a fan, and was asked to describe it. This turker applied themselves, and wrote a 3.7 level description. The best description offered by their peers rated 3.2.

Another turker was shown an image of a fan, and was asked to improve someone else’s description of it. This turker applied themselves, adding 400% new content, and crafting a 3.6 level description. Subsequent peers failed to improve upon it.

Turker-Spark Theory: The turkers in the story above experienced a turker-spark — a spark of insight and ambition that set their contribution a notch above their peers.

Several stars aligned to trigger the spark. Some of these stars have to do with the turker: native writing expertise; meticulous attention to detail.

Some of the stars have to do with the moment that they accepted the task: creative insight about the prompt; time and willing to apply themselves.

According to turker-spark theory, the way to solve tough problems on MTurk is to post many similar HIT assignments, and hope that one of them is ignighted by a turker’s spark of inspired effort.

If turker-spark theory is true, it would be nice to know the probability of a spark happening.

It would also be interesting to know if an iteration ungraced by any sparks can measure up to a turker-spark on a non-iterative assignment. That is, can turkers doing essentially mediocre work — iteratively — match the quality of a single turker doing great work?

This is a pilot experiment; we’ll soon have results of running this task 10 times, which should shed a little more light on the subject.

Response Time Over 24 Hours

  • Total Cost: $3.60
  • Running Time: 24 hours (1 new HIT with 5 assignments each 20 minutes)
  • 72 HITs with 5 assignments, each paying $0.01
  • TurKit code for this experiment:  response-time.zip response-time.js
  • TurKit Version: 0.1.29

This experiment revisits the question of response time initially explored as part of our small where’s the duct tape example.  Every 20 minutes over 24 hours we posted a HIT with 5 assignments.  The assignments were drawn from 12 tasks resembling the duct tape example – turkers were shown a picture containing an object and asked to click on the object.  You can try for yourself in our object recognition test page.

Turkers successfully identified the object in almost all cases.  The main exception being the following picture in which we asked turkers to find sunglasses:

Television with Sunglasses on the File Cabinet Beside It

Television with Sunglasses on the File Cabinet Beside It

Only about half of turkers shown this example correctly identified the sunglassses.  If you’re having trouble, they’re on top of the file cabinet.

You can see the results visually here.  These graphs were created with Protovis, which unfortunately only works with Firefox at the moment, and so are repeated below.  The blue color represents the first turker to submit, accept, or complete the HIT…the aqua, green, yellow, and pink represent later turkers.

Submission Times Over 24 Hours, generally the first submission happened in less than 100 seconds regardless of the time of day.

Submission Times Over 24 Hours, generally the first submission happened in less than 100 seconds regardless of the time of day.

Time Taken after Accepting the HIT, generally this required less than a minute, with a few peaks closer to 100 seconds.

Time Taken after Accepting the HIT, generally this required less than a minute, with a few peaks closer to 100 seconds.

Time to Accept the HIT, most HITs accepted in less than 30 seconds

Time to Accept the HIT, most HITs accepted in less than 30 seconds

What’s great about this is that without doing much of anything to encourage quick responses, we were still able to get an answer within 100 seconds most of the time.

In this experiment, we also looked to see how many assignments you would need to put out to get the fastest submissions. This is not straightforward because turkers can accept a HIT quickly and then wait a while before submitting it – it’s not clear why they would want to do this but they do.  In the worst case, all 5 assignments are accepted quickly but none are quickly submitted. We could avoid this situation by setting a shorter maximum time, but this might have adverse effects on the pickup time of the HIT and is something to consider in future iterations of the experiment.  In this study we used the TurKit default maximum of 60 minutes.

In 40 cases (56%), the turker who was first to accept the HIT was also the first to submit it.  17 2nd-acceptors submitted first, 10 3rd-acceptors submitted first, 4 4th acceptors submitted first, and 1 5th acceptor submitted first.

This study suggests at least two interesting properties of MT:

First, time of day does not seem to matter all that much – turkers took a similar amount of time to respond at 5am EST as they did at 5pm  EST.

Second, if you want something done quickly, you may need to recruit multiple turkers, as any one turker may not reliably complete a HIT expeditiously.  The turker who accepts first will also likely not be the first to submit it.

As we move forward, we plan to try more active techniques at getting turkers to answer more quickly.  How much would you have to be willing to spend over a 24 period to guarantee that you could always get an answer to a question like this in less than 10 seconds?  How would you do it?  Hopefully, we’ll have an answer!

Website Tags — Not Iterative

  • Total Cost: $0.84
  • Running Time: 18.4 hours
  • 42 Improve HITs, paying $0.02 each
  • Not Iterative
  • TurKit code for this experiment: tags-flat.js, all files (including database file*)
    • *NOTE:The database file contains fake worker Ids, to protect the privacy of turkers. These worker Ids are consistent within this database file, but not necessarily with future fake worker Ids.
  • TurKit Version: 0.1.34

This experiment is a non-iterative version of a previous blog post about creating tag clouds for websites.

Both experiments make six requests for new tags for each website. The previous experiment makes these requests in serial, where each HIT contains the tags from the previous improvement.

This experiment makes the requests in parallel, where each HIT contains a blank slate. Note that the parallel approach guarantees six unique turkers for each website, whereas the serial approach allowed turkers to improve upon their own work (this happend about one time for each website in the previous experiment).

I did not have people vote in this experiment. After inspecting the results by hand, I voted against four of the responses, which were obviously not lists of tags.

Results:

You may view the complete list of responses here.

In order to compare tags from this experiment and the previous experiment, we refined the tag lists in several ways. First, tags from each response for a given website in this experiment were combined together removing duplicates, after converting all tags to lower-case. These tag lists were then put in alphabetical order. Tag lists from the previous iterative experiment were also converted to lower-case and put into alphabetical order, with duplicates removed.

You may view comparisons of the tag clouds for each website here.

The first two comparisons are shown here:

http://www.zeros2heroes.com/
Not Iterative: 18 Iterative: 28
art
artists
campaign
classified
collectibles
comics
community
conventions
development
feedback
figurines
manga
marketing
online
publishing
social network
talent
workshops
artist
artists
beta
blog
classifieds
comic
comic community
comics
comics publisher
community
contests
editors
feedback
for kids
forums
free comics
gamers
games
heroes
participate
peoples publishers
portal
publisher
webcomics
workshops
writer
writers
zeros
http://www.deviantart.com
Not Iterative: 33 Iterative: 22
art
art collection
articles
artists
artwork
business
characters
chat
community
deviant
drawings
fanfiction
forum
graphics
images
manga
network
online
painters
paintings
photographers
photography
photos
pixel art
poetry
sales
sculpture
share
share artwork
sharing
sweet adult fun
thumbnails
wallpaper
animation
art
art games
blog
buy
cartoons
comics
community
contests
expression
film
flash
forums
images
individual
merchandising
photography
pictures
ratings
scraps
sell
services

Discussion:

Quantity: It appears that both methods produce similar numbers of tags. The non-iterative method generated an average of 22.3 tags, compared with 20.4 tags using the iterative method. This difference has a p value of 0.93, which is not significant. It is interesting to note that there is very little overlap between the groups — this suggests that neither approach is reaching a fundamental maximum number of tags.

Quality: I don’t see much difference in quality between the approaches, though I haven’t looked too deeply. I had wanted to run an experiment where I ask people to say how relevant each tag is to the website, but I’m not sure these results would be significant either way.

Time: The non-iterative experiment took a little longer, 18.4 hours instead of 12.6 hours. It’s not clear whether this is significant, but it is interesting, since the non-iterative experiment posts all the HITs at once, while the iterative experiment only posts new HITs after the first ones have been completed. A couple of reasons for this may be: first, the iterative experiment allows turkers to participate more than once adding tags for a single website, so we don’t have to wait for unique turkers. As mentioned before, this happend about once for each website. Second, 1-cent HITs have a brief time once they are posted to be at the top of the “most recent” search results, before they sink into the unsearchable abyss. The iterative experiment takes better advantage of this time.

Conclusion: In this case, people seem as likely to generate novel tags looking at a blank slate as they are looking at previously generated tags. It’s possible that iteration would be useful in refining the tag cloud, but before doing anything that complicated, I may investigate iteration in other similar domains.

Website Tags

  • Total Cost: $1.80
  • Running Time: 12.6 hours
  • 42 Improve HITs, paying $0.02 each
  • 42 Vote HITs with a total of 96 assignments, paying $0.01 each
  • Iterative
  • TurKit code for this experiment: tags.js, all-files.zip
  • TurKit Version: 0.1.34

This experiment tries to iteratively create tag clouds for websites. It is similar to an earlier blog post about summarizing websites.

The improve HITs looks like this:


Improve HIT

And the vote HITs look like this:


Vote HIT

This experiment uses the same websites from the previous blog post.

Results:

Here we show the iterations for http://www.ask500people.com:

version 1

opinion polls
poll
vote
survey
question
quiz
ask 500 people
version 2

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz
version 3

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz

questions
varied answers
questionnaire
ask world

version 4

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz

questions
varied answers
questionnaire
ask world

new
books
version 5

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz
forum
questions
varied answers
questionnaire
ask world
opinions
version 6

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz
forum
questions
varied answers
questionnaire
ask world
opinions
forum
blog
network

You may view the iterations for each website here.

Below, we show the final tag cloud generated for each website (except the one above):

http://www.zeros2heroes.com/

zeros2heroes
publisher
comics
comics publisher
webcomics
free comics
writer
artist
workshops
feedback
editors
Peoples Publishers
Comic Community
gamers
games
for kids
community
portal
contests
participate
writers
artists

publisher
beta
comic
blog
classifieds
forums
http://www.deviantart.com

art
photography
film
animation
pictures
images
flash
buy
sell
community
blog

art games
cartoons
comics
contests
Services
scraps
forums
ratings
merchandising

individual
expression
http://folding.stanford.edu

folding
stanford
university
distributed computing
protein folding
disease
charity
help
all languages
cloud computing
higher education
education
subjects offered
Forum
Research
http://www.artbreak.com

artbreak
art
commission free
sell art
shop for art
paintings
view art
display art
upload art
creative art
artists' chat
information on art
market art
postures
Randomizer
Share
work
world
online community
forum
http://www.orkut.com

orkut
instant message
share your videos, photos
free chat
free messaging
connect
profile
social network
google
blog
connect
connections
friends
family
scraps
people
communities
pictures
friend connectivity
media sharing
video sharing
MSN
AOL
YIM
communication
http://friendfinder.com

friend finder
social network
online dating
profile
personals
blogs
community
forums
networking
dating site
find a partner
man seeking woman
woman seeking man

Discussion:

One thing to note is that people typically append to the end of the tag cloud, rather than refining or removing any of the existing tags. If we want any refinement of the tags, we may need to create a separate task for this.

The first tagger seems to add more tags than subsequent taggers: 5.7 tags for the first tagger compared to 3.1 for subsequent taggers. If we throw out versions that were rejected, and take the average number of tags added in each iteration, we get: 5.7, 3.7, 5, 2.6, 2.3, 2.5. If we squint enough, we might convince ourselves that these numbers go down, presumably because it becomes increasingly difficult to find new tags.

At first, I was temped to say that people invest the same amount of effort on each task, but their effort yields fewer tags on later iterations. However, the time spent on task seems to go down after each iteration (except the first): 133.3, 175.4, 171.7, 121, 96.1, 91. Of course, this doesn’t mean that people are expending less mental energy on later iterations — perhaps they are spending less time looking at the webpage, since the tags do a good job of telling them what to expect there.

This experiment raises an interesting question about iteration — since people add more tags on the first iteration, maybe we should be showing everyone a blank slate. We would probably get duplicate tags this way, but if we remove the duplicates, will we have more remaining tags than we get in the iterative version?

Of course, this question existed in the previous summary writing experiments as well, but it might be easier to test with the tag cloud generation experiments, since we can more-or-less quantify the amount of contribution in each iteration.

The real hope of iteration is that people are able to use previous people’s work as a jumping off point to spur their creativity, and at the same time limit their creativity by showing them what has already been done. However, this jumping-off-point may also stifle creativity by acting as a powerful suggestion about the direction that the tag cloud should take. Ultimately both points of view are probably true, and the optimal Human Computation Algorithm for generating tag clouds must benefit from both.

A cautionary note on selection

In follow-up to my experiment on anchoring, I had subjects guess the number of dots in a series of images. Workers were “randomly” (I’ll explain the scare quotes later) assigned to one of 7 of these tasks (labeled A-G), in order from fewest dots to the most dots.
Here is A:
Picture A
and here is G:
Picture A
After the initial task, workers could keep working, making their way through the other 6 tasks, which were also “randomly” assigned. Most did all 7 but a sizable number just did 1. As data began to come in, I checked that an approximately equal numbers of workers did each of the 7 pictures. They did, but in AMT this tells us nothing about selection: when a user “returns” a HIT rather than completing it, that HIT is returned to the pool of uncompleted HITs, making it available for a future worker.

When I looked at the distribution of “first HITs” i.e., the first picture performed by a subject (remember that they could complete more than one picture) a striking pattern jumped out:

> table(data$first_pic_name)

A B C D E F G
101 100 141 170 154 158 169
>

In general, the more dots a picture had, the more likely it was to be the first task done by a worker. A chi-square test confirmed that the distribution of first pictures.

What’s going on?

The over-representation of high-dot pictures among the first pictures completed by a worker suggests that a disproportionate number of high-dot pictures are being returned. It is of course possible that workers find evaluating many-dot pictures more onerous, but this seems unlikely, especially since there was no penalty for bad guesses. My hunch is that workers find the high-dot tasks less pleasant because they take longer to load—not because they are intrinsically more difficult. Picture G’s file size is 35K, while picture A is only about 10K. This difference is not much for someone with a fast connection, but many workers presumably have slow connections and grew tired of waiting for a G image to load. I plan to test this hypothesis by using image files with approximately identical sizes and see if the pattern still persists.

So what?

If workers are returning HITs for basically arbitrary reasons, that the AMT “back to the pool” sampling is as good as random. However, if users are non-randomly non-complying, you can potentially make biased inferences. In addition to the problem of non-random attrition, because of how AMT allocated workers to tasks, you also get changes in the probability that a subject will be assigned to a group (e.g., late arrivals to the experiment are more likely to be assigned to the group people found distasteful). In future posts, I hope to discuss some of the ways this problem can be circumvented.

More on whether turkers are willing to wait

Experiment #1 (waiting 15 second for 2 cents):

  • Total Cost: $1.00
  • Running time: 12 minutes
  • 50 HITs, each paying $0.02
  • Experimental set up and analysis very similar to previous waiting post
Temporal Worker Activity - 15 second waiting time shown as horizontal lines

Temporal Worker Activity - 15 second waiting time shown as horizontal lines

Experiment #2 (waiting 30 second for 10 cents):

  • Total Cost: $5.00
  • Running time: 8 minutes
  • 50 HITs, each paying $0.10
  • Experimental set up and analysis very similar to previous waiting post
Temporal Worker Activity - 30 second waiting time shown as horizontal lines

Temporal Worker Activity - 30 second waiting time shown as horizontal lines

Conclusions:

Although I didn’t rigorously compare the two cases, it seems as though turkers are willing to wait, and are willing to wait longer for more money.  If the objective of waiting is to be able to have turkers available at the same time to contribute to a task cooperatively, these experiments show some, but not complete success.  In the case of paying workers 10 cents to wait 30 seconds, we got more instances of overlap in the time multiple workers were waiting.  We also got more people (16 as opposed to 11 turker) to do the task and it was finished faster.  In both experiments, we had people accept the task, but not press the “GO” button in time, indicating that that didn’t understand the instructions or we not willing to to wait (1 person for the 30-sec waiting experiment and 2 people for the 15-second waiting experiment).

Anchors have sway

Anchoring refers to the cognitive bias where people rely too heavily on a single factor or
piece of information when making a decision. Sometimes narrowing
a problem and focusing on just a few key details works well in practice, but anchoring can be deeply irrational in some cases:

…consider an illustration presented by MIT professor Dan Ariely. An audience is first asked to write the last 2 digits of their social security number, and, second, to submit mock bids on items such as wine and chocolate. The half of the audience with higher two-digit numbers would submit bids that were between 60 percent and 120 percent higher than those of the other half, far higher than a chance outcome; the simple act of thinking of the first number strongly influences the second, even though there is no logical connection between them.

The Setup

I conducted a randomized experiment where I showed subjects a picture with exactly 500 dots (see below). Subjects were first asked to state whether the number of dots in the picture was greater than or less than X (the anchor), where X was a random variable I generated with a snipped of Javascript code. X was uniformly distributed on [100,1200]. After answering the greater/less than question, subjects reported their own best point estimate.

Dot picture

Logistics

  • 200 subjects, 3 dropped for missing data
  • Experiment run on July 17th
  • Each paid 1 cent to evaluate picture
  • Data: anchor.csv
  • R code for data analysis: analysis.R

Did the anchor influence guesses?

After dropping one worker who guessed over 15,000 (~ 31 times too large) and plotting log guess vs. log anchor, we can see a clear relationship:

in_logs

In a linear regression of log guess on log answer, the coefficient is about ~ .33, meaning that on average, a 10% increase in the anchor leads to 3% higher guesses. Because we randomized the anchor after subjects had already accepted the HIT, we can be confident that the anchor is causing the increase in guesses (though causality in AMT is a bit complicated – a topic I’ll return to in future posts).

One thing about this experiment is that the effect of the anchor is neither that surprising or even irrational. Unlike the Ariely experiment (where the individual’s social security number was the anchor), a subject in this experiment might reasonably believe that the anchor is conveying some useful information.

Wisdom of crowds?

The dot guessing game is exactly the kind of task where we might expect the “wisdom of crowds” effect, e.g., the mean of all guesses would be be better than many of the individual guesses.

We can compute raw and trimmed mean of all guesses and compare those estimates to the distribution of errors made by subjects.

> woc1 <- mean(data$guess)
> woc2 <- mean(trim$guess)
> woc1
[1] 608.7513
> woc2
[1] 517.441
> 1-ecdf(data$error)(abs(500-woc1))
[1] 0.6954315
> 1-ecdf(data$error)(abs(500-woc2))
[1] 0.928934
>

As we can see, our trimmed mean of 517 is pretty close – and better than 92% of all individual guesses. We can be a little more formal in our trimming and examine how much of the tails we should have chopped off. The plot shows the trimmed mean versus alpha – the fraction of the distribution cutoff from both ends of the distribution of guesses. At least in this example, we would have done well to trim about 3% off both ends of the distribution of guesses.

trimmed mean