Latest Publications

Image Description Revisited

This experiment revisits a blog post about writing image descriptions iteratively and in parallel. That experiment did not see any statistically significant difference between the iterative and parallel process. This experiment does.

This experiment is larger, using 30 images instead of 10 (all taken from publicdomainpictures.net). We also made a number of changes to the process. First, we tried to make the instructions for HITs in each process as similar as possible. For instance, the title for HITs in the previous iterative process said “Improve Image Description” and the title for HITs in the previous parallel process said “Describe Image”. In this experiment, the title in both cases is “Describe Image Factually”. This title points to another change — we added the instruction “Please describe the image factually“. This was intended to discourage turkers from thinking they needed to advertise these images, and to make the descriptive styles more consistent. Here is an example HIT:

Example HIT for the iterative process.

This image shows a HIT in the iterative process. It contains the instruction “You may use the provided text as a starting point, or delete it and start over.” This instruction deliberately avoids suggesting that the turker just needs to improve the HIT. The idea is that we wanted each process to be as similar as possible, so it didn’t seem fair for turkers in one condition to think they only needed to make a small improvement, whereas turkers in the other condition think they need to write an entire final draft. Note that the very presence of text in the box may alert turkers to the possibility of other turkers seeing their work and being asked to write a description using it as a starting point, but we do not explicitly validate this hypothesis for workers.

This instruction is omitted for the parallel HITs. It is the only difference between the two, except of course that all of the parallel HITs start with a blank textarea, whereas all but the first iterative HITs will show prior work.

This experiment also uses the 1-10 based rating scale introduced in the updated blog post about brainstorming company names.

Finally, in order to compare the output from each process, we wanted some way of selecting a description in the parallel process to be the output. We do this by voting between descriptions, and keeping the best one in exactly the same way as the iterative process. (Note: one difference is that the iterative process highlights differences between the descriptions, whereas the parallel process does not. Since the descriptions in the parallel process are not based on each other, they are likely to be completely different, making the highlighting a distraction.)

Results:

Raw Results

Average ratings for descriptions generated in each iteration.

This graph shows the average rating of descriptions generated in each iteration of the iterative processes (blue), along with the average rating of all descriptions generated in the parallel processes (red). Error bars show standard error.

Discussion:

The final description from each iterative process averaged a rating of 7.86, which is statistically significantly greater than the 7.4 average rating for the final description in each parallel process (paired-ttest t(29) = 2.12, p = 0.043). We can also see from the graph that ratings appear to improve with each iteration.

This suggests there may be a positive correlation between the quality of prior work shown to a turker, and the quality of their resulting description. Of course, a confounding factor here is that turkers who choose to invest very little effort are not likely to harm the average rating as much when they are starting with an already good description, since any small change to it will probably still be good, whereas a very curt description of the whole image is likely to be rated much worse. This factor alone could explain an increase in the average rating of all descriptions in the iterative process, but it would not explain an increase in the average rating of the best description from the iterative process — for that, it must be that some people are writing descriptions that are better than they would have written if they were not shown prior work.

So why are we seeing a difference now when we didn’t before? We changed a number of factors in this experiment, but my guess is that the most important change was altering the instructions in the iterative process. I think the instructions in the old version encouraged turkers not to try as hard, since they merely needed to improve the description, rather than write a final-quality description. In this experiment, all turkers were asked to write a final description, but some were given something to start with.

I think this same idea explains the results seen in the previous blog post about brainstorming company names. Turkers in the iterative process of that experiment were required to generate the same number of names as turkers in the parallel process, whereas the experiment before that suggested that turkers in the iterative process just needed to add names to a growing list, and we saw that they generated fewer names.

In any case, these are only my guesses. More experiments to come.

Brainstorming Company Names Revisited

I’ve been gone for a bit working on a research paper, and then attending conferences, but now it is time to get back to business.

This experiment is an extension of the previous blog post about brainstorming company names. In that post, it seemed like iteration wasn’t making a difference, except to encourage fewer responses. This time, we decided to enforce that each worker contribute the maximum number of responses. We also reduced that number from 10 to 5, since it felt daunting to force people to come up with 10 names. This also reduced the number of names we needed to rate, which is the most expensive part of this experiment.

Finally, we decided to show all the names suggested so far in the iterative condition. Previously, we showed only the best 10 names, but this required rating the names, which seemed bad for a number of reasons. Most notably, it seemed like an awkward blend of using the ratings both as part of the iterative process, and also as the evaluation metric between the iterative and non-iterative (or parallel) conditions.

The new iterative HIT looks like this:

Example iterative HIT

The parallel version doesn’t have the “Names suggested so far:” section.

We also changed the rating scale from 1-5 to 1-10. This was done because 1-10 felt more intuitive, and provided a bit more granularity. It would be nice to run experiments concentrating on rating scales to verify that this was a good choice (anyone?). Here is the new scale:Rating scale from 1 to 10

We brainstormed names for 6 new fake companies (we had 4 in the previous study). You can read the descriptions for each company in the “Raw Results” link below.

Results:

Raw Results

Average rating of names in each iteration of iterative processes.This graph shows the average rating of names generated in each iteration of the iterative processes (blue), along with the average rating of all names generated in the parallel processes (red). Error bars show standard error.

Discussion:

Names generated in the iterative processes averaged 6.38 compared with 6.23 in the parallel process. This is not quite significant (two-sample t(357) = 1.56, p = 0.12). However, it does appear that iteration is having an effect. Names generated in the last two iterations of the iterative processes averaged 6.57, which is significantly greater than the parallel process (two-sample t(237) = 2.48, p = 0.014) — at least in the statistical sense; the actual difference is relatively small: 0.34.

There is also the issue of iteration 4. Why is it so low? This appears to be a coincidence—3 of the contributions in this iteration were considerably below average. Two of these contributions were made by the same turker (for different companies). A number of their suggestions appear to have been marked down for being grammatically awkward: “How to Work Computer”, and “Shop Headphone”. The other turker suggested names that could be considered offensive: “the galloping coed” and “stick a fork in me”.

These results suggest that iteration may be good after all, in some cases, if we do it right, maybe. Naturally we will continue to investigate this. We have already done a couple of studies with similar results to this one, suggesting that iteration does have an effect. After posting these studies on the blog (soon), the hope will be to start studying more complicated iterative tasks.

Brainstorming Company Names

We asked turkers to brainstorm company names, both iterative and non-iteratively. We used an experimental design based on Image Description — Iterative vs Non-Iterative. We kept 6 iterations for each condition in this experiment.

The experiment itself is based on Website Tags and Website Tags — Not Iterative. Instead of generating tags for websites, we are generating names for companies. We also provide separate input fields for people to add names, rather than a single textbox.

We made up a brief description for four fake companies. You can see the description of one fake company in this sample brainstorming HIT:

example brainstorming HIT

The HIT asks turkers to come up with at most 10 company names, and supplies 10 input fields. Turkers in the iterative condition were shown “Example names suggested so far”, which shows company names supplied by previous turkers in the iterative process. This text appeared even if there were no names suggested yet. The non-iterative condition did not have this text.

We also created HITs to evaluate the quality of each name. These evaluations were done on a scale of “1: Poor” to “5: Extremely Good”, similar to the image description experiment. Turkers were not allowed to rate any suggestions for company X if they supplied any suggestions for company X.

We intended to show turkers in the iterative condition the ten best previous suggestions, but due to a bug, we only showed them the suggestions from the turker right before them. The suggestions were still sorted from best to worst.

Note that turkers in the iterative condition were allowed to contribute more than once. This has always been the case, even in previous blog experiments of this sort. Probably this shouldn’t be allowed, and our future investigations will probably prohibit this. With that said, this only happened once in these experiments (i.e. one turker, in the iterative process for one fake company, did the brainstorming HIT twice).

Results:

You can see all the names generated for each fake company here, in both the iterative and non-iterative conditions.

Here are the results for the “productivity tools” fake company:

description: As technology grows, people have less and less time. Time management tools are still poorly understood. We believe our tools will help people take control of their lives again, while being more productive than before.
iterative non-iterative
  • MasterTime 3.6
  • Time Managers 3.5
  • PowerSource 3.5
  • Time Keepers 3.5
  • Endless Time 3.5
  • NewKnowledge 3.4
  • Time’s On UR Side 3.4
  • Productivity Maximizers 3.4
  • Time Breakers 3.4
  • Manage Me 3.3
  • Now’s The Time 3.3
  • Time Stoppers 3.2
  • Time Maker 3.2
  • DayPlanners 3.2
  • All In A Day 3.2
  • Time Pusher 3.1
  • MakeTheBest 3.1
  • Timetool 3
  • Productive Management 3
  • Time Warp 3
  • Time Control 3
  • My Time Counts 3
  • Life Tools 3
  • Productivity Boost 3
  • KarmaSpice 2.9
  • Minute Worker 2.9
  • BoldNewWorld 2.9
  • WorldKnow 2.9
  • My Brother’s Keeper 2.8
  • Wast’en No Time 2.8
  • AimTime 2.8
  • Recreating Eden 2.6
  • The Daytimers 2.5
  • ToolStrong 2.5
  • Robotime 2.4
  • Produc’en Time 2.4
  • HolyKnow 2.4
  • Yes Time 2.3
  • It’s Manage 2.2
  • Ti-me 2.1
  • Producorobo 2
  • Time 1.9
  • T2 1.9
  • Don’t Count Pennies, Count Minutes 1.8
  • Working With Time 3.7
  • The Eleventh Hour 3.5
  • Take Back Time 3.5
  • Hour Glass 3.5
  • Time For Everything 3.4
  • It’s About Time 3.4
  • smarttime 3.3
  • Plus Time 3.3
  • Time Warp 3.3
  • More Time 3.3
  • simplyfy 3.2
  • Tick Tock 3.2
  • Manage Time 3.2
  • Time management Trendz 3.2
  • Time Enough 3.2
  • Tools & Time 3.2
  • Time Control 3.1
  • Life Management 3.1
  • Punctual Trix 3.1
  • octopus 3
  • Extend 3
  • Time Manager 3
  • Stopwatch 2.9
  • timeismoney 2.9
  • Time Friendly 2.9
  • Worth 2.8
  • prioritize 2.8
  • freetime 2.7
  • zurich 2.7
  • Time Shared Systems 2.7
  • Tiempo Domado 2.6
  • Becton-Dickinson Medical Device Company 2.6
  • The Control Company 2.6
  • latice 2.6
  • Time Organizer 2.6
  • Timex 2.6
  • python 2.4
  • the Ticker 2.4
  • idecluster 2.3
  • adrino 2.3
  • Philips Healthcare Medical Device Company 2.3
  • Denso Corporation: Company 2.2
  • sea horse 2.2
  • Daiwa House Industry Co. Ltd.: Company 2.2
  • Danske Bank A/S: Company 2.2
  • hypnosis 2.2
  • Controla 2.2
  • Controlia 2.2
  • nephkin 2.2
  • Dentsu Inc.: Company 2.1
  • jessico 2.1
  • Duke Energy Corporation: Company 2
  • Axel Springer AG – Company 1.9
  • Elekta Medical Device Company 1.9
  • Dsg International Plc: Company 1.8
  • fercy 1.8
  • emilee 1.6

Discussion:

The most striking observation is that the non-iterative condition generates significantly more company names than the iterative condition: 47.8 vs 34.3 (paired t-test p < 0.002). The highest possible number of names for either condition is 60, since there are 6 iterations, and each iteration asks a turker for up to 10 names.

This suggests that turkers will generate fewer names if they are shown some examples of other people’s names. Possible explanations for this include:

  • Seeing other people’s names biases turkers toward those names, and they must think harder to come up with different names.
  • Seeing other people’s names suggests that other people are making progress toward the goal, and so it’s not as important to do a good job on the task, since other people will pick up the slack.
  • Seeing fewer than 10 example names may cue turkers into thinking it’s ok to provide fewer than 10 names (In fact, turkers shown 10 example names provide 7.2 names on average, while turkers shown between 1 and 9 example names provide 3.3 names (p < 0.01)).

The average quality of names generated in each condition seems to be the same. The average is 2.85 for the iterative condition, and 2.82 for the non-iterative condition. A t-test gives a p-value of 0.57, which is not significant.

Conclusion: Showing people other people’s ideas doesn’t seem to increase the quality of suggestions, and seems to encourage fewer suggestions. It would be nice to figure out some way to increase the quality of suggestions. Perhaps if we show people words that are associated with the company — like “love” and “cupid” for the fake online dating site — they can be used as building blocks for names. We could generate these words in a separate brainstorming process.

Blurry Text Transcription

  • Total Cost: $4.73
  • 40 HITs, including both text transcription and voting HITs.
  • TurKit code for this experiment: code.js, all files
    • NOTE: The experiment was stopped prematurely, so you probably can’t run the code.js file using the provided database, since the program will try to make calls to MTurk for HITs under my account. However, all the partially completed HITs have been written out to a file called output.txt.
  • TurKit Version: 0.1.37

We are in search of a task that lends itself to iteration — a task where it is easier to understand previous people’s work than it is to redo that work. We did an experiment in the past, before this blog, where people attempted to decipher extremely poor handwriting (page 6). This handwriting seemed impossible for anyone to decipher alone, but by building on each other’s work, turkers were able to transcribe it almost verbatim.

Of course, we didn’t have a control in that experiment, so we don’t actually know that nobody could decipher it all on their own. So we are now running a set of experiments comparing iteration and non-iteration for handwriting recognition.

Well, almost. It is difficult to write with a consistent poorness, so instead I wrote passages with the text tool in the Gimp, and obfuscated them with a distortion filter. To be precise, I used the “Sans” font, size 22 pixels. Then I distorted each passage with Filters–Noise–Spread, using 9 pixels for both horizontal and vertical spread. Here is a result:

writing-1-distort-1You may see this in a sample HIT here.

We then adopted the experimental design of a previous blog post comparing image description writing iteratively and non-iteratively, except we used 8 iterations for both conditions instead of 6. We also paid $0.10 for each iteration instead of $0.02 — we tried $0.02, but it didn’t seem like we were getting any takers.

Even with $0.10, the experiment ran for a few days without completing, and I could already see some room for improvement, so I shut it off prematurely.

Results:

Here is the final iterative version of the passage shown above:

I had intended to hit the nail, but I’m not a very good aim it seems and I ended up hitting my thumb. This is a common occurence I know, but it doesn’t make me feel any less ridiculous having done it myself. My new strategy will involve lightly tapping the nail while holding it until it is embedded into the wood enough that the wood itself is holding it straight and then I’ll remove my hand and pound carefully away. We’ll see how this goes.

The highlighted word should have been “wedged”. This transcription also fixes — or almost fixes — a couple of spelling errors in the original passage.

The following pages show the iterative and non-iterative submissions for each of the three blurry passages. Note that the iterative submissions build on each other, while the non-iterative submissions are all independent.

Nail Passage : After eight iterations, turkers transcribe the text with only one error, shown above. We only have 5 non-iterative responses — all of them essentially say that the text is unreadable.

Boat Passage : The first seven iterations don’t transcribe any words. The last iteration is a promising start. We got 7 non-iterative responses before terminating the experiment, and one of these is even more promising than the last iterative response.

Babysitting Passage : In this passage, we solicited non-iterative responses before the iterative ones. One of the responses is very good, with only about seven missed words. The iterative process only gets the first two iterations before the experiment stops.

Discussion:

We stopped the experiment because it was taking a while, and many people were submitting responses that essentially said “this text is unreadable.” Empirically, the text appears to be mostly readable to multiple people (the major contributors from each experiment were different turkers).

I hypothesize that the real problem is convincing turkers that progress is possible, since it looks impossible. An early iteration for the Nail Passage made a good effort, and laid the groundwork for future iterations. None of the subsequent turkers iterating on this passage complained about readability — at least not as an addendum to the transcribed passage — which suggests that people are more comfortable with this task after it has been broken down a little.

Another related problem is convincing turkers that it’s ok to just do a little bit, when faced with the entire passage. Most turkers would either do none of it, or make an attempt at most of it. One counter example is the very first turker in the Nail Passage, who attempted to transcribe only the first line or so. But then the voters voted against it. So even the voters need to be convinced that it’s ok to do just a little bit.

The plan for version 2.0 of this experiment is to have a textbox beneath each word, and instructions saying it’s ok to contribute only 1 or 2 words. The hope is to make turkers more comfortable contributing just what they are able. This should also make a better comparison between the iterative and non-iterative conditions, since it will be easier for the non-iterative contributors to make guesses on individual words, without feeling like they need to transcribe the entire text. These guesses can then be combined programmatically later, similar to how we combine tags from non-iterative responses in the tag cloud experiment.

Website Clustering

  • Total Cost: $114.8
  • 553 HITs, each with 10 assignments, each paying $0.01, $0.02 or $0.05.
  • TurKit code for this experiment: cluster22.js, all files.
  • TurKit Version: 0.1.37

This experiment explores clustering using Mechanical Turk. This work is done in conjunction with Thomas W. Malone, and Robert Laubacher of the Center for Collective Intelligence (CCI). The CCI is interested in categorizing and understanding the multitude of websites and organizations which make use of vast numbers of people to achieve various goals in intelligent ways.

The CCI has a growing list of websites exhibiting collective intelligence. For this experiment, we have taken a subset of this list consisting of 22 websites. I hand picked these websites, often in pairs that I thought were similar, like FaceBook and Orkut.

All the HITs in this experiment ask 10 different people to decide how similar two websites are, using the following user interface:

sampleHIT

We asked people to compare every pair of websites, paying $0.01 for each comparison. We then posted all the HITs again, offering $0.02 for each comparison. We posted the HITs a third time paying $0.05 per comparison, but this time we reduced the initial set of 22 websites down to 14, to save money.

Results:

Similarity Matrix for 1-cent.

Similarity Matrix for 2-cents.

Similarity Matrix for 5-cents (note that this matrix has only 14 sites).

If we subtract each value in the similarity matricies above from 5, then we get a form of distance matrix. We might envision that each website is a point in a high-dimensional space, and we know it’s distance from every other point in that space.

To help visualize the points in this high-dimensional space, we use a technique to place the points in a 2-dimensional space, while preserving the distance relationships between the points as much as possible. This is done using the Matlab Toolbox for Dimensionality Reduction from Laurens van der Maaten — specifically using the “Multidimensional scaling” technique in this toolbox.

Plot for 1-cent matrix:

cluster1

Plot for 2-cent matrix:

cluster2

Plot for 5-cent matrix (note that this plot has only 14 sites):

cluster5

Discussion:

First question: are we getting meaningful data?

I think so. We see websites like Facebook and Orkut near each other. Of course, we also see some potential anomalies, like YouTube closer to Facebook than to Hulu in the first plot — note that YouTube and Hulu are closer together in the 2-cent plot.

Second question: how close is the data to ground truth?

This question is harder to answer. Recall the coin flipping experiment, where turkers were asked to flip a coin and submit whether it landed on heads or tails. We saw a bias toward heads, suggesting that some turkers were cheating.

Turkers could be cheating here too, but it would be hard to detect, since we do not know the underlying distributions. Any variation or disagreement we see could be reflective of actual disagreement between turkers over how similar websites are.

The problem is similar to polling for an election, when we can’t trust the answers people give us. If we get 40 votes for candidate A, and 60 votes for candidate B, it could be that everyone answered truthfully, or it could be that 80 people answered randomly, and the remaining 20 people answered truthfully in favor of B.

One interesting observation is that the variance decreases when we offer more money. If we offer 1-cent for each similarity measure, the standard deviation of 10 responses is, on average, 0.86. If we offer 2-cents per similarity measure, this average goes down to 0.83. This difference is not significant (paired t-test, p=0.3). However, if we offer 5-cents for each similarity measure, the average standard deviation goes down to 0.59. This difference is significant (paired t-test, p<0.0001 — note that the t-test can be paired since we only look at comparisons between the 14 websites common between the 2-cent and 5-cent cases).

This observation suggests that there is a ground truth similarity between websites, and turkers come closer to discovering this value by exerting more mental energy, which they are encouraged to do with an increased reward.

However, it turns out that turkers seem to spend less time completing these tasks when offered more money. Average time per turker for 1-cent is 70 seconds. This decreases to 63 seconds for 2-cents, which is not significant (paired t-test, p=0.19). However, it drops again to 50 seconds for 5-cents, which is significant (paired t-test, p=0.01).

This seems odd, and it could be that the “average time on task” is masking the real story. If not, it raises many question, like: are turkers spending less time because they are more focused? Are we attracting a different segment of the turk population for 5-cents, which happen to work faster? Is the decrease we saw in variance actually a bad thing?

There is a lot of data here, and some more thought will probably reveal additional experiments to shed light on these questions. Comments and suggestions are welcome.

Image Description — Iterative vs Non-Iterative

  • Total Cost: $10.82
  • Running Time: 75.8 hours
  • 200 HITs, paying between $0.01 and $0.02
  • TurKit code for this experiment: code.js, all files
  • TurKit Version: 0.1.37

This experiment uses the experimental design described in the pilot experiment, except that we run the procedure on ten different images. We also alternate whether the iterative or non-iterative descriptions are solicited first.

Note that the same file-synchronization bug exists here as with the pilot experiment, so it is unclear whether turkers saw the example writing style.

Results:

The following table shows the iterative description in green, along with the iterations that lead up to that description. The non-iterative descriptions are shown in white, and all descriptions are sorted by their average rating.

You may also view a complete list of these tables for all ten images.

rating sunrise
description
4.6 An array of majestic sunbeams peaking over the horizon and piercing through the low clouds display a changing explosion of colors that constantly maneuver through the sky and are captured by the camera.
4.5 A blazing golden-yellow sun sets beneath a sky of swirling orange clouds in the moments before twilight, silhouetting a pine tree-lined horizon. The northern wilderness is calming down for the day and the creatures of the night will soon begin their nocturnal activities. A lone owl is perched high in the tree, ever watchful of the area around her.

1 Golden yellow sunset with silhouetted pine trees on the horizon.
2 Golden yellow sunset with silhouetted pine trees on the horizon. The sky is very fire color and that sun light veru pure white.
3 Blazing orange sunset with swirling clouds silhouettes a line of trees.
4 Blazing sun sets beneath a sky of swirling orange in the moments before twilight, silhouetting a tree-lined horizon.
5 A blazing golden-yellow sun sets beneath a sky of swirling orange clouds in the moments before twilight, silhouetting a pine tree-lined horizon.
6 A blazing golden-yellow sun sets beneath a sky of swirling orange clouds in the moments before twilight, silhouetting a pine tree-lined horizon. The northern wilderness is calming down for the day and the creatures of the night will soon begin their nocturnal activities. A lone owl is perched high in the tree, ever watchful of the area around her.
3.4 Sunset, orange hazy in the sky, and shadows of the tree line.
3 A red sunset in an equatorial country.
2.7 This image is a picture of a sun sets in the afternoon that many people loved to see and discover how great was the Earth.
2.7 it is a red yellowish sunset
1.7 Aku terus cuba dan mencuba untuk aku serasikan bidang photography dan kerja2 editing dan aku berharap akan ianya akan terus bersatu dalam jiwa akubak kata cameramerah http://www.flickr.com/photos/cameramerah
photography dan editing saling perlu memerlukan bagi menambah seri sesuatu hasil tp tidak keterlaluan tp aku punya nih dah macam mengubah alam la plak hehehehe xpelakan banyak masa untuk nih so sebagai ganti rugi aku upload untuk kawan2 comment dan bg pendapat ble kan

Discussion:

It is difficult to see a significant difference between the two methods. The average top ranked non-iterative description has a rating of 3.9, while the average iteratively generated description has a rating of 3.86. This difference is not significant (paired t-test p = 0.65).

The average length of the top ranked non-iterative description is 238 characters, compared with 237.5 in the iterative case. Again, this is not significant (paired t-test p = .99).

The average time spent on task for each turker in the non-iterative process is 214 seconds, compared with 195 seconds in the iterative case. Again, not significant (t-test p = 0.51).

Due to technical difficulties, it’s hard to compare the total running time of each process — sometimes the program running the tasks was shut down, which would add unnecessary time to the iterative process. Still, removing obvious outliers, we get 0.97 hours for non-iterative processes, and 1.13 hours for iterative processes, which is not a significant difference (t-test p = 0.54).

It would make sense that the two methods would be similar if the iterative descriptions could all be traced to the work of a single contributor, but this doesn’t appear to be the case. Subjectively, it appears that turkers often build on, or at least incorporate each other’s work.

One potential explanation is that it takes as much time and effort to read someone else’s work and improve upon it, as it does to write from scratch — at least in these examples. For iteration to be useful, it seems like we need each increment to save time for the next contributor, so that they can go even further.

Image Description — Iterative vs Non-Iterative (Pilot)

  • Total Cost: $1.08
  • Running Time: 15.7 hours
  • 20 HITs, paying between $0.01 and $0.02
  • TurKit code for this experiment: code.js, all files
  • TurKit Version: 0.1.37

A couple of previous experiments compare iterative and non-iterative methods of generating tag clouds for websites. This is a pilot experiment of a similar comparison for writing image descriptions.

Experimental Setup:

We had turkers describe a fan iteratively and non-iteratively. The non-iterative method asks 6 turkers to write an entire description (2 cents each). None of these workers were allowed to participate in the iterative version.

The iterative version alternates between an improve HIT (paying 2 cents) and one of two vote HITs: vote on first improvement, and vote on subsequent improvements. The vote HITs pay 1 cent per vote, and we collect votes until at least 2 votes agree with each other. The improve HIT seeds the textarea with the most recent description that was voted in favor of.

Next, we had ten turkers rate each description on a scale of 1-5 (1 cent each). None of the turkers who participated in either of the previous processes were allowed to rate the descriptions.

NOTE: All of the HITs linked above show an “Example writing style” with a description of the aurora borealis. However, due to a file-synchronization bug, it is unclear whether the turkers actually saw this example, or a previous version with the text “…put writing example here…”.

Results:

The following table shows the iterative description in green, along with the six non-iterative descriptions. All descriptions are sorted by their average rating.

rating fan
description
3.7 This is a Japaneese style folding fan and when opened it spreads like an accordian. The beautiful scene of the waterfall in the tropics, is in colors of deep forest green and white cascading falls. If you look closely between the folds it gives the appearance of sunshine glistening off the falls, almost as if you can see the water actually falling.
3.6
versions
Decorative wall fans are one of the most beautiful ways you can decorate your home. These large wall fans come in an endless variety of designs and colors. This is a large decorative oriental fan with a touch of traditional Chinese brush painting. This large oriental wall fan has a landscape motif of waterfalls and trees and two people sitting on the stones by enjoying the beautiful secenery. The material used for large wall fans is bamboo wood, which forms the spokes and base of the fan.
3.2 This beautiful decorative fan displays a nature image of waterfalls and trees against a mountainous background. The dark wood accent and natural setting would be a nice addition to any home.
3.2 This beautiful oriental, hand held fan unfolds to reveal ascenic waterfall. The muted colors are elegant on an ivory background accented by the deep red and white sunburst at the center of the fan. The black lacquered sides of the open fan complete a very sturdy and visually appealing design.
3 This classically beautiful asian fan features a landscape scene as well as asian characters. It depicts a landscape scene with a river, trees, waterfalls, and rocks on a white/beige background.
2.8 This is a vintage Japanese hand fan. It has a traditional landscape image of a couple on a hillside watching a waterfall done in cascading blues and grays.
1.7 it looks like a large fan. The colors are white,black, and red. It has lettering on top and a drawing of a forest.

Discussion:

There is a story here. A story of two turkers. One turker was shown an image of a fan, and was asked to describe it. This turker applied themselves, and wrote a 3.7 level description. The best description offered by their peers rated 3.2.

Another turker was shown an image of a fan, and was asked to improve someone else’s description of it. This turker applied themselves, adding 400% new content, and crafting a 3.6 level description. Subsequent peers failed to improve upon it.

Turker-Spark Theory: The turkers in the story above experienced a turker-spark — a spark of insight and ambition that set their contribution a notch above their peers.

Several stars aligned to trigger the spark. Some of these stars have to do with the turker: native writing expertise; meticulous attention to detail.

Some of the stars have to do with the moment that they accepted the task: creative insight about the prompt; time and willing to apply themselves.

According to turker-spark theory, the way to solve tough problems on MTurk is to post many similar HIT assignments, and hope that one of them is ignighted by a turker’s spark of inspired effort.

If turker-spark theory is true, it would be nice to know the probability of a spark happening.

It would also be interesting to know if an iteration ungraced by any sparks can measure up to a turker-spark on a non-iterative assignment. That is, can turkers doing essentially mediocre work — iteratively — match the quality of a single turker doing great work?

This is a pilot experiment; we’ll soon have results of running this task 10 times, which should shed a little more light on the subject.

Response Time Over 24 Hours

  • Total Cost: $3.60
  • Running Time: 24 hours (1 new HIT with 5 assignments each 20 minutes)
  • 72 HITs with 5 assignments, each paying $0.01
  • TurKit code for this experiment:  response-time.zip response-time.js
  • TurKit Version: 0.1.29

This experiment revisits the question of response time initially explored as part of our small where’s the duct tape example.  Every 20 minutes over 24 hours we posted a HIT with 5 assignments.  The assignments were drawn from 12 tasks resembling the duct tape example – turkers were shown a picture containing an object and asked to click on the object.  You can try for yourself in our object recognition test page.

Turkers successfully identified the object in almost all cases.  The main exception being the following picture in which we asked turkers to find sunglasses:

Television with Sunglasses on the File Cabinet Beside It

Television with Sunglasses on the File Cabinet Beside It

Only about half of turkers shown this example correctly identified the sunglassses.  If you’re having trouble, they’re on top of the file cabinet.

You can see the results visually here.  These graphs were created with Protovis, which unfortunately only works with Firefox at the moment, and so are repeated below.  The blue color represents the first turker to submit, accept, or complete the HIT…the aqua, green, yellow, and pink represent later turkers.

Submission Times Over 24 Hours, generally the first submission happened in less than 100 seconds regardless of the time of day.

Submission Times Over 24 Hours, generally the first submission happened in less than 100 seconds regardless of the time of day.

Time Taken after Accepting the HIT, generally this required less than a minute, with a few peaks closer to 100 seconds.

Time Taken after Accepting the HIT, generally this required less than a minute, with a few peaks closer to 100 seconds.

Time to Accept the HIT, most HITs accepted in less than 30 seconds

Time to Accept the HIT, most HITs accepted in less than 30 seconds

What’s great about this is that without doing much of anything to encourage quick responses, we were still able to get an answer within 100 seconds most of the time.

In this experiment, we also looked to see how many assignments you would need to put out to get the fastest submissions. This is not straightforward because turkers can accept a HIT quickly and then wait a while before submitting it – it’s not clear why they would want to do this but they do.  In the worst case, all 5 assignments are accepted quickly but none are quickly submitted. We could avoid this situation by setting a shorter maximum time, but this might have adverse effects on the pickup time of the HIT and is something to consider in future iterations of the experiment.  In this study we used the TurKit default maximum of 60 minutes.

In 40 cases (56%), the turker who was first to accept the HIT was also the first to submit it.  17 2nd-acceptors submitted first, 10 3rd-acceptors submitted first, 4 4th acceptors submitted first, and 1 5th acceptor submitted first.

This study suggests at least two interesting properties of MT:

First, time of day does not seem to matter all that much – turkers took a similar amount of time to respond at 5am EST as they did at 5pm  EST.

Second, if you want something done quickly, you may need to recruit multiple turkers, as any one turker may not reliably complete a HIT expeditiously.  The turker who accepts first will also likely not be the first to submit it.

As we move forward, we plan to try more active techniques at getting turkers to answer more quickly.  How much would you have to be willing to spend over a 24 period to guarantee that you could always get an answer to a question like this in less than 10 seconds?  How would you do it?  Hopefully, we’ll have an answer!

Website Tags — Not Iterative

  • Total Cost: $0.84
  • Running Time: 18.4 hours
  • 42 Improve HITs, paying $0.02 each
  • Not Iterative
  • TurKit code for this experiment: tags-flat.js, all files (including database file*)
    • *NOTE:The database file contains fake worker Ids, to protect the privacy of turkers. These worker Ids are consistent within this database file, but not necessarily with future fake worker Ids.
  • TurKit Version: 0.1.34

This experiment is a non-iterative version of a previous blog post about creating tag clouds for websites.

Both experiments make six requests for new tags for each website. The previous experiment makes these requests in serial, where each HIT contains the tags from the previous improvement.

This experiment makes the requests in parallel, where each HIT contains a blank slate. Note that the parallel approach guarantees six unique turkers for each website, whereas the serial approach allowed turkers to improve upon their own work (this happend about one time for each website in the previous experiment).

I did not have people vote in this experiment. After inspecting the results by hand, I voted against four of the responses, which were obviously not lists of tags.

Results:

You may view the complete list of responses here.

In order to compare tags from this experiment and the previous experiment, we refined the tag lists in several ways. First, tags from each response for a given website in this experiment were combined together removing duplicates, after converting all tags to lower-case. These tag lists were then put in alphabetical order. Tag lists from the previous iterative experiment were also converted to lower-case and put into alphabetical order, with duplicates removed.

You may view comparisons of the tag clouds for each website here.

The first two comparisons are shown here:

http://www.zeros2heroes.com/
Not Iterative: 18 Iterative: 28
art
artists
campaign
classified
collectibles
comics
community
conventions
development
feedback
figurines
manga
marketing
online
publishing
social network
talent
workshops
artist
artists
beta
blog
classifieds
comic
comic community
comics
comics publisher
community
contests
editors
feedback
for kids
forums
free comics
gamers
games
heroes
participate
peoples publishers
portal
publisher
webcomics
workshops
writer
writers
zeros
http://www.deviantart.com
Not Iterative: 33 Iterative: 22
art
art collection
articles
artists
artwork
business
characters
chat
community
deviant
drawings
fanfiction
forum
graphics
images
manga
network
online
painters
paintings
photographers
photography
photos
pixel art
poetry
sales
sculpture
share
share artwork
sharing
sweet adult fun
thumbnails
wallpaper
animation
art
art games
blog
buy
cartoons
comics
community
contests
expression
film
flash
forums
images
individual
merchandising
photography
pictures
ratings
scraps
sell
services

Discussion:

Quantity: It appears that both methods produce similar numbers of tags. The non-iterative method generated an average of 22.3 tags, compared with 20.4 tags using the iterative method. This difference has a p value of 0.93, which is not significant. It is interesting to note that there is very little overlap between the groups — this suggests that neither approach is reaching a fundamental maximum number of tags.

Quality: I don’t see much difference in quality between the approaches, though I haven’t looked too deeply. I had wanted to run an experiment where I ask people to say how relevant each tag is to the website, but I’m not sure these results would be significant either way.

Time: The non-iterative experiment took a little longer, 18.4 hours instead of 12.6 hours. It’s not clear whether this is significant, but it is interesting, since the non-iterative experiment posts all the HITs at once, while the iterative experiment only posts new HITs after the first ones have been completed. A couple of reasons for this may be: first, the iterative experiment allows turkers to participate more than once adding tags for a single website, so we don’t have to wait for unique turkers. As mentioned before, this happend about once for each website. Second, 1-cent HITs have a brief time once they are posted to be at the top of the “most recent” search results, before they sink into the unsearchable abyss. The iterative experiment takes better advantage of this time.

Conclusion: In this case, people seem as likely to generate novel tags looking at a blank slate as they are looking at previously generated tags. It’s possible that iteration would be useful in refining the tag cloud, but before doing anything that complicated, I may investigate iteration in other similar domains.

Website Tags

  • Total Cost: $1.80
  • Running Time: 12.6 hours
  • 42 Improve HITs, paying $0.02 each
  • 42 Vote HITs with a total of 96 assignments, paying $0.01 each
  • Iterative
  • TurKit code for this experiment: tags.js, all-files.zip
  • TurKit Version: 0.1.34

This experiment tries to iteratively create tag clouds for websites. It is similar to an earlier blog post about summarizing websites.

The improve HITs looks like this:


Improve HIT

And the vote HITs look like this:


Vote HIT

This experiment uses the same websites from the previous blog post.

Results:

Here we show the iterations for http://www.ask500people.com:

version 1

opinion polls
poll
vote
survey
question
quiz
ask 500 people
version 2

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz
version 3

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz

questions
varied answers
questionnaire
ask world

version 4

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz

questions
varied answers
questionnaire
ask world

new
books
version 5

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz
forum
questions
varied answers
questionnaire
ask world
opinions
version 6

opinion polls
poll
survey
sample
question
quiz
ask 500 people
vox pop
vote
answer
quiz
forum
questions
varied answers
questionnaire
ask world
opinions
forum
blog
network

You may view the iterations for each website here.

Below, we show the final tag cloud generated for each website (except the one above):

http://www.zeros2heroes.com/

zeros2heroes
publisher
comics
comics publisher
webcomics
free comics
writer
artist
workshops
feedback
editors
Peoples Publishers
Comic Community
gamers
games
for kids
community
portal
contests
participate
writers
artists

publisher
beta
comic
blog
classifieds
forums
http://www.deviantart.com

art
photography
film
animation
pictures
images
flash
buy
sell
community
blog

art games
cartoons
comics
contests
Services
scraps
forums
ratings
merchandising

individual
expression
http://folding.stanford.edu

folding
stanford
university
distributed computing
protein folding
disease
charity
help
all languages
cloud computing
higher education
education
subjects offered
Forum
Research
http://www.artbreak.com

artbreak
art
commission free
sell art
shop for art
paintings
view art
display art
upload art
creative art
artists' chat
information on art
market art
postures
Randomizer
Share
work
world
online community
forum
http://www.orkut.com

orkut
instant message
share your videos, photos
free chat
free messaging
connect
profile
social network
google
blog
connect
connections
friends
family
scraps
people
communities
pictures
friend connectivity
media sharing
video sharing
MSN
AOL
YIM
communication
http://friendfinder.com

friend finder
social network
online dating
profile
personals
blogs
community
forums
networking
dating site
find a partner
man seeking woman
woman seeking man

Discussion:

One thing to note is that people typically append to the end of the tag cloud, rather than refining or removing any of the existing tags. If we want any refinement of the tags, we may need to create a separate task for this.

The first tagger seems to add more tags than subsequent taggers: 5.7 tags for the first tagger compared to 3.1 for subsequent taggers. If we throw out versions that were rejected, and take the average number of tags added in each iteration, we get: 5.7, 3.7, 5, 2.6, 2.3, 2.5. If we squint enough, we might convince ourselves that these numbers go down, presumably because it becomes increasingly difficult to find new tags.

At first, I was temped to say that people invest the same amount of effort on each task, but their effort yields fewer tags on later iterations. However, the time spent on task seems to go down after each iteration (except the first): 133.3, 175.4, 171.7, 121, 96.1, 91. Of course, this doesn’t mean that people are expending less mental energy on later iterations — perhaps they are spending less time looking at the webpage, since the tags do a good job of telling them what to expect there.

This experiment raises an interesting question about iteration — since people add more tags on the first iteration, maybe we should be showing everyone a blank slate. We would probably get duplicate tags this way, but if we remove the duplicates, will we have more remaining tags than we get in the iterative version?

Of course, this question existed in the previous summary writing experiments as well, but it might be easier to test with the tag cloud generation experiments, since we can more-or-less quantify the amount of contribution in each iteration.

The real hope of iteration is that people are able to use previous people’s work as a jumping off point to spur their creativity, and at the same time limit their creativity by showing them what has already been done. However, this jumping-off-point may also stifle creativity by acting as a powerful suggestion about the direction that the tag cloud should take. Ultimately both points of view are probably true, and the optimal Human Computation Algorithm for generating tag clouds must benefit from both.