Latest Publications

Turkers as Proofreaders

I hate proofreading my own writing and would love to find a way to reliably turn this task over to the crowd. It is important to be clear what I mean by proofreading—I’m not looking for someone to crack open Warriners and provide pedantic comments on split infinitives—I want them to find gems like:

“In the following repression, we can see that the coefficient on…”

These are the sorts of mistakes that fresh eyes can find easily, and yet slip by someone engrossed with their work.

Anyway, I’m currently running an experiment to see how well Turker’s can do proofreading a paper (that is ironically about AMT). Because the whole paper is too much of an elephant for them to swallow, I broke it up into single pages, double-spaced. I’m paying them 25 cents to find and note any errors on the page. I imagine one could give them Knuth-style incentives and get better performance, but for this pilot, I’m keeping things simple (and I don’t want them to make too many Type I errors and over-report “problems”).

Step 1: Create a PDF version of your document that has line numbers. I think MS Word can do this automatically—in LaTex, you can use the lineno package.

Step 2: Use tools like pdftk to “burst” the PDF into single pages. In pdftk, this is just:

$> pdftk your_seminal_work.pdf burst

This will generate a bunch of single page PDFs, one for each page in your document. Put all the PDFs somewhere exposed to the web.

Step 3: Create a CSV file w/ a column “url” – fill the rows with the links to the pages you just posted online.

Step 4: Create template in AMT, and use a ${url} construct so that each HIT will link to one of the pages. Create a textbox on the template where workers can note the errors they find.

Step 5: Upload you CSV file and associate it with the template you just made.

Step 6: Launch and see what happens.

Right now, I’m on Step 6. I’ll get report the results as an update to this post.

Writing an Essay on MTurk: Part 1

We plan to write an essay on Mechanical Turk — the sort of 5-paragraph essay that you would write in High School or for the SAT or GRE.

The idea will be to coordinate the efforts of lots of contributors, rather than having a single worker write the entire essay. The plan is to proceed in phases, similar to the phases that English teachers are always teaching their students: brainstorm your ideas, write an outline, convert the outline into prose, add transitions, comb over for grammar and spelling, etc…

To start with, we need a topic. We modified the brainstorming code from a previous blog post to generate topic ideas instead of company names. We used the iterative method — where turkers could see all the suggestions so far — and seeded it with two topics so that people would have an example of what we were looking for:

  • Mankind should send a man to Mars in the next 10 years.
  • The criminal justice system should focus on rehabilitation rather than deterance [sic].

Here are the topic suggestions supplied by MTurk workers:

  • Is Hugo Chavez and his anti-United States of America sentiment good for Latin America?
  • Is a salary cap necessary to restore competitive parity in Major League Baseball?
  • Has the invention of the internet and the expansion of new-age media been good or bad news for the music industry?
  • Would it be good for the United States to put a cap on the number of children you can have while on Federal Welfare?
  • Should the drinking age be lowered?
  • How has the number of hours a kid is on his computer or watching tv affected his grades in school?
  • why do so many older cats die of cancer?
  • why are there not efforts to curb office bullies?
  • What new direction will hip hop music take?
  • Do you think sports stars get special treatment from the judicial system?
  • Should illegal immigrants be deserving of health benefits?
  • Should instant-replay be used for more than home runs in baseball?
  • Should Physician-Assisted-Suicide, or euthanasia be legalized?
  • How important is a college education in todays job market?
  • What ar the pros and cons of the bailouts our government has given?
  • Smoking should not be allowed in restaurants and bars.
  • What country has the most gender equality?
  • How effective is recycling in saving the environment?

We had intended, with the seed-examples, to push people toward statements instead of questions. This didn’t work. Perhaps it would have worked better in a parallel process where each person only saw the seed-examples. In any case, I suppose we’ll be writing an essay that answers a question ;)

Next we want to find the best topic. The process of finding the best n items from a list may be used a lot for different aspects of writing the essay, so it may be worth thinking about this next step carefully…

Code:

How many turkers are there?

With the recent interest in getting multiple turkers into a virtual room at the same time, it would be nice to know how many individual turkers are out there. We don’t have the answer, but we have run lots of experiments, and we can aggregate all of our historical data (much of which is posted on this blog) to get a sense for how many turkers might be out there, and how active they are.

These graphs are based on 4,449 HITs, with a total of 28,168 assignments. Most of these were posted over a 75-day period. Here is the number of assignments completed each day of that span:

Assignments completed each day over a 75-day period.

Most of our assignments were done on single days, but there is some spread in the data.

We had work done by 1,496 individual turkers. Here is the number of assignments completed by each turker:

Assignments completed by each turker.This is a classic Power Law distribution. We even see something close to the 80-20 rule: 80% of assignments are completed by the top 22% of turkers. The last 25% of turkers completed 1 assignment each.

So, it looks like there may be relatively few active turkers out there doing most of the work. I have some anicdotal evidence that it is hard to get, say 500 separate turkers to complete a single HIT over a short time-span. Of course, this will change as MTurk grows, and it’s not even clear that this data gives a good picture of everything that happens on MTurk. We have posted a variety of types of HITs, but nothing compared to the variety of HITs that are out there.

It might be nice if someone ran a study soliciting as many individual turkers as possible to complete a single HIT, just to see how many there are at a given time — something simple, like here’s 50-cents to click a button.

There may also be some questions we can answer with the data we have, but haven’t thought to ask yet. Please don’t be shy about posting comments and suggestions.

Code:

Here is the Excel file used for the graphs above.

Image Rating — MTurk vs. Lab Members

We have often used Mechanical Turk to rate things, like brainstorming ideas and image descriptions. We were curious how MTurk ratings compared to ratings we might obtain from a traditional user study.

We randomly selected 10 images from www.publicdomainpictures.net. This site has a rating for each image obtained from members of the site, on a scale from 1 to 10. We tried to select images with a uniform distribution of image ratings.

Next, we had two groups of people rate the images on a scale of 1 to 10:

  • MTurk workers: We posted the images on MTurk, and solicited ratings from 10 different turkers for each image (1 cent per rating).
  • Lab Member: We sent an e-mail to the MIT CSAIL mailing list, asking for self-identified amateur photographers to rate our images using a web interface, with the promise of $25 to one lucky participant. The response rate was greater than we expected: 56 participants answered all our questions.

Each group used the same interface to rate the images:

rating scale

Results:

The following table shows each of the images. Beneath each image are 3 ratings:
MTurk Rating — Lab Member Rating — and Site Rating.

8.86.49

8.77.48

7.65.97

8.46.77

6.95.03

7.56.46

4.13.44

7.74.62

7.84.91

7.34.05

Linear regression between MTurk Ratings and Lab Member Ratings

Discussion:

MTurk ratings definitely seem correlated with ratings obtained from amateur photographers in our lab (R^2 = 0.597), though the R^2 value is low enough to suggest that there may be some significant differences between these populations as well. MTurk ratings are also correlated with site ratings, though the correlation is not as high (R^2 = 0.182). Interestingly, the correlation between lab member ratings and site ratings is higher (R^2 = 0.495).

Lab member ratings were 2 points lower than MTurk ratings on average, which is statistically significant using a paired t-test (p < 0.0001). We suspected that this might be because lab members rated all the images, whereas MTurk workers could rate as many or as few as they wanted. However, each lab member saw the images in a random order, and we didn’t see any correlation between where an image appeared in the sequence, and the rating it received. Another hypothesis is that amateur photographers are more discriminating in their photography taste — or at least, they may have been encouraged to be more discriminating in our study, since we requested their skills as an amateur photographer.

Code:

  • TurKit code for this experiment: code.js, all files
    NOTE: The code runs an additional set of experiments not discussed in this blog post, having to do with image comparisons (as opposed to image ratings).
  • TurKit Version: 0.1.42

Getting Turkers Together in Real Time

  • Total Cost: about $40
  • Running Time: 10-30 minutes
  • 6 HITs with 165 Assignments total
  • Payment per Assignment: $0.01, $0.25, $0.50
  • Not Iterative
  • TurKit code for this experiment: wait-for-letter.js
  • TurKit version 0.1.42
  • Flex code for this experiment: WaitForLetter.mxml (compiled version WaitForLetter.swf)

For many collaborative uses of MTurk (like the Etherpad experiment in the last post), it would be desirable to have a number of turkers interacting at the same time.  This is an unusual use of MTurk; most HITs are designed to be done separately and asynchronously by each turker.

This experiment continues our exploration of what it might take to get a group of turkers to be working on a HIT at the same time.   (See also Are Turkers Willing to Wait for a Task?)

The HIT used here is a simple letter-transcription task: a picture of a letter (A-Z) is shown, and the turker must type it into a textbox.  The catch is that the letter won’t appear until some time in the future, typically a few minutes away.  The deadline is the same for all turkers who pick up the HIT.  The deadline is stored internally in GMT, but displayed in the turker’s local timezone.  Before the deadline, the instructions for the HIT look like this:

  • At the time shown below, an image of a letter (A-Z) will appear. In the box at the bottom, please type the letter you see.
  • Early answers and wrong answers will not be approved.
  • The letter will be visible for 10 seconds.
  • If you keep this HIT open in your browser, then a chime will sound while the letter is visible, to remind you to look at it.
  • If you don’t see a time below, don’t accept this HIT, because you won’t be able to do it.
Waiting for letter to appear

When the deadline is reached, the letter appears for 10 seconds, and a chime sound plays to alert the turker to pay attention to it:

You can see an example of what turkers saw when they waited for a letter (with only a 6-second deadline).

Because all turkers have the same deadline, and the letter is only visible for 10 seconds, we can assume that all the turkers who submit the correct answer were viewing and interacting with the HIT within that 10-second interval.  So the group was synchronized.  A synchronized group could be used to chat with each other, play a game, or pound on a website.

On a Friday evening (11pm-midnight EST), I ran several small pilot experiments to test this approach for collecting synchronized groups of turkers.  In the experiments below,

n = number of synchronized turkers desired (number of assignments in the HIT)
reward = payment for the HIT
deadline = time from posting of the HIT when the letter would appear

For n = 10, reward=$0.01, deadline=2 minutes, the HIT obtained 4 synchronized turkers.  One cent is evidently not enough to entice turkers to wait around that long.

Increasing the reward to $0.50 produced 6 synchronized turkers.  Two minutes may not be enough time to collect 10 turkers; the flow of new turkers interested in this task isn’t large enough.

Increasing the deadline to 10 minutes produced 9 synchronized turkers, which is close to the goal.

Finally, for n=30, reward=$0.50, deadline=10 minutes, we obtained 25 synchronized turkers.

In all cases, every turker provided the correct letter, which was different for each HIT.  We also recorded the turker’s total wait time from accepting the HIT to finally submitting it.  For the final 10-minute experiment, the wait time averaged 5 minutes 21 seconds, and the longest waiter took 10 minutes and 21 seconds, and the fastest one waited only 12 seconds.  This suggests that that the HIT was close to saturating the flow rate of turkers who discovered this HIT and were willing to do it at this time of day and this price point.

Update 1: I reran the experiment (Saturday night between 9pm and 10pm EST) with n=30, reward=$0.25, deadline=10 minutes, and obtained 21 synchronized turkers.  Halving the reward still produced about the same number of turkers, which may be around saturation at this time of day.

Another point of curiosity: even though Amazon guarantees that these turkers are all using different MTurk accounts, it seems possible that the same turker could be using multiple accounts at once (perhaps created on behalf of friends and family).  Consulting the server logs, however, we find that all 21 turkers came from different IP addresses, which makes it more likely that they were 21 unique people.

Update 2: One more run of the experiment (Sunday night, 8pm-9pm), with n=75, reward=$0.25, deadline=30 minutes, produced 33 synchronized turkers.

In this run, two turkers gave wrong answers early on, rather than wait for 30 minutes (they were not considered synchronized, so were not counted in the 33 above).  One turker produced a correct answer 30 seconds before the deadline, probably because their computer’s clock was fast.  A flaw in the experiment is that it assumes the turker’s own clock is accurate; a better implementation would measure local clock skew and adjust the deadline accordingly.

In this run, I also watched the server access logs to see how many turkers previewed the HIT.  74 unique IP addresses previewed the HIT, very smoothly distributed over the course of its 30-minute duration.  Since 10-minute HITs on Friday and Saturday night obtained roughly 25 turkers, and a 30-minute HIT at least drew the attention of roughly 75 (even though the price wasn’t enough to make them stick around), this suggests that the flow rate is roughly 2.5 turkers per minute (for this task and price on weekend evenings EST).

Of the 74 turkers who previewed the HIT, 58 accepted it, according to Amazon’s requester report.  Since only 33 successfully submitted, many of the others may have missed the 10-second window.

Turker Talk

One of the reasons I think Greg’s experiments are so neat is that he has turkers interact with the work of other turkers, which is very different from most of the current uses of AMT.

In this vein, I just launched a silly little experiment where turkers interact directly—I created a HIT that will bring 10 turkers (8 so far) to this Etherpad (which lets them collaboratively work on a document and chat with each other).

Update: Here is a permanent link, read-only link, in case the live, writeable Etherpad gets vandalized.

I prompted workers to discuss why they use AMT. I told them that they could type as little or much as they liked. If you click on the time slider (upper right corner), you can watch how the conversation evolved. In the future, I’d like to run more experiments like this, with turkers given a more structured task or conversation topic while manipulating group sizes, the direction of communication, the possibility (and structure) of payment etc. Any suggestions?

Blurry Text Transcription Revisited

  • Total Cost: $19.20
  • 384 HITs
  • TurKit code for this experiment: code.js, all files
  • TurKit Version: 0.1.42
  • Java code for blurrifying some text and creating a webpage with a textbox beneath each word

This experiment builds upon a previous blog post about transcribing blurry text iteratively and in parallel. This experiment is larger, and makes a number of changes to the experimental design:

  • Instead of using a single textbox for the entire passage, this experiment places a textbox beneath each word — see sample HIT below.
  • This experiment uses more sample passages — 12 instead of 3.
  • The passages in this experiment are smaller — about 1/3 the size of the previous experiment.
  • We paid $0.05 per HIT instead of $0.10.
  • We ran 16 iterations instead of 8.NOTE: we originally ran 8 iterations and didn’t see a significant difference between the iterative and parallel conditions. We thought that we might see a difference if we continued running for another 8 iterations. The code is a bit confusing for this reason.
  • Punctuation characters like commas and periods are not blurred, and we automatically fill the textbox beneath each punctuation mark with the correct character.

Here is a sample HIT:

blots

Results:

Cumulative accuracies after each iteration.

This graph shows the average cumulative accuracy after each iteration, i.e., the accuracy we would achieve if we stopped the process after this many iterations. Accuracy is the portion of words guessed correctly, not including punctuation. Each guess corresponds to the word which appears most frequently in the iterations. In the case of a tie, we give partial credit inversely proportional to the number of guesses that tied (provided that the correct guess is among them).

Note that in the parallel case, it doesn’t really make sense to talk about the first n iterations, since all the iterations happen in parallel — to account for this, we average over up to 100 possible choices for the first n iterations (we could average over all possible choices, but this is a little slow in JavaScript).

Raw Results (NOTE: The “combined” line in each result uses the most frequent guess from each iteration. In the case of a tie, it chooses randomly from among the most frequent guesses.)

Comments — 18% of workers left a comment (17 out of the 94 different contributors). Some left more than one comment. Almost all the comments bemoan the unreadability of the text.

Discussion:

As expected, the accuracy increases with each iteration. It grows quickly at first, and levels off at about 65%. In this run of the experiment, the iterative process does a little better than the parallel process, but not statistically significantly better.

It is worth noting that the iterative process appears to get stuck sometimes due to poor guesses early in the process. For instance, one iterative process ended up with 30% accuracy after sixteen iterations. The final result was very similar to the eighth iteration, where most of the words had guesses, and they made a kind of sense:

8th iteration: “Please do ask * anything you need *me. Everything is going fine, there * * , show me then * *anything you desire.”

16th iteration: “Please do ask *about anything you need *me. Everything is going fine, there *were * , show me then *bring * anything you desire.”

(Incorrect guesses have been crossed out.)

Here is the actual passage: “Please do not touch anything in this house. Everything is very old, and very expensive, and you will probably break anything you touch.”

Note that a single turker deciphered this entire passage correctly in the parallel process, suggesting that progress was hampered by poor guesses rather than by unreadable text. This sentiment is echoed by a comment from one of the turkers: “It’s distracting with the words filled in–it’s hard to see anything else but the words already there,so I don’t feel I can really give a good “guess”.”

It may be a good idea to try a hybrid approach with several iterative processes done in parallel, in case one of them gets stuck in a local maximum.

Image Description Revisited

This experiment revisits a blog post about writing image descriptions iteratively and in parallel. That experiment did not see any statistically significant difference between the iterative and parallel process. This experiment does.

This experiment is larger, using 30 images instead of 10 (all taken from publicdomainpictures.net). We also made a number of changes to the process. First, we tried to make the instructions for HITs in each process as similar as possible. For instance, the title for HITs in the previous iterative process said “Improve Image Description” and the title for HITs in the previous parallel process said “Describe Image”. In this experiment, the title in both cases is “Describe Image Factually”. This title points to another change — we added the instruction “Please describe the image factually“. This was intended to discourage turkers from thinking they needed to advertise these images, and to make the descriptive styles more consistent. Here is an example HIT:

Example HIT for the iterative process.

This image shows a HIT in the iterative process. It contains the instruction “You may use the provided text as a starting point, or delete it and start over.” This instruction deliberately avoids suggesting that the turker just needs to improve the HIT. The idea is that we wanted each process to be as similar as possible, so it didn’t seem fair for turkers in one condition to think they only needed to make a small improvement, whereas turkers in the other condition think they need to write an entire final draft. Note that the very presence of text in the box may alert turkers to the possibility of other turkers seeing their work and being asked to write a description using it as a starting point, but we do not explicitly validate this hypothesis for workers.

This instruction is omitted for the parallel HITs. It is the only difference between the two, except of course that all of the parallel HITs start with a blank textarea, whereas all but the first iterative HITs will show prior work.

This experiment also uses the 1-10 based rating scale introduced in the updated blog post about brainstorming company names.

Finally, in order to compare the output from each process, we wanted some way of selecting a description in the parallel process to be the output. We do this by voting between descriptions, and keeping the best one in exactly the same way as the iterative process. (Note: one difference is that the iterative process highlights differences between the descriptions, whereas the parallel process does not. Since the descriptions in the parallel process are not based on each other, they are likely to be completely different, making the highlighting a distraction.)

Results:

Raw Results

Average ratings for descriptions generated in each iteration.

This graph shows the average rating of descriptions generated in each iteration of the iterative processes (blue), along with the average rating of all descriptions generated in the parallel processes (red). Error bars show standard error.

Discussion:

The final description from each iterative process averaged a rating of 7.86, which is statistically significantly greater than the 7.4 average rating for the final description in each parallel process (paired-ttest t(29) = 2.12, p = 0.043). We can also see from the graph that ratings appear to improve with each iteration.

This suggests there may be a positive correlation between the quality of prior work shown to a turker, and the quality of their resulting description. Of course, a confounding factor here is that turkers who choose to invest very little effort are not likely to harm the average rating as much when they are starting with an already good description, since any small change to it will probably still be good, whereas a very curt description of the whole image is likely to be rated much worse. This factor alone could explain an increase in the average rating of all descriptions in the iterative process, but it would not explain an increase in the average rating of the best description from the iterative process — for that, it must be that some people are writing descriptions that are better than they would have written if they were not shown prior work.

So why are we seeing a difference now when we didn’t before? We changed a number of factors in this experiment, but my guess is that the most important change was altering the instructions in the iterative process. I think the instructions in the old version encouraged turkers not to try as hard, since they merely needed to improve the description, rather than write a final-quality description. In this experiment, all turkers were asked to write a final description, but some were given something to start with.

I think this same idea explains the results seen in the previous blog post about brainstorming company names. Turkers in the iterative process of that experiment were required to generate the same number of names as turkers in the parallel process, whereas the experiment before that suggested that turkers in the iterative process just needed to add names to a growing list, and we saw that they generated fewer names.

In any case, these are only my guesses. More experiments to come.

Brainstorming Company Names Revisited

I’ve been gone for a bit working on a research paper, and then attending conferences, but now it is time to get back to business.

This experiment is an extension of the previous blog post about brainstorming company names. In that post, it seemed like iteration wasn’t making a difference, except to encourage fewer responses. This time, we decided to enforce that each worker contribute the maximum number of responses. We also reduced that number from 10 to 5, since it felt daunting to force people to come up with 10 names. This also reduced the number of names we needed to rate, which is the most expensive part of this experiment.

Finally, we decided to show all the names suggested so far in the iterative condition. Previously, we showed only the best 10 names, but this required rating the names, which seemed bad for a number of reasons. Most notably, it seemed like an awkward blend of using the ratings both as part of the iterative process, and also as the evaluation metric between the iterative and non-iterative (or parallel) conditions.

The new iterative HIT looks like this:

Example iterative HIT

The parallel version doesn’t have the “Names suggested so far:” section.

We also changed the rating scale from 1-5 to 1-10. This was done because 1-10 felt more intuitive, and provided a bit more granularity. It would be nice to run experiments concentrating on rating scales to verify that this was a good choice (anyone?). Here is the new scale:Rating scale from 1 to 10

We brainstormed names for 6 new fake companies (we had 4 in the previous study). You can read the descriptions for each company in the “Raw Results” link below.

Results:

Raw Results

Average rating of names in each iteration of iterative processes.This graph shows the average rating of names generated in each iteration of the iterative processes (blue), along with the average rating of all names generated in the parallel processes (red). Error bars show standard error.

Discussion:

Names generated in the iterative processes averaged 6.38 compared with 6.23 in the parallel process. This is not quite significant (two-sample t(357) = 1.56, p = 0.12). However, it does appear that iteration is having an effect. Names generated in the last two iterations of the iterative processes averaged 6.57, which is significantly greater than the parallel process (two-sample t(237) = 2.48, p = 0.014) — at least in the statistical sense; the actual difference is relatively small: 0.34.

There is also the issue of iteration 4. Why is it so low? This appears to be a coincidence—3 of the contributions in this iteration were considerably below average. Two of these contributions were made by the same turker (for different companies). A number of their suggestions appear to have been marked down for being grammatically awkward: “How to Work Computer”, and “Shop Headphone”. The other turker suggested names that could be considered offensive: “the galloping coed” and “stick a fork in me”.

These results suggest that iteration may be good after all, in some cases, if we do it right, maybe. Naturally we will continue to investigate this. We have already done a couple of studies with similar results to this one, suggesting that iteration does have an effect. After posting these studies on the blog (soon), the hope will be to start studying more complicated iterative tasks.

Brainstorming Company Names

We asked turkers to brainstorm company names, both iterative and non-iteratively. We used an experimental design based on Image Description — Iterative vs Non-Iterative. We kept 6 iterations for each condition in this experiment.

The experiment itself is based on Website Tags and Website Tags — Not Iterative. Instead of generating tags for websites, we are generating names for companies. We also provide separate input fields for people to add names, rather than a single textbox.

We made up a brief description for four fake companies. You can see the description of one fake company in this sample brainstorming HIT:

example brainstorming HIT

The HIT asks turkers to come up with at most 10 company names, and supplies 10 input fields. Turkers in the iterative condition were shown “Example names suggested so far”, which shows company names supplied by previous turkers in the iterative process. This text appeared even if there were no names suggested yet. The non-iterative condition did not have this text.

We also created HITs to evaluate the quality of each name. These evaluations were done on a scale of “1: Poor” to “5: Extremely Good”, similar to the image description experiment. Turkers were not allowed to rate any suggestions for company X if they supplied any suggestions for company X.

We intended to show turkers in the iterative condition the ten best previous suggestions, but due to a bug, we only showed them the suggestions from the turker right before them. The suggestions were still sorted from best to worst.

Note that turkers in the iterative condition were allowed to contribute more than once. This has always been the case, even in previous blog experiments of this sort. Probably this shouldn’t be allowed, and our future investigations will probably prohibit this. With that said, this only happened once in these experiments (i.e. one turker, in the iterative process for one fake company, did the brainstorming HIT twice).

Results:

You can see all the names generated for each fake company here, in both the iterative and non-iterative conditions.

Here are the results for the “productivity tools” fake company:

description: As technology grows, people have less and less time. Time management tools are still poorly understood. We believe our tools will help people take control of their lives again, while being more productive than before.
iterative non-iterative
  • MasterTime 3.6
  • Time Managers 3.5
  • PowerSource 3.5
  • Time Keepers 3.5
  • Endless Time 3.5
  • NewKnowledge 3.4
  • Time’s On UR Side 3.4
  • Productivity Maximizers 3.4
  • Time Breakers 3.4
  • Manage Me 3.3
  • Now’s The Time 3.3
  • Time Stoppers 3.2
  • Time Maker 3.2
  • DayPlanners 3.2
  • All In A Day 3.2
  • Time Pusher 3.1
  • MakeTheBest 3.1
  • Timetool 3
  • Productive Management 3
  • Time Warp 3
  • Time Control 3
  • My Time Counts 3
  • Life Tools 3
  • Productivity Boost 3
  • KarmaSpice 2.9
  • Minute Worker 2.9
  • BoldNewWorld 2.9
  • WorldKnow 2.9
  • My Brother’s Keeper 2.8
  • Wast’en No Time 2.8
  • AimTime 2.8
  • Recreating Eden 2.6
  • The Daytimers 2.5
  • ToolStrong 2.5
  • Robotime 2.4
  • Produc’en Time 2.4
  • HolyKnow 2.4
  • Yes Time 2.3
  • It’s Manage 2.2
  • Ti-me 2.1
  • Producorobo 2
  • Time 1.9
  • T2 1.9
  • Don’t Count Pennies, Count Minutes 1.8
  • Working With Time 3.7
  • The Eleventh Hour 3.5
  • Take Back Time 3.5
  • Hour Glass 3.5
  • Time For Everything 3.4
  • It’s About Time 3.4
  • smarttime 3.3
  • Plus Time 3.3
  • Time Warp 3.3
  • More Time 3.3
  • simplyfy 3.2
  • Tick Tock 3.2
  • Manage Time 3.2
  • Time management Trendz 3.2
  • Time Enough 3.2
  • Tools & Time 3.2
  • Time Control 3.1
  • Life Management 3.1
  • Punctual Trix 3.1
  • octopus 3
  • Extend 3
  • Time Manager 3
  • Stopwatch 2.9
  • timeismoney 2.9
  • Time Friendly 2.9
  • Worth 2.8
  • prioritize 2.8
  • freetime 2.7
  • zurich 2.7
  • Time Shared Systems 2.7
  • Tiempo Domado 2.6
  • Becton-Dickinson Medical Device Company 2.6
  • The Control Company 2.6
  • latice 2.6
  • Time Organizer 2.6
  • Timex 2.6
  • python 2.4
  • the Ticker 2.4
  • idecluster 2.3
  • adrino 2.3
  • Philips Healthcare Medical Device Company 2.3
  • Denso Corporation: Company 2.2
  • sea horse 2.2
  • Daiwa House Industry Co. Ltd.: Company 2.2
  • Danske Bank A/S: Company 2.2
  • hypnosis 2.2
  • Controla 2.2
  • Controlia 2.2
  • nephkin 2.2
  • Dentsu Inc.: Company 2.1
  • jessico 2.1
  • Duke Energy Corporation: Company 2
  • Axel Springer AG – Company 1.9
  • Elekta Medical Device Company 1.9
  • Dsg International Plc: Company 1.8
  • fercy 1.8
  • emilee 1.6

Discussion:

The most striking observation is that the non-iterative condition generates significantly more company names than the iterative condition: 47.8 vs 34.3 (paired t-test p < 0.002). The highest possible number of names for either condition is 60, since there are 6 iterations, and each iteration asks a turker for up to 10 names.

This suggests that turkers will generate fewer names if they are shown some examples of other people’s names. Possible explanations for this include:

  • Seeing other people’s names biases turkers toward those names, and they must think harder to come up with different names.
  • Seeing other people’s names suggests that other people are making progress toward the goal, and so it’s not as important to do a good job on the task, since other people will pick up the slack.
  • Seeing fewer than 10 example names may cue turkers into thinking it’s ok to provide fewer than 10 names (In fact, turkers shown 10 example names provide 7.2 names on average, while turkers shown between 1 and 9 example names provide 3.3 names (p < 0.01)).

The average quality of names generated in each condition seems to be the same. The average is 2.85 for the iterative condition, and 2.82 for the non-iterative condition. A t-test gives a p-value of 0.57, which is not significant.

Conclusion: Showing people other people’s ideas doesn’t seem to increase the quality of suggestions, and seems to encourage fewer suggestions. It would be nice to figure out some way to increase the quality of suggestions. Perhaps if we show people words that are associated with the company — like “love” and “cupid” for the fake online dating site — they can be used as building blocks for names. We could generate these words in a separate brainstorming process.