SIGIR09: A Statistical Comparison of Tag and Query Logs
Mark Carman presented on the paper in the title. They’re interested in studying personalization, but for that they need personalized relevance judgements. Query logs are a great source of that information but aren’t available due to privacy concerns. So they started looking at whether tag data (public) could be used as a substitute for query logs. The tried doing that in previous work; now they’re going back and investigating whether what they did was well founded—they want to look at whether tags statistically “look like” queries. They compare the aol query log and delicious, each of which associates terms with various web pages.
The first obvious question is about vocabulary overlap between query terms and tags for a page. They found it averaged above 50%. They went on to compare distributions and found them much more similar than a random model. This similarity doesn’t seem to be limited to specific topics (as categorized by DMOZ). The next natural question is whether the tags and query terms might both be represented as samples from the same multinomial distribution, and they could not verify that hypothesis—so despite the overlap, it does seem something different is going on in the two cases. On further analysis they found one significant difference: query terms tend to be similar to the content of the web page while tags tend to differ from the content. This would seem to make sense: results are only retrieved if the query terms match the content, but one of tags’ main benefits is to add terms that don’t appear in the content. But tags and queries are more similar to each other than either is to the content.
I wonder if they’ve considerd the inverse approach: use query logs to create tags for web pages?
Hello,
Thanks for the SIGIR coverage!
At Stanford, we have been trying to answer the exact same question you posed at the end of your post; whether query logs can be used to create tags for web pages. We have been calling tags that we create using query logs query tags and we ‘ve seen that they can indeed be useful.
In addition, we are working on different ways to enable web site owner to automatically share query tags from their sites and we are also looking at other applications of query tags: we are currently working on improving navigation on the web using query tags.
You can find more details in the paper “Tagging with queries: How and Why?” f http://ilpubs.stanford.edu:8090/883/ and you should also check out the project web site http://tags.stanford.edu.
Yannis
There was a great paper at IIiX2008, called tagging for use, which showed that a lot of tags were ‘for use’. They asking repository managers about their collections, the responses were often, well we have data about x, which can be used for this or that. their analysis of an intranet tagging system was that the majority of tags were about how a document can be used.
of course, Cathy Marshall’s paper on tags at JCDL09 (http://www.csdl.tamu.edu/~marshall/JCDL2009-fp146-marshall.pdf), which showed that tags were very different to other metadata, was perhaps controversial in that community, and wasnt well received by everyone there.
the CHI09 paper on tags from PARC also had mixed response, but like what you say above, people who had to tag using words from the blogs, rather than their own words, were slower at recognising documents. One hypothesis to draw from these papers is that tagging is all about externalising the way our long-term memory is ‘indexing’ the information. Further, that we can learn about how the majority of people ‘index’ things in their long-term memory through social tagging. I suppose even further, its potential that long-term memory stores such information in an action (or ‘for use’) oriented way. Although much of the ‘for use’ paper saw how people were tagging in way that they thought other people would find valuable, rather than what they would find valuable.
interesting stuff though.