SIGIR09: Predicting User Interests from Contextual Information
Peter Bailey et al. from Microsoft described a study comparing sources of context in predicting the subsequent interests of a user looking at a particular page. (A note of caution, the term context in the IR community refers to auxillary text or documents used to augment a query typically to help disambiguate it and improve precision, rather than activity or situational context) Their assessment applied to data collected from 250K users of Microsoft’s Live toolbar, consisting of logs of timestamped visits to every web site visited by each user. Peter et al analyzed web page views by choosing random pages along each user’s history and asking the question: what other pages would act as the best sources of context to predict the categories of the next page(s) the user visited? Categories for each page were identified by looking up the page under the Open Directory Project and retrieving the most prominent categories for each.
The sets of other pages consisted of the following conditions:
- “none” – Only the current page the user was viewing
- “interaction” – The 5 last pages visited prior to the page
- “task” – the set of pages resulting from a random walk from a page to the queries that lead to that page, and other pages returned by those same queries
- “collection” – pages linking to the current page
- “historic” – long term visit history of the user
- “social” – combination of long term visit histories of others who visit the same url
Their results revealed that the “best” sources of context depended on whether one wanted to predict the next immediate pages, or next set of pages over the long term. Not surprisingly, the last five pages the user visited (”interaction” above) was found to most often predict the category of the next immediate page(s) visited. “Task” contexts performed second best at these immediate predictions. And, the long-term history best predicted overall the pages the user visited in the future over a long term.
These results back up our intuitive notion that people tend to have a mixture of short term (task-oriented) interests, and long-term (low-frequency) interests reflecting the user’s perhaps, personal interests.
What I thought was somewhat misleading about the study was its title — I was initially drawn to the paper because I expected to see the paper address anticipating user information needs via activity/situational context based implicit information retrieval. The study did not examine user interests at all — but merely looked at the categories assigned to pages subsequent to a particular visit. Moreover, instead of examining the predictive power of particular document collections, these merely assessed the overlap in these categories. I had expected instead the paper to examine the performance of various types of classifiers trained on subsets of page visits to predict subsequent pages — such an analysis would lead to greater insight surrounding the predictive power of various information sources towards future browsing than measuring simple category agreement.
Nonetheless, this is interesting, simple and important work, as it provides evidence surrounding sources of information that might be used in the future to anticipate users’ information needs. This study also highlights the current situation in web-scale research: that only companies like Microsoft, Google or Yahoo! have access to the sheer volume of data needed to do such an analaysis. Out of the three companies, Microsoft/MSR has been the most forthcoming with papers and involvement in conferences like SIGIR, WWW and SIGCHI, particularly over the last two years, with fascinating papers like the Large Scale Revisitation Patterns analysis by Adar, Teevan and Dumais et al. that won best paper at CHI ‘07.
One of our In the future, if academic research institutions like universities or the Web Science Research Initiative (WSRI) had access to such detailed usage data, there would likely be a significantly greater wealth of interesting insights about human behavior surrounding information access and retrieval. The UMass Lemur Project and our own projects list.it and eyebrowse (forthcoming) hope to contribute to greater access to real-life search, web-browsing and note-taking behaviors for academic institutions through open-source tools that facilitate the capture of user behavior, and real user-donated corpora.
[...] task, collection, social, historic) for different temporal durations. Max Van Kleek wrote a nice summary of the talk at the Haystack blog. The paper doesn’t investigate seasonality (perhaps because [...]
This study also highlights the current situation in web-scale research: that only companies like Microsoft, Google or Yahoo! have access to the sheer volume of data needed to do such an analaysis.
This has been brought up numerous times in the past and I couldnt agree more. What is really very important is for society (/ internet users) to realise that they are the creators of the content and they should demand access to the agregatted datasets. Or in other words (since most people wouldnt know what to do with the datasets) some kind of compensation.
This applies less to search engines, since they provide a free service, and applies more to other collaborative web 2.0 apps. However, search engines make money on advertising, so thats the reason why they are free to use, not because users are being compensated for the datasets that these search companies decide to store and aggregate.
My whole point of criticism is that search engines shouldn’t keep the raw data hidden behind proprietary domains but open up to the world research community.
On this topic, you might want to look at and contribute to our eyebrowse project, which aims to build the kind of publicly accessible corpus you want. Max’s blog post explains more.