<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Haystack Blog &#187; Publication</title>
	<atom:link href="http://groups.csail.mit.edu/haystack/blog/category/publication/feed/" rel="self" type="application/rss+xml" />
	<link>http://groups.csail.mit.edu/haystack/blog</link>
	<description>MIT CSAIL Research</description>
	<lastBuildDate>Tue, 24 Nov 2009 04:05:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>ISWC Afterthoughts</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/11/iswc-afterthoughts/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/11/iswc-afterthoughts/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 21:49:27 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[ISWC]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=698</guid>
		<description><![CDATA[After recovery from chairing ISWC 2009, I had always intended to blog about some of the changes my co-chair Avi Bernstien and I tried for the conference.   I was prompted to do it today by a very interesting post by James Landay&#8212;it describes problems with the current CHI/UIST reviewing process, and led to some fascinating [...]]]></description>
			<content:encoded><![CDATA[<p>After recovery from chairing ISWC 2009, I had always intended to blog about some of the changes my co-chair Avi Bernstien and I tried for the conference.   I was prompted to do it today by a very interesting <a href="http://dubfuture.blogspot.com/2009/11/i-give-up-on-chiuist.html">post by James Landay</a>&#8212;it describes problems with the current CHI/UIST reviewing process, and led to some fascinating discussion comments by a bunch of the biggest names in CHI.  Well worth a read.  I&#8217;m not going to address it directly here, but a bunch of what we did with ISWC was motivated by related thoughts.  I&#8217;m writing about this because I think these were mainly good ideas, and recommend them to anyone else chairing a conference.</p>
<p>One set of changes we made was aimed at squeezing the gap between submission and conference.  It seems obvious to me that conferences should inform us of new work.   This doesn&#8217;t happen if the conference presents work that was submitted 7 months earlier (cf <a href="http://www.chi2010.org/">CHI</a>)!  We made a few changes to tighten things up</p>
<ol>
<li>We compressed the initial reviewing period to two weeks.  While many conferences give reviewers several months to examine their papers, I&#8217;m certain that few people look at them more than a week ahead of the deadline.  Certainly most reviews don&#8217;t arrive until day-of!  We got exactly one reviewer  rushecomplaint abodut this schedule.   Perhaps we&#8217;ll lose some unusually disciplined and organized reviewers who like to plan ahead; I&#8217;m still convinced this is a big improvement for the community.  For the same reason, we offered the program committee only a week to digest and summarize these reviews before the committee meeting.</li>
<li>Similarly, we gave authors only a week to revise their papers for publication after acceptance.   Again, we assumed most revision happens at the last minute, so we jumped straight to the last minute.  I&#8217;d say that a paper requiring more than a week of revision probably wasn&#8217;t ready to submit anyway.</li>
<li>We advocated moving to a model, like NIPS, of producing the printed proceedings after the conference.   Given the slow printing process, distributing printed proceedings at the conference requires that final versions be submitted long before the conference.  I haven&#8217;t opened a paper proceedings for years, but enough people are still attached to them that I don&#8217;t think the time is ripe to do away with them entirely.  But distributing them post-conference lets us keep them without keeping the delay they introduce.    In the end, we weren&#8217;t able to convince the steering committee to make this change; however, our push led the publisher to offer an unusually swift timeline for publication.</li>
</ol>
<p>At the end, we squeezed out enough delay that the critical path became something outside our control: travel to the US.  For cheap fares, authors need to buy their airlines tickets a month before the conference.  On top of this we had to add a month for non-US authors to get visas for travel to the US (sadly, one of next year&#8217;s co-chairs could not get a visa in time, and missed the conference).  Combining these requirements with our one month review process produced our final, 3-month lag between submission and conference.</p>
<p>Looking ahead, I&#8217;d advocate for ISWC to eliminate the paper proceedings and to select countries with more sensible visa frameworks&#8212;Canada anyone?  This ought to let us cut the submission lag by another factor of 2.</p>
<p>We also made changes to the reviewing process.</p>
<ol>
<li>We eliminated topic tracks, instead allowing authors to choose their preferred program committee member for review of their paper (as well as an alternate).  My experience has been that most track descriptions tend to be ambiguous gobbledygook that produce tremendous ambiguity about the &#8220;right&#8221; track to submit&#8212;especially if, as often seems to be the case for me, your work seems to span multiple tracks.  Given that decisions are made by the people on the program committee rather than abstract tracks, an &#8220;end-to-end&#8221; argument says picking the person makes more sense.  Authors understand their papers well and can study committee members to identify the one most likely to favorably receive their paper.   This is analogous to the submission process for journals where one chooses an editor.   There was some worry about load balance&#8212;what if everyone picks the &#8220;nice&#8221; reviewer?   But in practice we were able to assign every paper to one of the two chosen committee members.</li>
<li>We eliminated the paper &#8220;bidding&#8221; process.  In past ISWCs, all potential reviewers were able to survey all submissions and bid for those they wanted to review.   We felt that it made more sense for our knowledgeable program committee members to select reviewers who they thought were best suited to each paper&#8212;aiming for a &#8220;best for the paper&#8221;, rather than &#8220;fun for the reviewer&#8221; assignment of papers.  We feel this worked well, guaranteeing good reviews of each paper.  Interestingly, this was the one change on which we got negative pushback at the ISWC town hall meeting&#8212;even though it was populated by the beneficiaries of our reviewing process, the majority advocated a return to reviewer bidding.  I still think our approach is superior.</li>
<li>By a second end-to-end argument, we eliminated generic 1-5 scoring of papers,which ultimately must be translated into an accept/reject, and replaced it with actionable descriptions: we asked each reviewer to take a position for or against acceptance of each paper, or to &#8220;give up&#8221; and indicate they didn&#8217;t care.</li>
<li>Perhaps most important, we sought <strong>controversial papers</strong>.  The obvious and common measure of  paper quality is the average of the reviewer scores.  However, this buries one of the most important potential roles of a conference&#8212;to encourage debate about research.  Rather than a paper that everyone considers OK, a paper that half the reviewers love and half hate seems much more important to present at a conference, since it indicates a fundamental disagreement about research that is well worth airing in public, exploring, and trying to resolve.   To seek out such papers we applied two policies.  The first was that each paper with at least one vote for acceptance was discussed at the PC meeting.  The second was that any paper that, after discussion, had at least one program committee member advocate was accepted to the conference, regardless of the amount of opposition.  Because &#8220;average scoring&#8221; is so deeply embedded in reviewing practice, this change required changes to our conference management software&#8212;which were cheerfully carried out by the folks at <a href="http://precisionconference.com/">precision conference</a>, along with all the other great customer support they offered.  We never had to wait more than a few hours for an answer to a support request.  I highly recommend them for other conferences.</li>
<li>We introduced conditional acceptances.  We got some pushback that this is a big burden for authors, but I don&#8217;t buy that&#8212;obviously, the authors could choose to treat this as a rejection and skip the hassle, but in fact many of our conditional accepts were worked on and turned into accepts&#8212;including one of the four papers we ultimately highlighted as a best paper.  It is certainly a big burden for the program committee, and I&#8217;d like to thank our member for being willing to tackle this extra responsibility.</li>
</ol>
<p>Finally, we made some changes to the conference itself.</p>
<ol>
<li>We scheduled our parallel tracks based on surveys of attendees about which talks they wished to attend. It was trivial to gather this information using a google spreadsheet.   Using it, we created a schedule with almost no conflicts among the papers people stated plans to attend.</li>
<li>We introduced a category of &#8220;general interest&#8221; papers.  These were specifically not intended to be &#8220;best&#8221; papers; rather, they were papers we felt ought to be attended to by anyone interested in achieving a broad sense of the work going on around the Semantic Web.   The Semantic Web community is made up of subcommunities that risk becoming quite isolated from one another&#8212;the description logic community, the Semantic Web user interface community, the scalability community, the standards community, and so on.   To combat isolation we need to share knowledge, so we identified papers that we felt were representative of each subcommunity but also broad enough that members of the other subcommunities could understand them.</li>
<li>We introduced a town-hall meeting so the community could give us feedback about all the innovations I&#8217;ve just described.  The town hall was reasonably well attended even without any food or beer incentive.  With the exception of reviewer bidding, all of our changes were favored by a majority of the town hall attendees.</li>
</ol>
<p>There were three changes we completely failed to pull off:</p>
<ol>
<li>ISWC offers &#8220;in use&#8221; and &#8220;industry&#8221; tracks with their own submissions, committees, and schedules independent of the research track.  This creates ambiuity for authors about where they should submit and ambiguity about which track should accept, and also leads to tracks that are not coordinated at the conference.  I would like to see a single committee considering all this work as a whole.</li>
<li>We had hoped to install Semantc-Web tools for managing information at the conference&#8212;letting people annotate the papers/talks they saw, provide information about restaurants they&#8217;ve attended, coordinate meetings with others at the conference.   We failed&#8212;creating a system for this still requires too much special purpose engineering.    I see that as a (to-date) failure of the Semantic Web community to provide tools that are easy to install and use.</li>
<li>Both Avi and I were eager to nurture the study of <a href="http://swui.semanticweb.org/">user interaction with the Semantic Web</a>.  We both consider this a topic that is not being sufficiently explored&#8212;without good user interfaces, we don&#8217;t think the promise of the Semantic Web can be fulfilled. Unfortunately, our plan was frustrated by a lack of submissions in this area&#8212;it seems the work simply isn&#8217;t being done.   I wonder how we can convince the community of the importance of pursuing it.</li>
</ol>
<p>In all, chairing the conference was surprisingly fun and easy.  For years, I&#8217;ve declined chairing requests, on the theory that with my organizational capabilities any conference I chaired would not actually take place.   For ISWC, my worries were assuaged by a fantastic and well organized (Swiss) co-chair who made sure everything actually happened on time.  So, a big thank you to Avi.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/11/iswc-afterthoughts/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Paper Awards at ISWC</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/08/paper-awards-at-iswc/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/08/paper-awards-at-iswc/#comments</comments>
		<pubDate>Sun, 08 Nov 2009 14:06:32 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[ISWC]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=692</guid>
		<description><![CDATA[Having gotten ISWC&#8217;s Ontology Panel off my chest, I want to take the time to discuss the Best Paper Awards we gave at the conference.  The papers that got these awards received uniformly high ratings from their reviewers, were recommended for awards by the program committee, impressed the program chairs, and had good presentations at [...]]]></description>
			<content:encoded><![CDATA[<p>Having gotten ISWC&#8217;s Ontology Panel <a href="http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/">off my chest</a>, I want to take the time to discuss the Best Paper Awards we gave at the conference.  The papers that got these awards received uniformly high ratings from their reviewers, were recommended for awards by the program committee, impressed the program chairs, and had good presentations at the conference.</p>
<p>Best paper went to Ugur Kuter and Jennifer Golbeck for &#8220;Semantic Web Service Composition in Social Environments&#8221;.   This paper reflected a nice linking of two disparate fields.  The first is semantic web service composition.   This area looks ahead to when users will create workflows by composing&#8212;chaining together&#8212;a series of &#8220;web services&#8221; (processes on the web with exposed APIs that can be invoked from elsewhere on the web), passing data from one to the next in order to perform some computation.  Generally, web service composition is looked at as a logic problem&#8212;figuring out which web services can be composed to meet a particular specification.  Kutur and Goldbeck instead look at a trust problem.   Many of these services on the web might not be completely trustworthy&#8212;perhaps they are fault-prone or even malicious.   The paper framed the problem of choosing, among many compositions, the most trustworthy one.  The algorithms in the paper are heuristic and surely not the last word, but real value of the paper was in framing the problem.</p>
<p>The best student paper, by Vicky Papavassiliou, Giorgos Flouris, Irini Fundulaki, Dimitris Kotzinos, and Vassilis Christophides, &#8220;On Detecting High-Level Changes in RDF/S KBs&#8221;,  was also a problem-framing paper.   Ontologies are obviously an important part of the Semantic Web (my<a href="http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/"> last post</a> notwithstanding) and are already heavily used to characterize data in a variety of domains (medical being a significant one).    Over time, these ontologies will evolve.   People will need to make sense of the changes to these ontologies.  It&#8217;s easy to describe &#8220;atomic&#8221; changes (this element was added, this removed) but these atomic changes are likely to occur in groups to achieve some higher level change.  It is these higher level changes that will serve as the most meaningful description to an end user.   This paper poses the two related questions of &#8220;how should these higher level changes be described?&#8221; and &#8220;how can they be detected from the raw description of atomic changes to the ontology?&#8221; Again, there is likely to be followon work, but this paper did a nice job initiating study of the problem.</p>
<p>We also awarded an honorable mention in each category.  For best paper, honorable mention went to Daniela Petrelli, Suvodeep Mazumdar, Aba-Sah Dadzie, Fabio Ciravegna for &#8220;Multi Visualization and Dynamic Query for Effective Exploration of Semantic Data.&#8221; Working at Rolls-Royce, they mapped a large amount of information (about jet engine design) into a Semantic Web framework and deployed a new user interface for visualizing and navigating that information.  They then studied its use and usefulness in the company and reported conclusions.   The paper was a real user study; something of whichwe see far too little in the semantic web community.  It was particularly nice to give this award after hearing a researcher declare the previous day that &#8220;there is no scientific method for evaluating semantic web user interfaces.&#8221;  The community needs more papers of this sort.</p>
<p>An honorable mention went to Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen for &#8221;  Scalable Distributed Reasoning using MapReduce.&#8221;  This was a performance paper, showing how to do typical semantic-web logical inference tasks on a large parallel cluster.  As is usual, the main problem is data communication bottlenecks between the processors, and this paper showed how a very limited amount of replication could dramatically reduce those communication bottlenecks.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/08/paper-awards-at-iswc/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Does the Semantic Web Need Ontologies?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 06:42:34 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[ISWC]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=669</guid>
		<description><![CDATA[Ever since returning from the 2009 International Semantic Web Conference last week I&#8217;ve been bursting to discuss a panel that took place there on the topic &#8220;Does the Semantic Web need Ontologies?&#8221;.    But the WWW2010 deadline was today and we had 3 papers to write.  With that deadline now 10 minutes past, I can [...]]]></description>
			<content:encoded><![CDATA[<p>Ever since returning from the <a href="http://iswc2009.semanticweb.org/">2009 International Semantic Web Conference</a> last week I&#8217;ve been bursting to discuss a panel that took place there on the topic &#8220;Does the Semantic Web need Ontologies?&#8221;.    But the WWW2010 deadline was today and we had 3 papers to write.  With that deadline now 10 minutes past, I can finally post!  When it was first proposed, I was concerned because panels need controversy to be fun, and I didn&#8217;t think there&#8217;d be debate on this topic.  However, the organizer was confident that he&#8217;d be able to arrange different viewpoints on the panel.</p>
<p>When I attended the panel I was sorry to discover that the panelist did in fact all agree.  Far worse, they all said &#8220;yes&#8221; and wanted to debate what <em>kind</em> of ontologies were needed. Those who&#8217;ve followed my slow <a href="http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/">conversation with Stefano Mazzocchi</a> won&#8217;t be surprised at my reaction&#8212;ajump to the audience microphone to voice a strong &#8220;no!&#8221;   I asserted that a bunch of data presented in spreadsheets was already a big step forward over our current unstructured web.  This led to some interesting discussion that helped me clarify some points in my mind that I&#8217;ll try to lay out here.</p>
<p>The panelists&#8217; general reaction was amazement that I could be opposed to ontologies.  Without ontologies, how could any tool actually use the data?  What good would that data be without an explanation of what it meant?</p>
<p>Tim Berners Lee tried to mediate by suggesting that I did support ontologies.  After all, a spreadsheet has an ontology: the ontology specifies rows, columns, cells, and the relationship between them. But by this definition, any structured data necessarily has an (implicit) ontology, and saying &#8220;ontology&#8221; is just another way of saying &#8220;structured data&#8221;.   And I think this diverges from the standard meaning of &#8220;ontology&#8221; in the Semantic Web community, which I would read as &#8220;an explicitly recorded, machine readable description of the ontology of the given data.&#8221;   While I am a big proponent of structured data I&#8217;m going to bet that the panelists would not consider their implicit ontologies to be ontologies in the Semantic Web sense.   So we do in fact disagree.</p>
<p>Why then do I think we don&#8217;t need (explicit) ontologies?   Because I&#8217;m focused on the ways that human beings, rather than machine agents, will consume the data being shared.  And for humans, a machine-readable explanation of the data&#8217;s meaning is often unnecessary because the human who is consuming that data can figure it out in other ways.  For example, the meaning of the data elements might be explained in English, a &#8220;caption&#8221; of the data I am inspecting.     Even without captions, if I get a data table with column headings, I can use my comprehension of English to understand the meaning of those headings and from it infer the roles of the columns.  Even if there aren&#8217;t column headings, the &#8220;shape&#8221; of the data can tell me a lot&#8212;I&#8217;ll recognize standard person names, phone numbers, addresses, prices, book titles, and such from the textual patterns or from matches to my large wetware database of known entities.  And if I see enough examples I can draw conclusions about the values in the column (indeed, <a href="http://www.google.com/squared">Google Squared</a> suggests that you might not even need a human in the loop to make these inferences).</p>
<p>So humans can understand data without (explicit) ontologies, but is it any use?  Sure!  Just to plug some of my own group&#8217;s tools, they can use <a href="http://www.simile-widgets.org/exhibit/">Exhibit</a> to throw it into a rich visualization&#8212;a map, timeline, or list with faceted browsing and sorting.   Or they can combine it with another data set using <a href="http://simile.mit.edu/potluck/">Potluck</a>, and throw the combined data into an Exhibit visualization.  I can make a post on <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">ManyEyes</a> or throw the data into <a href="http://www.dabbledb.com/">DabbleDB</a> for further processing.  These activities typically require me to match certain properties (columns) of the data set into roles in the UI (Exhibit, ManyEyes) or to properties in the other data set (Potluck, DabbleDB)&#8212;a straightforward task.  They don&#8217;t require the machine to understand the data, because I&#8217;m the one taking these actions.  They do require that the data be structured, since otherwise there&#8217;s no way for me to say &#8220;which column&#8221; to the tools I&#8217;m trying to use.</p>
<p>That&#8217;s the argument I wanted to make at the panel, but it&#8217;s a bit hard to squeeze into 20 seconds at the audience-feedback microphone.  So I&#8217;m afraid the panelists instead thought that I was arguing against ontologies, asserting that they should not be deployed at all.</p>
<p>On the  contrary, I like ontologies.  But I&#8217;m convinced that ontologies are a luxury, not a necessity. They&#8217;re certainly nice to have, and there are some things you can only do if you have them&#8211;for example, theycan help me understand column headings written in Russian or Spanish by connecting them to explanations in English.  But I remain captivated all the opportunities that arise just by making data easily accessible in raw form.   Too often, what people want to do with information is perfectly easy to explain, but impossible to do without serious programming, for silly reasons.</p>
<p>And it&#8217;s that enthusiasm for open data that keeps me energetically arguing that we don&#8217;t need ontologies.  If we need ontologies, then work on freeing data needs to stop until we get them.  I think that&#8217;s a very dangerous perspective.  It&#8217;s the one that says &#8220;there&#8217;s no point to building tools for scientists to publish their data, until we&#8217;ve figured out the right huge ontology that we&#8217;ll force them all to publish in.&#8221;</p>
<p>Instead, I think we should go right ahead with our research on ontologies and tools for them, but in the meantime, let the data fly!</p>
<p>P.S. When someone rose to support me, arguing that we should forget ontologies and concentrate on Linked Open Data, I mudied things further by asserting that we don&#8217;t really need the &#8220;Linked&#8221; part, and Open Data is useful in its own right.  While it comes from the same place as my perspective on ontologies above, that&#8217;s the substance of my <a href="../../2009/09/14/in-defense-of-a-semantic-web-wild-west/">discussion with Stefano</a>, and I won&#8217;t repeat it here.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>SIGIR09: Predicting User Interests from Contextual Information</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/24/sigir09-predicting-user-interests-from-contextual-information/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/24/sigir09-predicting-user-interests-from-contextual-information/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 18:29:58 +0000</pubDate>
		<dc:creator>Max Van Kleek</dc:creator>
				<category><![CDATA[SIGIR]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=375</guid>
		<description><![CDATA[Peter Bailey et al. from Microsoft described a study comparing sources of context in predicting the subsequent interests of a user looking at a particular page.  (A note of caution, the term context in the IR community refers to auxillary text or documents used to augment a query typically to help disambiguate it and improve [...]]]></description>
			<content:encoded><![CDATA[<p>Peter Bailey et al. from Microsoft described a study comparing sources of context in predicting the subsequent interests of a user looking at a particular page.  (A note of caution, the term <em>context</em> in the IR community refers to auxillary text<em> </em>or documents used to augment a query typically to help disambiguate it and improve precision, rather than activity or situational context) Their assessment applied to data collected from 250K users of Microsoft&#8217;s Live toolbar, consisting of logs of timestamped visits to every web site visited by each user.   Peter et al analyzed web page views by choosing random pages along each user&#8217;s history and asking the question: what <em>other pages</em> would act as the best sources of context to predict the categories of the next page(s) the user visited?  Categories for each page were identified by looking up the page under the <a href="http://www.dmoz.org/">Open Directory Project</a> and retrieving the most prominent categories for each.</p>
<p>The sets of other pages consisted of the following conditions:</p>
<ul>
<li>&#8220;none&#8221; &#8211; Only the current page the user was viewing</li>
<li>&#8220;interaction&#8221; &#8211; The 5 last pages visited prior to the page</li>
<li>&#8220;task&#8221; &#8211; the set of pages resulting from a random walk from a page to the queries that lead to that page, and other pages returned by those same queries</li>
<li>&#8220;collection&#8221; &#8211; pages linking to the current page</li>
<li>&#8220;historic&#8221; &#8211; long term visit history of the user</li>
<li>&#8220;social&#8221; &#8211; combination of long term visit histories of others who visit the same url</li>
</ul>
<p>Their results revealed that the &#8220;best&#8221; sources of context depended on whether one wanted to predict the next immediate pages, or next set of pages over the long term.  Not surprisingly, the last five pages the user visited (&#8221;interaction&#8221; above) was found to most often predict the category of the next immediate page(s) visited.  &#8220;Task&#8221; contexts performed second best at these immediate predictions. And, the long-term history best predicted overall the pages the user visited in the future over a long term.</p>
<p>These results back up our intuitive notion that people tend to have a mixture of short term (task-oriented) interests, and long-term (low-frequency) interests reflecting the user&#8217;s perhaps, personal interests.</p>
<p>What I thought was somewhat misleading about the study was its title &#8212; I was initially drawn to the paper because I expected to see the paper address anticipating user information needs via activity/situational context based implicit information retrieval.  The study did not examine user <em>interests</em> at all &#8212; but merely looked at the <em>categories</em> assigned to pages subsequent to a particular visit.  Moreover, instead of examining the predictive power of particular document collections, these merely assessed the <em>overlap</em> in these categories.  I had expected instead the paper to examine the performance of various types of classifiers trained on subsets of page visits to predict subsequent pages &#8212; such an analysis would lead to greater insight surrounding the predictive power of various information sources towards future browsing than measuring simple category agreement.</p>
<p>Nonetheless, this is interesting, simple and important work, as it provides evidence surrounding sources of information that might be used in the future to anticipate users&#8217; information needs.  This study also highlights the current situation in web-scale research: that only companies like Microsoft, Google or Yahoo! have access to the sheer volume of data needed to do such an analaysis.  Out of the three companies, Microsoft/MSR has been the most forthcoming with papers and involvement in conferences like SIGIR, WWW and SIGCHI, particularly over the last two years, with fascinating papers like the <a title="Large Scale Revisitation Patterns (ACM DL)" href="http://portal.acm.org/citation.cfm?id=1357054.1357241&amp;coll=ACM&amp;dl=ACM&amp;type=series&amp;idx=SERIES260&amp;part=series&amp;WantType=Proceedings&amp;title=CHI&amp;CFID=45230716&amp;CFTOKEN=42499154">Large Scale Revisitation Patterns analysis</a> by Adar, Teevan and Dumais et al. that won best paper at CHI &#8216;07.</p>
<p>One of our In the future, if academic research institutions like universities or the Web Science Research Initiative (WSRI) had access to such detailed usage data, there would likely be a significantly greater wealth of interesting insights about human behavior surrounding information access and retrieval.  The <a href="http://www.lemurproject.org/">UMass Lemur Project</a> and our own projects <a href="http://listit.csail.mit.edu">list.it and eyebrowse (forthcoming)</a> hope to contribute to greater access to real-life search, web-browsing and note-taking behaviors for academic institutions through open-source tools that facilitate the capture of user behavior, and real user-donated corpora.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/24/sigir09-predicting-user-interests-from-contextual-information/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>SIGIR09: Telling Experts from Spammers: Expertise Ranking in Folksonomies</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/22/sigir09-telling-experts-from-spammers-expertise-ranking-in-folksonomies/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/22/sigir09-telling-experts-from-spammers-expertise-ranking-in-folksonomies/#comments</comments>
		<pubDate>Wed, 22 Jul 2009 20:58:30 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[SIGIR]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Social Computing]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=386</guid>
		<description><![CDATA[From our friends in Southhampton (correction: and Hasso-Platner), a study of how to differentiate experts (who really know how to tag stuff) from spammers (who want to tag their own stuff, but try to acquire credibility by copying tags others have used).   They try to exploit the difference that the people who tag first are [...]]]></description>
			<content:encoded><![CDATA[<p>From our friends in Southhampton (correction: and Hasso-Platner), a study of how to differentiate experts (who really know how to tag stuff) from spammers (who want to tag their own stuff, but try to acquire credibility by copying tags others have used).   They try to exploit the difference that the people who tag first are obviously not copying.  They compared their classifier to some obvious baselines, such as assigning expertise to those with the most tags.  Evaluating their classifier was tricky because there isn&#8217;t a ground-truth data set.   So they used a simulation, inserting a variety of different simulated experts and spammers into the tag stream of delicious, and checking how there classifier deals with them. Their classifier won.</p>
<p>Of course, you can only draw limited confidence from this kind of simulation.  Their simulated users fit their model of the world (spammers labeled late) so of course a tool designed to their model will do well on their simulated users.  I wonder, would it have been that hard to just do manual labeling of expertise on some real delicious users?  This would obviously give more trustable results than simulations.   Indeed, they found that by manual examination, the top 50 users of the tag &#8220;mortgage&#8221; were spammers.  However, they say that the problem was finding a good ground truth for experts.   But that suggests it would still be possible to evaluate differentiation of spammers from non-spammers, even if you can&#8217;t evaluate differentiation of experts.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/22/sigir09-telling-experts-from-spammers-expertise-ranking-in-folksonomies/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIGIR09: The Wisdom of the Few: A Collaborative Filtering Approach Based on Expert Opinions from the Web</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/22/sigir09-the-wisdom-of-the-few-a-collaborative-filtering-approach-based-on-expert-opinions-from-the-web/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/22/sigir09-the-wisdom-of-the-few-a-collaborative-filtering-approach-based-on-expert-opinions-from-the-web/#comments</comments>
		<pubDate>Wed, 22 Jul 2009 18:45:21 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[SIGIR]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=383</guid>
		<description><![CDATA[Xavier  Amatriain of Telefonica research presented work on collaborative filtering.  Usually you do collaborative filtering by finding the other users &#8220;similar&#8221; to your subject and combining their recommendations.  This paper argued/demonstrated that sometimes you are better off figuring out who the experts art and only paying attention to their opinions.  You might just create [...]]]></description>
			<content:encoded><![CDATA[<p>Xavier  Amatriain of Telefonica research presented work on collaborative filtering.  Usually you do collaborative filtering by finding the other users &#8220;similar&#8221; to your subject and combining their recommendations.  This paper argued/demonstrated that sometimes you are better off figuring out who the experts art and only paying attention to their opinions.  You might just create non-personalized recommendations from them, or you might personalize by finding the best _experts_ to recommend for a user.  The experimented by exploring movie recommendation using the Netflix challenge mass ratings versus using the (expert) critics&#8217; recommendations on Rotten Tomatoes.  They found expert recommendations often worked better.</p>
<p>I asked about some past work on e.g. semisupervised learning suggests various approaches to combining small amounts of high-quality data (experts) with large amounts of messier data (mass user ratings).  It suggests, for example, some sort of weighted combination of expert and mass user opinion.  They know this could help a lot, but don&#8217;t have a general approach to separating the export from everyone else in a large mass of recommendations (Rotten Tomatoes did it for them).</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/22/sigir09-the-wisdom-of-the-few-a-collaborative-filtering-approach-based-on-expert-opinions-from-the-web/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR09: An Aspectual Interface for Supporting Complex Search Tasks</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-an-aspectual-interface-for-supporting-complex-search-tasks/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-an-aspectual-interface-for-supporting-complex-search-tasks/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 19:31:08 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Publication]]></category>
		<category><![CDATA[SIGIR]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=377</guid>
		<description><![CDATA[Robert Villa of U. Glasgow presented.   They consider how broad complex tasks can be supported through a search interface, covering the multiple subtasks someone might need to perform&#8212;e.g., investigating several candidate resources in the middle of the search, or, doing multiple related searches.  They developed an interface that let users create and [...]]]></description>
			<content:encoded><![CDATA[<p>Robert Villa of U. Glasgow presented.   They consider how broad complex tasks can be supported through a search interface, covering the multiple subtasks someone might need to perform&#8212;e.g., investigating several candidate resources in the middle of the search, or, doing multiple related searches.  They developed an interface that let users create and manage multiple &#8220;aspects&#8221; for working on the complex search task. The interface offered a &#8220;parallel view&#8221; of several aspects as well as a tabbed view showing one aspect at a time.  Each aspect had a name, a search box, and a list of search results.  They had three simulated tasks&#8212;decision making (one solution with multiple paths), explicit aspects (multiple solutions with independent aspects like biographies of several individuals) and implicit aspects (multiple solutions where the aspectsare implicit and interrelated).  They compared to a baseline (standard search interface) and yhey recorded how many documents users viewed and marked as relevant, how many searches they did, and user-reported perception of task difficulty.  Users were found to view significantly more documents when tackling the implicit/interrelated task.  Users carried out noticeably more searches with the aspectual interface.  QUery vocabulary was quite different to.  People were allowed to finish early; for the baseline interface people stopped much sooner while with the aspectual interface they used the whole time.  For the implicit task they found the aspectual interface led to the task being perceived as significantly easier.  Users seemed to prefer the tabbed view to the columnar view.  Their conclsion is that for really hard tasks, aspectual interfaces seem to help. They failed to proide any measure of the quality of the results.</p>
<p>All in all, a relatively inconclusive outcome.  But this is an area that needs more attention&#8212;support for a complex process involving many related searches, management of the result sets, etc.  Merry Morris has done some nice work here, with a focus on the collaborative aspects.</p>
<p>Followup: more coverage from <a href="http://palblog.fxpal.com/?p=1445">Fxpal</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-an-aspectual-interface-for-supporting-complex-search-tasks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR09: a comparison of query and term suggestion features for interactive searching</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-a-comparison-of-query-and-term-suggestion-features-for-interactive-searching/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-a-comparison-of-query-and-term-suggestion-features-for-interactive-searching/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 19:06:54 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Publication]]></category>
		<category><![CDATA[SIGIR]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=373</guid>
		<description><![CDATA[Diane Kelley of UNC Chapel Hill presented some interfaces for helping users refine their queries.  Lots of situations arise where people need to enter a series of queries to home in on what they are looking for.  Literature shows people can exhaust their ideas for good search queries.  IR has explored techniques [...]]]></description>
			<content:encoded><![CDATA[<p>Diane Kelley of UNC Chapel Hill presented some interfaces for helping users refine their queries.  Lots of situations arise where people need to enter a series of queries to home in on what they are looking for.  Literature shows people can exhaust their ideas for good search queries.  IR has explored techniques for term or whole-query suggestion.</p>
<p>The problem with term suggestion is that it&#8217;s based on common terms at the top, but these may be the unwanted/distracting documents.  How do you suggest ways deeper into the corpus?  Also, there are problems with terms being presented out of context.  There are also basic low level UI annoyances.  With query suggestion, there&#8217;s the problem of finding a good corpus of queries and figuring which ones are related/similar to the user&#8217;s query.  Kelly proposes using the automatic term selection techniques as a way to generate whole queries.  First extract some terms, then suggest ways of combining them to make a good query.   To generate terms, they cluster the documents, took the 5 largest, then selected &#8220;good&#8221; terms from each of the clusters.  The considered offering these terms individually to users, but also just offering a &#8220;new query&#8221; consisting of old query with top terms appended to the query.    They also considered user-generated suggestions.</p>
<p>Diane can always be trusted to carefully work out a good user study protocol so I won&#8217;t describe details of the corpora (TREC Robust track, with queries of different levels of difficulty) or user conditions or metrics (&#8221;Session-Based Normalized Discounted Cumulative Gain&#8221;).  That&#8217;s all well done but details are in the paper.</p>
<p>As queries became more difficult, users made more queries, and also used more query suggestions.  users saw term suggestions as a way to modify their query, but query suggestions as a way to make a whole new query.  A but funny, as the query suggestions were in fact modifications of their original query.  Qualitative feedback was that people liked the flexibility of term suggestion, and its use to refine the query.  They didn&#8217;t like the jumbling together of terms, and said it was too much effort to use.  People said it was hard to see how terms related to their search.   People liked the query suggestions for its &#8220;all in one&#8221; approach.  They liked the specificity and focus of the query&#8212;the query made a more meaningful semantic unit than the individual term suggestions.  The liked that the queries suggested ways of manually changing their query.  In cons, they wished they could click on individual query terms and felt many queries were redundant.  In a followup study they found that people used lots of query suggestions.  Query suggestions were generally rated higher than terms.  Especially, those who got user-generated suggestions preferred the whole queries to having them chopped up into term suggestions; perhaps the user suggestions did not make a lot of sense in individual terms.  They&#8217;d like to go back an develop a hybrid that lets people get the whole queries but then manipulate pieces of them.</p>
<p>I quite liked this talk but it made me think back to our old work on Scatter Gather.  Just like Diane&#8217;s, our system clustered the document collection, then picked important terms from each cluster.  But instead of presenting &#8220;queries&#8221;, we just presented each cluster&#8212;through its descriptive terms, and also through titles of some &#8220;representative documents&#8221;.  Scatter/Gather offered less flexibility to the user to mix and match terms for a new query; on the other hand, I think there is some interesting difference between the &#8220;cluster&#8221; metaphor versus the &#8220;query&#8221; metaphor.  I bet that presenting terms in clusters would fix some of the complaints about terms not making sense in isolation.</p>
<p>Followup: this talk was much tweeted and blogged about elsewhere:</p>
<ul>
<li><a title="Fxpal" href="http://palblog.fxpal.com/?p=1435">Fxpal</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-a-comparison-of-query-and-term-suggestion-features-for-interactive-searching/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SIGIR09: Search Engine Predeliction Towards News Media Providers</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-search-engine-predeliction-towards-news-media-providers/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-search-engine-predeliction-towards-news-media-providers/#comments</comments>
		<pubDate>Tue, 21 Jul 2009 14:49:46 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[SIGIR]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=370</guid>
		<description><![CDATA[



I saw a nice poster from U. Glasgow which shows that different search engines exhibit different biases on which news media providers are returned as results for various queries.  E.g., one search engine avoids New York Times, another avoids Reuters.  They can&#8217;t tell whether these biases are intentional or side effects of ranking [...]]]></description>
			<content:encoded><![CDATA[<table border="0">
<tbody>
<tr>
<td><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/07/news-media.jpg"><img class="alignnone size-medium wp-image-371" title="Poster Thumbnail" src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/07/img_0116.jpg" alt="" width="160" height="120" /></a></td>
<td>I saw a nice poster from U. Glasgow which shows that different search engines exhibit different biases on which news media providers are returned as results for various queries.  E.g., one search engine avoids New York Times, another avoids Reuters.  They can&#8217;t tell whether these biases are intentional or side effects of ranking algorithms, but it&#8217;s important and interesting either way.  Click the thumbnail for a larger image.</p>
<p><strong>Update:</strong> I&#8217;ve fixed the download problem outlined in the comment.  Now clicking on the thumbnail should give you the full image.</td>
</tr>
</tbody>
</table>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/21/sigir09-search-engine-predeliction-towards-news-media-providers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>SIGIR09: Enhancing Cluster Labeling using Wikipedia</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/20/sigir09-enhancing-cluster-labeling-using-wikipedia/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/20/sigir09-enhancing-cluster-labeling-using-wikipedia/#comments</comments>
		<pubDate>Mon, 20 Jul 2009 19:33:52 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Publication]]></category>
		<category><![CDATA[SIGIR]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=363</guid>
		<description><![CDATA[David Carmel from IBM Haifa spoke about the problem of labelling document clusters.  The goal is to find short labels for the clusters that describe them well to end users.  The typical approach seeks important terms in the clusters.  But sometimes important terms aren&#8217;t helpful/meaningful, and sometimes the best labels don&#8217;t show up in the [...]]]></description>
			<content:encoded><![CDATA[<p>David Carmel from IBM Haifa spoke about the problem of labelling document clusters.  The goal is to find short labels for the clusters that describe them well to end users.  The typical approach seeks important terms in the clusters.  But sometimes important terms aren&#8217;t helpful/meaningful, and sometimes the best labels don&#8217;t show up in the cluster at all.  For example, at the Open Directory Project, its category labels appeared in the text of documents with that label clusters only 85% of the time, and were rarely among the statistically important terms.</p>
<p>In this work, they try to match the cluster contents to articles in wikipedia, then look at the wikipedia articles&#8217; metadata (titles, categories) to find good descriptive labels for the clusters.  It seems to work pretty well.</p>
<p>to test this they took a bunch of text documents with labeled categories, and tested whether the manual label got selected by their wikipedia algorithm.  They tested on some standard corpora: the 20 newsgroups, and the open directory project for which they manually labeled 100 categories.  They carefully explored effects of cluster coherence, noise, etc.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/20/sigir09-enhancing-cluster-labeling-using-wikipedia/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
