<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Haystack Blog &#187; Semantic Web</title>
	<atom:link href="http://groups.csail.mit.edu/haystack/blog/category/semantic-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://groups.csail.mit.edu/haystack/blog</link>
	<description>MIT CSAIL Research</description>
	<lastBuildDate>Tue, 24 Nov 2009 04:05:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Building a Social Data Commons</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/23/building-a-social-data-commons/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/23/building-a-social-data-commons/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 03:28:04 +0000</pubDate>
		<dc:creator>Adam Marcus</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Social Computing]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[eGovernment]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=743</guid>
		<description><![CDATA[Inspired by Ted’s vision of what he’d like to see happen to data.gov, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where [...]]]></description>
			<content:encoded><![CDATA[<p>Inspired by <a href="http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/">Ted’s vision</a> of what he’d like to see happen to <a href="http://www.data.gov/">data.gov</a>, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where such steps were taken: a world in which government data was centralized, versioned, searchable, and accessible.</p>
<p>Now what? Given the large and growing pile of data we will optimistically uncover, we will run into new frustrations. People will claim that the published data formats are not the ones that their analysis tool requires. People will be overwhelmed by dataset size, not knowing where to start. People will unknowingly recreate someone else’s data-munging workflows on the way to repeating analyses of the same data. People will become the next bottleneck if data ever ceases to be.</p>
<p>There’s no one answer to the concerns listed above because everyone has a different goal for the data. To handle these issues, we will need more than a place to find up-to-date datasets—-we will also need a place where it is easy for people to share ideas and strategies for tackling data. We will need a <em>social data commons</em>.</p>
<p>Whereas blogs and wikis help report findings, steps, and missteps, a social data commons can be the place to go to “talk shop” about the available data. Even if people post their solutions using decentralized means, there will be benefit to pooling all of these resources in one place on the web. Here are some tools that will help the data-tinkerers get things done:</p>
<ul>
<li><strong>Data-munging war stories</strong>. The first stage in data analysis is often long and frustrating. One must digest the dataset in the form they received it, and transform, clean, and filter out the subset that they wish to analyze, visualize, or otherwise present. The workflow differs for each dataset and application, but to the extent that people can share tools and instructions for processing each dataset, these should be written up in the form of recipes for baking the data.</li>
<li><strong>Crowdsourced analysis</strong>. Datasets can be overwhelming. While many exploration tasks are easily automated, it is often easiest to leave certain tasks (e.g., “Find the interesting pictures”) to humans. <a href="https://www.mturk.com/mturk/">Mechanical Turk</a> gives us a hint at what this might look like, and the Guardian provides a wonderful <a href="http://mps-expenses.guardian.co.uk/">example</a> of crowdsourced public data analysis in action.</li>
<li><strong>Current uses showcases</strong>. To spark competition, avoid duplicating work, and inspire follow-on projects, visitors should see a showcase of the current uses of each dataset. Aside from links to sites built around a dataset, the list can include <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">embedded visualizations</a> of finished work.</li>
<li><strong>Analysis wishlists</strong>. Given that data released by a government reaches more than just programmers, there will be more people with ideas than people who can implement the ideas. People with ideas should be given an outlet, and passers-by should be asked to vote on these ideas to help data geeks with some free cycles discover the most insteresting unimplemented project.</li>
<li><strong>Data wishlists</strong>.  If an agency were to dedicate resources to releasing another dataset, which one is in highest demand?  As Ted <a href="http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/">mentioned</a>, governments should let demand drive delivery.</li>
<li><strong>Forums</strong>. No set of tools will encompass all use cases for social data analysis. A discussion forum can lead to the formation of interest groups while serving as a catch-all for needs not served by the list above.</li>
</ul>
<p>The US government might hit a few bumps trying to implement some of these social features. For example, a conflict of interest might arise if the showcase of uses of a dataset includes a site critical of the current administration. Having the executive branch ban spam or abusive comments on a forum draws concern over limitations of <a href="http://www.wired.com/techbiz/people/magazine/17-04/st_thompson">free speech</a>.  These details are not roadblocks, but they do signal that we can’t expect a social overlay to spring out of data.gov <em>per se</em>—-if we want these features, we may have to build and manage them on a third party.</p>
<p>I’m sure there’s more to the social data commons than I listed here. What did I miss, and where can we seek further inspiration?</p>
<p><em>Thanks to Ted for reading the first version of this entry.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/23/building-a-social-data-commons/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Plotting a Course for Data.gov</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/#comments</comments>
		<pubDate>Wed, 18 Nov 2009 16:15:59 +0000</pubDate>
		<dc:creator>Edward Benson</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Social Computing]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[eGovernment]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=734</guid>
		<description><![CDATA[The US Government efforts to create a culture of open government data is a big deal. Hopefully it signals a shift from the “pull” model of FOIA to a “push” mindset in which data is proactively returned to the public without first having to ask (and pay). Still, data.gov has a lot of room for [...]]]></description>
			<content:encoded><![CDATA[<p>The US Government efforts to create a culture of open government data is a big deal. Hopefully it signals a shift from the “pull” model of FOIA to a “push” mindset in which data is proactively returned to the public without first having to ask (and pay). Still, data.gov has a lot of room for improvement, as Clay Johnson of Sunlight Labs mentions <a href="http://www.sunlightlabs.com/blog/2009/get-your-act-together-datagov/">here</a> and <a href="http://www.sunlightlabs.com/blog/2009/what-id-change-about-datagov/">here</a>. </p>
<p>Clay’s criticisms are well founded, but what I’d like to see more of is some brainstorming about what our ideal data.gov would look like. A <a href="http://thedextrousweb.com/2009/10/the-wraps-come-off-data-gov-uk/">recent post</a> about the coming  data.gov.uk site provides a nice foil for us, for one, as the UK seems to be taking a very different approach. But more importantly, what would you want to see in a government data site, and how would you use it? </p>
<p>I heard once that it is a good exercise to try to compress an idea into three sentences or less — it forces you to understand what you really want to say. So here is my three sentence suggestion:</p>
<ul>
<li><b>Bring it all under one roof</b>. The current data.gov site is like Yahoo! from the mid-90s: it is just a directory of links to other sites. This is a noble start, but we really need to get a single point of access if we want to revolutionize eGovernment. The government is an immense, heterogeneous organization, so this is as much an organizational challenge as a technical one. But there are plenty of precedents of systems which allow individual data publishers (the government agencies) to retain control over the publishing and updating of their own data, while allowing data consumers (the public) to access it all from a single location.</li>
<li><b>.. But don’t forget to give credit.</b> When offering a single access point for all the data, it is essential to keep metadata that tracks which data came from where. This is as important for book-keeping and data integration reasons as it is for simply giving credit where credit is due. Agencies that publish data sets of great use should be recognized for their work.</li>
<li><b>Build it as you go</b>. We don’t need the perfect system overnight. No single ontology, schema, or data format will be able to encompass all the government’s data. That’s OK — it doesn’t have to. Don’t let fear of not getting it perfect slow down incremental progress toward our goal. Just bringing the data under one roof is a fantastic start; you can always try to begin standardizing formats and “linking” it later. My next blog post will specifically address this topic.</li>
<li><b>…And version data sets</b>. The benefits of offering “versions” of datasets are threefold. First, it allows you to maintain a system in which data providers feel comfortable updating their data at will. Second, it allows the implementors of the system to feel comfortable experimenting with data integration techniques and knowing that, if it doesn’t work out, users still have access to the same system they did last week. Third, it is the ultimate expression of openness: like a subversion repository for the government, everyone will be able to see the evolution of data over time.</li>
<li><b>Help users discover data</b>. With the sheer volume of data available, publishing it isn’t enough — you have to help people find what they want. The current data.gov site already does a decent job of offering search functionality. We can go further, providing data “footnotes” for bloggers to link back into the data.gov site (see the <a href="http://projects.csail.mit.edu/datapress/">DataPress</a> project for an idea of how this might work), suggestions of “hot” data sets for particular areas of interest, or a government data blog that highlights new and important data that has been recently published.</li>
<li><b>.. And let them tell you what they want.</b> Your users — the citizens — are your best assets. Let them prioritize your tasks for you by allowing them to suggest and vote on features and data sets they would like to see added. This type of decentralized management strategy is making waves among the business community, and the same mindset can be applied to government.</li>
</ul>
<p>So there are my three sentences: Bring it all under one roof, but don’t forget to give credit. Build it as you go, and version data sets along the way. Help users discover data, and let them tell you what they want.</p>
<p>What are your three?</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ISWC Afterthoughts</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/11/iswc-afterthoughts/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/11/iswc-afterthoughts/#comments</comments>
		<pubDate>Wed, 11 Nov 2009 21:49:27 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[ISWC]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=698</guid>
		<description><![CDATA[After recovery from chairing ISWC 2009, I had always intended to blog about some of the changes my co-chair Avi Bernstien and I tried for the conference.   I was prompted to do it today by a very interesting post by James Landay&#8212;it describes problems with the current CHI/UIST reviewing process, and led to some fascinating [...]]]></description>
			<content:encoded><![CDATA[<p>After recovery from chairing ISWC 2009, I had always intended to blog about some of the changes my co-chair Avi Bernstien and I tried for the conference.   I was prompted to do it today by a very interesting <a href="http://dubfuture.blogspot.com/2009/11/i-give-up-on-chiuist.html">post by James Landay</a>&#8212;it describes problems with the current CHI/UIST reviewing process, and led to some fascinating discussion comments by a bunch of the biggest names in CHI.  Well worth a read.  I&#8217;m not going to address it directly here, but a bunch of what we did with ISWC was motivated by related thoughts.  I&#8217;m writing about this because I think these were mainly good ideas, and recommend them to anyone else chairing a conference.</p>
<p>One set of changes we made was aimed at squeezing the gap between submission and conference.  It seems obvious to me that conferences should inform us of new work.   This doesn&#8217;t happen if the conference presents work that was submitted 7 months earlier (cf <a href="http://www.chi2010.org/">CHI</a>)!  We made a few changes to tighten things up</p>
<ol>
<li>We compressed the initial reviewing period to two weeks.  While many conferences give reviewers several months to examine their papers, I&#8217;m certain that few people look at them more than a week ahead of the deadline.  Certainly most reviews don&#8217;t arrive until day-of!  We got exactly one reviewer  rushecomplaint abodut this schedule.   Perhaps we&#8217;ll lose some unusually disciplined and organized reviewers who like to plan ahead; I&#8217;m still convinced this is a big improvement for the community.  For the same reason, we offered the program committee only a week to digest and summarize these reviews before the committee meeting.</li>
<li>Similarly, we gave authors only a week to revise their papers for publication after acceptance.   Again, we assumed most revision happens at the last minute, so we jumped straight to the last minute.  I&#8217;d say that a paper requiring more than a week of revision probably wasn&#8217;t ready to submit anyway.</li>
<li>We advocated moving to a model, like NIPS, of producing the printed proceedings after the conference.   Given the slow printing process, distributing printed proceedings at the conference requires that final versions be submitted long before the conference.  I haven&#8217;t opened a paper proceedings for years, but enough people are still attached to them that I don&#8217;t think the time is ripe to do away with them entirely.  But distributing them post-conference lets us keep them without keeping the delay they introduce.    In the end, we weren&#8217;t able to convince the steering committee to make this change; however, our push led the publisher to offer an unusually swift timeline for publication.</li>
</ol>
<p>At the end, we squeezed out enough delay that the critical path became something outside our control: travel to the US.  For cheap fares, authors need to buy their airlines tickets a month before the conference.  On top of this we had to add a month for non-US authors to get visas for travel to the US (sadly, one of next year&#8217;s co-chairs could not get a visa in time, and missed the conference).  Combining these requirements with our one month review process produced our final, 3-month lag between submission and conference.</p>
<p>Looking ahead, I&#8217;d advocate for ISWC to eliminate the paper proceedings and to select countries with more sensible visa frameworks&#8212;Canada anyone?  This ought to let us cut the submission lag by another factor of 2.</p>
<p>We also made changes to the reviewing process.</p>
<ol>
<li>We eliminated topic tracks, instead allowing authors to choose their preferred program committee member for review of their paper (as well as an alternate).  My experience has been that most track descriptions tend to be ambiguous gobbledygook that produce tremendous ambiguity about the &#8220;right&#8221; track to submit&#8212;especially if, as often seems to be the case for me, your work seems to span multiple tracks.  Given that decisions are made by the people on the program committee rather than abstract tracks, an &#8220;end-to-end&#8221; argument says picking the person makes more sense.  Authors understand their papers well and can study committee members to identify the one most likely to favorably receive their paper.   This is analogous to the submission process for journals where one chooses an editor.   There was some worry about load balance&#8212;what if everyone picks the &#8220;nice&#8221; reviewer?   But in practice we were able to assign every paper to one of the two chosen committee members.</li>
<li>We eliminated the paper &#8220;bidding&#8221; process.  In past ISWCs, all potential reviewers were able to survey all submissions and bid for those they wanted to review.   We felt that it made more sense for our knowledgeable program committee members to select reviewers who they thought were best suited to each paper&#8212;aiming for a &#8220;best for the paper&#8221;, rather than &#8220;fun for the reviewer&#8221; assignment of papers.  We feel this worked well, guaranteeing good reviews of each paper.  Interestingly, this was the one change on which we got negative pushback at the ISWC town hall meeting&#8212;even though it was populated by the beneficiaries of our reviewing process, the majority advocated a return to reviewer bidding.  I still think our approach is superior.</li>
<li>By a second end-to-end argument, we eliminated generic 1-5 scoring of papers,which ultimately must be translated into an accept/reject, and replaced it with actionable descriptions: we asked each reviewer to take a position for or against acceptance of each paper, or to &#8220;give up&#8221; and indicate they didn&#8217;t care.</li>
<li>Perhaps most important, we sought <strong>controversial papers</strong>.  The obvious and common measure of  paper quality is the average of the reviewer scores.  However, this buries one of the most important potential roles of a conference&#8212;to encourage debate about research.  Rather than a paper that everyone considers OK, a paper that half the reviewers love and half hate seems much more important to present at a conference, since it indicates a fundamental disagreement about research that is well worth airing in public, exploring, and trying to resolve.   To seek out such papers we applied two policies.  The first was that each paper with at least one vote for acceptance was discussed at the PC meeting.  The second was that any paper that, after discussion, had at least one program committee member advocate was accepted to the conference, regardless of the amount of opposition.  Because &#8220;average scoring&#8221; is so deeply embedded in reviewing practice, this change required changes to our conference management software&#8212;which were cheerfully carried out by the folks at <a href="http://precisionconference.com/">precision conference</a>, along with all the other great customer support they offered.  We never had to wait more than a few hours for an answer to a support request.  I highly recommend them for other conferences.</li>
<li>We introduced conditional acceptances.  We got some pushback that this is a big burden for authors, but I don&#8217;t buy that&#8212;obviously, the authors could choose to treat this as a rejection and skip the hassle, but in fact many of our conditional accepts were worked on and turned into accepts&#8212;including one of the four papers we ultimately highlighted as a best paper.  It is certainly a big burden for the program committee, and I&#8217;d like to thank our member for being willing to tackle this extra responsibility.</li>
</ol>
<p>Finally, we made some changes to the conference itself.</p>
<ol>
<li>We scheduled our parallel tracks based on surveys of attendees about which talks they wished to attend. It was trivial to gather this information using a google spreadsheet.   Using it, we created a schedule with almost no conflicts among the papers people stated plans to attend.</li>
<li>We introduced a category of &#8220;general interest&#8221; papers.  These were specifically not intended to be &#8220;best&#8221; papers; rather, they were papers we felt ought to be attended to by anyone interested in achieving a broad sense of the work going on around the Semantic Web.   The Semantic Web community is made up of subcommunities that risk becoming quite isolated from one another&#8212;the description logic community, the Semantic Web user interface community, the scalability community, the standards community, and so on.   To combat isolation we need to share knowledge, so we identified papers that we felt were representative of each subcommunity but also broad enough that members of the other subcommunities could understand them.</li>
<li>We introduced a town-hall meeting so the community could give us feedback about all the innovations I&#8217;ve just described.  The town hall was reasonably well attended even without any food or beer incentive.  With the exception of reviewer bidding, all of our changes were favored by a majority of the town hall attendees.</li>
</ol>
<p>There were three changes we completely failed to pull off:</p>
<ol>
<li>ISWC offers &#8220;in use&#8221; and &#8220;industry&#8221; tracks with their own submissions, committees, and schedules independent of the research track.  This creates ambiuity for authors about where they should submit and ambiguity about which track should accept, and also leads to tracks that are not coordinated at the conference.  I would like to see a single committee considering all this work as a whole.</li>
<li>We had hoped to install Semantc-Web tools for managing information at the conference&#8212;letting people annotate the papers/talks they saw, provide information about restaurants they&#8217;ve attended, coordinate meetings with others at the conference.   We failed&#8212;creating a system for this still requires too much special purpose engineering.    I see that as a (to-date) failure of the Semantic Web community to provide tools that are easy to install and use.</li>
<li>Both Avi and I were eager to nurture the study of <a href="http://swui.semanticweb.org/">user interaction with the Semantic Web</a>.  We both consider this a topic that is not being sufficiently explored&#8212;without good user interfaces, we don&#8217;t think the promise of the Semantic Web can be fulfilled. Unfortunately, our plan was frustrated by a lack of submissions in this area&#8212;it seems the work simply isn&#8217;t being done.   I wonder how we can convince the community of the importance of pursuing it.</li>
</ol>
<p>In all, chairing the conference was surprisingly fun and easy.  For years, I&#8217;ve declined chairing requests, on the theory that with my organizational capabilities any conference I chaired would not actually take place.   For ISWC, my worries were assuaged by a fantastic and well organized (Swiss) co-chair who made sure everything actually happened on time.  So, a big thank you to Avi.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/11/iswc-afterthoughts/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Paper Awards at ISWC</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/08/paper-awards-at-iswc/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/08/paper-awards-at-iswc/#comments</comments>
		<pubDate>Sun, 08 Nov 2009 14:06:32 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[ISWC]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=692</guid>
		<description><![CDATA[Having gotten ISWC&#8217;s Ontology Panel off my chest, I want to take the time to discuss the Best Paper Awards we gave at the conference.  The papers that got these awards received uniformly high ratings from their reviewers, were recommended for awards by the program committee, impressed the program chairs, and had good presentations at [...]]]></description>
			<content:encoded><![CDATA[<p>Having gotten ISWC&#8217;s Ontology Panel <a href="http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/">off my chest</a>, I want to take the time to discuss the Best Paper Awards we gave at the conference.  The papers that got these awards received uniformly high ratings from their reviewers, were recommended for awards by the program committee, impressed the program chairs, and had good presentations at the conference.</p>
<p>Best paper went to Ugur Kuter and Jennifer Golbeck for &#8220;Semantic Web Service Composition in Social Environments&#8221;.   This paper reflected a nice linking of two disparate fields.  The first is semantic web service composition.   This area looks ahead to when users will create workflows by composing&#8212;chaining together&#8212;a series of &#8220;web services&#8221; (processes on the web with exposed APIs that can be invoked from elsewhere on the web), passing data from one to the next in order to perform some computation.  Generally, web service composition is looked at as a logic problem&#8212;figuring out which web services can be composed to meet a particular specification.  Kutur and Goldbeck instead look at a trust problem.   Many of these services on the web might not be completely trustworthy&#8212;perhaps they are fault-prone or even malicious.   The paper framed the problem of choosing, among many compositions, the most trustworthy one.  The algorithms in the paper are heuristic and surely not the last word, but real value of the paper was in framing the problem.</p>
<p>The best student paper, by Vicky Papavassiliou, Giorgos Flouris, Irini Fundulaki, Dimitris Kotzinos, and Vassilis Christophides, &#8220;On Detecting High-Level Changes in RDF/S KBs&#8221;,  was also a problem-framing paper.   Ontologies are obviously an important part of the Semantic Web (my<a href="http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/"> last post</a> notwithstanding) and are already heavily used to characterize data in a variety of domains (medical being a significant one).    Over time, these ontologies will evolve.   People will need to make sense of the changes to these ontologies.  It&#8217;s easy to describe &#8220;atomic&#8221; changes (this element was added, this removed) but these atomic changes are likely to occur in groups to achieve some higher level change.  It is these higher level changes that will serve as the most meaningful description to an end user.   This paper poses the two related questions of &#8220;how should these higher level changes be described?&#8221; and &#8220;how can they be detected from the raw description of atomic changes to the ontology?&#8221; Again, there is likely to be followon work, but this paper did a nice job initiating study of the problem.</p>
<p>We also awarded an honorable mention in each category.  For best paper, honorable mention went to Daniela Petrelli, Suvodeep Mazumdar, Aba-Sah Dadzie, Fabio Ciravegna for &#8220;Multi Visualization and Dynamic Query for Effective Exploration of Semantic Data.&#8221; Working at Rolls-Royce, they mapped a large amount of information (about jet engine design) into a Semantic Web framework and deployed a new user interface for visualizing and navigating that information.  They then studied its use and usefulness in the company and reported conclusions.   The paper was a real user study; something of whichwe see far too little in the semantic web community.  It was particularly nice to give this award after hearing a researcher declare the previous day that &#8220;there is no scientific method for evaluating semantic web user interfaces.&#8221;  The community needs more papers of this sort.</p>
<p>An honorable mention went to Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen for &#8221;  Scalable Distributed Reasoning using MapReduce.&#8221;  This was a performance paper, showing how to do typical semantic-web logical inference tasks on a large parallel cluster.  As is usual, the main problem is data communication bottlenecks between the processors, and this paper showed how a very limited amount of replication could dramatically reduce those communication bottlenecks.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/08/paper-awards-at-iswc/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Does the Semantic Web Need Ontologies?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/#comments</comments>
		<pubDate>Tue, 03 Nov 2009 06:42:34 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[ISWC]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=669</guid>
		<description><![CDATA[Ever since returning from the 2009 International Semantic Web Conference last week I&#8217;ve been bursting to discuss a panel that took place there on the topic &#8220;Does the Semantic Web need Ontologies?&#8221;.    But the WWW2010 deadline was today and we had 3 papers to write.  With that deadline now 10 minutes past, I can [...]]]></description>
			<content:encoded><![CDATA[<p>Ever since returning from the <a href="http://iswc2009.semanticweb.org/">2009 International Semantic Web Conference</a> last week I&#8217;ve been bursting to discuss a panel that took place there on the topic &#8220;Does the Semantic Web need Ontologies?&#8221;.    But the WWW2010 deadline was today and we had 3 papers to write.  With that deadline now 10 minutes past, I can finally post!  When it was first proposed, I was concerned because panels need controversy to be fun, and I didn&#8217;t think there&#8217;d be debate on this topic.  However, the organizer was confident that he&#8217;d be able to arrange different viewpoints on the panel.</p>
<p>When I attended the panel I was sorry to discover that the panelist did in fact all agree.  Far worse, they all said &#8220;yes&#8221; and wanted to debate what <em>kind</em> of ontologies were needed. Those who&#8217;ve followed my slow <a href="http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/">conversation with Stefano Mazzocchi</a> won&#8217;t be surprised at my reaction&#8212;ajump to the audience microphone to voice a strong &#8220;no!&#8221;   I asserted that a bunch of data presented in spreadsheets was already a big step forward over our current unstructured web.  This led to some interesting discussion that helped me clarify some points in my mind that I&#8217;ll try to lay out here.</p>
<p>The panelists&#8217; general reaction was amazement that I could be opposed to ontologies.  Without ontologies, how could any tool actually use the data?  What good would that data be without an explanation of what it meant?</p>
<p>Tim Berners Lee tried to mediate by suggesting that I did support ontologies.  After all, a spreadsheet has an ontology: the ontology specifies rows, columns, cells, and the relationship between them. But by this definition, any structured data necessarily has an (implicit) ontology, and saying &#8220;ontology&#8221; is just another way of saying &#8220;structured data&#8221;.   And I think this diverges from the standard meaning of &#8220;ontology&#8221; in the Semantic Web community, which I would read as &#8220;an explicitly recorded, machine readable description of the ontology of the given data.&#8221;   While I am a big proponent of structured data I&#8217;m going to bet that the panelists would not consider their implicit ontologies to be ontologies in the Semantic Web sense.   So we do in fact disagree.</p>
<p>Why then do I think we don&#8217;t need (explicit) ontologies?   Because I&#8217;m focused on the ways that human beings, rather than machine agents, will consume the data being shared.  And for humans, a machine-readable explanation of the data&#8217;s meaning is often unnecessary because the human who is consuming that data can figure it out in other ways.  For example, the meaning of the data elements might be explained in English, a &#8220;caption&#8221; of the data I am inspecting.     Even without captions, if I get a data table with column headings, I can use my comprehension of English to understand the meaning of those headings and from it infer the roles of the columns.  Even if there aren&#8217;t column headings, the &#8220;shape&#8221; of the data can tell me a lot&#8212;I&#8217;ll recognize standard person names, phone numbers, addresses, prices, book titles, and such from the textual patterns or from matches to my large wetware database of known entities.  And if I see enough examples I can draw conclusions about the values in the column (indeed, <a href="http://www.google.com/squared">Google Squared</a> suggests that you might not even need a human in the loop to make these inferences).</p>
<p>So humans can understand data without (explicit) ontologies, but is it any use?  Sure!  Just to plug some of my own group&#8217;s tools, they can use <a href="http://www.simile-widgets.org/exhibit/">Exhibit</a> to throw it into a rich visualization&#8212;a map, timeline, or list with faceted browsing and sorting.   Or they can combine it with another data set using <a href="http://simile.mit.edu/potluck/">Potluck</a>, and throw the combined data into an Exhibit visualization.  I can make a post on <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">ManyEyes</a> or throw the data into <a href="http://www.dabbledb.com/">DabbleDB</a> for further processing.  These activities typically require me to match certain properties (columns) of the data set into roles in the UI (Exhibit, ManyEyes) or to properties in the other data set (Potluck, DabbleDB)&#8212;a straightforward task.  They don&#8217;t require the machine to understand the data, because I&#8217;m the one taking these actions.  They do require that the data be structured, since otherwise there&#8217;s no way for me to say &#8220;which column&#8221; to the tools I&#8217;m trying to use.</p>
<p>That&#8217;s the argument I wanted to make at the panel, but it&#8217;s a bit hard to squeeze into 20 seconds at the audience-feedback microphone.  So I&#8217;m afraid the panelists instead thought that I was arguing against ontologies, asserting that they should not be deployed at all.</p>
<p>On the  contrary, I like ontologies.  But I&#8217;m convinced that ontologies are a luxury, not a necessity. They&#8217;re certainly nice to have, and there are some things you can only do if you have them&#8211;for example, theycan help me understand column headings written in Russian or Spanish by connecting them to explanations in English.  But I remain captivated all the opportunities that arise just by making data easily accessible in raw form.   Too often, what people want to do with information is perfectly easy to explain, but impossible to do without serious programming, for silly reasons.</p>
<p>And it&#8217;s that enthusiasm for open data that keeps me energetically arguing that we don&#8217;t need ontologies.  If we need ontologies, then work on freeing data needs to stop until we get them.  I think that&#8217;s a very dangerous perspective.  It&#8217;s the one that says &#8220;there&#8217;s no point to building tools for scientists to publish their data, until we&#8217;ve figured out the right huge ontology that we&#8217;ll force them all to publish in.&#8221;</p>
<p>Instead, I think we should go right ahead with our research on ontologies and tools for them, but in the meantime, let the data fly!</p>
<p>P.S. When someone rose to support me, arguing that we should forget ontologies and concentrate on Linked Open Data, I mudied things further by asserting that we don&#8217;t really need the &#8220;Linked&#8221; part, and Open Data is useful in its own right.  While it comes from the same place as my perspective on ontologies above, that&#8217;s the substance of my <a href="../../2009/09/14/in-defense-of-a-semantic-web-wild-west/">discussion with Stefano</a>, and I won&#8217;t repeat it here.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/03/does-the-semantic-web-need-ontologies/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Tales of a Semantic Web Skeptic</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/10/25/tales-of-a-semantic-web-skeptic/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/10/25/tales-of-a-semantic-web-skeptic/#comments</comments>
		<pubDate>Sun, 25 Oct 2009 20:10:10 +0000</pubDate>
		<dc:creator>Michael Bernstein</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=660</guid>
		<description><![CDATA[Right now the world&#8217;s premiere semantic web conference is happening in Washington, D.C. As a graduate student of the fellow who&#8217;s chairing the conference this year, and working down the hall from Sir Linked Data himself, I&#8217;ve had my fair share of semantic web experiences. But my background is not in Semantic Web technology, so [...]]]></description>
			<content:encoded><![CDATA[<p>Right now <a href="http://iswc2009.semanticweb.org/">the world&#8217;s premiere semantic web conference</a> is happening in Washington, D.C. As a graduate student of <a href="http://people.csail.mit.edu/karger/">the fellow who&#8217;s chairing the conference this year</a>, and working down the hall from <a href="http://www.w3.org/People/Berners-Lee/">Sir Linked Data himself</a>, I&#8217;ve had my fair share of semantic web experiences. But my background is not in Semantic Web technology, so joining a group so focused on the semantic web threw some of its core tenets into sharp relief for me.</p>
<p>So here, from the perspective of a human-computer interaction guy, is what I&#8217;d like to see changed about the semantic web:</p>
<p><strong>Stop Calling Everything &#8216;Semantic&#8217;</strong></p>
<p>At worst, the term &#8217;semantic&#8217; in a title can mean &#8220;we re-did existing research using {RDF, RDFa, N3, OWL, SPARQL, DBpedia, Semantic MediaWiki},&#8221; without a clear notion of why this would be a good thing. The strength of the semantic web is its ability to interoperate heterogeneous data, yet the inclination is to ignore this and work on the problem, any problem, in a semantic web framework. Semantic web research papers can feel like a bunch of hammers running around in search of a nail. And there are plenty of oft-hammered nails: semantic query visualizations, semantic desktops, semantic wikis, semantic ontology alignment, and semantic web service composition, to name a few. Why do these benefit from being semantic, any moreso than taking another approach?</p>
<p>At best, the term is still very unclear about what it implies. &#8216;Semantic&#8217; should mean more than a language or a framework. It is an idea, and the idea should drive the research. Saying that something is semantic should imply something as clearly as saying that it is a proof by reduction, or a tangible user interface, or a static code analysis technique.</p>
<p><strong>Who&#8217;s the User? (And Why Would They Ever Use This?)</strong></p>
<p>That&#8217;s not a jab. It composes two specific critiques:</p>
<p>To solve important problems, you need to know who your users are. What are their problems? What biases and constraints do they bring to your system? This is true whether you&#8217;re composing web services or creating a Linux desktop. Semantic web technology&#8217;s greatest strength and greatest weakness is that it is very general. Too many projects focus on trying to help everybody; but too often, &#8220;everybody&#8221; is too vague to give you a good foothold, and it trends toward &#8220;semantic web-interested people&#8221;. This leads to many issues with the research, not the least of them is the <a href="http://swui.semanticweb.org/swui06/papers/Karger/Pathetic_Fallacy.html">Big Fat Graph solution</a> to every semantic web problem and the requirement that I manually author RDF triples. When you&#8217;re defining the problem, define the set of users! &#8220;Everybody&#8221; is too vague; start with some <a href="http://www.amazon.com/gp/product/0672326140/ref=pd_lpo_k2_dp_sr_1?pf_rd_p=486539851&amp;pf_rd_s=lpo-top-stripe-1&amp;pf_rd_t=201&amp;pf_rd_i=0672316498&amp;pf_rd_m=ATVPDKIKX0DER&amp;pf_rd_r=1KAN35XCZNJCDZESWKT7">personas or scenarios</a>, or build systems that aim at some subset of the world. This will give you the insights necessary to generalize back out to &#8220;everybody&#8221;.</p>
<p>Second, there are some serious questions about user motivation. The semantic web suffers from a real cold start problem &#8212; how to get all that data into linked format.  Again, no single motivator will work for everybody, so the resulting motivators are so general, or so tied to implicit semantic web assumptions, that few get off the ground. Nobody wants to sit and re-encode their data into semantic web format.  But given a real problem, and the promise of a solution that just so happens to involve RDF, it will happen.</p>
<p>This is why I think Semantic Web UIs is something of a misnomer. It&#8217;s like &#8220;Java Swing UIs&#8221; or &#8220;UIs based on a relational database backend and a PHP frontend&#8221;. The critical irony of a good semantic web UI is that there should be no indication that it&#8217;s semantic. You <em>could </em>do this using a standard database and data model, but it&#8217;s easier because it uses semantic web technologies. Again, the interface should flow from the problems, not from the data model (flexible as it is).</p>
<p>I&#8217;d love to hear a semantic web researcher&#8217;s critique of human-computer interaction. Or your thoughts on my thoughts&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/10/25/tales-of-a-semantic-web-skeptic/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Will the Namespace Traffic Jam Kill RDFa in HTML5?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/21/will-the-namespace-traffic-jam-kill-rdfa-in-html5/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/21/will-the-namespace-traffic-jam-kill-rdfa-in-html5/#comments</comments>
		<pubDate>Mon, 21 Sep 2009 17:44:27 +0000</pubDate>
		<dc:creator>Edward Benson</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[HTML5]]></category>
		<category><![CDATA[Microdata]]></category>
		<category><![CDATA[RDFa]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=517</guid>
		<description><![CDATA[One of the most exciting aspects of the (in-progress) HTML5 specification is the number of data-centric features it contains. It&#8217;s almost as if the committee is saying a big, &#8220;OK, OK! We heard you!&#8221;  to all the data-heads out there and is providing not one, not two, not three, but four different ways to [...]]]></description>
			<content:encoded><![CDATA[<p>One of the most exciting aspects of the (in-progress) HTML5 specification is the number of data-centric features it contains. It&#8217;s almost as if the committee is saying a big, &#8220;OK, OK! We heard you!&#8221;  to all the data-heads out there and is providing not one, not two, not three, but four different ways to access and manage structured data from within the client browser:</p>
<ol>
<li><b>Data Attributes</b>, are key-value pairs that may be added to any DOM node</li>
<li><b>Microdata</b> provides a way to interweave objects and object-properties amidst the DOM</li>
<li><b>RDFa</b> provides a way to interweave RDF amidst the DOM</li>
<li><b>Client-side Database Support</b> provides a full relational data access from JavaScript (the spec says this will be SQL compliant, but in reality it will likely just be the SQLite subset of SQL).</li>
</ol>
<p>These are all great developments, and will no doubt bring about a lot of creativity about how data can be used on the client-side, but what interests me the most is <i>why the HTML5 working group felt the need to include Microdata alongside RDFa</i>. </p>
<p>The capabilities of HTML5 Microdata and RDFa are nearly identical, albeit with slightly different terminology. Both provide a way to embed data within HTML attributes and tag contents. Both allow for both named entities and blank nodes. And both allow for a variety of more complex constructions, such as lists and HREF property values. One of the only real differences, as I can tell from glancing over the specs, is that RDFa requires URIs whereas Microdata simply uses ordinary strings to reference entities and properties. And that is what worries me: one of the biggest benefits of RDF is its use of URIs, yet URIs seem to be exactly what is preventing the adoption of RDF. </p>
<p>One problem is probably that URIs look funny as data model elements, even to a programmer. <i>&#8220;A person has name&#8221;</i> is much more natural sounding than <i>&#8220;A http://csail.mit.edu/Contact#Person has a http://csail.mit.edu/Contact#name&#8221;</i>. We think of our code in natural language terms, and URIs obfuscate our real world metaphors. </p>
<p>Far more serious a problem is the <b>namespace traffic jam</b> that currently exists. If I want to publish an RDF document that describes this blog, for example, best practice would have me draw class types and property types from no less than <b>six</b> ontologies!</p>
<ul>
<li>The RDF ontology to describe object properties</li>
<li>The RDFS ontology to describe object classes and labels</li>
<li>The Dublin Core (DC) ontology to describe the titles, authors, and the like</li>
<li>The Friend of a Friend (FOAF) ontology to describe my contact information</li>
<li>The XSD ontology to describe literal dates, strings, and numbers</li>
<li>And yet another, custom, ontology to describe everything else particular to the blog</li>
</ul>
<p>That is already 6 ontologies, and we haven&#8217;t even raised the possibility of using OWL Time, Snap, Span, and GeoOWL for things like time and space description! Even for a semantic web developer, the complexity of managing all of these ontologies, and the namespaces that go with them, becomes pretty burdensome pretty quickly. </p>
<p>And that is why I worry about the future of RDFa in HTML5. It appears that the Microdata specification in HTML5 is essentially the RDF graph data model with the URIs neutered out. Given essentially the same data model, no doubt most developers will pick the easier of two formats to implement. </p>
<p>In order to get more people on the RDF bandwagon, we need to make the RDF path just as easy to follow as the Microdata one. How can this be done? If you ask me, the best way is to get rid of this namespace traffic jam and cultivate a set of community-oriented ontologies. </p>
<p>Rather than trying to create base ontologies that address abstract universal concepts, why not try to have each community standardize a single ontology for their particular domain. Have WordPress and Blogger sponsor the Blog Ontology. Have Amazon.com and eBay sponsor the Marketplace Ontology. Have Facebook and MySpace sponsor the Social Ontology. Then, instead of reusing bits from other ontologies, such as <tt>dc:creator</tt> or <tt>foaf:name</tt>, have each of these community-focused ontologies be self-sufficient, covering all the concepts necessary for their domain. We can always apply mapping rules to distinguish between <tt>social:name</tt> and <tt>store:book-author-name</tt> later. With only a single ontology per domain area to worry about, the namespace traffic jam will disappear and it will be easier for people to get on board with RDF and RDFa.  </p>
<p>All in all, it seems the good news coming out of the HTML5 spec is that we can expect rich data annotation to soon be arriving to HTML content everywhere. But what we need to work on as a community is a way to make URIs, and the Ontologies that give them meaning, easier for programmers to use so that the web won&#8217;t just be full of data with Microdata, but full of <i>linked</i> data with RDFa. </p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/21/will-the-namespace-traffic-jam-kill-rdfa-in-html5/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>In Defense of a Semantic Web Wild West</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 06:17:23 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[PIM]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=447</guid>
		<description><![CDATA[A month ago Stefano Mazzocchi published an interesting article on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized &#8220;a priori&#8221; approach (trying to keep the identifiers merged at the beginning).  I posted a response arguing the value of a more anarchic &#8220;a posteriori&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>A month ago Stefano Mazzocchi published an interesting <a title="Stefano's blog post" href="http://www.betaversion.org/~stefano/linotype/news/304/">article</a> on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized &#8220;a priori&#8221; approach (trying to keep the identifiers merged at the beginning).  I posted a <a title="my blog response" href="http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/">response</a> arguing the value of a more anarchic &#8220;a posteriori&#8221; approach where you let anyone create whatever identifiers and relations they want, and worry about detecting linkages later.   Stefano <a title="stefano blog response" href="http://www.betaversion.org/~stefano/linotype/news/311/">responded</a> to that, but by then I was busy chairing the submissions for the <a title="ISWC 2009 home page" href="http://iswc2009.semanticweb.org/">2009 International Semantic Web Conference</a>.   Now that that&#8217;s over (I hope you will attend what should be an interesting meeting&#8212;October 25-29 near Washington DC) I&#8217;d like to pick up the discussion again.</p>
<p>I argued in favor of letting individuals make their own RDF collections (using, for example, our <a href="http://www.simile-widgets.org/exhibit/">Exhibit</a> framework) and worry about merging them with other people&#8217;s data later.  Stefano&#8217;s response accused me of using &#8220;RDF&#8221; and &#8220;structured data&#8221; interchangeably, asserting Exhibit is really just a nice UI over spreadsheet (tabular) data&#8212;that although it can export RDF, it is &#8220;not properly using RDF&#8221; because it has &#8220;lost the notion of globally unique identifiers (and in that regard, is much more similar to <a href="http://en.wikipedia.org/wiki/Microsoft_Excel">Excel</a> than to <a href="http://www.w3.org/2005/ajar/tab">Tabulator</a>)&#8221;.  Tim Berners Lee has made similar complaints to me about Exhibit not using RDF.</p>
<p>This argument highlights for me yet an important ambiguity about what RDF <em>is</em>.   I occasionally have to help people understand that RDF is a <em>model</em>, not a syntax.  That some data can be RDF even if it isn&#8217;t serialized to RDF/XML.  That the key is to have items named by URIs, connected by relations named by URIs.  Stefano&#8217;s argument suggests a different blurring: between the model and its intended use.  Stefano&#8217;s &#8220;not properly using&#8221; phrase implies that if you don&#8217;t intend to merge your data into the global namespace, then even if you implement the model  and wrote it down as RDF/XML to boot, you won&#8217;t be &#8220;properly using RDF&#8221;.</p>
<p>I want to address both these claims: that Exhibit is just a UI over spreadsheets, and that using RDF this way isn&#8217;t proper.</p>
<p><strong>RDF and spreadsheets</strong></p>
<p>Regarding the spreadsheet claim, I&#8217;ll begin by admitting that Stefano is absolutely right:  Exhibit is a visualization tool for tabular (spreadsheet) data.  But notice that <em>all</em> RDF is spreadsheet data&#8212;I can take all the RDF in the world and throw it into one spreadsheet.  In fact, I only need three columns to contain the subject (tail), object (head), and predicate (link) for each RDF statement.  Admittedly none of today&#8217;s spreadsheets would have enough rows, but that&#8217;s an engineering detail.  So, the spreadsheet <em>model</em> isn&#8217;t the problem.   And we also agree that Exhibit&#8217;s <em>interface</em> is nothing like spreadsheets&#8217;, and far better for the collection visualization tasks it is designed for.</p>
<p>I think instead that what Stefano is objecting to is a <em>usage</em> characteristic of spreadsheets versus RDF.  When I open a spreadsheet, the data it shows me is right there, in a file on my own system.  Global identifiers don&#8217;t matter because the data is all there (and presumably self-consistent) in the one spreadsheet.   In contrast, in Stefano&#8217;s image of RDF (and in Tim&#8217;s, as one can see from the Tabulator project) the data about a particular entity is spread all over the web, and it is the globally unique identifier that lets you go out, gather all that data together, and know that it is all about the same entity.</p>
<p>This is certainly an appealing vision.  But I want to argue that a focus on globally unique identifiers neglects two benefits of RDF that I consider equally important: <strong>data portability</strong> and <strong>schema flexibility</strong>.</p>
<p><strong>Spreadsheets suffice</strong></p>
<p>To illustrate this argument, I&#8217;ll hark back to a <a title="Hard data management blog post" href="../../2008/11/20/hard-information-management-that-should-have-been-easy/">previous post</a> where I discussed a data integration problem that should have been easy but wasn&#8217;t.   I keep an  <a href="http://simile.mit.edu/exhibit/">Exhibit</a> of folk dance videos on the web.   Recently, Nissim Ben Ami posted a <a href="http://il.youtube.com/profile_videos?p=r&amp;user=NissimBenAmi&amp;page=1">collection</a> of 511 new dance videos on Youtube.  I wanted to incorporate it into my site.  But it quickly became apparent that said incorporation would basically require my entering all 511 video descriptions manually into my system, and I still haven&#8217;t gotten around to it.</p>
<p>The major barriers were twofold.  The first was syntactic:, the structured descriptions of the videos were delivered as XML.   That meant that in order to get at the data, I was going to have to learn XSLT&#8212;something I&#8217;ve been putting off for years.   The second hurdle is semantic: Youtube has the wrong schema for my folkdance videos.  I care about choreographer, dance type, and year choreographed; YouTube only offers slots for submitter and submission date of the video.  So, as you can see from<a title="Matzlichim video" href="http://www.youtube.com/watch?v=PgbRwUqHsOM"> this example</a>, the contributor takes the usual approach: he takes his nice structure data and shoves it into the generic comment (info) field as free text.  All that structure is instantly lost.</p>
<p>Suppose instead that spreadsheets (or, in a pinch, RDF) were the accepted framework for publishing information on the web.  The YouTube &#8220;spreadsheet&#8221; would contain submitter and submission date information, but Nissim could just add &#8220;artist&#8221; and &#8220;composition-date&#8221; columns to hold the data he wanted to enter.   I would then be in a great position to download his data and incorporate it into my own catalog (spreadsheet).  What would I have to do?  After opening his spreadsheet and mine, I&#8217;d have to match columns&#8212;perhaps he called his &#8220;artist&#8221; and &#8220;composition date&#8221; while mine are &#8220;choreographer&#8221; and &#8220;year&#8221;.  But a simple copy and paste fixes that discrepancy.  Merging entities is not much harder than merging properties: a simple global replace will convert his choreographer &#8220;Israel Ya&#8217;akovi&#8221; to my &#8220;Israel Yakovee&#8221;.  The local consistency of his data and mine means that I only have to work once per choreographer (and in most cases I won&#8217;t have to: there&#8217;s a standard spelling for almost every choreographer&#8217;s name, which serves as a unique identifier<em> in this context</em> even if it isn&#8217;t a URL).</p>
<p>Overall, my work has reduced by order of magnitude.  Instead of laboriously entering 511 new records, I just download a spreadsheet and match up a handful of properties (columns) and a few tens of choreographer names.</p>
<p>Stepping back, observe that I&#8217;ve relied on two things.   First, on <strong>data portability</strong>&#8212;my being able to download the data in a convenient form: not XML, which is a programmer&#8217;s friend but an end-user&#8217;s enemy; rather, something I can just look at and understand.  Second, on <strong>schema flexibility</strong>&#8212;on Nissim&#8217;s being able to add whatever columns/properties he decides are important, instead of being limited to those used on the hosting web application.</p>
<p>I&#8217;m also relying on some features of this particular scenario, but I believe they often hold.   I am relying on Nissim&#8217;s data having only a small number of properties so that I can map them manually to mine.   I also rely on there being a small number of choreographers, and hope to take advantage of most of them having matching names in his data and mine&#8212;these names certainly aren&#8217;t globally unique identifers, but they are &#8220;unique enough&#8221; when considering just my data and his.  Critically, I am not thinking of pulling all data about a given dance from a multitude of different web sites&#8212;this would demand global unique identifiers to link data since I would never have the patience.  Rather, I am considering a pairwise data acquistion: taking data I want from one internally consistent site.</p>
<p>Such pairwise acquisition is commonplace: any time a scientists wants to pull a data set from some other scientist&#8217;s lab, or a consumer wants to download product information about several cameras from a review site, or a student wants to include a Wikipedia data set in a report they are writing, there is an obvious single source and target for a data merger.   And there&#8217;s a human being who has the incentive, and with the right tools the capability, to do the limited amount of work needed to accomplish that merger.</p>
<p>This is a simple low-hanging fruit argument.  It would be wonderful to be able to <em>automatically</em> merge data from <em>thousands</em> of different sources into a coherent whole.  And this is a problem Freebase will need to solve, if they want to become the hub for aggregation of structured data.  But right now we can&#8217;t even <em>manually</em> merge data from <em>two</em> sites without doing a ridiculous amount of grunt work&#8212;so perhaps we should give some attention to that easier problem on our way to solving the hard one.</p>
<p><strong>Don&#8217;t skip the wild west<br />
</strong></p>
<p>I&#8217;d like to so these efforts proceed in parallel, but I&#8217;m worried about enthusiasm for the more ambitious goal blocking movement toward the low-hanging fruit.  I recently submitted a proposal to NIH on the topic of data integration that reflected my perspective above.  I argued that the current efforts in the Biology community to force everyone to adopt a common ontology (and sometimes repository) for their experimental data are being resisted by biologists who think they know best how to present their data.  I suggested as an alternative that we give biologists tools, such as Exhibit, that would encourage them to publish their data in a common structured syntax, and worry about integrating all that data <em>after</em> it has become available in structured form.  The proposal rejection was accompanied by a review that said, on the one hand, &#8220;The benefit of the proposed approach is that it is very different from some multi-institutional data sharing projects (like caBIG), which have used a very rigid, top-down approach to creating semantics. Even if this project is unsuccessful it could bring to light new ideas and strategies that might make those large-scale projects more responsive to investigators and more successful.&#8221;  At the same time, it argued for rejection because &#8220;The absence of any control over the information models and ontologies – truly a semantic wild west – is daring and may ultimately be the downfall of this project.&#8221;</p>
<p>I&#8217;m fascinated to see, in the same review, a recognition of the problems that the current centralized approach is bringing (lack of buy-in to common ontologies by individual scientists who think they know better and probably do), and an unwillingness to tolerate the contrary (anarchic) solution.  I also love the metaphor of the &#8220;semantic wild west&#8221; because I think it supports my argument.  Would anyone have suggested establishing a city of several million people just after the west was opened for settlement?  The west&#8217;s early wildness was an unavoidable phase of its evolution towards the thickly settled and uniformly governed area it is now.    In the same vein, I think that our semantic web is best grown by encouraging individual semantic-web settlers to create their own data homesteads and begin looking for the trails that connect them to neighboring collections.  We need to get the data into plain view first.   Later we can send in the data sheriffs and place all those data sets under uniform governance.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Talk: Community-based ontology development alignment and evaluation</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/27/talk-community-based-ontology-development-alignment-and-evaluation/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/27/talk-community-based-ontology-development-alignment-and-evaluation/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 19:23:10 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=410</guid>
		<description><![CDATA[Natasha Noy gave a talk at CSAIL with the above title.  She works in with a large medical bioinformatics group at Stanford.  The bioinformatics community in general couldn&#8217;t care less about cool computer science but is one of the few groups that have heavily adopted formal ontologies as a way to get their work done.  [...]]]></description>
			<content:encoded><![CDATA[<p>Natasha Noy gave a talk at CSAIL with the above title.  She works in with a <a href="http://protege.stanford.edu/">large medical bioinformatics group</a> at Stanford.  The bioinformatics community in general couldn&#8217;t care less about cool computer science but is one of the few groups that have heavily adopted formal ontologies as a way to get their work done.  They have tons of data partitioned over many silos.   Biologists have adopted ontologies to provide canonical representations of scientific knowledge, or to annotate data to let others make use of it.  Often, it is not the authors who do it, but curators or automatic tools.</p>
<p>There are now hundreds of ontologies with tens of thousand of terms.  However, it has always been a &#8220;cottage industry&#8221;&#8212;various groups develop their own ontologies, then publish them for use by others.  Is there a way to open the development of the ontology up to the community?  Community might be just a few or thousands.</p>
<p>As an example, the gene ontology (28K terms) has 3 full time curators.  People from the community submit to an issue tracker to get new terms etc.  A ne version is released daily.  In contrast, the NCI thesaurus (for cancer) has 20 full time editors with 1 lead editor who runs everything, and a slow cycle of &#8220;releases&#8221; with less community input.  Others work like typical open source projects with 20-30 team members involved in active discussions.</p>
<p>Natasha&#8217;s group builds on Protege, a very old open source ontology editor that is now one of the most popular, with 120,000 registered users.  It has a very open plugin architecture with dozens of plugins for visualization, import, export, nlp, and lots of unknowns.  They&#8217;ve been working to augment protege with support for collaboration.  It works in a distributed fashion (desktop and web clients).  It support simultaneous editing, but also annotation, discussion, proposals and voting in the context of the ontology.  There are many types of annotations&#8212;questions, comments, proposals&#8212;on any elements of the ontology&#8212;classes, properties, instances.   While the tool handles most types of structured data, it is focused on taxonomic hierarchies were stuff gets inherited down the hierarchy.</p>
<p>They investigated use of their tool for several tasks.  One is ontology evaluation&#8212;finding existing ontologies that might be useful for you.  This source of information for this is author-contributed metadata about the ontologies&#8212;domain, key classes and concepts, who the developer is, etc. Another is automatic tools that compute quality metrics, and another is annotations by other users of the ontologies.</p>
<p>This last is important because some ontology metrics are subjective&#8212;a feature that is &#8220;good&#8221; in one setting can be awful in another.  An example might be a high level of axiomatization.  This is important for inference, but creates clutter if you just want description.  There&#8217;s also the problem of crosscutting taxonomies&#8212;you might have two different ways of describing the same domain that form a &#8220;matrix&#8221; of non-overlapping hierarchies.  To address this sort of subjectivity, they allow users to record evaluations of ontologies.</p>
<p>These tools can be explored at their <a title="Bioportal web site" href="http://bioportal.bioontology.org/">bioportal web site</a> where they have a large library of biomedical ontologies.  On that site, users can describe their ontology based projects, and list/review the ontologies they are using.  Reviewers give general reviews, usage information problems encountered, coverage of the key terms, major gaps, and issues with specific elements of the ontology.  This site aims to make ontology evaluation/creation a truly democratic process.  This is controversial&#8212;some argue that ontologies need a more rigorous editorial process (mirroring a current debate about open vs. traditional journal publication).</p>
<p>Another big task is mapping: connecting two ontologies by asserting that terms in two different ontologies &#8220;match&#8221;.  They aren&#8217;t trying to find mappings, but want to enable others to upload the mappings they have found.  Mappings can be created manually or uploaded in bulk (if computed by someone&#8217;s tool).  Mappings are themselves metadata, which can be annotated and discussed just like other data in the ontology.</p>
<p>Of course a big question is whether people will use these tools.  Right now, many users are asking for these features and reporting lots of bugs&#8212;good signs of demand.</p>
<p>A lot of questions have now arisen that need some serious user studies&#8212;what are the dynamics of the social networks that form around collaborative ontologies?  What are the different types of users/editors?  What produces the most discussion/controversy?  Do these tools help or hinder collaboration?</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/27/talk-community-based-ontology-development-alignment-and-evaluation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Is RDF any good without a web of linked data?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 05:39:10 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[PIM]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=403</guid>
		<description><![CDATA[Stefano Mazzochi used to work at our SIMILE project here at MIT, where we explored the use of RDF and Semantic Web tools for the sharing of knowledge.  He has since gone to work at Metaweb and, it seems, become much more friendly to their &#8220;top down&#8221; approach of trying to create a centralized repository [...]]]></description>
			<content:encoded><![CDATA[<p>Stefano Mazzochi used to work at our <a title="Simile Project web site" href="http://simile.mit.edu/">SIMILE project</a> here at MIT, where we explored the use of RDF and Semantic Web tools for the sharing of knowledge.  He has since gone to work at <a href="http://www.metaweb.com/">Metaweb</a> and, it seems, become <a href="http://www.betaversion.org/~stefano/linotype/news/304/">much more friendly</a> to their &#8220;top down&#8221; approach of trying to create a <a href="http://www.freebase.com/">centralized repository</a> of structured data with consistent identifiers, as opposed to letting that data grow all over the place any which way and get <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/">linked together afterwards</a>.  In particular, he argues for the critical importance of <em>relational density</em> in the data.  His point is that when there are many distinct, unlinked identifiers for the same object, then what one person says about one of those identifiers (&#8221;Chicago&#8221;) won&#8217;t be visible to someone looking at a different identifier (&#8221;the Windy City&#8221;).  He opines that &#8220;without it [relational density] there would be very little value in it compared to what traditional search engines are already doing&#8221;.</p>
<p>Being argumentative by nature, I wanted to highlight some of the benefits of the looser, sloppier approach to data sharing that we took for SIMILE.   Obviously, being able to link data from multiple sources, and feed it into a search engine as Stefano describes, is a great thing.  But there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data <em>with no reference to anyone else&#8217;s</em>.</p>
<p>The first is interaction.  As shown with our <a href="http://www.simile-widgets.org/exhibit/">Exhibit framework</a> (created by <a href="http://davidhuynh.net/">David Huynh</a>, now also at Metaweb), structured data enables rich visualization.  If my data objects have coordinates, I can plot them on a map.  If they have dates, I can put them on a timeline.  If they have colors, I can filter or sort by color.  It doesn&#8217;t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City&#8212;as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.</p>
<p>The second benefit is portability.  If I publish some interesting data as part of an HTML document, then anyone who wants to use that data for something else&#8212;to rebut my argument, to mash it up with some other data, to put it some use I never thought of&#8212;has the unpleasant job of <a href="http://en.wikipedia.org/wiki/Web_scraping">scraping</a> said data out of the HTML into a usable form.  This generally requires a programmer, and even for them it&#8217;s a tedious task that distracts them, and may deter them, from what they really want to do with the data.  But if that data is published as data&#8212;even in something old fashioned as a spreadsheet&#8212;it becomes way easier to grab it and reuse it.  Look at how much of the blogosphere is made up of cross-references, trackbacks, and responses to other blog postings.  If you&#8217;re going to argue about something involving data&#8212;for example, whether a single payer system is going to end up saving or costing money, or whether <a title="Perfect Game story" href="http://sportsillustrated.cnn.com/2009/baseball/mlb/07/23/buehrle.cnn/index.html?cnn=yes">today&#8217;s perfect game</a> is all that unusual&#8212;you probably want to publish that data to support your argument.  At which point, someone who wants to refute your argument is going to want to use that same data.  That&#8217;s going to be a lot easier if they can get that data from your posting.  That&#8217;s the theory behind our <a href="http://projects.csail.mit.edu/datapress/">Datapress</a> project, which aims to let you post data sets (and visualizations of them) in your Wordpress blog, and lets other people refer to and reuse that data.  In that sort of one-on-one debate over data, it really doesn&#8217;t matter whether I use the same identifiers as Freebase&#8212;you can take my identifiers and use them to build your rebuttal.</p>
<p>Uniformity does start to matter when someone wants to mash up data from multiple sources.  If those sources haven&#8217;t agreed on identifiers beforehand, then the masher has some work ahead&#8212;this is a case where a centralized vocabulary is really helpful.  But again, getting the data <em>at all</em> is such a big jump over the current state of affairs&#8212;I imagine how grateful mashup makers would be if all they had to do was merge some identifiers instead of retyping a whole spreadsheet from scratch.  The point here is that unlike Stefano&#8217;s hypothetical search engine, that wants to issue a query against all the world&#8217;s data at once, your typical mashup author just needs to deal with a couple of (probably small) data sets.  His or her <em>particular </em>data integration problem is quite manageable <em>a posteriori</em>.</p>
<p>I&#8217;ll also dust off an argument David Huynh once made to me, even if it might get him in trouble with his current employer.   Unification is not an absolute, but contextual.  Whether two things are the same may change depending on what you are doing with them.   Continuing my never-before attempted forays into sports analogies, are the Brooklyn Dodgers the same as the L.A. Dodgers?  If you want to talk about the team that moved from Brooklyn to LA, the answer must be yes!  But in a different context you might be interested in comparing the lifetime records of these two distinct teams.  (In fact, Freebase tries to have it both ways: it asserts that the <a title="Brooklyn Dodgers on Freebase" href="http://www.freebase.com/view/guid/9202a8c04000641f800000000ad5a169">Brooklyn Dodgers</a> were &#8220;later known as&#8221; the <a title="LA Dodgers on Freebase" href="http://www.freebase.com/view/en/los_angeles_dodgers">Los Angeles Dodgers</a> (implying they are the same team with a name change) but asserts that Los Angeles Dodgers were founded in 1958, which clearly isn&#8217;t true of the Brooklyn Dodgers that folded in 57.)</p>
<p>This is obviously one of those half-empty half-full debates:  We both recognize the value of both approaches, but are compelled by different aspects.  Stefano looks at the amazing things that could be done with a single consistent data universe, and worries about how to create it.  I look at the amazing things that can already be done with a host of disjoint but internally-consistent data microverses, and find that compelling enough to allay any worry about whether we&#8217;ll ever need <a title="Freebase Lion Article" href="http://www.freebase.com/view/en/lion">http://www.freebase.com/view/en/lion</a> to unify with <a title="Wikipedia Lamb Article" href="http://en.wikipedia.org/wiki/Lamb">http://en.wikipedia.org/wiki/Lamb</a> .</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
	</channel>
</rss>
