<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Haystack Blog &#187; Web Architectures</title>
	<atom:link href="http://groups.csail.mit.edu/haystack/blog/category/web-architectures/feed/" rel="self" type="application/rss+xml" />
	<link>http://groups.csail.mit.edu/haystack/blog</link>
	<description>MIT CSAIL Research</description>
	<lastBuildDate>Tue, 24 Nov 2009 04:05:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>How Safari and Firefox handle HTML 5 Manifest files</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/26/how-safari-and-firefox-handle-html-5-manifest-files/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/26/how-safari-and-firefox-handle-html-5-manifest-files/#comments</comments>
		<pubDate>Sat, 26 Sep 2009 18:49:16 +0000</pubDate>
		<dc:creator>Edward Benson</dc:creator>
				<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[HTML5]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=567</guid>
		<description><![CDATA[I was doing some experiments with Adam in the lab on Friday, and we discovered some interesting variations in the way that Firefox and Safari implement the HTML 5 Cache Manifest specification. I think this is a particularly important feature to have implemented consistently across platforms because it is the make-or-break feature of HTML5 that [...]]]></description>
			<content:encoded><![CDATA[<p>I was doing some experiments with Adam in the lab on Friday, and we discovered some interesting variations in the way that Firefox and Safari implement the HTML 5 Cache Manifest specification. I think this is a particularly important feature to have implemented consistently across platforms because it is the make-or-break feature of HTML5 that will permit web applications to function offline. </p>
<h3>First, what is the manifest?</h3>
<p>For people who haven&#8217;t heard about this feature before, the manifest is essentially a special file that lists portions of a web site that should be cached locally for offline access. This is the feature of HTML 5 that will standardize the type of &#8220;airplane mode&#8221; access that GMail users have with Google&#8217;s custom Gears plugin.  </p>
<p>The manifest is served as a regular old file, with MIME type <i>text/cache-manifest</i>, and is linked from the <i>html</i> tag itself, as follows:</p>
<pre name="code" class="html">

&lt;html manifest="site.manifest"&gt;

..

&lt;/html&gt;
</pre>
<p>Once a web site is marked as being cached, then the browser will use the local cached copy of all the files specified in the manifest instead of attempting to load them from the internet. Say you&#8217;re on an airplane and type in the URL for <tt>http://my_cached_site.com</tt>. The browser will recognize it as a cached one, load it from its local storage instead, and then use a new JavaScript API to inform the web site that it is running in offline mode.  </p>
<p>So now for the important part, how do these two browsers (Firefox and Safari) handle this file?</p>
<h3>Firefox</h3>
<p>Upon loading an HTML5 document with a manifest attached, Firefox firsts asks permission to cache the site offline before requesting the manifest file from the server. Here is how the toolbar looks on my browser:</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_ffox_permission.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_ffox_permission-300x133.png" alt="cache_ffox_permission" title="cache_ffox_permission" width="300" height="133" class="aligncenter size-medium wp-image-569" /></a></p>
<p>And here is the server log (I&#8217;m using a Rails project to test this) to show that the manifest was not yet loaded:</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_firefox_first_load.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_firefox_first_load-300x83.png" alt="cache_firefox_first_load" title="cache_firefox_first_load" width="300" height="83" class="aligncenter size-medium wp-image-571" /></a></p>
<p>If you choose to allow offline caching, the web browser then requests the cache file, as can be seen from this screen shot.</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_ffox_after_perm.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_ffox_after_perm-300x83.png" alt="cache_ffox_after_perm" title="cache_ffox_after_perm" width="300" height="83" class="aligncenter size-medium wp-image-568" /></a></p>
<p>Now here&#8217;s the cool thing, I set the headers on the manifest file such that the manifest file itself should also be cached on the client side:</p>
<pre name="code" class="ruby">

        headers["Expires"] = "Fri, 30 Oct 2010 14:19:41 GMT"
        headers["Cache-Control"] = "max-age=3600, must-revalidate"
</pre>
<p>And the result of this is that the subsequent load, <b>no files at all are loaded from Firefox</b> &#8212; it operates entirely offline. Notice the completely empty server log as I reload the site 2..n times. </p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_ffox_second_load.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_ffox_second_load-300x83.png" alt="cache_ffox_second_load" title="cache_ffox_second_load" width="300" height="83" class="aligncenter size-medium wp-image-570" /></a></p>
<h3>Safari</h3>
<p>Now let&#8217;s look at how Safari does it. Upon loading the web page, Safari also does not load the manifest file, as can be seen from this screen shot:</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_safari_first_load.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_safari_first_load-300x111.png" alt="cache_safari_first_load" title="cache_safari_first_load" width="300" height="111" class="aligncenter size-medium wp-image-572" /></a></p>
<p>However, it also does not ask any questions about offline access. The <i>next</i> time I load the web page, something strange happens. Safari checks the manifest file <i>twice</i> and then doesn&#8217;t load the actual HTML page (because it doesn&#8217;t have to). The double-loading of the manifest file appears to be on the second page load, not split 1/1 between the page departure and subsequent reload. A little strange, if you ask me. </p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_safari_second_load.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_safari_second_load-300x136.png" alt="cache_safari_second_load" title="cache_safari_second_load" width="300" height="136" class="aligncenter size-medium wp-image-573" /></a></p>
<p>Furthermore, when I reload the page, despite the HTTP headers specifying that the manifest should be cached, Safari reloads the manifest file. Though, at least it only loads it once for every subsequent time:</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_safari_third_load.png"><img src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/cache_safari_third_load-300x83.png" alt="cache_safari_third_load" title="cache_safari_third_load" width="300" height="83" class="aligncenter size-medium wp-image-574" /></a></p>
<h3>Conclusion</h3>
<p>I&#8217;m no spec-master, but it seems like Firefox&#8217;s implementation of this feature is what I would want to happen as a web architect, while Safari&#8217;s behavior seems a bit strange. </p>
<p>Firefox:</p>
<ol>
<li>Only loads the web page once </li>
<li>Asks the user for permission to enter offline mode</li>
<li>Only downloads the manifest file once if given permission</li>
<li>Then obeys HTTP Cache Control headers to suppress reloading the manifest file on future loads</li>
</ol>
<p>If Safari were to also behave like this, there are a few fixes that need to be implemented. Namely: </p>
<ol>
<li>Ask the user if offline access should be allowed</li>
<li>Load the manifest when the user loads the page the first time (and approves offline mode), not the second time, when the user might be on an airplane</li>
<li>Stop loading the manifest file multiple times in a single page load</li>
<li>Start obeying the HTTP cache headers so that <i>zero</i> web connections are necessary if the cache says so</li>
</ol>
<p>Safari&#8217;s Manifest handling quirks aside, both browser teams should be applauded for so aggressively implementing the HTML5 spec. It is a real treat as someone researching web platforms to get to test the in-progress spec on real browsers instead of just talking about what might eventually happen down the road. </p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/26/how-safari-and-firefox-handle-html-5-manifest-files/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>eyebrowse &#124; update and user reactions</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/22/eyebrowse-update-and-user-reactions/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/22/eyebrowse-update-and-user-reactions/#comments</comments>
		<pubDate>Tue, 22 Sep 2009 06:48:15 +0000</pubDate>
		<dc:creator>Brennan Moore</dc:creator>
				<category><![CDATA[Search]]></category>
		<category><![CDATA[User Interfaces]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[Information visualization]]></category>
		<category><![CDATA[life-tracking]]></category>
		<category><![CDATA[real-time]]></category>
		<category><![CDATA[social browsing]]></category>
		<category><![CDATA[temporal data]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=530</guid>
		<description><![CDATA[
Today, we rely increasingly on the Web for a multitude of everyday activities that run the gamut from simple queries to complex social interactions. As a result, our browsing patterns are starting to reflect the intricate and multi-faceted nature of our daily lives, but web browsers retain little of the nuanced richness of this information [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left">
<p style="text-align: left">Today, we rely increasingly on the Web for a multitude of everyday activities that run the gamut from simple queries to complex social interactions. As a result, our browsing patterns are starting to reflect the intricate and multi-faceted nature of our daily lives, but web browsers retain little of the nuanced richness of this information beyond simple &#8220;page histories&#8221; of previously visited sites. Analytics providers such as Google and Alexa regularly collect statistics of browsing activity, but such analytics are sitecentric and not clustered around individual end-users. Moreover, despite the social nature of web browsing, individuals have little awareness of what others are looking at and how often; while sites like del.icio.us facilitate social exploration, they focus on what people choose to share rather than on their actual habits.</p>
<p style="text-align: left">When we created <a href="http://eyebrowse.csail.mit.edu">Eyebrowse</a>, we sought to allow users to capture their web browsing activity to examine whether it could help them better understand how they and their friends use the web.  Specifically, Eyebrowse allowed people to examine long term patterns in their web browsing activity and facilitates sharing, comparison, and increased social awareness of browsing patterns among friends. and finally, to form a public, democratized corpus of web browsing data for the research community. So, how much are people willing to share, and how does sharing impact the web browsing experiences and habits of the individual?</p>
<p style="text-align: left">After three weeks, we have over 200 users sharing selected portions of their web browsing activity. We surveyed some of them and found that public web browsing was most useful to them for seeing socially derived information in context of their own web browsing activity and for viewing other users profiles for the purposes of social awareness and information discovery. Almost all users reported social- or work-related privacy concerns and their comments indicated a fear of being misrepresented by their web browsing activity. To help cope with this we are considering implementing a &#8216;greylist&#8217; that would hide specific page titles, but track overall activity and multiple whitelists, such as one for home, and one for work.</p>
<p><img class="size-full wp-image-552 aligncenter" src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/09/eyebrowse_plugin.jpg" alt="eyebrowse_plugin" width="306" height="288" /></p>
<p>We plan to continue to grow Eyebrowse into a service that supports social browsing through collaborative filtering and other crowd-sourcing techniques, promotes self-awareness among users of both the patterns in their browsing activities, and provides researchers with useful web browsing data without violating the users&#8217; privacy sensibilities.</p>
<p>Thanks to all our users and supporters! We were recently featured on <a href="http://infosthetics.com/archives/2009/09/eyebrowse_record_visualize_and_share_your_browser_history.html">infosthetics</a><a href="http://infosthetics.com/archives/2009/09/eyebrowse_record_visualize_and_share_your_browser_history.html"></a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/22/eyebrowse-update-and-user-reactions/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Will the Namespace Traffic Jam Kill RDFa in HTML5?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/21/will-the-namespace-traffic-jam-kill-rdfa-in-html5/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/21/will-the-namespace-traffic-jam-kill-rdfa-in-html5/#comments</comments>
		<pubDate>Mon, 21 Sep 2009 17:44:27 +0000</pubDate>
		<dc:creator>Edward Benson</dc:creator>
				<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[HTML5]]></category>
		<category><![CDATA[Microdata]]></category>
		<category><![CDATA[RDFa]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=517</guid>
		<description><![CDATA[One of the most exciting aspects of the (in-progress) HTML5 specification is the number of data-centric features it contains. It&#8217;s almost as if the committee is saying a big, &#8220;OK, OK! We heard you!&#8221;  to all the data-heads out there and is providing not one, not two, not three, but four different ways to [...]]]></description>
			<content:encoded><![CDATA[<p>One of the most exciting aspects of the (in-progress) HTML5 specification is the number of data-centric features it contains. It&#8217;s almost as if the committee is saying a big, &#8220;OK, OK! We heard you!&#8221;  to all the data-heads out there and is providing not one, not two, not three, but four different ways to access and manage structured data from within the client browser:</p>
<ol>
<li><b>Data Attributes</b>, are key-value pairs that may be added to any DOM node</li>
<li><b>Microdata</b> provides a way to interweave objects and object-properties amidst the DOM</li>
<li><b>RDFa</b> provides a way to interweave RDF amidst the DOM</li>
<li><b>Client-side Database Support</b> provides a full relational data access from JavaScript (the spec says this will be SQL compliant, but in reality it will likely just be the SQLite subset of SQL).</li>
</ol>
<p>These are all great developments, and will no doubt bring about a lot of creativity about how data can be used on the client-side, but what interests me the most is <i>why the HTML5 working group felt the need to include Microdata alongside RDFa</i>. </p>
<p>The capabilities of HTML5 Microdata and RDFa are nearly identical, albeit with slightly different terminology. Both provide a way to embed data within HTML attributes and tag contents. Both allow for both named entities and blank nodes. And both allow for a variety of more complex constructions, such as lists and HREF property values. One of the only real differences, as I can tell from glancing over the specs, is that RDFa requires URIs whereas Microdata simply uses ordinary strings to reference entities and properties. And that is what worries me: one of the biggest benefits of RDF is its use of URIs, yet URIs seem to be exactly what is preventing the adoption of RDF. </p>
<p>One problem is probably that URIs look funny as data model elements, even to a programmer. <i>&#8220;A person has name&#8221;</i> is much more natural sounding than <i>&#8220;A http://csail.mit.edu/Contact#Person has a http://csail.mit.edu/Contact#name&#8221;</i>. We think of our code in natural language terms, and URIs obfuscate our real world metaphors. </p>
<p>Far more serious a problem is the <b>namespace traffic jam</b> that currently exists. If I want to publish an RDF document that describes this blog, for example, best practice would have me draw class types and property types from no less than <b>six</b> ontologies!</p>
<ul>
<li>The RDF ontology to describe object properties</li>
<li>The RDFS ontology to describe object classes and labels</li>
<li>The Dublin Core (DC) ontology to describe the titles, authors, and the like</li>
<li>The Friend of a Friend (FOAF) ontology to describe my contact information</li>
<li>The XSD ontology to describe literal dates, strings, and numbers</li>
<li>And yet another, custom, ontology to describe everything else particular to the blog</li>
</ul>
<p>That is already 6 ontologies, and we haven&#8217;t even raised the possibility of using OWL Time, Snap, Span, and GeoOWL for things like time and space description! Even for a semantic web developer, the complexity of managing all of these ontologies, and the namespaces that go with them, becomes pretty burdensome pretty quickly. </p>
<p>And that is why I worry about the future of RDFa in HTML5. It appears that the Microdata specification in HTML5 is essentially the RDF graph data model with the URIs neutered out. Given essentially the same data model, no doubt most developers will pick the easier of two formats to implement. </p>
<p>In order to get more people on the RDF bandwagon, we need to make the RDF path just as easy to follow as the Microdata one. How can this be done? If you ask me, the best way is to get rid of this namespace traffic jam and cultivate a set of community-oriented ontologies. </p>
<p>Rather than trying to create base ontologies that address abstract universal concepts, why not try to have each community standardize a single ontology for their particular domain. Have WordPress and Blogger sponsor the Blog Ontology. Have Amazon.com and eBay sponsor the Marketplace Ontology. Have Facebook and MySpace sponsor the Social Ontology. Then, instead of reusing bits from other ontologies, such as <tt>dc:creator</tt> or <tt>foaf:name</tt>, have each of these community-focused ontologies be self-sufficient, covering all the concepts necessary for their domain. We can always apply mapping rules to distinguish between <tt>social:name</tt> and <tt>store:book-author-name</tt> later. With only a single ontology per domain area to worry about, the namespace traffic jam will disappear and it will be easier for people to get on board with RDF and RDFa.  </p>
<p>All in all, it seems the good news coming out of the HTML5 spec is that we can expect rich data annotation to soon be arriving to HTML content everywhere. But what we need to work on as a community is a way to make URIs, and the Ontologies that give them meaning, easier for programmers to use so that the web won&#8217;t just be full of data with Microdata, but full of <i>linked</i> data with RDFa. </p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/21/will-the-namespace-traffic-jam-kill-rdfa-in-html5/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>In Defense of a Semantic Web Wild West</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 06:17:23 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[PIM]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=447</guid>
		<description><![CDATA[A month ago Stefano Mazzocchi published an interesting article on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized &#8220;a priori&#8221; approach (trying to keep the identifiers merged at the beginning).  I posted a response arguing the value of a more anarchic &#8220;a posteriori&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>A month ago Stefano Mazzocchi published an interesting <a title="Stefano's blog post" href="http://www.betaversion.org/~stefano/linotype/news/304/">article</a> on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized &#8220;a priori&#8221; approach (trying to keep the identifiers merged at the beginning).  I posted a <a title="my blog response" href="http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/">response</a> arguing the value of a more anarchic &#8220;a posteriori&#8221; approach where you let anyone create whatever identifiers and relations they want, and worry about detecting linkages later.   Stefano <a title="stefano blog response" href="http://www.betaversion.org/~stefano/linotype/news/311/">responded</a> to that, but by then I was busy chairing the submissions for the <a title="ISWC 2009 home page" href="http://iswc2009.semanticweb.org/">2009 International Semantic Web Conference</a>.   Now that that&#8217;s over (I hope you will attend what should be an interesting meeting&#8212;October 25-29 near Washington DC) I&#8217;d like to pick up the discussion again.</p>
<p>I argued in favor of letting individuals make their own RDF collections (using, for example, our <a href="http://www.simile-widgets.org/exhibit/">Exhibit</a> framework) and worry about merging them with other people&#8217;s data later.  Stefano&#8217;s response accused me of using &#8220;RDF&#8221; and &#8220;structured data&#8221; interchangeably, asserting Exhibit is really just a nice UI over spreadsheet (tabular) data&#8212;that although it can export RDF, it is &#8220;not properly using RDF&#8221; because it has &#8220;lost the notion of globally unique identifiers (and in that regard, is much more similar to <a href="http://en.wikipedia.org/wiki/Microsoft_Excel">Excel</a> than to <a href="http://www.w3.org/2005/ajar/tab">Tabulator</a>)&#8221;.  Tim Berners Lee has made similar complaints to me about Exhibit not using RDF.</p>
<p>This argument highlights for me yet an important ambiguity about what RDF <em>is</em>.   I occasionally have to help people understand that RDF is a <em>model</em>, not a syntax.  That some data can be RDF even if it isn&#8217;t serialized to RDF/XML.  That the key is to have items named by URIs, connected by relations named by URIs.  Stefano&#8217;s argument suggests a different blurring: between the model and its intended use.  Stefano&#8217;s &#8220;not properly using&#8221; phrase implies that if you don&#8217;t intend to merge your data into the global namespace, then even if you implement the model  and wrote it down as RDF/XML to boot, you won&#8217;t be &#8220;properly using RDF&#8221;.</p>
<p>I want to address both these claims: that Exhibit is just a UI over spreadsheets, and that using RDF this way isn&#8217;t proper.</p>
<p><strong>RDF and spreadsheets</strong></p>
<p>Regarding the spreadsheet claim, I&#8217;ll begin by admitting that Stefano is absolutely right:  Exhibit is a visualization tool for tabular (spreadsheet) data.  But notice that <em>all</em> RDF is spreadsheet data&#8212;I can take all the RDF in the world and throw it into one spreadsheet.  In fact, I only need three columns to contain the subject (tail), object (head), and predicate (link) for each RDF statement.  Admittedly none of today&#8217;s spreadsheets would have enough rows, but that&#8217;s an engineering detail.  So, the spreadsheet <em>model</em> isn&#8217;t the problem.   And we also agree that Exhibit&#8217;s <em>interface</em> is nothing like spreadsheets&#8217;, and far better for the collection visualization tasks it is designed for.</p>
<p>I think instead that what Stefano is objecting to is a <em>usage</em> characteristic of spreadsheets versus RDF.  When I open a spreadsheet, the data it shows me is right there, in a file on my own system.  Global identifiers don&#8217;t matter because the data is all there (and presumably self-consistent) in the one spreadsheet.   In contrast, in Stefano&#8217;s image of RDF (and in Tim&#8217;s, as one can see from the Tabulator project) the data about a particular entity is spread all over the web, and it is the globally unique identifier that lets you go out, gather all that data together, and know that it is all about the same entity.</p>
<p>This is certainly an appealing vision.  But I want to argue that a focus on globally unique identifiers neglects two benefits of RDF that I consider equally important: <strong>data portability</strong> and <strong>schema flexibility</strong>.</p>
<p><strong>Spreadsheets suffice</strong></p>
<p>To illustrate this argument, I&#8217;ll hark back to a <a title="Hard data management blog post" href="../../2008/11/20/hard-information-management-that-should-have-been-easy/">previous post</a> where I discussed a data integration problem that should have been easy but wasn&#8217;t.   I keep an  <a href="http://simile.mit.edu/exhibit/">Exhibit</a> of folk dance videos on the web.   Recently, Nissim Ben Ami posted a <a href="http://il.youtube.com/profile_videos?p=r&amp;user=NissimBenAmi&amp;page=1">collection</a> of 511 new dance videos on Youtube.  I wanted to incorporate it into my site.  But it quickly became apparent that said incorporation would basically require my entering all 511 video descriptions manually into my system, and I still haven&#8217;t gotten around to it.</p>
<p>The major barriers were twofold.  The first was syntactic:, the structured descriptions of the videos were delivered as XML.   That meant that in order to get at the data, I was going to have to learn XSLT&#8212;something I&#8217;ve been putting off for years.   The second hurdle is semantic: Youtube has the wrong schema for my folkdance videos.  I care about choreographer, dance type, and year choreographed; YouTube only offers slots for submitter and submission date of the video.  So, as you can see from<a title="Matzlichim video" href="http://www.youtube.com/watch?v=PgbRwUqHsOM"> this example</a>, the contributor takes the usual approach: he takes his nice structure data and shoves it into the generic comment (info) field as free text.  All that structure is instantly lost.</p>
<p>Suppose instead that spreadsheets (or, in a pinch, RDF) were the accepted framework for publishing information on the web.  The YouTube &#8220;spreadsheet&#8221; would contain submitter and submission date information, but Nissim could just add &#8220;artist&#8221; and &#8220;composition-date&#8221; columns to hold the data he wanted to enter.   I would then be in a great position to download his data and incorporate it into my own catalog (spreadsheet).  What would I have to do?  After opening his spreadsheet and mine, I&#8217;d have to match columns&#8212;perhaps he called his &#8220;artist&#8221; and &#8220;composition date&#8221; while mine are &#8220;choreographer&#8221; and &#8220;year&#8221;.  But a simple copy and paste fixes that discrepancy.  Merging entities is not much harder than merging properties: a simple global replace will convert his choreographer &#8220;Israel Ya&#8217;akovi&#8221; to my &#8220;Israel Yakovee&#8221;.  The local consistency of his data and mine means that I only have to work once per choreographer (and in most cases I won&#8217;t have to: there&#8217;s a standard spelling for almost every choreographer&#8217;s name, which serves as a unique identifier<em> in this context</em> even if it isn&#8217;t a URL).</p>
<p>Overall, my work has reduced by order of magnitude.  Instead of laboriously entering 511 new records, I just download a spreadsheet and match up a handful of properties (columns) and a few tens of choreographer names.</p>
<p>Stepping back, observe that I&#8217;ve relied on two things.   First, on <strong>data portability</strong>&#8212;my being able to download the data in a convenient form: not XML, which is a programmer&#8217;s friend but an end-user&#8217;s enemy; rather, something I can just look at and understand.  Second, on <strong>schema flexibility</strong>&#8212;on Nissim&#8217;s being able to add whatever columns/properties he decides are important, instead of being limited to those used on the hosting web application.</p>
<p>I&#8217;m also relying on some features of this particular scenario, but I believe they often hold.   I am relying on Nissim&#8217;s data having only a small number of properties so that I can map them manually to mine.   I also rely on there being a small number of choreographers, and hope to take advantage of most of them having matching names in his data and mine&#8212;these names certainly aren&#8217;t globally unique identifers, but they are &#8220;unique enough&#8221; when considering just my data and his.  Critically, I am not thinking of pulling all data about a given dance from a multitude of different web sites&#8212;this would demand global unique identifiers to link data since I would never have the patience.  Rather, I am considering a pairwise data acquistion: taking data I want from one internally consistent site.</p>
<p>Such pairwise acquisition is commonplace: any time a scientists wants to pull a data set from some other scientist&#8217;s lab, or a consumer wants to download product information about several cameras from a review site, or a student wants to include a Wikipedia data set in a report they are writing, there is an obvious single source and target for a data merger.   And there&#8217;s a human being who has the incentive, and with the right tools the capability, to do the limited amount of work needed to accomplish that merger.</p>
<p>This is a simple low-hanging fruit argument.  It would be wonderful to be able to <em>automatically</em> merge data from <em>thousands</em> of different sources into a coherent whole.  And this is a problem Freebase will need to solve, if they want to become the hub for aggregation of structured data.  But right now we can&#8217;t even <em>manually</em> merge data from <em>two</em> sites without doing a ridiculous amount of grunt work&#8212;so perhaps we should give some attention to that easier problem on our way to solving the hard one.</p>
<p><strong>Don&#8217;t skip the wild west<br />
</strong></p>
<p>I&#8217;d like to so these efforts proceed in parallel, but I&#8217;m worried about enthusiasm for the more ambitious goal blocking movement toward the low-hanging fruit.  I recently submitted a proposal to NIH on the topic of data integration that reflected my perspective above.  I argued that the current efforts in the Biology community to force everyone to adopt a common ontology (and sometimes repository) for their experimental data are being resisted by biologists who think they know best how to present their data.  I suggested as an alternative that we give biologists tools, such as Exhibit, that would encourage them to publish their data in a common structured syntax, and worry about integrating all that data <em>after</em> it has become available in structured form.  The proposal rejection was accompanied by a review that said, on the one hand, &#8220;The benefit of the proposed approach is that it is very different from some multi-institutional data sharing projects (like caBIG), which have used a very rigid, top-down approach to creating semantics. Even if this project is unsuccessful it could bring to light new ideas and strategies that might make those large-scale projects more responsive to investigators and more successful.&#8221;  At the same time, it argued for rejection because &#8220;The absence of any control over the information models and ontologies – truly a semantic wild west – is daring and may ultimately be the downfall of this project.&#8221;</p>
<p>I&#8217;m fascinated to see, in the same review, a recognition of the problems that the current centralized approach is bringing (lack of buy-in to common ontologies by individual scientists who think they know better and probably do), and an unwillingness to tolerate the contrary (anarchic) solution.  I also love the metaphor of the &#8220;semantic wild west&#8221; because I think it supports my argument.  Would anyone have suggested establishing a city of several million people just after the west was opened for settlement?  The west&#8217;s early wildness was an unavoidable phase of its evolution towards the thickly settled and uniformly governed area it is now.    In the same vein, I think that our semantic web is best grown by encouraging individual semantic-web settlers to create their own data homesteads and begin looking for the trails that connect them to neighboring collections.  We need to get the data into plain view first.   Later we can send in the data sheriffs and place all those data sets under uniform governance.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Introducing &#8220;Eyebrowse&#8221; &#8211; Track and share your web browsing in real time</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/08/28/introducing-eyebrowse-track-and-share-your-web-browsing-in-real-time/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/08/28/introducing-eyebrowse-track-and-share-your-web-browsing-in-real-time/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 06:55:05 +0000</pubDate>
		<dc:creator>Max Van Kleek</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[News]]></category>
		<category><![CDATA[Social Computing]]></category>
		<category><![CDATA[Web Architectures]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=450</guid>
		<description><![CDATA[We&#8217;ve launched a service for letting people share, in real time, what pages they&#8217;re looking at on the web.  Our system, eyebrowse, lets the person choose exactly what sites they want to share their viewing patterns about, and eyebrowse does the rest &#8212; producing statistical visualisations of your web browsing habits over time, compared to [...]]]></description>
			<content:encoded><![CDATA[<p>We&#8217;ve launched a service for letting people share, in real time, what pages they&#8217;re looking at on the web.  Our system, eyebrowse, lets the person choose exactly what sites they want to share their viewing patterns about, and eyebrowse does the rest &#8212; producing statistical visualisations of your web browsing habits over time, compared to your friends and the world.  It&#8217;s called &#8220;eyebrowse&#8221; and is available here:</p>
<p><strong><a href="http://eyebrowse.csail.mit.edu">http://eyebrowse.csail.mit.edu</a></strong></p>
<p>It currently requires Firefox/Iceweasel and works on all major platforms.  All data that is collected is <strong>public</strong> and available to <strong>anyone</strong> who wants it (we do not horde or claim to own any of your data. We like Twitter&#8217;s model.)  We will soon provide a nice interface with daily tarballs of the database in RDF, XML and CSV.</p>
<p><strong>Why would you want to share your web trails?</strong></p>
<p>1. For Science!  It&#8217;s not fair that certain Search Engine Companies can do web trail research because they have access to massive repositories of data.  There should be public corpora for IR researchers around the world.  And these should be OPEN.</p>
<p>2. For your friends!  You look at lots of cool stuff on the web every day.  You might not think of explicitly sharing every single thing you read.  Eyebrowse is lightweight enough that you just have to tell it once per site you want to share.  I&#8217;ve already discovered tons of weird things that my friends are looking at that they would not have bothered to share explicitly.</p>
<p>3. To understand your own browsing habits.  How many times do you read ACM/IEEE every day? I bet you don&#8217;t know. Now you can get quantitative statistics and visualise long-term journal revisitation patterns &#8211; and other things.</p>
<p><strong>Will it violate my privacy?</strong></p>
<p>1. We give you control.  You have to tell eyebrowse explicitly what you want to share on a site-by-site (host) basis. You can take things off the whitelist at any time.  You can also go back and delete things that it has logged in the past all through our web interface.   It also respects Private Browsing (aka pornmode) and will not log any data regardless during this mode.</p>
<p>2. It fosters contemplation/awareness: We are trying to also raise awareness of what OTHERS (e.g. Google Analytics) are collecting about you as you surf the web, by showing you what you can learn from yourself by selectively publishing your own data feeds.</p>
<p>By letting people selectively publish web trails in an open, non-invasive way, we are hoping to foster a discussion of how we can use our web browsing behavior to build more adaptive and effective interfaces that <strong>respect people&#8217;s privacy</strong>.</p>
<p>Feedback is appreciated.  Please email us directly at : eyebrowse@csail.mit.edu</p>
<p>Oh and eyebrowse is free and open source software, licensed under the MIT License.  The source is available as part of the list-it codebase here: <a href="http://code.google.com/p/list-it">http://code.google.com/p/list-it</a></p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/08/28/introducing-eyebrowse-track-and-share-your-web-browsing-in-real-time/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Is RDF any good without a web of linked data?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/#comments</comments>
		<pubDate>Fri, 24 Jul 2009 05:39:10 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[PIM]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=403</guid>
		<description><![CDATA[Stefano Mazzochi used to work at our SIMILE project here at MIT, where we explored the use of RDF and Semantic Web tools for the sharing of knowledge.  He has since gone to work at Metaweb and, it seems, become much more friendly to their &#8220;top down&#8221; approach of trying to create a centralized repository [...]]]></description>
			<content:encoded><![CDATA[<p>Stefano Mazzochi used to work at our <a title="Simile Project web site" href="http://simile.mit.edu/">SIMILE project</a> here at MIT, where we explored the use of RDF and Semantic Web tools for the sharing of knowledge.  He has since gone to work at <a href="http://www.metaweb.com/">Metaweb</a> and, it seems, become <a href="http://www.betaversion.org/~stefano/linotype/news/304/">much more friendly</a> to their &#8220;top down&#8221; approach of trying to create a <a href="http://www.freebase.com/">centralized repository</a> of structured data with consistent identifiers, as opposed to letting that data grow all over the place any which way and get <a href="http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData/">linked together afterwards</a>.  In particular, he argues for the critical importance of <em>relational density</em> in the data.  His point is that when there are many distinct, unlinked identifiers for the same object, then what one person says about one of those identifiers (&#8221;Chicago&#8221;) won&#8217;t be visible to someone looking at a different identifier (&#8221;the Windy City&#8221;).  He opines that &#8220;without it [relational density] there would be very little value in it compared to what traditional search engines are already doing&#8221;.</p>
<p>Being argumentative by nature, I wanted to highlight some of the benefits of the looser, sloppier approach to data sharing that we took for SIMILE.   Obviously, being able to link data from multiple sources, and feed it into a search engine as Stefano describes, is a great thing.  But there are some tremendous advantages that accrue when even a single individual decides to create a blob of structured data <em>with no reference to anyone else&#8217;s</em>.</p>
<p>The first is interaction.  As shown with our <a href="http://www.simile-widgets.org/exhibit/">Exhibit framework</a> (created by <a href="http://davidhuynh.net/">David Huynh</a>, now also at Metaweb), structured data enables rich visualization.  If my data objects have coordinates, I can plot them on a map.  If they have dates, I can put them on a timeline.  If they have colors, I can filter or sort by color.  It doesn&#8217;t matter if I call those properties latitude, longitude, date and color, or northSouth, eastWest, sinceTheCreation and elementOfTheRainbow, and whether I decide that my city is Chicago or the Windy City&#8212;as long as I have my own internally consistent names for them, I can use them to hook my data into interesting visualizations and interactions.</p>
<p>The second benefit is portability.  If I publish some interesting data as part of an HTML document, then anyone who wants to use that data for something else&#8212;to rebut my argument, to mash it up with some other data, to put it some use I never thought of&#8212;has the unpleasant job of <a href="http://en.wikipedia.org/wiki/Web_scraping">scraping</a> said data out of the HTML into a usable form.  This generally requires a programmer, and even for them it&#8217;s a tedious task that distracts them, and may deter them, from what they really want to do with the data.  But if that data is published as data&#8212;even in something old fashioned as a spreadsheet&#8212;it becomes way easier to grab it and reuse it.  Look at how much of the blogosphere is made up of cross-references, trackbacks, and responses to other blog postings.  If you&#8217;re going to argue about something involving data&#8212;for example, whether a single payer system is going to end up saving or costing money, or whether <a title="Perfect Game story" href="http://sportsillustrated.cnn.com/2009/baseball/mlb/07/23/buehrle.cnn/index.html?cnn=yes">today&#8217;s perfect game</a> is all that unusual&#8212;you probably want to publish that data to support your argument.  At which point, someone who wants to refute your argument is going to want to use that same data.  That&#8217;s going to be a lot easier if they can get that data from your posting.  That&#8217;s the theory behind our <a href="http://projects.csail.mit.edu/datapress/">Datapress</a> project, which aims to let you post data sets (and visualizations of them) in your Wordpress blog, and lets other people refer to and reuse that data.  In that sort of one-on-one debate over data, it really doesn&#8217;t matter whether I use the same identifiers as Freebase&#8212;you can take my identifiers and use them to build your rebuttal.</p>
<p>Uniformity does start to matter when someone wants to mash up data from multiple sources.  If those sources haven&#8217;t agreed on identifiers beforehand, then the masher has some work ahead&#8212;this is a case where a centralized vocabulary is really helpful.  But again, getting the data <em>at all</em> is such a big jump over the current state of affairs&#8212;I imagine how grateful mashup makers would be if all they had to do was merge some identifiers instead of retyping a whole spreadsheet from scratch.  The point here is that unlike Stefano&#8217;s hypothetical search engine, that wants to issue a query against all the world&#8217;s data at once, your typical mashup author just needs to deal with a couple of (probably small) data sets.  His or her <em>particular </em>data integration problem is quite manageable <em>a posteriori</em>.</p>
<p>I&#8217;ll also dust off an argument David Huynh once made to me, even if it might get him in trouble with his current employer.   Unification is not an absolute, but contextual.  Whether two things are the same may change depending on what you are doing with them.   Continuing my never-before attempted forays into sports analogies, are the Brooklyn Dodgers the same as the L.A. Dodgers?  If you want to talk about the team that moved from Brooklyn to LA, the answer must be yes!  But in a different context you might be interested in comparing the lifetime records of these two distinct teams.  (In fact, Freebase tries to have it both ways: it asserts that the <a title="Brooklyn Dodgers on Freebase" href="http://www.freebase.com/view/guid/9202a8c04000641f800000000ad5a169">Brooklyn Dodgers</a> were &#8220;later known as&#8221; the <a title="LA Dodgers on Freebase" href="http://www.freebase.com/view/en/los_angeles_dodgers">Los Angeles Dodgers</a> (implying they are the same team with a name change) but asserts that Los Angeles Dodgers were founded in 1958, which clearly isn&#8217;t true of the Brooklyn Dodgers that folded in 57.)</p>
<p>This is obviously one of those half-empty half-full debates:  We both recognize the value of both approaches, but are compelled by different aspects.  Stefano looks at the amazing things that could be done with a single consistent data universe, and worries about how to create it.  I look at the amazing things that can already be done with a host of disjoint but internally-consistent data microverses, and find that compelling enough to allay any worry about whether we&#8217;ll ever need <a title="Freebase Lion Article" href="http://www.freebase.com/view/en/lion">http://www.freebase.com/view/en/lion</a> to unify with <a title="Wikipedia Lamb Article" href="http://en.wikipedia.org/wiki/Lamb">http://en.wikipedia.org/wiki/Lamb</a> .</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>A Simple Extension for Microformat &amp; RDFa Table Support</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/05/19/a-simple-extension-for-microformat-rdfa-table-support/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/05/19/a-simple-extension-for-microformat-rdfa-table-support/#comments</comments>
		<pubDate>Tue, 19 May 2009 14:01:48 +0000</pubDate>
		<dc:creator>Edward Benson</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web Architectures]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=332</guid>
		<description><![CDATA[Microformats and RDFa provide a way to interweave semantic markup within a web document so that structured information can be more easily extracted. Both Microformats and RDFa follow the hierarchical model of HTML: structured data to be extracted may exist spread across several layers of the DOM 
hierarchy. A pseudocode example of this is below, where we see that [...]]]></description>
			<content:encoded><![CDATA[<div>Microformats and RDFa provide a way to interweave semantic markup within a web document so that structured information can be more easily extracted. Both Microformats and RDFa follow the hierarchical model of HTML: structured data to be extracted may exist spread across several layers of the DOM </div>
<div>hierarchy. A pseudocode example of this is below, where we see that the statement &lt;Jefferson eats Hamburgers&gt; is spread across three levels of the DOM Tree.</div>
<p> </p>
<p><code></p>
<div>&lt;div subject="Tom"&gt;</div>
<div>   &lt;div property="eats"&gt;</div>
<div>      Hamburgers</div>
<div>   &lt;/div&gt;</div>
<div>&lt;/div&gt;</div>
<p></code></p>
<p> </p>
<div>A wrinkle in this hierarchical mindset is the fact that a great deal of structured information on the web lives in the one HTML construct that does not fit a hierarchical representation: tables. </div>
<div>Consider Wikipedia, as just one source of structured data virtually begging to be annotated with RDFa. On many Wikipedia pages one of the first things that catches a visitor&#8217;s eye is the &#8220;Info Box&#8221;. This is a small box containing a structured summary of the key factual items on the page. Equally important is the way in which Info Boxes are populated: they begin their life as templates &#8212; the &#8220;Capital City&#8221; or &#8220;Baseball Player&#8221; template, for example &#8211; and all the Wikipedia contributor has to do is fill in the empty field values.</div>
<div>Microformats and RDFa are able to mark up tables, such as these Info Boxes, but they do so in a suboptimal way: they require repetition of semantic markup across each row (or column, depending on the orientation of the table). This:</div>
<div>
<ul>
<li>Is redundant from a representational standpoint </li>
<li>Clouds the ability of a data enhanced web browser to infer that entire rows and columns of the table talk about the same thing. </li>
<li>Forces the template writer to put markup within the table body, rather than on its column and row declarations</li>
</ul>
</div>
<div>To cure this, here is a simple proposal to add table support to both Microformats and RDFa. It is compact, needing only a single sentence to describe:</div>
<div></div>
<p></p>
<div><em><strong>When parsing the DOM to extract embedded data and a &lt;TD&gt; element is encountered,<span style="font-style: normal;"> </span>treat its corresponding &lt;COL&gt; element as if it were the DOM parent of the &lt;TR&gt; element for that cell.</strong></em></div>
<p></p>
<div></div>
<div>The problem with tables is that a table cell (TD) is contained in HTML by its row (TR) but not its column  (COL) because of the way HTML works. This means that when we use tables and microformats/RDFa  together, we&#8217;re stuck: we can put information across each row, but not down each column. So the proposal is to make a special case for table columns when it comes to pulling out structured information in the cells: even though the cell isn&#8217;t technically contained by the column element, pretend that it is. </div>
<div>Let&#8217;s see how this cleans up the representation of a simple table of president names. This table has one president per row: Thomas and John. </div>
<div>Here is a pseudocode example of how microformats/RDFa currently require such a table to be marked up. Notice how the &#8220;first&#8221; and &#8220;last&#8221; properties needed to be repeated across each row.</div>
<p> </p>
<p><code></p>
<div>&lt;TABLE&gt;</div>
<div>  &lt;TR&gt;&lt;TH&gt;First&lt;/TH&gt;&lt;TH&gt;Last&lt;/TH&gt;&lt;/TR&gt;</div>
<div>  &lt;TR subject="tj"&gt;&lt;TD property="first"&gt;Thomas&lt;/TD&gt;&lt;TD property="last"&gt;Jefferson&lt;/TD&gt;&lt;/TR&gt;</div>
<div>  &lt;TR subject="ja"&gt;&lt;TD property="first"&gt;John&lt;/TD&gt;&lt;TD property="last"&gt;Adams&lt;/TD&gt;&lt;/TR&gt;</div>
<div>&lt;/TABLE&gt;</div>
<p></code></p>
<p> </p>
<div>If we allow structured content to live in the &lt;COL /&gt; elements, then we do not need this repetition:</div>
<p> </p>
<p><code></p>
<div>&lt;TABLE&gt;</div>
<div>  &lt;COL property="first" /&gt;&lt;COL property="last" /&gt;</div>
<div>  &lt;TR&gt;&lt;TH&gt;First&lt;/TH&gt;&lt;TH&gt;Last&lt;/TH&gt;&lt;/TR&gt;</div>
<div>  &lt;TR subject="tj"&gt;&lt;TD&gt;Thomas&lt;/TD&gt;&lt;TD&gt;Jefferson&lt;/TD&gt;&lt;/TR&gt;</div>
<div>  &lt;TR subject="ja"&gt;&lt;TD&gt;John&lt;/TD&gt;&lt;TD&gt;Adams&lt;/TD&gt;&lt;/TR&gt;</div>
<div>&lt;/TABLE&gt;</div>
<p></code></p>
<p> </p>
<div>When parsing either of these two tables for data, we extract the same information:</div>
<div>
<ul>
<li>:tj :first &#8220;Thomas&#8221;</li>
<li>:tj :last  &#8221;Jefferson&#8221;</li>
<li>:ja :first &#8220;John&#8221;</li>
<li>:ja :last  &#8221;Adams</li>
</ul>
</div>
<div>Tables are all over the web, and they make great templates to assist users in entering structured information. This small change to the semantics of microformat and RDFa parsing should allow a cleaner syntax for publishing both data and data-templates, easing the adoption of the respective</div>
<div>formats.</div>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/05/19/a-simple-extension-for-microformat-rdfa-table-support/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making the Case for Raw Data</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/03/24/making-the-case-for-raw-data/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/03/24/making-the-case-for-raw-data/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 14:46:05 +0000</pubDate>
		<dc:creator>Adam Marcus</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Web Architectures]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=273</guid>
		<description><![CDATA[Tim Berners-Lee’s recent TED talk on Linked Data has inspired quite a few people to ask what exactly linked data is, how it differs from data on the semantic web, and how realistic it is to assume universal and unique addressability of data items. A world with linked data would be a world with richer, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">Tim Berners-Lee</a>’s recent TED talk on Linked Data has inspired quite a few people to ask what exactly linked data is, how it differs from data on the semantic web, and how realistic it is to assume universal and unique addressability of data items. A world with linked data would be a world with richer, more explorable data, and that notion on its own makes Tim’s talk worth viewing. The most inspiring part of his talk, in my opinion, was the one in which he got the entire crowd to loudly demand RAW DATA NOW. Given the push for more open datasets in government, and given that more websites are becoming API-providing data platforms, it is important to demand raw data where possible.</p>
<h2>The magic behind raw data</h2>
<p>The best thing about raw data is that almost everyone knows how it works. This means that as far as the data (re)user is concerned, the datasets are text files (or perhaps a close variant) that they can download, open in some default application, and get some immediate use out of it.</p>
<p>If the US Federal budget dataset is released as a comma-separated file, a middle-schooler can download the file, open it in a spreadsheet application, and sum the columns to see how much we’re spending on the Department of Education this year. A more skilled high-schooler can upload the file to <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">Many Eyes</a>, make a pie chart out of it, and post it to their blog. A first-year college student can write a php script to allow people to comment on various parts of that pie chart, allowing you to drill in to various slices to get a finer granularity.</p>
<p>With raw data, you’ve opened more people to more visualization, exploration, and discussion than was available through the original web application that acted as a firewall to your database.</p>
<h2>Hugging the data to death</h2>
<p>During his talk, Tim spoke about “Database Huggers,” or people who, for various reasons, hide their data away in databases. Once the data sits in a database, the publisher might provide a specific and constrained view of the data by way of a website, or they might hide it even more, simply calculating some aggregate statistic over the data and claiming, without verification, that the data has certain properties.</p>
<p>There are several legitimate reasons for database hugging. Some data was meant to be private—academic, medical, and financial information are all datapoints we’d prefer to keep private. We’d hope our service providers will keep it out of the hands of others. Similarly, a company might have competitive reasons for keeping information private, especially when it would be equally valuable to their competitors and not too valuable to the public—lists of customers and transaction histories come to mind. Keeping this information far from the publicly accessible web is responsible and wise.</p>
<p>There are other cases, however, where the data should legitimately stay open and publicly accessible. Open government initiatives will result in many datasets published by organizations that <a href="http://www.recovery.gov/">will</a> or <a href="http://www.nih.gov/">should</a> exist in the public domain.  Many <a href="http://en.wikipedia.org/wiki/The_Long_Tail">Long Tail</a> websites, maintained by small groups of <a href="http://simile.mit.edu/exhibit/examples/cereals/cereal-characters.html">hobbyists</a>, probably would not mind if the datasets they generate are published in their full glory. For these types of applications, raw data is ideal.</p>
<p>Even in the case of datasets that should be open to the public, database huggers will sometimes disable direct access to the data, instead opting to place it in a database that sits behind an html-generating web application. Thinking that you’ve hidden your data behind HTML, thus making it safe from reuse, is an unwise assumption. In about an hour, a decent programmer can write a perl script to crawl your site and tease the data apart from the obfuscated HTML that surrounds it, reverse-engineering your database without asking for permission. In fact, there are <a href="http://simile.mit.edu/wiki/Solvent">tools</a> that make this process easier than writing a one-off perl script. And if you think you can block the person from accessing every page on your site in a short period of time, then they will just collaborate with <em>everyone else</em> who wants the data, write a <a href="http://www.greasespot.net/">Greasemonkey</a> script to collect parts of the site that they browse, and eventually collect your entire presented dataset.</p>
<p>Databases are not inherently evil. They provide an excellent way to store, index, and query data, but they also have a way of separating the average user from that data. Most websites, for example, do not publish a read-only username and password to their database, for fear of arbitrary queries that could easily take down their machines, or at least keep the machines busy for a long time. We should design tools to maintain the excellent services that databases have been built to provide over the last four decades, without limiting the access to the raw data when such access would be most valuable.</p>
<h2>Are APIs the future of raw data?</h2>
<p>There is a middle ground between the highly private datasets and the obviously open ones. Most forward-thinking organizations have realized this. They have also realized that if they have something to sell, be it in meatspace or screenspace, it’s better to release the data about their offerings to anyone that wants to use it, so that people eventually end up at their site. They do this by providing a web <a href="http://en.wikipedia.org/wiki/API">API</a> to make their dataset queriable, essentially telling other software developers which questions they can answer about the dataset (<em>query for books by author</em>, <em>query for restaurants by cuisine</em>).  <a href="http://www.amazon.com/">Amazon</a> has some APIs, as does <a href="http://www.yelp.com/">Yelp</a>, and you’d have to be a pretty self-loathing web 2.0 company to not provide an API over <em>some portion</em> of your data.  So are APIs the solution?  Not always.</p>
<p>APIs are a step in the right direction—open data is better than obfuscated data. APIs help both third-party developers and dataset publishers get more out of a dataset. They have a few drawbacks as well:</p>
<ul>
<li>The API is an HTTP interface to <em>your</em> database.  This means that if <em>someone else</em> makes a third-party application that is immensely popular, it’s your database that pays for the brunt of its popularity. You weren’t expecting a huge ramp-up in server load? Too bad.</li>
<li>As kind as the dataset publisher is, they can’t predict <em>every</em> use of the data—if they could, they already would have implemented the best use cases. If they can’t predict how the consumer/developer will use the data, they might not publish a good hook into the dataset. This would either prevent or make awkward the interaction between the third-party application and the publisher.</li>
<li>Building an API for a dataset makes the people who are nice enough to share their data do <em>more work</em> on top of designing their application.  Following common <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">REST</a> or <a href="http://en.wikipedia.org/wiki/Create,_read,_update_and_delete">CRUD</a> conventions makes this easier, but still puts the onus on the developer. As a corollary, APIs don’t change with the data. APIs are frequently revised, meaning that a change in your data requires constant upkeep of your API.</li>
</ul>
<p>One might argue that some of the criticisms of APIs are unfair:</p>
<ul>
<li>Saying that raw data will reduce the load on your database implies that the third party has some cache of the data, which is thus slightly out-of-date. You could imagine some sort of <a href="http://en.wikipedia.org/wiki/Comet_%28programming%29">Comet</a>-updated raw dataset system, but it’s unlikely for now that dataset publishers will be willing to stream live updates to third parties.</li>
<li>Perhaps the limited API functionality is for good reason. Amazon might never want you to be able to download their entire dataset—they don’t want to waste the bandwidth and they don’t want competitors to know exactly how many items they have on hand.</li>
<li>Publishing any sort of raw data will require extra work on behalf of the dataset publisher. Perhaps API-writing is the least invasive of their time?</li>
</ul>
<p>An ideal data management tool would allow raw data publishing when possible, and make it easier to build APIs when some limited access is desirable. We should not pretend to know the point at which raw data is superior to APIs, but the point exists somewhere. It’s important to understand the benefits that raw data provides on top of web APIs, so that you can think about when it would be valuable to use.</p>
<h2>After all this time, the answer was text files?</h2>
<p>You’ve probably become skeptical of these suggestions. Are we really supposed to throw away decades of database research in how to properly store, index, and query reasonably sized datasets so that a middle-schooler can look at the data in a different way? Of course not. The interesting research question becomes whether we can give the user the illusion of raw data while still benefiting from database technology where possible.</p>
<p>That’s one research direction we’re taking within the Haystack group. With the constraint that the raw data, in human-readable text files, should always be available, we’d like to blur the boundaries between databases and data-aware webservers.</p>
<p>Specifically, what we plan on designing is an apache web server module that recognizes when it is serving a dataset, perhaps by taking note that it is serving a .csv, .rdf, or .json file. In such cases, the server would cook the data into a database behind the scenes. Data-aware clients (in javascript for the time being, but in the browser one day) can then query the web server about the data directly. Updates become difficult, but we can make consistency guarantees about the original raw data text files to ensure that someone can download them and see up-to-date information.</p>
<p>If you prefer programmatic access to the files, the module turns into a REST(, SQL, SPARQL, you favorite path language)-capable endpoint. If you prefer to get down and dirty with the data, you’ve got the text files.</p>
<p>We certainly don’t want to stand in the way of a world with Linked Data, so if you’d like, the tool will eventually return data with URIs. We can’t guarantee the URIs will resolve to anything useful, but that just might require a human’s touch. We’re not sure how that fits into the picture for the average data publisher, since the marginal benefit to the individual of universally addressing your own data is small, whereas the benefit to everyone else of adding another linked dataset grows with the number of datasets it is linked to.</p>
<h2>And now, for some questions</h2>
<p>We’re early in the development of our tools, so we’re open to your ideas and suggestions. Keeping text files up-to-date with the database that’s proxying them is nontrivial. Thinking of the ideal client/server mode of operation will also take time. We probably haven’t thought of the most important must-have feature yet, so any suggestions are welcome.</p>
<p><em>Thanks to Ted Benson, Sam Madden, and David Karger for their thoughts on this post.</em></p>
<p><em>(Cross-posted on <a href="http://blog.marcua.net/post/89373158/making-the-case-for-raw-data">my blog</a>)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/03/24/making-the-case-for-raw-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What&#8217;s Wrong with SQL?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/02/16/whats-wrong-with-sql/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/02/16/whats-wrong-with-sql/#comments</comments>
		<pubDate>Tue, 17 Feb 2009 04:27:33 +0000</pubDate>
		<dc:creator>Eirik Bakke</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[data model]]></category>
		<category><![CDATA[facets]]></category>
		<category><![CDATA[hierarchical]]></category>
		<category><![CDATA[json]]></category>
		<category><![CDATA[nested relations]]></category>
		<category><![CDATA[object-relational impedance mismatch]]></category>
		<category><![CDATA[relational]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=260</guid>
		<description><![CDATA[A lot of things, Mike Stonebraker might say, but I have something rather fundamental in mind.
Suppose I&#8217;m developing some sort of academic course management system. Chances are I&#8217;ll want to display to the user a list of course offerings and their associated course codes, readings from the syllabus, meeting times etc. Maybe something like this:

Now [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of things, Mike Stonebraker might say, but I have something rather fundamental in mind.</p>
<p>Suppose I&#8217;m developing some sort of academic course management system. Chances are I&#8217;ll want to display to the user a list of course offerings and their associated course codes, readings from the syllabus, meeting times etc. Maybe something like this:</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill1.png"><img class="alignnone size-full wp-image-266" title="Logical Query (example from the Princeton University course catalog)" src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill1.png" alt="" width="500" height="268" /></a></p>
<p>Now according to Good Rules of Normalization and Decency, I probably stored this data across several database tables, related by foreign keys. I might have tables named &#8220;offerings&#8221;, &#8220;course_codes&#8221;, &#8220;readings&#8221;, &#8220;sections&#8221;, &#8220;meetings&#8221; and so forth. So how do I retrieve all this related data from the database?</p>
<p>The good news is that relational databases are made for just this kind of task: joining tables efficiently is what they do for a living. Unsuspectingly, I run my query [1]:</p>
<pre style="padding-left: 30px;">SELECT o.title, cc.code, r.author, r.title, s.name,
       m.start_time, m.end_time, m.day, m.place
FROM   offerings o, course_codes cc, readings r, sections s,
       meetings m
WHERE cc.oid = o.id
AND   r.oid = o.id
AND   s.oid = o.id
AND   m.sid = s.id;</pre>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill2.png"><img class="alignnone size-full wp-image-267" title="SQL Query" src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill2.png" alt="" width="500" height="365" /></a></p>
<p>The bad news is: That didn&#8217;t work too well. The mistake may seem obvious to seasoned database application developers: I can&#8217;t just do several unrelated joins in parallel like that, or I&#8217;ll get a gazillion rows [2] back. Not only does this lead to exponentially bad performance, but the result is also in a rather annoying form as far as the client application is concerned. There is even another problem: if any of the courses in the database do not happen to have any sections or readings listed, they will be omitted from the result. SQL &#8220;fixes&#8221; this through a hack known as <a title="Outer joins" href="http://en.wikipedia.org/wiki/Join_(SQL)#Outer_joins">outer joins</a>. It introduces NULL values into the result and, rather undeclaratively, requires each join to have its particular join condition specified explicitly rather than as part of the more general WHERE clause.</p>
<p>So how <em>do</em> we retrieve data like this from a relational database? We pull the joins out of the database and evaluate them ourselves, in our own application-specific data structures. Just about every non-trivial database web app out there does this in some way or another. The data is stored across multiple related tables in some MySQL or Postgres database. When the Javascript in the end user&#8217;s browser needs to present data to the user in some hierarchical fashion like the example above, it issues a request to a server-side middle layer, written in PHP, Ruby on Rails, Python, Java, <a href="http://www.cs.princeton.edu/~bwk/reg.html">awk</a> or whatnot. The middle layer, possibly with the help of a persistence library, then issues a bunch of separate SQL queries to the database to retrieve all the data involved, assembles (read: joins) this into some hierarchical data structure, and returns it to the Javascript app in <a title="JSON Spec" href="http://json.org/">JSON</a> or <a title="XML Big Picture" href="http://www.wdvl.com/Authoring/Languages/XML/XMLFamily/BigPicture/bigpix20a.html">XML</a> form. True, the database does help limit the data enough that this assembly process is not too much of a performance concern. <em>But joining tables is the job of the database, and we shouldn&#8217;t have to write middle layers to do it ourselves.</em></p>
<p>There should be a general and declarative way to make big joiny queries like the above work efficiently, returning the data in exactly the hierarchical form we want it &#8212; strictly relational result sets are not expressive enough. I am currently working on a simple SQL-like query language that does just this: send my generalized middleware a single big, <em>declarative</em> (no for loops or outer joins here!) query, and you&#8217;ll get back the JSON equivalent of the relational result set with the data nested into arrays and objects any way you want it.</p>
<p>[1] &#8220;No one does this!&#8221; some may object. Actually, Ruby on Rails&#8217; own ActiveRecord <a title="ActiveRecord Cartesian Product" href="http://dev.rubyonrails.org/ticket/9640">did for a while</a>.<br />
[2] I believe the technical term is &#8220;The Cartesian Product.&#8221; Darn you, Descartes.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/02/16/whats-wrong-with-sql/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Building a content management system just by drawing the web forms</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/01/06/building-a-content-management-system-just-by-drawing-the-web-forms/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/01/06/building-a-content-management-system-just-by-drawing-the-web-forms/#comments</comments>
		<pubDate>Tue, 06 Jan 2009 20:37:19 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[Web Architectures]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=233</guid>
		<description><![CDATA[This is a nice talk by Kian Win Ong of UCSD called &#8220;Do It Yourself custom forms-driven workflow applications.&#8221;   They&#8217;re looking at all the work people invest building special purpose content management systems that really offer users little more than &#8220;CRUD&#8221; (create, read, update delete) interactins for certain specialized kinds of content.
The basic approach is [...]]]></description>
			<content:encoded><![CDATA[<p>This is a nice talk by Kian Win Ong of UCSD called &#8220;Do It Yourself custom forms-driven workflow applications.&#8221;   They&#8217;re looking at all the work people invest building special purpose content management systems that really offer users little more than &#8220;CRUD&#8221; (create, read, update delete) interactins for certain specialized kinds of content.</p>
<p>The basic approach is for the owner to manipulate the visible parts of the system&#8212;the forms that people use to enter data, and the pages that show the data in the system&#8212;and for the server to automatically create the schemas and databases needed to support those interfaces.  For example, if the owner adds a field in the form, the backend will add a field in the back-end database, without the owner knowing anything about that database.  This class of tools are known as &#8220;forms driven applications&#8221;.</p>
<p>The main contribution here is that an important part is to manage &#8220;workflows&#8221;&#8212;they way content is entered and then flows through various stages of the system, evolving and changing who can and should see it as it goes.   There needs to be a notion of roles and access permissions, and for pages to behave differently depending on both the state of the data and who is accessing the page.   It&#8217;s hard to do this if you only work with one page/form at a time.  Their tool tries to provide &#8220;guided debugging&#8221; of the entire workflow, suggesting the next steps that should happen to a particular piece of data, data it should be combined with, and roles it should be assigned to.</p>
<p>These ideas have been pushed into a startup called <a href="http://www.app2you.com/">app2you</a>.  I quite like the approach and hope it is successful.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/01/06/building-a-content-management-system-just-by-drawing-the-web-forms/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
