<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Haystack Blog &#187; Databases</title>
	<atom:link href="http://groups.csail.mit.edu/haystack/blog/category/databases/feed/" rel="self" type="application/rss+xml" />
	<link>http://groups.csail.mit.edu/haystack/blog</link>
	<description>MIT CSAIL Research</description>
	<lastBuildDate>Tue, 24 Nov 2009 04:05:39 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Building a Social Data Commons</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/11/23/building-a-social-data-commons/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/11/23/building-a-social-data-commons/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 03:28:04 +0000</pubDate>
		<dc:creator>Adam Marcus</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Social Computing]]></category>
		<category><![CDATA[Thought Piece]]></category>
		<category><![CDATA[eGovernment]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=743</guid>
		<description><![CDATA[Inspired by Ted’s vision of what he’d like to see happen to data.gov, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where [...]]]></description>
			<content:encoded><![CDATA[<p>Inspired by <a href="http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/">Ted’s vision</a> of what he’d like to see happen to <a href="http://www.data.gov/">data.gov</a>, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where such steps were taken: a world in which government data was centralized, versioned, searchable, and accessible.</p>
<p>Now what? Given the large and growing pile of data we will optimistically uncover, we will run into new frustrations. People will claim that the published data formats are not the ones that their analysis tool requires. People will be overwhelmed by dataset size, not knowing where to start. People will unknowingly recreate someone else’s data-munging workflows on the way to repeating analyses of the same data. People will become the next bottleneck if data ever ceases to be.</p>
<p>There’s no one answer to the concerns listed above because everyone has a different goal for the data. To handle these issues, we will need more than a place to find up-to-date datasets—-we will also need a place where it is easy for people to share ideas and strategies for tackling data. We will need a <em>social data commons</em>.</p>
<p>Whereas blogs and wikis help report findings, steps, and missteps, a social data commons can be the place to go to “talk shop” about the available data. Even if people post their solutions using decentralized means, there will be benefit to pooling all of these resources in one place on the web. Here are some tools that will help the data-tinkerers get things done:</p>
<ul>
<li><strong>Data-munging war stories</strong>. The first stage in data analysis is often long and frustrating. One must digest the dataset in the form they received it, and transform, clean, and filter out the subset that they wish to analyze, visualize, or otherwise present. The workflow differs for each dataset and application, but to the extent that people can share tools and instructions for processing each dataset, these should be written up in the form of recipes for baking the data.</li>
<li><strong>Crowdsourced analysis</strong>. Datasets can be overwhelming. While many exploration tasks are easily automated, it is often easiest to leave certain tasks (e.g., “Find the interesting pictures”) to humans. <a href="https://www.mturk.com/mturk/">Mechanical Turk</a> gives us a hint at what this might look like, and the Guardian provides a wonderful <a href="http://mps-expenses.guardian.co.uk/">example</a> of crowdsourced public data analysis in action.</li>
<li><strong>Current uses showcases</strong>. To spark competition, avoid duplicating work, and inspire follow-on projects, visitors should see a showcase of the current uses of each dataset. Aside from links to sites built around a dataset, the list can include <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">embedded visualizations</a> of finished work.</li>
<li><strong>Analysis wishlists</strong>. Given that data released by a government reaches more than just programmers, there will be more people with ideas than people who can implement the ideas. People with ideas should be given an outlet, and passers-by should be asked to vote on these ideas to help data geeks with some free cycles discover the most insteresting unimplemented project.</li>
<li><strong>Data wishlists</strong>.  If an agency were to dedicate resources to releasing another dataset, which one is in highest demand?  As Ted <a href="http://groups.csail.mit.edu/haystack/blog/2009/11/18/plotting-a-course-for-data-gov/">mentioned</a>, governments should let demand drive delivery.</li>
<li><strong>Forums</strong>. No set of tools will encompass all use cases for social data analysis. A discussion forum can lead to the formation of interest groups while serving as a catch-all for needs not served by the list above.</li>
</ul>
<p>The US government might hit a few bumps trying to implement some of these social features. For example, a conflict of interest might arise if the showcase of uses of a dataset includes a site critical of the current administration. Having the executive branch ban spam or abusive comments on a forum draws concern over limitations of <a href="http://www.wired.com/techbiz/people/magazine/17-04/st_thompson">free speech</a>.  These details are not roadblocks, but they do signal that we can’t expect a social overlay to spring out of data.gov <em>per se</em>—-if we want these features, we may have to build and manage them on a third party.</p>
<p>I’m sure there’s more to the social data commons than I listed here. What did I miss, and where can we seek further inspiration?</p>
<p><em>Thanks to Ted for reading the first version of this entry.</em></p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/11/23/building-a-social-data-commons/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Spreadsheets vs. Relational Databases: Bridging the Gap</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/16/spreadsheets-vs-relational-databases-bridging-the-gap/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/16/spreadsheets-vs-relational-databases-bridging-the-gap/#comments</comments>
		<pubDate>Thu, 17 Sep 2009 01:54:06 +0000</pubDate>
		<dc:creator>Eirik Bakke</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[PIM]]></category>
		<category><![CDATA[User Interfaces]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=501</guid>
		<description><![CDATA[For non-programmers, spreadsheets are usually the option of choice when it comes  to keeping track of non-trivial amounts of structured data. This is seen in all  kinds of settings ranging from the business world to public administration and academic research. Spreadsheets, however, can only  capture one kind of data structure: separate tabular [...]]]></description>
			<content:encoded><![CDATA[<p>For non-programmers, spreadsheets are usually the option of choice when it comes  to keeping track of non-trivial amounts of structured data. This is seen in all  kinds of settings ranging from <a title="Is Excel Running Your Business?" href="http://www.hiredbrains.com/proclarity.pdf">the business world</a> to public administration and academic research. Spreadsheets, however, can only  capture one kind of data structure: separate tabular views of the data. This is  a significant constraint for the user, who arguably thinks of the data, and  needs to navigate it, in a more hierarchical manner (e.g. &#8220;each student takes a  number of courses, each which has a number of TAs&#8221;). In the &#8220;Hierarchical  Spreadsheet&#8221; project we tried to extend the spreadsheet paradigm to include some  useful features usually found only in the relational database world. Some potentially novel concepts included:</p>
<p>1) Strongly  typed worksheets with &#8220;advisory&#8221; error checking. For instance, the user can designate a particular column to hold numbers only, and maybe proceed to enter a date, but would then see an Excel-style warning dot in the cell in question.</p>
<p><img class="aligncenter" style="margin-top: 10px; margin-bottom: 10px;" title="Incorrectly Formatted Data" src="http://courses.csail.mit.edu/6.831/wiki/images/3/37/Hier_warning.png" alt="" width="219" height="190" />2) Transparent many-to-many or one-to-many  relationships between worksheets in a workbook (think foreign key relationships in database-speak). The user can designate a particular column to hold references to rows in another worksheet, or lists of such. The other worksheet will then automatically have a corresponding column added containing references going in the other direction.  (E.g. if each row in the &#8220;Departments&#8221; worksheet has a column referencing &#8220;Courses&#8221;, then &#8220;Courses&#8221; has a column referencing the corresponding rows in the &#8220;Departments&#8221; worksheet.</p>
<p style="text-align: center;"><img class="aligncenter" title="Selecting a Row from a Referenced Worksheet" src="http://courses.csail.mit.edu/6.831/wiki/images/thumb/9/91/Hier_reference.png/500px-Hier_reference.png" alt="" width="500" height="121" /></p>
<p>3) Hierarchical  presentation of relationships between worksheets in the workbook. Columns that reference other worksheets may be configured to show any subset of columns from the referenced worksheet, and so on.</p>
<p><img class="aligncenter" style="margin-top: 10px; margin-bottom: 10px;" title="Showing Data from a Referenced Worksheet" src="http://courses.csail.mit.edu/6.831/wiki/images/thumb/7/78/Hier_main.png/500px-Hier_main.png" alt="" width="500" height="341" /></p>
<p>User testing with multiple prototypes showed that the user interface needed to  be very similar to that of a traditional spreadsheet (e.g. Excel) to be usable  by most users in the target population. Significant features hypothesized to  make the interface more efficient (e.g. automatic report layout management)  proved only to confuse the users and make it harder to design consistent editing  affordances. Nevertheless, we did manage to integrate the key high-level  features of the application (relationships between worksheets and the  presentation of resulting hierarchical data on screen) into a prototype bearing very much of a resemblance to Excel.</p>
<p>(This project was done by Paul Grogan, Yod Watanaprakornkul, and me.)</p>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 0px; width: 1px; height: 1px;">Our application includes several novel concepts, including: transparent  many-to-many or one-to-many relationships between worksheets (relations) in a  workbook, hierarchical presentation of relationships between worksheets in the  workbook, and strongly typed worksheets with advisory error checking.</div>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/16/spreadsheets-vs-relational-databases-bridging-the-gap/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>In Defense of a Semantic Web Wild West</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 06:17:23 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[PIM]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=447</guid>
		<description><![CDATA[A month ago Stefano Mazzocchi published an interesting article on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized &#8220;a priori&#8221; approach (trying to keep the identifiers merged at the beginning).  I posted a response arguing the value of a more anarchic &#8220;a posteriori&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>A month ago Stefano Mazzocchi published an interesting <a title="Stefano's blog post" href="http://www.betaversion.org/~stefano/linotype/news/304/">article</a> on data reconciliation (detecting when two identifiers refer to the same item, and merging them) where he advocated a more centralized &#8220;a priori&#8221; approach (trying to keep the identifiers merged at the beginning).  I posted a <a title="my blog response" href="http://groups.csail.mit.edu/haystack/blog/2009/07/24/is-rdf-any-good-without-a-web-of-linked-data/">response</a> arguing the value of a more anarchic &#8220;a posteriori&#8221; approach where you let anyone create whatever identifiers and relations they want, and worry about detecting linkages later.   Stefano <a title="stefano blog response" href="http://www.betaversion.org/~stefano/linotype/news/311/">responded</a> to that, but by then I was busy chairing the submissions for the <a title="ISWC 2009 home page" href="http://iswc2009.semanticweb.org/">2009 International Semantic Web Conference</a>.   Now that that&#8217;s over (I hope you will attend what should be an interesting meeting&#8212;October 25-29 near Washington DC) I&#8217;d like to pick up the discussion again.</p>
<p>I argued in favor of letting individuals make their own RDF collections (using, for example, our <a href="http://www.simile-widgets.org/exhibit/">Exhibit</a> framework) and worry about merging them with other people&#8217;s data later.  Stefano&#8217;s response accused me of using &#8220;RDF&#8221; and &#8220;structured data&#8221; interchangeably, asserting Exhibit is really just a nice UI over spreadsheet (tabular) data&#8212;that although it can export RDF, it is &#8220;not properly using RDF&#8221; because it has &#8220;lost the notion of globally unique identifiers (and in that regard, is much more similar to <a href="http://en.wikipedia.org/wiki/Microsoft_Excel">Excel</a> than to <a href="http://www.w3.org/2005/ajar/tab">Tabulator</a>)&#8221;.  Tim Berners Lee has made similar complaints to me about Exhibit not using RDF.</p>
<p>This argument highlights for me yet an important ambiguity about what RDF <em>is</em>.   I occasionally have to help people understand that RDF is a <em>model</em>, not a syntax.  That some data can be RDF even if it isn&#8217;t serialized to RDF/XML.  That the key is to have items named by URIs, connected by relations named by URIs.  Stefano&#8217;s argument suggests a different blurring: between the model and its intended use.  Stefano&#8217;s &#8220;not properly using&#8221; phrase implies that if you don&#8217;t intend to merge your data into the global namespace, then even if you implement the model  and wrote it down as RDF/XML to boot, you won&#8217;t be &#8220;properly using RDF&#8221;.</p>
<p>I want to address both these claims: that Exhibit is just a UI over spreadsheets, and that using RDF this way isn&#8217;t proper.</p>
<p><strong>RDF and spreadsheets</strong></p>
<p>Regarding the spreadsheet claim, I&#8217;ll begin by admitting that Stefano is absolutely right:  Exhibit is a visualization tool for tabular (spreadsheet) data.  But notice that <em>all</em> RDF is spreadsheet data&#8212;I can take all the RDF in the world and throw it into one spreadsheet.  In fact, I only need three columns to contain the subject (tail), object (head), and predicate (link) for each RDF statement.  Admittedly none of today&#8217;s spreadsheets would have enough rows, but that&#8217;s an engineering detail.  So, the spreadsheet <em>model</em> isn&#8217;t the problem.   And we also agree that Exhibit&#8217;s <em>interface</em> is nothing like spreadsheets&#8217;, and far better for the collection visualization tasks it is designed for.</p>
<p>I think instead that what Stefano is objecting to is a <em>usage</em> characteristic of spreadsheets versus RDF.  When I open a spreadsheet, the data it shows me is right there, in a file on my own system.  Global identifiers don&#8217;t matter because the data is all there (and presumably self-consistent) in the one spreadsheet.   In contrast, in Stefano&#8217;s image of RDF (and in Tim&#8217;s, as one can see from the Tabulator project) the data about a particular entity is spread all over the web, and it is the globally unique identifier that lets you go out, gather all that data together, and know that it is all about the same entity.</p>
<p>This is certainly an appealing vision.  But I want to argue that a focus on globally unique identifiers neglects two benefits of RDF that I consider equally important: <strong>data portability</strong> and <strong>schema flexibility</strong>.</p>
<p><strong>Spreadsheets suffice</strong></p>
<p>To illustrate this argument, I&#8217;ll hark back to a <a title="Hard data management blog post" href="../../2008/11/20/hard-information-management-that-should-have-been-easy/">previous post</a> where I discussed a data integration problem that should have been easy but wasn&#8217;t.   I keep an  <a href="http://simile.mit.edu/exhibit/">Exhibit</a> of folk dance videos on the web.   Recently, Nissim Ben Ami posted a <a href="http://il.youtube.com/profile_videos?p=r&amp;user=NissimBenAmi&amp;page=1">collection</a> of 511 new dance videos on Youtube.  I wanted to incorporate it into my site.  But it quickly became apparent that said incorporation would basically require my entering all 511 video descriptions manually into my system, and I still haven&#8217;t gotten around to it.</p>
<p>The major barriers were twofold.  The first was syntactic:, the structured descriptions of the videos were delivered as XML.   That meant that in order to get at the data, I was going to have to learn XSLT&#8212;something I&#8217;ve been putting off for years.   The second hurdle is semantic: Youtube has the wrong schema for my folkdance videos.  I care about choreographer, dance type, and year choreographed; YouTube only offers slots for submitter and submission date of the video.  So, as you can see from<a title="Matzlichim video" href="http://www.youtube.com/watch?v=PgbRwUqHsOM"> this example</a>, the contributor takes the usual approach: he takes his nice structure data and shoves it into the generic comment (info) field as free text.  All that structure is instantly lost.</p>
<p>Suppose instead that spreadsheets (or, in a pinch, RDF) were the accepted framework for publishing information on the web.  The YouTube &#8220;spreadsheet&#8221; would contain submitter and submission date information, but Nissim could just add &#8220;artist&#8221; and &#8220;composition-date&#8221; columns to hold the data he wanted to enter.   I would then be in a great position to download his data and incorporate it into my own catalog (spreadsheet).  What would I have to do?  After opening his spreadsheet and mine, I&#8217;d have to match columns&#8212;perhaps he called his &#8220;artist&#8221; and &#8220;composition date&#8221; while mine are &#8220;choreographer&#8221; and &#8220;year&#8221;.  But a simple copy and paste fixes that discrepancy.  Merging entities is not much harder than merging properties: a simple global replace will convert his choreographer &#8220;Israel Ya&#8217;akovi&#8221; to my &#8220;Israel Yakovee&#8221;.  The local consistency of his data and mine means that I only have to work once per choreographer (and in most cases I won&#8217;t have to: there&#8217;s a standard spelling for almost every choreographer&#8217;s name, which serves as a unique identifier<em> in this context</em> even if it isn&#8217;t a URL).</p>
<p>Overall, my work has reduced by order of magnitude.  Instead of laboriously entering 511 new records, I just download a spreadsheet and match up a handful of properties (columns) and a few tens of choreographer names.</p>
<p>Stepping back, observe that I&#8217;ve relied on two things.   First, on <strong>data portability</strong>&#8212;my being able to download the data in a convenient form: not XML, which is a programmer&#8217;s friend but an end-user&#8217;s enemy; rather, something I can just look at and understand.  Second, on <strong>schema flexibility</strong>&#8212;on Nissim&#8217;s being able to add whatever columns/properties he decides are important, instead of being limited to those used on the hosting web application.</p>
<p>I&#8217;m also relying on some features of this particular scenario, but I believe they often hold.   I am relying on Nissim&#8217;s data having only a small number of properties so that I can map them manually to mine.   I also rely on there being a small number of choreographers, and hope to take advantage of most of them having matching names in his data and mine&#8212;these names certainly aren&#8217;t globally unique identifers, but they are &#8220;unique enough&#8221; when considering just my data and his.  Critically, I am not thinking of pulling all data about a given dance from a multitude of different web sites&#8212;this would demand global unique identifiers to link data since I would never have the patience.  Rather, I am considering a pairwise data acquistion: taking data I want from one internally consistent site.</p>
<p>Such pairwise acquisition is commonplace: any time a scientists wants to pull a data set from some other scientist&#8217;s lab, or a consumer wants to download product information about several cameras from a review site, or a student wants to include a Wikipedia data set in a report they are writing, there is an obvious single source and target for a data merger.   And there&#8217;s a human being who has the incentive, and with the right tools the capability, to do the limited amount of work needed to accomplish that merger.</p>
<p>This is a simple low-hanging fruit argument.  It would be wonderful to be able to <em>automatically</em> merge data from <em>thousands</em> of different sources into a coherent whole.  And this is a problem Freebase will need to solve, if they want to become the hub for aggregation of structured data.  But right now we can&#8217;t even <em>manually</em> merge data from <em>two</em> sites without doing a ridiculous amount of grunt work&#8212;so perhaps we should give some attention to that easier problem on our way to solving the hard one.</p>
<p><strong>Don&#8217;t skip the wild west<br />
</strong></p>
<p>I&#8217;d like to so these efforts proceed in parallel, but I&#8217;m worried about enthusiasm for the more ambitious goal blocking movement toward the low-hanging fruit.  I recently submitted a proposal to NIH on the topic of data integration that reflected my perspective above.  I argued that the current efforts in the Biology community to force everyone to adopt a common ontology (and sometimes repository) for their experimental data are being resisted by biologists who think they know best how to present their data.  I suggested as an alternative that we give biologists tools, such as Exhibit, that would encourage them to publish their data in a common structured syntax, and worry about integrating all that data <em>after</em> it has become available in structured form.  The proposal rejection was accompanied by a review that said, on the one hand, &#8220;The benefit of the proposed approach is that it is very different from some multi-institutional data sharing projects (like caBIG), which have used a very rigid, top-down approach to creating semantics. Even if this project is unsuccessful it could bring to light new ideas and strategies that might make those large-scale projects more responsive to investigators and more successful.&#8221;  At the same time, it argued for rejection because &#8220;The absence of any control over the information models and ontologies – truly a semantic wild west – is daring and may ultimately be the downfall of this project.&#8221;</p>
<p>I&#8217;m fascinated to see, in the same review, a recognition of the problems that the current centralized approach is bringing (lack of buy-in to common ontologies by individual scientists who think they know better and probably do), and an unwillingness to tolerate the contrary (anarchic) solution.  I also love the metaphor of the &#8220;semantic wild west&#8221; because I think it supports my argument.  Would anyone have suggested establishing a city of several million people just after the west was opened for settlement?  The west&#8217;s early wildness was an unavoidable phase of its evolution towards the thickly settled and uniformly governed area it is now.    In the same vein, I think that our semantic web is best grown by encouraging individual semantic-web settlers to create their own data homesteads and begin looking for the trails that connect them to neighboring collections.  We need to get the data into plain view first.   Later we can send in the data sheriffs and place all those data sets under uniform governance.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/09/14/in-defense-of-a-semantic-web-wild-west/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Talk: Community-based ontology development alignment and evaluation</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/07/27/talk-community-based-ontology-development-alignment-and-evaluation/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/07/27/talk-community-based-ontology-development-alignment-and-evaluation/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 19:23:10 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Semantic Web]]></category>
		<category><![CDATA[CSAIL]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=410</guid>
		<description><![CDATA[Natasha Noy gave a talk at CSAIL with the above title.  She works in with a large medical bioinformatics group at Stanford.  The bioinformatics community in general couldn&#8217;t care less about cool computer science but is one of the few groups that have heavily adopted formal ontologies as a way to get their work done.  [...]]]></description>
			<content:encoded><![CDATA[<p>Natasha Noy gave a talk at CSAIL with the above title.  She works in with a <a href="http://protege.stanford.edu/">large medical bioinformatics group</a> at Stanford.  The bioinformatics community in general couldn&#8217;t care less about cool computer science but is one of the few groups that have heavily adopted formal ontologies as a way to get their work done.  They have tons of data partitioned over many silos.   Biologists have adopted ontologies to provide canonical representations of scientific knowledge, or to annotate data to let others make use of it.  Often, it is not the authors who do it, but curators or automatic tools.</p>
<p>There are now hundreds of ontologies with tens of thousand of terms.  However, it has always been a &#8220;cottage industry&#8221;&#8212;various groups develop their own ontologies, then publish them for use by others.  Is there a way to open the development of the ontology up to the community?  Community might be just a few or thousands.</p>
<p>As an example, the gene ontology (28K terms) has 3 full time curators.  People from the community submit to an issue tracker to get new terms etc.  A ne version is released daily.  In contrast, the NCI thesaurus (for cancer) has 20 full time editors with 1 lead editor who runs everything, and a slow cycle of &#8220;releases&#8221; with less community input.  Others work like typical open source projects with 20-30 team members involved in active discussions.</p>
<p>Natasha&#8217;s group builds on Protege, a very old open source ontology editor that is now one of the most popular, with 120,000 registered users.  It has a very open plugin architecture with dozens of plugins for visualization, import, export, nlp, and lots of unknowns.  They&#8217;ve been working to augment protege with support for collaboration.  It works in a distributed fashion (desktop and web clients).  It support simultaneous editing, but also annotation, discussion, proposals and voting in the context of the ontology.  There are many types of annotations&#8212;questions, comments, proposals&#8212;on any elements of the ontology&#8212;classes, properties, instances.   While the tool handles most types of structured data, it is focused on taxonomic hierarchies were stuff gets inherited down the hierarchy.</p>
<p>They investigated use of their tool for several tasks.  One is ontology evaluation&#8212;finding existing ontologies that might be useful for you.  This source of information for this is author-contributed metadata about the ontologies&#8212;domain, key classes and concepts, who the developer is, etc. Another is automatic tools that compute quality metrics, and another is annotations by other users of the ontologies.</p>
<p>This last is important because some ontology metrics are subjective&#8212;a feature that is &#8220;good&#8221; in one setting can be awful in another.  An example might be a high level of axiomatization.  This is important for inference, but creates clutter if you just want description.  There&#8217;s also the problem of crosscutting taxonomies&#8212;you might have two different ways of describing the same domain that form a &#8220;matrix&#8221; of non-overlapping hierarchies.  To address this sort of subjectivity, they allow users to record evaluations of ontologies.</p>
<p>These tools can be explored at their <a title="Bioportal web site" href="http://bioportal.bioontology.org/">bioportal web site</a> where they have a large library of biomedical ontologies.  On that site, users can describe their ontology based projects, and list/review the ontologies they are using.  Reviewers give general reviews, usage information problems encountered, coverage of the key terms, major gaps, and issues with specific elements of the ontology.  This site aims to make ontology evaluation/creation a truly democratic process.  This is controversial&#8212;some argue that ontologies need a more rigorous editorial process (mirroring a current debate about open vs. traditional journal publication).</p>
<p>Another big task is mapping: connecting two ontologies by asserting that terms in two different ontologies &#8220;match&#8221;.  They aren&#8217;t trying to find mappings, but want to enable others to upload the mappings they have found.  Mappings can be created manually or uploaded in bulk (if computed by someone&#8217;s tool).  Mappings are themselves metadata, which can be annotated and discussed just like other data in the ontology.</p>
<p>Of course a big question is whether people will use these tools.  Right now, many users are asking for these features and reporting lots of bugs&#8212;good signs of demand.</p>
<p>A lot of questions have now arisen that need some serious user studies&#8212;what are the dynamics of the social networks that form around collaborative ontologies?  What are the different types of users/editors?  What produces the most discussion/controversy?  Do these tools help or hinder collaboration?</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/07/27/talk-community-based-ontology-development-alignment-and-evaluation/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Interacting with Temporal Data @CHI09</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/04/17/interacting-with-temporal-data-chi09/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/04/17/interacting-with-temporal-data-chi09/#comments</comments>
		<pubDate>Fri, 17 Apr 2009 07:01:34 +0000</pubDate>
		<dc:creator>Max Van Kleek</dc:creator>
				<category><![CDATA[CHI]]></category>
		<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[chi]]></category>
		<category><![CDATA[temporal data]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=310</guid>
		<description><![CDATA[This year Wendy Mackay, Aurélien Tabard and I held a workshop for examining interaction challenges surrounding time, in particular time as a component of temporal data sets.  Our interest in this topic was brought about by the observation that low-cost storage, cheap sensing technologies, the Web and high speed networking have started to bring us [...]]]></description>
			<content:encoded><![CDATA[<p>This year Wendy Mackay, Aurélien Tabard and I held a workshop for examining interaction challenges surrounding time, in particular time as a component of temporal data sets.  Our interest in this topic was brought about by the observation that low-cost storage, cheap sensing technologies, the Web and high speed networking have started to bring us vast quantities of rich temporal data &#8212; whether it is in &#8220;traditional&#8221; forms (such as audio or video), or &#8220;new&#8221; forms such as rich activity logs of people, places and things.  The availability of these volumes of new data present new opportunities but also pose interaction challenges that we wished to start to identify and address.  From our CfP:</p>
<blockquote><p><em>Is time just another attribute of data? Or is it something more? Time brings meaning to data, especially data about the real world. Time is also essential for understanding human activity and an essential element of design processes. Sometimes we address time explicitly, sometimes implicitly. It structures how people interact with computers, but is also a measurable effect of that interaction. The goal of this workshop is to explore human-computer interaction from a temporal perspective.</em></p></blockquote>
<p>We were pleased that our workshop drew 35 participants with a variety of interests and backgrounds &#8212; from architects, interaction designers to data mining analysts, doctors, ubicomp researchers, and of course HCI researchers.  As can be seen in the <a title="Interacting with Temporal Data Workshop proceedings" href="http://temporal.csail.mit.edu/exhibit">workshop proceedings</a>, our participants were interested in a number of different types of temporal data:</p>
<ul>
<li>Media: audio + video capture, manipulation, editing, sharing</li>
<li>Personal health</li>
<li>Personal information management</li>
<li>Life logging (Personal activity data recording + Reflection)</li>
<li>Air traffic control</li>
<li>Financial data analysis</li>
<li>Sensor networks</li>
<li>Environmental impact monitoring</li>
<li>Product research</li>
<li>Software engineering</li>
</ul>
<p>Despite the diversity, several common themes emerged.</p>
<p>The first was empowerment: the idea that accurate, low-cost-to-capture rich records of people&#8217;s everyday activities could thoroughly change the way we live.  Participants highlighted several creative examples of how such records of our lives could help us &#8212; in personal, social and work contexts.  For example, getting accurate records of one&#8217;s daily routines (such as exercise and diet) could let people identify ways to live healthier [a la Thomas Goertz's Decision Tree].  Or, to enable the hacking of social dynamics: for example, to analyze in situ or post-hoc repeated patterns of conflict in interactions with particular individuals so as to be able to better understand sources of stress related to collaboration.  Or, simply helping the user more easily retrieve and manage their personal information in an activity-centric manner than complements human episodic memory.</p>
<p>The essential challenge was the question of how to give individuals (just-plain-folks, end-users) access to this rich data about themselves in a way that they could easily analyze, understand, manage and use.  One participant commented that such information was &#8220;turning citizens into intelligence analysts &#8212; about their own lives&#8221;.  Intelligence analysts, of course, have extensive training in how to look at data; end-users don&#8217;t.</p>
<p>Another was the question of accountability, access, protection and privacy: we have never previously had access to accurate records about any aspects of our lives.  Once we have these records, what sort of implications will this have on our interactions with others? (e.g., ineffable records of where people were, how long they were there, what they did)  The process of scientific discovery, process/protocol and how this will impact how scientists work with one another? How will we control or grant others access to these records in a way that provides individuals privacy?  If individuals are employees/members of organizations, who &#8220;owns&#8221; the data about an individual&#8217;s activities at work, and what rights does the individual have towards accessing/ it and what rights does an individual have to their own activity records? Finally, after an individual departs (passes away or leaves an organization), how should such data be handled or retired? Who has rights to a deceased individual&#8217;s life log?</p>
<p>Other themes and topics of discussion included : the need for interfaces to help reconcile subjective/emotional memories of the past with &#8220;cold, hard lifelogs&#8221;, implicit versus explicit representations of time; e.g., different ways of portraying dynamic processes, and explanation facilities for time-dynamic pattern recognition.</p>
<p>Based on the strong interest from our workshop participants, we have decided to start a discussion group / online watering hole for us to further discuss some of the issues surrounding interaction with temporal data.  We welcome anyone interested (not only workshop participants) to join and post their thoughts, questions, projects and ideas:</p>
<ul>
<li><a title="Google Group on Temporal Data" href="http://group.google.com/group/temporal-data">Temporal-Data @ groups.google.com</a> &#8211; Temporal Data Google Group</li>
</ul>
<p>With this google group we wish to continue our consolidated, cross-application domain discussion of interaction issues with the hopes of taming the complexities of our data rich environments.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/04/17/interacting-with-temporal-data-chi09/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making the Case for Raw Data</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/03/24/making-the-case-for-raw-data/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/03/24/making-the-case-for-raw-data/#comments</comments>
		<pubDate>Tue, 24 Mar 2009 14:46:05 +0000</pubDate>
		<dc:creator>Adam Marcus</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Web Architectures]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=273</guid>
		<description><![CDATA[Tim Berners-Lee’s recent TED talk on Linked Data has inspired quite a few people to ask what exactly linked data is, how it differs from data on the semantic web, and how realistic it is to assume universal and unique addressability of data items. A world with linked data would be a world with richer, [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">Tim Berners-Lee</a>’s recent TED talk on Linked Data has inspired quite a few people to ask what exactly linked data is, how it differs from data on the semantic web, and how realistic it is to assume universal and unique addressability of data items. A world with linked data would be a world with richer, more explorable data, and that notion on its own makes Tim’s talk worth viewing. The most inspiring part of his talk, in my opinion, was the one in which he got the entire crowd to loudly demand RAW DATA NOW. Given the push for more open datasets in government, and given that more websites are becoming API-providing data platforms, it is important to demand raw data where possible.</p>
<h2>The magic behind raw data</h2>
<p>The best thing about raw data is that almost everyone knows how it works. This means that as far as the data (re)user is concerned, the datasets are text files (or perhaps a close variant) that they can download, open in some default application, and get some immediate use out of it.</p>
<p>If the US Federal budget dataset is released as a comma-separated file, a middle-schooler can download the file, open it in a spreadsheet application, and sum the columns to see how much we’re spending on the Department of Education this year. A more skilled high-schooler can upload the file to <a href="http://manyeyes.alphaworks.ibm.com/manyeyes/">Many Eyes</a>, make a pie chart out of it, and post it to their blog. A first-year college student can write a php script to allow people to comment on various parts of that pie chart, allowing you to drill in to various slices to get a finer granularity.</p>
<p>With raw data, you’ve opened more people to more visualization, exploration, and discussion than was available through the original web application that acted as a firewall to your database.</p>
<h2>Hugging the data to death</h2>
<p>During his talk, Tim spoke about “Database Huggers,” or people who, for various reasons, hide their data away in databases. Once the data sits in a database, the publisher might provide a specific and constrained view of the data by way of a website, or they might hide it even more, simply calculating some aggregate statistic over the data and claiming, without verification, that the data has certain properties.</p>
<p>There are several legitimate reasons for database hugging. Some data was meant to be private—academic, medical, and financial information are all datapoints we’d prefer to keep private. We’d hope our service providers will keep it out of the hands of others. Similarly, a company might have competitive reasons for keeping information private, especially when it would be equally valuable to their competitors and not too valuable to the public—lists of customers and transaction histories come to mind. Keeping this information far from the publicly accessible web is responsible and wise.</p>
<p>There are other cases, however, where the data should legitimately stay open and publicly accessible. Open government initiatives will result in many datasets published by organizations that <a href="http://www.recovery.gov/">will</a> or <a href="http://www.nih.gov/">should</a> exist in the public domain.  Many <a href="http://en.wikipedia.org/wiki/The_Long_Tail">Long Tail</a> websites, maintained by small groups of <a href="http://simile.mit.edu/exhibit/examples/cereals/cereal-characters.html">hobbyists</a>, probably would not mind if the datasets they generate are published in their full glory. For these types of applications, raw data is ideal.</p>
<p>Even in the case of datasets that should be open to the public, database huggers will sometimes disable direct access to the data, instead opting to place it in a database that sits behind an html-generating web application. Thinking that you’ve hidden your data behind HTML, thus making it safe from reuse, is an unwise assumption. In about an hour, a decent programmer can write a perl script to crawl your site and tease the data apart from the obfuscated HTML that surrounds it, reverse-engineering your database without asking for permission. In fact, there are <a href="http://simile.mit.edu/wiki/Solvent">tools</a> that make this process easier than writing a one-off perl script. And if you think you can block the person from accessing every page on your site in a short period of time, then they will just collaborate with <em>everyone else</em> who wants the data, write a <a href="http://www.greasespot.net/">Greasemonkey</a> script to collect parts of the site that they browse, and eventually collect your entire presented dataset.</p>
<p>Databases are not inherently evil. They provide an excellent way to store, index, and query data, but they also have a way of separating the average user from that data. Most websites, for example, do not publish a read-only username and password to their database, for fear of arbitrary queries that could easily take down their machines, or at least keep the machines busy for a long time. We should design tools to maintain the excellent services that databases have been built to provide over the last four decades, without limiting the access to the raw data when such access would be most valuable.</p>
<h2>Are APIs the future of raw data?</h2>
<p>There is a middle ground between the highly private datasets and the obviously open ones. Most forward-thinking organizations have realized this. They have also realized that if they have something to sell, be it in meatspace or screenspace, it’s better to release the data about their offerings to anyone that wants to use it, so that people eventually end up at their site. They do this by providing a web <a href="http://en.wikipedia.org/wiki/API">API</a> to make their dataset queriable, essentially telling other software developers which questions they can answer about the dataset (<em>query for books by author</em>, <em>query for restaurants by cuisine</em>).  <a href="http://www.amazon.com/">Amazon</a> has some APIs, as does <a href="http://www.yelp.com/">Yelp</a>, and you’d have to be a pretty self-loathing web 2.0 company to not provide an API over <em>some portion</em> of your data.  So are APIs the solution?  Not always.</p>
<p>APIs are a step in the right direction—open data is better than obfuscated data. APIs help both third-party developers and dataset publishers get more out of a dataset. They have a few drawbacks as well:</p>
<ul>
<li>The API is an HTTP interface to <em>your</em> database.  This means that if <em>someone else</em> makes a third-party application that is immensely popular, it’s your database that pays for the brunt of its popularity. You weren’t expecting a huge ramp-up in server load? Too bad.</li>
<li>As kind as the dataset publisher is, they can’t predict <em>every</em> use of the data—if they could, they already would have implemented the best use cases. If they can’t predict how the consumer/developer will use the data, they might not publish a good hook into the dataset. This would either prevent or make awkward the interaction between the third-party application and the publisher.</li>
<li>Building an API for a dataset makes the people who are nice enough to share their data do <em>more work</em> on top of designing their application.  Following common <a href="http://en.wikipedia.org/wiki/Representational_State_Transfer">REST</a> or <a href="http://en.wikipedia.org/wiki/Create,_read,_update_and_delete">CRUD</a> conventions makes this easier, but still puts the onus on the developer. As a corollary, APIs don’t change with the data. APIs are frequently revised, meaning that a change in your data requires constant upkeep of your API.</li>
</ul>
<p>One might argue that some of the criticisms of APIs are unfair:</p>
<ul>
<li>Saying that raw data will reduce the load on your database implies that the third party has some cache of the data, which is thus slightly out-of-date. You could imagine some sort of <a href="http://en.wikipedia.org/wiki/Comet_%28programming%29">Comet</a>-updated raw dataset system, but it’s unlikely for now that dataset publishers will be willing to stream live updates to third parties.</li>
<li>Perhaps the limited API functionality is for good reason. Amazon might never want you to be able to download their entire dataset—they don’t want to waste the bandwidth and they don’t want competitors to know exactly how many items they have on hand.</li>
<li>Publishing any sort of raw data will require extra work on behalf of the dataset publisher. Perhaps API-writing is the least invasive of their time?</li>
</ul>
<p>An ideal data management tool would allow raw data publishing when possible, and make it easier to build APIs when some limited access is desirable. We should not pretend to know the point at which raw data is superior to APIs, but the point exists somewhere. It’s important to understand the benefits that raw data provides on top of web APIs, so that you can think about when it would be valuable to use.</p>
<h2>After all this time, the answer was text files?</h2>
<p>You’ve probably become skeptical of these suggestions. Are we really supposed to throw away decades of database research in how to properly store, index, and query reasonably sized datasets so that a middle-schooler can look at the data in a different way? Of course not. The interesting research question becomes whether we can give the user the illusion of raw data while still benefiting from database technology where possible.</p>
<p>That’s one research direction we’re taking within the Haystack group. With the constraint that the raw data, in human-readable text files, should always be available, we’d like to blur the boundaries between databases and data-aware webservers.</p>
<p>Specifically, what we plan on designing is an apache web server module that recognizes when it is serving a dataset, perhaps by taking note that it is serving a .csv, .rdf, or .json file. In such cases, the server would cook the data into a database behind the scenes. Data-aware clients (in javascript for the time being, but in the browser one day) can then query the web server about the data directly. Updates become difficult, but we can make consistency guarantees about the original raw data text files to ensure that someone can download them and see up-to-date information.</p>
<p>If you prefer programmatic access to the files, the module turns into a REST(, SQL, SPARQL, you favorite path language)-capable endpoint. If you prefer to get down and dirty with the data, you’ve got the text files.</p>
<p>We certainly don’t want to stand in the way of a world with Linked Data, so if you’d like, the tool will eventually return data with URIs. We can’t guarantee the URIs will resolve to anything useful, but that just might require a human’s touch. We’re not sure how that fits into the picture for the average data publisher, since the marginal benefit to the individual of universally addressing your own data is small, whereas the benefit to everyone else of adding another linked dataset grows with the number of datasets it is linked to.</p>
<h2>And now, for some questions</h2>
<p>We’re early in the development of our tools, so we’re open to your ideas and suggestions. Keeping text files up-to-date with the database that’s proxying them is nontrivial. Thinking of the ideal client/server mode of operation will also take time. We probably haven’t thought of the most important must-have feature yet, so any suggestions are welcome.</p>
<p><em>Thanks to Ted Benson, Sam Madden, and David Karger for their thoughts on this post.</em></p>
<p><em>(Cross-posted on <a href="http://blog.marcua.net/post/89373158/making-the-case-for-raw-data">my blog</a>)</em></p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/03/24/making-the-case-for-raw-data/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>What&#8217;s Wrong with SQL?</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/02/16/whats-wrong-with-sql/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/02/16/whats-wrong-with-sql/#comments</comments>
		<pubDate>Tue, 17 Feb 2009 04:27:33 +0000</pubDate>
		<dc:creator>Eirik Bakke</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Web Architectures]]></category>
		<category><![CDATA[data model]]></category>
		<category><![CDATA[facets]]></category>
		<category><![CDATA[hierarchical]]></category>
		<category><![CDATA[json]]></category>
		<category><![CDATA[nested relations]]></category>
		<category><![CDATA[object-relational impedance mismatch]]></category>
		<category><![CDATA[relational]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=260</guid>
		<description><![CDATA[A lot of things, Mike Stonebraker might say, but I have something rather fundamental in mind.
Suppose I&#8217;m developing some sort of academic course management system. Chances are I&#8217;ll want to display to the user a list of course offerings and their associated course codes, readings from the syllabus, meeting times etc. Maybe something like this:

Now [...]]]></description>
			<content:encoded><![CDATA[<p>A lot of things, Mike Stonebraker might say, but I have something rather fundamental in mind.</p>
<p>Suppose I&#8217;m developing some sort of academic course management system. Chances are I&#8217;ll want to display to the user a list of course offerings and their associated course codes, readings from the syllabus, meeting times etc. Maybe something like this:</p>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill1.png"><img class="alignnone size-full wp-image-266" title="Logical Query (example from the Princeton University course catalog)" src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill1.png" alt="" width="500" height="268" /></a></p>
<p>Now according to Good Rules of Normalization and Decency, I probably stored this data across several database tables, related by foreign keys. I might have tables named &#8220;offerings&#8221;, &#8220;course_codes&#8221;, &#8220;readings&#8221;, &#8220;sections&#8221;, &#8220;meetings&#8221; and so forth. So how do I retrieve all this related data from the database?</p>
<p>The good news is that relational databases are made for just this kind of task: joining tables efficiently is what they do for a living. Unsuspectingly, I run my query [1]:</p>
<pre style="padding-left: 30px;">SELECT o.title, cc.code, r.author, r.title, s.name,
       m.start_time, m.end_time, m.day, m.place
FROM   offerings o, course_codes cc, readings r, sections s,
       meetings m
WHERE cc.oid = o.id
AND   r.oid = o.id
AND   s.oid = o.id
AND   m.sid = s.id;</pre>
<p><a href="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill2.png"><img class="alignnone size-full wp-image-267" title="SQL Query" src="http://groups.csail.mit.edu/haystack/blog/wordpress/wp-content/uploads/2009/02/ill2.png" alt="" width="500" height="365" /></a></p>
<p>The bad news is: That didn&#8217;t work too well. The mistake may seem obvious to seasoned database application developers: I can&#8217;t just do several unrelated joins in parallel like that, or I&#8217;ll get a gazillion rows [2] back. Not only does this lead to exponentially bad performance, but the result is also in a rather annoying form as far as the client application is concerned. There is even another problem: if any of the courses in the database do not happen to have any sections or readings listed, they will be omitted from the result. SQL &#8220;fixes&#8221; this through a hack known as <a title="Outer joins" href="http://en.wikipedia.org/wiki/Join_(SQL)#Outer_joins">outer joins</a>. It introduces NULL values into the result and, rather undeclaratively, requires each join to have its particular join condition specified explicitly rather than as part of the more general WHERE clause.</p>
<p>So how <em>do</em> we retrieve data like this from a relational database? We pull the joins out of the database and evaluate them ourselves, in our own application-specific data structures. Just about every non-trivial database web app out there does this in some way or another. The data is stored across multiple related tables in some MySQL or Postgres database. When the Javascript in the end user&#8217;s browser needs to present data to the user in some hierarchical fashion like the example above, it issues a request to a server-side middle layer, written in PHP, Ruby on Rails, Python, Java, <a href="http://www.cs.princeton.edu/~bwk/reg.html">awk</a> or whatnot. The middle layer, possibly with the help of a persistence library, then issues a bunch of separate SQL queries to the database to retrieve all the data involved, assembles (read: joins) this into some hierarchical data structure, and returns it to the Javascript app in <a title="JSON Spec" href="http://json.org/">JSON</a> or <a title="XML Big Picture" href="http://www.wdvl.com/Authoring/Languages/XML/XMLFamily/BigPicture/bigpix20a.html">XML</a> form. True, the database does help limit the data enough that this assembly process is not too much of a performance concern. <em>But joining tables is the job of the database, and we shouldn&#8217;t have to write middle layers to do it ourselves.</em></p>
<p>There should be a general and declarative way to make big joiny queries like the above work efficiently, returning the data in exactly the hierarchical form we want it &#8212; strictly relational result sets are not expressive enough. I am currently working on a simple SQL-like query language that does just this: send my generalized middleware a single big, <em>declarative</em> (no for loops or outer joins here!) query, and you&#8217;ll get back the JSON equivalent of the relational result set with the data nested into arrays and objects any way you want it.</p>
<p>[1] &#8220;No one does this!&#8221; some may object. Actually, Ruby on Rails&#8217; own ActiveRecord <a title="ActiveRecord Cartesian Product" href="http://dev.rubyonrails.org/ticket/9640">did for a while</a>.<br />
[2] I believe the technical term is &#8220;The Cartesian Product.&#8221; Darn you, Descartes.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/02/16/whats-wrong-with-sql/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Building a content management system just by drawing the web forms</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/01/06/building-a-content-management-system-just-by-drawing-the-web-forms/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/01/06/building-a-content-management-system-just-by-drawing-the-web-forms/#comments</comments>
		<pubDate>Tue, 06 Jan 2009 20:37:19 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Publication]]></category>
		<category><![CDATA[Web Architectures]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=233</guid>
		<description><![CDATA[This is a nice talk by Kian Win Ong of UCSD called &#8220;Do It Yourself custom forms-driven workflow applications.&#8221;   They&#8217;re looking at all the work people invest building special purpose content management systems that really offer users little more than &#8220;CRUD&#8221; (create, read, update delete) interactins for certain specialized kinds of content.
The basic approach is [...]]]></description>
			<content:encoded><![CDATA[<p>This is a nice talk by Kian Win Ong of UCSD called &#8220;Do It Yourself custom forms-driven workflow applications.&#8221;   They&#8217;re looking at all the work people invest building special purpose content management systems that really offer users little more than &#8220;CRUD&#8221; (create, read, update delete) interactins for certain specialized kinds of content.</p>
<p>The basic approach is for the owner to manipulate the visible parts of the system&#8212;the forms that people use to enter data, and the pages that show the data in the system&#8212;and for the server to automatically create the schemas and databases needed to support those interfaces.  For example, if the owner adds a field in the form, the backend will add a field in the back-end database, without the owner knowing anything about that database.  This class of tools are known as &#8220;forms driven applications&#8221;.</p>
<p>The main contribution here is that an important part is to manage &#8220;workflows&#8221;&#8212;they way content is entered and then flows through various stages of the system, evolving and changing who can and should see it as it goes.   There needs to be a notion of roles and access permissions, and for pages to behave differently depending on both the state of the data and who is accessing the page.   It&#8217;s hard to do this if you only work with one page/form at a time.  Their tool tries to provide &#8220;guided debugging&#8221; of the entire workflow, suggesting the next steps that should happen to a particular piece of data, data it should be combined with, and roles it should be assigned to.</p>
<p>These ideas have been pushed into a startup called <a href="http://www.app2you.com/">app2you</a>.  I quite like the approach and hope it is successful.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/01/06/building-a-content-management-system-just-by-drawing-the-web-forms/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>A Case for a Collaborative Query Management System</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/01/06/a-case-for-a-collaborative-query-management-system/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/01/06/a-case-for-a-collaborative-query-management-system/#comments</comments>
		<pubDate>Tue, 06 Jan 2009 20:14:49 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Collective Intelligence]]></category>
		<category><![CDATA[Databases]]></category>
		<category><![CDATA[Publication]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=231</guid>
		<description><![CDATA[This is a CIDR presentation by Nodira Khoussainova of University of Washington arguing for a collaborative repository of complex SQL database queries.  Sounds like they want co-scripter for SQL.
There&#8217;s a problem of hunting through all the queries to find the one you want.  They want effective search and browsing, and also assistance in composing new [...]]]></description>
			<content:encoded><![CDATA[<p>This is a CIDR presentation by Nodira Khoussainova of University of Washington arguing for a collaborative repository of complex SQL database queries.  Sounds like they want co-scripter for SQL.</p>
<p>There&#8217;s a problem of hunting through all the queries to find the one you want.  They want effective search and browsing, and also assistance in composing new queries.    There are challenges:</p>
<ul>
<li>queries are not just strings, but complex objects with inputs, outputs, and semantics.  2 similar queries can have very different outputs, and 2 different queries can return the same</li>
<li>typical search problem: need to avoid giving too many matches</li>
<li>efficient algorithms (this is a database conference after all)</li>
</ul>
<p>An application they have in mind is scientific data management.  There&#8217;s tons of data and lots of (shared) data analysis with complex queries that are freqently evolving.</p>
<p>Consider the scenario of a novice user trying to create a query, given a large repository of past queries by others. He&#8217;ll try to find a perfect match but will probably need to take something close and then modify it.  There must be a metaquery language for describing the kind of query you want.  Since that query was probably built over time, there may be many versions that evolved, and it can be useful to see all the different versions and find the best ones for his use.   It willbe useful to explain to the user how these versions are related, e.g. this refines that.  One needs to watch out for the metaquery being more complicated to construct than the query they want to find.  One approach is &#8220;partial query&#8221;&#8212;for the user to build as much of the query as they can, then look for other queries that are similar.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/01/06/a-case-for-a-collaborative-query-management-system/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Role of Schema Matching in Large Enterprises</title>
		<link>http://groups.csail.mit.edu/haystack/blog/2009/01/06/the-role-of-schema-matching-in-large-enterprises/</link>
		<comments>http://groups.csail.mit.edu/haystack/blog/2009/01/06/the-role-of-schema-matching-in-large-enterprises/#comments</comments>
		<pubDate>Tue, 06 Jan 2009 19:43:31 +0000</pubDate>
		<dc:creator>David Karger</dc:creator>
				<category><![CDATA[Databases]]></category>
		<category><![CDATA[Semantic Web]]></category>

		<guid isPermaLink="false">http://groups.csail.mit.edu/haystack/blog/?p=228</guid>
		<description><![CDATA[A CIDR presentation by Ken Smith from Mitre on the use of the &#8220;match&#8221; operation that pairs properties of two different schema.  It&#8217;s used to merge data from two different sources.  He&#8217;s arguing that there are tons of uses of schema matching that precede the actually merging of data.

When you are trying to decide whether [...]]]></description>
			<content:encoded><![CDATA[<p>A CIDR presentation by Ken Smith from Mitre on the use of the &#8220;match&#8221; operation that pairs properties of two different schema.  It&#8217;s used to merge data from two different sources.  He&#8217;s arguing that there are tons of uses of schema matching that precede the actually merging of data.</p>
<ul>
<li>When you are trying to decide <em>whether </em>you should merge data from two different sources.  e.g. to find out what portion of keys and concepts they have in common.</li>
<li>To decide between different approaches to integration.</li>
<li>To become aware of what information you even have.  E.g., the department of homeland security was formed by mashing a bunch of different agencies together, and it isn&#8217;t even clear if they know what they know.</li>
<li>To help form communities, by discovering subcommunities with overlapping knowledge who could benefit from talking to each other</li>
<li>The government often uses a &#8220;one to rule them all&#8221; massive schema as a means of data exchange, a hub and spoke model where everyone migrates their data into and out of the huge schema.  To use it, you have to find out where your little schema fits into the huge one.</li>
</ul>
<p>He described a case study of such schema matching and outlines limitations in existing tools and needs for the next generation.  Schema A had 1374 elements, was a relational schema envisioned as being a hub schema for the whole military.  Schema B was relatively small (800 element) legacy schema.  They hoped to subsume B away into A and forget about it going forward.  But they asked, do these schemas overlap (and what is the nature of overlap)?  If not, maybe B should be left as an island.  What is distinctive about each?  Can you produce a comprehensive vocabulary of terms participating in one or both?  Nobody wanted any mappings (yet).  They did want summaries, statistics, high level concepts of what the schemas address.  What were the commonalities and distinctions?</p>
<p>They used a schema mapping tool called Harmony.  It was hard to identify high level concepts from lists of matches: what is &#8220;date_begin+156&#8243; property.  The started by <em>manually</em> identifying &#8220;high level concepts&#8221; on both.  For each A concept, they looked for strongest matches in B.  Reported numbers of overlaps and distinct concepts.  They concluded there weren&#8217;t many overlaps.  Customers said &#8220;great, can you incorporate these 7 other schemas?&#8221;.  They really needed a way to automatically summarize a huge schema into coherent parts.   They found schema centric views (showing whoe schema as one object) were insufficient.  They discovered belatedly that spreadsheets were actually a good way to show the pairwise matches of schema terms.  But this doesn&#8217;t work beyond 2 schemas.  Multi-way matching is hard and vital.</p>
]]></content:encoded>
			<wfw:commentRss>http://groups.csail.mit.edu/haystack/blog/2009/01/06/the-role-of-schema-matching-in-large-enterprises/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
