On a Few Deadly Data Sins and the Entropy of Open Data

I just ran into a lovely and frustrating open-government-style map of stimulus funding put together in Colorado.   The same tool is used in a number of other states, listed in Brady Forest’s blog post at O’reilly Radar.  Lovely because its always nice to look at maps; frustrating because that’s all I can do.  Where’s the data?  That is, the little table consisting of project name, geographic coordinates, category, and dollar amount?   I can’t find it anywhere on the page, or even on the site.   I don’t know if this data set was created in Colorado; I’m betting it was actually assembled from information at data.gov.  (As evidence, another map on the site claims “The reports were compiled from a variety of sources, including data received directly from government agencies and information posted on the federal Recovery Act website.”)  Clearly this data exists, as it’s necessary to drive the application.  But there’s no apparent way for me to get at it—the visualization is a flash application that, as far as I can tell, has actually compiled the data into the body of the flash app, where the only access would be a flash decompiler.

This is an example of what I’m going to call “Open Data Entropy” or perhaps “Opentropy”—the natural tendency of open data to decay into closed data over time.   While it’s often understandable why certain data has never been opened—because the cost of that initial preparation is too high—it’s a lot harder to justify closing off data that’s already been opened.  But it happens a lot.   Sometimes, this may be an active decision on the part of author, driven by greed (bringing eyeballs to the site), pride (thinking only his visualization is good enough), or lust (wanting to be engaged in all uses of the data).

But I want to focus on another likely culprit: sloth.  Many authors of data visualizations simply don’t care whether or not the data underlying those visualizations is open or not.  They just want to publish the visualization, and will do the minimum necessary to get there.  This throws responsibility for open data back to us tool developers.  If we build tools where the user has to do something extra to open the data, they won’t bother.  On the other hand, if we build tools where the user has to do something extra to close they data, they also won’t bother!

This perspective is part of the genius of the Exhibit data visualization framework that David Huynh built while he was still my student at MIT.  Exhibit doesn’t say anything about open data.  Instead, it focuses on incentivizing the author through beautiful data visualizations that can be created with ease.  But as a side effect, any Exhibit created by any author will automatically make its data open through a simple copy-button () that appears on the visualization by default when you hover over the visualization.  Many authors probably don’t even notice it’s there.  Those who do can dig through the manual to figure out how to turn it off, but very few bother.

Indeed, Exhibit has all the features necessary to replicate the Colorado map—a map view, an icon-based facet for selecting categories, and a pie chart.  Plus, they could have thrown in an expenditure timeline and a pivot table for exploring the data.  I wonder how much time or money they spent on their custom-built flex application, with its side effect of closing off the data?

At the tail end of a nice post on the cool new Gridworks tool he built with that same David Huynh, Stefano Mazzocchi muses on the challenges of getting people who download data and improve it to share it back.  He points at this tweet from someone who’s pondering the pros and cons, and muses about how to push such people to play nice.  This is an important question, but I think it misses a much larger and easier target—those people who just don’t care.  No matter how willing someone is to share their data, it isn’t going to happen if it’s too hard.  On the other hand, if we make open data a default part of our authoring tools, we’ll see it popping up all over.

11 Responses to “On a Few Deadly Data Sins and the Entropy of Open Data”

  • I’d go a step further and say that for the new generation of data tools, it is not enough to just allow the data to be extracted. That’s technically sufficient to claim openness, but in most non-trivial cases, most viewers of the published visualization/report/story will not be in the position to do anything meaningful with the data in its extracted form. I think the tools have to understand that they exist to serve the publisher’s readers and doubters and critics as much as they serve the publishers themselves. (The publishers have to understand this about the tools, too.) Open Data, in a practical sense, has to mean that the tools for presenting an argument in data inherently facilitate examining the argument, debugging its reasoning, exploring alternatives, etc.

    This is part of what we’re trying to do (eventually, slowly) with Needle. If you’re looking at this table of top World Cup scorers, and doubt Ronaldo’s supposed 15 goals, you can click on the 15 and see when each one supposedly happened. If I made a logic error in the query, you have a prayer of finding it, just by clicking around. If you want to redo the table excluding penalty goals, you can (like this, which requires no editing privileges).

  • David Karger says:

    Well said! I agree completely with your sentiment. Interaction conveys way more information than looking at a static image—if pictures are 2d, then an interactive visualization adds a third dimension through interaction over time, creating much higher communication bandwidth.

    But I’ll quibble over tying the phrase “open data” to it. There is indeed huge benefit in allowing a user to interact with a data visualization. But this benefit accrues equally to visualizations backed by open data and by closed data. If I give you a nifty interactive timeline, I’m communicating better with you. But if I don’t give you the data that underlies that timeline, then you can’t use it for anything else. And I’m still controlling the discussion—you see what I want you to see, even if you see it better. To effectively argue over the data, you may need to create a visualization completely different than mine; one that highlights different aspects of the data. For that, grabbing the raw data is a must.

  • “you see what I want you to see, even if you see it better”

    No, my point is that in an open tool, you can see what you want to see. That last link in my previous post is to a reader-created query that does a qualitatively different analysis than the one the publisher of this data initially presented. In Needle the reader has the same analytical powers that the publisher does (just not the same publishing powers).

    Whereas if I just made the data available, and you discover that it consists of 17 relational tables with complicated linkages, it would take you a really long time to get it remodeled into some other analysis system where you could do anything with it at all. For most values of “you”, you’ll never be able to replicate my query in any other system, let alone vary it.

    (I’m also in favor of allowing the data to be extracted, of course, for the values of “you” where you really can make use of it, and Needle lets you export various formats of anything you can see. This is necessary. But it’s not at all sufficient.)

  • David Karger says:

    We’re arguing about which of two really good things—open data or flexible interfaces to it—is more important. But it’s still interesting to compare.

    As you say, in Needle the reader has the same analytical powers as the publisher. But that also means they have the same limitations. If Needle doesn’t offer the visualization they want, they can’t produce it in Needle. Fortunately, since Needle does offer open data, they can take that data somewhere else for the visualization they want. This is good—Needle offers a powerful query tool but only a limited set of visualizations. A tool like manyEyes is offers the reverse. With open data, these tools work together.

    So which matters more? Well, if I have a tool with a great interface but no open data, I’m guaranteed to stumble someday on the visualization I want but can’t make with it. On the other hand, no matter how bad the tool is, if the data is open I can take it somewhere else. Which would I rather rely on: that there exists a tool that can create every interesting visualization, or that for every interesting visualization there exists a tool that can create it? I’ll put my money on placing the universal quantifier first, which is why in the end I think open data matters more than open query interfaces.

  • “if the data is open I can take it somewhere else”

    If the data is open, you can try to take it somewhere else. And if you’re you or me, there’s a reasonable chance that you’ll succeed. Eventually. Maybe.

    I’m saying that’s the wrong base criteria. Most people, most data, there’s no “take it somewhere else”. There’s no “somewhere else”, and a lot of times the “it” is not realistically manageable to begin with.

    Thus I’m saying that both things are essential.

    “Which would I rather rely on: that there exists a tool that can create every interesting visualization, or that for every interesting visualization there exists a tool that can create it?”

    “Visualizations” are not the problem. I’m talking about answers. How many goals did Ronaldo score? In how many games? Did anybody else score more goals? Or score in more games? You have to be able to answer questions before you can worry about whether you “visualize” the answers in a streamgraph or a wordle or a hyperfaceted whatever. Needle isn’t trying to offer every widget in the universe. But it is trying to provide a universal domain-agnostic data-querying language and data-exploration environment in which you can answer any computable question for which you have the necessary data. Maybe that doesn’t initially sound more realistic, but I think it actually is.

  • David Karger says:

    It’s true that nowadays it’s hard to take data somewhere else. But I think it’s going to be essential for the same reasons as I discussed in my last comment. If you want answers that are in Needle fine—but what if the answer requires combining data in Needle with data in Freebase? Something’s got to move. So we have to find ways that make it possible for regular people to move data around. I think that’s actually one of the main opportunities offered by the Semantic Web (you knew that would creep in eventually)—it offers a uniform data model and consistent entity names (resource URIs) that give some hope that you can copy data out of one repository, move it to another, and discover it linking up with its new environment.

    Now you might argue that making data mobile can be solved on the input side rather than the output side, by designing tools like needs to understand and import a huge variety of data formats. But for that to work, all the other tools and repositories out there have to be able to export their data in some form that Needle can understand.

  • Right, and that’s both why I care about the Semantic Web, and why I think the RDF/SPARQL version of the SW toolset is undermining its important potential:

    - As a universal data-representation, RDF is a level too low (an assembly language where we need a Basic/Python/Ruby/Perl or even C/C++/Java, to use a programming analogy).

    - As a common tongue for discussing data and data-relationships, SPARQL is unwieldy, poorly expressive and generally not modeled on natural human patterns of inquiry.

    Thus my own personal Semantic Web agenda, manifest in Needle (along with various ITA corporate imperatives that aren’t about the SW), is to show a draft of an alternate data-model and query-language better suited for real human data needs. That is, I’m trying to do for interconnected graph-shaped data what the spreadsheet model and cell-reference/function language did for columns of numbers.

    Or, put another way, if every data system had Needle’s data-model and querying ability as its minimum function, in the same way that any spreadsheet is expected to talk CSV and calculate SUMs, then you wouldn’t need to move things around to do anything basic, but you could move things around if you really needed to.

  • [...] sparking these considerations, but it’s not: it’s something that Prof. David Karger wrote about my previous post (we deeply enjoy these blog-based conversations). He’s suggesting [...]

  • I think you are absolutely spot on with your analysis of online data. Great looking user interfaces, but no accountability for the data.

  • [...] a post last week I argued that the key to making structured data pervasive on the web was tools that make [...]

  • [...] claimed by analogy that people will be happy to share structured data (given the right authoring tools) the [...]