Building a Social Data Commons

Inspired by Ted’s vision of what he’d like to see happen to data.gov, I decided to have a try at my hopes for it. Ted’s desires for data.gov are all ones that I agree would make the data more accessible. I would now like to discuss what else I might want in a world where such steps were taken: a world in which government data was centralized, versioned, searchable, and accessible.

Now what? Given the large and growing pile of data we will optimistically uncover, we will run into new frustrations. People will claim that the published data formats are not the ones that their analysis tool requires. People will be overwhelmed by dataset size, not knowing where to start. People will unknowingly recreate someone else’s data-munging workflows on the way to repeating analyses of the same data. People will become the next bottleneck if data ever ceases to be.

There’s no one answer to the concerns listed above because everyone has a different goal for the data. To handle these issues, we will need more than a place to find up-to-date datasets—-we will also need a place where it is easy for people to share ideas and strategies for tackling data. We will need a social data commons.

Whereas blogs and wikis help report findings, steps, and missteps, a social data commons can be the place to go to “talk shop” about the available data. Even if people post their solutions using decentralized means, there will be benefit to pooling all of these resources in one place on the web. Here are some tools that will help the data-tinkerers get things done:

  • Data-munging war stories. The first stage in data analysis is often long and frustrating. One must digest the dataset in the form they received it, and transform, clean, and filter out the subset that they wish to analyze, visualize, or otherwise present. The workflow differs for each dataset and application, but to the extent that people can share tools and instructions for processing each dataset, these should be written up in the form of recipes for baking the data.
  • Crowdsourced analysis. Datasets can be overwhelming. While many exploration tasks are easily automated, it is often easiest to leave certain tasks (e.g., “Find the interesting pictures”) to humans. Mechanical Turk gives us a hint at what this might look like, and the Guardian provides a wonderful example of crowdsourced public data analysis in action.
  • Current uses showcases. To spark competition, avoid duplicating work, and inspire follow-on projects, visitors should see a showcase of the current uses of each dataset. Aside from links to sites built around a dataset, the list can include embedded visualizations of finished work.
  • Analysis wishlists. Given that data released by a government reaches more than just programmers, there will be more people with ideas than people who can implement the ideas. People with ideas should be given an outlet, and passers-by should be asked to vote on these ideas to help data geeks with some free cycles discover the most insteresting unimplemented project.
  • Data wishlists. If an agency were to dedicate resources to releasing another dataset, which one is in highest demand? As Ted mentioned, governments should let demand drive delivery.
  • Forums. No set of tools will encompass all use cases for social data analysis. A discussion forum can lead to the formation of interest groups while serving as a catch-all for needs not served by the list above.

The US government might hit a few bumps trying to implement some of these social features. For example, a conflict of interest might arise if the showcase of uses of a dataset includes a site critical of the current administration. Having the executive branch ban spam or abusive comments on a forum draws concern over limitations of free speech. These details are not roadblocks, but they do signal that we can’t expect a social overlay to spring out of data.gov per se—-if we want these features, we may have to build and manage them on a third party.

I’m sure there’s more to the social data commons than I listed here. What did I miss, and where can we seek further inspiration?

Thanks to Ted for reading the first version of this entry.

3 Responses to “Building a Social Data Commons”

  • ManyEyes and others have done a great job in making it easier to create visualizations but I think there are still a lot more unexplored possibilities to create viz tools for amateurs. That’s something we’re trying to explore with the next version of Scratch. One of the challenges is to come up with compelling examples that are appealing to regular people. A lot of the data out there is from the government, which is awesome but not the most exciting thing for a lot of people. Just like silly internet memes have help spread wide adoption, I wonder what is the equivalent in the web of data age. Excellent post Adam!

  • Great post Adam, thanks. Found it in my search for existing social network sites for govt data, and
    tools for creating such sites. Currently trying to make good on a commitment I made recently
    at a regional workshop on Data Sharing, largely focused on salmon related data here in the
    Pacific Northwest (agencies/orgs involved: NOAA, BPA (under DoE), USFWS, USGS, USFS,
    US ACE, US BOR, EPA, WDFW, ODFW, Defenders of Wildlife, Nature Conservancy, etc.).
    My offer at the workshop was to help form a “guild” of data practitioners – people with skills
    in data modeling, information design, data sharing & mgmt. Generated lots of interest from
    data geeks as well as policy level folks looking for a forum for idea sharing and developing
    use cases for better data mgmt.

    Hence the tie into your ideas of showcases and wishlists. Love the way you’ve described
    them above – we’ve talked about the same things, albeit less eloquently.

    Would love to hear about similar regional or national level social networks / forums / user
    groups and learn from what they’ve done, what’s worked and what hasn’t. Also interested
    in learning which tools folks are using (e.g. jive, ning, various wiki products, etc.).

    By way of context/introduction, I am a partner in a small private company (Sitka Technology
    Group) here in Portland, Oregon on contract with the Fish & Wildlife group within
    Bonneville Power Administration (under the DoE) to build a web app to help them
    manage their F&W program.

    While we’re still building it (15 mos into a 24 mos project), it’s out there:
    http://www.cbfish.org We’ve recently added a slew of web
    services and are working with number of other fed and state agencies to do data
    integrations. We did one web service integration with Streamnet where we help expose
    scientific data sets generated by projects that our system manages.

    I mention the above as it is but one small example of data visualizations that perhaps are
    appealing to regular people (to Andres Monroy-Hernadez’s point). Here are the data
    visualizations we’ve done using the datasets we manage for the fed govt:
    http://www.cbfish.org/Report.mvc/Index

  • Adam Marcus says:

    @Matt — It’s great to see examples such as yours which show why the data collection, aggregation, discovery, and visualization process is one that’s important not only at the national level, but also for other data-oriented sites. It’s also nice to see the government funding regional data collection processes. Since you would benefit from similar functionality that is discussed in this post, perhaps the correct solution is to either build a web service that provides this sort of functionality for many clients on a hosted platform, or an open solution that you can host yourself. Do you know of any such services or projects? Thanks for stopping by!