Hard information management that should have been easy
I’m composing the first of what I hope will be many posts describing information tasks that should be easy (ideally, point-and-click easy) but instead seem to be very hard (requiring programming). I post these in the hope of
- finding out about tools that do make these tasks easy, or
- fomenting thought about why these tasks are hard, and what new tools ought to be developed to make them easy
So on to the task. I maintain an index of Israeli Folk Dance videos. I (and others) build the collection manually. I find videos on youtube, get data about the dance from the IFD database, enter data about the dance and the video on a submission form and feed it into an Exhibit that makes it look nice. Recently, someone posted a collection of 511 new videos. Obviously I’d like to incorporate them into my index, and equally obviously I’m not interested in entering them all manually. What do I need to do? My goal is to produce a json file (or table, or spreadsheet) of the information from those videos, that I can add to my existing data file.
Let’s talk first about getting the information out of the video site. This site has some of the information I want for each video: a title, a link to the video, a link to the thumbnail, and usually even choreographer information in the comments. But how do I get a table of that information? Scraping it out of the html would be an tricky task—the stuff is spread over multiple pages, and scraping is always so error prone. Sifter has a trick for dealing with the multiple page problem, but is still susceptible to scraping errors. Fortunately, youtube has an API that lets me get out the data in machine readable form, like this. It was pretty easy for me to go read a document that told me what URL to type in to get the data dump.
As one particular annoying issue, the API also enforces pagination! I can only get a maximum of 25 items at a time. I can imagine the reason—they don’t want to waste processing serving too much data to someone who doesn’t actually need it—but wouldn’t it make sense to offer a “give me all the data” API for those who do need it?
Leaving the silly paging problem aside, and assuming I’ve managed to get a single lump of data, there’s the serious problem of data transformation. It’s no surprise that the data won’t be in the form I want, but what tools exist to help me transform it? The Yahoo API gives “tree-shaped” data results in XML (or json) syntax. Here’s an example:
<entry>
<id>http://gdata.youtube.com/feeds/api/videos/Qv-Ayqj4cjw</id>
<published>2008-10-07T00:51:30.000Z</published>
<updated>2008-11-19T16:23:49.000Z</updated>
<category scheme='http://gdata.youtube.com/schemas/2007/keywords.cat' term='dance'/>
<category scheme='http://gdata.youtube.com/schemas/2007/keywords.cat' term='דודו'/>
<title type='text'>El HaOr אל האור</title>
<content type='text'>Choreographer: Dudu Barzilay Year: 2003
Performing: Nissim Ben-Ami</content>
<link rel='alternate' type='text/html' href='http://www.youtube.com/watch?v=Qv-Ayqj4cjw'/>
<link rel='http://gdata.youtube.com/schemas/2007#video.responses' type='application/atom+xml'
href='http://gdata.youtube.com/feeds/api/videos/Qv-Ayqj4cjw/responses'/>
<author>
<name>NissimBenAmi</name>
<uri>http://gdata.youtube.com/feeds/api/users/nissimbenami</uri>
</author>
<media:group><media:title type='plain'>El HaOr אל האור</media:title>
<media:description type='plain'>Choreographer: Dudu Barzilay Year: 2003
Performing: Nissim Ben-Ami</media:description>
<media:keywords>dance, folk, israel, ברזילי, דודו, עם, ריקוד, ריקודי</media:keywords>
<yt:duration seconds='167'/>
<media:content url='http://www.youtube.com/v/Qv-Ayqj4cjw&f=gdata_user_uploads'
type='application/x-shockwave-flash' medium='video'
isDefault='true' expression='full' duration='167' yt:format='5'/>
<media:thumbnail url='http://i.ytimg.com/vi/Qv-Ayqj4cjw/2.jpg' height='97'
width='130' time='00:01:23.500'/>
<media:thumbnail url='http://i.ytimg.com/vi/Qv-Ayqj4cjw/1.jpg' height='97'
width='130' time='00:00:41.750'/>
<media:player url='http://www.youtube.com/watch?v=Qv-Ayqj4cjw'/></media:group>
<yt:statistics viewCount='342' favoriteCount='3'/>
<gd:rating min='1' max='5' numRaters='1' average='5.00'/>
<gd:comments>
<gd:feedLink href='http://gdata.youtube.com/feeds/api/videos/Qv-Ayqj4cjw/comments'
countHint='0'/>
</gd:comments>
</entry>
In fact, it came back even harder to read, but I’ve removed lots of extraeneous info and added some indentation. Even that required that I be comfortable reading and editing raw XML, which rules out most people.
The small bit of stuff I want is certainly in here, but how do I get it out? There are two stages to consider. The first is just plucking out the specific fields I want. Looking at the raw XML, I can see that I want the <title> field, the <content> field (which contains a bunch of info—choreographer, year—I’ll get back to that), the <media:thumbnail> and <media:player> fields. But that’s the easy part—how do I actually extract that data?
There are plenty of programming approaches, with the most suited probably being to write an XSLT transform. But why do I need to program? In fact, why did I need to look at the raw XML data at all? Describing what I want to do is easy, so doing it should be easy too! Is there a tool out there that will take this data and show it to me in a nice tree viewer, let me pick the elements I want, and export a table? Perhaps the right brute force approach is simply a tool that takes any xml data set and flattens it into a table, from which I can cut and paste columns using my favorite spreadsheet program.
Actually, pulling individual properties is only the start. Some of what I want is embedded in text, with only limited consistency—the various videos’ comments offer “choreographed by Dudu Barzilay”, “Dudu Barzilay Choreographer”, “CHOREOGRAPH: Dudy Barzilay”, etc. Also, an odd case here, some of the fields such as “media:title” above contain both an English and a Hebrew segment, which I would like to extract as separate properties so I can show e.g. a Hebrew title and an English Title in separate places instead of lumped together. Trivial for the human eye, but how do I specify that it should be done for me? Is this too weird a case to have non-programmatic support?
The next problem is data integration. How do I get all the rest of the information I want for my index? It’s already available online, at the IFD database, but how to I combine it? If I’m lucky, the title is spelled the same in that database and in the videos I’ve just extracted, so “all” I need to do is search for the title of each video, then go through another data scraping process to get the fields out of the tabular form that comes back. This database is actually more “primitive” than youtube, offering a tabular result set in an html page instead of an XML data feed; ironically, with today’s tools that may actually be easier to scrape. I can probably use a scripting tool to “learn” to do the submission and scraping for for me, and maybe if I’m lucky to iterate through all 511 videos. So perhaps this can all be done without programming. But it seems a really time consuming process, so it’ll probably be quite a while before I give it a try.