Zachary Ives presented at CIDR about combining data from a large number of sources. Generally the DB community thinks of this as a hard problem for sophisticated data integration professionals. But what about, e.g.
- combining data for an emergency management effort—need for speed
- a scientist gathering data from many bio portals for data relating to some specific gene sequence. maybe changing integration as he sees more data
- assembling info about phones, changing schema as we go
you want to quickly see the results of what you are integrating but change as you go. data is spread across many sources that you see for the first time the moment you want to integrate. Traditionally, it’s been about doing a lengthy integration design followed by application. We need a tighter feedback loop. So people often just manually copy and paste data into excel. Can we start there and make something that is even easier and more intuitive than spreadsheet work? Avoid the separation between integrating data and then querying the data to answer a specific question? It’s important to add new sources and attributes as understanding of the data develops.
Their system, called CopyCat, uses “smart copy and paste” which works by example. Users paste data into the system, then proposes “auto completions”. Besides the traditional “wrapper induction” (learning the pattern of data extraction) the system tries to guess the query/join the user is trying to construct. And to suggest new attributes, e.g. if it sees a street and state it might suggest a zip code. The user sees results and explanations, then gives feedback (or ignores). A challenge is to understand the user’s feedback—what does it mean when they dislike a particular tuple?
Their UI looks like a table. You select and copy an element from a web page, and paste it into a column; the column gets filled with similar elements. You add a type field to the header (it makes some suggestions based on the contents and past experience), saying this is a school. It uses xpaths and also regular expressions (eg it can notice that addresses are all text fragments that start with numbers). It supports some simultaneous editing on the data in the columns.
If you do this from multiple sources, making multiple tables, it looks for overlapping values in columns and uses them to suggest joins between tables that might be useful. If you highlight a column in one table it shows what tables it will join with, and you can specify certain columns you would like to select from that join to combine into the first table—eg add a phone number column that comes from joining through another table that has phone numbers.
Their system really comes in two pieces—data extraction and data processing. If tools like exhibit became popular, the data extraction part would become much easier, and might not require such powerful tools, but they still offer some interesting directions for helping the user figure out how to manipulate the data.