The Role of Schema Matching in Large Enterprises

A CIDR presentation by Ken Smith from Mitre on the use of the “match” operation that pairs properties of two different schema.  It’s used to merge data from two different sources.  He’s arguing that there are tons of uses of schema matching that precede the actually merging of data.

  • When you are trying to decide whether you should merge data from two different sources.  e.g. to find out what portion of keys and concepts they have in common.
  • To decide between different approaches to integration.
  • To become aware of what information you even have.  E.g., the department of homeland security was formed by mashing a bunch of different agencies together, and it isn’t even clear if they know what they know.
  • To help form communities, by discovering subcommunities with overlapping knowledge who could benefit from talking to each other
  • The government often uses a “one to rule them all” massive schema as a means of data exchange, a hub and spoke model where everyone migrates their data into and out of the huge schema.  To use it, you have to find out where your little schema fits into the huge one.

He described a case study of such schema matching and outlines limitations in existing tools and needs for the next generation.  Schema A had 1374 elements, was a relational schema envisioned as being a hub schema for the whole military.  Schema B was relatively small (800 element) legacy schema.  They hoped to subsume B away into A and forget about it going forward.  But they asked, do these schemas overlap (and what is the nature of overlap)?  If not, maybe B should be left as an island.  What is distinctive about each?  Can you produce a comprehensive vocabulary of terms participating in one or both?  Nobody wanted any mappings (yet).  They did want summaries, statistics, high level concepts of what the schemas address.  What were the commonalities and distinctions?

They used a schema mapping tool called Harmony.  It was hard to identify high level concepts from lists of matches: what is “date_begin+156″ property.  The started by manually identifying “high level concepts” on both.  For each A concept, they looked for strongest matches in B.  Reported numbers of overlaps and distinct concepts.  They concluded there weren’t many overlaps.  Customers said “great, can you incorporate these 7 other schemas?”.  They really needed a way to automatically summarize a huge schema into coherent parts.   They found schema centric views (showing whoe schema as one object) were insufficient.  They discovered belatedly that spreadsheets were actually a good way to show the pairwise matches of schema terms.  But this doesn’t work beyond 2 schemas.  Multi-way matching is hard and vital.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>