A Simple Extension for Microformat & RDFa Table Support
Microformats and RDFa provide a way to interweave semantic markup within a web document so that structured information can be more easily extracted. Both Microformats and RDFa follow the hierarchical model of HTML: structured data to be extracted may exist spread across several layers of the DOM
hierarchy. A pseudocode example of this is below, where we see that the statement <Jefferson eats Hamburgers> is spread across three levels of the DOM Tree.
<div subject="Tom">
<div property="eats">
Hamburgers
</div>
</div>
A wrinkle in this hierarchical mindset is the fact that a great deal of structured information on the web lives in the one HTML construct that does not fit a hierarchical representation: tables.
Consider Wikipedia, as just one source of structured data virtually begging to be annotated with RDFa. On many Wikipedia pages one of the first things that catches a visitor’s eye is the “Info Box”. This is a small box containing a structured summary of the key factual items on the page. Equally important is the way in which Info Boxes are populated: they begin their life as templates — the “Capital City” or “Baseball Player” template, for example – and all the Wikipedia contributor has to do is fill in the empty field values.
Microformats and RDFa are able to mark up tables, such as these Info Boxes, but they do so in a suboptimal way: they require repetition of semantic markup across each row (or column, depending on the orientation of the table). This:
- Is redundant from a representational standpoint
- Clouds the ability of a data enhanced web browser to infer that entire rows and columns of the table talk about the same thing.
- Forces the template writer to put markup within the table body, rather than on its column and row declarations
To cure this, here is a simple proposal to add table support to both Microformats and RDFa. It is compact, needing only a single sentence to describe:
When parsing the DOM to extract embedded data and a <TD> element is encountered, treat its corresponding <COL> element as if it were the DOM parent of the <TR> element for that cell.
The problem with tables is that a table cell (TD) is contained in HTML by its row (TR) but not its column (COL) because of the way HTML works. This means that when we use tables and microformats/RDFa together, we’re stuck: we can put information across each row, but not down each column. So the proposal is to make a special case for table columns when it comes to pulling out structured information in the cells: even though the cell isn’t technically contained by the column element, pretend that it is.
Let’s see how this cleans up the representation of a simple table of president names. This table has one president per row: Thomas and John.
Here is a pseudocode example of how microformats/RDFa currently require such a table to be marked up. Notice how the “first” and “last” properties needed to be repeated across each row.
<TABLE>
<TR><TH>First</TH><TH>Last</TH></TR>
<TR subject="tj"><TD property="first">Thomas</TD><TD property="last">Jefferson</TD></TR>
<TR subject="ja"><TD property="first">John</TD><TD property="last">Adams</TD></TR>
</TABLE>
If we allow structured content to live in the <COL /> elements, then we do not need this repetition:
<TABLE>
<COL property="first" /><COL property="last" />
<TR><TH>First</TH><TH>Last</TH></TR>
<TR subject="tj"><TD>Thomas</TD><TD>Jefferson</TD></TR>
<TR subject="ja"><TD>John</TD><TD>Adams</TD></TR>
</TABLE>
When parsing either of these two tables for data, we extract the same information:
- :tj :first “Thomas”
- :tj :last ”Jefferson”
- :ja :first “John”
- :ja :last ”Adams
Tables are all over the web, and they make great templates to assist users in entering structured information. This small change to the semantics of microformat and RDFa parsing should allow a cleaner syntax for publishing both data and data-templates, easing the adoption of the respective
formats.