SIGIR09: Enhancing Cluster Labeling using Wikipedia
David Carmel from IBM Haifa spoke about the problem of labelling document clusters. The goal is to find short labels for the clusters that describe them well to end users. The typical approach seeks important terms in the clusters. But sometimes important terms aren’t helpful/meaningful, and sometimes the best labels don’t show up in the cluster at all. For example, at the Open Directory Project, its category labels appeared in the text of documents with that label clusters only 85% of the time, and were rarely among the statistically important terms.
In this work, they try to match the cluster contents to articles in wikipedia, then look at the wikipedia articles’ metadata (titles, categories) to find good descriptive labels for the clusters. It seems to work pretty well.
to test this they took a bunch of text documents with labeled categories, and tested whether the manual label got selected by their wikipedia algorithm. They tested on some standard corpora: the 20 newsgroups, and the open directory project for which they manually labeled 100 categories. They carefully explored effects of cluster coherence, noise, etc.