SIGIR09: simultaneously medeling semantics and structure of threaded discussions
A group from SMR asia is working on modeling threaded discussions. Threaded discussions pervade IMs, chat rooms, web forums, and mailing lists. They’re hierarchical. This group wants to mine the semantics (discover the topics) and the structure (author-reply relationships). The applications include spam blocking, reply constructions (figuring out which specific posts other posts are replying to, which may not be clear if the system is linear like chat) and expert identification. Oviously later posts often reply to earlier ones, but which one? They also hope to identify and remove chitchat and spam.
As for the model, they posit that each thread has several topics (which kind of contradicts the “pure” notion of thread, but is certainly true in practice). Conversely, they assume each post in the thread is just a couple of topics. They try to approximate each post as a linear combination of the (topics of the) previous posts, but a sparse one (only a few nonzeros to meet the idea that each post is narrow). For training, they used forums like slashdot which do track replies to a specific comment.