Short Text Conceptualization Using a Probabilistic Knowledgebase
Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, Weizhu Chen
Most of the text mining tasks, such as clustering, is dominated by statistical approaches that treat text as a bag of words. Semantics in the text is largely ignored in the mining process, and the mining results are often not easily interpretable. One particular challenge faced by such approaches is short text understanding, as short text lacks enough content from which a statistical conclusion can be drawn. For example, traditional topic analysis methods consider topic segments with tens of hundreds of words. Latent topic modeling, such as latent Dirichlet allocation, also requires sufficient words to infer document topic distribution. We enhance machine learning algorithms by first giving the machine a probabilistic knowledgebase that contains as big, rich, and consistent concepts (of worldly facts) as those in our mental world. Then a Bayesian inference mechanism is developed to conceptualize words and short text. We conducted comprehensive tests of our method on conceptualizing set of text terms, as well as clustering Twitter messages (tweets), which are typically approximately ten words long. Compared to latent semantic topic modeling and other four kinds of methods that using WordNet, Freebase and Wikipedia (category links and explicit semantic analysis), we show significant improvements in terms of tweets clustering accuracy.