Short and Sparse Text Topic Modeling via Self-Aggregation / 2270
Xiaojun Quan, Chunyu Kit, Yong Ge, Sinno Jialin Pan
The overwhelming amount of short text data on social media and elsewhere has posed great challenges to topic modeling due to the sparsity problem. Most existing attempts to alleviate this problem resort to heuristic strategies to aggregate short texts into pseudo-documents before the application of standard topic modeling. Although such strategies cannot be well generalized to more general genres of short texts, the success has shed light on how to develop a generalized solution. In this paper, we present a novel model towards this goal by integrating topic modeling with short text aggregation during topic inference. The aggregation is founded on general topical affinity of texts rather than particular heuristics, making the model readily applicable to various short texts. Experimental results on real-world datasets validate the effectiveness of this new model, suggesting that it can distill more meaningful topics from short texts.