Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Main track. Pages 3969-3975. https://doi.org/10.24963/ijcai.2020/549
Dataless text classification has attracted increasing attentions recently. It only needs very few seed words of each category to classify documents, which is much cheaper than supervised text classification that requires massive labeling efforts. However, most of existing models pay attention to long texts, but get unsatisfactory performance on short texts, which have become increasingly popular on the Internet. In this paper, we at first propose a novel model named Seeded Biterm Topic Model (SeedBTM) extending BTM to solve the problem of dataless short text classification with seed words. It takes advantage of both word co-occurrence information in the topic model and category-word similarity from widely used word embeddings as the prior topic-in-set knowledge. Moreover, with the same approach, we also propose Seeded Twitter Biterm Topic Model (SeedTBTM), which extends Twitter-BTM and utilizes additional user information to achieve higher classification accuracy. Experimental results on five real short-text datasets show that our models outperform the state-of-the-art methods, and especially perform well when the categories are overlapping and interrelated.
Natural Language Processing: Text Classification
Data Mining: Mining Text, Web, Social Media
Machine Learning: Knowledge-based Learning
Machine Learning: Learning Graphical Models