Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

Yi Yang; Hongan Wang; Jiaqi Zhu; Yunkun Wu; Kailong Jiang; Wenli Guo; Wandong Shi

doi:10.24963/ijcai.2020/549

Dataless Short Text Classification Based on Biterm Topic Model and Word Embeddings

Yi Yang, Hongan Wang, Jiaqi Zhu, Yunkun Wu, Kailong Jiang, Wenli Guo, Wandong Shi

Short video

Long video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 3969-3975. https://doi.org/10.24963/ijcai.2020/549

PDF BibTeX

Dataless text classification has attracted increasing attentions recently. It only needs very few seed words of each category to classify documents, which is much cheaper than supervised text classification that requires massive labeling efforts. However, most of existing models pay attention to long texts, but get unsatisfactory performance on short texts, which have become increasingly popular on the Internet. In this paper, we at first propose a novel model named Seeded Biterm Topic Model (SeedBTM) extending BTM to solve the problem of dataless short text classification with seed words. It takes advantage of both word co-occurrence information in the topic model and category-word similarity from widely used word embeddings as the prior topic-in-set knowledge. Moreover, with the same approach, we also propose Seeded Twitter Biterm Topic Model (SeedTBTM), which extends Twitter-BTM and utilizes additional user information to achieve higher classification accuracy. Experimental results on five real short-text datasets show that our models outperform the state-of-the-art methods, and especially perform well when the categories are overlapping and interrelated.

Keywords:

Natural Language Processing: Text Classification

Data Mining: Mining Text, Web, Social Media

Machine Learning: Knowledge-based Learning

Machine Learning: Learning Graphical Models