Learning to Identify Unexpected Instances in the Test Set

Xiao-Li Li, Bing Liu, See-Kiong Ng

Traditional classification involves building a clas-sifier using labeled training examples from a set of predefined classes and then applying the classifier to classify test instances into the same set of classes. In practice, this paradigm can be problematic be-cause the test data may contain instances that do not belong to any of the previously defined classes. Detecting such unexpected instances in the test set is an important issue in practice. The problem can be formulated as learning from positive and unla-beled examples (PU learning). However, current PU learning algorithms require a large proportion of negative instances in the unlabeled set to be effec-tive. This paper proposes a novel technique to solve this problem in the text classification domain. The technique first generates a single artificial negative document AN. The sets P and {AN} are then used to build a naïve Bayesian classifier. Our experiment results show that this method is significantly better than existing techniques.