Semi-supervised Learning for Multi-component Data Classification

Akinori Fujino, Naonori Ueda, Kazumi Saito

This paper presents a method for designing a semi-supervised classifier for multi-component data such as web pages consisting of text and link information. The proposed method is based on a hybrid of generative and discriminative approaches to take advantage of both approaches. With our hybrid approach, for each component, we consider an individual generative model trained on labeled samples and a model introduced to reduce the effect of the bias that results when there are few labeled samples. Then, we construct a hybrid classifier by combining all the models based on the maximum entropy principle. In our experimental results using three test collections such as web pages and technical papers, we confirmed that our hybrid approach was effective in improving the generalization performance of multi-component data classification.