Avoidance of Model Re-Induction in SVM-based Feature Selection for Text Categorization

Aleksander Kolcz, Abdur Chowdhury

Searching the feature space for a subset yielding optimum performance tends to be expensive, especially in applications where the cardinality of the feature space is high (e.g., text categorization). This is particularly true for massive datasets and learning algorithms with worse than linear scaling factors. Linear Support Vector Machines (SVMs) are among the top performers in the text classification domain and often work best with very rich feature representations. Even they however benefit from reducing the number of features, sometimes to a large extent. In this work we propose alternatives to exact re-induction of SVM models during the search for the optimum feature subset. The approximations offer substantial benefits in terms of computational efficiency. We are able to demonstrate that no significant compromises in terms of model quality are made and, moreover, in some cases gains in accuracy can be achieved.