Robust High-Dimensional Classification From Few Positive Examples

Robust High-Dimensional Classification From Few Positive Examples

Deepayan Chakrabarti, Benjamin Fauber

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 1952-1958. https://doi.org/10.24963/ijcai.2022/271

We tackle an extreme form of imbalanced classification, with up to 105 features but as few as 5 samples from the minority class. This problem occurs in predicting predicting tumor types and fraud detection, among others. Standard imbalanced classification methods are not designed for such severe data scarcity. Sampling-based methods need too many samples due to the high-dimensionality, while cost-based methods must place too high a weight on the limited minority samples. Our proposed method, called DIRECT, bypasses sample generation by training the classifier over a robust smoothed distribution of the minority class. DIRECT is fast, simple, robust, parameter-free, and easy to interpret. We validate DIRECT on several real-world datasets spanning document, image, and medical classification. DIRECT is up to 5x − 7x better than SMOTE-like methods, 30−200% better than ensemble methods, 3x − 7x better than cost-sensitive methods. The greatest gains are for settings with the fewest samples in the minority class, where DIRECT’s robustness is most helpful.
Keywords:
Data Mining: Class Imbalance and Unequal Cost
Machine Learning: Robustness
Machine Learning: Classification
Data Mining: Applications