On the Complexity of Learning from Label Proportions

On the Complexity of Learning from Label Proportions

Benjamin Fish, Lev Reyzin

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Main track. Pages 1675-1681. https://doi.org/10.24963/ijcai.2017/232

In the problem of learning with label proportions (also known as the problem of estimating class ratios), the training data is unlabeled, and only the proportions of examples receiving each label are given. The goal is to learn a hypothesis that predicts the proportions of labels on the distribution underlying the sample. This model of learning is useful in a wide variety of settings, including predicting the number of votes for candidates in political elections from polls. In this paper, we resolve foundational questions regarding the computational complexity of learning in this setting. We formalize a simple version of the setting, and we compare the computational complexity of learning in this model to classical PAC learning. Perhaps surprisingly, we show that what can be learned efficiently in this model is a strict subset of what may be leaned efficiently in PAC, under standard complexity assumptions. We give a characterization in terms of VC dimension, and we show that there are non-trivial problems in this model that can be efficiently learned. We also give an algorithm that demonstrates the feasibility of learning under well-behaved distributions.
Keywords:
Machine Learning: Learning Theory
Machine Learning: Semi-Supervised Learning