Marginal Posterior Sampling for Slate Bandits

Marginal Posterior Sampling for Slate Bandits

Maria Dimakopoulou, Nikos Vlassis, Tony Jebara

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 2223-2229. https://doi.org/10.24963/ijcai.2019/308

We introduce a new Thompson sampling-based algorithm, called marginal posterior sampling, for online slate bandits, that is characterized by three key ideas. First, it postulates that the slate-level reward is a monotone function of the marginal unobserved rewards of the base actions selected in the slates's slots, but it does not attempt to estimate this function. Second, instead of maintaining a slate-level reward posterior, the algorithm maintains posterior distributions for the marginal reward of each slot's base actions and uses the samples from these marginal posteriors to select the next slate. Third, marginal posterior sampling optimizes at the slot-level rather than the slate-level, which makes the approach computationally efficient. Simulation results establish substantial advantages of marginal posterior sampling over alternative Thompson sampling-based approaches that are widely used in the domain of web services.
Keywords:
Machine Learning: Online Learning
Machine Learning: Learning Preferences or Rankings