Improved Pure Exploration in Linear Bandits with No-Regret Learning

Improved Pure Exploration in Linear Bandits with No-Regret Learning

Mohammadi Zaki, Avi Mohan, Aditya Gopalan

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 3709-3715. https://doi.org/10.24963/ijcai.2022/515

We study the best arm identification problem in linear multi-armed bandits (LMAB) in the fixed confidence setting ; this is also the problem of optimizing an unknown linear function over a discrete domain with noisy, zeroth-order access. We propose an explicitly implementable and provably order-optimal sample-complexity algorithm for best arm identification. Existing approaches that achieve optimal sample complexity assume either access to a nontrivial minimax optimization oracle (e.g. RAGE and Lazy-TS) or knowledge of an upper-bound on the norm of the unknown parameter vector(e.g. LinGame). Our algorithm, which we call the Phased Elimination Linear Exploration Game (PELEG), maintains a high-probability confidence ellipsoid containing the unknown vector, and uses it to eliminate suboptimal arms in phases, where a minimax problem is essentially solved in each phase using two-player low regret learners. PELEG does not require to know a bound on norm of the unknown vector, and is asymptotically-optimal, settling an open problem. We show that the sample complexity of PELEG matches, up to order and in a non-asymptotic sense, an instance-dependent universal lower bound for linear bandits. PELEG is thus the first algorithm to achieve both order-optimal sample complexity and explicit implementability for this setting. We also provide numerical results for the proposed algorithm consistent with its theoretical guarantees.
Keywords:
Machine Learning: Online Learning
Machine Learning: Learning Theory