Search-based Reinforcement Learning through Bandit Linear Optimization

Milan Peelman; Antoon Bronselaer; Guy De Tré

doi:10.24963/ijcai.2022/469

Search-based Reinforcement Learning through Bandit Linear Optimization

Milan Peelman, Antoon Bronselaer, Guy De Tré

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 3380-3386. https://doi.org/10.24963/ijcai.2022/469

PDF BibTeX

The development of AlphaZero was a breakthrough in search-based reinforcement learning, by employing a given world model in a Monte-Carlo tree search (MCTS) algorithm to incrementally learn both an action policy and a value estimation. When extending this paradigm to the setting of simultaneous move games we find that the selection strategy of AlphaZero has theoretical shortcomings, including that convergence to a Nash equilibrium is not guaranteed. By analyzing these shortcomings, we find that the selection strategy corresponds to an approximated version of bandit linear optimization using Tsallis entropy regularization with α parameter set to zero, which is equivalent to log-barrier regularization. This observation allows us to refine the search method used by AlphaZero to obtain an algorithm that has theoretically optimal regret as well as superior empirical performance on our evaluation benchmark.

Keywords:

Machine Learning: Reinforcement Learning

Agent-based and Multi-agent Systems: Algorithmic Game Theory

Search: Game Playing