Search-based Reinforcement Learning through Bandit Linear Optimization

Search-based Reinforcement Learning through Bandit Linear Optimization

Milan Peelman, Antoon Bronselaer, Guy De Tré

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 3380-3386. https://doi.org/10.24963/ijcai.2022/469

The development of AlphaZero was a breakthrough in search-based reinforcement learning, by employing a given world model in a Monte-Carlo tree search (MCTS) algorithm to incrementally learn both an action policy and a value estimation. When extending this paradigm to the setting of simultaneous move games we find that the selection strategy of AlphaZero has theoretical shortcomings, including that convergence to a Nash equilibrium is not guaranteed. By analyzing these shortcomings, we find that the selection strategy corresponds to an approximated version of bandit linear optimization using Tsallis entropy regularization with α parameter set to zero, which is equivalent to log-barrier regularization. This observation allows us to refine the search method used by AlphaZero to obtain an algorithm that has theoretically optimal regret as well as superior empirical performance on our evaluation benchmark.
Keywords:
Machine Learning: Reinforcement Learning
Agent-based and Multi-agent Systems: Algorithmic Game Theory
Search: Game Playing