Multi-player Multi-armed Bandits with Delayed Feedback

Jingqi Fan; Zilong Wang; Shuai Li; Linghe Kong

doi:10.24963/ijcai.2025/564

Multi-player Multi-armed Bandits with Delayed Feedback

Jingqi Fan, Zilong Wang, Shuai Li, Linghe Kong

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 5065-5073. https://doi.org/10.24963/ijcai.2025/564

PDF BibTeX

Multi-player multi-armed bandits (MP-MAB) have been extensively studied due to their application in cognitive radio networks. In this setting, multiple players simultaneously select arms and instantly receive feedback. However, in realistic decentralized networks, feedback is often delayed due to sensing latency and signal processing. Without a central coordinator, explicit communication is impossible, and delayed feedback disrupts implicit coordination, since it depends on synchronous observations. As a result, collisions are frequent and system performance degrades significantly. In this paper, we propose an algorithm in MP-MAB with stochastic delay feedback. Each player in the algorithm independently maintains an estimate of the optimal arm set based on their own delayed rewards but only pulls arms from the set, which is, with high probability, identical to those of other players, thus avoiding collisions. The identical arm set also enables implicit communication, allowing players to utilize the exploration results of others. We establish a regret upper bound and derive a lower bound to prove the algorithm is near-optimal. Numerical experiments on both synthetic and real-world datasets validate the effectiveness of our algorithm.

Keywords:

Machine Learning: ML: Multi-armed bandits

Machine Learning: ML: Online learning

Machine Learning: ML: Learning theory