Batch-Switching Policy Iteration / 3147
Shivaram Kalyanakrishnan, Utkarsh Mall, Ritish Goyal
Policy Iteration (PI) is a widely-used family of algorithms for computing an optimal policy for a given Markov Decision Problem (MDP). Starting with an arbitrary initial policy, PI repeatedly updates to a dominating policy until an optimal policy is found. The update step involves switching the actions corresponding to a set of "improvable" states, which are easily identified. Whereas progress is guaranteed even if just one improvable state is switched at every step, the canonical variant of PI, attributed to Howard , switches every improvable state in order to obtain the next iterate. For MDPs with n states and 2 actions per state, the tightest known bound on the complexity of Howard's PI is O(2n/n) iterations. To date, the tightest bound known across all variants of PI is O(1.7172n) expected iterations for a randomised variant introduced by Mansour and Singh . We introduce Batch-Switching Policy Iteration (BSPI), a family of deterministic PI algorithms that switches states in "batches," taking the batch size b as a parameter. By varying b, BSPI interpolates between Howard's PI and another previously-studied variant called Simple PI [Melekopoglou and Condon, 1994]. Our main contribution is a bound of O(1.6479n) on the number of iterations taken by an instance of BSPI. We believe this is the tightest bound shown yet for any variant of PI. We also present experimental results that suggest Howard's PI might itself enjoy an even tighter bound.