Reward-Constrained Behavior Cloning

Reward-Constrained Behavior Cloning

Zhaorong Wang, Meng Wang, Jingqi Zhang, Yingfeng Chen, Chongjie Zhang

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
Main Track. Pages 3169-3175. https://doi.org/10.24963/ijcai.2021/436

Deep reinforcement learning (RL) has demonstrated success in challenging decision-making/control tasks. However, RL methods, which solve tasks through maximizing the expected reward, may generate undesirable behaviors due to inferior local convergence or incompetent reward design. These undesirable behaviors of agents may not reduce the total reward but destroy the user experience of the application. For example, in the autonomous driving task, the policy actuated by speed reward behaves much more sudden brakes while human drivers generally don’t do that. To overcome this problem, we present a novel method named Reward-Constrained Behavior Cloning (RCBC) which synthesizes imitation learning and constrained reinforcement learning. RCBC leverages human demonstrations to induce desirable or human-like behaviors and employs lower-bound reward constraints for policy optimization to maximize the expected reward. Empirical results on popular benchmark environments show that RCBC learns significantly more human-desired policies with performance guarantees which meet the lower-bound reward constraints while performing better than or as well as baseline methods in terms of reward maximization.
Keywords:
Machine Learning: Deep Reinforcement Learning
Machine Learning: Reinforcement Learning
Constraints and SAT: Constraint Optimization