Explanation-Guided Reward Alignment

Explanation-Guided Reward Alignment

Saaduddin Mahmud, Sandhya Saisubramanian, Shlomo Zilberstein

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 473-482. https://doi.org/10.24963/ijcai.2023/53

Agents often need to infer a reward function from observations to learn desired behaviors. However, agents may infer a reward function that does not align with the original intent because there can be multiple reward functions consistent with its observations. Operating based on such misaligned rewards can be risky. Furthermore, black-box representations make it difficult to verify the learned rewards and prevent harmful behavior. We present a framework for verifying and improving reward alignment using explanations and show how explanations can help detect misalignment and reveal failure cases in novel scenarios. The problem is formulated as inverse reinforcement learning from ranked trajectories. Verification tests created from the trajectory dataset are used to iteratively validate and improve reward alignment. The agent explains its learned reward and a tester signals whether the explanation passes the test. In cases where the explanation fails, the agent offers alternative explanations to gather feedback, which is then used to improve the learned reward. We analyze the efficiency of our approach in improving reward alignment using different types of explanations and demonstrate its effectiveness in five domains.
Keywords:
AI Ethics, Trust, Fairness: ETF: Safety and robustness
AI Ethics, Trust, Fairness: ETF: Explainability and interpretability
Machine Learning: ML: Reinforcement learning