Learning to Explain: Towards Human-Aligned Explainability in Deep Reinforcement Learning via Attention Guidance

Learning to Explain: Towards Human-Aligned Explainability in Deep Reinforcement Learning via Attention Guidance

Bokai Ji, Guangxia Li, Yulong Shen, Gang Xiao

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 5472-5479. https://doi.org/10.24963/ijcai.2025/609

Recent advances in explainable deep reinforcement learning (DRL) have provided insights into the reasoning behind decisions made by DRL agents. However, existing methods often overlook the subjective nature of explanations and fail to consider human cognitive styles and preferences. Such ignorance tends to reduce the interpretability and relevance of the generated explanations from a human evaluator's perspective. To address this issue, we introduce human cognition into the explaining procedure by integrating DRL with attention guidance in a novel manner. The proposed concept proximal policy optimization (Concept-PPO) learns to generate human-aligned explanations by jointly optimizing the DRL performance and the discrepancy between generated explanations and human annotations. Its key component is a specially designed spatial concept transformer that can enhance explaining efficiency by premasking decision-irrelevant information. Experiments on the ATARI benchmark demonstrate that Concept-PPO achieves better policies than its black-box counterparts, and user studies confirm its superiority in generating human-aligned explanations compared to existing explainable DRL methods.
Keywords:
Machine Learning: ML: Explainable/Interpretable machine learning
Machine Learning: ML: Reinforcement learning