Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

Interpreting Pretrained Language Models via Concept Bottlenecks (Extended Abstract)

Zhen Tan, Lu Cheng, Song Wang, Yuan Bo, Jundong Li, Huan Liu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Sister Conferences Best Papers. Pages 10942-10946. https://doi.org/10.24963/ijcai.2025/1221

Pretrained language models (PLMs) achieve state-of-the-art results but often function as ``black boxes'', hindering interpretability and responsible deployment. While methods like attention analysis exist, they often lack clarity and intuitiveness. We propose interpreting PLMs through high-level, human-understandable concepts using Concept Bottleneck Models (CBMs). This extended abstract introduces C3M (ChatGPT-guided Concept augmentation with Concept-level Mixup), a novel framework for training Concept-Bottleneck-Enabled PLMs (CBE-PLMs). C3M leverages Large Language Models (LLMs) like ChatGPT to augment concept sets and generate noisy concept labels, combined with a concept-level MixUp mechanism to enhance robustness and effectively learn from both human-annotated and machine-generated concepts. Empirical results show our approach provides intuitive explanations, aids model diagnosis via test-time intervention, and improves the interpretability-utility trade-off, even with limited or noisy concept annotations. This is an concise version of [Tan et al., 2024b], recipient of the Best Paper Award at PAKDD 2024. Code and data are released at https://github.com/Zhen-Tan-dmml/CBM_NLP.git.
Keywords:
Sister Conferences Best Papers: Humans and AI
Sister Conferences Best Papers: AI Ethics, Trust, Fairness
Sister Conferences Best Papers: Machine Learning
Sister Conferences Best Papers: Natural Language Processing