Towards Lossless Head Pruning through Automatic Peer Distillation for Language Models

Bingbing Li; Zigeng Wang; Shaoyi Huang; Mikhail Bragin; Ji Li; Caiwen Ding

doi:10.24963/ijcai.2023/568

Towards Lossless Head Pruning through Automatic Peer Distillation for Language Models

Bingbing Li, Zigeng Wang, Shaoyi Huang, Mikhail Bragin, Ji Li, Caiwen Ding

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 5113-5121. https://doi.org/10.24963/ijcai.2023/568

PDF BibTeX

Pruning has been extensively studied in Transformer-based language models to improve efficiency. Typically, we zero (prune) unimportant model weights and train a derived compact model to improve final accuracy. For pruned weights, we treat them as useless and discard them. This usually leads to significant model accuracy degradation. In this paper, we focus on attention head pruning as head attention is a key component of the transformer-based language models and provides interpretable knowledge meaning. We reveal the relationship between pruned attention heads and retained heads and provide a solution to recycle the discarded knowledge from the pruned heads, named peer distillation. We also develop an automatic framework to locate the to-be-pruned attention heads in each layer, freeing the time-consuming human labor in tuning hyperparameters.Experimental results on the General Language Understanding Evaluation (GLUE) benchmark are provided using BERT model. By recycling discarded knowledge from pruned heads, the proposed method maintains model performance across all nine tasks while reducing heads by over 58% on average and outperforms state-of-the-art techniques (e.g., Random, HISP, L0 Norm, SMP).

Keywords:

Natural Language Processing: NLP: Language models