ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Jingyu Li; Zhendong Mao; Shancheng Fang; Hao Li

doi:10.24963/ijcai.2022/151

ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning

Jingyu Li, Zhendong Mao, Shancheng Fang, Hao Li

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 1081-1087. https://doi.org/10.24963/ijcai.2022/151

PDF BibTeX

Image captioning (IC), bringing vision to language, has drawn extensive attention. Precisely describing visual relations between image objects is a key challenge in IC. We argue that the visual relations, that is geometric positions (i.e., distance and size) and semantic interactions (i.e., actions and possessives), indicate the mutual correlations between objects. Existing Transformer-based methods typically resort to geometric positions to enhance the representation of visual relations, yet only using the shallow geometric is unable to precisely cover the complex and actional correlations. In this paper, we propose to enhance the correlations between objects from a comprehensive view that jointly considers explicit semantic and geometric relations, generating plausible captions with accurate relationship predictions. Specifically, we propose a novel Enhanced-Adaptive Relation Self-Attention Network (ER-SAN). We design the direction-sensitive semantic-enhanced attention, which considers content objects to semantic relations and semantic relations to content objects attention to learn explicit semantic-aware relations. Further, we devise an adaptive re-weight relation module that determines how much semantic and geometric attention should be activated to each relation feature. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance. Codes will be released \url{https://github.com/CrossmodalGroup/ER-SAN}.

Keywords:

Computer Vision: Vision and language

Natural Language Processing: Language Generation