S2 Transformer for Image Captioning

S2 Transformer for Image Captioning

Pengpeng Zeng, Haonan Zhang, Jingkuan Song, Lianli Gao

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 1608-1614. https://doi.org/10.24963/ijcai.2022/224

Transformer-based architectures with grid features represent the state-of-the-art in visual and language reasoning tasks, such as visual question answering and image-text matching. However, directly applying them to image captioning may result in spatial and fine-grained semantic information loss. Their applicability to image captioning is still largely under-explored. Towards this goal, we propose a simple yet effective method, Spatial- and Scale-aware Transformer (S2 Transformer) for image captioning. Specifically, we firstly propose a Spatial-aware Pseudo-supervised (SP) module, which resorts to feature clustering to help preserve spatial information for grid features. Next, to maintain the model size and produce superior results, we build a simple weighted residual connection, named Scale-wise Reinforcement (SR) module, to simultaneously explore both low- and high-level encoded features with rich semantics. Extensive experiments on the MSCOCO benchmark demonstrate that our method achieves new state-of-art performance without bringing excessive parameters compared with the vanilla transformer. The source code is available at https://github.com/zchoi/S2-Transformer
Keywords:
Computer Vision: Vision and languageĀ 
Computer Vision: Visual reasoning and symbolic representation