Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning

Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning

Hui Chen, Guiguang Ding, Zijia Lin, Sicheng Zhao, Jungong Han

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 606-612. https://doi.org/10.24963/ijcai.2018/84

Despite the fact that attribute-based approaches and attention-based approaches have been proven to be effective in image captioning, most attribute-based approaches simply predict attributes independently without taking the co-occurrence dependencies among attributes into account. Besides, most attention-based captioning models directly leverage the feature map extracted from CNN, in which many features may be redundant in relation to the image content. In this paper, we focus on training a good attribute-inference model via the recurrent neural network (RNN) for image captioning, where the co-occurrence dependencies among attributes can be maintained. The uniqueness of our inference model lies in the usage of a RNN with the visual attention mechanism to \textit{observe} the image before generating captions. Additionally, it is noticed that compact and attribute-driven features will be more useful for the attention-based captioning model. To this end, we extract the context feature for each attribute, and guide the captioning model adaptively attend to these context features. We verify the effectiveness and superiority of the proposed approach over the other captioning approaches by conducting massive experiments and comparisons on MS COCO image captioning dataset.
Keywords:
Computer Vision: Language and Vision
Computer Vision: Computer Vision