An Attention-based Regression Model for Grounding Textual Phrases in Images

An Attention-based Regression Model for Grounding Textual Phrases in Images

Ko Endo, Masaki Aono, Eric Nichols, Kotaro Funakoshi

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Main track. Pages 3995-4001. https://doi.org/10.24963/ijcai.2017/558

Grounding, or localizing, a textual phrase in an image is a challenging problem that is integral to visual language understanding. Previous approaches to this task typically make use of candidate region proposals, where end performance depends on that of the region proposal method and additional computational costs are incurred. In this paper, we treat grounding as a regression problem and propose a method to directly identify the region referred to by a textual phrase, eliminating the need for external candidate region prediction. Our approach uses deep neural networks to combine image and text representations and refines the target region with attention models over both image subregions and words in the textual phrase. Despite the challenging nature of this task and sparsity of available data, in evaluation on the ReferIt dataset, our proposed method achieves a new state-of-the-art in performance of 37.26% accuracy, surpassing the previously reported best by over 5 percentage points. We find that combining image and text attention models and an image attention area-sensitive loss function contribute to substantial improvements.
Keywords:
Natural Language Processing: Information Retrieval
Natural Language Processing: Natural Language Semantics
Machine Learning: Deep Learning
Robotics and Vision: Robotics and Vision