Multi-modal Sentence Summarization with Modality Attention and Image Filtering

Multi-modal Sentence Summarization with Modality Attention and Image Filtering

Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, Chengqing Zong

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 4152-4158. https://doi.org/10.24963/ijcai.2018/577

In this paper, we introduce a multi-modal sentence summarization task that produces a short summary from a pair of sentence and image. This task is more challenging than sentence summarization. It not only needs to effectively incorporate visual features into standard text summarization framework, but also requires to avoid noise of image. To this end, we propose a modality-based attention mechanism to pay different attention to image patches and text units, and we design image filters to selectively use visual information to enhance the semantics of the input sentence. We construct a multimodal sentence summarization dataset and extensive experiments on this dataset demonstrate that our models significantly outperform conventional models which only employ text as input. Further analyses suggest that sentence summarization task can benefit from visually grounded representations from a variety of aspects.
Keywords:
Natural Language Processing: Natural Language Summarization
Natural Language Processing: Natural Language Processing