Multi-modal Sentence Summarization with Modality Attention and Image Filtering

Haoran Li; Junnan Zhu; Tianshang Liu; Jiajun Zhang; Chengqing Zong

Multi-modal Sentence Summarization with Modality Attention and Image Filtering

Haoran Li, Junnan Zhu, Tianshang Liu, Jiajun Zhang, Chengqing Zong

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence

Main track. Pages 4152-4158. https://doi.org/10.24963/ijcai.2018/577

PDF BibTeX

In this paper, we introduce a multi-modal sentence summarization task that produces a short summary from a pair of sentence and image. This task is more challenging than sentence summarization. It not only needs to effectively incorporate visual features into standard text summarization framework, but also requires to avoid noise of image. To this end, we propose a modality-based attention mechanism to pay different attention to image patches and text units, and we design image filters to selectively use visual information to enhance the semantics of the input sentence. We construct a multimodal sentence summarization dataset and extensive experiments on this dataset demonstrate that our models significantly outperform conventional models which only employ text as input. Further analyses suggest that sentence summarization task can benefit from visually grounded representations from a variety of aspects.

Keywords:

Natural Language Processing: Natural Language Summarization

Natural Language Processing: Natural Language Processing