Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization

Swell-and-Shrink: Decomposing Image Captioning by Transformation and Summarization

Hanzhang Wang, Hanli Wang, Kaisheng Xu

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 5226-5232. https://doi.org/10.24963/ijcai.2019/726

Image captioning is currently viewed as a problem analogous to machine translation. However, it always suffers from poor interpretability, coarse or even incorrect descriptions on regional details. Moreover, information abstraction and compression, as essential characteristics of captioning, are always overlooked and seldom discussed. To overcome the shortcomings, a swell-shrink method is proposed to redefine image captioning as a compositional task which consists of two separated modules: modality transformation and text compression. The former is guaranteed to accurately transform adequate visual content into textual form while the latter consists of a hierarchical LSTM which particularly emphasizes on removing the redundancy among multiple phrases and organizing the final abstractive caption. Additionally, the order and quality of region of interest and modality processing are studied to give insights of better understanding the influence of regional visual cues on language forming. Experiments demonstrate the effectiveness of the proposed method.
Keywords:
Natural Language Processing: Natural Language Summarization
Computer Vision: Language and Vision