Human Consensus-Oriented Image Captioning
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Main track. Pages 659-665. https://doi.org/10.24963/ijcai.2020/92
Image captioning aims to describe an image with a concise, accurate, and interesting sentence. To build such an automatic neural captioner, the traditional models align the generated words with a number of human-annotated sentences to mimic human-like captions. However, the crowd-sourced annotations inevitably come with data quality issues such as grammatical errors, wrong identification of visual objects and sub-optimal sentence focus. During the model training, existing methods treat all the annotations equally regardless of the data quality. In this work, we explicitly engage human consensus to measure the quality of ground truth captions in advance, and directly encourage the model to learn high quality captions with high priority. Therefore, the proposed consensus-oriented method can accelerate the training process and achieve superior performance with only supervised objective without time-consuming reinforcement learning. The novel consensus loss can be implemented into most of the existing state-of-the-art methods, boosting the BLEU-4 performance by maximum relative 12.47% comparing to the conventional cross-entropy loss. Extensive experiments are conducted on MS-COCO Image Captioning dataset demonstrating the proposed human consensus-oriented training method can significantly improve the training efficiency and model effectiveness.
Computer Vision: Language and Vision
Machine Learning: Deep Learning
Natural Language Processing: Machine Translation