Incorporating Unlikely Negative Cues for Distinctive Image Captioning

Incorporating Unlikely Negative Cues for Distinctive Image Captioning

Zhengcong Fei, Junshi Huang

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 745-753. https://doi.org/10.24963/ijcai.2023/83

While recent neural image captioning models have shown great promise in terms of automatic metrics, they still struggle with generating generic sentences, which limits their use to only a handful of simple scenarios. On the other hand, negative training has been suggested as an effective way to prevent models from producing frequent yet meaningless sentences. However, when applied to image captioning, this approach may overlook low-frequency but generic and vague sentences, which can be problematic when dealing with diverse and changeable visual scenes. In this paper, we introduce a approach to improve image captioning by integrating negative knowledge that focuses on preventing the model from producing undesirable generic descriptions while addressing previous limitations. We accomplish this by training a negative teacher model that generates image-wise generic sentences with retrieval entropy-filtered data. Subsequently, the student model is required to maximize the distance with multi-level negative knowledge transferring for optimal guiding. Empirical results evaluated on MS COCO benchmark confirm that our plug-and-play framework incorporating unlikely negative knowledge leads to significant improvements in both accuracy and diversity, surpassing previous state-of-the-art methods for distinctive image captioning.
Keywords:
Computer Vision: CV: Vision and languageĀ 
Machine Learning: ML: Learning preferences or rankings