Find and Perceive: Tell Visual Change with Fine-Grained Comparison

Find and Perceive: Tell Visual Change with Fine-Grained Comparison

Feixiao Lv, Rui Wang, Lihua Jing, Lijun Liu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 5878-5886. https://doi.org/10.24963/ijcai.2025/654

The goal of the image change captioning task is to capture the differences between two similar images and describe them in natural language. In this paper, we decompose this task into two sub-problems, i.e., fine-grained change feature learning and discrimination of changed regions. Compared with existing methods which only focus on change feature learning, we propose a novel change captioning learning paradigm, Find and Perceive (F&P). Our proposed F&P consists of two main ideas, i.e., the Fine-Grained Semantic Change Perception (FGSCP) module for improving the model's perception ability of subtle changes and the Weakly-Supervised Discriminator (WSD) of changed regions for improving the model's sensitivity of localising the important regions. Specifically, the FGSCP deploys a two-step manner, firstly introducing the fine-grained categorisation and then enhancing the interaction of the two paired images. And the WSD adopts the contributions of each image region for final generated captions, accurately indicating which regions are important for change captions without any extra annotations. Finally, we conduct extensive experiments on four change captioning datasets, and experimental results show that our proposed method F&P outperforms existing change caption methods and achieves new state-of-the-art performance.
Keywords:
Machine Learning: ML: Weakly supervised learning
Computer Vision: CV: Vision, language and reasoning