Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu; Ante Wang; Jia Song; Delai Qiu; Qingsong Liu; Jinsong Su

doi:10.24963/ijcai.2025/126

Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 1125-1133. https://doi.org/10.24963/ijcai.2025/126

PDF BibTeX

Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, Causality-driven Visual object Completion (CVC). This task requires LVLMs to infer the masked object in an image based on its causal relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (e.g., GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. Code and the supplementary file are available at https://github.com/XMUDeepLIT/CVC.

Keywords:

Computer Vision: CV: Vision, language and reasoning

Computer Vision: CV: Multimodal learning

Natural Language Processing: NLP: Language generation

Natural Language Processing: NLP: Language models