Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Qingguo Hu, Ante Wang, Jia Song, Delai Qiu, Qingsong Liu, Jinsong Su

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1125-1133. https://doi.org/10.24963/ijcai.2025/126

Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A potential cause is the scarcity of visual knowledge in popular instruction-tuning corpora, resulting in inadequate visual perception and reasoning capabilities. To address this challenge, we introduce a self-improvement framework grounded in a novel visual knowledge-intensive task, Causality-driven Visual object Completion (CVC). This task requires LVLMs to infer the masked object in an image based on its causal relationships with the other visible information. We first obtain rich examples cheaply through our automated instance construction pipeline, without relying on sophisticated LVLMs (e.g., GPT-4V) or human assistance. Then, LVLMs effectively self-improve through trial and error learning using these created instances. Our experiments demonstrate substantial gains across four challenging specialized tasks and four widely-used comprehensive benchmarks. Especially on specialized tasks, our method achieves an average improvement of 5.4% and 4.0% compared to the corresponding baselines when utilizing LLaVA-1.5-7B and LLaVA-1.5-13B, respectively. Code and the supplementary file are available at https://github.com/XMUDeepLIT/CVC.
Keywords:
Computer Vision: CV: Vision, language and reasoning
Computer Vision: CV: Multimodal learning
Natural Language Processing: NLP: Language generation
Natural Language Processing: NLP: Language models