Words Over Pixels? Rethinking Vision in Multimodal Large Language Models
Words Over Pixels? Rethinking Vision in Multimodal Large Language Models
Anubhooti Jain, Mayank Vatsa, Richa Singh
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Survey Track. Pages 10481-10489.
https://doi.org/10.24963/ijcai.2025/1164
Multimodal Large Language Models (MLLMs) promise seamless integration of vision and language understanding. However, despite their strong performance, recent studies reveal that MLLMs often fail to effectively utilize visual information, frequently relying on textual cues instead. This survey provides a comprehensive analysis of the vision component in MLLMs, covering both application-level and architectural aspects. We investigate critical challenges such as weak spatial reasoning, poor fine-grained visual perception, and suboptimal fusion of visual and textual modalities. Additionally, we explore limitations in current vision encoders, benchmark inconsistencies, and their implications for downstream tasks. By synthesizing recent advancements, we highlight key research opportunities to enhance visual understanding, improve cross-modal alignment, and develop more robust and efficient MLLMs. Our observations emphasize the urgent need to elevate vision to an equal footing with language, paving the path for more reliable and perceptually aware multimodal models.
Keywords:
Computer Vision: CV: Vision, language and reasoning
Computer Vision: CV: Multimodal learning
