Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Xin He, Longhui Wei, Lingxi Xie, Qi Tian

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1098-1106. https://doi.org/10.24963/ijcai.2025/123

Multimodal Large Language Models (MLLMs) are experiencing rapid growth, yielding a plethora of novel works recently. The prevailing trend involves adopting data-driven methodologies, wherein diverse instruction-following datasets were collected. However, these approaches always face the challenge of limited visual perception capabilities, as they solely utilizing CLIP-like encoders to extract visual information from inputs. Though these encoders are pre-trained on billions of image-text pairs, they still grapple with the information loss dilemma, given that textual captions only partially capture the contents depicted in images. To address this limitation, this paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism. Specifically, this work introduces a novel method that incorporates multi-task encoders and existing visual tools into the MLLMs training and inference pipeline, aiming to provide a more comprehensive summarization of visual inputs. Extensive experiments have evaluated its effectiveness of advancing MLLMs, showcasing improved visual perception capability achieved through the integration of visual experts.
Keywords:
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Vision, language and reasoning
Machine Learning: ML: Multi-modal learning