Understanding Visual Detail Hallucinations of Large Vision-Language Models
Understanding Visual Detail Hallucinations of Large Vision-Language Models
Xiaoxi Sun, Jianxin Liang, Yueqian Wang, Huishuai Zhang, Dongyan Zhao
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1900-1908.
https://doi.org/10.24963/ijcai.2025/212
Understanding small visual objects is crucial in fields such as video surveillance, remote sensing, and autonomous driving. In this paper, we investigate the capability of advanced large vision-language models (LVLMs) to recognize and interpret small objects in visual data. To this end, we curate a specialized dataset for evaluating fine-grained visual hallucinations, incorporating two object categories and three types of hallucinations.
First, we assess 11 state-of-the-art LVLMs, yielding several key insights, as anticipated, LVLMs perform significantly worse on queries related to small objects compared to regular-sized ones, with performance on regular objects proving to be an unreliable predictor of that on small objects. This finding underscores the need for dedicated research on fine-grained visual hallucinations. Second, we evaluate three training-free methods: Scaffold, Chain of Thought (CoT), and Image Resizing, all of which result in varying degrees of improvement. Furthermore, we conduct a series of detailed ablation studies on the visual encoders of Eagle-X5, examining their performance across fine-grained visual hallucination tasks. Our findings reveal that ConvNeXt architecture is critical for object existence recognition tasks. In contrast, for mitigating other types of hallucinations, integrating information from multiple visual encoders is significantly more effective than relying on a single encoder.
These results highlight several promising directions for advancing small object recognition with LVLMs.
Keywords:
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Recognition (object detection, categorization)
