Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency
Hallucination Reduction in Video-Language Models via Hierarchical Multimodal Consistency
Jisheng Dang, Shengjun Deng, Haochen Chang, Teng Wang, Bimei Wang, Shude Wang, Nannan Zhu, Guo Niu, Jingwen Zhao, Jizhao Liu
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
AI4Tech: AI Enabling Technologies. Pages 9167-9175.
https://doi.org/10.24963/ijcai.2025/1019
The rapid advancement of large language models (LLMs) has led to the widespread adoption of video-language models (VLMs) across various domains. However, VLMs are often hindered by their limited semantic discrimination capability, exacerbated by the limited diversity and biased sample distribution of most video-language datasets. This limitation results in a biased understanding of the semantics between visual concepts, leading to hallucinations. To address this challenge, we propose a Multi-level Multimodal Alignment (MMA) framework that leverages a text encoder and semantic discriminative loss to achieve multi-level alignment. This enables the model to capture both low-level and high-level semantic relationships, thereby reducing hallucinations. By incorporating language-level alignment into the training process, our approach ensures stronger semantic consistency between video and textual modalities. Furthermore, we introduce a two-stage progressive training strategy that exploits larger and more diverse datasets to enhance semantic alignment and better capture general semantic relationships between visual and textual modalities. Our comprehensive experiments demonstrate that the proposed MMA method significantly mitigates hallucinations and achieves state-of-the-art performance across multiple video-language tasks, establishing a new benchmark in the field.
Keywords:
Advanced AI4Tech: Multimodal AI4Tech
Advanced AI4Tech: Generative and LLMs-driven AI4Tech
Domain-specific AI4Tech: AI4Safety
Domain-specific AI4Tech: General
