Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models

Diff-LMM: Diffusion Teacher-Guided Spatio-Temporal Perception for Video Large Multimodal Models

Jisheng Dang, Ligen Chen, Jingze Wu, Ronghao Lin, Bimei Wang, Yun Wang, Liting Wang, Nannan Zhu, Teng Wang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 873-881. https://doi.org/10.24963/ijcai.2025/98

Dynamic spatio-temporal understanding is essential for video-based multimodal tasks, yet existing methods often struggle to capture fine-grained temporal and spatial relationships in long videos. Current approaches primarily rely on pre-trained CLIP encoders, which excel in semantic understanding but lack spatially-aware visual context. This leads to hallucinated results when interpreting fine-grained objects or scenes. To address these limitations, we propose a novel framework that integrates diffusion models into multimodal video models. By employing diffusion encoders at intermediate layers, we enhance visual representations through feature alignment and knowledge distillation losses, significantly improving the model's ability to capture spatial patterns over time. Additionally, we introduce a multi-level alignment strategy to learn robust feature correspondence from pre-trained diffusion models. Extensive experiments on benchmark datasets demonstrate our approach's state-of-the-art performance across multiple video understanding tasks. These results establish diffusion models as a powerful tool for enhancing multimodal video models in complex, dynamic scenarios.
Keywords:
Computer Vision: CV: Multimodal learning
Computer Vision: CV: Video analysis and understanding