CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation

CMFS: CLIP-Guided Modality Interaction for Mitigating Noise in Multi-Modal Image Fusion and Segmentation

Guilin Su, Yuqing Huang, Chao Yang, Zhenyu He

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1873-1881. https://doi.org/10.24963/ijcai.2025/209

Infrared-visible image fusion and semantic segmentation are pivotal tasks for robust scene understanding under challenging conditions such as low light. However, existing methods often struggle with high noise, modality inconsistencies, and inefficient cross-modal interactions, limiting fusion quality and segmentation accuracy. To this end, we propose CMFS, a unified framework that leverages CLIP-guided modality interaction to mitigate noise in multi-modal image fusion and segmentation. Our approach features a region-aware Modal Interaction Alignment module that combines a VMamba-based encoder with an additional shuffle layer to obtain more robust features and a CLIP-guided, regionally constrained multi-modal feature interaction block to emphasize foreground targets while suppressing low-light noise. Additionally, a Frequency-Spatial Collaboration module uses selective scanning and integrates wavelet-, spatial-, and Fourier-domain features to achieve adaptive denoising and balanced feature allocation. Furthermore, we employ a low-rank mixture-of-experts with dynamic routing to improve region-specific fusion and enhance pixel-level accuracy. Extensive experiments on several benchmarks show that, compared with state-of-the-art methods, the proposed approach demonstrates effectiveness in both image fusion quality and semantic segmentation accuracy, especially in complex environments. The source code will be released at IJCAI2025-CMFS.
Keywords:
Computer Vision: CV: Low-level Vision
Computer Vision: CV: Applications and Systems
Computer Vision: CV: Segmentation, grouping and shape analysis