Multimodal Prior Learning with Double Constraint Alignment for Snapshot Spectral Compressive Imaging

Mingjin Zhang; Longyi Li; Fei Gao; Qiming Zhang; Jie Guo

doi:10.24963/ijcai.2025/263

Multimodal Prior Learning with Double Constraint Alignment for Snapshot Spectral Compressive Imaging

Mingjin Zhang, Longyi Li, Fei Gao, Qiming Zhang, Jie Guo

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 2359-2367. https://doi.org/10.24963/ijcai.2025/263

PDF BibTeX

The objective of snapshot spectral compressive imaging reconstruction is to recover the 3D hyperspectral image (HSI) from a 2D measurement. Existing methods either focus on network architecture design or simply introduce image-level prior to the model. However, these methods lack guiding information for accurate reconstruction. Recognizing that textual description contain rich semantic information that can significantly enhance details, this paper introduces a novel framework, CAMM, which integrates text information into the model to improve the performance. The framework comprises two key components: Fine-grained Alignment Module (FAM) and Multimodal Fusion Mamba (MFM). Specifically, FAM is used to reduce the knowledge gap between the RGB domain obtained by the pre-trained vision-language model and the HSI domain. Through the double constraints of distribution similarity and entropy, the adaptive alignment of different complexity features is realized, which makes the encoded features more accurate. MFM aims to identify the guiding effect of RGB features and text features on HSI in space and channel dimensions. Instead of fusing features directly, it integrates prior at image-level and text-level prior into Mamba's state-space equation, so that each scanning step can be accurately guided. This kind of positive feedback adjustment ensures the authenticity of the guiding information. To our knowledge, this is the first text-guided model for compressive spectral imaging. Extensive experimental results the public datasets demonstrate the superior performance of CAMM, validating the effectiveness of our proposed method.

Keywords:

Computer Vision: CV: Computational photography