Bidirectional Dilation Transformer for Multispectral and Hyperspectral Image Fusion

Bidirectional Dilation Transformer for Multispectral and Hyperspectral Image Fusion

Shangqi Deng, Liang-Jian Deng, Xiao Wu, Ran Ran, Rui Wen

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 3633-3641. https://doi.org/10.24963/ijcai.2023/404

Transformer-based methods have proven to be effective in achieving long-distance modeling, capturing the spatial and spectral information, and exhibiting strong inductive bias in various computer vision tasks. Generally, the Transformer model includes two common modes of multi-head self-attention (MSA): spatial MSA (Spa-MSA) and spectral MSA (Spe-MSA). However, Spa-MSA is computationally efficient but limits the global spatial response within a local window. On the other hand, Spe-MSA can calculate channel self-attention to accommodate high-resolution images, but it disregards the crucial local information that is essential for low-level vision tasks. In this study, we propose a bidirectional dilation Transformer (BDT) for multispectral and hyperspectral image fusion (MHIF), which aims to leverage the advantages of both MSA and the latent multiscale information specific to MHIF tasks. The BDT consists of two designed modules: the dilation Spa-MSA (D-Spa), which dynamically expands the spatial receptive field through a given hollow strategy, and the grouped Spe-MSA (G-Spe), which extracts latent features within the feature map and learns local data behavior. Additionally, to fully exploit the multiscale information from both inputs with different spatial resolutions, we employ a bidirectional hierarchy strategy in the BDT, resulting in improved performance. Finally, extensive experiments on two commonly used datasets, CAVE and Harvard, demonstrate the superiority of BDT both visually and quantitatively. Furthermore, the related code will be available at the GitHub page of the authors.
Keywords:
Machine Learning: ML: Attention models
Computer Vision: CV: Machine learning for vision
Machine Learning: ML: Multi-modal learning