Enhancing Semantic Clarity: Discriminative and Fine-grained Information Mining for Remote Sensing Image-Text Retrieval

Yu Liu; Haipeng Chen; Yuheng Liang; Yuheng  Yang; Xun Yang; Yingda Lyu

doi:10.24963/ijcai.2025/647

Enhancing Semantic Clarity: Discriminative and Fine-grained Information Mining for Remote Sensing Image-Text Retrieval

Yu Liu, Haipeng Chen, Yuheng Liang, Yuheng Yang, Xun Yang, Yingda Lyu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 5815-5823. https://doi.org/10.24963/ijcai.2025/647

PDF BibTeX

Remote sensing image-text retrieval is a fundamental task in remote sensing multimodal analysis, promoting the alignment of visual and language representations. The mainstream approaches commonly focus on capturing shared semantic representations between visual and textual modalities. However, the inherent characteristics of remote sensing image-text pairs lead to a semantic confusion problem, stemming from redundant visual representations and high inter-class similarity. To tackle this problem, we propose a novel Discriminative and Fine-grained Information Mining (DFIM) model, which aims to enhance semantic clarity by reducing visual redundancy and increasing the semantic gap between different classes. Specifically, the Dynamic Visual Enhancement (DVE) module adaptively enhances the visual discriminative features under the guidance of multimodal fusion information. Meanwhile, the Fine-grained Semantic Matching (FSM) module cleverly models the matching relationship between image regions and text words as an optimal transport problem, thereby refining intra-instance matching. Extensive experiments on two benchmark datasets justify the superiority of DFIM in terms of retrieval accuracy and visual interpretability over the leading methods.

Keywords:

Machine Learning: ML: Multi-modal learning

Machine Learning: ML: Representation learning