Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images

Biao Liu; Xu Liu; Lingling Li; Licheng Jiao; Fang Liu; Xinyu Sun; Youlin Huang

doi:10.24963/ijcai.2025/174

Language-Guided Hybrid Representation Learning for Visual Grounding on Remote Sensing Images

Biao Liu, Xu Liu, Lingling Li, Licheng Jiao, Fang Liu, Xinyu Sun, Youlin Huang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 1557-1566. https://doi.org/10.24963/ijcai.2025/174

PDF BibTeX

Visual grounding (VG) refers to detecting the specific objects in images based on linguistic expressions, and it has profound significance in the advanced interpretation of natural images. In remote sensing image interpretation, visual grounding is limited by characteristics such as the complex scenes and diverse object sizes. To solve this problem, we propose a novel remote sensing visual grounding (RSVG) framework, named language-guided hybrid representation learning Transformer (LGFormer). Specifically, we designed a multimodal dual-encoder Transformer structure called the adaptive multimodal feature fusion module. This structure innovatively integrates text and visual features as hybrid queries, enabling early-stage decoding queries to perceive the target position accurately. Then, the different modal information from the dual encoders is aggregated by hybrid queries to obtain the final object embedding for coordinate regression. Besides, a multi-scale cross-modal feature enhancement module (MSCM) is designed to enhance the self-representation of the extracted text and visual features and align them semantically. As for the hybrid queries, we use linguistic guidance to select visual features as the visual part and sentence-level features as the textual part. Finally, the LGFormer model we designed achieved the best results compared to existing models on the DIOR-RSVG and OPT-RSVG datasets.

Keywords:

Computer Vision: CV: Multimodal learning

Computer Vision: CV: Recognition (object detection, categorization)

Computer Vision: CV: Vision, language and reasoning

Machine Learning: ML: Multi-modal learning