VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture

Jinglin Huang; Maoqiang Wu; Peichun Li; Wen Wu; Rong Yu

doi:10.24963/ijcai.2025/133

VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture

Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, Rong Yu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 1188-1196. https://doi.org/10.24963/ijcai.2025/133

PDF BibTeX

Cross-view geo-localization is a crucial task with diverse applications, yet it remains challenging due to the significant variations in viewpoints and visual appearances between images from different perspectives. While recent advancements have been made, existing methods often suffer from high model complexity, excessive resource consumption, and the impact of sample learning difficulty on optimization. To overcome these limitations, we optimize the Vision Mamba (Vim) model, built on a State Space Model (SSM) architecture, by replacing the traditional classification head with Channel Group Pooling (CGP) for efficient feature integration. This optimization reduces model parameters by 1.5% and computational complexity by 0.4%. Additionally, we propose a novel Dynamic Weighted Batch-tuple Loss (DWBL) to dynamically adjust the weighting of negative samples, improving model performance. By combining CGP and DWBL, we develop an efficient end-to-end network, VimGeo, which achieves state-of-the-art performance with enhanced computational efficiency. Specifically, VimGeo achieves a Recall@1 of 81.67% on the CVACT_test dataset, outperforming prior approaches. Extensive experiments on CVUSA, CVACT, and VIGOR datasets validate VimGeo's effectiveness and competitiveness in cross-view geo-localization tasks, achieving the leading results among sequence modeling-based methods. The implementation is available at: https://github.com/VimGeoTeam/VimGeo.

Keywords:

Computer Vision: CV: Representation learning

Computer Vision: CV: Efficiency and Optimization

Computer Vision: CV: Image and video retrieval

Computer Vision: CV: Scene analysis and understanding