VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture
VimGeo: Efficient Cross-View Geo-Localization with Vision Mamba Architecture
Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, Rong Yu
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1188-1196.
https://doi.org/10.24963/ijcai.2025/133
Cross-view geo-localization is a crucial task with diverse applications, yet it remains challenging due to the significant variations in viewpoints and visual appearances between images from different perspectives. While recent advancements have been made, existing methods often suffer from high model complexity, excessive resource consumption, and the impact of sample learning difficulty on optimization. To overcome these limitations, we optimize the Vision Mamba (Vim) model, built on a State Space Model (SSM) architecture, by replacing the traditional classification head with Channel Group Pooling (CGP) for efficient feature integration. This optimization reduces model parameters by 1.5% and computational complexity by 0.4%. Additionally, we propose a novel Dynamic Weighted Batch-tuple Loss (DWBL) to dynamically adjust the weighting of negative samples, improving model performance. By combining CGP and DWBL, we develop an efficient end-to-end network, VimGeo, which achieves state-of-the-art performance with enhanced computational efficiency. Specifically, VimGeo achieves a Recall@1 of 81.67% on the CVACT_test dataset, outperforming prior approaches. Extensive experiments on CVUSA, CVACT, and VIGOR datasets validate VimGeo's effectiveness and competitiveness in cross-view geo-localization tasks, achieving the leading results among sequence modeling-based methods. The implementation is available at: https://github.com/VimGeoTeam/VimGeo.
Keywords:
Computer Vision: CV: Representation learning
Computer Vision: CV: Efficiency and Optimization
Computer Vision: CV: Image and video retrieval
Computer Vision: CV: Scene analysis and understanding
