Understanding Matters: Semantic-Structural Determined Visual Relocalization for Large Scenes

Jingyi Nie; Liangliang Cai; Qichuan Geng; Zhong Zhou

doi:10.24963/ijcai.2025/974

Understanding Matters: Semantic-Structural Determined Visual Relocalization for Large Scenes

Jingyi Nie, Liangliang Cai, Qichuan Geng, Zhong Zhou

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 8759-8767. https://doi.org/10.24963/ijcai.2025/974

PDF BibTeX

Scene Coordinate Regression (SCR) estimates 3D scene coordinates from 2D images, and has become an important approach in visual relocalization. Existing methods exhibit high localization accuracy in small scenes, but still face substantial challenges in large-scale scenes, which usually have significant variations in depth, scale, and occlusion. Although structure-guided scene partitioning is commonly adopted, the over-partitioned elements and large feature variances within subscenes impede the estimation of the 3D coordinates, introducing misleading information for subsequent processing. To address the above-mentioned issues, we propose the Semantic-Structural Determined Visual Relocalization method for SCR, which leverages semantic-structural partition learning and partition-determined pose refinement to better understand the semantic and structural information on large scenes. Firstly, we partition the scene into small subscenes with label assignments, ensuring semantic consistency and structural continuity within each subscene. A classifier is then trained with sampling-based learning to predict these labels. Secondly, the partition predictions are encoded into embeddings and integrated with local features for intra-class compactness and inter-class separation, producing partition-aware features. To further decrease feature variances, we employ a discriminability metric and suppress ambiguous points, improving subsequent computations. Experimental results on the Cambridge Landmarks dataset demonstrate that the proposed method achieves significant improvements with fewer training costs on large-scale scenes, reducing the median error by 38% compared to the state-of-the-art SCR method DSAC*. Code is available: https://gitee.com/VR_NAVE/ss-dvr.

Keywords:

Robotics: ROB: Localization, mapping, state estimation

Computer Vision: CV: 3D computer vision

Computer Vision: CV: Scene analysis and understanding

Robotics: ROB: Robotics and vision