DDPA-3DVG: Vision-Language Dual-Decoupling and Progressive Alignment for 3D Visual Grounding

DDPA-3DVG: Vision-Language Dual-Decoupling and Progressive Alignment for 3D Visual Grounding

Hongjie Gu, Jinlong Fan, Liang Zheng, Jing Zhang, Yuxiang Yang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1044-1052. https://doi.org/10.24963/ijcai.2025/117

3D visual grounding aims to localize target objects in point clouds based on free-form natural language, which often describes both target and reference objects. Effective alignment between visual and text features is crucial for this task. However, existing two-stage methods that rely solely on object-level features can yield suboptimal accuracy, while one-stage methods that align only point-level features can be prone to noise. In this paper, we propose DDPA-3DVG, a novel framework that progressively aligns visual locations and language descriptions at multiple granularities. Specifically, we decouple natural language descriptions into distinct representations of target objects, reference objects, and their mutual relationships, while disentangling 3D scenes into object-level, voxel-level, and point-level features. By progressively fusing these dual-decoupled features from coarse to fine, our method enhances cross-modal alignment and achieves state-of-the-art performance on three challenging benchmarks—ScanRefer, Nr3D, and Sr3D. The code will be released at https://github.com/HDU-VRLab/DDPA-3DVG.
Keywords:
Computer Vision: CV: 3D computer vision
Computer Vision: CV: Multimodal learning