Pre-defined Keypoints Promote Category-level Articulation Pose Estimation via Multi-Modal Alignment

Wenbo Xu; Li Zhang; Liu Liu; Yan Zhong; Haonan Jiang; Xue Wang; Rujing Wang

doi:10.24963/ijcai.2025/237

Pre-defined Keypoints Promote Category-level Articulation Pose Estimation via Multi-Modal Alignment

Wenbo Xu, Li Zhang, Liu Liu, Yan Zhong, Haonan Jiang, Xue Wang, Rujing Wang

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 2125-2133. https://doi.org/10.24963/ijcai.2025/237

PDF BibTeX

Articulations are essential in everyday interactions, yet traditional RGB-based pose estimation methods often struggle with issues such as lighting variations and shadows. To overcome these challenges, we propose a novel Pre-defined keypoint based framework for category-level articulation pose estimation via multi-modal Alignment, coined PAGE. Specifically, we first propose a customized keypoint estimation method, aiming to avoid the divergent distance pattern between heuristically generated keypoints and visible points. In addition, to reduce the mutual information redundancy between point clouds and RGB images, we design the geometry-color alignment, which fuses the features after aligning two modalities. This is followed by decoding the radius for each visible point, and applying our proposal integration scoring strategy to predict keypoints. Ultimately, the framework outputs the per-part 6D pose of the articulation. We conduct extensive experiments to evaluate PAGE across a variety of datasets, from synthetic to real-world scenarios, demonstrating its robustness and superior performance.

Keywords:

Computer Vision: CV: 3D computer vision

Computer Vision: CV: Multimodal learning