Top-Down Guidance for Learning Object-Centric Representations

Junhong Zou; Xiangyu Zhu; Zhaoxiang Zhang; Zhen Lei

doi:10.24963/ijcai.2025/284

Top-Down Guidance for Learning Object-Centric Representations

Junhong Zou, Xiangyu Zhu, Zhaoxiang Zhang, Zhen Lei

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 2548-2556. https://doi.org/10.24963/ijcai.2025/284

PDF BibTeX

Humans' innate ability to decompose scenes into objects allows for efficient understanding, predicting, and planning. In light of this, Object-Centric Learning (OCL) attempts to endow networks with similar capabilities, learning to represent scenes with the composition of objects. However, existing OCL models only learn through reconstructing the input images, which does not assist the model in distinguishing objects, resulting in suboptimal object-centric representations. This flaw limits current object-centric models to relatively simple downstream tasks. To address this issue, we draw on humans’ top-down vision pathway and propose Top-Down Guided Network (TDGNet), which includes a top-down pathway to improve object-centric representations. During training, the top-down pathway constructs guidance with high-level object-centric representations to optimize low-level grid features output by the backbone. While during inference, it refines object-centric representations by detecting and solving conflicts between low- and high-level features. We show that TDGNet outperforms current object-centric models on multiple datasets of varying complexity. In addition, we expand the downstream task scope of object-centric representations by applying TDGNet to the field of robotics, validating its effectiveness in downstream tasks including video prediction and visual planning. Code will be available at https://github.com/zoujunhong/RHGNet.

Keywords:

Computer Vision: CV: Representation learning

Computer Vision: CV: Embodied vision: Active agents, simulation

Computer Vision: CV: Segmentation, grouping and shape analysis