G3PT: Unleash the Power of Autoregressive Modeling in 3D Generation via Cross-Scale Querying Transformer
G3PT: Unleash the Power of Autoregressive Modeling in 3D Generation via Cross-Scale Querying Transformer
Jinzhi Zhang, Feng Xiong, Guangyu Wang, Mu Xu
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 2350-2358.
https://doi.org/10.24963/ijcai.2025/262
Autoregressive transformers have revolutionized generative models in language processing and shown substantial promise in image and video generation. However, these models face significant challenges when extended to 3D generation tasks due to their reliance on next-token prediction to learn token sequences, which is incompatible with the unordered nature of 3D data. Instead of imposing an artificial order on 3D data, in this paper, we introduce G3PT, a scalable, coarse-to-fine 3D native generative model with cross-scale vector quantization and cross-scale autoregressive modeling. The key is to map point-based 3D data into discrete tokens with different levels of detail, naturally establishing a sequential relationship across a variety of scales suitable for autoregressive modeling. Remarkably, our method connects tokens globally across different levels of detail without manually specified ordering. Benefiting from this approach, G3PT features a versatile 3D generation pipeline that effortlessly supports the generation of 3D shapes under diverse conditional modalities. Extensive experiments demonstrate that G3PT achieves superior generation quality and generalization ability compared to previous baselines. Most importantly, for the first time in 3D generation, scaling up G3PT reveals distinct power-law scaling behaviors.
Keywords:
Computer Vision: CV: 3D computer vision
Machine Learning: ML: Foundation models
Machine Learning: ML: Generative models
