A Dual Stream Visual Tokenizer for LLM Image Generation

Yongqian Li; Yong Luo; Xiantao Cai; Zheng He; Zhennan Meng; Nidong Wang; Yunlin Chen; Zhifei Li

doi:10.24963/ijcai.2025/167

A Dual Stream Visual Tokenizer for LLM Image Generation

Yongqian Li, Yong Luo, Xiantao Cai, Zheng He, Zhennan Meng, Nidong Wang, Yunlin Chen, Zhifei Li

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 1494-1502. https://doi.org/10.24963/ijcai.2025/167

PDF BibTeX

We proposes a novel visual tokenizer by combining high-level semantic tokens and low-level pixel tokens to represent images, aiming to address the challenges of image-to-sequence conversion for Large Language Models (LLMs). Existing visual tokenizers, such as VQ-VAE and diffusion-based models, either struggle with token explosion as image resolution increases or fail to capture detailed structural information. Our method introduces a dual-token system: high-level semantic tokens capture the main content of the image, while low-level pixel tokens preserve structural details. By integrating these tokens in a hybrid architecture, we leverage a VQ-VAE branch to generate low-resolution guidance and a diffusion process to reconstruct high-resolution images with both semantic coherence and structural accuracy. This approach significantly reduces the number of required tokens and enhances image reconstruction quality, offering an efficient solution for tasks like image generation and understanding based on LLMs.

Keywords:

Computer Vision: CV: Image and video synthesis and generation

Machine Learning: ML: Generative models