Unified Molecule-Text Language Model with Discrete Token Representation

Shuhan Guo; Yatao Bian; Ruibing Wang; Nan Yin; Zhen Wang; Quanming Yao

doi:10.24963/ijcai.2025/1023

Unified Molecule-Text Language Model with Discrete Token Representation

Shuhan Guo, Yatao Bian, Ruibing Wang, Nan Yin, Zhen Wang, Quanming Yao

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

AI4Tech: AI Enabling Technologies. Pages 9205-9213. https://doi.org/10.24963/ijcai.2025/1023

PDF BibTeX

The remarkable success of Large Language Models (LLMs) across diverse tasks has driven the research community to extend their capabilities to molecular applications. However, most molecular LLMs employ adapter-based architectures that fail to equally integrate molecule and text modalities and lack explicit supervision signals for the molecular modality. To address these issues, we introduce UniMoT, a Unified Molecule-Text LLM adopting a tokenizer-based architecture that expands the vocabulary of LLMs with molecule tokens. Specifically, we introduce a Vector Quantization-driven tokenizer that incorporates a Q-Former to bridge the modality gap between molecule and text. This tokenizer transforms molecular structures into sequences of tokens exhibiting causal dependency, thereby encapsulating both high-level molecular features and textual information. Equipped with this tokenizer, UniMoT unifies molecule and text modalities under a shared token representation and an autoregressive training paradigm. This enables the model to process molecular structures as a distinct linguistic system and generate them in textual form. Through a four-stage training scheme, UniMoT functions as a multi-modal generalist capable of performing both molecule-to-text and text-to-molecule tasks. Extensive experiments demonstrate that UniMoT achieves state-of-the-art performance across a wide range of molecule comprehension and generation tasks.

Keywords:

Advanced AI4Tech: Multimodal AI4Tech

Advanced AI4Tech: AI4Tech foundations

Advanced AI4Tech: Data-driven AI4Tech

Emerging AI4Tech: Emerging AI4Tech areas