IterMeme: Expert-Guided Multimodal LLM for Interactive Meme Creation with Layout-Aware Generation

Yaqi Cai; Shancheng Fang; Yadong Qu; Xiaorui Wang; Meng Shao; Hongtao Xie

doi:10.24963/ijcai.2025/81

IterMeme: Expert-Guided Multimodal LLM for Interactive Meme Creation with Layout-Aware Generation

Yaqi Cai, Shancheng Fang, Yadong Qu, Xiaorui Wang, Meng Shao, Hongtao Xie

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 720-728. https://doi.org/10.24963/ijcai.2025/81

PDF BibTeX

Meme creation is a creative process that blends images and text. However, existing methods lack critical components, failing to support intent-driven caption-layout generation and personalized generation, making it difficult to generate high-quality memes. To address this limitation, we propose IterMeme, an end-to-end interactive meme creation framework that utilizes a unified Multimodal Large Language Model (MLLM) to facilitate seamless collaboration among multiple components. To overcome the absence of a caption-layout generation component, we develop a robust layout representation method and construct a large-scale image-caption-layout dataset, MemeCap, which enhances the model’s ability to comprehend emotions and coordinate caption-layout generation effectively. To address the lack of a personalization component, we introduce a parameter-shared dual-LLM architecture that decouples the intricate representations of reference images and text. Furthermore, we incorporate the expert-guided M³OE for fine-grained identity properties (IP) feature extraction and cross-modal fusion. By dynamically injecting features into every layer of the model, we enable adaptive refinement of both visual and semantic information. Experimental results demonstrate that IterMeme significantly advances the field of meme creation by delivering consistently high-quality outcomes. The code, model, and dataset will be open-sourced to the community.

Keywords:

Computer Vision: CV: Multimodal learning

Computer Vision: CV: Image and video synthesis and generation