CSF-GAN: Cross-modal Semantic Fusion-based Generative Adversarial Network for Text-guided Image Inpainting

Shilin Zhang; Suixue Wang; Qingchen Zhang; Liang Zhao; Weiliang Huo; Sijia Hou; Chunjiang Fu

doi:10.24963/ijcai.2025/265

CSF-GAN: Cross-modal Semantic Fusion-based Generative Adversarial Network for Text-guided Image Inpainting

Shilin Zhang, Suixue Wang, Qingchen Zhang, Liang Zhao, Weiliang Huo, Sijia Hou, Chunjiang Fu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 2377-2385. https://doi.org/10.24963/ijcai.2025/265

PDF BibTeX

Most visual-guided image inpainting methods based on generative adversarial networks (GANs) struggle when the missing region has weak correlations with the surrounding visual context. Recently, diffusion-based methods guided by textual context have been proposed to address this limitation by leveraging additional semantic information to restore corrupted objects. However, these models typically involve more parameters and exhibit slower generation speeds compared to GAN-based approaches. To address this problem, we propose a novel text-guided image inpainting model, the cross-modal semantic fusion generative adversarial network (CSF-GAN). CSF-GAN is designed as a one-stage GAN with the following key contributions. First, a novel semantic fusion module (SFM) is introduced to integrate sentence- and word-level textual context into the inpainting process, enabling more effective guidance from multi-granularity semantic information. Second, a newly designed word-level local discriminator provides detailed feedback to the generator, enhancing the accuracy of generated content in alignment with word-level semantics. Third, two loss functions, the inpainting loss and edge loss, are employed to enhance both structural coherence and textural realism in the generated results. Extensive experiments on two benchmark datasets demonstrate that CSF-GAN outperforms state-of-the-art methods.

Keywords:

Computer Vision: CV: Image and video synthesis and generation

Computer Vision: CV: Adversarial learning, adversarial attack and defense methods

Computer Vision: CV: Vision, language and reasoning