Mask Does Not Matter: A Unified Latent Diffusion-Enhanced Framework for Mask-Free Virtual Try-On

Chenghu Du; Junyin Wang; Kai Liu; Shengwu Xiong; Yi Rong

doi:10.24963/ijcai.2025/105

Mask Does Not Matter: A Unified Latent Diffusion-Enhanced Framework for Mask-Free Virtual Try-On

Chenghu Du, Junyin Wang, Kai Liu, Shengwu Xiong, Yi Rong

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence

Main Track. Pages 936-944. https://doi.org/10.24963/ijcai.2025/105

PDF BibTeX

A good virtual try-on model should introduce minimal redundant conditional information to avoid instability and increase inference efficiency. Existing methods rely on inpainting masks to guide the generation of the object, but the masks, generated by unstable human parsers, often produce unreliable results with fabric residues due to wrong segmentation. Moreover, large mask regions can lose spatial structure and identity information, requiring extra conditional inputs to compensate, which increases model instability and reduces efficiency. To tackle the problem, we present a novel Mask-Free virtual Try-ON (MFTON) framework. Specifically, we propose a mask-free strategy to eliminate all denoising conditions except for clothing and person images, thereby directly extracting spatial structure and identity information from the person image to improve efficiency and reduce instability. Additionally, to optimize the generated clothing regions, we propose a clothing texture-aware attention mechanism to enable the model to focus on texture generation with significant visual differences. We then introduce a geometric detail capture loss to further enable the model to capture more high-frequency information. Finally, we propose an appearance consistency inference method to reduce the initial randomness of the sampling process significantly. Extensive experiments on popular datasets demonstrate that our method outperforms state-of-the-art virtual try-on methods.

Keywords:

Computer Vision: CV: Image and video synthesis and generation

Computer Vision: CV: Machine learning for vision

Humans and AI: HAI: Applications

Machine Learning: ML: Deep learning architectures