DiffFERV: Diffusion-based Facial Editing of Real Videos

DiffFERV: Diffusion-based Facial Editing of Real Videos

Xiangyi Chen, Han Xue, Li Song

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 819-827. https://doi.org/10.24963/ijcai.2025/92

Face video editing presents significant challenges, requiring precise preservation of facial identity, temporal consistency, and background details. Existing methods encounter three major challenges: difficulty in achieving accurate facial reconstruction, struggles with challenging real-world videos and reliance on a crop-edit-stitch paradigm that confines editing to localized facial regions. In response, we introduce DiffFERV, a novel diffusion-based framework for realistic face video editing that addresses these limitations through three core contributions. (1) A specialization stage that extends large Text-to-Image (T2I) models' general prior to faces while retaining their broad generative capabilities. This enables robust performance on non-aligned and challenging face images. (2) Temporal modeling, implemented through two distinct attention mechanisms, complements the specialization stage to ensure joint and temporally consistent processing of video frames. (3) Finally, we present a holistic editing pipeline and the concept of preservation features, which leverages our model’s enhanced priors and temporal mechanisms to achieve faithful edits of entire video frames without the need for cropping, excelling even in real-world scenarios. Extensive experiments demonstrate that DiffFERV achieves state-of-the-art performance in both reconstruction and editing tasks.
Keywords:
Computer Vision: CV: Biometrics, face, gesture and pose recognition
Computer Vision: CV: Applications and Systems
Computer Vision: CV: Image and video synthesis and generationÂ