Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing

Mingce Guo, Jingxuan He, Yufei Yin, Zhangye Wang, Shengeng Tang, Lechao Cheng

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 1062-1070. https://doi.org/10.24963/ijcai.2025/119

Text-driven video editing powered by generative diffusion models holds significant promise for applications spanning film production, advertising, and beyond. However, the limited expressiveness of pre-trained word embeddings often restricts nuanced edits, especially when targeting novel concepts with specific attributes. In this work, we present a novel Concept-Augmented Textual Inversion (CATI) framework that flexibly integrates new object information from user-provided concept videos. By fine-tuning only the V (Value) projection in attention via Low-Rank Adaptation (LoRA), our approach preserves the original attention distribution of the diffusion model while efficiently incorporating external concept knowledge. To further stabilize editing results and mitigate the issue of attention dispersion when prompt keywords are modified, we introduce a Dual Prior Supervision (DPS) mechanism. DPS supervises cross-attention between the source and target prompts, preventing undesired changes to non-target areas and improving the fidelity of novel concepts. Extensive evaluations demonstrate that our plug-and-play solution not only maintains spatial and temporal consistency but also outperforms state-of-the-art methods in generating lifelike and stable edited videos. The source code is publicly available at https://guomc9.github.io/STIVE-PAGE/.
Keywords:
Computer Vision: CV: Image and video synthesis and generation 
Computer Vision: CV: Machine learning for vision
Computer Vision: CV: Video analysis and understanding