ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

Zhongjie Duan, Hong Zhang, Wenmeng Zhou, Cen Chen, Yaliang Li, Yu Zhang, Yingda Chen

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
AI, Arts & Creativity. Pages 10063-10071. https://doi.org/10.24963/ijcai.2025/1118

Recently, advancements in video synthesis have attracted significant attention. Video synthesis models have demonstrated the practical applicability of diffusion models in creating dynamic visual content. Despite these advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model's capacity to generate up to 5x its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We have released the source code and the enhanced model publicly.
Keywords:
Application domains: Images, movies and visual arts