VideoHumanMIB: Unlocking Appearance Decoupling for Video Human Motion In-betweening

VideoHumanMIB: Unlocking Appearance Decoupling for Video Human Motion In-betweening

Haiwei Xue, Zhensong Zhang, Minglei Li, Zonghong Dai, Fei Yu, Fei Ma, Zhiyong Wu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 4254-4262. https://doi.org/10.24963/ijcai.2025/474

We propose VideoHumanMIB, a novel framework for Video Human Motion In-betweening that enables seamless transitions between different motion video clips, facilitating the generation of longer and more natural digital human videos. While existing video frame interpolation methods work well for similar motions in adjacent frames, they often struggle with complex human movements, resulting in artifacts and unrealistic transitions. To address these challenges, we introduce a two-stage approach: First, we design an Appearance Reconstruction AutoEncoder to decouple appearance and motion information, extracting robust appearance-invariant features. Second, we develop an enhanced diffusion pretrained network that leverages both motion optical flow and human pose as guidance conditions, enabling the model to learn comprehensive latent distributions of possible motions. Rather than operating directly in pixel space, our model works in a learned latent space, allowing it to better capture the underlying motion dynamics. The framework is optimized with a dual-frame constraint loss and a motion flow loss to ensure temporal consistency and natural movement transitions. Extensive experiments demonstrate that our approach generates highly realistic transition sequences that significantly outperform existing methods, particularly in challenging scenarios with large motion variations. The proposed VideoHumanMIB establishes a new baseline for human motion synthesis and enables more natural and controllable digital human animation.
Keywords:
Humans and AI: HAI: Applications
Humans and AI: HAI: Other