HAF-SVG: Hierarchical Stochastic Video Generation with Aligned Features

HAF-SVG: Hierarchical Stochastic Video Generation with Aligned Features

Zhihui Lin, Chun Yuan, Maomao Li

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Main track. Pages 991-997. https://doi.org/10.24963/ijcai.2020/138

Stochastic video generation methods predict diverse videos based on observed frames, where the main challenge lies in modeling the complex future uncertainty and generating realistic frames. Numerous of Recurrent-VAE-based methods have achieved state-of-the-art results. However, on the one hand, the independence assumption of the variables of approximate posterior limits the inference performance. On the other hand, although these methods adopt skip connections between encoder and decoder to utilize multi-level features, they still produce blurry generation due to the spatial misalignment between encoder and decoder features at different time steps. In this paper, we propose a hierarchical recurrent VAE with a feature aligner, which can not only relax the independence assumption in typical VAE but also use a feature aligner to enable the decoder to obtain the aligned spatial information from the last observed frames. The proposed model is named Hierarchical Stochastic Video Generation network with Aligned Features, referred to as HAF-SVG. Experiments on Moving-MNIST, BAIR, and KTH datasets demonstrate that hierarchical structure is helpful for modeling more accurate future uncertainty, and the feature aligner is beneficial to generate realistic frames. Besides, the HAF-SVG exceeds SVG on both prediction accuracy and the quality of generated frames.
Keywords:
Computer Vision: Other
Machine Learning: Deep Generative Models
Computer Vision: Video: Events, Activities and Surveillance
Machine Learning: Learning Generative Models