Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description

Kai Shen; Lingfei Wu; Fangli Xu; Siliang Tang; Jun Xiao; Yueting Zhuang

doi:10.24963/ijcai.2020/131

Hierarchical Attention Based Spatial-Temporal Graph-to-Sequence Learning for Grounded Video Description

Kai Shen, Lingfei Wu, Fangli Xu, Siliang Tang, Jun Xiao, Yueting Zhuang

Short video

Long video

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

Main track. Pages 941-947. https://doi.org/10.24963/ijcai.2020/131

PDF BibTeX

The task of Grounded Video Description~(GVD) is to generate sentences whose objects can be grounded with the bounding boxes in the video frames. Existing works often fail to exploit structural information both in modeling the relationships among the region proposals and in attending them for text generation. To address these issues, we cast the GVD task as a spatial-temporal Graph-to-Sequence learning problem, where we model video frames as spatial-temporal sequence graph in order to better capture implicit structural relationships. In particular, we exploit two ways to construct a sequence graph that captures spatial-temporal correlations among different objects in each frame and further present a novel graph topology refinement technique to discover optimal underlying graph structure. In addition, we also present hierarchical attention mechanism to attend sequence graph in different resolution levels for better generating the sentences. Our extensive experiments demonstrate the effectiveness of our proposed method compared to state-of-the-art methods.

Keywords:

Computer Vision: Language and Vision

Computer Vision: Video: Events, Activities and Surveillance