Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang; Mengdan Zhang; Runnan Chen; Guanyu Cai; Penghao Zhou; Pai Peng; Xiaowei Guo; Jian Wu; Xing Sun

doi:10.24963/ijcai.2021/154

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang, Mengdan Zhang, Runnan Chen, Guanyu Cai, Penghao Zhou, Pai Peng, Xiaowei Guo, Jian Wu, Xing Sun

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence

Main Track. Pages 1113-1121. https://doi.org/10.24963/ijcai.2021/154

PDF BibTeX

Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.

Keywords:

Computer Vision: Language and Vision

Machine Learning: Deep Learning