Localizing Unseen Activities in Video via Image Query

Localizing Unseen Activities in Video via Image Query

Zhu Zhang, Zhou Zhao, Zhijie Lin, Jingkuan Song, Deng Cai

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
Main track. Pages 4390-4396. https://doi.org/10.24963/ijcai.2019/610

Action localization in untrimmed videos is an important topic in the field of video understanding. However, existing action localization methods are restricted to a pre-defined set of actions and cannot localize unseen activities. Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization. This task faces three inherent challenges: (1) how to eliminate the influence of semantically inessential contents in image queries; (2) how to deal with the fuzzy localization of inaccurate image queries; (3) how to determine the precise boundaries of target segments. We then propose a novel self-attention interaction localizer to retrieve unseen activities in an end-to-end fashion. Specifically, we first devise a region self-attention method with relative position encoding to learn fine-grained image region representations. Then, we employ a local transformer encoder to build multi-step fusion and reasoning of image and video contents. We next adopt an order-sensitive localizer to directly retrieve the target segment. Furthermore, we construct a new dataset ActivityIBAL by reorganizing the ActivityNet dataset. The extensive experiments show the effectiveness of our method.
Keywords:
Machine Learning: Deep Learning
Computer Vision: Computer Vision