Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization

Jun Yu; Yingshuai Zheng; Shulan Ruan; Qi Liu; Zhiyuan Cheng; Jinze Wu

doi:10.24963/ijcai.2023/186

Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network for Spatial-Temporal Action Localization

Jun Yu, Yingshuai Zheng, Shulan Ruan, Qi Liu, Zhiyuan Cheng, Jinze Wu

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence

Main Track. Pages 1676-1685. https://doi.org/10.24963/ijcai.2023/186

PDF BibTeX

The key to video action detection lies in the understanding of interaction between persons and background objects in a video. Current methods usually employ object detectors to extract objects directly or use grid features to represent objects in the environment, which underestimate the great potential of multi-scale context information (e.g., objects and scenes of different sizes). How to exactly represent the multi-scale context and make full utilization of it still remains an unresolved challenge for spatial-temporal action localization. In this paper, we propose a novel Actor-Multi-Scale Context Bidirectional Higher Order Interactive Relation Network (AMCRNet) that extracts multi-scale context through multiple pooling layers with different sizes. Specifically, we develop an Interactive Relation Extraction module to model the higher-order relation between the target person and the context (e.g., other persons and objects). Along this line, we further propose a History Feature Bank and Interaction method to achieve better performance by modeling such relation across continuing video clips. Extensive experimental results on AVA2.2 and UCF101-24 demonstrate the superiority and rationality of our proposed AMCRNet.

Keywords:

Computer Vision: CV: Action and behavior recognition

Computer Vision: CV: Machine learning for vision

Computer Vision: CV: Video analysis and understanding