Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation

Deeply-Supervised CNN Model for Action Recognition with Trainable Feature Aggregation

Yang Li, Kan Li, Xinxin Wang

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence
Main track. Pages 807-813. https://doi.org/10.24963/ijcai.2018/112

In this paper, we propose a deeply-supervised CNN model for action recognition that fully exploits powerful hierarchical features of CNNs. In this model, we build multi-level video representations by applying our proposed aggregation module at different convolutional layers. Moreover, we train this model in a deep supervision manner, which brings improvement in both performance and efficiency. Meanwhile, in order to capture the temporal structure as well as preserve more details about actions, we propose a trainable aggregation module. It models the temporal evolution of each spatial location and projects them into a semantic space using the Vector of Locally Aggregated Descriptors (VLAD) technique. This deeply-supervised CNN model integrating the powerful aggregation module provides a promising solution to recognize actions in videos. We conduct experiments on two action recognition datasets: HMDB51 and UCF101. Results show that our model outperforms the state-of-the-art methods.
Keywords:
Computer Vision: Action Recognition
Computer Vision: Video: Events, Activities and Surveillance
Computer Vision: Computer Vision