Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions

Unsupervised Learning of Deep Feature Representation for Clustering Egocentric Actions

Bharat Lal Bhatnagar, Suriya Singh, Chetan Arora, C.V. Jawahar

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence
Main track. Pages 1447-1453. https://doi.org/10.24963/ijcai.2017/200

Popularity of wearable cameras in life logging, law enforcement, assistive vision and other similar applications is leading to explosion in generation of egocentric video content. First person action recognition is an important aspect of automatic analysis of such videos. Annotating such videos is hard, not only because of obvious scalability constraints, but also because of privacy issues often associated with egocentric videos. This motivates the use of unsupervised methods for egocentric video analysis. In this work, we propose a robust and generic unsupervised approach for first person action clustering. Unlike the contemporary approaches, our technique is neither limited to any particular class of actions nor requires priors such as pre-training, fine-tuning, etc. We learn time sequenced visual and flow features from an array of weak feature extractors based on convolutional and LSTM autoencoder networks. We demonstrate that clustering of such features leads to the discovery of semantically meaningful actions present in the video. We validate our approach on four disparate public egocentric actions datasets amounting to approximately 50 hours of videos. We show that our approach surpasses the supervised state of the art accuracies without using the action labels.
Keywords:
Machine Learning: Ensemble Methods
Machine Learning: Feature Selection/Construction
Machine Learning: Unsupervised Learning
Robotics and Vision: Vision and Perception