Part Aware Contrastive Learning for Self-Supervised Action Recognition

Part Aware Contrastive Learning for Self-Supervised Action Recognition

Yilei Hua, Wenhan Wu, Ce Zheng, Aidong Lu, Mengyuan Liu, Chen Chen, Shiqian Wu

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 855-863. https://doi.org/10.24963/ijcai.2023/95

In recent years, remarkable results have been achieved in self-supervised action recognition using skeleton sequences with contrastive learning. It has been observed that the semantic distinction of human action features is often represented by local body parts, such as legs or hands, which are advantageous for skeleton-based action recognition. This paper proposes an attention-based contrastive learning framework for skeleton representation learning, called SkeAttnCLR, which integrates local similarity and global features for skeleton-based action representations. To achieve this, a multi-head attention mask module is employed to learn the soft attention mask features from the skeletons, suppressing non-salient local features while accentuating local salient features, thereby bringing similar local features closer in the feature space. Additionally, ample contrastive pairs are generated by expanding contrastive pairs based on salient and non-salient features with global features, which guide the network to learn the semantic representations of the entire skeleton. Therefore, with the attention mask mechanism, SkeAttnCLR learns local features under different data augmentation views. The experiment results demonstrate that the inclusion of local feature similarity significantly enhances skeleton-based action representation. Our proposed SkeAttnCLR outperforms state-of-the-art methods on NTURGB+D, NTU120-RGB+D, and PKU-MMD datasets. The code and settings are available at this repository: https://github.com/GitHubOfHyl97/SkeAttnCLR.
Keywords:
Computer Vision: CV: Action and behavior recognition
Computer Vision: CV: 3D computer vision
Computer Vision: CV: Representation learning
Machine Learning: ML: Self-supervised Learning