PACE: Predictive and Contrastive Embedding for Unsupervised Action Segmentation

Jiahao Wang; Jie Qin; Yunhong Wang; Annan Li

doi:10.24963/ijcai.2022/198

PACE: Predictive and Contrastive Embedding for Unsupervised Action Segmentation

Jiahao Wang, Jie Qin, Yunhong Wang, Annan Li

Watch video

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence

Main Track. Pages 1423-1429. https://doi.org/10.24963/ijcai.2022/198

PDF BibTeX

Action segmentation, inferring temporal positions of human actions in an untrimmed video, is an important prerequisite for various video understanding tasks. Recently, unsupervised action segmentation (UAS) has emerged as a more challenging task due to the unavailability of frame-level annotations. Existing clustering- or prediction-based UAS approaches suffer from either over-segmentation or overfitting, leading to unsatisfactory results. To address those problems,we propose Predictive And Contrastive Embedding (PACE), a unified UAS framework leveraging both predictability and similarity information for more accurate action segmentation. On the basis of an auto-regressive transformer encoder, predictive embeddings are learned by exploiting the predictability of video context, while contrastive embeddings are generated by leveraging the similarity of adjacent short video clips. Extensive experiments on three challenging benchmarks demonstrate the superiority of our method, with up to 26.9% improvements in F1-score over the state of the art.

Keywords:

Computer Vision: Video analysis and understanding

Computer Vision: Action and Behaviour Recognition

Computer Vision: Transfer, low-shot, semi- and un- supervised learning