Unsupervised Alignment of Actions in Video with Text Descriptions / 2025
Young Chol Song, Iftekhar Naim, Abdullah Al Mamun, Kaustubh Kulkarni, Parag Singla, Jiebo Luo, Daniel Gildea, Henry Kautz
Advances in video technology and data storage have made large scale video data collections of complex activities readily accessible. An increasingly popular approach for automatically inferring the details of a video is to associate the spatio-temporal segments in a video with its natural language descriptions. Most algorithms for connecting natural language with video rely on pre-aligned supervised training data. Recently, several models have been shown to be effective for unsupervised alignment of objects in video with language. However, it remains difficult to generate good spatio-temporal video segments for actions that align well with language. This paper presents a framework that extracts higher level representations of low-level action features through hyperfeature coding from video and aligns them with language. We propose a two-step process that creates a high-level action feature codebook with temporally consistent motions, and then applies an unsupervised alignment algorithm over the action codewords and verbs in the language to identify individual activities. We show an improvement over previous alignment models of objects and nouns on videos of biological experiments, and also evaluate our system on a larger scale collection of videos involving kitchen activities.