Dual Video Summarization: From Frames to Captions

Dual Video Summarization: From Frames to Captions

Zhenzhen Hu, Zhenshan Wang, Zijie Song, Richang Hong

Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence
Main Track. Pages 846-854. https://doi.org/10.24963/ijcai.2023/94

Video summarization and video captioning both condense the video content from the perspective of visual and text modes, i.e. the keyframe selection and language description generation. Existing video-and-language learning models commonly sample multiple frames for training instead of observing all. These sampled deputies greatly improve computational efficiency, but do they represent the original video content enough with no more redundancy? In this work, we propose a dual video summarization framework and verify it in the context of video captioning. Given the video frames, we firstly extract the visual representation based on the ViT model fine-tuned on the video-text domain. Then we summarize the keyframes according to the frame-lever score. To compress the number of keyframes as much as possible while ensuring the quality of captioning, we learn a cross-modal video summarizer to select the most semantically consistent frames according to the pseudo score label. Top K frames ( K is no more than 3% of the entire video.) are chosen to form the video representation. Moreover, to evaluate the static appearance and temporal information of video, we design the ranking scheme of video representation from two aspects: feature-oriented and sequence-oriented. Finally, we generate the descriptions with a lightweight LSTM decoder. The experiment results on the MSR-VTT and MSVD dataset reveal that, for the generative task as video captioning, a small number of keyframes can convey the same semantic information to perform well on captioning, or even better than the original sampling.
Keywords:
Computer Vision: CV: Vision and language 
Computer Vision: CV: Video analysis and understanding